Big Data Engineer

Location:

Cary, NC

Posted:

February 08, 2024

Contact this candidate

Resume:

JANSEE KORAPATI

Email: **********@*****.***

Cell: 984-***-****

PROFILE SUMMARY

Experienced Senior Data Engineer/Bigdata Lead with over 15 years of IT Experience including 9+ years of development & lead experience in Big Data & Cloud platforms.

Hands on experience in Big Data Technologies like SPARK, Scala, SPARK SQL, Spark Streaming, Kafka, HDFS, SQOOP, Hive, Impala, YARN & CASSANDRA.

Strong experience in AZURE cloud (Event hubs, Data lakes, Blob storage, Blob storage, Active Directory), Dockers & Kubernetes cluster (AKS).

Deep understanding of partitions & bucketing concepts in hive and design of both managed and external tables in hive to optimize performance.

Good experience in different file formats such as Json, Parquet, CSV, Avro & ORC.

Experience in using various distributions like Cloudera, Datastax, AZURE, & CDP.

Having good technical and analytical skills & experience in Designing architecture for Data platform.

Good experience in Grafana, CICD, Jenkins, Bitbucket, GitHub, Jupyter Notebooks, Databricks, Elastic search, ELK, Nexus & JFROG artifactory.

Good experience in ETL tools like Oozie & spark.

Good experience in monolithic architecture & micro services and serverless architecture.

Moderate experience in Monitoring tools Prometheus, Grafana, Kubernetes dashboard, HUE slack & OpsGenie.

Good exposure in Druid, AZURE Data Factory, Event Hub Capture & Python.

Good exposure in AWS Clod Services like S3, RDS & Pyspark.

Strong Experience in Automation testing using QTP.

Good working experience in Agile (Scrum based) development model.

Self-motivated and ready to learn innovative technologies. Comfort working in a fast-paced environment.

Received Signature Awards for improving patching process, support to migrate engineering subscriptions from Honeywell to Resideo and data migration to gen2datalake.

CERTIFICATIONS

Completed AZURE AZ-900, DP-900 & DP-203 Certifications from AZURE cloud.

EDUCATION

Bachelors in electrical & electronics Communication in Engineering from Jawaharlal Nehru Technological University (JNTU), India, 2007

TECHNICAL SKILLS

Big Data Technologies : Spark SQL, Spark Streaming, HIVE, SQOOP, YARN, Cloudera, Impala, KAFKA, ZK, Kubernetes clusters, Docker, Jenkins, Streamsets, Druid & ELK

Cloud Services : Azure Event Hubs, Data Lakes, Databricks, Data Factory, AWS S3, Glue, EMR, ECR, Boto3, Cloud watch & CDP

NO-SQL Databases : CASSANDRA, ELASSANDRA, HBASE, Mongo DB, Document DB

SQL Databases : SQL SERVER, MySQL & ORACLE

Monitoring : Grafana, Prometheus, Hue, Kubernetes dashboard

IDEs& Tools : Eclipse, IntelliJ & VS Code

Schedulers Tools : Cron, Oozie, Kubernetes Cron, Control M

Version Control Tools : Bitbucket, GitHub & SVN

PROFESSIONAL EXPERIENCE

Anthem/Elevance Health

September 2022 – Till date

Big Data Lead

Big data tools used: Spark, Scala, Spark SQL, Hive 3, Cloudera, CDP 7.1.8, Hue, S3, AWS RDS, Mongo DB, AWS DocDB & Control M Scheduler, Kafka.

Description: Elevance Health is an American Health insurance provider. Services includes Medical, Medicare, dental & vision. It provides services to members as per member preferences for contacts & communications like email, text, phone, and address.

Roles and Responsibilities:

Working on creating spark jobs using scala to process the data and meet the customer requirements.

Successfully deployed 4 products into production.

Worked on closely with Architect in designing solutions from on-prem to AWS cloud migration project.

Worked on spark & scala jobs code refactoring and reduced job execution time by 40%

Worked on creating a new spark job for 3 different data sources from a different source system i.e Cloud which was pointed to Oracle on-prem database earlier. Achieved it by completing job execution in < 4 minutes (The existing job consuming from Oracle taking ~ 45 minutes).

Worked on code refactoring for Oracle & File extract Landing Zone spark code & Rules Engine spark jobs.

Worked on updating existing spark jobs to change phone numbers level and preferences.

Worked on creating new spark job to bring demographics data from AWS cloud and detokenize and integrate with On Prem Oracle data.

Working on creating external tables in Hive3 CDP 7.1.8 version for new source files

Worked on creating spark jobs to consume the data from S3 buckets to HDFS.

Working on creating spark jobs to process daily delta preferences data of members from new source (AWS RDS), enrich the membership data and generate unique keys for members & profile identifications and insert into Hive tables.

Worked on code refactoring the spark code which will read from hive and apply rules to the data and transform then store into Mongo collections.

Worked on POC to check Upsert functionality in Hive3 to update & delete from hive table.

Working on implementing Hive 3 UPSERT functionalities to increase performance by removing overwrite features in existing code.

Responsible for code reviews and code deployments.

Cox Communications

Oct 2021 – August 2022

Big Data Architect

Big data tools used: Spark, Spark Structured Streaming, Scala, Hive, Sqoop, Kafka, Cloudera, AWS Cloud, AWS S3, Databricks, Hbase, Oozie, Hue

Description: Cox is enhancing the downstream process by including multiple OFDM blocks. The objective of this project is to handle multiple data ingestion into EDLC space and expose the data. This involves consuming modems data from multiple sources like ORACLE and Kafka then transform the data to store it into Hive & Hbase tables.

Roles and Responsibilities:

Worked on creating spark streaming jobs to consume data from Kafka topics & store into HIVE & Hbase tables

Created spark jobs to read data from AWS S3 buckets & transform data using spark in Databricks notebooks to perform analytics.

Good experience in writing spark jobs in databricks notebooks & run adhoc reports & monthly reports

Responsible for creating jobs that import data from Oracle to Hive tables using sqoop, parse the data and export to CCM DB using Golden Gate Replication

Created scripts to perform Data Quality checks on Modems & Interface data

Responsible for peer review of the code and merging into the master branch, deploying code to dev & QA environments using Jenkins & creating MOP document for production deployments

Schedule jobs in oozie, export results to New Relic and monitor jobs in Hue.

Deployed 3 big data projects into production successfully.

Resideo/Honeywell

March 2018 – Oct 2021

Technology Specialist

Project 2: Athena Data platform

Big data tools used: Spark, Spark Streaming, Scala, Cassandra, Kubernetes Cluster, Microsoft AZURE, Data Factory, DevOps (Jenkins, Groovy script), Streamsets, Druid, Kafka, Docker, Data Lake, Event Hubs, Blob storage, Airflow

Description: Processing and Analyzing LCC, TCC and CHIL sub systems thermostats data (any type of devices data) using big data technologies. Consume data from event hubs and write data into data lake. Building pipelines for monthly energy reports.

Roles and Responsibilities:

Created spark Ingestion jobs which will consume data from event hubs, transform data and store into data lake for Analytics

Worked closely with Architects in designing architecture to migrate Legacy applications to new data platform

Created pipelines in Jenkins for all preprocessors jobs, spark streaming jobs, spark batch jobs and API’s which includes running binary scan, protocode sc and black duck scan on images and deploy in cluster as per selected environment

Worked on running monthly energy reports spark jobs in Jenkins

Involved in automating CI CD platform to trigger the pipeline as soon as code committed to the branch & perform all stages like code checkout from Git, build, binary scan, protocode sc and black duck scan & deploy

Worked on creating consumer groups, event hubs, blob storage and key vaults, Service principals in Azure Active Directory, Data Lake creation in Gen1 & Gen2 and creating AKS cluster for QA and Production subscriptions in AZURE portal

Worked on monthly energy report spark jobs enhancements

Worked on Spark Ingestion job optimizations & file formats

Created spark jobs in Jupyter notebooks to provide analytics data to customers as per their requirement or issues

Worked on consuming events from event hub and store into Druid Database.

Worked on creating Docker images for all sbt jobs, maven and dotnet services and pushed to azure container registries and deployed images in kubernetes cluster

Worked on streamsets to consume data from event hubs and transform data and write into data lake in parquet format

Worked on creating environments for Dev and QA in Kubernetes cluster and production issue resolution

Project 1: Titan Data Analytics platform (Homes and Buildings)

Big data tools used: Kubernetes Cluster (AKS), Microsoft AZURE, Blob storage, Spark streaming, Scala, Kafka, zookeeper, Schema registry, Elassandra, Docker, Data Lake,Azure Data Factory Data Bricks, Elastic search, Event Hubs, DevOps (Jenkins, Groovy script) and Grafana

Description: Processing and Analyzing LCC, TCC and CHIL sub systems thermostats data (any type of devices data) using big data technologies. Consume data from event hubs and process using Kafka and spark streaming and store processed data in Elassandra database.

Roles and Responsibilities:

Worked on Spark streaming jobs to consume data from Kafka Topics and transform events and store into Elassandra database

Worked on creating Docker images for all sbt jobs, maven and dotnet services and pushed to azure container registries and deployed images in Kubernetes cluster

Created pipelines in Azure Data Factory to migrate data from Gen1 Data lake Gen2 data lake

Worked on executing spark jobs in Jupyter and data bricks notebooks and provide analytics data to customers as per the customer requirements or issues

Involved in micro services deployment in web servers for customer API web portal

Worked on Reviewing the developed code and approve PR requests and merge code to master branch in bitbucket

Working on creating stories and assign to the team in JIRA and managing sprints in JIRA board

JPMC/Mphasis

April 2016 – Feb 2018

Module Lead

Project: LRI Evolution (Banking Domain)

Big data tools used: Spark, Hive, Impala, Python & Tableau

Description: LRI Data Analytics gets source files in AVRO format from Aggregate Engine, Hadoop copies files into cluster. Cleanse and enrich the data using Spark and load data into Impala tables. Create views as per user requirement and publish dashboards.

Roles and Responsibilities:

Worked on Spark to ingest AVRO files into hive and process the data by using various functions in data frames.

Created Impala tables to store the processed results in a tabular format

Working on Reconciliation for Horizontal, Vertical and summary tables to make sure Data is ingested without any fail using Spark and Python

Worked on Tableau to create Dashboards using Impala Tables

Involved in production deployment and support activities.

AIG/Mphasis

Project: AIG- Life and Retirement - Compliance Data Lake (Insurance domain)

Big data tools used: Spark, Sqoop, Hive, Pig, Tableau and Hortonworks

Roles and Responsibilities:

Worked on SQOOP extensively to import data from Oracle to HDFS using various import features like sqoop-import-all, import multiple tables using --exclude-tables, importing subset of data with Query conditions etc.

Worked on Sqoop for incremental loads using last modified value

Worked on spark to ingest files into hive and process the data by using various functions in data frames.

Created external tables in Hive to avoid duplication of data which is already present in HDFS

Created shell scripts to execute sqoop, Hive and spark jobs

Utilized optimization techniques in both Spark and Hive like parallel processing, early filter, HIVE(ORC), compression techniques and vectorization.

Involved in development, production deployment and support activities.

JPMC/Mphasis

Project: Banking Data Lake

Big Data Tools Used: Cloudera distributions CDH 5.7.1, SQOOP, HDFS, SPARK SQL, Scala,HIVE,IMPALA,QlikView and Talend

Involved in implementation of the following use cases

Data Ingestion, Data Transformation, Error Handling, Data Analytics, Flat Table, Calculations, Adjustment, Allocations, EOD, EOM, Metadata & Reporting

Roles & Responsibilities:

Worked on preparing Data Ingestion HLD & LLD documents.

Worked on importing data from MySQL database as well as CSV files using SPARK SQL.

Created Data Frame to hold RDBMS data and CSV file data.

Worked on calculating derived metrics using map function in Scala.

Faced challenges while importing incremental loads (CDC) using SQOOP when tables doesn’t have Timestamp column, so utilized CDC using SPARK SQL to fix.

Worked on converting file formats like Avro to Parquet and Import MySQL tables to HDFS directly as AVRO, PARQUET and Sequence file formats.

Involved in Error Handling component to capture Errors using Error Handler and Error Util class in Ingestion phase.

Involved in generating reports through Qlikview.

Involved in generating data through Talend tool.

Associate Tech Specialist, Tech Mahindra

May 2015 – April 2016

Tools Used: SQOOP, PIG, HIVE and Oozie, QTP

Tech Lead, Tech Mahindra/ People Prime

Feb 2014 – November 2014

Senior Software Test Engineer, Value Labs

March 2011 – Feb 2014

Testing – Trainee, Computer Vision Labs

Nov 2008- Feb 2011

Contact this candidate