Data Engineer Big

Location:

Frisco, TX

Posted:

December 27, 2024

Contact this candidate

Resume:

Priyanka K

Dallas, TX

603-***-****

*********.****@*****.***

Bigdata Engineer

Professional Summary:

Having 8+ years of professional experience as a Big Data Engineer/Developer in Analysis, development, design, implementation, maintenance, and support with experience in Big Data, Hadoop Development, Python, PL/SQL, Java, SQL, REST API’s, AWS cloud platform.

Hands on experience in Hadoop ecosystem including HDFS, Map Reduce, Spark, Kafka, HBase, Scala, Pig, Impala, Sqoop, Oozie, Flume, Zookeeper and also worked on Spark SQL, Spark Streaming, AWS services like EMR, S3, Airflow, Glue and Redshift.

Hands-on experience developing data pipelines using Spark components, Spark SQL, Spark Streaming and MLLib.

Hands-on experience in importing and exporting data using Sqoop from HDFS to Relational Database Systems and vice-versa.

Hands-on experience implementing and designing large scale data lakes, pipelines and efficient ETL (extract/load/transform) workflows to organize, collect and standardize data that helps generate insights and addresses reporting needs.

Experience in AWS cloud services like Elastic Map Reduce (EMR), Elastic Compute Cloud EC2, VPC, Simple Storage Service (S3), RedShift, CloudWatch and Lambda functions.

Experience working with both Streaming and Batch data processing using multiple technologies.

Excellent understanding of Hadoop architecture and various components such as HDFS, Job Tracker, Task Tracker, Name Node, Data Node and MapReduce programming paradigm.

Experienced in developing web-based applications using Python, QT, C++, XML, CSS, JSON, HTML, DHTML.

Hands-on experience building, scheduling-, and monitoring workflows using Apache Airflow with Python.

Developed Python code to gather the data from HBase and designs the solution to implement using PySpark.

Experienced in Software methodologies such as Agile and Safe, sprint planning, attending daily standups and story grooming.

Experience with implementing CI/CD pipelines for DevOps – source code management using Git, Unit testing, build and deployment scripts.

Hands-on experience working with DevOps tools such as Jenkins, Docker, Kubernetes, Gocd, Autosys scheduler.

Worked with Kafka streaming using KTables and KStreams and deploying these on Confluent and Apache Kafka environments.

Experience working in various Software Development Methodologies like Waterfall, Agile SCRUM and TDD.

Worked in complete Software Development Life Cycle in Agile model.

Strong problem-solving skills with an ability to isolate, deconstruct and resolve complex data challenges.

Technical Skills:

Big Data Ecosystem

HDFS, Yarn, MapReduce, Spark, Kafka, Kafka Connect, Hive, Pig Airflow, Impala, Sqoop, HBase, Flume, Oozie, Zookeeper

Hadoop Distributions

Cloudera, Hortonworks, Apache.

Cloud Environments

AWS EMR, EC2, S3, AWS Redshift, Airflow

Operating Systems

Linux, Windows

Languages

Python, SQL, Scala, Java

Databases

Oracle, SQL Server, MySQL, HBase, MongoDB, DynamoDB

ETL Tools

Informatica

Report & Development Tools:

Eclipse, IntelliJ Idea, Visual Studio Code,

Jupyter Notebook, Tableau, Power BI.

Development/Build Tools

Maven, Gradle

Repositories:

GitHub, SVN.

Scripting Languages

bash/Shell scripting, Linux/Unix

Methodology

Agile, Waterfall

Professional Experience:

Medtronic Inc, Minneapolis, MN Apr 2023 – Current

Sr. Data Engineer – AWS, PySpark, Python, Hive

•Developed Streaming Applications using PySpark and Kafka to read messages from Amazon AWS Kafka queues, transforming and writing JSON data to AWS S3 buckets, ensuring efficient data storage and retrieval for downstream processing.

•Optimized ETL Pipelines using Spark and PySpark to process large datasets, transforming and loading data into Amazon Redshift, utilizing AWS Glue for seamless integration and automation of data flows.

•Designed and Implemented a Notification System using AWS SNS to automatically send alerts to subscribers via email and SMS when specific events occur in the data pipeline, improving system monitoring and response times.

•Optimized Scala-based Spark jobs for cost-effective processing on AWS EMR by tuning configurations and resource allocation.

•Implemented real-time data processing applications using Python and AWS Kinesis for streaming analytics.

•Built machine learning models with Python and deployed them on AWS SageMaker for scalable model training and inference.

•Developed custom Python scripts for data extraction and transformation, interfacing with AWS DynamoDB for NoSQL operations.

•Utilized Python with AWS CloudFormation for infrastructure as code, automating the provisioning of cloud resources.

•Integrated Scala with AWS Glue to automate ETL processes and manage data cataloging.

•Developed Scala-based machine learning models using Spark MLlib and deployed them on AWS SageMaker.

•Used Scala to implement distributed data processing and analytics on AWS Redshift Spectrum, enabling queries on data stored in S3.

•Integrated AWS SNS with Lambda Functions to trigger automated workflows in response to incoming notifications, ensuring real-time processing and handling of critical events.

•Automated Infrastructure Provisioning on AWS using Terraform, reducing deployment time and ensuring consistent and repeatable infrastructure setups across multiple environments.

•Orchestrated Complex Workflows using AWS Step Functions to automate and manage the execution of serverless applications, reducing manual intervention and improving operational efficiency.

•Developed Asynchronous Processing Pipelines using AWS SQS to decouple and manage message queues between microservices, improving application scalability and fault tolerance.

•Built Real-Time ETL Processes using AWS Glue, Kinesis, and PySpark to continuously stream and process data, ensuring near real-time availability of data in Amazon Redshift, enhancing decision-making processes.

•Orchestrated Complex Workflows using AWS Step Functions, integrating various AWS services like Lambda, SQS, and SNS, to automate and streamline ETL processes and data pipeline operations.

•Automated Repetitive Tasks by developing tools using Python, Shell scripting, and XML, including the management and orchestration of AWS services, to increase operational efficiency.

•Accessed and Processed Hive Tables into Spark using PySpark scripts and Spark SQL, achieving faster data processing and enhanced performance in data analytics tasks.

•Developed and Automated ETL Processes to load data into Amazon Redshift from various sources such as S3, RDS, and DynamoDB using optimized SQL commands like COPY, ensuring timely and efficient data ingestion.

•Ingested Data from Various Sources into HDFS Data Lake using PySpark for both streaming and batch processing, ensuring scalable and efficient data storage and retrieval.

•Performed Performance Tuning of HQL Queries by analyzing query plans, optimizing indexing, and applying partition pruning, resulting in improved execution times and resource utilization.

•Implemented Spark Applications in Scala for real-time analysis and fast querying, utilizing Spark APIs and advanced functions like MapReduce by Key to process large-scale datasets efficiently.

•Developed Back-End Web Services using Python and the Django REST framework, integrating with AWS services like Lambda and API Gateway for serverless application deployment.

•Enhanced Existing Hadoop Models by utilizing Spark Context, Spark-SQL, DataFrames, and Pair RDDs, optimizing them for better performance and scalability.

•Imported Data Using Sqoop from various relational databases like Oracle, Teradata, and MySQL into Hadoop, enabling large-scale data processing and analytics.

Client: McKinsey, Dallas, TX

Duration: Aug 2022 – March 2023

Role : Big Data Engineer – AWS, PySpark, Python, Hive

Responsibilities:

Developed ETL data pipelines using Hadoop big data tools - HDFS, Hive, Presto, Apache Nifi, Sqoop, Spark, Elastic Search, Kafka

Enabled partitions and bucketing for optimization and used ORC format for compression and optimization.

Optimized Spark application as the volume of data is in billions of records.

Developed Spark application that takes tables from Hive and SQLServer and performs data validation and returns the tables and the kind of data that has issues to fix.

Responsible for designing data pipelines using ETL for effective data ingestion from existing data management platforms to enterprise Data Lake

Design and implement data pipelines using AWS services such as S3, Glue, and EMR

Created Airflow Scheduling scripts in Python. Created job flow using Airflow in Python language and automated jobs

Developed data pipeline using Flume, Pig, Sqoop to ingest data and customer histories into HDFS for analysis

Designed and developed POCs in Spark using Scala to compare the performance of Spark with Hive and SQL/Oracle

To meet specific business requirements, wrote UDF's in Scala and PySpark

Stay up-to-date with the latest AWS data services and technologies and recommend new solutions to improve data engineering processes.

Developed python job that automatically places the files to respective target directories, creates hive partition in HDI for external tables and then load data from external table to managed table.

Developed Hive Python UDF and Spark UDF to explode Array of Structs into 3rd Normal Form.

Created hive views to transform data in external tables and place in managed tables as per the business requirements.

Exported the data using Sqoop to RDBMS servers and processed that data for ETL operations

Build and maintain data warehouses and data lakes using AWS Redshift and Athena

Automate data processing and deployment using AWS Lambda and other serverless technologies

Developed and executed interface test scenarios and test scripts for complex business rules using available ETL tools

Developed Spark applications using Spark-SQL for data extraction, transformation, and aggregation from multiple file formats for analyzing & transforming the data to uncover insights into the customer usage patterns

Designed the ETL process by creating high-level design document including the logical data flows, source data extraction process, the database staging and the extract creation, source archival, job scheduling and Error Handling

Optimized PySpark code for better performance

Managed and supported enterprise Data Warehouse operation, big data advanced predictive application development using Cloudera & Hortonworks HDP

Loaded and transformed large sets of structured, semi structured, and unstructured data in various formats like text, zip, XML and JSON

Developed and designed a system to collect data from multiple portals using Kafka and then process it using Spark

Implemented Map Reduce jobs in Hive by querying the available data and designed the ETL process by creating a high-level design document including the logical data flows, source data extraction process, the database staging and the extract creation, source archival, job scheduling, and Error Handling

Created external tables with partitions using Hive and designed External and Managed tables in Hive and processed data to the HDFS using Sqoop

Developed the Linux shell scripts for creating the reports from Hive data

Environment: Hadoop, Spark, HDFS, Hive, Kedro Framework, AWS Glue, AWS, Big Data, Oozie, Sqoop, Kafka, PySpark, Python, Flume, Zookeeper, MapReduces, Cassandra, Scala, Linux, NoSQL, MySQL, PySpark, SQL

Client: DoorDash – San Francisco, CA

Duration: Sep 2021 – Jul 2022

Role: Sr Big Data Developer

Responsibilities:

Configured Spark Streaming to receive real time data from the Apache Kafka and store the stream data to HDFS using Scala.

Developed ETL data pipelines using Hadoop big data tools – HDFS, Hive, Presto, Apache Nifi, Sqoop, Spark, Elastic Search, Kafka.

Exported the data using Sqoop to RDBMS servers and processed that data for ETL operations.

Developed data pipeline using Flume, Pig, Sqoop to ingest data and customer histories into HDFS for analysis.

Experience in designing and developing POCs in Spark using Scala to compare the performance of Spark with Hive and SQL/Oracle.

Created EMR cluster and Spark clusters in AWS instances to generate terabytes of data and stored it in AWS HDFS.

On demand, secure EMR launcher with custom Spark submit steps using S3 Event, SNS, KMS and Lambda function.

Created Virtual Data Lake by using AWS Redshift, S3, Spectrum and Athena services to query large amount data stored on S3

Responsible for designing data pipelines using ETL for effective data ingestion from existing data management platforms to enterprise Data Lake.

Developed and executed interface test scenarios and test scripts for complex business rules using available ETL tools.

Worked on ETL Migration services by developing and deploying AWS Lambda functions for generating a serverless data pipeline which can be written to Glue Catalog and can be queried from Athena.

Developed code from scratch in Spark using SCALA according to the technical requirements and Used both Hive context as well as SQL context of Spark to do the initial testing of the Spark job

Created multi-node Hadoop and Spark clusters in AWS instances to generate terabytes of data and stored it in AWS HDFS.

Set up Build Infrastructure with Jenkins and Subversion server in AWS.

Transferred data using Informatica tool from AWS S3 to AWS Redshift.

Created S3 buckets and managed the policies for S3 buckets and then utilized S3 bucket and Glacier for storage and backup on AWS.

Worked with DevOps practices using AWS, Elastic Bean stalk and Docker with Kubernetes.

Uploaded and processed terabytes of data from various structured and unstructured sources into HDFS (AWS cloud) using Sqoop and implemented the Business Rules in Spark/ SCALA to get the business logic in place.

Developed and designed a system to collect data from multiple portals using Kafka and then process it using Spark.

Triggered Lambda once the data falls in S3 and sent Notifications to teams using SNS.

Loaded and transformed large sets of structured, semi structured, and unstructured data in various formats like text, zip, XML and JSON.

Create external tables with partitions using Hive, AWS Athena and Redshift, Designed External and Managed tables in Hive and processed data to the HDFS using Sqoop.

Developed the Linux shell scripts for creating the reports from Hive data.

Hands on experience on Data Analytics Services such as Athena, Glue Data Catalog & Quick Sight.

Managed and supported enterprise Data Warehouse operation, big data advanced predictive application development using Cloudera & Hortonworks HDP.

Worked on AWS Airflow which is a job trigger where the code is written in hive query language and Scala. This helps to read, back fill, and write the data for particular time Frame from hive tables to HDFS locations.

Environment: Hadoop, Spark, HDFS, Hive, Pig, HBase, Big Data, Apache Storm, Oozie, Sqoop, Kafka, Flume, Zookeeper, MapReduce, Cassandra, Scala, Linux, NoSQL, MySQL, PySpark, SQL Server, AWS, Kubernetes, Docker, Python, AWS Redshift.

Client: AIG, New Jersey, NJ

Duration: Feb 2021 – Aug 2021

Role: Big Data Developer

Responsibilities:

Delivery experience on major Hadoop ecosystem Components such as Hive, Spark Kafka, Elastic Search & HBase and monitoring with Cloudera Manager and worked on loading disparate data sets coming from different sources to (HADOOP) environment using Spark.

Implemented static Partitioning, Dynamic partitioning and Bucketing in Hive using internal and external tables.

Developed Restful API services using spring boot to upload data from local to AWS S3, listing S3 objects and file manipulation operations.

Scheduling workflow through the Oozie engine to run multiple Hive and pig jobs. Wrote shell scripts to automate the jobs in UNIX.

Used Pyspark and Spark SQL for extracting, transforming, and loading the data according to the business requirements.

Developed UNIX scripts in creating Batch load for bringing huge amount of data from Relational databases to BIGDATA platform.

Developed Spark jobs to create data frames from the source system, process, and analyze the data in Data frames based on business requirements.

Invoked in creating Hive tables, loading with data, and writing Hive queries, which will invoke MapReduce jobs in the backend.

Developed Spark SQL scripts using PySpark to perform transformations and actions on Data frames, Data set in spark for faster data Processing and implemented them using PySpark.

Used AWS Data Pipeline to schedule an AWS EMR cluster to clean and process web server logs stored in Amazon S3 bucket.

Involved in enhancing the existing ETL data pipeline for better data migration with reduced data issues by using Apache Airflow.

Worked on monitoring, scheduling, and authoring Data Pipelines using Apache Airflow.

Utilized the Apache Hadoop environment by Cloudera. Monitoring and Debugging Spark jobs which are running on a Spark cluster using Cloudera Manager.

Fetch data and generate monthly reports. Visualization of those reports using Tableau and Python.

Manage and support of enterprise Data Warehouse operation, big data advanced predictive application development using Cloudera.

Implemented Kafka consumers to move data from Kafka partitions into Cassandra for near real-time analysis and worked extensively on Hive to create, alter, and drop tables and involved in writing hive queries.

Developed workflow in Oozie to automate the tasks of loading the data into HDFS and pre-processing with Pig and parsed high-level design spec to simple ETL coding and mapping standards.

Created Reports with different Selection Criteria from Hive Tables on the data residing in Data Lake.

Extensively used ETL methodology for supporting Data Extraction, transformations and loading processing, using Hadoop.

Environment: Hadoop, MapReduce, HDFS, Spark, PySpark, Java, Yarn, Hive 2.1, Sqoop, Cassandra Oozie, Scala, Python, AWS, Flume, Kafka, Tableau, Linux, SQL Server, Shell Scripting, Apache Airflow, Cloudera.

Client: Cisin, India

Duration: Jun 2017 – Dec 2020

Role: Hadoop Developer

Responsibilities:

Understand and preparing Design document preparation according to client requirement.

Loaded and transformed large sets of structured, semi structured, and unstructured data. Imported data using Sqoop to load and export data from My SQL to HDFS and NoSQL Databases on regular basis for designing and developing Hive scripts to process data in a batch to perform analysis of data.

Developed data pipelines using Sqoop, Pig and Hive to ingest customer data into HDFS to perform data analytics.

Analyzed large amounts of data sets to determine optimal way to aggregate and report on it.

Developed Sqoop scripts to handle change data capture for processing incremental records between new arrived and existing data in RDBMS tables.

Implementing Partitioning and Bucketing in Hive as part of performance tuning for the workflow and co-ordination files using Oozie framework to automate tasks.

Involved in developing data pipelines using Sqoop, Pig and Hive to ingest data into HDFS to perform data analytics.

Experience writing Sqoop jobs to move data from various RDBMS into HDFS and vice versa.

Developed ETL pipelines to source data to Business intelligence teams to build visualizations.

Optimizing the Hive Queries using the various file formats like PARQUET, JSON, AVRO and CSV file.

Worked with Oozie workflow engine to schedule time-based jobs to perform multiple actions.

Involved in unit testing, interface testing, system testing and user acceptance testing of the workflow Tool.

Environment: HDFS, Hive, Oozie, Cloudera Distribution with Hadoop (CDH4), MySQL, CentOS, Apache HBase, MapReduce, Hue, PIG, Sqoop, SQL, Windows, Linux.

Client: Fission Labs, India

Duration: Jan 2015 – May 2017

Role: SQL / Hadoop Developer

Responsibilities:

Worked with large amounts of data from various data sources and loading into the database.

Primary responsibilities included building scalable distributed data solutions using Hadoop ecosystem.

Importing and exporting data using Sqoop from HDFS to Relational Database Systems and vice-versa.

Loaded datasets from Teradata to HDFS and Hive on a daily basis.

Developed and created database objects in Oracle, SQL Server i.e., Tables, Indexes, Stored Procedures, Views, Functions, Triggers, etc.

Written complex SQL Queries to validate Business Scenarios & Match Rules.

Performed and conducted complex custom analytics as needed by clients.

Used Impala to query the Hadoop data stored in HDFS.

Handled importing of data from various data sources, performed transformations using Hive, MapReduce, loaded data into HDFS and Extracted the data from Teradata into HDFS using Sqoop.

Written Stored procedures and packages as per the business requirement and scheduled jobs for health checks.

Developed complex queries to generate Monthly, Weekly reports to extract data for visualizing in QlikView.

Designed and developed specific databases for collection, tracking and reporting of data.

Designed, coded, tested and debugged custom queries using Microsoft T-SQL and SQL Reporting Services.

Conducted research to collect and assemble data for databases - Was responsible for design/development of relational databases for collecting data

Written the complex SQL Queries and scripts to give input to the reporting tools.

Environment: HDFS, Hive, Teradata, MapReduce, XML, JSON, Oracle, MySQL, Java, PL/SQL Developer, Stored procedures, Triggers, QlikView

Contact this candidate