Data Engineer Systems

Location:

New Haven, CT

Posted:

February 10, 2025

Contact this candidate

Resume:

Pavani Lakshmi Gadde

475-***-****

******************@*****.***

Sr. Data Engineer

Summary

Experienced and highly skilled Data Engineer with 9 years of experience in designing, implementing, and optimizing complex data systems and pipelines. Experienced in working in highly scalable and large-scale applications building with different technologies using Cloud, bigdata, DevOps and Spring boot. Also, expert in working in different working environments like Agile and Waterfall. Experience in working multi cloud, migration, and scalable application projects. Strong communicator with experience collaborating with data scientists, analysts, and business stakeholders to deliver tailored data solutions.

Professional Summary

•Built spark data pipelines with various optimization techniques using python and Scala.

•Expert in ingesting data for incremental loads from various RDBMS tools using Apache Sqoop.

•Strong knowledge of Hadoop Architecture and Daemons such as HDFS, JOB Tracker, Task Tracker, Name Node, Data Node and Map Reduce concepts.

•Experience in working with various cloud distributions like AWS, Azure and GCP.

•Experienced in data transfer between HDFS and RDBMS using tools like Sqoop, Talend and Spark.

•Developed scalable applications for real-time ingestions into various databases using Apache Kafka.

•Developed Pig Latin scripts and MapReduce jobs for large data transformations and Loads.

•Experience in design, develop and maintain Datalake projects using different bigdata tool stack.

•Experience in building Scala applications for loading data into NoSQL databases (MongoDB).

•Implemented various optimizing techniques in Hive and Spark scripts for data transformations.

•Expert in writing various scripts using Shell Script and Python Scripting.

•Experience in working with NoSQL Databases like HBase, Cassandra and MongoDB.

•Migrated data from different data sources like Oracle, MySQL, Teradata to Hive, HBase and HDFS.

•Experienced in building Jupyter notebooks using PySpark for extensive data analysis.

•Expertise in dealing with AWS services including S3, EC2, EMR, Glue, Athena, Aurora, SNS, and Crawler.

•Established Azure SQL databases, monitored and restored Azure SQL databases, and conducted migrations from Microsoft SQL Server to Azure SQL Database.

•Experience in working various Hadoop distributions like Cloudera, Hortonworks and MapR.

•Experience in ingesting and exporting data from Apache Kafka using Apache Spark Streaming.

•Implemented streaming applications to consume data from Event Hub and Pub/Sub.

•Experience in using different optimized file formats like Avro, Parquet, Sequence.

•Experience is using Azure cloud tools like Azure data factory, Azure Data Lake, Azure Synapsis.

•Developed scalable applications using AWS tools like Redshift, DynamoDB.

•Worked on building pipelines using snowflake for extensive data aggregations.

•Experience on GCP tools like BigQuery, Pub/Sub, Cloud SQL and Cloud functions.

•Built custom dashboards using Power BI for reporting purpose.

•Demonstrated mastery in Azure Data Factory (ADF), Integration Run Time (IR), and File System Data Ingestion, along with Relational Data Ingestion.

•Adept at real-time data processing, data modeling, and optimization of large-scale data systems for enhanced query performance.

•Experience in building continuous integration and deployments using Jenkins, Drone, Travis CI.

•Strong expertise in containerization and orchestration with Docker, Kubernetes, and CI/CD pipelines using Jenkins and Terraform.

•Expert in building containerized apps using tools like Docker, Kubernetes and terraform.

•Experience in building metrics dashboards and alerts using Grafana and Kibana.

•Involved in multiple deployments through Jenkins/CICD pipeline tools.

•Mastered implementing end-to-end (E2E) solutions on big data using the Hadoop framework.

•Architected and implemented a robust data lake on GCP, integrating BigQuery, Cloud Storage, and Dataproc for unified data storage and advanced analytics.

•Worked on containerization technologies like Docker and Kubernetes for scaling applications.

•Utilized BigQuery's capabilities to define and query views, and external tables in GCS buckets.

•Managed software configuration using GIT for efficient software development.

•Experience in various integration tools like Talend, NiFi for ingesting batch and streaming data.

•Experience in migration of data warehouse applications into snowflake.

•Leveraged Jira for effective ticketing and tracking of issue progression.

•Proficiently navigated SDLC methodologies including agile, scrum, and waterfall models.

Technical Skills

•Bigdata Ecosystem: HDFS, Map Reduce, YARN, Hive, HBase, Impala, Sqoop, Oozie, Tez, Spark, Pig.

•Cloud Technologies: AWS, Azure, Google Cloud Platform(GCP), Databricks, ADF, ADLS, Snowflake, S3, EC2, Bigquery

•Database(SQL/NOSQL): MySQL, PostgreSQL, Oracle, Hbase

•Programming Languages: Scala, Python, Spark, SQL

•Tools: Tableau, Power BI, Terraform

•Containerization & Orchestration: Docker, Kubernetes

•Job Scheduler Tools: Airflow

•Version Control: Bit Buket, Git

•IDE: Intellij, PyCharm, Visual Studio

•Methodologies: Agile, Waterfall

Experience Summary

Client

American Express

Location

Phoenix, Arizona

Designation

Sr. Data Engineer

Duration

December 2022 – present

Responsibilities:

•Proficient in migrating SQL databases to AWS Data Lake, AWS SQL Database, Data Bricks, and AWS SQL Data Warehouse, and controlling database access for on-premises database migration to AWS Data Lake utilizing AWS Data Factory

•Developed intricate SQL transformations to preprocess and enrich data as it flows through real-time pipelines, ensuring data quality and relevance.

•Built serverless data processing solutions using AWS Functions, automating data transformation and integration tasks.

•Implemented and optimized real-time data streaming solutions using tools like Apache Kafka and Spark to support immediate analytics.

•Developed and deployed real-time data processing solutions using AWS Stream Analytics and AWS Event Hubs to handle high-velocity data streams efficiently.

•Use Spark to process the data before ingesting the data into the HBase. Both Batch and real-time spark jobs were created using Scala.

•Created Pipelines in ADF using Linked Services/Datasets/Pipeline/ to Extract, Transform and load data from different sources like AWS SQL, Blob storage and AWS SQL Data warehouse.

•Utilized AWS Key Vault as a central repository for maintaining secrets and referenced them in AWS Data Factory and Databricks notebooks.

•Defined and implemented the data architecture and design decisions, ensuring that data flows efficiently between systems and is accessible to business stakeholders and data scientists.

•Orchestrated Oozie integration with the Hadoop stack, facilitating various job types including MapReduce, Pig, Hive, Sqoop, Java programs, and shell scripts.

•Worked HBase as the database to store application data, as HBase offers features like high scalability, distributed NoSQL, column oriented and real-time data querying to name a few.

•Collaborated closely with application architects and DevOps teams.

•Experience on Migrating SQL database to AWS data Lake, AWS data lake Analytics, AWS SQL Database, Data Bricks and AWS SQL Data warehouse and Controlling and granting database access and Migrating On premise databases to AWS Data lake store using AWS Data factory.

•Developed JSON Scripts for deploying the Pipeline in AWS Data Factory (ADF) that process the data using the Sql Activity.

•Implemented CI/CD pipelines for data engineering projects using AWS DevOps, enabling automated testing, deploy ent, and infrastructure as code (IAC) management.

•Developed Python-based ETL (Extract, Transform, Load) processes to ingest, process, and transform data from diverse sources into real-time analytics-ready formats.

•Transformed Hadoop jobs for execution on the EMR cluster after meticulous cluster configuration according to data size.

•Designed and maintained data warehousing solutions, utilizing SQL-based databases like PostgreSQL, MySQL, or Google BigQuery for efficient data storage and retrieval.

•Responsible for resolving the issues and troubleshooting related to performance of Hadoop cluster.

•Applied Agile principles to the development of real-time analytics solutions, allowing for quick adaptation to changing data streams and business needs.

•Work closely with data scientists, analysts, business intelligence teams, and product managers to understand business requirements and translate them into effective data solutions.

•Perform root cause analysis of any failures in the data pipeline or infrastructure and implement preventive measures to avoid future occurrences.

Environment: HDFS, Hive, Spark, Oozie, Python, Scala, Shell, Talend, Snowflake, AWS,, Databricks, Grafana, Jenkins, AWS Data Lake, AWS SQL

Client

Optum

Location

Remote

Designation

Sr. Data Engineer

Duration

2021 June- 2022 Nov

Responsibilities:

•Experience in GCP Dataproc, GCS, Cloud functions, BigQuery.

•Worked on Talend integrations to ingest data from multiple sources into Data Lake.

•Extensive experience in using Terraform for infrastructure as code (IAC), automating the provisioning of GCP resources, and maintaining infrastructure as code repositories.Developed various scripting functionality using Shell Script and Python.

•Experience in automating end to end Hadoop jobs using Oozie applications in optimized way.

•Analyzed the Sql scripts and designed it by using PySpark SQL for faster performance.

•Implemented cloud integrations to GCP and Azure for bi-directional flow setups for data migrations.

•Developed an MVP on exporting data to Snowflake to understand usages and benefits for migration.

•Designed and implemented scalable Terraform configurations that dynamically adjust resource allocation to meet real-time data processing demands.Implemented code coverage and integrations using Sonar for improving code testability.

•Developed APIs for quick real-time lookup on top of HBase tables for transactional data.

•Built Jupyter notebooks using PySpark for extensive data analysis and exploration.

•Developed automated file transfer mechanism using python from MFT, SFTP to HDFS.

•Implemented robust Terraform state management strategies, including remote state storage and locking mechanisms, to ensure data integrity and collaboration.Pushed application logs and data streams logs to Kibana server for monitoring and alerting purpose.

•Architect, deploy, and manage data solutions on cloud platforms (AWS, GCP, Azure) using services like S3, EC2, Redshift, BigQuery, and Dataflow.

•Extensive Experience in IT data analytics projects, Hands on experience in migrating on premise ETLs to Google Cloud Platform (GCP) using cloud native tools such as BIG query, Cloud Data Proc, Google Cloud Storage, ComposerDeveloped Custom Map Reduce code in Java for data cleansing and crunching for further usage.

•Developed Jenkins and Drone pipelines for continuous integration and deployment purpose.

•Extensive experience in using PySpark for large-scale data processing, transformation, and analysis, leveraging its powerful distributed computing capabilities.Integrated Terraform with version control systems like Git for efficient collaboration, code review, and automated deployments.

•Developed spark applications in python (PySpark) on distributed environment to load huge number of CSV files with different schema in to Hive ORC tables.

•Implemented data quality checks, validation rules, and error handling mechanisms within PySpark workflows to ensure data accuracy and reliability.

•Build data pipelines in airflow in GCP for ETL related jobs using different airflow operators Imported and exported data into HDFS and Hive using Sqoop.

•Maintain comprehensive documentation of data models, data pipelines, architecture, and processes to ensure transparency and facilitate future development

•Demonstrated expertise in optimizing PySpark jobs for performance, including tuning configurations, parallelism, and query optimization for big data workloads.

•Built SFTP integrations using various VMWare solutions for external vendors on boarding.

Environment: HDFS, Hive, Spark, GCP, Oozie, Python, Scala, Shell, Talend, Snowflake, Azure, Dataproc, Grafana, Jenkins, Airflow, Terraform, Pyspark, BigQuery

Client

Dollar General

Location

Dickson, Tennessee

Designation

Sr. Data Engineer

Duration

April 2020– May2021

Responsibilities:

•Experience in developing scalable real-time applications for ingesting clickstream data using Kafka Streams and Spark Streaming.

•Worked on Talend integrations to ingest data from multiple sources into Data Lake.

•Experience in migrating existing legacy applications into optimized data pipelines using Spark with Scala and Python, supporting testability and observability.

•Experience in developing Spark applications using Spark-SQL in Databricks for data extraction, transformation, and aggregation from multiple file formats for Analyzing & transforming the data to uncover insights into the customer usage patterns.

•Pushed application logs and data streams logs to Kibana server for monitoring and alerting purpose.

•Developed optimized and tuned ETL operations in Hive and Spark scripts using techniques such as partitioning, bucketing, vectorization, serialization, configuring memory and number of executors.

•Experience in python modules regular expression, collections, dates and time, unit testing, Load testing. Hands on experience in using NOSQL libraries like MongoDB, Cassandra, Redis and relational databases like Oracle, SQLite, PostgreSQL and MYSQL databases.

•Exposure to Data Lake Implementation and developed Data pipelines and applied business logic utilizing Apache Spark.

•Pushed application logs and data streams logs to Kibana server for monitoring and alerting purpose.

•Engineered data pipelines employing Flume, Sqoop, Pig, and Java MapReduce to ingest customer behavioral data and financial histories into HDFS for analysis.

•Experience designing solutions in Azure tools like Azure Data Factory, Azure Data Lake, SQL DWH, Azure SQL & Azure SQL Data Warehouse, Azure Functions.

•Worked on migrating data from HDFS to Azure HD Insights and Azure Databricks.

•Orchestrated Oozie integration with the Hadoop stack, facilitating various job types including MapReduce, Pig, Hive, Sqoop, Java programs, and shell scripts.

•Migrated existing processes and data from our on-premises SQL Server and other environments to Azure Data Lake.

•Implemented multiple modules in microservices to expose data through Restful API’s.

•Developed Jenkins pipelines for continuous integration and deployment purpose.

•Experience in working on analyzing snowflake datasets performance.

Environment: PySpark, Kafka, Spark, Sqoop, Hive, Azure, Databricks, Grafana, Jenkins, Azure Data Lake, Azure SQL, Jenkins, Grafana, Python, Shell, Microservices, Restful API’s

Client

Travelers

Location

Hartford, Connecticut

Designation

Data Engineer

Duration

April 2019 – March 2020

Responsibilities:

•Developed and Tuned Spark Streaming application using Scala for processing data from Kafka.

•Imported batch data using Sqoop to load data to HDFS on regular basis from various sources.

•Create, manage, and optimize databases (relational and NoSQL) to store structured and unstructured data. This could involve using tools like MySQL, PostgreSQL, MongoDB, Cassandra, etc.

•Experience in working on cloud environments such as Open Shift and AWS.

•Imported trading and derivatives data in Hadoop Distributed File System using Eco System components MapReduce, Pig, Hive, Sqoop.

•Responsible writing Hive queries and PIG scripts for data processing.

•Involved in development of Python APIs to dump the array structures in the Processor at the failure point for debugging. Using Chef, deployed and configured Elasticsearch, Logstash and Kibana (ELK) for log analytics, full text search, application monitoring in integration with AWS Lambda and CloudWatch.

•Optimized existing algorithms in Hadoop using Spark Context, Spark SQL, DataFrames, and Pair RDDs.

•Running Sqoop for importing data from Oracle and another Database.

•Created Terraform scripts for EC2 instances, Elastic Load balancers and S3 buckets. Implemented Terraform to manage the AWS infrastructure and managed servers using configuration management tools like Chef and Ansible.

•Optimized script using illustrate and explain and used parameterize Pig Script.

•Build an AWS and REDIS server for storing the data and performed defect analysis and interaction with Business Users during UAT/SIT. Developed Restful Microservices using Django and deployed on AWS servers using EBS and EC2.

•Creating Athena glue tables on existing csv data using AWS crawlers and Extensively used AWS Lambda, Kinesis, Cloud Front for real time data collection.

•Implemented various optimization techniques for Spark applications for improving performance.

•Involved in the process of configuring HA, Kerberos security issues and name node failure restoration activity time to time as a part of zero downtime.

•Managed and prioritized the product backlog, focusing on delivering high-value features early and adjusting priorities based on real-time feedback.

•Used SVN as version control to check in the code, created branches and tagged the code in SVN.

Environment: HDP, HDFS, Hive, Spark, Oozie, HBase, AWS, Scala, Python, Bash, Kafka, Java, Jenkins, Spark Streaming, Tez, AWS Athena, Glue

Client

Excellent Web World

Location

India

Designation

Data Engineer

Duration

December 2017 – February 2019

Responsibilities:

•Experienced in building streaming jobs to process terabytes of xml format data using Flume.

•Worked on batch data ingestion using Sqoop from various sources like Teradata, Oracle.

•Worked on various pig Latin scripts for data transformations and cleansing.

•Proficient in containerization tools like Docker and container orchestration platforms like Kubernetes, enabling the deployment and scaling of real-time data applications.

•Collaborated with product owners and stakeholders to refine user stories, ensuring clear acceptance criteria and a shared understanding of project requirements.

•Used Kafka a publish-subscribe messaging system by creating topics using consumers and producers to ingest data into the application for Spark to process the data and create Kafka topics for application and system logs.

•Enhanced scripts of existing Python modules and developed APIs for loading processed data to HBase tables.

•Implemented multiple Impala scripts for exposing it into Tableau.

•Expertise in working with data serialization formats such as JSON, Avro, and Protobuf for efficient data transfer in real-time systems.

•Involved in developing the Pig scripts to process the data coming from different sources.

•Developed Advance PL/SQL packages, procedures, triggers, functions, Indexes and Collections to implement business logic using SQL Developer. Generated server-side PL/SQL scripts for data manipulation and validation and materialized views for remote instances.

•Optimized ETL operations in Hadoop and Spark scripts, slashing data processing time by 40% through techniques such as partitioning, bucketing, serialization, and executor configuration.

•Produced various data visualizations using Python and Tableau.

•Demonstrated expertise in data modeling, encompassing logical and physical database design, normalization, and establishment of referential integrity constraints.

•Worked on data cleaning using Pig scripts and storing in HDFS.

Environment: Cloudera, HDFS, Hive, MapReduce, Pig, Sqoop, Oracle, Python, Oozie, Impala, Tableau

Education

•Bachelor of Technology in Jawaharlal Institute of Technology and Science (2016)

Contact this candidate