Data Engineer Integration

Location:

Indian Trail, NC

Posted:

November 07, 2025

Contact this candidate

Resume:

Priya Chellagundla

Data Engineer

Email: *****************@*****.***

Mobile: +1-832-***-****

8+ years of professional experience in designing, developing, and optimizing end-to-end data pipelines across the Big Data ecosystem.

Expertise in ETL/ELT workflow design — data ingestion, cleansing, transformation, and orchestration using PySpark, Airflow, and AWS Glue.

Strong experience in data integration, migration, and modeling across structured and unstructured data sources.

Proficient in building scalable batch and streaming data pipelines using Spark (PySpark/Scala), Kafka, and Hive.

Extensive experience implementing data pipelines on AWS using EC2, S3, Glue, Lambda, Redshift, Step Functions, and CloudWatch for orchestration and monitoring.

Optimized Informatica mappings and sessions by implementing partitioning, pushdown optimization, and reusable transformations to improve performance and maintainability.

Created and maintained Informatica workflows and scheduling jobs using Workflow Manager for automated data integration and monitoring.

Experience in Python and Spark APIs for developing high-performance ETL workflows, performing transformations, and automating data validation.

Developed data quality and validation frameworks using PySpark, SQL, and Pytest to ensure pipeline accuracy and consistency across environments.

Implemented data partitioning, caching, and broadcast join optimization in Spark to reduce job runtime and improve cluster efficiency.

Experienced with CI/CD workflows in Databricks and GitHub Actions for automated code deployment, testing, and version control.

Hands-on experience with workflow schedulers and monitoring tools including Airflow, Oozie, and Datadog for job orchestration and alerting.

Collaborated with Data Scientists to deliver model-ready datasets, integrating ML workflows with AWS SageMaker and Databricks environments.

Proficient in SQL, Spark SQL, and DataFrame APIs for data preparation, aggregation, and analytical transformations in production-grade pipelines.

TECHNICAL SKILLS:

Big Data Ecosystem

Spark, Kafka, Kafka Connect, Hive, Airflow, HBase,

Hadoop Distributions

Apache Hadoop (HDFS, MapReduce, YARN), Cloudera CDP

Cloud Environment

Amazon Web Services (AWS), Azure, GCP

AWS

EC2, S3, EMR, Redshift, Glue, Lambda, Step Functions, Kinesis, SageMaker, CloudWatch, CloudFormation, IAM, VPC, EBS

Databases

MySQL, PostgreSQL, Amazon Redshift, DynamoDB, HBase, Hive, Oracle

Operating systems

Linux, Unix, Windows

Software /Tools

Databricks, Jupyter Notebook, PyCharm, Visual Studio Code, IntelliJ, Jenkins, Datadog, AWS Glue, Docker, Postman, Hue, SAP

Reporting Tools/ETL Tools

Tableau, Power BI, AWS QuickSight, AWS Glue, Excel (Pivot & Power Query)

Programming Languages

Python (Pandas, NumPy, Scikit-Learn, PySpark), SQL (T-SQL, PL/SQL), Scala, Shell Scripting, C

Version Control

Git, Bitbucket

Development Tools

Eclipse, NetBeans, Microsoft Office Suite (Word, Excel, PowerPoint, Access)

PROFESSIONAL EXPERIENCE:

Toyota- Dallas, TX Aug 2023 – Oct 2025

Sr Data Engineer

Responsibilities

Design and develop data pipelines using ETL (Extract, Transform, Load) processes to move and transform data between multiple systems.

Build scalable pipelines using PySpark and Spark SQL, applying advanced transformations and improving overall functionality and efficiency.

Collaborate with Data Scientists to understand data and model requirements, ensuring the pipeline supports analytical and forecasting needs.

Create and manage Databricks job workflows aligned with overall data pipeline design and automation standards.

Develop automated ETL jobs to load aggregated and transformed data from Databricks into Snowflake for downstream analytics and forecasting.

Create and manage Databricks job workflows aligned with overall data pipeline design, leveraging Snowflake connectors for optimized data transfer.

Monitor data pipelines in both development and production environments using Databricks and AWS CloudWatch.

Investigate and resolve data pipeline failures by analyzing Databricks and CloudWatch logs and tracing issues from source to downstream systems.

Enhance pipelines to generate regional and national forecasts, supporting business-level insights.

Optimize pipeline performance by improving code efficiency, tuning Spark configurations, and reducing runtime using job clusters.

Develop and maintain Databricks notebooks for testing and validating code changes prior to deployment.

Configure Databricks to automatically sync the latest code from GitHub, ensuring production pipelines always run on the most recent version.

Leverage AWS S3, CloudWatch, and SageMaker for model registry, training, and version control of multiple series-based models.

Analyze and compare different model versions in SageMaker to validate performance metrics and ensure consistent accuracy.

Design Datadog alerts, dashboards, and monitors to track pipeline performance, detect failures, and visualize system health.

Integrate Datadog dashboards with data engineering and machine learning repositories for real-time monitoring.

Document all pipeline processes, tasks, and enhancements in Atlassian Confluence for team reference and knowledge sharing.

Manage and track JIRA tasks related to pipeline development, bug fixes, and enhancements to ensure timely delivery.

Implemented Apache Airflow for authoring, scheduling and monitoring Data Pipelines.

Scheduled Airflow DAGs to run multiple Hive, which independently run with time and data availability.

Followed agile methodology and involved in daily SCRUM meetings, sprint planning, showcases and retrospective.

Document all pipeline processes, tasks, and enhancements in Atlassian Confluence for team reference and knowledge sharing.

Utilize Databricks Copilot to summarize notebooks, document pipeline logic, and assist in identifying potential issues or improvements in existing transformation scripts.

Manage and track JIRA tasks related to pipeline development, bug fixes, and enhancements to ensure timely delivery.

Participate in daily standups, biweekly sprint reviews, PR reviews and technical demos to communicate project progress and resolve blockers.

Environment: Python, PySpark, AWS (EC2, S3, EMR, RDS, Redshift, Glue, Lambda, CloudWatch,

API Gateway) Databricks, Airflow, SageMaker, Shell Scripting, GitHub Actions, Docker, Datadog, Power BI

United Airlines– Houston, TX Dec 2022 – Jul 2023

Data Engineer

Responsibilities:

Designed and deployed large-scale data pipelines and applications on AWS stack ensuring high availability and scalability.

Developed Spark applications in PySpark and Spark SQL for data extraction, transformation, and aggregation.

Built and optimized AWS Glue ETL jobs for schema mapping and transformations.

Integrated Informatica with AWS and Snowflake environments for hybrid ETL workflows, enabling seamless data movement between on-premises and cloud data platforms.

Implemented real-time streaming pipelines with Kafka, Spark, and Hive.

Modernized legacy SQL scripts into optimized Scala Spark workflows.

Designed data ingestion and processing frameworks on AWS EMR, Redshift, and S3.

Automated data validation and transformation workflows using AWS Lambda and DynamoDB Streams.

Created PySpark pipelines to convert Avro raw data into optimized ORC format.

Implemented AWS CloudWatch monitoring and alerts to track pipeline performance, identify failures, and ensure system reliability.

Developed event-driven Lambda functions to automate triggers and actions across multiple AWS services, reducing manual intervention and latency.

Integrated CI/CD pipelines using GitHub Actions and AWS CodePipeline to automate data pipeline deployment and testing.

Optimized PySpark and Glue jobs through efficient partitioning, caching, and broadcast joins to reduce job runtime and cluster cost.

Collaborated with Data Science teams to operationalize ML scoring workflows using SageMaker and PySpark within production data pipelines.

Environment: Python, PySpark, AWS (EC2, S3, EMR, RDS, Redshift, Glue, Lambda, CloudWatch,

API Gateway), Airflow, Jenkins, Docker

KPMG June 2020 – June 2021 Software Engineer

Responsibilities:

Built ETL pipelines to move data from flat files and Oracle/Excel databases into enterprise data warehouses.

Managed large-scale datasets in HDFS and performed complex data transformations using PySpark.

Designed and queried analytical Hive tables to support business reporting and data analysis.

Involved in Data Migration using SQL, SQL Azure, Kafka, Azure Storage, and Azure Data Factory, SSIS, PowerShell.

Implemented streaming data ingestion using Kafka and Spark Structured Streaming for near real-time workflows.

Use Snowflake as a central data warehouse for storing curated datasets, enabling fast analytical queries and integration with BI tools.

Develop automated ETL jobs to load aggregated and transformed data from Databricks into Snowflake for downstream analytics and forecasting.

Leveraged AWS S3 and EMR for scalable data storage, distributed processing, and job orchestration.

Optimized Spark jobs using partitioning, caching, and configuration tuning for better performance.

Scheduled and monitored daily ETL workflows to ensure on-time delivery and data freshness.

Developed SQL scripts for data validation, reconciliation, and reporting to ensure data quality.

Automated lightweight data validation and metadata updates using AWS Lambda for operational efficiency.

Environment: Python, PySpark, Hadoop (HDFS, Sqoop), Hive, SQL, Kafka, AWS (S3, EMR, Lambda), Oracle, Azure, Snowflake, Power BI

DXC Technology May 2016 – May 2020

Associate Professional Programmer Analyst

Designed and maintained ETL workflows to extract, transform, and load data from multiple sources into relational databases.

Wrote complex SQL queries, stored procedures, and views for data extraction, cleansing, and reporting.

Performed data validation and quality checks to ensure accuracy and consistency across systems.

Automated routine data-processing tasks using Python scripts for file handling, XML parsing, and data formatting.

Developed and optimized database schemas (MySQL) with indexing and normalization for better performance.

Created dashboards and ad-hoc data extracts to support business analysis and decision-making.

Collaborated with analysts and developers to troubleshoot and resolve data pipeline or query performance issues.

Supported migration and integration of datasets between legacy systems and new data platforms.

Assisted in data documentation and establishing standard SQL coding practices across teams.

Building and maintaining data pipelines to extract, transform, and load (ETL) data from various sources into SAP systems or data warehouses.

Designing and developing data models that fit the needs of the business and make it easier to analyze SAP data.

Optimizing SAP data processes to improve performance, making sure queries and reports run efficiently.

Environment: SAP, SQL (MySQL, Spark SQL), Python, Linux, ETL Tools, Apache Server

Education:

Master’s in Computer Science Aug 2021 – Dec 2022

Lamar University, Beaumont, TX

Contact this candidate