Priya Chellagundla
Data Engineer
Email: *****************@*****.***
Mobile: +1-832-***-****
8+ years of professional experience in designing, developing, and optimizing end-to-end data pipelines across the Big Data ecosystem.
Expertise in ETL/ELT workflow design — data ingestion, cleansing, transformation, and orchestration using PySpark, Airflow, and AWS Glue.
Strong experience in data integration, migration, and modeling across structured and unstructured data sources.
Proficient in building scalable batch and streaming data pipelines using Spark (PySpark/Scala), Kafka, and Hive.
Extensive experience implementing data pipelines on AWS using EC2, S3, Glue, Lambda, Redshift, Step Functions, and CloudWatch for orchestration and monitoring.
Optimized Informatica mappings and sessions by implementing partitioning, pushdown optimization, and reusable transformations to improve performance and maintainability.
Created and maintained Informatica workflows and scheduling jobs using Workflow Manager for automated data integration and monitoring.
Experience in Python and Spark APIs for developing high-performance ETL workflows, performing transformations, and automating data validation.
Developed data quality and validation frameworks using PySpark, SQL, and Pytest to ensure pipeline accuracy and consistency across environments.
Implemented data partitioning, caching, and broadcast join optimization in Spark to reduce job runtime and improve cluster efficiency.
Experienced with CI/CD workflows in Databricks and GitHub Actions for automated code deployment, testing, and version control.
Hands-on experience with workflow schedulers and monitoring tools including Airflow, Oozie, and Datadog for job orchestration and alerting.
Collaborated with Data Scientists to deliver model-ready datasets, integrating ML workflows with AWS SageMaker and Databricks environments.
Proficient in SQL, Spark SQL, and DataFrame APIs for data preparation, aggregation, and analytical transformations in production-grade pipelines.
TECHNICAL SKILLS:
Big Data Ecosystem
Spark, Kafka, Kafka Connect, Hive, Airflow, HBase,
Hadoop Distributions
Apache Hadoop (HDFS, MapReduce, YARN), Cloudera CDP
Cloud Environment
Amazon Web Services (AWS), Azure, GCP
AWS
EC2, S3, EMR, Redshift, Glue, Lambda, Step Functions, Kinesis, SageMaker, CloudWatch, CloudFormation, IAM, VPC, EBS
Databases
MySQL, PostgreSQL, Amazon Redshift, DynamoDB, HBase, Hive, Oracle
Operating systems
Linux, Unix, Windows
Software /Tools
Databricks, Jupyter Notebook, PyCharm, Visual Studio Code, IntelliJ, Jenkins, Datadog, AWS Glue, Docker, Postman, Hue, SAP
Reporting Tools/ETL Tools
Tableau, Power BI, AWS QuickSight, AWS Glue, Excel (Pivot & Power Query)
Programming Languages
Python (Pandas, NumPy, Scikit-Learn, PySpark), SQL (T-SQL, PL/SQL), Scala, Shell Scripting, C
Version Control
Git, Bitbucket
Development Tools
Eclipse, NetBeans, Microsoft Office Suite (Word, Excel, PowerPoint, Access)
PROFESSIONAL EXPERIENCE:
Toyota- Dallas, TX Aug 2023 – Oct 2025
Sr Data Engineer
Responsibilities
Design and develop data pipelines using ETL (Extract, Transform, Load) processes to move and transform data between multiple systems.
Build scalable pipelines using PySpark and Spark SQL, applying advanced transformations and improving overall functionality and efficiency.
Collaborate with Data Scientists to understand data and model requirements, ensuring the pipeline supports analytical and forecasting needs.
Create and manage Databricks job workflows aligned with overall data pipeline design and automation standards.
Develop automated ETL jobs to load aggregated and transformed data from Databricks into Snowflake for downstream analytics and forecasting.
Create and manage Databricks job workflows aligned with overall data pipeline design, leveraging Snowflake connectors for optimized data transfer.
Monitor data pipelines in both development and production environments using Databricks and AWS CloudWatch.
Investigate and resolve data pipeline failures by analyzing Databricks and CloudWatch logs and tracing issues from source to downstream systems.
Enhance pipelines to generate regional and national forecasts, supporting business-level insights.
Optimize pipeline performance by improving code efficiency, tuning Spark configurations, and reducing runtime using job clusters.
Develop and maintain Databricks notebooks for testing and validating code changes prior to deployment.
Configure Databricks to automatically sync the latest code from GitHub, ensuring production pipelines always run on the most recent version.
Leverage AWS S3, CloudWatch, and SageMaker for model registry, training, and version control of multiple series-based models.
Analyze and compare different model versions in SageMaker to validate performance metrics and ensure consistent accuracy.
Design Datadog alerts, dashboards, and monitors to track pipeline performance, detect failures, and visualize system health.
Integrate Datadog dashboards with data engineering and machine learning repositories for real-time monitoring.
Document all pipeline processes, tasks, and enhancements in Atlassian Confluence for team reference and knowledge sharing.
Manage and track JIRA tasks related to pipeline development, bug fixes, and enhancements to ensure timely delivery.
Implemented Apache Airflow for authoring, scheduling and monitoring Data Pipelines.
Scheduled Airflow DAGs to run multiple Hive, which independently run with time and data availability.
Followed agile methodology and involved in daily SCRUM meetings, sprint planning, showcases and retrospective.
Document all pipeline processes, tasks, and enhancements in Atlassian Confluence for team reference and knowledge sharing.
Utilize Databricks Copilot to summarize notebooks, document pipeline logic, and assist in identifying potential issues or improvements in existing transformation scripts.
Manage and track JIRA tasks related to pipeline development, bug fixes, and enhancements to ensure timely delivery.
Participate in daily standups, biweekly sprint reviews, PR reviews and technical demos to communicate project progress and resolve blockers.
Environment: Python, PySpark, AWS (EC2, S3, EMR, RDS, Redshift, Glue, Lambda, CloudWatch,
API Gateway) Databricks, Airflow, SageMaker, Shell Scripting, GitHub Actions, Docker, Datadog, Power BI
United Airlines– Houston, TX Dec 2022 – Jul 2023
Data Engineer
Responsibilities:
Designed and deployed large-scale data pipelines and applications on AWS stack ensuring high availability and scalability.
Developed Spark applications in PySpark and Spark SQL for data extraction, transformation, and aggregation.
Built and optimized AWS Glue ETL jobs for schema mapping and transformations.
Integrated Informatica with AWS and Snowflake environments for hybrid ETL workflows, enabling seamless data movement between on-premises and cloud data platforms.
Implemented real-time streaming pipelines with Kafka, Spark, and Hive.
Modernized legacy SQL scripts into optimized Scala Spark workflows.
Designed data ingestion and processing frameworks on AWS EMR, Redshift, and S3.
Automated data validation and transformation workflows using AWS Lambda and DynamoDB Streams.
Created PySpark pipelines to convert Avro raw data into optimized ORC format.
Implemented AWS CloudWatch monitoring and alerts to track pipeline performance, identify failures, and ensure system reliability.
Developed event-driven Lambda functions to automate triggers and actions across multiple AWS services, reducing manual intervention and latency.
Integrated CI/CD pipelines using GitHub Actions and AWS CodePipeline to automate data pipeline deployment and testing.
Optimized PySpark and Glue jobs through efficient partitioning, caching, and broadcast joins to reduce job runtime and cluster cost.
Collaborated with Data Science teams to operationalize ML scoring workflows using SageMaker and PySpark within production data pipelines.
Environment: Python, PySpark, AWS (EC2, S3, EMR, RDS, Redshift, Glue, Lambda, CloudWatch,
API Gateway), Airflow, Jenkins, Docker
KPMG June 2020 – June 2021 Software Engineer
Responsibilities:
Built ETL pipelines to move data from flat files and Oracle/Excel databases into enterprise data warehouses.
Managed large-scale datasets in HDFS and performed complex data transformations using PySpark.
Designed and queried analytical Hive tables to support business reporting and data analysis.
Involved in Data Migration using SQL, SQL Azure, Kafka, Azure Storage, and Azure Data Factory, SSIS, PowerShell.
Implemented streaming data ingestion using Kafka and Spark Structured Streaming for near real-time workflows.
Use Snowflake as a central data warehouse for storing curated datasets, enabling fast analytical queries and integration with BI tools.
Develop automated ETL jobs to load aggregated and transformed data from Databricks into Snowflake for downstream analytics and forecasting.
Leveraged AWS S3 and EMR for scalable data storage, distributed processing, and job orchestration.
Optimized Spark jobs using partitioning, caching, and configuration tuning for better performance.
Scheduled and monitored daily ETL workflows to ensure on-time delivery and data freshness.
Developed SQL scripts for data validation, reconciliation, and reporting to ensure data quality.
Automated lightweight data validation and metadata updates using AWS Lambda for operational efficiency.
·
Environment: Python, PySpark, Hadoop (HDFS, Sqoop), Hive, SQL, Kafka, AWS (S3, EMR, Lambda), Oracle, Azure, Snowflake, Power BI
DXC Technology May 2016 – May 2020
Associate Professional Programmer Analyst
Designed and maintained ETL workflows to extract, transform, and load data from multiple sources into relational databases.
Wrote complex SQL queries, stored procedures, and views for data extraction, cleansing, and reporting.
Performed data validation and quality checks to ensure accuracy and consistency across systems.
Automated routine data-processing tasks using Python scripts for file handling, XML parsing, and data formatting.
Developed and optimized database schemas (MySQL) with indexing and normalization for better performance.
Created dashboards and ad-hoc data extracts to support business analysis and decision-making.
Collaborated with analysts and developers to troubleshoot and resolve data pipeline or query performance issues.
Supported migration and integration of datasets between legacy systems and new data platforms.
Assisted in data documentation and establishing standard SQL coding practices across teams.
Building and maintaining data pipelines to extract, transform, and load (ETL) data from various sources into SAP systems or data warehouses.
Designing and developing data models that fit the needs of the business and make it easier to analyze SAP data.
Optimizing SAP data processes to improve performance, making sure queries and reports run efficiently.
Environment: SAP, SQL (MySQL, Spark SQL), Python, Linux, ETL Tools, Apache Server
Education:
Master’s in Computer Science Aug 2021 – Dec 2022
Lamar University, Beaumont, TX