Data Engineer Quality

Location:

Richardson, TX, 75080

Posted:

October 15, 2025

Contact this candidate

Resume:

Pravalika Nagaraju

Data Engineer

Richardson,TX 603-***-**** *********.****@*****.***

SUMMARY

Data Engineer with 5+ years of experience designing, building, and optimizing large-scale data platforms in cloud (AWS) and on-premise environments. Proven expertise in ETL/ELT pipelines, real-time streaming, data quality frameworks, and data mesh architectures using tools such as Apache Spark, Kafka, dbt, Snowflake, and PySpark. Skilled in deploying production-grade CI/CD pipelines (Airflow, Jenkins, GitHub Actions, Terraform), driving automation, reducing latency, and improving data reliability at scale. Adept at collaborating cross-functionally with data scientists, ML engineers, and business stakeholders to deliver high-impact data solutions that support machine learning, personalization, fraud detection, and analytics for enterprises such as Netflix and JPMorgan Chase. Known for implementing cost-effective, high-performance, and governance-compliant data systems that reduce manual effort and drive measurable business outcomes. EXPERIENCE

Netflix Data Engineer CA April 2024 – Present

• Optimized high-performance Apache Spark (EMR) jobs and real-time Kafka pipelines, reducing data latency from hours to seconds for over 50 million daily users and improving event-driven personalization and search.

• Developed a Generative AI-powered data quality framework using AWS Bedrock and Python to auto-generate SQL, validate data quality rules, and document lineage, cutting manual governance tasks by 45% and boosting anomaly detection by 35%.

• Architected and deployed a Delta Lake-based Medallion architecture on AWS (S3, EMR, Glue), improving data discoverability and reducing batch processing costs by 20%, while enabling schema evolution for ML workflows.

• Designed and implemented a unified Data Mesh architecture to decentralize data ownership across business units (Content, Marketing, Analytics), enabling self-service data access and reducing cross-team data dependency by 45%.

• Created and maintained a scalable dbt Core pipeline for automated feature engineering and ML model monitoring, integrated with Metaflow to deliver training-ready datasets and detect feature drift, reducing prep time by 35%.

• Collaborated closely with data scientists and research engineers on the Personalization and Search teams to design and implement efficient data access and preprocessing strategies for large-scale A/B testing and model training. JPMorgan Chase Data Engineer India January 2019 – November 2022

• Implemented real-time streaming data pipelines using Apache Kafka, AWS Kinesis, and Spark Structured Streaming, delivering fraud detection with sub-second latency and processing over 50K transactions/sec at 99.99% reliability.

• Tuned and optimized complex SQL queries and materialized views in Snowflake, reducing analytics latency from 10 minutes to under 30 seconds, supporting time-sensitive high-frequency trading operations.

• Engineered a unified CI/CD and orchestration platform using Apache Airflow, Jenkins, GitHub Actions, and Terraform to automate testing, deployment, and monitoring of data pipelines, cutting manual tasks by 60% and enabling audit-ready releases.

• Designed and deployed 15+ scalable data pipelines across AWS (S3, Glue, Lambda) and on-prem Hadoop (HDFS, Hive, Spark) ecosystems, improving ETL efficiency, reducing latency by 30%, and ensuring adherence to financial data compliance standards.

• Contributed to the design of modular Python frameworks (Pandas, PySpark) for automated data quality validation and feature engineering, saving 20+ hours per week in manual QA and boosting ML model accuracy by 25%.

• Mentored and trained 3 junior engineers in PySpark performance tuning, SQL optimization, and cloud-native architecture, boosting team output by 30% and enabling successful delivery of 3+ production-grade pipelines. SKILLS

Programming Languages: Python, Java, Scala, SQL (Oracle, PostgreSQL, MySQL, SQL Server), NoSQL (MongoDB, DynamoDB), Shell Scripting Big Data & Distributed Systems: Apache Spark, Apache Hive, Apache Pig, Hadoop (HDFS, MapReduce), Apache Sqoop Streaming & Real-Time Processing: Apache Kafka, AWS Kinesis, Spark Structured Streaming, Apache Flink ETL, Workflow Orchestration & Data Pipelines: Apache Airflow, AWS Glue, Azure Data Factory, Talend, SSIS, Prefect, dbt (Data Build Tool) Cloud Platforms & Data Warehousing: AWS (S3, EC2, Redshift, EMR, Glue, Athena, RDS, Lambda, Bedrock), Azure (Synapse, Databricks, Data Lake), GCP (BigQuery, Cloud Storage), Snowflake DevOps & Infrastructure as Code: Git, Jenkins, GitHub Actions, GitLab CI/CD, Docker, Kubernetes, Terraform Machine Learning & AI Engineering: Feature Engineering, ML Model Monitoring, Feature Drift Detection, Model Readiness Pipelines, LLM Integration, MLOps (Metaflow), Generative AI (AWS Bedrock) Libraries & Frameworks: Pandas, NumPy, Scikit-learn, TensorFlow, PyTorch, Keras, Matplotlib, Seaborn, SciPy Data Visualization & Development Methodologies: Tableau, Power BI, QuickSight, QlikView, Microsoft Excel, Agile (Scrum), Waterfall, SDLC EDUCATION

Master’s in Management Information Systems,

Northern Illinois University, DeKalb, IL, USA.

CERTIFICATIONS

• Databricks Certified Data Engineer Associate

• Microsoft Certified:Power BI Data Analyst Associate

• Microsoft Certified: Fabric Data Engineer Associate

Contact this candidate