Data Engineer Machine Learning

Location:

Denton, TX

Salary:

95000

Posted:

September 10, 2025

Contact this candidate

Resume:

Anusha Yanna

Data Engineer

Denton, TX 945-***-**** ******.*******@*****.*** LinkedIn

SUMMARY

Data Engineer with 5+ years of experience delivering scalable data pipelines and real-time streaming solutions across entertainment, retail, and financial services industries. Skilled in cloud platforms (AWS, GCP), big data technologies (Hadoop, Spark, Kafka, Flink), and DevOps practices. Proven ability to enhance data reliability, optimize costs, and drive analytics and machine learning initiatives in fast-paced, cross- functional environments.

SKILLS

Programming Languages: Python, Java, Scala, SQL, PL/SQL, T-SQL, Shell Scripting Big Data & Streaming Technologies: Hadoop, Spark, Hive, Pig, Sqoop, HBase, MapReduce, Flink, Kafka, Kinesis ETL & Data Pipelines: Apache Airflow, AWS Glue, Informatica PowerCenter, SSIS, Talend, DataStage, Azure Data Factory, Luigi, Prefect Cloud Platforms: AWS (EC2, S3, Redshift, Glue, EMR, Athena, DynamoDB, RDS), Azure (Data Factory, Databricks, Synapse, Blob Storage, Data Lake), GCP (BigQuery, Storage)

Data Warehousing: Redshift, BigQuery, Snowflake, Teradata, Synapse, Oracle, SQL Server, PostgreSQL, MySQL Databases: MongoDB, DynamoDB, CosmosDB, DB2

DevOps & Infrastructure: Git, Jenkins, Docker, Kubernetes, Terraform, GitLab CI Data Visualization & Reporting: Tableau, Power BI, QuickSight, IBM Cognos, QlikView, Microsoft Excel, SSRS, Seaborn, Plotly, Dash Libraries & Frameworks: Pandas, NumPy, Scikit-learn, TensorFlow, PyTorch, Keras, Matplotlib, SciPy, NLTK, PyMC3, Requests, Boto3 Methodologies: Agile, Scrum, SDLC, Waterfall

*Certificate: Data Science with R Programming - Simplilearn EXPERIENCE

Netflix CA Senior Data Engineer August 2024 – Present

Developed a high-throughput Flink and Kafka pipeline processing over 1M events per second, implementing end-to-end AES-256 encryption and IAM-based access controls to enable real-time recommendation systems and achieve SOC2 compliance.

Architected and implemented a Delta Lake-based medallion architecture on AWS (S3, EMR, and Glue), enhancing data discoverability and reducing batch processing costs by 25%, while ensuring schema evolution support for machine learning teams.

Collaborated with ML engineers to productionize personalization features on Databricks using Spark, optimizing feature store pipelines (Feast) and reducing model training latency by 30%.

Led migration from legacy Hive/MapReduce workflows to Spark SQL and Iceberg, applying tag-based IAM access policies and query audit logging, resulting in $500K annual savings and full alignment with Netflix data governance standards.

Deployed OpenLineage integrated with AWS Macie for sensitive data classification and Great Expectations for anomaly detection, reducing incident resolution times by 50% and proactively mitigating over three data exposure risks per quarter. Walmart AR Data Engineer October 2023 - August 2024

Optimized large-scale data workflows within Walmart’s Hadoop and Spark ecosystem, utilizing YARN and Hive for resource management and batch processing, increasing supply chain analytics efficiency by 30%.

Developed interactive dashboards with Python (Plotly, Dash) and Tableau to visualize sales trends, inventory levels, and customer behavior, empowering merchandising teams with actionable insights.

Automated CI/CD pipelines for data applications using Jenkins and Docker, achieving consistent deployment of containerized analytics models across Walmart’s hybrid cloud environments (Google Cloud Platform and on-premises).

Designed scalable NoSQL data models in Google BigQuery and Cassandra to manage high-velocity data sources, including point-of- sale transactions and IoT sensor data, improving query performance by 40%.

Engineered ETL pipelines using Apache Airflow and Informatica to process terabytes of structured and unstructured vendor, store, and e-commerce data, reducing data ingestion latency by 25%. JPMorgan Chase India Data Engineer January 2019 - July 2022

Designed and deployed 15+ scalable data pipelines across AWS (S3, Glue, Lambda) and on-premises Hadoop environments (HDFS, Hive, Spark), optimizing ETL workflows, reducing latency by 35%, and ensuring compliance with financial data governance policies.

Built data processing frameworks using PySpark and SQL on Databricks and AWS EMR, implementing Delta Lake for ACID transactions, schema enforcement, and real-time streaming via Kafka, improving data reliability by 25%.

Enhanced Teradata and Oracle data warehouses through advanced SQL techniques such as partitioning, indexing, and query optimization, reducing report generation time by 40% for business analytics users.

Developed automated ETL workflows with Apache NiFi and Informatica, integrating diverse data sources including RDBMS (DB2, SQL Server), REST APIs, and market data feeds, while embedding data quality checks and audit logs.

Established CI/CD pipelines using Jenkins and GitLab CI, automating infrastructure provisioning with Terraform and Python scripts, cutting deployment times by 20% for data applications.

Partnered with risk analytics and trading teams to deliver real-time data solutions, accelerating regulatory reporting and improving customer insight generation.

EDUCATION

Master's Informational Systems and Technology,

University of North Texas, TX, USA.

Contact this candidate