Data Engineer Machine Learning

Location:

Denton, TX

Salary:

70000

Posted:

October 15, 2025

Contact this candidate

Resume:

Sripriya Reddy

Data Engineer

Dallas, TX +1-972-***-**** ****************@*****.*** LinkedIn GitHub SUMMARY

Data Engineer with 5+ years of experience architecting and optimizing cloud-native data ecosystems to enable scalable analytics, real-time insights, and advanced machine learning workflows. Designing end-to-end data pipelines leveraging Python, SQL, Apache Spark, and Airflow, with expertise in AWS, GCP and Snowflake environments. Demonstrated success in implementing data models, orchestration frameworks, and CI/CD that ensure reliability, performance, and governance across complex datasets. Committed to transforming data infrastructure into a strategic asset that accelerates innovation and drives intelligent decision-making.

EXPERIENCE

Data Engineer, Charles Schwab TX, USA Dec 2024 – Present

Engineered large-scale ETL pipelines using AWS Glue, Lambda, and Spark, automating ingestion of transactional, portfolio, and market datasets supporting machine learning workflows and reducing manual ingestion time.

Architected and optimized Snowflake dimensional data models, improving analytical query speed by 45% while enforcing consistent governance across portfolio analytics, customer risk, and performance dashboards.

Implemented workflow orchestration with Apache Airflow to enhance scheduling reliability, manage dependencies dynamically, and cut end-to-end data pipeline runtime by 30% through parallel processing.

Partnered with cross functional teams to integrate validation and lineage tracking using dbt and Great Expectations, ensuring traceability for over 10M financial records and maintaining regulatory data integrity.

Built AI-ready data pipelines delivering feature-rich datasets for model training and inference, enabling predictive analytics that forecasted customer risk exposure and market volatility trends with 92% model accuracy.

Automated CI/CD deployment of ETL workloads using Jenkins and Terraform, implementing infrastructure-as-code provisioning and version-controlled rollouts across multiple AWS environments, reducing deployment failures by 60%. Data Engineer, Tata Consultancy Services India May 2022 – May 2023

Implemented distributed data pipelines using PySpark and Kafka to process over 500 million financial transactions daily, ensuring sub-second data availability and horizontal scalability across multiple banking systems.

Led the complete migration of legacy ETL pipelines to Google Cloud, optimizing cluster configurations and dynamic scaling, cutting compute costs, and improving job reliability and uptime by 35%.

Developed standardized Python and SQL transformation templates, reducing code redundancy across 8 projects and improving developer onboarding efficiency by 40% through modular, reusable design patterns.

Automated Airflow workflows for batch and streaming jobs with SLA-based alerting and real-time recovery, reducing job failures by 70% and improving reporting predictability for financial analytics teams.

Collaborated with DevOps engineers to build CI/CD pipelines using Jenkins, Terraform, and Git, integrating testing, versioning, and rollback mechanisms, enabling zero-downtime releases for production data environments.

Optimized BigQuery performance via partitioning, clustering, and cost tuning, reducing monthly query cost by 40% and improving average query runtime from 15s to under 5s across analytic workloads. Data Analyst, Capgemini India Jun 2019 – May 2022

Designed analytical data models using SQL Server and Azure Synapse Analytics, consolidating multiple data silos and enabling near real-time visibility into financial KPIs across retail, logistics, and engagement functions.

Developed advanced Power BI dashboards and data visualizations consumed by over 40 senior executives, identifying performance bottlenecks and supporting high-impact business decisions.

Automated weekly data refresh pipelines using Python and Airflow, reducing manual reporting workload by 40%, improving data availability timelines, and ensuring SLA adherence across global reporting teams.

Conducted detailed data validation and reconciliation on multi-terabyte datasets to ensure integrity between upstream ERP systems and analytical warehouses, reducing data discrepancy incidents by 35% and reprocessing costs.

Migrated large-scale on-premises datasets to Azure Synapse, improving query performance by 50% and enabling cost- efficient data compression strategies for long-term analytics storage across multiple business units.

Performed exploratory data analysis and predictive modeling with Python (pandas, scikit-learn), uncovering patterns in customer churn that led to a 12% improvement in retention campaign efficiency. Data Analyst Associate, Citius Tech India Jan 2019 – Apr 2019

Extracted, transformed, and standardized healthcare data from SQL Server, Excel, and API sources, ensuring compliance with HIPAA standards and enabling automated reporting across 15+ clinical performance metrics.

Maintained Python-based cleansing scripts to validate data accuracy, handle missing values, and streamline regulatory datasets, reducing manual QA hours by 50% across patient data workflows.

Partnered with analysts to design and deploy Tableau dashboards visualizing patient outcomes and treatment timelines, improving visibility for healthcare executives and accelerating clinical decision-making processes.

Collaborated with engineering teams to document schema definitions, transformation logic, and lineage maps, improving knowledge transfer and maintainability of the enterprise data ecosystem for 10+ internal projects.

Created ETL documentation and monitoring scripts that improved data pipeline observability, enabling early error detection and reducing downtime in daily data loading operations by 30%.

Conducted exploratory data assessments and anomaly detection across multi-source datasets, identifying missing attributes and data inconsistencies that improved overall reporting reliability by 20%. SKILLS

Programming & Scripting: Python, SQL, Scala, Bash, Shell Scripting, DAX, R Data Processing & Big Data Frameworks: Apache Spark, PySpark, Hadoop, Kafka, Flink, Hive, Spark Streaming Data Orchestration & Workflow Management: Apache Airflow, dbt, Luigi, Prefect AI / Machine Learning: OpenAI API, LangChain, Hugging Face Transformers, Scikit-learn, TensorFlow, PyTorch, MLflow, XGBoost, LightGBM, Feature Engineering, Model Deployment, MLOps, Model Monitoring Data Warehousing & Storage: Snowflake, Amazon Redshift, Google BigQuery, Azure Synapse, Databricks, AWS S3 ETL / ELT Development: Data ingestion, integration, and automation, Informatica, Talend, Alteryx Database Technologies: PostgreSQL, MySQL, MongoDB, Cassandra, DynamoDB Data Modeling & Architecture: Dimensional Modeling, Star/Snowflake Schema, Data Lakes, Data Marts, Data Governance Analytics & Business Intelligence: Power BI, Tableau, Looker, Excel (Pivot Tables, Power Query), Data Storytelling Statistical & Analytical Methods: Exploratory Data Analysis (EDA), Hypothesis Testing, Regression Analysis, Time Series Forecasting

Cloud Platforms: AWS (Glue, Lambda, EMR, S3), GCP (Dataflow, Pub/Sub), Azure (Data Factory) Project Management and Collaboration: JIRA, Asana, Trello, Agile (Scrum/Kanban) DevOps & CI/CD: Docker, Kubernetes, Terraform, Jenkins, Git, GitHub, Bitbucket, CI/CD pipeline automation Security & Compliance: IAM, Data Encryption, GDPR, SOC 2, Access Control EDUCATION

Master of Science in Information Systems and Technologies Aug 2023 – May 2025 University of North Texas, Denton, Texas

CERTIFICATION

AWS Cloud Practitioner - Amazon

AWS AI Practitioner - Amazon

Fabric Data Engineer Associate DP-700 – Microsoft

Generative AI Fundamentals – Databricks

Databricks Fundamentals – Databricks

PROJECTS

Financial Report Anomaly Detector Automation

Built a CNN-based anomaly detection pipeline using AWS Textract, OpenCV, and TensorFlow, achieving 99.6% accuracy in detecting anomalies within scanned compliance documents.

Automated reconciliation of 20K+ daily financial records with Python fuzzy matching, saving 50 hours/week and improving SOX compliance.

Real-Time Fraud Detection System for National Australia Bank

Designed a scalable fraud detection pipeline with AWS (S3, Lambda, Glue, Redshift) and Apache Airflow, enhancing ingestion speed, reducing downtime, and improving real-time query performance for suspicious transaction analysis. Supply Chain Optimization for McDonald’s

Generated AWS and Airflow-based pipelines to optimize inventory management and demand forecasting across 100+ stores.

Created interactive Power BI dashboards for rapid decision-making and cut cloud storage costs via S3 and Redshift optimizations.

Contact this candidate