ML Data Engineer/ Data Science

Location:

Irving, TX

Posted:

July 24, 2025

Contact this candidate

Resume:

Pavan Yarlagadda

*****.*******@*****.*** Phone: 346-***-****

Professional Summary

Over 6+ years of experience in designing, developing and deploying scalable data engineering and ML platforms using Python, Pyspark, Java and TensorFlow.

Distributed Big Data & streaming pipelines across Hadoop, Spark (RDDs/DataFrames/Datasets), Kafka, Flink, NiFi, Kinesis, Spark Streaming for batch and lowlatency workloads.

Designed and optimized ETL/ELT data pipelines with Airflow, AWS Glue, Informatica, Oozie, Sqoop, Flume, Pig, Databricks; SQL & NoSQL (MySQL, Redshift, Snowflake, HBase, Cassandra).

Strong background in AWS (S3, EC2, RDS, Redshift, Lambda, Glue, EMR, SageMaker) and Google Cloud (BigQuery), with CI/CD automation using Terraform, Jenkins, Docker, Kubernetes, and GitHub Actions.

Delivered production ML systems for predictive analytics, anomaly detection, churn reduction, sentiment analysis, recommender engines, and LangChain-based LLM pipelines.

Leveraged advanced ML frameworks (TensorFlow, PyTorch, Scikit-learn) and Natural Language processing (spaCy, NLTK, Gensim) for real word applications.

Ensured secure, compliant data solutions with AWS IAM, KMS, encryption, data masking, and governance aligning to GDPR, HIPAA, and enterprise quality standards.

Created interactive visualization integrations and dashboards using Tableau, Power BI, QuickSight enabling executive leadership to access real-time metrics on data ingestion and operational KPIs.

Delivered near real-time decisioning platforms by integrating Kafka/Kinesis with Spark Streaming, Flink, and AWS Lambda for anomaly detection and operational insights.

Worked closely with business stakeholders, data scientists, and engineering teams to develop scalable data architectures and ML-driven insights aligned with business goals.

Agile/Scrum & SDLC practitioner, driving performance tuning, cost optimization, and rapid adoption of emerging tools in fast-paced AI/data environments.

Technical Skills

Languages: Python, Java, Scala, SQL, HiveQL, Spark SQL.

Scripting Languages: Bash, Python scripting, PowerShell.

Big Data Tools: Apache Spark (RDDs, DataFrames), Hadoop (HDFS, MapReduce, Hive), Pyspark, Apache Airflow, NiFi, Kafka, Oozie, Databricks, Informatica, AWS Glue, AWS Data Pipeline, ETL pipelines.

Cloud Technologies: AWS (S3, EC2, RDS, Redshift, Lambda, Glue, Kinesis, EMR, SageMaker, IAM, KMS, Athena, CloudFormation), Terraform.

Machine Learning and Statistics: Linear Regression, Logistic Regression, Naive Bayes, Decision Trees, Random Forest, Support Vector Machines (SVM), K-Means Clustering, K-Nearest Neighbors (KNN), Gradient Boosting Trees, Ada Boosting, PCA, LDA, NLP.

Machine learning technologies: NumPy, Pandas, Scikit-learn, Matplotlib, Seaborn, TensorFlow, Keras, PyTorch, Hugging Face, AWS ML, NLTK, SpaCy, Open CV.

Security & Compliance: AWS IAM, KMS, Encryption, HIPAA, GDPR Compliance.

DevOps & CI/CD: Git, GitHub, GitLab, Jenkins, AWS CodePipeline, Docker, Kubernetes.

Methodologies: Agile, Scrum, SDLC, Impact Analysis.

Professional Experience

Client: Thermofisher 2025 Jan - 2025 May

Role: Data Engineering with ML, Remote

Designed and deployed scalable data pipelines using Python, PySpark, Apache Kafka, and Airflow to ingest and process biomedical and diagnostic data in real-time and batch modes, supporting lab automation and scientific workflows.

Architected AWS S3-based data lake solutions for secure, compliant storage of high-throughput lab instrumentation data, enabling downstream analytics, machine learning, and regulatory traceability.

Developed ETL workflows using Databricks, AWS Glue, and Informatica to transform clinical, genomic, and experimental datasets into analytics-ready formats, ensuring compliance with HIPAA and FDA 21 CFR Part 11.

Built reusable feature pipelines across S3, Snowflake, and hybrid SQL/NoSQL sources, accelerating development of ML models for applications like patient risk scoring and lab workflow automation.

Optimized distributed processing systems using Hadoop, Hive, and HBase to handle petabyte-scale genomic datasets, improving data accessibility for researchers and data scientists.

Implemented real-time ingestion pipelines using Apache NiFi and AWS Kinesis to integrate streaming data from lab instruments, ensuring minimal latency in high-throughput research settings.

Developed ML pipelines using TensorFlow, Scikit-learn, and PyTorch for predictive diagnostics, anomaly detection in lab instruments, and forecasting of clinical sample throughput.

Applied NLP techniques for analyzing lab notes, scientific publications, and support ticket logs using spaCy and NLTK, enabling semantic search and knowledge extraction.

Streamlined ML model deployment using MLflow, Docker, and Kubernetes, facilitating version-controlled, scalable inference in clinical and research environments.

Conducted exploratory and statistical data analysis with Pandas, NumPy, and Jupyter Notebooks to identify trends in biomedical data and drive R&D insights.

Packaged inference APIs using Flask/FastAPI and deployed them as containerized microservices on Kubernetes, enabling integration with internal lab management systems.

Automated infrastructure provisioning with Terraform across AWS environments, ensuring reproducibility and scalability of scientific computing workflows.

Built CI/CD pipelines using GitLab, Jenkins, and AWS CodePipeline to ensure rapid, validated deployment of data pipelines and ML models in a regulated environment.

Implemented strict access controls and encryption using AWS IAM, KMS, and S3 policies to enforce compliance with HIPAA and GDPR across all data workflows.

Created real-time dashboards with Power BI and AWS QuickSight, allowing scientists and lab managers to monitor pipeline health, sample status, and data quality metrics.

Enabled proactive incident response by integrating ElasticSearch and Kibana for centralized logging and rapid debugging of data/ML pipeline issues.

Collaborated with cross-functional teams including scientists, data analysts, and regulatory affairs to define requirements and deliver machine learning solutions aligned with healthcare and diagnostics goals.

Client: Citi Bank 2024 May - 2024 Dec

Role: Data Scientist, Remote

Built & production scale Python ML models (TensorFlow, PyTorch, Scikitlearn) for classification & anomaly detection that improved model AUC and reduced false positives.

Deployed inference microservices via Flask & FastAPI on Kubernetes with MLflow registry and blue/green rollouts, cutting model release cycle and enabling autoscaling at peak traffic.

Engineered realtime streaming pipelines (Kafka, Kinesis, NiFi, PySpark) that lowered data latency and powered near realtime risk & customer experience dashboards.

Designed AWS S3–based data lake combined with Snowflake feature store, integrating structured/unstructured data across, accelerating feature reuse and reducing model development lead time.

Optimized ETL/ELT workflows in Databricks, Glue, Informatica, Airflow to process terra bytes with costaware autoscaling, cutting compute spend while improving SLA adherence.

Implemented endtoend observability & model performance monitoring (ElasticSearch, Prometheus/Grafana/Power BI) tracking latency, drift, and accuracy; reduced mean time to detect degradations.

Automated secure cloud infrastructure with Terraform with IAM/KMS policy enforcement meeting GDPR/HIPAA data protection controls.

Provided data modernization support by migrating legacy Hadoop/Hive systems to AWS cloud, integrating governance frameworks which improved data quality and accelerated compliance audits.

Established scalable ML architecture blueprints from stakeholder use cases, empowering product and engineering teams to deliver modular, production-ready models with minimal rework and enhanced traceability.

Established CI/CD with MLOps practices (Git/GitLab, Jenkins, Docker, Kubernetes) with automated testing, model validation, and security scans, raising deployment frequency while lowering rollback incidents.

Client: Schlumberger 2020 oct - 2023 Aug

Role: Data Engineering, Hybrid

Orchestrated distributed data pipelines using Apache Airflow on AWS to support reliable and automated ingestion of telemetry, drilling, and subsurface data, ensuring pipeline SLAs across global exploration datasets.

Developed DAGs for automated ETL workflows, reducing manual processing of seismic, geophysical, and equipment sensor data, improving data availability for real-time decision-making.

Led migration of legacy data warehouses and MDM platforms to AWS Redshift, enhancing scalability, query performance, and cross-domain analytics across E&P datasets.

Built scalable AWS cloud solutions using EC2, S3, Lambda, Redshift, and RDS, enabling secure and cost-effective storage and processing of petabyte-scale energy domain data.

Engineered event-driven ingestion and feature serving pipelines using Kinesis, Lambda, and Step Functions, supporting ML systems for equipment failure prediction and real-time drilling optimization.

Developed Spark-based transformation jobs in Python on EMR, improving data throughput and reducing processing time for high-volume geospatial and sensor datasets.

Productionized incremental aggregation workflows on EMR Spark to serve demand forecasting and revenue optimization ML models for logistics, inventory, and fleet management.

Integrated application telemetry and user activity data into AWS Redshift, enabling real-time dashboards and analytics for internal software and customer-facing platforms.

Collaborated with ML engineers to deliver SageMaker-based pipelines, automating data preparation, feature engineering, and monitoring for predictive maintenance models.

Developed configurable data delivery pipelines in Python, synchronizing processed data to downstream AWS-hosted storage and analytics platforms used by reservoir engineers and field teams.

Designed cost-optimized AWS infrastructure for high availability and fault tolerance in energy analytics workloads, incorporating Auto Scaling and spot instance strategies.

Implemented a CI/CD pipeline using Docker, GitHub, and AWS CodePipeline, streamlining delivery of data engineering code and infrastructure-as-code deployments.

Performed rigorous data validation, cleansing, and schema enforcement on ingestion pipelines to ensure data quality and consistency across multi-source energy datasets.

Participated in Agile Scrum ceremonies, contributing to sprint planning, backlog grooming, and incremental delivery of data engineering features and enhancements.

Benchmarked and tuned AWS Glue and Athena queries, ensuring high-throughput, low-latency data access for analytics and regulatory reporting.

Documented technical design, data lineage, and operational runbooks, supporting compliance and audit-readiness across regulated data environments.

Partnered with domain experts and technical teams to maintain scalable AWS-based data infrastructure aligned with SLB's digital transformation and automation goals.

Client: Target Retail 2018 Jul - 2020 Aug

Role: NLP Data Engineer, Remote

Improved demand forecasting using Python and XGBoost with engineered retail & seasonal features, lowering inventory carrying cost and reducing instore waste exposure.

Operationalized NLP on customer conversations (VADER sentiment + keyphrase extraction) to track brand perception trends feeding CX and marketing dashboards.

Built graphaugmented, contentbased product recommendation engine (Neo4j, purchase history, behavioral paths) powering product search & crosssell personalization across digital channels.

Architected storetocloud data pipelines ingesting POS, loyalty, and ecommerce data into Snowflake, standardized schemas & CDC patterns to support downstream ML feature stores.

Deployed MLbacked REST microservices for product search & recommendations using Flask + Scikitlearn (containerized, CI/CDready for EKS/ECS), enabling subsecond inapp inference.

Tuned largescale Spark workloads on AWS EMR (autoscaling + Spot pricing), cutting compute cost and reducing batch runtimes by for sales & behavioral enrichment jobs.

Implemented data quality & governance automation, great expectat validations in pipeline CI and Python automation of Alation catalog tagging for PII fields halving manual report prep and strengthening GDPR/CCPA readiness.

Delivered interactive Tableau analytics unifying sentiment scores, forecast accuracy, CLV tiers, and product affinity graphs to guide merchandising, supply chain, and CX strategy.

Scaled team adoption & code quality by establishing Gitbased peer review checklists, leading PoCs in graph DB & anomaly detection tech, and delivering virtual training on clustering, dimensionality reduction, and regularization techniques.

Certification:

Certified in AWS AI

Education

University of Texas: Master of Science in Computer Science, May 2025

GPA: 4.0/4.0

Contact this candidate