Machine Learning Data Engineer

Location:

Cincinnati, OH, 45202

Posted:

June 02, 2025

Contact this candidate

Resume:

Sravanthi Guduru

972-***-**** # *.*************@*****.*** ï LinkedIn Location: Dallas, TX

Summary

• Data Engineer with 4+ years of experience in Hadoop, Spark, Kafka, Hive, and Snowflake, delivering scalable, cloud-integrated big data solutions across various industries

• Proficient in Python, PySpark, and Scala for big data processing, robust ETL development, and data analysis

• Utilized machine learning libraries like NumPy, Pandas, Scikit-Learn, TensorFlow, and SciPy for advanced data manipulation, modeling, and predictive analytics

• Worked with a range of databases including PostgreSQL, MongoDB, and MS SQL Server, optimizing performance, scalability, and reliability

• Engineered scalable enterprise data applications and ETL pipelines using the Hadoop ecosystem (HDFS, MapReduce, Hive, Pig, Sqoop), with workflow orchestration managed using tools like Zookeeper and Apache Airflow

• Optimized data warehousing solutions, encompassing Amazon Redshift, Snowflake, and traditional warehouses, emphasizing performance and cost-effectiveness

• Developed interactive dashboards and analytics using Power BI, Tableau, Power View, and Matplotlib

• Strong background in data warehousing, security, and governance, with a focus on performance optimization and automation

• Applied a wide range of machine learning models including Regression, Decision Trees, Random Forest, KNN, Neural Networks, and Clustering techniques

• Hands-on experience with AWS, Azure, and Google Cloud services to build scalable cloud-based data pipelines

• Implemented version control and CI/CD practices using Jenkins, Git, and GitHub for enhanced development reliability Technical Skills

Programming : Python, SQL, Scala, Java, PySpark

Cloud Platforms : AWS (EC2, EMR, S3, Redshift, Lambda, Kinesis, Glue), Azure (Azure Databricks, Data Lake, Blob Storage, Azure Synapse Analytics), GCP (Google Cloud Storage, BigQuery, Cloud Dataflow, Compute Engine) Big Data : Apache Spark, Hadoop (HDFS, MapReduce, Sqoop, Hive), Kafka, Airflow Databases : PostgreSQL, MongoDB, SQL Server, DynamoDB, NoSQL ETL Processes : Azure Data Factory (ADF), SSIS, Informatica, DBT, Alteryx Data Warehousing : Amazon Redshift, Google BigQuery, Snowflake, SSRS, SSAS, Data Modeling Data Analytics & Visualization : Pandas, NumPy, Matplotlib, TensorFlow, Power BI, Tableau DevOps & Version Control : Azure DevOps, Git, GitHub, Docker, Jenkins, Kubernetes, CI/CD Microservices & APIs : RESTful APIs, Swagger/OpenAPI, Microservices, Flask, FastAPI, API Versioning Machine Learning Basics : Logistic Regression,Decision Tree, Random Forest, K-Nearest Neighbor (KNN), Principal Component Analysis

Experience

Christus Health Aug 2024 – Current

Data Engineer Texas, USA

• Built and refined end-to-end big data pipelines incorporating Hadoop, Apache Spark, and Apache Kafka, improving data processing efficiency by 42%

• Monitored and optimized Azure Data Factory pipelines using Azure Monitor and Log Analytics, reducing system downtime by 25% and ensuring 99.9% reliability

• Led key initiatives in the organization’s strategic migration from Azure to Google Cloud Platform (GCP), developing comprehensive migration strategies and implementation roadmaps

• Designed hybrid cloud architecture to facilitate seamless data flow between Azure and GCP environments during the phased migration process

• Migrated select workloads from Azure Data Factory to Cloud Composer and Dataflow on GCP, maintaining continuity of operations while improving processing efficiency by 15%

• Implemented BigQuery as a complementary data warehouse solution alongside Snowflake, optimizing cost management across cloud platforms and reducing infrastructure expenses by 30%

• Developed Python scripts for automated data loading and staging, streamlining deployments using both Azure DevOps and Google Cloud Build for CI/CD, reducing deployment time by 50%

• Established data pipelines utilizing GCP’s Pub/Sub and Dataflow services to replace Azure event processing systems, ensuring minimal disruption during transition

• Implemented data governance policies using Collibra, ensuring data ownership, stewardship, and compliance across data domains

• Designed and implemented RESTful APIs using Python Flask and FastAPI, enabling secure access to data services while reducing integration time by 40%

• Built scalable data architectures spanning Azure Synapse and Google BigQuery, enabling cross-platform analytics and reporting capabilities

• Integrated dbt with both Snowflake and BigQuery environments, achieving 47% faster model execution and improved resource utilization

• Designed and implemented scalable data ingestion pipelines using Apache Airflow on Google Cloud Composer, handling 3x more data volume with 20% lower processing time

• Transitioned containerized workloads from Azure Kubernetes Service (AKS) to Google Kubernetes Engine

(GKE), improving resource efficiency by 45%

• Leveraged Apache Spark on both Azure Databricks and Dataproc for large-scale data processing, reducing batch processing time by 55%

• Implemented data security protocols across both Azure and GCP environments, ensuring consistent governance during the migration phase

• Designed and implemented complex PostgreSQL database schemas, optimizing for performance and scalability, with 36% faster query response times

• Designed normalized OLTP schemas in PostgreSQL to support high-volume transactional applications, ensuring ACID compliance and real-time write efficiency

• Led Agile data teams with daily stand-ups and sprint planning, improving delivery times and documentation quality FM Global Aug 2023 – Aug 2024

Data Engineer Texas, USA

• Designed and optimized enterprise-scale ETL pipelines using Azure Data Factory for cloud workflows and SSIS for legacy on-premise integrations, standardizing data processing and improving quality by 40%

• Accelerated data processing by 60% using Apache Spark and Hadoop (HDFS, MapReduce, Hive, Sqoop)

• Created and managed detailed reports in SQL Server Reporting Services (SSRS) to provide valuable data insights that help drive informed business decisions

• Implemented Azure Databricks with Delta Lake, enabling real-time analytics and accelerating queries by 50%

• Developed high-volume real-time data pipelines in Scala, Apache Kafka, and Airflow, ensuring zero data loss and improved workflow orchestration

• Developed extract scripts using data views with auditing, error handling, and data integrity checks

• Maintained field mapping documentation and managed databases via SQL Server Management Studio

• Enforced schema evolution and versioning in Delta Lake, maintaining 100% backward compatibility and data integrity

• Strengthened data security by developing IAM policies and role-based access control (RBAC), ensuring controlled access management

• Deployed Kubernetes clusters with auto-scaling and load balancing, optimizing resource utilization by 40%

• Ensured regulatory compliance and data lineage by implementing Azure Purview for data governance

• Created advanced T-SQL scripts for automated data validation, implementing error handling, logging, and dynamic parameter management to enhance data quality and reliability

• Enabled real-time analytics by integrating Azure Databricks with Snowflake, leveraging Delta Lake for structured streaming and up-to-date insights across the pipeline

• Integrated machine learning models using Scikit-Learn and TensorFlow, applying classification and clustering algorithms for business insights; leveraged SciPy for numerical optimization and preprocessing

• Engineered reusable ETL mappings in Informatica PowerCenter, improving data consistency across 50+ data source

• Led data migration efforts from Teradata into Azure Synapse, ensuring a seamless transition with minimal downtime

• Implemented OLAP cube structures in Azure Synapse and SSAS to support multi-dimensional analysis and business reporting with minimal query latency

• Implemented comprehensive API versioning and documentation using Swagger/OpenAPI, ensuring backward compatibility and streamlined developer onboarding

• Facilitated efficient project management by leveraging collaboration tools such as Git, and Jira to streamline workflows and enhance team communication

• Automated data reconciliation with DBT and Python, reducing manual efforts by 70%

• Enabled hybrid analytics by transforming OLTP data into dimensional OLAP models using DBT and Snowflake Streams and Tasks, enhancing analytical query performance by 40%

• Streamlined deployments using Azure DevOps, GitHub Actions, and CI/CD, ensuring faster go-to-market Oakspro Software Solution Jan 2019 - Jul 2021

Data Engineer Hyderabad, India

• Built and refined ETL pipelines using Python and PySpark on AWS EMR, while separately utilizing AWS Lambda and Step Functions for lightweight, serverless data processing workflows

• Leveraged Apache Airflow, Snowflake, and DBT to streamline data workflows and enhance system efficiency

• Designed serverless architectures using AWS Lambda, Kinesis, and S3, reducing compute costs by 30%

• Built and maintained REST APIs and microservices, enabling seamless data integration across enterprise systems

• Processed millions of events per second in Apache Spark (Scala) and Kafka for real-time analytics

• Integrated OLTP systems with real-time ETL pipelines using Apache Kafka and Airflow to synchronize transactional data into OLAP layers in Snowflake

• Improved legacy on-prem data ingestion pipelines using SSIS, increasing reliability and performance by 50%

• Migrated a legacy on-premises data warehouse to Snowflake on AWS, reducing infrastructure costs by 30%

• Enhanced query performance and minimized run time by optimizing SQL queries with Amazon Redshift

• Implemented CDC and incremental loading in SQL Server using native CDC features and integrated with Apache Flink via Kafka for real-time replication

• Refined dimensional data models in Erwin, enhancing reporting efficiency by 40%

• Reduced query execution time by 50% with fine-tuned SSRS reports and query tuning

• Developed complex T-SQL stored procedures to optimize data retrieval and transformation, improving query performance by implementing efficient indexing strategies

• Developed and maintained project documentation on databases, ensuring compliance with departmental procedures by utilizing AWS RDS for (PostgreSQL) and Athena for querying and reporting

• Monitored and optimized data pipelines using AWS CloudWatch and Datadog, resolving performance bottlenecks and ensuring high system uptime

• Automated data transformation and ETL workflows using Alteryx, reducing manual efforts by 50% and improving data accuracy

• Automated workflows by integrating Python scripts in AWS Lambda with Glue, S3 services, and Jenkins to enhance data processing efficiency

• Wrote Unix scripts for automation, reducing system downtime and enhancing operational reliability

• Ensured data security and compliance by implementing AWS IAM policies, KMS encryption, and robust security controls like row-level security and column encryption, meeting GDPR and CCPA regulatory requirements

• Followed Agile methodologies, engaging in daily stand-ups and sprint planning to enhance delivery timelines Education

Master of Science in Data Science (CGPA of 3.9) Aug 2021 - May 2023 University of North Texas Denton, Texas

Bachelor of Science in Computer Science and Engineering (CGPA of 9.7) Aug 2016 - Sep 2020 Malla Reddy Engineering College Hyderabad, India

Contact this candidate