Data Engineer Machine Learning

Location:

United States

Salary:

80000

Posted:

October 15, 2025

Contact this candidate

Resume:

SAHITHI SHANAGONDA

Data Engineer

Texas 325-***-**** ***********@*****.*** LinkedIn

SUMMARY

Data Engineer with 5+ years of experience designing and optimizing scalable data architectures across healthcare, finance, and enterprise environments. Adept at managing large-scale datasets and developing high-performance pipelines that drive operational efficiency and real-time analytics.

Proficient in Python, SQL (T-SQL, SSIS, SSMS, SSRS), R, and SAS for advanced data engineering, statistical analysis, and ETL development, enabling seamless data integration and business insights.

Extensive hands-on experience building distributed data processing solutions using Hadoop ecosystems (Hive, Pig, Spark) and machine learning libraries (Scikit-Learn, TensorFlow, OpenCV) to support predictive analytics and modeling.

Architected robust, cloud-native pipelines on AWS (EC2, S3, RDS, CloudWatch, DynamoDB, and Athena), Azure HDInsight, and EMR, streamlining data ingestion, storage, and processing in compliance with enterprise security standards.

Skilled in developing and deploying ETL solutions using SSIS, Informatica, Talend, and Apache Airflow, ensuring data quality, consistency, and automation across workflows.

Strong command over relational and NoSQL databases, including SQL Server, MySQL, PostgreSQL, MongoDB, and Cassandra, with expertise in schema design, query optimization, and performance tuning.

Collaborated with cross-functional teams including data scientists, analysts, and business stakeholders to deliver actionable dashboards and reports via Tableau, Power BI, and SharePoint.

Delivered scalable pipelines processing over 10,000+ records daily from diverse data sources, enabling data-driven decision-making and boosting operational efficiency by 20%.

Applied DevOps practices, leveraging Docker and Jenkins for CI/CD, accelerating deployment cycles and enhancing reliability of production data workflows.

Demonstrated a deep understanding of data governance, security, and compliance while transforming on-premise data warehouses to cloud platforms.

CORE QUALIFICATIONS

Languages Data Quality and Governance:

Python, SQL, Shell Scripting, R & T-SQL, Scala. Apache Atlas, Collibra, Informatica Data Quality, Talend Cloud Service: Data Quality, Trifacta.

AWS (S3, EC2, EMR, Redshift, Athena, and IAM), Azure (Data CI/CD and Version Control: bricks, ADF, Synapse Data warehouses, Stream Analytics) Kafka, Jenkins, Git, GitLab, Bit bucket Databases: Data Visualization and Reporting:

MySQL, PostgreSQL, Snowflake, Oracle, DB2, Hive, Redshift, T a b l e a u, Power BI, Looker, QlikView, Google Data Studio. Azure SQL DW, Data bricks Database. Python Libraries: Big Data Technologies: NumPy, Pandas, Beautiful Soup, PySpark, SQLAlchemy, Hadoop (HDFS, Map Reduce, YARN), Apache Spark, Apache PyTest, Apache Airflow. Kafka, Apache Hive, Sqoop.

EXPERIENCE

Data Engineer Aug 2024- Current

Molina Healthcare, TX

Crafted Spark applications using PySpark and Spark-SQL for data extraction, transformation, and aggregation, enhancing data processing speed by 25%.

Utilized Azure cloud services, including HDInsight, Databricks, Data Lake, Blob, Data Factory, Synapse, SQL DB, and SQL DWH.

Managed data from multiple third-party sources, applying transformations using PySpark and custom-built packages, ensuring a 99% data accuracy rate and 95% consistency across datasets.

Employed Data Stage Director for job scheduling, validation, running, and monitoring, achieving a 98% success rate in the seamless execution of data pipelines.

Engineered and fine-tuned complex Map Reduce or Spark jobs for data processing, transformation, and enrichment, maximizing distributed computing capabilities and reducing computational costs by 20%.

Incorporated advanced analytics models and libraries like PyTorch into Hadoop, enhancing machine learning outcomes and increasing predictive accuracy by 15%.

Undertook proof of concepts (POCs) for delta table package integration and removing specific Active Directory groups, resulting in a 25% improvement in data management practices.

Formulated automated Data bricks workflows for parallel data loads, improving data processing speed by 40% and reliability by 30%. Implemented and optimized SSIS, SSMS, and SSRS solutions to streamline data integration, management, and reporting processes, enhancing operational efficiency and decision-making capabilities.

Created database objects like tables, views, stored procedures, triggers, packages, and functions using T-SQL for efficient data management and structure.

Synchronized Kafka with Spark Streaming for real-time data analysis, enabling a 50% faster response rate in decision making processes

Participated in Agile development methodologies, managing sprint planning, daily stand-ups, and incremental deployments that contributed to the rollout of three critical applications, including HIPAA-compliant real-time monitoring platforms.

Established an ETL framework using Spark and Hive, reducing overall data processing time by 30% and enhancing the efficiency of data retrieval operations

Data Engineer Oct 2020 - Jul 2022

KPMG, India

Revitalized a 10GB+ warehouse infrastructure, elevating automation to 60%. This overhaul minimized downtime, ensuring maximized data accessibility and operational continuity.

Utilized Spark’s in memory capabilities to handle large datasets on S3 Data Lake. Loaded data into S3 buckets, then filtered and loaded into Hive external tables.

Strong Hands on experience in creating and modifying SQL stored procedures, functions, views, indexes, and triggers.

Performed ETL operations using Python, SparkSQL, S3 and Redshift on terabytes of data to obtain customer insights

Involved heavily in setting up the CI/CD pipeline using Jenkins, Terraform and AWS

Performed end- to-end Architecture & implementation assessment of various AWS services like Amazon EMR, Redshift, S3

Used AWS EMR to transform and move large amounts of data into and out of other AWS data stores and databases, such as Amazon Simple Storage Service (Amazon S3) and Amazon DynamoDB

Good Understanding of other AWS services like S3, EC2 IAM, RDS Experience with Orchestration and Data Pipeline like AWS Step functions/Data Pipeline/Glue.

Transformed the data using AWS Glue dynamic frames with PySpark; cataloged the transformed the data using Crawlers and scheduled the job and crawler using workflow feature

Designed and managed public/private cloud infrastructures using Confidential Web Services (AWS) which include EC2, S3, Cloud Front, Elastic File System, and IAM which allowed automated operations.

Deployed Cloud Front to deliver content further allowing reduction of load on the servers. Data Analyst May 2018 - Sep 2020

Trigent Software, India

Leveraged R's integration with big data tools (like SparkR, RHIPE) for analyzing massive datasets, resulting in a 20% improvement in handling large-scale data and a 15% speed increase in data processing.

Expert in Python with a focus on Pandas, NumPy, and scikit-learn for efficient data analysis. Achieved a 30% boost in data processing efficiency through optimized code.

Skilled in managing MySQL databases, including data storage, retrieval, and complex query formulation. Ensured data accuracy and integrity, resulting in a 20% improvement in database performance.

Experienced in creating dynamic and insightful visualizations using Power BI. Effectively enhanced data storytelling and analysis, leading to a 15% improvement in data presentation and understanding.

Implemented Collibra for enhanced data governance and management, leading to a 20% improvement in data quality and a 15% increase in compliance with data standards.

EDUCATION

Masters: CS Aug. 2022 – May. 2024

Clark University, Worcester Massachusetts, USA

Bachelors: CSE Sep 2012- May 2016

Vidya Jyothi Institute of Technology Hyderabad, India PROJECTS

• Credit Card Fraud Detection (Python): Developed a machine learning model with an accuracy rate of 95% using SMOTE, Random Forest, and AdaBoost to detect fraudulent transactions.

• YouTube Data Analysis (AWS Lambda, S3, Glue, Athena, Quick Sight, and Python): Implemented an ETL pipeline, optimized standard querying by 50%, and visualized data with QuickSight dashboards.

• Tuberculosis Prediction (Python, NumPy, Pandas, Scikit-learn, TensorFlow): Explored 5 distinct architectures, blending Traditional CNN, VGG16, and random forest, achieving a commendable 97% accuracy; further enhancing VGG16 and CNN models through TensorFlow and Scikit-learn for an optimized outcome.

• Stack Overflow Data Analysis (PySpark, Python, Hive, Pig, and GCP): Analyzed 20+ Business KPIs, designed a data pipeline for BigQuery to cloud storage, and implemented text classification models with an 88% accuracy using Logistic Regression and Random Forests.

• Group Attendance System (Python, React, Scikit-learn, TensorFlow, OpenCV, and SMTP): Augmented dataset capacity by 90 % and developed an automatic attendance system with 92% accuracy using Machine Learning (CNN) and computer vision architecture.

Contact this candidate