Data Engineer Engineering

Location:

Akron, OH

Posted:

August 18, 2025

Contact this candidate

Resume:

APURVA ASHOK SARODE

Data Engineer

+1-732-***-****

**************@*****.***

PROFESSIONAL SUMMARY:

Over 10 years of IT experience in Data Engineering and Software Development with a strong focus on AI/ML applications across various business domains

Strong expertise in building ETL pipelines for batch and streaming data using PySpark and Spark SQL.

Extensive experience with AWS cloud technologies, including EC2, EMR, S3, Lambda, Athena, and Redshift.

Proficient in data analysis using Python, with robust problem-solving and troubleshooting skills.

Skilled in Data Engineering technologies, such as Hadoop 2, Spark, and Elastic MapReduce.

Adept at performing Exploratory Data Analysis (EDA), Root Cause Analysis, and Impact Analysis on large datasets.

Proficient in deploying and managing applications on Microsoft's Azure cloud platform.

Experienced in performance tuning, query optimization, data transformation services, and database security.

Skilled in data visualization using Python and Tableau.

Experienced in Data Migration from RDBMS to Snowflake cloud data warehouse.

Hands-on experience in data manipulation with Python and SAS.

Familiar with AI and deep learning platforms like TensorFlow and PyTorch.

Extensive experience in text analytics and creating data visualizations and dashboards with Python and Tableau.

Strong knowledge of Base SAS, SAS/Stat, SAS/Access, SAS/Macros, SAS/ODS, and SAS/SQL in Windows environments.

Proficient in normalization and de-normalization techniques for OLTP and OLAP systems, with a solid understanding of relational and multidimensional modeling concepts.

Experienced in Dimensional Modeling using Snowflake schema methodologies of Data Warehouse and Integration projects

Experience in modeling and implementing relational and multidimensional data warehouses.

Expertise in building and automating scalable ETL/ELT pipelines using Control-M Scheduler.

Skilled in T-SQL programming, including creating stored procedures, user-defined functions, constraints, queries, joins, keys, indexes, data import/export, triggers, tables, views, and cursors.

Designed and scheduled data workflows for batch and real-time processing using Control-M integrated with AWS and Azure services.

Expert in evaluating and reviewing test plans against functional requirements, design documents, policies, procedures, and regulatory requirements.

EDUCATION:

MS in Computer Science

BS in Computer Science

TECHNICAL SKILLS:

Languages

Python, Java, Scala, SQL

OLAP Tools

OLAP Suite, Business Objects

ETL Tools

SSIS, AWS Glue, Apache Airflow, EC2, Vertex AI, EMR, S3, Lambda, Athena, and Redshift, Azure Data Lake Storage (ADLS), Azure Synapse Analytics, Azure Databricks, and Azure Data Factory

Databases

Oracle, MS SQL Server, MS Access, DB2

Cloud Platform

AWS, Azure, GCP

Big Data

Apache Spark, Apache Kafka, HBase, Hive, MongoDB, Hadoop

Operating Systems

Windows, Linux, MacOS

CERTIFICATIONS

Certified in Azure Data Engineer

Certified as a Big Data Engineer

PROFESSIONAL EXPERIENCE

NeueHealth Inc., Minneapolis, MN Dec 2024 - Present

Data Engineer

Responsibilities:

Develop and maintain Spark-Scala ETL pipelines to efficiently extract, transform, and load large healthcare datasets (e.g., patient records, clinical data, claims data).

Ensure data integrity, cleanliness, and consistency in pipelines.

Work with large-scale datasets to apply complex transformations using Spark (such as aggregations, joins, filtering, and cleaning).

Implement data ingestion from diverse sources (RDS, S3, DynamoDB, on-prem, APIs) into AWS Glue or EMR.

Build and orchestrate workflows using AWS Glue Workflows, Step Functions, or Apache Airflow.

Collaborate with Data Scientists and ML Engineers to prepare and curate large-scale structured and unstructured datasets for AI/ML model training and deployment using Vertex AI.

Ensure transformations meet business requirements and quality standards.

Integrate data from multiple healthcare data sources (such as EHR, claims, and other third-party systems) into Azure cloud environments using Azure Data Factory, Databricks, or other relevant tools.

Work on building and managing the bronze, silver, and gold layers in the data lake, ensuring proper storage, data processing, and accessibility of healthcare data.

Strong understanding of Spark architecture and development of PySpark and Spark-SQL applications on Azure Databricks.

Architect and maintain data lakes on Amazon S3, ensuring ACID compliance with tools like Apache Hudi, Delta Lake, or Iceberg.

Migrated key systems from on-premises hosting to Azure Cloud Services Snow SQL Writing: SQL queries against Snowflake.

Bronze: Raw data ingestion layer (e.g., unprocessed data from various sources).

Developed PySpark scripts for large-scale data processing, enhancing the organization’s analytics capabilities.

Silver: Cleaned, structured, and transformed data (processed and refined for analytics).

Gold: Aggregated, business-critical data used for insights and reporting.

Used Python (NumPy, SciPy, pandas, sci-kit-learn, seaborn) and Spark (PySpark, MLlib) to develop a variety of models and algorithms for analytic purposes.

Implemented job orchestration and dependency management through Control-M to streamline complex multi-system workflows.

Designed ML-ready datasets for predictive healthcare analytics, ensuring proper feature engineering and data preparation

Utilize services like Azure Data Lake Storage (ADLS), Azure Synapse Analytics, Azure Databricks, and Azure Data Factory to implement the Medallion Architecture.

Designing and implement a fully operational production grade large scale data solution on Snowflake Data Warehouse.

Collaborated with DevOps to integrate Control-M scheduling into CI/CD pipelines for data workflows.

Ensure data flow between layers is automated, efficient, and scalable

Environment: Azure Data Lake Storage (ADLS), Control-M, PySpark, Azure Synapse Analytics, Azure Databricks, Azure Data Factory, ETL, Snowflake, Spark, and Scala, AWS Glue, EMR, S3, Athena, Redshift, Lambda, Step Functions, CloudWatch, Apache Airflow, Glue Workflows

Premier Bank, Defiance, OH Jun 2023 – Nov 2024

Data Engineer

Responsibilities:

Implemented a comprehensive project covering data acquisition, wrangling, exploratory data analysis, model development, and evaluation.

Handled both structured and unstructured data, performing cleaning, descriptive analysis, and dataset preparation.

Automated Metadata management using AWS Glue Crawlers and maintained a centralized Data Catalog to support compliance and discoverability.

Developed machine learning and predictive analytics modules using PySpark on Hadoop within AWS, including a Python-based distributed random forest.

Utilized commercial data mining tools, primarily Python, for various job requirements.

Created diverse data visualizations with Python libraries and Tableau.

Processed multiple data formats such as JSON, XML, CSV, and XLSX, applying machine learning algorithms in Python.

Conducted statistical analysis and employed suitable data visualization techniques to derive meaningful insights.

Leveraged Python to develop models and algorithms for analytic purposes.

Executed data profiling to understand data patterns, locations, dates, and times.

●Developed NLP methods to process large unstructured datasets, extracting signals from noise and providing personalized patient-level insights to enhance our analytics platform.

Worked on Migrating objects from DB2 to Snowflake.

●Automated ETL processes using PySpark v. 3.2 in AWS Glue and Apache Airflow v. 2.7 complemented by CRON scripting in UNIX v. 2024 for seamless job scheduling.

●Automated job monitoring, alerts, and recovery procedures via Control-M to ensure SLA compliance and minimize downtime.

●Monitored Glue and EMR job performance via CloudWatch, identifying bottlenecks and reducing job failure rates by 50%.

Applied logistic regression, random forest, decision tree, and SVM to classify delivery outcomes for new routes.

Used SAS for data preprocessing, SQL queries, analysis, reporting, graphics, and statistical analyses.

Ensured data accuracy through data profiling and validation between the warehouse and source systems.

Utilized AWS cloud services for big data machine learning tasks.

Updated Python scripts to align training data with our database in AWS Cloud Search.

Developed unsupervised machine learning models on AWS EC2 within a Hadoop environment.

Designed and developed a scalable ML pipeline using Spark and HDFS on AWS EMR.

Built a Spark data pipeline for feature extraction utilizing Hive.

Created audit trails and job logs in Control-M for compliance reporting and operational transparency.

Migrating legacy data warehouse and other databases (SQL Server/Oracle Database, Teradata and DB2) to Snowflake.

Created and executed quality scripts using SQL and Hive to ensure successful data loads and maintain data quality.

Designed SQL tables with referential integrity and developed advanced queries with stored procedures and functions in SQL Server Management Studio.

Actively engaged in the analysis, development, and unit testing of data, ensuring delivery assurance within an Agile environment.

Environment: Machine Learning, Deep Learning, Snowflake, Python, PySpark, Tableau, SAS, JSON, XML, SQL, Control-M, Hive, Agile, AWS Glue (ETL, Crawlers, Workflows), Amazon EMR (Spark, Hadoop), Amazon S3, AWS Lambda, AWS Step Functions, AWS Lake Formation, AWS KMS, IAM, CloudWatch, Amazon RDS, Amazon Redshift, Amazon Athena

Invacare, Elyria, OH Aug 2022 – May 2023

Data Engineer

Responsibilities:

Built models using statistical techniques such as Bayesian HMM and machine learning classification models like XGBoost, SVM, and Random Forest.

Engaged in data manipulation and visualization, web scraping, machine learning, Python programming, SQL, Git, Unix commands, NoSQL, MongoDB, and Hadoop.

Set up storage and data analysis tools within the Amazon Web Services (AWS) cloud infrastructure.

Utilized Python libraries including pandas, numpy, seaborn, scipy, matplotlib, scikit-learn, and NLTK for developing various machine learning algorithms.

Conducted detailed studies on potential third-party data handling solutions, ensuring compliance with internal needs and stakeholder requirements.

Collaborated on ETL (Extract, Transform, Load) tasks, maintaining data integrity and pipeline stability.

Executed large-scale data conversions for integration into HDInsight.

Wrote data processing notebooks in Python, PySpark, and Spark-SQL on Databricks, enhancing big data analysis capabilities.

Designed and implemented effective database solutions using Azure Blob Storage for data storage and retrieval.

Developed advanced analytics ranging from descriptive to predictive models and machine learning techniques.

Monitored incoming data analytics requests and distributed results to support IoT hub and streaming analytics.

Prepared documentation and analytic reports, delivering summarized results, analysis, and conclusions to stakeholders.

Implemented data validation, reconciliation, and data quality checks as part of post-execution Control-M jobs.

Communicated new or updated data requirements to the global team.

Developed database architectural strategies at the modeling, design, and implementation stages to address business or industry requirements.

Involved with Snowflake utilities, Snowflake SQL, Snow Pipe, etc.

Employed data cleansing methods, significantly enhancing data quality.

Oversaw data integration across the entire group.

Integrated Control-M with Snowflake, Redshift, and Azure Synapse for managing data load and transformation tasks.

Wrote Azure Service Bus topics and Azure functions to handle abnormal data found in the streaming analytics service.

Created SQL databases for storing application information.

Established blob storage to save raw data from streaming analytics.

Constructed Azure DocumentDB to store application-related information.

Connected blob storage to HDInsight.

Involved in learning architecting data intelligence solutions around Snowflake Data Warehouse and architecting snowflake solutions.

Deployed Azure Data Factory for creating data pipelines to orchestrate data into SQL databases.

Utilized Jupyter Notebook and MySQL for data integration and storage technologies.

Provided a clean, usable interface to check status, accessible via mobile devices or web clients.

Environment: Azure, MDM, GIT, Unix, Snowflake, PySpark, Python, MLLib, SAS, Regression, Logistic Regression, Hadoop, OLTP, OLAP, HDFS, NLTK, SVM, Control-M, JSON, XML.

The Home Depot, USA Nov 2021 – Jul 2022

Data Engineer

Responsibilities:

Gathered, analyzed, and translated business requirements into analytic approaches.

Identified inconsistencies in data from various sources and collaborated with business stakeholders using SQL, reducing risk factors by 63%.

Developed machine learning algorithms in Python using TensorFlow, Keras, Theano, Pandas, NumPy, SciPy, Scikit-learn, and NLTK, including neural networks, linear regression, multivariate regression, naïve Bayes, random forests, decision trees, SVMs, K-means, and KNN.

Created data pipelines with AWS S3 to extract data, store it in HDFS, and deploy machine learning models.

Applied clustering and factor analysis for data classification using machine learning algorithms.

Developed custom Control-M scripts using shell and Python to extend native job capabilities and API automation.

Designed, built, and maintained scalable ETL data pipelines from various sources into data storage systems or data warehouses.

Working knowledge in AWS environment and AWS spark, Snowflake, Lambda and EC2.

Working in Cloud data migration using AWS and Snowflake.

Optimized scheduling strategies by configuring calendars, conditions, and resources in Control-M to improve throughput and job efficiency.

Conducted risk analysis, root cause analysis, cluster analysis, correlation, and optimization, and utilized K-means for data clustering.

Collaborated with data scientists and senior technical staff to identify client needs.

Implemented data ingestion processes, data transformation logic, and data quality checks.

Environment: SQL, ETL, AWS, Snowflake, Machine Learning, Control-M, Python

PayPal, USA Aug 2020 – Nov 2021

Data Engineer

Responsibilities:

Constructed a data pipeline to process semi-structured data using Apache Spark, integrating 100 million raw records from 14 different data sources.

Utilized machine learning algorithms, including neural networks, linear regression, logistic regression, SVMs, and decision trees, for group classification and variable significance analysis.

Developed Pareto charts using SAS to identify high-impact categories in modules for workforce distribution, and created various data visualization charts.

Designed a scalable data pipeline architecture for a new product, rapidly growing from 0 to 60,000 daily users with a user retention rate of 73%.

Developed Snowflake views to load and unload data from and to an AWS S3 bucket, as well as transferring the code to production.

Built parameterized job templates in Control-M to handle environment-specific configurations dynamically.

Generated detailed reports by validating graphs with Python and adjusting variables for model fitting.

Collaborated with product managers on cohort analysis, identifying an opportunity to reduce pricing by 35% for a user segment, boosting yearly revenue by $ 320 K.

Performed root cause analysis on job failures and provided permanent solutions to reduce recurrence using Control-M logs and job flow tracing.

Responsible for loading data into S3 buckets from the internal server and the Snowflake data warehouse.

Worked with big data frameworks and technologies such as Hadoop, Spark, and NoSQL databases to efficiently process and analyze large data volumes. Developed distributed computing solutions and optimized data processing workflows.

Environment: Hadoop, Spark, NoSQL, Oracle, Snowflake, Windows, Control-M, Python, SAS

Circana LLC, Pune, India Aug 2018 – Jan 2020

Data Engineer

Responsibilities:

Collaborated with a team of four to establish a cloud-first data ingestion system, utilizing Azure, Apache Spark, and Kafka to ingest data from diverse sources, enhancing data processing speed by 74%.

Developed performance analysis metrics using Kibana to monitor project implementation, resulting in a 15% cost reduction over two years.

Effectively utilized Power Map and Power View to visually represent data for technical and non-technical users.

Wrote Spark code to process and parse data from various sources, storing parsed data in HBase and Hive through HBase-Hive Integration.

Integrated data from internal and external sources to ensure seamless data flow and enable comprehensive data analysis.

Loaded, transformed the data continuously using Snowpipe from Amazon S3 buckets to Snowflake and used Spark Connector.

Transformed raw data into processed data, performing tasks such as merging, outlier detection, error correction, trend analysis, handling missing values, and data distribution assessment.

Conducted data analysis, visualization, feature extraction, selection, and engineering using Python.

Applied business forecasting, segmentation analysis, and data mining techniques, preparing management reports that define problems, document analyses, and recommend actionable strategies for optimal outcomes.

Designed SQL tables with referential integrity and developed advanced queries using stored procedures and functions in SQL Server Management Studio.

Conducted knowledge transfer sessions on Control-M job creation, monitoring, and troubleshooting for cross-functional teams.

Developed ETL pipelines in and out of data warehouse using a combination of Python and Snowflakes SnowSQL Writing SQL queries against Snowflake.

Engaged in API integration with third-party systems and established data synchronization processes.

Environment: Hadoop, Spark, NoSQL, Oracle, Windows, SQL Server, Azure, Snowflake, Apache Spark, Kafka, Control-M, Spark, HBase – Hive.

Bajaj Allianz Life Insurance Company Limited, Pune, India Jan 2017 – Jul 2018

Data Engineer

Responsibilities:

Defined source-to-business rules, target data mappings, and data definitions for various projects.

Conducted data validation and reconciliation between disparate source and target systems.

Identified customer and account attributes necessary for MDM implementation from diverse sources, creating detailed documentation.

Produced data visualization reports for management using Tableau.

Utilized a range of statistical packages including MLIB, Python, and others.

Cleaned data using Python, employed correlation matrix, step-wise regression, and Random Forest for variable selection.

Applied multinomial logistic regression, Random Forest, decision trees, and SVM to predict timely package delivery for new routes.

Provided technical input and recommendations to business analysts, BI engineers, and data scientists.

Segmented customers based on demographics using K-means clustering.

Leveraged T-SQL queries to extract data from disparate systems and data warehouses across various environments.

Employed MS Excel extensively for data validation purposes.

Environment: Python, ETL, BI, TSQL, SQL, Machine Language, and Windows.

Velotio Technologies Pvt Ltd, Pune, India Jun 2015 – Dec 2016

Data Engineer

Responsibilities:

Designed, developed, and maintained scalable data pipelines using modern data engineering tools and frameworks (e.g., Apache Spark, Apache Kafka, AWS Glue) to support business intelligence, analytics, and data science initiatives.

Collaborated with cross-functional teams, including data scientists, analysts, and software developers, to understand data requirements and deliver high-quality data solutions that meet business objectives.

Implemented ETL (Extract, Transform, Load) processes to ingest, clean, transform, and store structured and unstructured data from various sources such as APIs, databases, and flat files into data warehouses and data lakes (e.g., Amazon Redshift, Snowflake, Azure Synapse).

Supported senior data engineers in data integration tasks, including the collection, transformation, and migration of data from multiple data sources such as APIs, databases, and cloud platforms.

Developed and optimized database schema and data models to support efficient data storage and retrieval, ensuring data integrity, consistency, and security.

Automated routine data tasks using scripting languages (e.g., Python, SQL, Bash) and orchestrated workflows using tools like Apache Airflow, Luigi, or Prefect.

Participated in the design and implementation of data security and privacy policies to ensure compliance with data protection regulations (e.g., GDPR, CCPA).

Monitored and troubleshooted data pipeline performance and reliability, implementing enhancements to improve data processing efficiency and reduce latency.

Ensured data quality and compliance by implementing data validation, cleansing, and governance practices across data workflows, maintaining data accuracy and reliability.

Wrote and optimized SQL queries to extract relevant data for analysis and reporting, improving query performance and data retrieval speeds.

Engaged in agile development processes, participating in sprint planning, daily stand-ups, and retrospective meetings to ensure the timely delivery of data engineering solutions.

Environment: Python, GDPR, CCPA, SQL queries, ETL, Apache Spark, Apache Kafka, AWS Glue, Oracle, Windows.

Contact this candidate