Data Engineer Machine Learning

Location:

Nashville, TN, 37230

Posted:

February 25, 2025

Contact this candidate

Resume:

Data Engineer

Siva Reddy

Email:*************@*****.***

Contact:+1-972-***-****

Summary:

Data Engineer with around 5 years of experience delivering innovative and scalable data solutions across e-commerce, healthcare, and geospatial domains. Proficient in cloud platforms (AWS, Azure,GCP), big data tools (Apache Spark, Kafka, Databricks), and advanced data modelling techniques to support business-critical analytics and machine learning initiatives.

Hands-on experience with AWS (Glue, Lambda, S3, Redshift, RDS), Azure (Synapse Analytics, Data Factory), and GCP (BigQuery, Cloud Storage, Dataflow, BigQuery GIS) for building and optimizing scalable data architectures.

Proficient in AWS KMS and Azure Security protocols for data encryption and compliance.

Designed and implemented robust ETL/ELT pipelines processing terabytes of data using tools like Apache Spark, PySpark, and Databricks.

Expertise in real-time data streaming using Apache Kafka and event-driven architectures.

Integrated structured and unstructured data from diverse sources like IoT sensors, CRM systems, and SQL databases.

Built and optimized high-performance workflows for big data processing, reducing pipeline latency by up to 40%.

Proficient in data lakes (Delta Lake), partitioning, indexing, and query optimization for high-throughput analytics.

Designed normalized and denormalized star and snowflake schemas to support complex data analytics.

Extensive experience with relational and geospatial databases (PostgreSQL, SQL Server, Oracle, PostGIS).

Created dynamic dashboards using Power BI, Tableau, and ArcGIS, delivering actionable insights for stakeholders.

Collaborated with data scientists on customer segmentation and predictive modelling using python and scikit-learn.

Ensured compliance with HIPAA and GDPR, implementing data encryption, RBAC, and audit trails to safeguard sensitive information.

Advanced skills in GIS tools (ArcGIS, QGIS, PostGIS) and geospatial data processing using GeoPandas and FME Workbench for utility and urban planning projects.

Automated complex workflows using Apache Airflow and Azure Data Factory, ensuring seamless data ingestion and transformation.

Developed reusable Airflow DAGs and cron jobs for batch processing, improving efficiency and reducing manual intervention.

Proficient in Python, SQL, and Java (for AWS Lambda), with a focus on creating modular, reusable, and scalable code.

Integrated REST APIs for real-time data ingestion and external system connectivity.

Implemented AWS CloudWatch, ELK Stack, and custom monitoring solutions to track pipeline health and resolve issues proactively.

Adept at working with cross-functional teams, aligning technical solutions with business goals, and delivering high-value outcomes.

Documented workflows and processes in JIRA and Confluence to ensure team-wide knowledge sharing.

Used Adobe Analytics to integrate real-time customer behavior data into the Customer Behavior Analytics Platform, optimizing personalized engagement strategies at QVC.

Technical Skills:

Category

Tools/Technologies

Cloud Platforms

AWS (Glue, Redshift, S3, RDS, CloudWatch), Azure (Synapse Analytics, Data Factory), Google Cloud Platform(BigQuery, Google Cloud Storage, Google Earth Engine, Google Cloud Dataflow)

Data Engineering

Apache Spark, PySpark, AWS Glue, Apache Kafka, Apache Airflow, Databricks, Delta Lake

Data Modeling & Databases

PostgreSQL, Amazon Redshift, SQL Server, Delta Lake, Oracle, PostGIS

Data Processing & ETL

ETL Pipelines, ELT Pipelines, Apache Spark, Python (Pandas, PySpark), Talend, AWS Glue, Azure Data Factory, FME Workbench

Data Visualization

Power BI, Tableau, Custom Dashboards

Programming Languages

Python, SQL, Java (for AWS Lambda), PySpark

Big Data Tools

Apache Spark, Apache Kafka, Databricks, AWS Glue, Apache Airflow

Machine Learning

scikit-learn (for customer segmentation models)

Version Control

Git

Security & Compliance

AWS KMS, Data Encryption (in transit and at rest), Role-Based Access Control (RBAC), HIPAA

Automation & Orchestration

Apache Airflow, AWS Lambda

Monitoring & Logging

AWS CloudWatch, ELK Stack (Elasticsearch, Logstash, Kibana), Custom Dashboards

Collaboration Tools

JIRA, Confluence

Data Integration & APIs

REST APIs, IoT Device Integration, AWS Lambda, Apache Nifi, Python (for API Integration)

GIS & Geospatial Tools

ArcGIS, QGIS, GeoPandas, PostGIS, Shapely, FME Workbench, AutoCAD, CAD, GeoJSON, Shapely

Professional Experience:

Client: QVC, PA May 2024 - Till Date

Role: ETL Engineer

Project: Customer Behaviour Analytics Platform Enhancement

Project overview: As a Data Engineer at QVC, I contribute to the Customer Behaviour Analytics Platform, which leverages big data tools and cloud technologies to process and analyse customer interaction data from diverse touchpoints. The goal is to deliver personalized recommendations, optimize inventory management, and improve overall customer engagement.

Responsibilities:

Architected and implemented ETL pipelines using AWS Glue and Apache Spark to extract, transform, and load large datasets from various sources, including website logs, customer data, and transaction history into a central data warehouse.

Used SQL and Python to create custom data transformations, ensuring data consistency and high-quality data integration.

Developed real-time data processing workflows using Apache Kafka and Apache Spark Streaming, enabling continuous ingestion and analysis of live customer behaviour data for up-to-the-minute insights.

Designed and optimized data models for transactional and behavioural data in Amazon Redshift and PostgreSQL to support the scalable and high-performance analytics.

Implemented indexing, partitioning, and query optimization techniques to improve database performance and ensure efficient querying of large datasets.

Integrated data from multiple systems, including CRM, e-commerce platforms, and inventory management systems, using AWS S3, AWS RDS, and Redshift, creating a seamless data pipeline for advanced analytics.

Ensured data consistency and accuracy by developing data validation rules and quality checks.

Used Apache Airflow to automate scheduling, monitoring, and orchestration of complex data workflows, ensuring that data is processed and made available for analysis with minimal manual intervention.

Designed reusable Airflow DAGs (Directed Acyclic Graphs) for batch processing, reducing processing time and manual errors.

Collaborated with Data Scientists to develop customer segmentation models and advanced machine learning algorithms using Python and scikit-learn, improving personalization strategies and marketing campaigns.

Provided data insights through SQL queries and Python scripting, assisting stakeholders in making data-driven decisions.

Implemented performance monitoring for data pipelines using AWS CloudWatch and custom-built monitoring solutions, tracking pipeline health and setting up alerts for failures or slowdowns.

Troubleshot and optimized data workflows to reduce pipeline latency by 30%, ensuring faster insights for marketing and business teams.

Utilized Git for version control to manage and deploy changes to codebases and configuration files across teams.

Implemented FinOps best practices by optimizing AWS Glue and Spark ETL pipelines, reducing compute costs by 20% through auto-scaling and efficient resource allocation.

Monitored AWS Redshift and S3 storage costs, identifying unused resources and reducing cloud spend by 15% through optimized partitioning and compression techniques.

You can include Kubernetes under your QVC project (Customer Behavior Analytics Platform Enhancement) because you were working with Apache Spark, AWS Glue, and real-time data streaming a scenario where Kubernetes is often used for containerized workload orchestration.

Deployed and orchestrated Apache Spark jobs on Kubernetes, enabling auto-scaling and fault tolerance for large-scale ETL pipelines, reducing compute costs by 20%.

This aligns with your big data processing work and strengthens your profile with cloud-native orchestration expertise. Let me know if you’d like more refinements!

Used Tools: AWS, Apache Spark, Apache Airflow, AWS Glue, Apache Kafka, Amazon Redshift, PostgreSQL, Amazon RDS, Pandas, PySpark, SQL, scikit-learn, AWS CloudWatch, Custom Dashboards, Git, JIRA, Confluence.

Client: Kaiser, California Feb 2023 - April 2024

Role: Azure Data Engineer

Project: Patient Data Analytics and Reporting System

Project Overview: The project aimed to develop a centralized data analytics and reporting platform to optimize patient care, reduce operational costs, and ensure compliance with healthcare regulations. It involved designing, implementing, and managing data pipelines to integrate data from multiple systems like Electronic Health Records (EHR), patient feedback systems, and operational databases.

Responsibilities:

Designed and implemented scalable ETL/ELT pipelines using Apache Spark and Databricks, processing terabytes of data from multiple sources, including HL7 EHR feeds, SQL databases, and Azure Blob Storage.

Automated data ingestion and transformation workflows using Azure Data Factory, enabling real-time data flow into Azure Synapse Analytics for downstream processing.

Developed star and snowflake schema data models to optimize storage and query performance in Delta Lake and SQL Server.

Utilized PySpark for distributed data processing to handle high-volume healthcare datasets, ensuring scalability and performance.

Implemented robust data validation and reconciliation workflows using Great Expectations, achieving 99.9% accuracy in critical datasets.

Designed dynamic Power BI dashboards to provide clinical teams and executives with actionable insights into patient health trends, readmission rates, and service efficiency.

Partnered with data science teams to prepare clean, feature-rich datasets for predictive modelling, enhancing the accuracy of health risk predictions.

Implemented secure data storage solutions using AWS Secrets Manager and Azure Key Vault to ensure the safety of sensitive patient data, supporting HIPAA compliance and data integrity.

Automated the extraction of patient data from Adobe PDFs using Python libraries to streamline reporting and data integration into the Patient Data Analytics platform.

Applied encryption, access controls, and audit trails to meet HIPAA and organizational security standards during data processing and storage.

Optimized Spark jobs and Synapse queries to reduce processing time by 40% and minimize cloud infrastructure costs.

Worked closely with product managers, clinical stakeholders, and IT teams to align technical solutions with business requirements and deliver value-driven outcomes.

Used Tools: Python, SQL, PySpark, Apache Spark, Databricks, Delta Lake, Azure, Power BI, Great Expectations, SQL Server, Azure DevOps,Adobe.

Client: Cadsys, India Jan 2021- Aug 2022

Role: Data Engineer

Project: Geospatial Data Pipeline for Utility Infrastructure Management

Description: The project aimed to design and implement a scalable data pipeline to handle and process geospatial and non-spatial data for a municipal utility services client. This included integrating data from various sources like GIS, IoT sensors, CAD systems, and traditional RDBMS to support city-wide asset tracking, resource optimization, and urban planning. The solution provided real-time data analytics, visualization, and reporting, enabling the municipal corporation to monitor utility infrastructure such as water pipelines, gas distribution networks, and electrical grids effectively.

Responsibilities:

Designed and implemented ETL processes using Talend, extracting geospatial and non-spatial data from sources like shapefiles, GeoJSON, CAD drawings, and legacy databases (Oracle, SQL Server).

Developed data transformation logic in Talend to preprocess, clean, and structure data, ensuring compatibility across both AWS and GCP cloud platforms for seamless integration.

Stored and secured geospatial data in AWS S3, applying AWS KMS for encryption, ensuring compliance with security standards for sensitive infrastructure data.

Created and optimized geospatial data models in AWS RDS with PostGIS, ensuring high-performance querying and scalability for large datasets.

Built real-time data ingestion pipelines using AWS Lambda and Apache Kafka, enabling continuous data flow from IoT sensors and utility meters into a central repository for processing.

Automated data processing workflows using Apache Airflow, reducing manual intervention and ensuring timely execution of complex data tasks.

Integrated Google Cloud Storage (GCS) for cost-effective, scalable storage of geospatial data, ensuring high availability and easy integration with other GCP analytics tools.

Leveraged Google BigQuery for large-scale geospatial data processing and Google Earth Engine for advanced spatial analysis, providing insights to optimize utility infrastructure management.

Implemented real-time and batch data processing workflows using Google Cloud Dataflow, processing streams of IoT sensor data and providing actionable insights for infrastructure optimization.

Utilized Git for version control and documented workflows in JIRA and Confluence, ensuring smooth collaboration and tracking of project progress across teams.

Used Tools: Python, SQL, Talend, Apache Nifi, FME Workbench, PostgreSQL, PostGIS, Oracle, Tableau, Power BI, ArcGIS, Apache Airflow, ELK Stack, AWS, GCP,Pandas, GeoPandas, Shapely, QGIS, AutoCAD, Linux, Windows

Education:

Masters: The University of Texas at Arlington - USA

Bachelor’s: GITAM - India

Certifications: Microsoft Azure Data Engineer Associate.

Contact this candidate