Post Job Free
Sign in

Data Engineer Senior

Location:
United States
Posted:
June 04, 2025

Contact this candidate

Resume:

Senior Data Engineer

Dileep M

Phone: +1-512-***-****

Email: ******************@*****.***

PROFESSIONAL SUMMARY:

Senior Data Engineer with 9+ years of experience designing scalable data pipelines and automating workflows across cloud platforms (AWS, Azure, GCP) using Python, PySpark, and REST APIs..

Proven success in cloud migration projects, including Snowflake to Databricks (Azure) and on-prem SQL/Hadoop to GCP, with a focus on performance, governance, and compliance.

Expertise in building real-time and batch ETL/ELT pipelines using PySpark, Delta Lake, Databricks, Apache Spark, Snowflake, and Redshift.

Strong communicator and self-starter with a history of remote project delivery, cross-functional collaboration, and aligning technical solutions with business priorities in healthcare, finance, and insurance domains.

Extensive experience with Azure Data Factory (ADF) for orchestrating ETL workflows, Azure Databricks for scalable data processing, Synapse Analytics for enterprise data warehousing, and Logic Apps for automating event-driven integrations.

Proficient in AWS Glue for ETL development, Lambda for serverless computing, EMR for distributed data processing, Lake Formation for secure data governance, S3 for object storage, and Step Functions for orchestration of serverless workflows.

Hands-on experience using BigQuery for large-scale analytics, Dataflow for streaming and batch data pipelines, Cloud Composer for workflow orchestration, Dataproc for Apache Spark processing, and Cloud Functions for event-driven automation.

Implemented Medallion Architecture, Delta Live Tables (DLT), and Apache Hudi/Iceberg to support schema evolution, incremental loads, and time travel capabilities.

Strong knowledge of data governance and security, using Unity Catalog, RBAC, Lake Formation, IAM, KMS, and enforcing SOX, HIPAA, PCI DSS, and GDPR standards.

Worked on end-to-end integration and transformation of healthcare claims data ensuring strict adherence to HIPAA compliance standards and optimizing data pipelines for regulatory reporting and analytics.

Hands-on experience integrating Google Sheets and Excel for data validation, stakeholder collaboration, and scripting-based automation to support data quality and reporting needs.

Applied Change Data Capture (CDC) and Slowly Changing Dimensions (SCD Type 1 and 2) for real-time and historical data tracking.

Built automated and event-driven workflows using Apache Airflow, AWS Step Functions, and Azure Logic Apps, improving pipeline observability and scheduling.

Developed and maintained CI/CD pipelines using Azure DevOps, Terraform, GitHub Actions, and AWS CodePipeline for infrastructure and ETL deployment automation.

Developed serverless pipelines using AWS Lambda and GCP Cloud Functions for real-time data processing and ingestion.

Proficient in Docker, Kubernetes, and container-based deployment strategies for scalable ETL environments.

Built interactive dashboards and reporting solutions using Power BI, Tableau, and Looker, integrated with Databricks SQL and Snowflake.

Collaborated cross-functionally with business analysts, compliance officers, DevOps, and risk management teams to align technical solutions with business needs.

Applied advanced performance tuning techniques for Spark jobs, cluster autoscaling, partitioning, and query optimization to reduce costs and latency.

Strong background in data modeling, profiling, unit testing, and documentation, including mapping specifications, STMs, and test result archives.

TECHNICAL SKILLS:

Programming Languages

Python, Scala, Java

Big Data & Distributed Processing

PySpark, Apache Spark, Spark SQL, Hive, YARN, HDFS, MapReduce, Kafka Streams, Spark Streaming

Data Engineering & Integration Tools

Azure Data Factory (ADF), AWS Glue, Google Cloud Dataflow, Databricks, Apache Airflow, dbt, Apache Kafka, Apache Hudi, Apache Iceberg

Hadoop Platform

Hortonworks, Cloudera, AWS EMR, EC2

Cloud Platforms

Azure:ADF, Azure Databricks, Synapse Analytics, ADLS, HDInsight, AKS, Logic Apps

AWS: S3, EMR, Glue, Redshift, Lambda, Lake Formation, Athena, Step Functions

GCP: BigQuery, Dataproc, Cloud Composer, Cloud Functions, GCS, Cloud SQL

Data Modeling & Warehousing

Snowflake, Redshift, BigQuery, MS SQL Server, Oracle, Teradata, MySQL, DynamoDB

Database

Oracle, MySQL, Teradata, MS SQL, Snowflake, DynamoDB, Redshift, BigQuery

CI/CD & DevOps

Azure DevOps, GitHub Actions, Terraform, Jenkins, AWS CodePipeline, Git

Orchestration & Scheduling

Apache Airflow, AWS Step Functions, Azure Logic Apps, Cloud Composer, Control-M

Visualization & BI Tools

Power BI, Tableau, Looker

Containers & Deployment

Docker, Kubernetes, ECS, EKS

Data Governance & Security

Unity Catalog, IAM, RBAC, KMS, AWS Lake Formation, Data Masking, Encryption (at-rest and in-transit), SOX, GDPR, HIPAA, PCI DSS

Development & Productivity Tools

VS Code, Eclipse, IntelliJ, Oracle SQL Developer, Google Sheets (API integration, automation scripting, Putty, Erwin Data Modeler, Visual studio code, Excel (advanced formulas, macros, pivot tables for data validation and transformation)

PROFESSIONAL EXPERIENCE:

U.S. Bancorp – Minneapolis, Minnesota Feb 2022 - Present

Senior Data Engineer

Responsibilities:

As part of U.S. Bancorp’s enterprise modernization strategy, I led the migration of critical financial data workloads from Snowflake to Azure Databricks, optimizing analytical performance, data governance, and real-time reporting pipelines for banking operations, credit risk analysis, and regulatory compliance.

Migrated structured and semi-structured data from Snowflake to Databricks, including financial transactions, loan records, AML logs, and KYC datasets. Ensured schema consistency, data accuracy, and minimal downtime throughout the migration.

Designed scalable ETL pipelines in Azure Databricks using PySpark, Delta Lake, and Delta Live Tables (DLT), enabling real-time and batch data processing for financial analytics and fraud detection.

Implemented Databricks Medallion Architecture to streamline ingestion, transformation, and enrichment of data across Bronze, Silver, and Gold layers.

Developed CI/CD pipelines using Azure DevOps and Terraform to automate infrastructure provisioning and ETL deployments, improving release cycles and operational reliability.

Integrated Snowflake Streams & Tasks with Delta Live Tables for continuous data updates, reducing reporting latency for anti-money laundering (AML) and risk modeling use cases.

Orchestrated data pipelines using Azure Data Factory and monitored them through Azure Event Hubs and Logic Apps, enabling real-time event-driven data processing.

Applied Unity Catalog and RBAC policies to enforce role-based access and regulatory compliance with standards such as SOX, PCI DSS, and GDPR.

Developed post-migration validation frameworks in Python and Databricks Notebooks for automated reconciliation and data quality assurance.

Optimized Databricks clusters and cost management strategies through autoscaling and workload-aware configuration, reducing compute costs and improving performance.

Created Power BI dashboards by integrating Databricks SQL and Snowflake for real-time monitoring of credit risk, fraud patterns, and payment activity.

Containerized ETL components using Docker and deployed services via Kubernetes for scalable, reproducible data processing environments.

Collaborated with banking analysts, compliance officers, and DevOps engineers to align pipeline design with business needs and ensure uninterrupted operations.

Tracked project progress using JIRA and Agile ceremonies to maintain momentum, manage deliverables, and complete the Snowflake-to-Databricks migration initiative on time.

Environment: Azure Databricks, Databricks SQL, Delta Lake, Delta Live Tables (DLT), PySpark, Spark SQL, Snowflake, SnowSQL, Snowpipe, Azure Data Factory (ADF), Azure Synapse Analytics, Azure SQL Database, Azure Data Lake Storage (ADLS), Unity Catalog, Azure DevOps, Terraform, GitHub Actions, Jenkins, Docker, Kubernetes, Power BI, Tableau, Apache Kafka, Apache Airflow, Azure Event Hubs, Azure Functions, Azure Logic Apps, Python, Scala, Bash, SQL, CI/CD, Data Governance, RBAC, SOX, GDPR, PCI DSS

MetLife – New York City, NY Aug 2020 – Jan 2022

Senior Data Engineer

Responsibilities:

As part of MetLife's enterprise data modernization initiative, I led the migration and modernization of legacy ETL workflows and analytical systems to a cloud-native architecture on AWS and Databricks. The project focused on handling structured and semi-structured data from actuarial, claims, and customer domains to enable secure, compliant, and high-performance data processing for business insights.

Migrated on-premises SAP ASE databases and mainframe data sources (including customer, claims, and premium data) to AWS S3 and EC2 using AWS DMS, SQL Loader, and custom ingestion frameworks.

Designed and implemented scalable ETL pipelines using AWS Glue, PySpark, and Lambda to ingest, process, and transform data in Parquet, JSON, and Avro formats.

Developed reusable Python automation scripts to validate claim ingestion files, format raw data, and push curated outputs to S3 for actuarial analysis.

Integrated REST APIs with AWS Lambda to retrieve external policy and claims data, triggering real-time ingestion pipelines and downstream notifications.

Collaborated with non-technical business users and data analysts to define data mapping templates using Excel and Google Sheets, standardizing metadata inputs and transformations.

Independently delivered and maintained a fully automated data validation pipeline that ensured clean, accessible data for reporting and audits.

Created serverless workflows with AWS Lambda and SQS for event-driven data ingestion, reducing batch processing time by 40% and improving SLA adherence.

Built a Medallion Architecture (Bronze-Silver-Gold) on AWS S3 and Databricks to centralize and streamline data enrichment for downstream analytics.

Implemented Apache Hudi and Iceberg to support incremental data loading, schema evolution, and time-travel capabilities.

Enforced data governance and compliance controls using AWS Lake Formation and Glue Data Catalog, enabling role-based access and PII protection aligned with HIPAA standards.

Developed Airflow DAGs and Glue Workflows with integrated data quality rules (schema checks, null tracking, field thresholds), resulting in a 65% reduction in data issue escalations.

Used Spark SQL and complex functions (e.g., map_agg, array_agg) to optimize transformations in Snowflake and Redshift, reducing query latency by 80%.

Built Dockerized ETL batch subsystems for historical data access and ad-hoc analytics, supporting self-service use cases for actuarial and business teams.

Created interactive dashboards in Tableau using semantic models defined in collaboration with data science and actuarial stakeholders.

Automated deployment of ETL workflows and infrastructure using Terraform, GitLab CI/CD, and AWS CodePipeline, enabling version-controlled promotion across dev, test, and prod environments.

Monitored workflow execution with AWS CloudWatch and Step Functions, ensuring resilience, observability, and automated error handling.

Environment: Linux, AWS (S3, Glue, EMR, Lambda, Lake Formation), Snowflake, Databricks, Apache Hudi, Apache Iceberg, AWS Glue Data Catalog, AWS Athena, Redshift, Terraform, GitLab CI/CD, AWS CodePipeline, AWS CloudWatch, Datadog, IAM, Data Masking, Python 3.6, Pandas, Scala, Hadoop 2.7.4, SAP CRM, SRM, SQL, Tableau, SSIS, Airflow, Spark SQL,, REST APIs, Excel, Google Sheets

United Healthcare – San Francisco, CA March 2018 – July 2021

Data Engineer

Responsibilities:

At United Healthcare, I contributed to the digital transformation of healthcare data systems by migrating on-prem SQL and Hadoop workloads to Google Cloud Platform (GCP) and building cloud-native data pipelines to support advanced analytics, customer engagement, and compliance across healthcare and retail analytics initiatives.

Led the migration of on-prem MS SQL Server and Hadoop-based PySpark pipelines to GCP using Cloud Dataflow, Cloud Composer, and Dataproc, improving data processing speed and storage efficiency.

Built scalable data pipelines and ELT jobs using dbt and Snowflake, enabling analytics for retail customer behavior, inventory tracking, and performance forecasting.

Designed and implemented workflow orchestration using Apache Airflow and Cloud Composer, automating ETL execution across daily and event-triggered jobs.

Built Python-based utilities for automated reconciliation of BigQuery tables with legacy SQL Server sources, integrating API-driven ingestion triggers via GCP Cloud Functions.

Led internal efforts to standardize Google Sheets templates for QA and data governance teams to track pipeline-level metrics and validation flags.

Communicated regularly with cross-functional teams including engineers, business analysts, and compliance to align on data rules and pipeline SLAs.

Worked on end-to-end integration and transformation of healthcare claims data ensuring strict adherence to HIPAA compliance standards and optimizing data pipelines for regulatory reporting and analytics.

Worked in a remote, Agile environment managing timelines independently and prioritizing issues based on impact and business requirements.

Executed end-to-end data validation strategies by reconciling legacy SQL Server datasets with BigQuery, ensuring accuracy post-migration.

Engineered Spark-based pipelines using Scala, PySpark, and Spark SQL to process large-scale structured and semi-structured data for use cases including customer sentiment analysis and real-time fraud detection.

Developed scheduled and event-based triggers using Cloud Functions, Cloud Scheduler, and Composer for automating data workflows in pricing and promotional analytics.

Created and maintained optimized storage structures using partitioned GCS hierarchies, improving query performance for historical and behavioral datasets.

Built robust ELT pipelines for high-volume retail transactions and loyalty program analytics, leveraging T-SQL, PL/SQL, and stored procedures.

Integrated BigQuery Data Transfer Service and Cloud Dataflow to support multi-channel retail data fusion and supply chain analytics.

Modeled data using dimensional modeling and Medallion Architecture, supporting reporting and segmentation use cases in Looker and Power BI.

Automated data ingestion and transformation workflows using Spark for bulk and incremental loads across POS, product catalog, and feedback systems.

Collaborated with QA teams to optimize pipeline performance and resolve data discrepancies for POS and e-commerce reporting.

Supported real-time analytics and ML model readiness through Spark pipelines for fraud detection and customer segmentation.

Managed cloud resources using IAM and MFA, ensuring secure access and compliance; containerized services using Docker and orchestrated with ECS/EKS for scalable deployments.

Leveraged AWS EMR, Athena, and Glue Metastore alongside GCP services for hybrid cloud integration and SKU-level reporting.

Environment: Google Cloud Platform (GCP), BigQuery, Cloud Dataflow, Cloud Composer, Dataproc, GCS, Cloud SQL, Cloud Functions, Cloud Scheduler, REST APIs, dbt, PySpark, Apache Spark, Spark-SQL, T-SQL, PL/SQL, SQL Server, Scala, Python, AWS EMR, Glue, Athena, Looker, Power BI, Docker, ECS, EKS, Unix, Apache Kafka, Python, Google Sheets, Excel, Data Governance, Medallion Architecture, CI/CD, Data Modeling, Retail Analytics.

Info Logitech Systems – Hyderabad, India May 2014 – Nov 2017

Data Engineer

Responsibilities:

At Info Logitech Systems, I was responsible for designing and developing data pipelines across hybrid cloud environments using Azure and AWS. My work focused on building secure, high-performance ETL solutions to support business intelligence, real-time data processing, and operational analytics.

Designed and implemented complex ETL pipelines using Azure Data Factory (ADF), AWS Glue, and Databricks for data integration and transformation across cloud and on-prem sources.

Developed and executed data profiling routines using PySpark and SQL to define business logic, data mappings, and model schemas.

Implemented Change Data Capture (CDC) and Slowly Changing Dimensions (SCD Type 1 and Type 2) in pipelines to manage delta loads and historical tracking in fact tables.

Created and deployed scripts in Python and Bash for data validation, secure file movement, and automation between AWS S3 and Azure Blob Storage.

Built and orchestrated workflows using Apache Airflow and AWS Step Functions, enabling scheduled and event-driven processing.

Supported QA teams during system and integration testing phases, resolving defects and ensuring reliable data delivery.

Wrote comprehensive unit tests and maintained documentation including mapping specs, system technical manuals (STM), and test result logs.

Applied performance optimization techniques such as data partitioning, query tuning, and Spark job enhancements to improve runtime efficiency.

Developed ELT pipelines in Snowflake, utilizing SQL transformations for analytics, reporting, and ad-hoc querying.

Implemented serverless data workflows using AWS Lambda to process real-time streaming data for time-sensitive applications.

Environment: Azure Data Factory, Databricks, Snowflake, AWS Glue, AWS Lambda, PySpark, SQL, Apache Airflow, AWS Step Functions, Azure Blob Storage, Amazon S3, Bash, Python, Erwin, Control-M, Data Warehousing, CDC/SCD, Performance Tuning, Data Modeling, ETL Automation.



Contact this candidate