Post Job Free
Sign in

Senior Data Engineer, Healthcare Data Platforms Expert

Location:
Edison, NJ
Salary:
70$/hr
Posted:
May 04, 2026

Contact this candidate

Resume:

Sr. Data Engineer

Name: Alekhya G

Email: ****************@*****.***

Ph: +1-848-***-****

LinkedIn: www.linkedin.com/in/alekhya-g-5243a9160

Professional Summary:

10+ years of experience designing and delivering scalable data engineering and database solutions across Healthcare/Life Sciences, Banking, and Capital Markets domains, supporting clinical operations, regulatory compliance, and analytics platforms.

Strong expertise in Azure Data Factory, Synapse Analytics, SQL Server, and healthcare data integration (EHR/LIS, HL7/FHIR) with deep focus on compliance, performance optimization, and secure PHI data handling.

Strong expertise in building data pipelines using Azure Data Factory (ADF v2), Azure Synapse Analytics (SQL Pools, Spark Pools), and SQL Server (2012–2022) for large-scale healthcare datasets.

Proven experience designing and optimizing data lake and data warehouse architectures on Azure Synapse and ADLS Gen2 supporting analytics, reporting, and machine learning workloads.

Extensive experience integrating Electronic Health Records (EHR), Laboratory Information Systems (LIS), and healthcare data sources using HL7 (v2.x), FHIR (R4), and JSON/XML formats.

Strong expertise in T-SQL, SSIS (2012–2019), stored procedures, indexing, and query tuning for high-performance SQL Server database systems.

Hands-on experience implementing data quality, validation, auditing, and reconciliation frameworks ensuring integrity of clinical and regulatory datasets.

Proven ability to design scalable data models (star/snowflake) optimized for healthcare analytics, reporting, and interoperability.

Strong experience in database administration including backup/recovery, disaster recovery (HA/DR), performance tuning, and monitoring using Azure Monitor, SQL Profiler, and Log Analytics.

Expertise in implementing data security, encryption, masking, and access controls aligned with HIPAA, HITECH, PHI protection, GDPR, and FDA 21 CFR Part 11 compliance standards.

Experience working with Azure SQL Database, Azure Blob Storage, ADLS Gen2, and Synapse Pipelines for end-to-end data engineering solutions.

Hands-on experience implementing CI/CD pipelines using Azure DevOps and GitHub Actions for data engineering and database deployments.

Strong collaboration with clinicians, analysts, data scientists, and application teams to deliver healthcare data solutions and analytics platforms.

Excellent problem-solving skills delivering secure, compliant, and high-performance data systems aligned with healthcare regulatory standards.

Technical Skills:

Category

Technologies & Tools

Programming Languages

Python (2.7–3.10), SQL (Advanced), PySpark, Spark SQL, PL/SQL, Shell Scripting (Bash), T-SQL, SSIS (2012–2019)

Big Data & Distributed Processing

Apache Spark (1.6–3.4+), Spark Structured Streaming, EMR (5.x–6.x), Hadoop (HDFS, Hive, MapReduce)

Cloud Platform

AWS (Primary), Microsoft Azure

Azure Data Services

Azure Data Factory (ADF v2), Azure Synapse Analytics (SQL Pools, Spark Pools), ADLS Gen2, Azure SQL Database, Azure Blob Storage

AWS Data Services

S3, EMR, Glue (2.x–4.0), Lambda, Redshift (RA3), RDS, Athena, Lake Formation

Streaming & Messaging

Apache Kafka (0.9–3.x), Amazon MSK (Kafka), Kinesis Data Streams

Data Warehousing

Amazon Redshift, Oracle, SQL Server, Amazon RDS (PostgreSQL/MySQL), Azure Synapse SQL Pools

Data Modeling

Dimensional Modeling (Star/Snowflake), Normalized Models (3NF), Data Partitioning, Performance Tuning, Workload Management (WLM)

ETL / ELT & Orchestration

Apache Airflow (2.x), AWS Glue, Spark Jobs, dbt Core (1.3–1.6+), ADF Pipelines, Synapse Pipelines

Data Quality & Governance

Great Expectations, AWS Deequ, AWS Glue Data Catalog, Hive Metastore, Data Auditing, Data Validation Frameworks

DevOps & Infrastructure as Code

Terraform (0.12–1.6+), CloudFormation, Jenkins, GitHub Actions, Git, Azure DevOps

Monitoring & Observability

CloudWatch, CloudTrail, AWS Config, Azure Monitor, Log Analytics, SQL Profiler

Security & Compliance

IAM, Lake Formation RBAC, KMS (AES-256), TLS 1.2+, Data Masking, SOX, SEC, FINRA, Basel III, HIPAA, FDA 21 CFR Part 11, PCI-DSS, GDPR, HITECH, PHI, HL7, FHIR

Machine Learning Integration

AWS SageMaker (integration), XGBoost (basic usage)

API & Application Integration

REST APIs, AWS Lambda Integrations

BI & Visualization

Tableau, Power BI, Amazon QuickSight

Healthcare Integration

EHR, LIS, HL7 v2.x, FHIR R4, JSON/XML APIs

Methodologies

Agile/Scrum, CI/CD, DataOps

Professional Experience:

GOLDMAN SACHS

Sr. Data Engineer Jan 2025 – Present

Responsibilities:

Architect and maintain scalable, fault-tolerant ETL/ELT pipelines using AWS Glue 4.0, EMR 6.x (Spark 3.4+), and Python to process high-volume capital markets data including trade lifecycle, market data feeds, KYC, AML, liquidity risk, and regulatory reporting datasets.

Designed and implemented data pipelines using Azure Data Factory and Synapse Analytics for regulated financial and healthcare-aligned datasets.

Developed SQL-based transformations using T-SQL and optimized queries for high-performance reporting systems.

Implemented data lineage, auditing, and compliance frameworks supporting regulated datasets and audit requirements.

Design and optimize enterprise data lakehouse and data warehouse architecture using Amazon S3 (Parquet, Delta Lake), AWS Lake Formation, Amazon Redshift RA3, and Amazon Athena to support risk aggregation, PnL reporting, treasury analytics, and regulatory compliance reporting.

Develop and manage workflow orchestration using Apache Airflow 2.7+ for scheduling, dependency management, and end-to-end pipeline automation across batch and streaming data workloads.

Engineer real-time data pipelines using Amazon MSK (Kafka) and Spark Structured Streaming to enable near real-time trade surveillance, fraud detection, and market risk monitoring.

Implement robust data quality, validation, and reconciliation frameworks using dbt Core and Great Expectations to ensure high data integrity, data consistency, and audit-ready datasets aligned with SOX and regulatory requirements.

Develop high-performance PySpark transformation pipelines and optimize complex SQL queries in Amazon Redshift to improve performance for large-scale risk analytics, PnL calculations, and compliance reporting workloads.

Enforce secure data access and governance using AWS IAM roles, Lake Formation row- and column-level security, and AWS KMS encryption to protect sensitive financial, trading, and regulatory data.

Build and maintain centralized metadata management and data catalog solutions using AWS Glue Data Catalog to support data lineage, governance, and regulatory audit traceability across financial data platforms.

Implement monitoring, alerting, and logging solutions using Amazon CloudWatch to ensure data pipeline reliability, SLA adherence, and proactive detection of failures in production environments.

Integrate upstream and downstream data systems including trading platforms, risk engines, and regulatory systems to deliver curated, reconciled, and analytics-ready datasets.

Collaborate with cross-functional stakeholders including risk, treasury, compliance, and audit teams to deliver high-quality datasets, KPIs, and reporting solutions aligned with financial regulatory standards and business objectives.

Environment: AWS (Glue 4.0, EMR 6.x, S3, Redshift RA3, Lake Formation, Athena, MSK Kafka, IAM, KMS, CloudWatch), Apache Airflow 2.7+, Spark 3.4+, PySpark, SQL, Python 3.x, dbt Core, Great Expectations, Delta Lake, Terraform, GitHub Actions, Agile/Scrum, Jira, SOX, SEC, FINRA, Basel III, AML, GDPR.

MORGAN STANLEY

Sr. Data Engineer Mar 2023 – Jan 2025

Responsibilities:

Designed and developed large-scale ETL/ELT pipelines using AWS Glue, EMR (Spark 3.3+), AWS Lambda, and Amazon Redshift to process high-volume trading, market data, liquidity risk, and AML reporting datasets.

Developed data pipelines using Azure Synapse and SQL Server for reporting and analytics use cases.

Built optimized SQL models and stored procedures supporting data validation, reconciliation, and compliance reporting.

Built scalable workflow orchestration using Apache Airflow 2.x for scheduling, dependency management, and monitoring of enterprise data pipelines across batch and streaming workloads.

Developed optimized data warehouse models in Amazon Redshift supporting regulatory reporting, risk aggregation, and PnL analytics aligned with SOX, SEC, and Basel III frameworks.

Engineered real-time and batch ingestion pipelines using Amazon MSK (Kafka) and Spark Structured Streaming enabling low-latency processing of trading event streams and market data feeds.

Implemented enterprise data validation, reconciliation, and quality frameworks using dbt Core, Great Expectations, and AWS Deequ improving accuracy and consistency of financial regulatory reports.

Designed secure AWS-based data lake architecture using Amazon S3, AWS Lake Formation, IAM policies, and KMS encryption ensuring compliance with enterprise data governance and security standards.

Developed reusable PySpark-based transformation frameworks and modular dbt models accelerating onboarding and standardization of new trading and risk datasets.

Implemented metadata management and data cataloging using AWS Glue Data Catalog enabling data lineage, governance, and audit traceability across financial data platforms.

Implemented monitoring, logging, and alerting solutions using Amazon CloudWatch ensuring reliability, SLA adherence, and proactive failure detection in production data pipelines.

Built interactive dashboards and reporting solutions using Amazon QuickSight and Power BI supporting risk, compliance, and regulatory reporting analytics.

Partnered with enterprise architecture, risk, and compliance teams to align data engineering solutions with financial regulatory requirements and enterprise data standards.

Environment: AWS (Glue 3.x/4.0, EMR 6.x, Lambda, Redshift, S3, Lake Formation, MSK Kafka, IAM, KMS, CloudWatch, Athena), Apache Airflow 2.5+, Spark 3.3+, PySpark, SQL, Python 3.9, dbt Core, Great Expectations, AWS Deequ, Terraform, GitHub Actions, QuickSight, Power BI, Agile/Scrum, Jira, SOX, SEC, FINRA, Basel III, AML, GDPR.

AbbVie, Vernon Hills, IL

Sr. Data Engineer September 2020 to February 2023

Responsibilities:

Designed and developed large-scale ETL pipelines using AWS Glue, EMR (Spark 3.1–3.3), and Amazon S3 to process multi-terabyte clinical trial, patient, and healthcare datasets.

Designed and developed ETL pipelines using Azure Data Factory (ADF v2), Azure Synapse Pipelines, and SQL Server (2016–2019) to process large-scale clinical, patient, and laboratory datasets.

Integrated healthcare data sources including EHR, LIS, and clinical systems using HL7 v2.x, FHIR R4, JSON, and API-based ingestion frameworks.

Developed and optimized T-SQL stored procedures, indexing strategies, and query tuning techniques improving database performance and scalability.

Implemented data warehouse solutions using Azure Synapse Analytics (Dedicated SQL Pools) and ADLS Gen2 supporting clinical analytics and reporting.

Built data validation, auditing, and reconciliation frameworks ensuring data accuracy, consistency, and regulatory compliance for PHI datasets.

Administered SQL Server databases including backup/recovery, HA/DR strategies, performance tuning, and monitoring using SQL Profiler and Azure Monitor.

Implemented data security controls including PHI masking, encryption, access controls, and role-based security aligned with HIPAA, HITECH, and FDA 21 CFR Part 11.

Built scalable PySpark-based data processing frameworks to transform structured and semi-structured healthcare data including EHR, HL7, and JSON formats.

Implemented enterprise AWS data lake architecture using Amazon S3, AWS Lake Formation, IAM, and KMS encryption aligned with HIPAA and FDA 21 CFR Part 11 compliance standards.

Developed real-time and batch ingestion pipelines using AWS Lambda and Kinesis Data Streams enabling near real-time processing of healthcare and clinical event data.

Implemented data quality validation, reconciliation, and testing frameworks using Great Expectations, PyTest, and SQL-based validation rules ensuring accuracy and integrity of clinical research data.

Designed and optimized data warehouse models in Amazon Redshift supporting healthcare analytics, regulatory reporting, and clinical research insights.

Integrated machine learning model outputs from Amazon SageMaker into data pipelines for predictive patient risk analytics and clinical insights.

Developed metadata management and data catalog solutions using AWS Glue Data Catalog enabling data lineage, governance, and audit traceability for regulated datasets.

Implemented secure data protection controls including PHI masking, encryption-at-rest, and encryption-in-transit to safeguard sensitive healthcare information.

Orchestrated workflows using Apache Airflow 2.4+ for scheduling, dependency management, and monitoring of data pipelines across batch and streaming workloads.

Collaborated with clinical analysts, data stewards, and compliance teams to deliver high-quality, validated datasets supporting clinical research and regulatory analytics.

Collaborated with clinical teams, analysts, and compliance stakeholders to deliver secure and compliant healthcare data solutions.

Environment: AWS (Glue 2.x/3.x, EMR 6.x, Lambda, Redshift, S3, Kinesis Data Streams, Lake Formation, IAM, KMS, CloudWatch, Athena), Apache Airflow 2.4+, Spark 3.1–3.3, PySpark, Python 3.8/3.9, SQL, Great Expectations, Terraform 1.3+, GitHub Actions, Tableau, Agile/Scrum, Jira, HIPAA, FDA 21 CFR Part 11, GxP, GDPR, CI/CD.

Charter Communications, Detroit, Michigan

Data Engineer August 2018 to July 2020

Responsibilities:

Developed scalable ETL pipelines using AWS Glue, Amazon S3, AWS Lambda, and Amazon Redshift to process telecom datasets including subscriber usage, CDR logs, and network telemetry data.

Designed SQL-based data models and pipelines supporting analytics and reporting use cases.

Implemented data validation and performance optimization for large-scale datasets.

Built real-time ingestion pipelines using Apache Kafka enabling streaming analytics for telecom network monitoring and event-driven processing.

Implemented PySpark-based data transformation jobs on EMR clusters to process high-volume telecom datasets and optimize data processing performance.

Designed and optimized data warehouse schemas in Amazon Redshift and Amazon RDS supporting churn analytics, customer insights, and operational reporting.

Automated infrastructure provisioning using Terraform and AWS CloudFormation enabling consistent and repeatable deployment of data engineering environments.

Implemented CI/CD pipelines using Jenkins to automate build, testing, and deployment of data workflows.

Implemented data quality validation frameworks using SQL-based validation rules ensuring accuracy and consistency of telecom reporting data.

Optimized storage and query performance using S3 partitioning, Parquet file formats, and Redshift Spectrum for efficient querying of large-scale datasets.

Integrated machine learning model outputs from Amazon SageMaker into data pipelines supporting telecom churn prediction and network analytics use cases.

Implemented monitoring and logging solutions using Amazon CloudWatch and AWS CloudTrail ensuring reliability and stability of data pipelines.

Enforced enterprise security using IAM roles, KMS encryption, and VPC networking controls aligned with telecom compliance standards.

Environment: AWS (Glue 1.x/2.x, S3, Lambda, Kinesis Data Streams, Redshift, RDS, EMR 5.x/6.x, IAM, VPC, CloudWatch, CloudTrail), Apache Spark 2.4–3.0, PySpark, Python 3.7, SQL, Kafka 2.x, Jenkins, Terraform 0.12+, CloudFormation, Tableau, SOC 2, PCI-DSS, FCC Compliance.

PNB Bank, India

Data Engineer May 2014 to November 2017

Responsibilities:

Developed enterprise ETL pipelines using Hadoop (HDFS, Hive) and Apache Spark to process large-scale banking datasets supporting risk analytics, transaction monitoring, and regulatory reporting.

Developed SQL Server and Oracle-based data pipelines supporting regulated financial datasets and reporting systems.

Implemented data security, encryption, and compliance aligned with SOX, Basel III, and banking regulations.

Built batch and near real-time ingestion frameworks using Apache Kafka and Sqoop to ingest transactional data, customer data, and fraud detection feeds into distributed data platforms.

Designed and optimized data warehouse models in Oracle and SQL Server supporting banking analytics, regulatory reporting, and financial data aggregation.

Developed Spark and PySpark-based data transformation pipelines for processing financial datasets including transaction logs, customer profiles, and risk scoring data.

Implemented secure data architectures using role-based access controls, encryption mechanisms, and network security practices ensuring compliance with banking regulations.

Built CI/CD pipelines using Jenkins and Git automating deployment and version control of data engineering workflows.

Implemented data lineage and metadata management using Hive Metastore and audit logging frameworks ensuring traceability of financial datasets.

Optimized query and processing performance using Hive partitioning, indexing, and efficient data modeling techniques for large-scale banking data.

Supported basic machine learning workflows using Python for credit risk analysis and customer segmentation analytics.

Implemented data governance controls aligned with SOX, PCI-DSS, Basel III, AML, and KYC regulatory frameworks.

Collaborated with risk, compliance, and audit teams to ensure data accuracy, consistency, and regulatory reporting compliance.

Environment: Hadoop (HDFS, Hive, MapReduce), Apache Spark 1.6–2.x, PySpark, Kafka 0.9–0.10, Sqoop, Oracle, SQL Server, Python 2.7/3.x, SQL, Jenkins, Git, Linux, Tableau, SOX, PCI-DSS, Basel III, AML, KYC, ISO 27001, Agile/Scrum, Jira.



Contact this candidate