Data Engineer Azure

Location:

Pittsburgh, PA

Posted:

July 01, 2025

Contact this candidate

Resume:

Hema Aravind Kommu

Data Engineer *+ Years

*****************@*****.*** +1-412-***-**** www.linkedin.com/in/kommu-a323882b0

PROFESSIONAL SUMMARY

Data Engineer with 5+ years of experience designing and building robust, scalable, and secure data pipelines using PySpark, Databricks, and Azure Data Factory for enterprise applications.

Specializing in cloud-native data architecture using Azure, managing data ingestion, transformation, orchestration, and delivery across healthcare, banking, and aviation domains.

Proficient in implementing ETL/ELT frameworks using ADF, Apache Spark, and SQL, ensuring accuracy, performance, and maintainability of complex data workflows.

Experienced in building Delta Lake architectures with bronze, silver, and gold zones to support structured data access, governance, and ACID-compliant data operations.

Developed dynamic ingestion frameworks that support schema evolution, multiple file formats (CSV, JSON, Avro), and automatically handle validation, transformation, and load failures.

Created real-time ingestion pipelines using Kafka and Azure Event Hubs, processing transaction and streaming events with sub-second latency for near real-time analytics.

Engineered data pipelines using Databricks Workflows and Jobs API for orchestration, chaining notebook execution with dependency management and automated alerts on success/failure.

Integrated and managed Azure Key Vault, RBAC, and user-based access controls to ensure data security, compliance, and credential protection across enterprise pipelines.

Worked with Azure Purview and Apache Atlas to implement data lineage, glossary, and classification, enabling strong data governance and audit transparency.

Tuned Spark jobs for performance using partitioning strategies, broadcast joins, and caching, reducing compute time and improving cluster cost efficiency.

Developed and version-controlled ADF pipelines and Databricks notebooks using Azure DevOps, supporting CI/CD and promoting infrastructure-as-code standards across teams.

Collaborated with data analysts and business teams to design semantic layers, data marts, and views using SQL, enabling self-service BI and Power BI reports.

Created robust data validation scripts and test cases to validate schema, record counts, duplicates, nulls, and referential integrity during batch and streaming loads.

Built Power BI dashboards from curated gold data layers, visualizing KPIs, metrics, and trends for executives and operational staff across departments.

Migrated legacy SSIS and SQL-based processes into modern Spark-based pipelines in Databricks, improving performance, maintainability, and scalability.

Integrated REST APIs from systems like Salesforce, ServiceNow, and SAP into Azure using ADF HTTP connectors and dynamic REST mapping patterns.

Developed metadata-driven frameworks for reusability and standardization of pipelines, reducing development time and enforcing consistency in naming and partition logic.

Participated in business requirement gathering sessions, translated requirements into technical design documents, and ensured traceability from source to reporting layers.

Implemented Type 2 Slowly Changing Dimensions (SCD) to track changes in historical data attributes like customer demographics, pricing tiers, and operational statuses.

Optimized cloud usage and costs by implementing auto-scaling, job-clustering, and resource tagging strategies in Azure Databricks and Azure Data Lake.

Created internal documentation on pipeline architecture, failure recovery processes, and job schedules using Confluence and internal Wiki spaces for onboarding new team members.

Enabled end-to-end monitoring and alerts using Log Analytics, Azure Monitor, and custom metrics dashboards, helping stakeholders proactively manage SLA and failures.

Performed data profiling using PySpark, capturing field-level statistics, value distributions, null ratios, and anomaly detection for new data onboarding.

Ensured regulatory compliance, including HIPAA, GDPR, and SOX, by applying encryption, masking, and secure storage controls across data zones.

Followed Agile methodology, attending daily standups, sprint planning, retrospectives, and providing timely updates and technical inputs throughout delivery cycles.

TECHNICAL SKILLS

Category

Skills

Programming Languages

Python, SQL, PySpark, Shell Scripting, Scala (basic)

Big Data Frameworks

Apache Spark, Hive, Hadoop, Kafka, Oozie, Sqoop, Flume

Cloud Platforms

Azure (ADF, ADLS, Databricks, Purview, DevOps, Key Vault), AWS (S3, EMR, Lambda)

Data Integration Tools

Azure Data Factory, Apache NiFi, SSIS, Databricks Jobs API

Data Storage

Azure Data Lake Gen2, HDFS, Delta Lake, Snowflake, SQL Server

Databases

Oracle, MySQL, PostgreSQL, SQL Server, Hive Metastore

File Formats

CSV, JSON, Parquet, ORC, Avro, XML

Monitoring & Logging

Azure Monitor, Log Analytics, CloudWatch, ELK Stack

DevOps & CI/CD

Azure DevOps, GitHub Actions, Git, Bitbucket, YAML Pipelines

Visualization Tools

Power BI, Tableau, Jupyter Notebooks

Governance & Quality

Great Expectations, Apache Atlas, Azure Purview, RBAC

Scheduling Tools

Oozie, Airflow (basic), ADF Triggers

ETL/ELT Concepts

SCD Type 1 & 2, Data Warehousing, Star/Snowflake Schema, Ingestion Frameworks

PROFESSIONAL EXPERIENCE

Client 3: Citibank

Duration: May 2024 – Present

Role: Data Engineer

Developed scalable pipelines using Azure Data Factory (ADF) and Databricks to ingest, cleanse, and structure massive volumes of financial transaction and customer behavioral data into Azure Data Lake.

Engineered real-time data streaming using Kafka to capture transactional logs for fraud detection and monitoring, reducing detection time from hours to minutes across 20+ datasets.

Implemented Delta Lake architecture (bronze/silver/gold) with ACID support, enabling controlled historical tracking, schema evolution, and rollback for banking data sources.

Created modular parameter-driven ADF pipelines to automate ingestion from SFTP, APIs, databases, and Blob storage, reducing manual intervention by 60% across 50+ workflows.

Applied PySpark transformations to create aggregated views, calculated fields, and analytical KPIs for credit scoring and customer retention analysis.

Designed Power BI dashboards for compliance and audit teams with visualizations on suspicious activity flags, KYC violations, and account anomalies.

Integrated Azure Key Vault, RBAC, and Managed Identity for secure authentication, ensuring encryption and least-privilege access across all pipeline components.

Developed CI/CD pipelines in Azure DevOps, promoting reusable notebooks, ADF configurations, and linked service deployments via YAML and ARM templates.

Tuned Spark cluster performance using dynamic partitioning, caching, and job parallelism, reducing heavy ETL job runtime by 40%.

Implemented data quality checks and logging in ADF and Databricks to capture null values, duplicates, and threshold alerts, pushing outputs to Log Analytics.

Integrated REST APIs to ingest external credit bureau data, validating and joining scores into customer profiles for model readiness.

Led migration of legacy SSIS-based pipelines into Databricks jobs, ensuring fidelity of transformations and aligning with cloud modernization strategy.

Collaborated with governance teams to enable Azure Purview integration, defining classification tags, glossary terms, and lineage for regulatory datasets.

Created job orchestration flows in Databricks with retry logic and dependency control, enhancing the reliability of sequential multi-stage workflows.

Developed metadata-driven frameworks for ingestion rules, file naming conventions, and schema mapping using SQL config tables and JSON mappings.

Built Python utilities to track schema drift, generate ingestion reports, and raise alerts for pipeline failures or data mismatches.

Delivered curated datasets and views to data science teams to support model development for customer risk, churn, and upsell prediction.

Created Confluence documentation with architecture diagrams, job flow charts, error troubleshooting guides, and onboarding materials for team members and stakeholders.

Participated in sprint demos, backlog grooming, and daily scrum to maintain agile momentum and deliver biweekly data releases.

Automated retry alerts via Azure Monitor, reducing downtime and improving SLA metrics for mission-critical banking ingestion pipelines.

Client 2: Ascension Health

Duration: December 2020 – August 2022

Role: Data Engineer

Designed and developed scalable data pipelines using Azure Data Factory (ADF) and Databricks, transforming patient admissions, treatments, and discharge data into analytics-ready formats.

Built parameterized pipelines in ADF to dynamically ingest 40+ data feeds from hospitals and clinical departments, supporting different file types and custom business rules.

Used PySpark in Databricks to cleanse, standardize, and enrich healthcare datasets with ICD codes, diagnosis info, and timestamp formatting to enable consistent downstream usage.

Implemented Delta Lake storage architecture with bronze, silver, and gold zones for optimal data versioning, rollback, and data validation.

Developed real-time streaming pipelines using Azure Event Hubs to monitor patient vitals from medical IoT devices and raise alerts for anomalies.

Applied schema enforcement and evolution logic to handle changing data structures from external hospital systems without causing pipeline failures or manual intervention.

Created data quality checks using Python scripts and validation rules to monitor null values, field mismatches, duplication, and business logic violations.

Automated job execution with Databricks Workflows, configuring dependencies and email alerts for success/failure notifications and SLA breaches.

Developed dashboards in Power BI displaying hospital census, wait times, staff-to-patient ratios, and ER overcapacity alerts in real-time for operational leadership.

Integrated on-prem clinical systems like Cerner and Epic into Azure using self-hosted integration runtimes (IR), managing hybrid data workflows.

Tuned Spark cluster performance using broadcast joins, partition filters, and adaptive execution to reduce job durations across batch processing jobs.

Configured Azure Monitor and Log Analytics to collect logs, track metrics, and visualize pipeline executions across various layers and environments.

Built and maintained metadata-driven frameworks that governed dynamic folder paths, ingestion frequency, and pipeline parameters using control tables.

Collaborated with healthcare analysts to deliver SQL views and data marts for use in patient outcome tracking, length-of-stay metrics, and readmission reports.

Implemented access controls via Azure AD and RBAC to restrict PHI access based on user roles and compliance guidelines.

Migrated legacy SSIS and stored procedure-based ETLs to modern Spark-based pipelines, reducing maintenance overhead and improving performance.

Participated in sprint planning, backlog grooming, and daily scrums to deliver incremental data enhancements aligned with agile practices.

Worked with QA to design regression test scripts and validate source-to-target mappings across staging, transformation, and reporting layers.

Built reusable notebooks in Databricks for data exploration, schema profiling, and ad-hoc quality investigations used by data scientists and analysts.

Documented architecture diagrams, pipeline runbooks, and data dictionary terms using Confluence and SharePoint for better knowledge management.

Client 1: JetBlue Airways

Duration: January 2019 – November 2020

Role: Data Engineer

Built and managed scalable ETL pipelines using Apache Spark and Azure Data Factory to ingest flight schedules, bookings, crew assignments, and delay logs into centralized data lakes.

Designed and implemented data pipelines for integrating on-prem airline systems with Azure, handling data from flight sensors, maintenance logs, and passenger reservations.

Engineered Delta Lake tables with merge logic to manage late-arriving data and maintain consistency in operational KPIs like on-time performance and passenger load factors.

Developed real-time ingestion of flight status updates using Kafka, enabling up-to-the-minute updates to dashboards and airline notification systems.

Cleaned and normalized customer data from loyalty programs, flight check-ins, and third-party vendors, applying PySpark-based transformation rules to enable accurate downstream analysis.

Built Power BI dashboards that visualized KPIs like flight punctuality, average delay by airport, baggage claim times, and customer satisfaction metrics.

Created Python-based data validation scripts to audit ingestion files for completeness, null checks, duplication, and schema mismatches before promoting to the silver zone.

Optimized Spark performance through partitioning strategies, broadcast joins, and data bucketing to manage large datasets from multiple aircraft logs and telemetry feeds.

Collaborated with DevOps to automate cluster management and job deployment via CI/CD pipelines using Azure DevOps and Git.

Created lookup and reference data mappings for airport codes, aircraft types, and service classes to enrich analytical models and operational reporting datasets.

Worked with data governance teams to implement sensitive data handling policies, including masking of customer PII, frequent flyer details, and payment tokens.

Enabled automated archival and purging mechanisms for aged raw data using lifecycle policies in Azure Data Lake Gen2 to optimize storage usage.

Conducted schema profiling using PySpark and SQL to ensure data readiness for predictive models around delay prediction and fuel consumption optimization.

Participated in data migration from on-prem SQL Server and Hadoop systems into Azure-based data lake and Synapse warehouse environment.

Maintained data catalog using Apache Atlas and shared documentation for every table and transformation logic to improve discoverability and reduce rework.

Integrated external API data (weather, airport traffic, FAA notices) to enhance decision-making around flight delays, reroutes, and crew rescheduling workflows.

Developed event-driven triggers using Azure Functions to kick off post-ingestion validations, send notifications, and push results to downstream analytics teams.

Collaborated with the QA team to build reusable test datasets and assertion logic to validate every stage of data movement and transformation.

Participated in monthly compliance reviews to ensure alignment with FAA data audit rules, retention policies, and operational risk controls.

Delivered weekly presentations and walkthroughs to airline business stakeholders explaining the impact and usage of new dashboards, data assets, and KPIs.

EDUCATION

Master's in Information Systems and Management

Robert Morris University.

Contact this candidate