Data Engineer Azure

Location:

Jersey City, NJ

Posted:

October 15, 2025

Contact this candidate

Resume:

Kranthi Kakarla

830-***-**** *******.*********@*****.*** LinkedIn

PROFESSIONAL SUMMARY

• Data Engineer with 7+ years of experience designing and implementing scalable ETL pipelines, data lakes, and cloud-native analytics solutions across AWS, Azure, and GCP.

• Extensive hands-on expertise with AWS (S3, Glue, Redshift, Lambda, Kinesis, MSK) for building high-performance batch and streaming data workflows.

• Proficient in Azure Data Factory (ADF), Databricks (PySpark, Delta Lake, Unity Catalog), Synapse Analytics, and ADLS, supporting large-scale migrations and operational analytics.

• Skilled in GCP (BigQuery, Dataflow, Dataproc, Pub/Sub) for cost-effective real-time ingestion and analytics, and integrating Apache Beam for time-series transformations.

• Strong experience with big data and orchestration tools such as Apache Spark, PySpark, Hadoop (HDFS, Hive), Kafka, Airflow, Matillion, Delta Lake, dbt, and Informatica for processing complex, high-volume datasets.

• Delivered AI/ML and GenAI solutions using Azure OpenAI, Hugging Face Transformers, LangChain, GPT-4 APIs, SageMaker, Spark ML, XGBoost, and MLflow, supporting predictive analytics, anomaly detection, and intelligent automation.

• Experienced in developing and deploying infrastructure solutions using Terraform, with working knowledge of CloudFormation and CDK.

• Familiar with network architecture principles, including VPNs, firewalls, routing, and security protocols for cloud-hosted environments.

• Proficient in SQL and NoSQL databases including Snowflake, Redshift, BigQuery, Synapse, PostgreSQL, DynamoDB, and Cassandra, with strong query optimization and data modeling skills (Star/Snowflake Schema).

• Built interactive dashboards and advanced KPIs using Power BI, Tableau, Looker, and QuickSight, delivering actionable insights to business and executive teams.

• Experienced in data governance and compliance using Azure Purview, IBM IGC, Ataccama DQ, Key Vault, Row-Level Security

(RLS), and Data Sensitivity Labels, ensuring GDPR, HIPAA, HITRUST, and SOC 2 compliance.

• Skilled in CI/CD and infrastructure automation with Terraform, Jenkins, GitLab CI/CD, GitHub Actions, and Docker, enabling seamless provisioning and deployment of pipelines and ML workflows.

• Adept at observability and monitoring using DataDog, Grafana, Azure Monitor, and CloudWatch, building proactive alerting, lineage tracking, and SLA enforcement for mission-critical workloads.

• Collaborates closely with cross-functional teams, including AI/ML, DevOps, Bioinformatics, and Business stakeholders, to deliver enterprise-scale, multi-cloud-ready analytics and AI-driven solutions. EDUCATION

• Masters in Information Science - Trine University, MI. CERTIFICATIONS

• Microsoft Certified: Azure Data Fundamentals

• AWS Certified: DevOps Engineer

• Google Cloud Certified: Cloud Engineer

TECHNICAL SKILLS

• Cloud Platforms & Services: AWS (S3, Glue, Redshift, Lambda, Kinesis, MSK, EMR), Azure (Data Factory, Databricks, Synapse, ADLS, Purview, Event Hubs, Event Grid, Cosmos DB, API Management), GCP (BigQuery, Dataflow, Dataproc, Pub/Sub), Snowflake, Palantir Foundry

• Big Data & ETL Processing: Spark, PySpark, Scala, Matillion, Hadoop (HDFS, Hive), Kafka, Airflow, Delta Lake, dbt, Apache Beam, Informatica, AWS Step Functions

• Databases & Data Warehousing: Snowflake, Redshift, BigQuery, Synapse, PostgreSQL, MongoDB, Couchbase, DynamoDB, Cassandra

• Streaming & Real-Time Analytics: Kinesis, Spark Streaming, Azure Stream Analytics, Pub/Sub, Flink, Apache Pulsar, Materialize

• DataOps & Infrastructure: Terraform, Jenkins, Docker, Kubernetes, GitHub CI/CD, GitHub Actions

• BI, Visualization & Monitoring: Power BI, Tableau, Looker, QuickSight, DataDog, Grafana

• Programming & Data Modeling: Python, SQL, PySpark, Scala, Java, SQL Mesh, GeoPandas, Shapely

• Networking & Security: VPNs, TCP/IP, IAM, RBAC, Azure Key Vault, Firewalls

• LLMs & GenAI: Azure OpenAI, OpenAI API, Hugging Face Transformers, LangChain, Prompt Engineering, BERT, GPT-2, GPT- 3.5/4

• ML & Advanced Analytics: Azure ML, SageMaker, Spark MLlib, XGBoost, Prophet, scikit-learn

• Data Governance & Quality: Azure Purview, Ataccama DQ, IBM IGC, Great Expectations, Data Sensitivity Labels, Row-Level Security (RLS)

PROFESSIONAL EXPERIENCE

American Airlines, Dallas, TX Aug 2024 – Present Azure Data Engineer

Responsibilities:

• Designed and deployed enterprise-grade, end-to-end ETL pipelines using Azure Data Factory (ADF), Databricks (PySpark, Delta Lake, Unity Catalog), Snowflake, and Synapse Analytics, integrating structured, semi-structured, streaming, and API-based datasets (Kafka, Event Hubs, REST APIs) to support real-time and batch processing across flight operations, customer, and finance domains.

• Implemented a medallion architecture (Bronze, Silver, Gold layers) within Databricks and Delta Lake, transforming raw streaming and transactional datasets into curated and aggregated layers optimized for predictive analytics, ML model consumption, and operational reporting.

• Developed and optimized complex T-SQL and Snowflake SQL procedures, window functions, materialized views, and dynamic pivot queries, improving BI dashboard performance for mission-critical reports used by revenue management and operations teams.

• Integrated Generative AI (GenAI) and LLM pipelines using Azure OpenAI, LangChain, Hugging Face Transformers, and GPT- 4 APIs for automated anomaly detection, natural language summarization of operational metrics, and intelligent data classification.

• Built and trained machine learning models using Azure ML, Spark MLlib, and scikit-learn for flight delay prediction, revenue forecasting, and customer behavior analysis, leveraging Databricks MLflow for experiment tracking and automated deployment pipelines.

• Migrated multiple legacy SSIS and on-prem SQL Server ETL processes into cloud-native solutions using ADF, Databricks, and Snowflake.

• Designed and optimized data models, schema, and indexing strategies for MongoDB and AWS DynamoDB, ensuring low- latency queries and scalability for real-time applications.

• Implemented sharding, replication, and partitioning strategies to support distributed data workloads and high availability.

• Collaborated with DevOps teams to automate backup, recovery, and HA/DR strategies using Terraform and Ansible, enhancing fault tolerance.

• Integrated Kafka streaming with NoSQL databases for real-time analytics and event-driven data processing.

• Automated data quality, validation, and observability using dbt, Great Expectations, Azure Monitor, and DataDog, establishing alerting, lineage, and SLA enforcement across mission-critical pipelines.

• Created self-service Snowflake data marts for Finance, Operations, and Marketing, implementing role-based access, time-travel, and data sharing features, enabling faster insights for analysts and data scientists.

• Developed interactive Power BI dashboards with DAX measures and DirectQuery connections to Snowflake/Synapse, delivering real-time KPI monitoring and predictive trend analysis for senior executives.

• Applied data security, governance, and compliance frameworks using Azure Purview, Key Vault, Private Endpoints, Row-Level Security (RLS), and Dynamic Data Masking, ensuring GDPR, CCPA, and HIPAA compliance for sensitive operational datasets.

• Designed CI/CD workflows using Terraform and GitHub Actions for provisioning infrastructure on multi-cloud platforms, applying DevOps best practices and cost-optimization strategies.

• Built fully automated CI/CD workflows using Jenkins, GitHub Actions, Terraform, and Docker for provisioning ADF pipelines, Databricks clusters, and Snowflake resources.

• Collaborated with DevOps, AI/ML teams, and business stakeholders to design scalable, multi-cloud-ready architectures, supporting streaming analytics, ML pipelines, and GenAI-driven data solutions for enterprise decision-making. Mylon Laboratories, India Aug 2020 – Jul 2023

Azure Data Engineer

Responsibilities:

• Designed and implemented high-throughput ETL and ELT pipelines using Matillion, Apache NiFi, and Azure Data Factory (ADF) to integrate EHR datasets (HL7/FHIR), claims feeds, and external provider networks into a HIPAA-compliant Azure Data Lakehouse built on ADLS Gen2 and Snowflake.

• Architected real-time streaming pipelines using Azure Event Hubs, Event Grid, and Stream Analytics, enabling ingestion of telehealth, IoT device, and pharmacy feeds into Databricks Delta tables for downstream care analytics.

• Leveraged Azure Dataverse and Cosmos DB for handling semi-structured, JSON, and document-based medical records, enabling care teams to query unstructured data at scale with minimal latency.

• Developed multi-layered star and snowflake schemas optimized for population health analytics and predictive modeling, enabling actuarial teams to run risk scoring and utilization forecasting with high performance.

• Automated data validation, PHI masking, and quality scoring using Azure Logic Apps, Power Automate, dbt, and Great Expectations, providing end-to-end lineage and SLA reporting for compliance audits.

• Built ML and statistical pipelines for fraud detection, readmission risk, and chronic disease progression modeling using Azure ML, Spark MLlib, XGBoost, and Prophet, integrating MLflow for tracking, deployment, and drift detection.

• Migrated critical datasets from traditional RDBMS (SQL Server/Oracle) to Cassandra and Azure Cosmos DB, improving horizontal scalability and reducing query response times by 40%.

• Integrated IAM, private endpoints, VNet/Subnet configurations, and encryption standards to enforce enterprise security and compliance.

• Tuned NoSQL performance via query optimization, data partitioning, and caching strategies using Redis.

• Developed automation scripts in Python and Shell for database health monitoring, capacity planning, and alerting mechanisms.

• Deployed FHIR-compliant APIs via Azure API Management to enable secure, scalable consumption of curated clinical datasets by provider and research teams.

• Created interactive Power BI dashboards with Direct Lake Mode and composite models to deliver claims cost insights, readmission KPIs, and predictive alerts directly to care managers and executives.

• Configured advanced governance and security using Azure Purview, Key Vault, Private Endpoints, Data Sensitivity Labels, and Role-Based Access Control (RBAC) to ensure compliance with HIPAA, HITRUST, and SOC 2 standards.

• Migrated legacy Informatica and Hadoop-based ETL pipelines into cloud-native Databricks and Snowflake workflows, improving processing throughput and reducing operational overhead.

• Implemented GitLab CI/CD and Terraform automation to provision ADF pipelines, Databricks clusters, Snowflake warehouses, and Event Hub namespaces, streamlining deployments across environments.

• Collaborated with clinical data scientists, care management teams, and actuarial groups to deliver predictive analytics solutions that improved care gap closure rates and reduced claims leakage. Rwaltz Group, India Sep 2018 – Jul 2020

AWS Data Engineer

Responsibilities:

• Designed and deployed enterprise-grade ETL pipelines using Python, Scala, AWS Glue (Crawlers, Catalog, Jobs), and Spark to process large-scale genomic, clinical trial, and EHR datasets, integrating multi-terabyte feeds from lab systems, clinical partners, and public genomic repositories into AWS S3 data lakes for downstream analytics.

• Orchestrated streaming and batch ingestion using Kafka Connect and AWS MSK, enabling ingestion of HL7/FHIR messages, real- time trial results, and sensor data into S3, Redshift Spectrum, and Snowflake, ensuring low-latency data delivery for research teams.

• Built PySpark ETL frameworks in AWS Glue with dynamic partitioning, bucketing, and windowed transformations, applying custom serialization and predicate pushdown optimizations to cut processing times across clinical data pipelines.

• Developed PostgreSQL, Redshift, and Snowflake schemas with materialized views, distribution keys, and clustering strategies for efficient querying of multi-billion-row trial datasets, enabling faster exploratory research and regulatory reporting.

• Integrated Informatica Cloud for ingestion and transformation of structured and semi-structured EHR (HL7, CCD, FHIR) data, implementing schema drift detection, automated validation, and PHI masking, ensuring full HIPAA and 21 CFR Part 11 compliance.

• Implemented ontology-driven semantic models in Palantir Foundry, mapping clinical, genomic, and trial data to standardized ontologies, while leveraging lineage tracking and metadata versioning to streamline audits and regulatory submissions.

• Automated multi-stage workflows using AWS Step Functions to orchestrate Glue ETL jobs, Redshift/Snowflake loads, and Foundry data transformations, achieving resilient, event-driven orchestration across multi-environment pipelines.

• Created PySpark-based log and anomaly detection pipelines for schema validation, PHI exposure monitoring, and trial data consistency checks, integrating alerts via AWS CloudWatch, SNS, and Slack for rapid incident response.

• Integrated IBM IGC (Information Governance Catalog) with Ataccama DQ to enforce data quality rules, metadata lineage tracking, and PHI classification, improving regulatory transparency and reducing downstream errors.

• Developed fully automated CI/CD pipelines with GitLab, Terraform, and Docker, provisioning AWS Glue jobs, Redshift clusters, Palantir Foundry transformations, and IAM roles across multiple environments, reducing deployment cycles.

• Partnered with bioinformatics, data science, and compliance teams to deliver curated, governed data products that accelerated drug discovery timelines and improved clinical trial insights through advanced analytics and machine learning-ready datasets. Siemens, India Jun 2017 – Aug 2018

Data Engineer

Responsibilities:

• Designed and maintained batch and streaming data pipelines using Databricks (PySpark, Delta Lake), Hadoop (HDFS, Hive, Spark Streaming), and Kafka to process industrial IoT sensor telemetry, equipment logs, and ERP system data, enabling predictive maintenance and anomaly detection.

• Migrated legacy Hadoop (HDFS/YARN) clusters to GCP Dataproc, integrating with Pub/Sub for streaming ingestion and BigQuery for scalable analytics, achieving 25% cost savings on infrastructure.

• Enhanced Delta Lake performance by enabling ACID transactions, schema evolution, and time-travel queries in Databricks, boosting real-time data reliability for operational dashboards.

• Developed Airflow DAGs to orchestrate Spark/Hadoop ETL pipelines, implementing checkpointing, retry mechanisms, and Slack alerts for SLA adherence and proactive error recovery.

• Created partitioned and clustered BigQuery tables with materialized views and federated queries, improving cost efficiency and query speeds for plant operations analytics by 30%.

• Incorporated geospatial analytics using Python (GeoPandas, Shapely) for equipment hotspot detection, enabling proactive interventions and reducing equipment downtime.

• Integrated Grafana and DataDog for cluster health monitoring, ETL job metrics, and custom SLA alerts, ensuring early detection of failures and resource optimization.

• Built streaming analytics dashboards by combining Databricks Delta Live Tables with BigQuery BI Engine, enabling near real- time tracking of factory KPIs for operations leadership.

• Leveraged Apache Beam within Dataproc to process complex IoT time-series transformations, improving downstream model readiness for predictive analytics teams.

• Implemented CI/CD automation with Jenkins and Terraform for Dataproc clusters, Airflow DAGs, and Databricks jobs, accelerating environment deployments by 40%.

Contact this candidate