Post Job Free
Sign in

Senior Data Engineer Databricks & Azure Cloud Expert

Location:
McKinney, TX
Posted:
December 23, 2025

Contact this candidate

Resume:

SIDDHARTHA KOTHI

+1-972-***-**** • **********.***@*****.***

PROFESSIONAL SUMMARY

• Senior Data Engineer with 10+ years of experience specializing in Databricks (PySpark & Scala), Azure cloud,

and supporting large-scale analytics across Azure and AWS.

• Strong background in translating business requirements into scalable data models, including OLTP, OLAP, and

dimensional structures used for finance, operations, and regulatory reporting.

• Skilled in developing high-performance ETL/ELT pipelines using Azure Data Factory, Databricks, AWS Glue, and

SQL-based transformation frameworks.

• Hands-on experience optimizing PySpark and SQL workloads, improving reliability, throughput, and compute

efficiency across distributed systems.

• Experienced in cloud migrations involving Azure SQL, Synapse, Snowflake, and S3-based data lakes, with a focus

on secure architecture, schema optimization, and resilient data flows.

• Adept at building metadata-driven ingestion frameworks, automating orchestration, and improving data

quality through validation, drift detection, and lineage tracking.

• Working understanding of AI/ML data workflows, including preparing clean, feature-ready datasets, managing

historical snapshots, and supporting downstream model training and scoring processes.

• Hands-on experience supporting ML use cases by building reliable data pipelines, validating feature inputs,

and ensuring data consistency for analytics, risk, and forecasting models.

• Comfortable working across engineering, BI, analytics, and ML teams to deliver model-ready datasets,

governed semantic layers, and Power BI dataflows that support interactive reporting at scale.

• Recognized for balancing technical depth with clear communication, and for delivering data solutions that are

reliable, maintainable, and aligned with business priorities.

• Skilled in Azure Data Factory, ADLS, Synapse, Databricks (architectural design), Azure SQL, Purview, and

metadata governance.

• Experienced at integrating complex source systems, including SAP ECC, into scalable Azure lakehouse and

warehouse structures.

• Proven ability to define modelling standards, build data dictionaries, document lineage, and support mission-

critical migrations in Agile/SAFe environments.

TECHNICAL SKILLS

Category Skills

Azure Cloud & Azure Data Factory (ADF), Azure Databricks, Azure Synapse Analytics, Azure SQL Database, Azure

Data Engineering Analysis Services, Azure Data Lake Storage Gen2, Azure Blob Storage, Azure Event Hubs, Azure

Stream Analytics, Azure Service Bus, Azure Functions, Azure Integration Services, Azure Purview

AWS & Big Data AWS EMR, AWS Glue, AWS Lambda, Amazon S3, Glue Data Catalog, IAM, Kafka, Hadoop (HDFS,

Hive), Spark on EMR

ETL / ELT & Data ADF Pipelines, SSIS, AWS Glue ETL, Informatica PowerCenter concepts, IICS patterns, Batch &

Integration Streaming ETL, Change Data Capture (CDC), Data Acquisition Frameworks, REST APIs,

Webhooks, File-based ingestion (CSV/JSON/XML), Real-time ingestion

Lakehouse & Delta Lake, Unity Catalog, Medallion Architecture (Bronze / Silver / Gold), Databricks Autoloader,

Storage Iceberg fundamentals

Programming & PySpark, Spark SQL, Scala, Python (Pandas, PySpark utilities), SQL (advanced tuning), C# (data

Data Processing integration), PowerShell

Data Modeling & Dimensional Modeling (Kimball), Star/Snowflake Schemas, Conceptual/Logical/Physical Data

Warehousing Models, OLTP/OLAP Modeling, Oracle Data Warehousing, Semantic Layer Design

Databases Snowflake, SQL Server, Oracle, Azure SQL Database, Synapse Dedicated Pools, PostgreSQL,

MySQL, Cosmos DB, MongoDB

Orchestration & Azure DevOps, Git, Jenkins, Airflow, DBT (basic), Automated Deployments, Build/Release

CI/CD Pipelines

Monitoring & Azure Monitor, Application Insights, Log Analytics, Databricks Job Logs, Spark UI, CloudWatch,

Debugging Cluster Tuning

Governance & RBAC, Unity Catalog governance, Azure AD Authentication, Azure Key Vault, Encryption (at

Data Quality rest/in transit), Data Masking, Metadata Management, Schema Drift Handling, Data Quality

Frameworks

Big Data & Spark (batch + streaming), Kafka, Structured Streaming, Hadoop, MapReduce, NoSQL

Streaming ecosystems

BI & Reporting Power BI (DAX, Dataflows,Semantic Models), Tableau, SSRS, SSAS

Tools

Automation & Autosys, Airflow DAGs, ADF Triggers, Scheduled Pipelines

Scheduling

Tools & Visual Studio, VS Code, JIRA, Confluence, MS Office, Postman, GitHub

Platforms

PROFESSIONAL EXPERIENCE

JP Morgan Chase – Columbus, OH Sep 2023 – Present

Senior Data Engineer

• Designed and maintained scalable Azure Data Factory (ADF) pipelines ingesting data from SQL systems, APIs,

and cloud applications into ADLS and Snowflake, increasing throughput and reducing execution time by ~40%.

• Built a structured Databricks Lakehouse (Bronze Silver Gold) improving data lineage, reconciliation

accuracy, and auditability across finance, risk, and analytics domains.

• Developed distributed PySpark pipelines supporting daily and intraday fraud, payments, and liquidity analytics,

while also preparing feature-ready datasets consumed by downstream analytics and ML workflows.

• Implemented near real-time ingestion using Event Hubs, Autoloader, and Delta Live Tables (DLT), reducing

dashboard and model-input data lag from hours to minutes.

• Redesigned Snowflake loading patterns using Streams, Tasks, and clustering keys, improving refresh reliability,

lowering compute usage, and stabilizing datasets used for reporting and model scoring.

• Integrated Unity Catalog to enforce RBAC, fine-grained permissions, table lineage, and centralized governance

for PCI-regulated and analytics-critical datasets.

• Built automated data validation frameworks detecting schema drift, null anomalies, distribution mismatches,

and late-arriving data, helping ensure consistent inputs for analytics and ML models.

• Developed monitoring dashboards using Azure Monitor, Log Analytics, and custom metrics, improving system

reliability and reducing downtime by ~25%.

• Partnered with machine learning and analytics teams to translate feature definitions into Spark and SQL

transformations, supporting training and inference datasets with consistent logic.

• Supported model-scoring pipelines by orchestrating batch and incremental data refreshes using Delta Lake

patterns, ensuring point-in-time correctness and reproducibility.

• Standardized CI/CD deployments for notebooks, SQL objects, and ADF pipelines using Azure DevOps,

improving release consistency and reducing manual errors across analytics workflows.

• Worked with platform engineering to optimize cluster autoscaling, instance pools, and job-cluster

configurations, reducing compute cost by ~20% for both analytics and ML-support workloads.

• Supported SOX, CCAR, and regulatory audits through lineage documentation, architecture diagrams, and

system-flow artifacts covering analytics and model-input data paths.

• Collaborated with upstream system owners to remediate schema mismatches, retire unused data feeds, and

enforce stronger data-quality SLAs impacting downstream analytics and ML use cases.

• Authored reusable ingestion templates, schema-mapping standards, and transformation frameworks that

reduced onboarding time for new feeds from weeks to days.

• Crafted semantic Snowflake models representing positions, transactions, and balances, improving BI

performance and analytical feature access.

• Led root-cause investigations for production incidents involving partition skew, corrupt payloads, data drift,

and inconsistent upstream schedules impacting analytics outputs.

• Designed optimized SQL models for financial, risk, and analytics teams supporting complex analytical and

exploratory workloads.

• Implemented encryption-at-rest and in-transit using Key Vault, secure credential flows, and Snowflake

masking policies for PII/PCI and model-related datasets.

• Built automated alerting for pipeline failures, SLA breaches, and data freshness issues affecting dashboards

and ML scoring jobs using Azure Functions and Monitor.

• Documented runbooks, SLA guides, transformation logic, and lineage references enabling support teams and

data scientists to troubleshoot data issues independently.

• Implemented checkpointing, idempotent writes, and ACID-compliant Delta Lake merges, ensuring reliable

historical datasets for backtesting and trend analysis.

• Conducted internal Spark performance workshops, mentoring engineers on tuning, debugging, and

distributed computing principles relevant to analytics and ML data pipelines.

Environment: Azure Data Factory, Azure Databricks (PySpark), ADLS Gen2, Snowflake, Azure SQL, Synapse, Event Hubs,

Delta Lake, Unity Catalog, Azure Functions, Azure Monitor, Log Analytics, Power BI

Verizon – Ashburn, VA Aug 2019 – Sep 2023

Data Engineer

• Built large-scale Spark pipelines on AWS EMR to process batch and streaming datasets from operational

systems, application logs, API integrations, and event streams.

• Migrated legacy Hadoop workloads to EMR and Amazon S3, reducing processing time, improving job reliability,

and simplifying data lifecycle management.

• Designed ingestion workflows moving data from Hive S3 Snowflake, improving refresh SLAs while

reducing overall maintenance overhead.

• Developed reusable Spark modules for partition pruning, file compaction, and join optimization, increasing

computational efficiency and lowering EMR cluster costs.

• Implemented streaming pipelines using Kafka and Structured Streaming, providing real-time visibility into

application-level events used for monitoring and analytics.

• Built multi-source ingestion frameworks integrating REST APIs, Kafka topics, application logs, and internal

systems, ensuring consistent schema handling and metadata capture.

• Designed Snowflake staging and dimensional models enabling faster analytical workloads and improving

query responsiveness.

• Implemented incremental data processing using Snowflake Streams and Tasks, lowering the dependency on

full refreshes and supporting near real-time dashboards.

• Applied strong governance using IAM, KMS encryption, RBAC, and secure data access patterns across S3, EMR,

Glue Catalog, and Snowflake.

• Created automated validation frameworks for schema drift, late-arrival detection, referential integrity, and

value distribution checks, reducing operational escalations.

• Orchestrated 40+ batch and streaming workflows using Apache Airflow, improving dependency management,

SLA monitoring, and error-handling coverage.

• Integrated AWS-native services such as Lambda, Glue Catalog, CloudWatch metrics, and EMRFS to automate

ingestion tasks and enhance operational monitoring.

• Improved job stability by tuning EMR configurations including instance types, autoscaling policies, and cluster

sizing strategies for large workloads.

• Participated in platform optimization initiatives, evaluating EMR runtime configurations, storage formats

(Parquet), and partitioning strategies to reduce compute cost.

• Developed internal dashboards and reporting utilities for pipeline health, data latency, and ingestion trends,

supporting leadership visibility and quicker debugging cycles.

• Ensured data quality and consistency across pipeline stages by implementing profiling rules, timestamp checks,

null-rate controls, and anomaly detection logic.

• Collaborated with engineering, analytics, and operations teams to onboard new data feeds, align on schema

standards, and document end-to-end data flows.

• Mentored junior engineers on Spark debugging, efficient AWS data patterns, Airflow DAG design, and overall

big-data engineering workflows.

Environment: AWS EMR, Amazon S3, AWS Lambda, AWS Glue Data Catalog, AWS CloudWatch, Kafka, Hadoop

(HDFS/Hive), Apache Spark (Scala/PySpark), Apache Airflow, Snowflake, Python, SQL, Linux

Optum (UnitedHealth Group) – Eden Prairie, MN Jan 2018 – Aug 2019

Data Engineer

• Designed and developed ETL pipelines for claims, membership, and provider datasets using ADF, Azure

Functions, and Python, reducing manual ingestion efforts by 50%.

• Built the foundational Confidential Claims Data Cube using Spark and Scala, improving actuarial reporting

performance and claims analytics turnaround time by 40%.

• Engineered ingestion workflows connecting REST APIs, SQL Server, flat files, and EDI/JSON healthcare feeds

into Azure Data Lake Gen2, establishing a unified clinical and claims repository.

• Implemented real-time ingestion patterns using Event Hubs ADF Databricks, enabling sub-hour visibility

into provider events and care-coordination metrics.

• Optimized Spark jobs by tuning partitions, caching logic, and join strategies, reducing cluster execution cost

and improving throughput for payer workloads.

• Designed Snowflake staging and curated layers for underwriting, risk adjustment, and care-management

analytics, improving model-readiness of healthcare datasets.

• Implemented incremental data loading using Snowflake Streams & Tasks, eliminating full loads and improving

refresh frequency for claims and adjudication tables.

• Used VARIANT to flatten complex EDI/JSON claim structures into analytical tables, accelerating diagnosis

grouping, DRG calculations, and risk-scoring workflows.

• Developed secure access patterns in Snowflake using RBAC, masking policies, and PHI-compliant access tiers,

ensuring HIPAA alignment across engineering teams.

• Created reusable ETL components in Databricks for data cleansing, code-set validation (ICD, DRG, CPT),

referential checks, and encounter normalisation, improving accuracy across downstream models.

• Built automated quality layers detecting schema drift, null-rate anomalies, duplicate claim submissions, and

late-arriving encounters, improving data reliability by 30%.

• Developed Python scripts to ingest, transform, and validate financial, eligibility, and provider datasets,

supporting Optum’s enterprise risk and quality reporting.

• Supported actuarial, clinical, and operations teams with curated datasets for risk adjustment, care-gap

identification, utilization analysis, and cost-of-care dashboards.

• Implemented secure credential management using Key Vault, MSI, and parameterized pipelines, improving

security posture across ETL processes.

• Built healthcare-specific transformation logic for claim lineage, billing event sequencing, member attribution,

and diagnosis rollups, ensuring dataset integrity for BI and modeling.

• Collaborated with BI teams to publish curated datasets into reporting platforms, improving accessibility for

care-management leadership and clinical operations.

• Authored detailed documentation covering healthcare data flows, validation rules, schema definitions, and

SLAs, improving cross-team adoption of data assets.

Environment: Azure Data Factory, Azure Data Lake Gen2, Azure Databricks (Scala/PySpark), Azure SQL, Snowflake,

Delta Lake, Event Hubs, Azure Functions, Application Insights, Azure Monitor, Azure Key Vault, Autosys, Python, SQL.

Standalone IT Solutions – Hyderabad, India Oct 2015 – Dec 2017

Data Analyst

• Analyzed multi-source product, customer, and operational datasets using SQL and Python, delivering insights

that shaped product decisions and reporting workflows.

• Built interactive dashboards using React, Highcharts, and D3.js visualizing KPIs, funnel metrics, user cohorts,

and operational patterns for product and business teams.

• Designed SQL-based transformations, stored procedures, and analytical datasets improving reporting accuracy

and reducing manual data preparation time.

• Automated core data cleansing, deduplication, and standardization routines using Python, improving data

reliability across dashboard pipelines.

• Performed exploratory data analysis (EDA) to uncover churn indicators, activation drivers, retention patterns,

and behavioral segmentation trends.

• Managed A/B testing datasets, ran statistical comparisons, and delivered experiment insights that improved

onboarding flows and feature adoption.

• Developed reusable ingestion utilities for JSON, CSV, XML data, reducing onboarding time for new data sources

by 50%.

• Built automated quality checks (missing data scans, schema mismatch alerts, threshold monitors), increasing

trust in BI assets and reporting accuracy.

• Created automated daily and weekly reporting workflows, cutting manual reporting effort by 60% across the

analytics team.

• Designed and maintained KPI documentation, metric dictionaries, and business rules, ensuring alignment

across product, engineering, and analytics teams.

• Built anomaly detection alerts using Python, allowing teams to detect data shifts or application issues early.

Environment: SQL, Python, Pandas, MySQL, MongoDB, Node.js, JavaScript, React, Highcharts, D3.js, REST APIs, Git

EDUCATION

Bachelor of Technology in Computer Science

Kakatiya University, Warangal, Telangana, India



Contact this candidate