SIDDHARTHA KOTHI
+1-972-***-**** • **********.***@*****.***
PROFESSIONAL SUMMARY
• Senior Data Engineer with 10+ years of experience specializing in Databricks (PySpark & Scala), Azure cloud,
and supporting large-scale analytics across Azure and AWS.
• Strong background in translating business requirements into scalable data models, including OLTP, OLAP, and
dimensional structures used for finance, operations, and regulatory reporting.
• Skilled in developing high-performance ETL/ELT pipelines using Azure Data Factory, Databricks, AWS Glue, and
SQL-based transformation frameworks.
• Hands-on experience optimizing PySpark and SQL workloads, improving reliability, throughput, and compute
efficiency across distributed systems.
• Experienced in cloud migrations involving Azure SQL, Synapse, Snowflake, and S3-based data lakes, with a focus
on secure architecture, schema optimization, and resilient data flows.
• Adept at building metadata-driven ingestion frameworks, automating orchestration, and improving data
quality through validation, drift detection, and lineage tracking.
• Working understanding of AI/ML data workflows, including preparing clean, feature-ready datasets, managing
historical snapshots, and supporting downstream model training and scoring processes.
• Hands-on experience supporting ML use cases by building reliable data pipelines, validating feature inputs,
and ensuring data consistency for analytics, risk, and forecasting models.
• Comfortable working across engineering, BI, analytics, and ML teams to deliver model-ready datasets,
governed semantic layers, and Power BI dataflows that support interactive reporting at scale.
• Recognized for balancing technical depth with clear communication, and for delivering data solutions that are
reliable, maintainable, and aligned with business priorities.
• Skilled in Azure Data Factory, ADLS, Synapse, Databricks (architectural design), Azure SQL, Purview, and
metadata governance.
• Experienced at integrating complex source systems, including SAP ECC, into scalable Azure lakehouse and
warehouse structures.
• Proven ability to define modelling standards, build data dictionaries, document lineage, and support mission-
critical migrations in Agile/SAFe environments.
TECHNICAL SKILLS
Category Skills
Azure Cloud & Azure Data Factory (ADF), Azure Databricks, Azure Synapse Analytics, Azure SQL Database, Azure
Data Engineering Analysis Services, Azure Data Lake Storage Gen2, Azure Blob Storage, Azure Event Hubs, Azure
Stream Analytics, Azure Service Bus, Azure Functions, Azure Integration Services, Azure Purview
AWS & Big Data AWS EMR, AWS Glue, AWS Lambda, Amazon S3, Glue Data Catalog, IAM, Kafka, Hadoop (HDFS,
Hive), Spark on EMR
ETL / ELT & Data ADF Pipelines, SSIS, AWS Glue ETL, Informatica PowerCenter concepts, IICS patterns, Batch &
Integration Streaming ETL, Change Data Capture (CDC), Data Acquisition Frameworks, REST APIs,
Webhooks, File-based ingestion (CSV/JSON/XML), Real-time ingestion
Lakehouse & Delta Lake, Unity Catalog, Medallion Architecture (Bronze / Silver / Gold), Databricks Autoloader,
Storage Iceberg fundamentals
Programming & PySpark, Spark SQL, Scala, Python (Pandas, PySpark utilities), SQL (advanced tuning), C# (data
Data Processing integration), PowerShell
Data Modeling & Dimensional Modeling (Kimball), Star/Snowflake Schemas, Conceptual/Logical/Physical Data
Warehousing Models, OLTP/OLAP Modeling, Oracle Data Warehousing, Semantic Layer Design
Databases Snowflake, SQL Server, Oracle, Azure SQL Database, Synapse Dedicated Pools, PostgreSQL,
MySQL, Cosmos DB, MongoDB
Orchestration & Azure DevOps, Git, Jenkins, Airflow, DBT (basic), Automated Deployments, Build/Release
CI/CD Pipelines
Monitoring & Azure Monitor, Application Insights, Log Analytics, Databricks Job Logs, Spark UI, CloudWatch,
Debugging Cluster Tuning
Governance & RBAC, Unity Catalog governance, Azure AD Authentication, Azure Key Vault, Encryption (at
Data Quality rest/in transit), Data Masking, Metadata Management, Schema Drift Handling, Data Quality
Frameworks
Big Data & Spark (batch + streaming), Kafka, Structured Streaming, Hadoop, MapReduce, NoSQL
Streaming ecosystems
BI & Reporting Power BI (DAX, Dataflows,Semantic Models), Tableau, SSRS, SSAS
Tools
Automation & Autosys, Airflow DAGs, ADF Triggers, Scheduled Pipelines
Scheduling
Tools & Visual Studio, VS Code, JIRA, Confluence, MS Office, Postman, GitHub
Platforms
PROFESSIONAL EXPERIENCE
JP Morgan Chase – Columbus, OH Sep 2023 – Present
Senior Data Engineer
• Designed and maintained scalable Azure Data Factory (ADF) pipelines ingesting data from SQL systems, APIs,
and cloud applications into ADLS and Snowflake, increasing throughput and reducing execution time by ~40%.
• Built a structured Databricks Lakehouse (Bronze Silver Gold) improving data lineage, reconciliation
accuracy, and auditability across finance, risk, and analytics domains.
• Developed distributed PySpark pipelines supporting daily and intraday fraud, payments, and liquidity analytics,
while also preparing feature-ready datasets consumed by downstream analytics and ML workflows.
• Implemented near real-time ingestion using Event Hubs, Autoloader, and Delta Live Tables (DLT), reducing
dashboard and model-input data lag from hours to minutes.
• Redesigned Snowflake loading patterns using Streams, Tasks, and clustering keys, improving refresh reliability,
lowering compute usage, and stabilizing datasets used for reporting and model scoring.
• Integrated Unity Catalog to enforce RBAC, fine-grained permissions, table lineage, and centralized governance
for PCI-regulated and analytics-critical datasets.
• Built automated data validation frameworks detecting schema drift, null anomalies, distribution mismatches,
and late-arriving data, helping ensure consistent inputs for analytics and ML models.
• Developed monitoring dashboards using Azure Monitor, Log Analytics, and custom metrics, improving system
reliability and reducing downtime by ~25%.
• Partnered with machine learning and analytics teams to translate feature definitions into Spark and SQL
transformations, supporting training and inference datasets with consistent logic.
• Supported model-scoring pipelines by orchestrating batch and incremental data refreshes using Delta Lake
patterns, ensuring point-in-time correctness and reproducibility.
• Standardized CI/CD deployments for notebooks, SQL objects, and ADF pipelines using Azure DevOps,
improving release consistency and reducing manual errors across analytics workflows.
• Worked with platform engineering to optimize cluster autoscaling, instance pools, and job-cluster
configurations, reducing compute cost by ~20% for both analytics and ML-support workloads.
• Supported SOX, CCAR, and regulatory audits through lineage documentation, architecture diagrams, and
system-flow artifacts covering analytics and model-input data paths.
• Collaborated with upstream system owners to remediate schema mismatches, retire unused data feeds, and
enforce stronger data-quality SLAs impacting downstream analytics and ML use cases.
• Authored reusable ingestion templates, schema-mapping standards, and transformation frameworks that
reduced onboarding time for new feeds from weeks to days.
• Crafted semantic Snowflake models representing positions, transactions, and balances, improving BI
performance and analytical feature access.
• Led root-cause investigations for production incidents involving partition skew, corrupt payloads, data drift,
and inconsistent upstream schedules impacting analytics outputs.
• Designed optimized SQL models for financial, risk, and analytics teams supporting complex analytical and
exploratory workloads.
• Implemented encryption-at-rest and in-transit using Key Vault, secure credential flows, and Snowflake
masking policies for PII/PCI and model-related datasets.
• Built automated alerting for pipeline failures, SLA breaches, and data freshness issues affecting dashboards
and ML scoring jobs using Azure Functions and Monitor.
• Documented runbooks, SLA guides, transformation logic, and lineage references enabling support teams and
data scientists to troubleshoot data issues independently.
• Implemented checkpointing, idempotent writes, and ACID-compliant Delta Lake merges, ensuring reliable
historical datasets for backtesting and trend analysis.
• Conducted internal Spark performance workshops, mentoring engineers on tuning, debugging, and
distributed computing principles relevant to analytics and ML data pipelines.
Environment: Azure Data Factory, Azure Databricks (PySpark), ADLS Gen2, Snowflake, Azure SQL, Synapse, Event Hubs,
Delta Lake, Unity Catalog, Azure Functions, Azure Monitor, Log Analytics, Power BI
Verizon – Ashburn, VA Aug 2019 – Sep 2023
Data Engineer
• Built large-scale Spark pipelines on AWS EMR to process batch and streaming datasets from operational
systems, application logs, API integrations, and event streams.
• Migrated legacy Hadoop workloads to EMR and Amazon S3, reducing processing time, improving job reliability,
and simplifying data lifecycle management.
• Designed ingestion workflows moving data from Hive S3 Snowflake, improving refresh SLAs while
reducing overall maintenance overhead.
• Developed reusable Spark modules for partition pruning, file compaction, and join optimization, increasing
computational efficiency and lowering EMR cluster costs.
• Implemented streaming pipelines using Kafka and Structured Streaming, providing real-time visibility into
application-level events used for monitoring and analytics.
• Built multi-source ingestion frameworks integrating REST APIs, Kafka topics, application logs, and internal
systems, ensuring consistent schema handling and metadata capture.
• Designed Snowflake staging and dimensional models enabling faster analytical workloads and improving
query responsiveness.
• Implemented incremental data processing using Snowflake Streams and Tasks, lowering the dependency on
full refreshes and supporting near real-time dashboards.
• Applied strong governance using IAM, KMS encryption, RBAC, and secure data access patterns across S3, EMR,
Glue Catalog, and Snowflake.
• Created automated validation frameworks for schema drift, late-arrival detection, referential integrity, and
value distribution checks, reducing operational escalations.
• Orchestrated 40+ batch and streaming workflows using Apache Airflow, improving dependency management,
SLA monitoring, and error-handling coverage.
• Integrated AWS-native services such as Lambda, Glue Catalog, CloudWatch metrics, and EMRFS to automate
ingestion tasks and enhance operational monitoring.
• Improved job stability by tuning EMR configurations including instance types, autoscaling policies, and cluster
sizing strategies for large workloads.
• Participated in platform optimization initiatives, evaluating EMR runtime configurations, storage formats
(Parquet), and partitioning strategies to reduce compute cost.
• Developed internal dashboards and reporting utilities for pipeline health, data latency, and ingestion trends,
supporting leadership visibility and quicker debugging cycles.
• Ensured data quality and consistency across pipeline stages by implementing profiling rules, timestamp checks,
null-rate controls, and anomaly detection logic.
• Collaborated with engineering, analytics, and operations teams to onboard new data feeds, align on schema
standards, and document end-to-end data flows.
• Mentored junior engineers on Spark debugging, efficient AWS data patterns, Airflow DAG design, and overall
big-data engineering workflows.
Environment: AWS EMR, Amazon S3, AWS Lambda, AWS Glue Data Catalog, AWS CloudWatch, Kafka, Hadoop
(HDFS/Hive), Apache Spark (Scala/PySpark), Apache Airflow, Snowflake, Python, SQL, Linux
Optum (UnitedHealth Group) – Eden Prairie, MN Jan 2018 – Aug 2019
Data Engineer
• Designed and developed ETL pipelines for claims, membership, and provider datasets using ADF, Azure
Functions, and Python, reducing manual ingestion efforts by 50%.
• Built the foundational Confidential Claims Data Cube using Spark and Scala, improving actuarial reporting
performance and claims analytics turnaround time by 40%.
• Engineered ingestion workflows connecting REST APIs, SQL Server, flat files, and EDI/JSON healthcare feeds
into Azure Data Lake Gen2, establishing a unified clinical and claims repository.
• Implemented real-time ingestion patterns using Event Hubs ADF Databricks, enabling sub-hour visibility
into provider events and care-coordination metrics.
• Optimized Spark jobs by tuning partitions, caching logic, and join strategies, reducing cluster execution cost
and improving throughput for payer workloads.
• Designed Snowflake staging and curated layers for underwriting, risk adjustment, and care-management
analytics, improving model-readiness of healthcare datasets.
• Implemented incremental data loading using Snowflake Streams & Tasks, eliminating full loads and improving
refresh frequency for claims and adjudication tables.
• Used VARIANT to flatten complex EDI/JSON claim structures into analytical tables, accelerating diagnosis
grouping, DRG calculations, and risk-scoring workflows.
• Developed secure access patterns in Snowflake using RBAC, masking policies, and PHI-compliant access tiers,
ensuring HIPAA alignment across engineering teams.
• Created reusable ETL components in Databricks for data cleansing, code-set validation (ICD, DRG, CPT),
referential checks, and encounter normalisation, improving accuracy across downstream models.
• Built automated quality layers detecting schema drift, null-rate anomalies, duplicate claim submissions, and
late-arriving encounters, improving data reliability by 30%.
• Developed Python scripts to ingest, transform, and validate financial, eligibility, and provider datasets,
supporting Optum’s enterprise risk and quality reporting.
• Supported actuarial, clinical, and operations teams with curated datasets for risk adjustment, care-gap
identification, utilization analysis, and cost-of-care dashboards.
• Implemented secure credential management using Key Vault, MSI, and parameterized pipelines, improving
security posture across ETL processes.
• Built healthcare-specific transformation logic for claim lineage, billing event sequencing, member attribution,
and diagnosis rollups, ensuring dataset integrity for BI and modeling.
• Collaborated with BI teams to publish curated datasets into reporting platforms, improving accessibility for
care-management leadership and clinical operations.
• Authored detailed documentation covering healthcare data flows, validation rules, schema definitions, and
SLAs, improving cross-team adoption of data assets.
Environment: Azure Data Factory, Azure Data Lake Gen2, Azure Databricks (Scala/PySpark), Azure SQL, Snowflake,
Delta Lake, Event Hubs, Azure Functions, Application Insights, Azure Monitor, Azure Key Vault, Autosys, Python, SQL.
Standalone IT Solutions – Hyderabad, India Oct 2015 – Dec 2017
Data Analyst
• Analyzed multi-source product, customer, and operational datasets using SQL and Python, delivering insights
that shaped product decisions and reporting workflows.
• Built interactive dashboards using React, Highcharts, and D3.js visualizing KPIs, funnel metrics, user cohorts,
and operational patterns for product and business teams.
• Designed SQL-based transformations, stored procedures, and analytical datasets improving reporting accuracy
and reducing manual data preparation time.
• Automated core data cleansing, deduplication, and standardization routines using Python, improving data
reliability across dashboard pipelines.
• Performed exploratory data analysis (EDA) to uncover churn indicators, activation drivers, retention patterns,
and behavioral segmentation trends.
• Managed A/B testing datasets, ran statistical comparisons, and delivered experiment insights that improved
onboarding flows and feature adoption.
• Developed reusable ingestion utilities for JSON, CSV, XML data, reducing onboarding time for new data sources
by 50%.
• Built automated quality checks (missing data scans, schema mismatch alerts, threshold monitors), increasing
trust in BI assets and reporting accuracy.
• Created automated daily and weekly reporting workflows, cutting manual reporting effort by 60% across the
analytics team.
• Designed and maintained KPI documentation, metric dictionaries, and business rules, ensuring alignment
across product, engineering, and analytics teams.
• Built anomaly detection alerts using Python, allowing teams to detect data shifts or application issues early.
Environment: SQL, Python, Pandas, MySQL, MongoDB, Node.js, JavaScript, React, Highcharts, D3.js, REST APIs, Git
EDUCATION
Bachelor of Technology in Computer Science
Kakatiya University, Warangal, Telangana, India