Data Engineer

Location:

Jersey City, NJ

Posted:

September 10, 2025

Contact this candidate

Resume:

Jajwalya Chukka

Data Engineer

Email: **************@*****.***

Ph. No:201-***-****

PROFESSIONAL SUMMARY:

Built and ran Databricks (PySpark/SQL) ETL jobs that process 10M+ records per day.

Organized data in Delta Lake (bronze/silver/gold) with smart partitioning and file layout to speed up queries.

Built streaming pipelines with Spark Structured Streaming and Kafka / Azure Event Hubs for near-real-time data.

Scheduled and monitored dataflows using Apache Airflow (DAGs, retries, SLAs), Azure Data Factory, and AWS Step Functions.

Created data quality checks in Python/SQL (nulls, duplicates, ranges) and sent alerts through CloudWatch / Azure Monitor.

Built curated tables and views in Snowflake, Azure Synapse, and AWS Redshift for reporting and ad-hoc analysis.

Used Snowflake warehouses efficiently (right sizing, auto-suspend/resume) and simple tasks/streams for ELT.

Improved Spark performance by using the right joins, partitions, caching, and autoscaling—shorter runtimes, lower cost.

Handled schema changes safely and used Delta time-travel for stable historical queries.

Wrote reusable, parameterized notebooks/jobs to support multiple teams with one code path.

Added error handling and safe retries so pipelines recover cleanly from failures and backfills.

Prepared clean, well-modeled data for Power BI and Tableau; supported incremental refresh where needed.

Managed governance and access with Databricks Unity Catalog and Azure Purview (RBAC, lineage, tags).

Ingested data from SQL Server / Oracle / PostgreSQL via ADF, JDBC in Spark, and AWS Glue; used Sqoop for older Hadoop paths.

Published job health and freshness dashboards; kept SLAs visible to stakeholders.

For banking, built feeds for fraud/AML/risk using Spark + Kafka and Cassandra/Synapse (dedupe, late data handling, reconciliations).

For healthcare, processed claims and HEDIS datasets with consistent validation for reporting.

Migrated older MapReduce/Hive workloads to Spark on Azure Databricks / AWS EMR to simplify and speed up processing.

Secured credentials with Key Vault / Secrets Manager; used IAM roles and table-level policies for data access.

Supported data science by delivering feature datasets and tracking runs with MLflow.

Applied basic ML where appropriate: Linear/Logistic Regression, Decision Trees, Random Forests, XGBoost/LightGBM, K-Means, PCA with clear train/validate/test steps.

Exposed lake data using Synapse serverless views and Redshift external tables to cut costs when full warehouses weren’t needed.

Shipped changes through CI/CD with Azure DevOps / Jenkins / Git; used config-driven deploys across Dev UAT Prod.

Built and maintained infra with Terraform (and ARM on Azure) for storage, compute, networking, and identities.

Reduced end-to-end latency ~40% and query times ~65% with storage/layout and job tuning; led to meaningful cost savings.

Wrote runbooks/playbooks for incidents, reruns, and backfills; kept data contracts and SLOs clear.

Handled ad-hoc SQL/PySpark analyses and compliance extracts without disrupting regular schedules.

Added simple reconciliation tools (row counts, hash checks, sample diffs) to verify data loads.

Deployed selected data tasks on GKE (GCP) where container-based scaling helped.

Worked in Agile teams, did code reviews, and documented rules and lineage so others can support the pipelines easily.

TECHNICAL SKILLS:

Languages

Python, SQL, Scala, Java, Bash/Shell scripting

ML / AI / NLP

Spark MLlib, XGBoost, MLflow, Pandas/NumPy (for feature engineering & EDA)

Big Data & ETL

Spark (PySpark, SparkSQL), Kafka, Delta Lake, Hive, Impala, HDFS, Sqoop, Airflow, AWS Glue, Databricks, Azure Data Factory (ADF), Synapse

Cloud & MLOps

AWS (S3, Lambda, Glue, Redshift, Step Functions, CloudWatch, EMR), Azure (ADLS, ADF, Synapse, Databricks, Purview), GCP (GKE), Terraform, Jenkins, Azure DevOps (CI/CD)

Databases

Redshift, PostgreSQL, SQL Server, Oracle, HBase, Cassandra

Visualization & BI

Power BI, Tableau, Hive/Impala ad hoc queries, Matplotlib, Seaborn

WORK EXPERIENCE:

Client: John Deere, Moline, IL Dec 2022 - Present

Role: Data Engineer

Built and maintained enterprise-scale ETL pipelines in Databricks (PySpark/SQL) to process millions of daily records across procurement, logistics, and operations.

Designed and optimized Delta Lake architectures (bronze/silver/gold layers) with partitioning, Z-Ordering, and caching, improving query efficiency and reducing compute costs.

Developed streaming pipelines using Spark Structured Streaming and Kafka for near real-time ingestion and analytics.

Orchestrated end-to-end jobs with Apache Airflow (DAGs, sensors, retries, SLAs) alongside Databricks Job Workflows for reliable scheduling and recovery.

Automated ETL workflows in Databricks to process 10M+ records/day, reducing latency by ~40% for business-critical datasets.

Integrated AWS services including S3 (data lake), Glue (metadata catalog), Redshift (analytics warehouse), Lambda (automation), and Step Functions (workflow orchestration).

Implemented a Snowflake warehouse layer by creating external stages to S3, configuring Snowpipe for continuous loads, and using tasks and streams for ELT and incremental updates.

Prepared curated datasets in Databricks and Snowflake for Power BI dashboards, enabling stakeholders to access live analytics and reports.

Developed and maintained semantic models in Power BI for consistent reporting across 20+ dynamic data sources.

Supported KPI reporting pipelines, contributing to ~$1.2M in operational cost savings through data-driven decision-making.

Partnered with data science teams to provide feature-ready datasets for ML models such as forecasting and risk analysis.

Wrote data validation and reconciliation scripts in Python and SQL to ensure data accuracy, completeness, and consistency.

Implemented Delta Lake time travel and schema evolution to manage historical data and adapt to changing source structures.

Monitored pipelines using Airflow task status, Databricks jobs, AWS CloudWatch, and Datadog alerts to ensure SLA compliance.

Applied Spark performance tuning (broadcast joins, partition pruning, cluster optimization) to reduce runtimes and improve efficiency.

Managed data governance, lineage, and access controls in Databricks using Unity Catalog to support compliance and audit readiness.

Adopted CI/CD pipelines with Git, Azure DevOps, and Terraform to automate deployment of jobs and infrastructure provisioning.

Deployed parameterized pipelines to reuse ETL logic across multiple business units with minimal customization.

Designed error-handling and retry logic within data pipelines (including Airflow on-failure callbacks) to minimize downtime and ensure reliability.

Provided ad hoc SQL and PySpark support to analysts for one-off reporting and urgent data requests.

Documented ETL workflows, transformation logic, and best practices, improving knowledge sharing across the team.

Collaborated in Agile ceremonies with business users, analysts, and engineering teams to deliver scalable, production-ready solutions aligned with evolving needs.

Client: TD Bank March 2020 - Jan 2022

Role: Azure Data Engineer

Designed and developed scalable ETL pipelines using Apache Spark (Scala/Python) to process high-volume transactional, credit card, mortgage, and trading data across multiple banking divisions.

Migrated legacy on-prem pipelines to Microsoft Azure, building a modern data platform with Azure Data Lake Storage (ADLS), Azure Databricks, and Azure Data Factory (ADF) to improve scalability and reduce latency.

Automated real-time data ingestion from Kafka and Azure Event Hubs into HDFS, HBase, and Cassandra, supporting fraud detection, AML, and suspicious activity monitoring.

Built curated data marts and reporting datasets on Azure Synapse Analytics to deliver insights for liquidity monitoring, risk analytics, and financial performance tracking.

Implemented PySpark and SQL transformations for cleansing, enrichment, and reconciliation of financial datasets to ensure accuracy for downstream analytics.

Optimized Spark workloads using partitioning, bucketing, Delta Lake, and caching strategies, reducing runtime for large datasets used in banking analytics.

Orchestrated pipelines using Apache Airflow and ADF, enabling dependency management, SLA monitoring, and automated failure alerts.

Leveraged Azure Purview and Glue Data Catalog for data governance, lineage tracking, and enterprise metadata management.

Built and supported data lakes on Azure (ADLS + Databricks) and delivered curated datasets to Power BI and Tableau dashboards for stakeholders across risk, fraud, and operations teams.

Integrated Google Kubernetes Engine (GKE) in TD Securities for scalable deployment of selected data processing workloads as part of a multi-cloud strategy (Azure + GCP).

Utilized advanced SQL (T-SQL, PL/SQL) and NoSQL queries (Cassandra, HBase) for data extraction, transformation, and reconciliations across multiple sources.

Deployed infrastructure-as-code solutions using Terraform and ARM templates to provision and manage Azure resources consistently.

Adopted CI/CD practices with Git, Jenkins, and Azure DevOps pipelines to automate deployment of Spark and data pipelines.

Implemented Delta Lake ACID transactions and schema evolution to manage historical and evolving financial datasets.

Created data quality dashboards to monitor SLAs, missing data, and anomalies across transaction and fraud detection pipelines.

Partnered with data scientists, risk analysts, and business teams to translate banking use cases into scalable Spark/SQL pipelines.

Provided ad-hoc queries and datasets to support analytics for fraud detection, customer risk scoring, and liquidity reporting.

Participated in Agile sprints, peer code reviews, and design discussions, ensuring reusable and maintainable solutions.

Documented end-to-end pipelines, controls, and lineage to support internal data governance and audit readiness.

Client: Elevance Health(Anthem) July 2019 - Feb 2020

Role: Data Analyst/ Engineer

Processed and validated healthcare claims, enrolment, and HEDIS quality datasets to support compliance reporting and performance analytics.

Ingested structured data from Oracle and SQL Server into Hadoop (HDFS, Hive) using Sqoop, enabling incremental refreshes of claims and member data.

Built and optimized HiveQL and Impala queries to perform aggregations, joins, and reporting on large-scale healthcare datasets for NCQA/HEDIS measures.

Developed PySpark jobs to clean, transform, and enrich claims and encounter data, ensuring readiness for downstream analytics and quality checks.

Assisted in implementing Kafka pipelines to capture real-time healthcare transaction events and integrated them with Spark Streaming for near real-time analytics.

Wrote Python and Shell scripts to automate ETL workflows, perform schema validation, and track healthcare data quality issues across multiple sources.

Leveraged AWS EMR clusters for scalable processing of healthcare workloads and stored processed outputs in Amazon S3 for data lake integration.

Designed and executed data validation and profiling checks to identify anomalies, missing values, and duplicates in claims and eligibility datasets.

Used Jupyter Notebooks with Pandas/NumPy for exploratory analysis, trend detection, and prototyping healthcare data transformations.

Contributed to the migration of legacy MapReduce jobs into Spark-based transformations, reducing runtime and simplifying code maintenance.

Produced ad hoc SQL reports and OLAP summaries to support analysts in provider performance tracking, claims cost analysis, and HEDIS reporting cycles.

Collaborated with data architects and healthcare SMEs to understand HEDIS requirements and translate them into scalable SQL queries and Spark workflows.

Maintained Git repositories, participated in Agile sprint ceremonies, and documented ETL workflows and data lineage to support audit and compliance requirements.

Education:

Master of Engineering in Applied AI

Stevens Institute of Technology

Contact this candidate