Data Engineer - PySpark, Kafka, Databricks

Location:

Boston, MA

Salary:

85000

Posted:

June 19, 2026

Contact this candidate

Resume:

Keshika Arunkumar

Boston, MA +1-857-***-**** ******************@*****.*** Linkedin Github

SUMMARY

Data Engineer with over 3 years of experience building ETL pipelines, distributed data processing workflows, and analytics

platforms across mobility, marketing technology, and enterprise infrastructure domains using PySpark, Apache Kafka, Databricks,

Snowflake, AWS Glue, Hadoop, SQL, dbt, Airflow, and BigQuery, supporting more than 250 recurring reporting and ingestion

workflows across enterprise data operations while reducing batch processing interruptions and reporting delays.

PROFESSIONAL EXPERIENCE

Data Engineer USA

Uber Feb 2026 - Present

● Built data pipelines processing nearly 4M+ ride and trip events daily using PySpark and Apache Kafka across Snowflake

environments, reducing operational reporting delays by 4 hours for marketplace analytics teams.

● Automated Delta Lake ingestion workflows using Databricks, Airflow, and SQL validation checks for rider and driver datasets,

removing more than 850 recurring weekly data quality exceptions from reporting systems.

● Refined PostgreSQL and BigQuery query performance through indexing, partitioning, and ETL optimization techniques,

accelerating 140+ recurring analytics workloads supporting operations and financial reporting teams.

● Standardized Terraform automation and CI/CD deployment processes for containerized analytics services, reducing manual

release validation effort by 9 hours across monthly production deployments.

● Coordinated data validation initiatives with product managers, analysts, and backend engineers using Python and dbt

monitoring models, resolving 320+ schema and transformation issues before executive dashboard releases.

Data Engineer USA

Epsilon Jan 2025 - Aug 2025

● Engineered customer segmentation pipelines using Spark SQL and Amazon Redshift for multi-channel campaign data,

processing nearly 3.8M customer engagement records during quarterly targeting cycles.

● Consolidated batch ingestion processes through AWS Glue, Python, and Medallion Architecture transformation layers,

reducing recurring advertising data load failures by nearly 120 records across weekly audience delivery runs.

● Integrated event-stream pipelines with Apache NiFi and MongoDB for loyalty and engagement platforms, reducing

turnaround time for campaign attribution reporting across 6 regional marketing programs.

● Migrated legacy ETL mappings into Informatica PowerCenter and Oracle SQL pipelines, cutting nightly customer analytics

refresh windows by nearly 3 operational hours across production reporting environments.

● Validated audience and conversion reporting layers using dbt, Tableau, and automated reconciliation scripts, identifying more

than 90 mapping inconsistencies before downstream media performance reviews.

Associate Data Engineer IND

Dell Technologies Jul 2021 - Aug 2023

● Developed server log ingestion pipelines using Hadoop, Hive, and Sqoop for infrastructure monitoring data, processing nearly

1.6M system records supporting weekly storage and server performance reviews.

● Administered ETL scheduling routines through Unix Shell Scripting and Control-M, reducing delayed batch execution incidents

across 45 recurring monthly processing jobs within production environments.

● Transformed hardware inventory feeds using PySpark and HDFS staging layers, improving reconciliation coverage across 12

enterprise asset tracking reports used by infrastructure support teams.

● Optimized transactional queries within MySQL and SQL Server reporting environments, shortening dashboard refresh

intervals by nearly 90 minutes for supply chain analytics and inventory operations users.

● Streamlined data extraction procedures through Talend and XML parsing workflows, reducing manual remediation effort tied

to more than 80 malformed vendor records during quarterly compliance audits.

● Monitored production data validation checks using Power BI, Excel, and automated exception tracking scripts, preventing

nearly 60 reporting discrepancies from reaching infrastructure compliance review cycles.

TECHNICAL SKILLS

● Programming Languages: Python (Pandas, NumPy, PySpark), SQL, T-SQL, R, Unix Shell Scripting

● Big Data & Data Engineering: Apache Kafka, Hadoop, Hive, Sqoop, HDFS, ETL/ELT Pipelines, Delta Lake, Batch Processing,

Event-Stream Processing, Data Validation, Data Reconciliation, Data Quality, Query Optimization, Ingestion Frameworks

● Data Modeling & Warehousing: Star Schema, Dimensional Modeling, SCD Type 1, SCD Type 2, Medallion Architecture

● Cloud & Lakehouse Platforms: AWS (S3, EKS), Azure (Data Factory, SQL Database, ADLS Gen2), GCP, Microsoft Fabric

(OneLake, Lakehouse), Databricks, Snowflake, Amazon Redshift, BigQuery

● Databases & Warehouses: PostgreSQL, MySQL, SQL Server, Oracle SQL, MongoDB

● ETL & Orchestration Tools: Apache Airflow, Azure Data Factory, Informatica PowerCenter, Talend, dbt, Great Expectations

● CI/CD & DevOps Tools: Terraform, Docker, Kubernetes, Git, GitHub Actions, CI/CD Pipelines

● Visualization & Reporting Tools: Power BI, Tableau, Excel

● Data Formats & Processing: XML, Indexing, Partitioning

EDUCATION

Northeastern University Master of Science in Data Analytics Engineering Boston, MA Dec 2025

PROJECTS

RideStream: Real-Time Uber Data Pipeline

● Constructed real-time ingestion pipelines using Azure Event Hubs, PySpark Structured Streaming, and Databricks, processing

more than 10K ride events into Bronze, Silver, and Gold Delta Lake layers.

● Modeled analytical datasets through Star Schema, SCD Type 1 & 2, and Medallion Architecture design patterns, reducing

query joins across downstream reporting workflows.

OutbreakLens: Global Disease Surveillance Pipeline

● Deployed automated ingestion workflows through Apache Airflow and PostgreSQL, loading more than 260K public health

records from 3 external data sources through 5 scheduled DAGs.

● Curated transformation models using dbt and automated data quality validations, executing 27 validation checks supporting

outbreak trend monitoring and reporting accuracy.

Financial Data Quality Monitor SQL Airflow Great Expectations

● Built automated data quality pipelines using Airflow, SQL, and Great Expectations, validating 5.6K+ financial records across

multiple data sources.

● Developed anomaly detection and monitoring workflows with 29 automated validation checks, identifying 230+ data quality

issues through reconciliation and health scoring.

CERTIFICATIONS

● Microsoft Certified: Fabric Data Engineer Associate

● Databricks – Generative AI Application Development

● Databricks – GenAI Application Evaluation and Governance

● Databricks – Generative AI Application Deployment and Monitoring

● Google Data Analytics

● Tableau Certified (Simplilearn)

● AWS Introduction to Generative AI

Contact this candidate