Keshika Arunkumar
Boston, MA +1-857-***-**** ******************@*****.*** Linkedin Github
SUMMARY
Data Engineer with over 3 years of experience building ETL pipelines, distributed data processing workflows, and analytics
platforms across mobility, marketing technology, and enterprise infrastructure domains using PySpark, Apache Kafka, Databricks,
Snowflake, AWS Glue, Hadoop, SQL, dbt, Airflow, and BigQuery, supporting more than 250 recurring reporting and ingestion
workflows across enterprise data operations while reducing batch processing interruptions and reporting delays.
PROFESSIONAL EXPERIENCE
Data Engineer USA
Uber Feb 2026 - Present
● Built data pipelines processing nearly 4M+ ride and trip events daily using PySpark and Apache Kafka across Snowflake
environments, reducing operational reporting delays by 4 hours for marketplace analytics teams.
● Automated Delta Lake ingestion workflows using Databricks, Airflow, and SQL validation checks for rider and driver datasets,
removing more than 850 recurring weekly data quality exceptions from reporting systems.
● Refined PostgreSQL and BigQuery query performance through indexing, partitioning, and ETL optimization techniques,
accelerating 140+ recurring analytics workloads supporting operations and financial reporting teams.
● Standardized Terraform automation and CI/CD deployment processes for containerized analytics services, reducing manual
release validation effort by 9 hours across monthly production deployments.
● Coordinated data validation initiatives with product managers, analysts, and backend engineers using Python and dbt
monitoring models, resolving 320+ schema and transformation issues before executive dashboard releases.
Data Engineer USA
Epsilon Jan 2025 - Aug 2025
● Engineered customer segmentation pipelines using Spark SQL and Amazon Redshift for multi-channel campaign data,
processing nearly 3.8M customer engagement records during quarterly targeting cycles.
● Consolidated batch ingestion processes through AWS Glue, Python, and Medallion Architecture transformation layers,
reducing recurring advertising data load failures by nearly 120 records across weekly audience delivery runs.
● Integrated event-stream pipelines with Apache NiFi and MongoDB for loyalty and engagement platforms, reducing
turnaround time for campaign attribution reporting across 6 regional marketing programs.
● Migrated legacy ETL mappings into Informatica PowerCenter and Oracle SQL pipelines, cutting nightly customer analytics
refresh windows by nearly 3 operational hours across production reporting environments.
● Validated audience and conversion reporting layers using dbt, Tableau, and automated reconciliation scripts, identifying more
than 90 mapping inconsistencies before downstream media performance reviews.
Associate Data Engineer IND
Dell Technologies Jul 2021 - Aug 2023
● Developed server log ingestion pipelines using Hadoop, Hive, and Sqoop for infrastructure monitoring data, processing nearly
1.6M system records supporting weekly storage and server performance reviews.
● Administered ETL scheduling routines through Unix Shell Scripting and Control-M, reducing delayed batch execution incidents
across 45 recurring monthly processing jobs within production environments.
● Transformed hardware inventory feeds using PySpark and HDFS staging layers, improving reconciliation coverage across 12
enterprise asset tracking reports used by infrastructure support teams.
● Optimized transactional queries within MySQL and SQL Server reporting environments, shortening dashboard refresh
intervals by nearly 90 minutes for supply chain analytics and inventory operations users.
● Streamlined data extraction procedures through Talend and XML parsing workflows, reducing manual remediation effort tied
to more than 80 malformed vendor records during quarterly compliance audits.
● Monitored production data validation checks using Power BI, Excel, and automated exception tracking scripts, preventing
nearly 60 reporting discrepancies from reaching infrastructure compliance review cycles.
TECHNICAL SKILLS
● Programming Languages: Python (Pandas, NumPy, PySpark), SQL, T-SQL, R, Unix Shell Scripting
● Big Data & Data Engineering: Apache Kafka, Hadoop, Hive, Sqoop, HDFS, ETL/ELT Pipelines, Delta Lake, Batch Processing,
Event-Stream Processing, Data Validation, Data Reconciliation, Data Quality, Query Optimization, Ingestion Frameworks
● Data Modeling & Warehousing: Star Schema, Dimensional Modeling, SCD Type 1, SCD Type 2, Medallion Architecture
● Cloud & Lakehouse Platforms: AWS (S3, EKS), Azure (Data Factory, SQL Database, ADLS Gen2), GCP, Microsoft Fabric
(OneLake, Lakehouse), Databricks, Snowflake, Amazon Redshift, BigQuery
● Databases & Warehouses: PostgreSQL, MySQL, SQL Server, Oracle SQL, MongoDB
● ETL & Orchestration Tools: Apache Airflow, Azure Data Factory, Informatica PowerCenter, Talend, dbt, Great Expectations
● CI/CD & DevOps Tools: Terraform, Docker, Kubernetes, Git, GitHub Actions, CI/CD Pipelines
● Visualization & Reporting Tools: Power BI, Tableau, Excel
● Data Formats & Processing: XML, Indexing, Partitioning
EDUCATION
Northeastern University Master of Science in Data Analytics Engineering Boston, MA Dec 2025
PROJECTS
RideStream: Real-Time Uber Data Pipeline
● Constructed real-time ingestion pipelines using Azure Event Hubs, PySpark Structured Streaming, and Databricks, processing
more than 10K ride events into Bronze, Silver, and Gold Delta Lake layers.
● Modeled analytical datasets through Star Schema, SCD Type 1 & 2, and Medallion Architecture design patterns, reducing
query joins across downstream reporting workflows.
OutbreakLens: Global Disease Surveillance Pipeline
● Deployed automated ingestion workflows through Apache Airflow and PostgreSQL, loading more than 260K public health
records from 3 external data sources through 5 scheduled DAGs.
● Curated transformation models using dbt and automated data quality validations, executing 27 validation checks supporting
outbreak trend monitoring and reporting accuracy.
Financial Data Quality Monitor SQL Airflow Great Expectations
● Built automated data quality pipelines using Airflow, SQL, and Great Expectations, validating 5.6K+ financial records across
multiple data sources.
● Developed anomaly detection and monitoring workflows with 29 automated validation checks, identifying 230+ data quality
issues through reconciliation and health scoring.
CERTIFICATIONS
● Microsoft Certified: Fabric Data Engineer Associate
● Databricks – Generative AI Application Development
● Databricks – GenAI Application Evaluation and Governance
● Databricks – Generative AI Application Deployment and Monitoring
● Google Data Analytics
● Tableau Certified (Simplilearn)
● AWS Introduction to Generative AI