Junior/Entry Level Data Engineer

Location:

United States

Posted:

April 16, 2026

Contact this candidate

Resume:

VARUNKUMAR SONAWANE

*********@*****.*** 812-***-**** linkedin.com/in/varun-sonawane GitHub Portfolio Education

Indiana University Bloomington, IN, USA. Aug 2024 – May 2026 Master of Science in Computer Science CGPA: 3.83/4 Pune University, MH, India. Aug 2019 – May 2023

Bachelor’s Degree in Computer Science CGPA: 9/10

Certifications

AWS Certified Developer – Associate Microsoft Azure AI Fundamentals Experience

Associate Instructor, Data Science on Ramp & Data Representations Jan 2025 – Present Indiana University Bloomington, IN

• Partnered with 120+ students as a technical advisor on end-to-end data engineering workflows, guiding architecture decisions for distributed PySpark pipelines, SQL data models, and cloud-native deployments on GCP.

• Translated complex BigQuery schema design, normalization, and ETL/ELT concepts into actionable labs for non-technical audiences, reducing data preparation errors by 20% and improving downstream analytical readiness.

• Built 10+ hands-on modules covering Apache Spark distributed processing, DataFrames, MapReduce-style transformations, and large-scale aggregations using Scala, reinforcing production-grade big data engineering practices. Data Engineer Intern May 2025 – Aug 2025

The Commons XR San Diego, CA

• Designed and managed scalable GCP data pipelines migrating 5M+ records/week from OLTP PostgreSQL to OLAP BigQuery via Datastream, enforcing data quality validation checks to ensure pipeline reliability and 99%+ ingestion accuracy.

• Built and deployed an LLM-powered RAG pipeline on Vertex AI to automate post-session behavioral analytics, replacing 92% of manual reporting for 2 XR sessions/week and generating structured, session-level efficiency insights for both technical and non-technical stakeholders.

• Created 10+ intuitive dashboards in Python (Plotly, Dash) providing clear visibility into XR platform efficiency metrics, translating complex operational data into actionable summaries for cross-functional stakeholders.

• Improved analytics readiness by architecting a BigQuery cloud data warehouse using Medallion Architecture with layered SQL transformations, delivering analytics-ready datasets aligned to downstream reporting and resource-planning requirements. Software Engineering Intern July 2023 – Jan 2024

CodeClause Pune, India

• Architected and deployed a cloud-native data pipeline on AWS (S3, Lambda, DynamoDB, ECS Fargate) using Docker and CI/CD workflows, processing 10K+ daily records with structured logging and observability across distributed microservices.

• Stabilized 5+ REST API modules by adding unit tests and error handling within a Git-based CI/CD workflow, troubleshooting technical issues across services and reducing recurring runtime failures throughout the SDLC. Projects

DataLens – AI-Powered Data Storytelling Platform Python, Vertex AI, Gemini Pro, BigQuery, GCS, Cloud Run

• Designed and deployed a production GCP-native data platform on Google Cloud Run, orchestrating Vertex AI Vector Search and Gemini Pro to ingest, process, and transform raw datasets into real-time narrative insights via SSE streaming.

• Built automated data validation and session persistence workflows in GCS and Firestore, ensuring pipeline reliability and accuracy across multimodal outputs for both technical and non-technical stakeholder audiences. IRIS – Intelligent Retail & Ingredient Scanner Node.js, Claude Sonnet Vision, ElevenLabs, Docker, Fly.io Hackathon Winner

• Engineered a multi-source data enrichment pipeline querying 3 external APIs in parallel via Promise.all, merging structured product and nutrition data before passing enriched payloads to Claude Sonnet Vision for safety evaluation.

• Designed a structured output reliability layer with a 4-step JSON parsing fallback ensuring zero pipeline crashes on imperfect LLM output across all 8 backend API endpoints.

• Optimized end-to-end latency to 2 seconds scan-to-spoken-verdict by routing voice endpoints to Claude Haiku and implementing an LRU TTS cache (150 entries, MD5 hashing) to cut redundant ElevenLabs API calls. Big Data Analytics Platform PySpark, Hadoop, MapReduce, AWS S3, EC2, Glue, SageMaker, Power BI

• Processed 100K+ records through an end-to-end Hadoop/PySpark pipeline on EC2, applying MapReduce-style distributed transformations, AWS Glue schema management, and S3 partitioned storage for downstream analytics.

• Enabled KPI-driven reporting by triggering SageMaker AutoML on processed data and publishing efficiency dashboards in Power BI with SNS pipeline alerts, providing stakeholders with clear visibility into model outputs. Weather Data ETL Pipeline Airflow, Snowflake, Python, dbt, Astronomer

• Orchestrated Airflow DAGs with dependency management, retries, and scheduling to extract, transform, and load time-series weather data into Snowflake for 10+ locations, applying dbt transformations for analytics-ready datasets. Technical Skills

Languages: Python, SQL, Java, Scala, Bash.

GCP & Cloud: BigQuery, Vertex AI, Cloud Run, Dataproc, GCS, Datastream, Firestore, AWS (S3, Glue, Lambda, ECS Fargate), Azure, Snowflake.

Big Data & ETL: Apache Spark, PySpark, Hadoop, MapReduce, Hive, Airflow, dbt, Kafka, Kinesis, Iceberg, Datadog. Data Engineering: ELT/ETL Pipelines, Data Quality, Schema Design, Data Governance, CI/CD, Docker, Terraform. AI/ML & Analytics: Vertex AI (RAG, Vector Search), LangChain, TensorFlow, Scikit-learn, Pandas, NumPy, Power BI, Tableau.

Contact this candidate