VATSAL PATEL Data Engineer
Location: CA Email ID: ******.******@*****.*** Phone: (408) 309- 0780 LinkedIn SUMMARY
Data Engineer with 4+ years of experience designing and delivering large-scale data pipelines, ETL processes, and analytical data models across insurance, utilities, and financial services. Skilled in building real-time data ingestion and transformation workflows using Apache Kafka, PySpark, Hadoop, and Informatica, with strong expertise in Snowflake, AWS (S3, RDS), and SQL/Teradata. Experienced in workflow orchestration with Apache Airflow, CI/CD automation using GitLab/Jenkins, and infrastructure deployment via Terraform. Adept at implementing data quality checks (Great Expectations), monitoring with Grafana/Splunk, and enforcing governance with Collibra, role-based access, and compliance controls. Proven ability to collaborate with cross-functional teams—including data science, BI, and risk/compliance—while mentoring junior engineers and delivering reliable, regulatory- compliant, and business-impactful data solutions.
SKILLS
Programming & Scripting: Python (Pandas, PySpark), SQL, Scala, Shell Scripting Big Data & Processing: Apache Spark, Databricks, PySpark, Hadoop, Kafka, Informatica, Batch & Real-time ETL Databases & Warehousing: PostgreSQL, MySQL, Oracle DB, Snowflake, BigQuery Visualization & BI: Tableau, Power BI, Matplotlib, Seaborn Machine Learning &
Analytics:
Scikit-learn, ML pipelines, Feature Engineering, Predictive Modeling Cloud & Deployment: AWS (S3, RDS, EC2, Lambda, Glue), GCP, Azure Data Factory (exposure), Docker, Kubernetes, Terraform, Jenkins
Workflow & Monitoring
Tools:
Apache Airflow, Git/GitLab, JIRA, Confluence, Grafana, Splunk, ELK Stack EXPERIENCE
Data Engineer Berkshire Hathaway, CA Jun 2025 - Present
Designed and implemented data ingestion pipelines using Apache Kafka and PySpark to stream claims, utility, and asset data into AWS S3 with near real-time availability (<5 minutes).
Developed ETL processes to cleanse, duplicate, and transform large datasets, loading curated data into Snowflake for analytics and reporting.
Automated orchestration workflows with Apache Airflow, ensuring SLA compliance and end-to-end pipeline monitoring.
Implemented data quality checks with Great Expectations and built monitoring dashboards in Grafana to improve pipeline reliability and reduce data issues.
Designed analytical data models and marts in Snowflake using dbt to support BI dashboards and ML feature stores.
Built interactive Tableau dashboards for executives and risk managers, providing insights into claims patterns, energy consumption, and asset utilization.
Enforced data governance and security controls including encryption, role-based access, and compliance with regulatory standards.
Deployed pipelines and infrastructure using Terraform and CI/CD (GitLab), enabling faster, more reliable releases. Data Engineer Intern Berkshire Hathaway, CA Feb 2025 – May 2025
Gathered raw datasets from insurance policy systems and smart meter logs; documented source formats (CSV, JSON, relational tables) in a central knowledge base for senior engineers.
Performed exploratory data profiling using SQL and Pandas, identifying missing values, inconsistent date formats, and duplicate policy IDs, which informed data cleaning rules.
Created lightweight Python scripts to batch-load daily claims extracts into AWS RDS and scheduled runs with Cron jobs, ensuring reliable availability for testing and analysis.
Designed Tableau prototype dashboards to visualize claims settlement times and regional energy consumption, providing early insights into potential operational bottlenecks.
Sr. Data Engineer Capgemini (Client: Morgan Stanley), India Jun 2020 – Dec 2023
Documented source-to-target mappings and metadata for 5+ external financial data providers; leveraged Collibra to maintain lineage, ensuring compliance with regulatory audits.
Performed profiling of heterogeneous datasets (market data, client portfolios, and risk metrics) in SQL and Teradata, identifying anomalies and redundancies that improved data integrity by 30%.
Designed and automated end-to-end ETL pipelines using PySpark, Hadoop, and Informatica, delivering large-scale ingestion and transformation of financial datasets.
Implemented workflows in Apache Airflow and developed shell script automations to schedule daily ingestion tasks, enforce dependencies, and reduce manual intervention.
Partnered with data science teams to integrate curated datasets into risk-model pipelines; automated delivery with Git & Jenkins CI/CD, reducing ML deployment time by 40%.
Monitored production jobs with Splunk dashboards and performed Tier 2 support, introducing proactive alerting and resolution strategies that ensured 99.9% uptime for mission-critical data services.
Guided junior engineers on PySpark/Hadoop coding standards and collaborated with risk & compliance teams to translate regulatory requirements into scalable data solutions on Snowflake and Oracle. EDUCATION
Master in Computer Science May 2025
San Francisco Bay University
Bachelors of Technology in Computer Science May 2021 Parul University, India
PROJECTS
Smart Retail Checkout System – Capstone
Developed and trained an object detection model using YOLOv8 on the RPC dataset, achieving 87% mean Average Precision
(mAP) for automated product recognition.
Utilized Python, PyTorch, OpenCV, and Google Colab to design an end-to-end pipeline for real-time retail checkout automation. Zomato Review Sentiment Analysis
Collected and preprocessed 10K+ customer reviews through web scraping, applied TF-IDF vectorization, and built classification models for sentiment analysis.
Achieved an 87% F1-score using Scikit-learn, Beautiful Soup, and Seaborn, enabling insights into customer satisfaction trends.