Senior Data Engineer ETL/Delta Lake/Databricks Expert

Location:

Boston, MA

Posted:

January 07, 2026

Contact this candidate

Resume:

SYONA JAIMY

Boston, MA (willing to relocate) Phone: 872-***-**** ***********@*****.*** LinkedIn Github Professional Summary

Data Engineer with 5.5+ years of experience building end-to-end ETL, streaming, and lakehouse pipelines using PySpark, Kafka, Airflow, and Databricks across AWS, Azure, and GCP, processing 850M+ monthly records in enterprise-scale ecosystems.

Experienced in Big Data engineering with deep expertise across the Hadoop ecosystem and large-scale distributed data processing.

Experienced in building modular dbt transformations, reusable SQL models, and semantic layers with automated tests across Snowflake, Redshift, and Databricks.

Implemented Databricks Lakehouse architectures using Delta Lake, Unity Catalog, and optimized clusters to enable scalable ETL, ACID-compliant processing, and cost-efficient pipelines.

Skilled in data modeling, governance, and data quality using AWS Glue, Lake Formation, Great Expectations, and Spark to support compliant, high-availability risk infrastructures.

Experienced in integrating machine learning and LLM-driven pipelines with LangChain, FAISS, and Databricks, supporting predictive modeling and RAG-based decision intelligence across business domains.

Strong experience across SQL and NoSQL databases, including DynamoDB, MongoDB, and Cosmos DB, with hands-on expertise in data modeling, performance tuning, and reliability engineering.

Proficient in DevOps and orchestration using Docker, Terraform, Kubernetes, and GitHub Actions to automate CI/CD and ensure reliable, observable multi-cloud data operations. Experience

Community Dreams Foundation, MA Oct 2025 – Present Data Engineer

Architected end-to-end ETL/ELT pipelines using Airflow, PySpark, and Databricks to process 850M+ monthly financial and risk records, enabling real-time credit, fraud, and compliance analytics.

Designed dimensional models and integrate SQL/NoSQL, APIs, AWS S3, and Kafka data (Parquet/ORC/Avro) into Snowflake and Redshift for unified financial, volunteer, and grant analytics.

Implemented data quality, validation, and data lineage frameworks using Great Expectations, SQL checks, and GitHub Actions, ensuring auditability and compliance with SOX, GDPR, IRDAI, and Basel controls.

Optimized ETL performance through Spark Structured Streaming, partitioning, indexing, and parallel processing, improving SLA adherence and reliability of daily/intraday risk-scoring pipelines. Amazon Robotics, MA Jul 2024 – Dec 2024

Data Engineer (BIE)

Built batch and streaming pipelines using AWS Glue (PySpark), EMR, Spark Streaming, and Kinesis, processing 5.6TB of telemetry daily from 9,000+ autonomous robots for real-time operational insights.

Developed optimized warehouse models in Amazon Redshift using Spectrum tables, sort/distribution key tuning, and partition pruning, cutting robotics analytics query time from 12s to under 4s.

Implemented data quality validation and metadata governance using Great Expectations, Glue Catalog, and lineage tools, improving observability for 48+ critical datasets and ensuring PII and compliance coverage.

Partnered with ML and operations teams to build Delta-style analytical layers on S3 using Hudi/Iceberg, enabling predictive maintenance models that reduced robot fault-detection time by 35 minutes per incident and improved fleet uptime.

Northeastern University, MA Sep 2023 – Apr 2024

Graduate Teaching Assistant

Architected Azure Databricks lakehouse integrating ERP, CRM, and academic data using dbt for transformations, processing 50,000+ engagement and fitness records and reducing query runtime by 6 seconds via Medallion

(Bronze–Silver–Gold) design.

Formulated Power BI dashboards Databricks SQL, Redshift, and Azure powered by dbt-modeled SQL layers. Delivered retention, attendance, and engagement insights across academic programs.

Provided retention, attendance, and engagement insights that helped program leaders identify at-risk students and improve support strategies.

Enhanced an LLM-powered RAG chatbot using LangChain and FAISS, enabling natural language access to structured campus datasets and improving research accessibility for faculty and students. Airbnb, India Mar 2021 – Dec 2022

Data Engineer

Built scalable ETL pipelines in AWS Glue (PySpark) and Redshift, processing Parquet/Avro booking and host activity from seven regional markets to deliver real-time insights and cut batch runtimes by 2.4 hours daily.

Designed data models in Snowflake and Redshift, implementing dimensional schemas that enhance user segmentation, personalized recommendations, and marketplace growth analytics.

Optimized transformation logic and SQL queries in Databricks and Spark, reducing ETL workload time by 38% and improving data availability for pricing and operations teams.

Automated validation, anomaly detection, and monitoring workflows using Python and AWS Glue triggers, ensuring continuous data quality across 120+ production datasets.

Collaborated with data scientists to develop feature store pipelines in S3, enabling reproducible ML training datasets and accelerating model deployment cycles by 40 minutes per iteration. Gartner, India Jun 2019 – Jan 2021

Data Scientist / Data Engineer

Developed and maintain ETL pipelines in Azure Data Factory, Databricks (PySpark), and SQL Server, integrating client and workforce data from multiple regions into a unified analytics environment used by 50+ stakeholders.

Automated data extraction and transformation workflows using Python (Pandas, Requests, and BeautifulSoup), cutting manual sourcing time by 30 hours weekly and improving research data freshness.

Implemented scalable analytics models and dashboards in Power BI and Azure Synapse Analytics, reducing report generation time from 9 hours to under 4 hours and enabling real-time business intelligence for leadership teams. Core Qualifications

Programming & Scripting: Python, Pandas, PySpark, NumPy, SQL, Bash, Shell Scripting, Java, Scala

Data Engineering & Processing: Apache Spark, Spark SQL, PySpark, Spark Streaming, Databricks, Airflow, AWS Glue, Kafka, Kinesis, Delta Lake, Medallion Architecture, dbt, Data Lakehouse, Batch and Streaming Pipelines

Cloud Platforms: AWS, S3, Redshift, Kinesis, Lambda, EMR, Glue, Azure, Data Factory, Synapse, Databricks, Analysis Services, Google Cloud Platform, BigQuery, Dataflow, Pub/Sub

Data Warehousing & Databases: Snowflake, Redshift, BigQuery, PostgreSQL, MySQL, MongoDB, Cassandra, Hive, HBase

Data Modeling & Architecture: Star Schema, Snowflake Schema, Dimensional Modeling, Feature Store Design, ETL and ELT Frameworks, Data Quality and Validation, Metadata Management, Data Catalogs including AWS Glue Catalog, Collibra

Machine Learning & LLM Integration: Feature Engineering, Predictive Modeling, MLOps Integration, LLM and RAG Pipelines using LangChain, FAISS, Vector Stores, ML-Driven ETL

DevOps, Orchestration & Infrastructure: Docker, Kubernetes, Terraform, GitHub Actions, Jenkins, CI/CD Pipelines, Monitoring using Prometheus, Grafana, CloudWatch

Analytics & BI Tools: Tableau, Power BI, Looker, QuickSight, Metabase, KPI Dashboards, Real-Time Operational Dashboards

Big Data Ecosystem: Hadoop, HDFS, Spark Streaming, Kafka Streaming, Flink, Scalable Distributed Processing

Collaboration & Methodologies: Agile Scrum, Cross-Functional Collaboration, Stakeholder Communication, Requirement Gathering, Performance Optimization

Education

NORTHEASTERN UNIVERSITY, Boston, MA Jan 2023 – May 2025 Master of Science in Data Science GPA: 3.9/4.0

MOUNT CARMEL COLLEGE, Belagavi, India Jun 2016 – Apr 2019 Bachelor of Science in Data Analytics GPA: 8.8/10.0 Projects

InboxAI - Intelligent Email Assistant (LangChain, Llama3/OpenAI, Gmail/Outlook API, Airflow, ChromaDB)

• Built an AI-powered email assistant for real-time retrieval, summarization, and response generation. Integrated Airflow pipelines and ChromaDB vector embeddings for semantic search and context-aware enterprise communication. YouTube Data Engineering Pipeline (AWS S3, Glue, Lambda, Athena, QuickSight)

• Designed a scalable pipeline to ingest and process structured and semi-structured YouTube trend data across regions. Interactive dashboards to visualize audience retention, category trends, and geographic engagement patterns for content strategy insights.

Policy Recommendation System using LLMs (LangChain, Streamlit, CSV/Text/Image/Hyperspectral Data)

• Upgraded full data ingestion and preprocessing workflows for agricultural, climate, and policy datasets. A LangChain

+ Streamlit pipeline for real-time policy recommendations, achieving 4th place among 100 teams at the Cloudera Climate & Sustainability Hackathon.

Stock Market Real-Time Analysis (Apache Kafka, PySpark, SparkSQL,Docker)

• Built Python-based ETL and Spark streaming pipelines on AWS, processing 10M+ YouTube records and 3,000+ stock market events/sec with end-to-end load times under 5 minutes, enabling distributed real-time analytics and Grafana visualization.

Contact this candidate