Data Engineer Senior

Location:

Ahmedabad, Gujarat, India

Posted:

September 19, 2025

Contact this candidate

Resume:

GAYATRI GUPTA

SENIOR DATA ENGINEER AWS PYSPARK SNOWFLAKE SQL ML-READY PIPELINES

DATABRICKS

Orange County, CA ****************@*****.*** +1-949-***-**** LinkedIn PROFESSIONAL SUMMARY

I am a Senior Data Engineer with around 9 years and 6 months of experience designing and deploying scalable, cloud- native data solutions. I specialize in building end-to-end ML-ready data pipelines using AWS, PySpark, Snowflake, SQL, and Databricks. My expertise lies in transforming raw data into actionable insights, optimizing performance for large-scale data workloads, and driving data platform modernization across cross-functional teams. I am passionate about leveraging best- in-class tools to enable advanced analytics and support business-critical decision-making. SKILLS

Cloud Platforms: AWS (S3, Glue, Lambda, EMR, Redshift, Athena, IAM), Azure Data Lake, Azure Databricks, Snowflake Data Engineering: PySpark, Spark (RDD, DataFrames, Structured Streaming), SQL, ETL/ELT Pipelines, dbt, Delta Lake, Apache Kafka, Apache NiFi, Airflow

ML/AI Pipeline Support: ML-Ready Pipelines, Feature Engineering, SageMaker Integration, MLflow, TensorFlow Extended

(TFX), Model Registry

Data Warehousing: Snowflake, Redshift, Azure Synapse, Dimensional Modeling, Star and Snowflake Schema Design DevOps & Automation: CI/CD (GitHub Actions, Jenkins), Terraform, Docker, Kubernetes, Apache Airflow, CloudFormation, Infrastructure as Code (IaC)

Collaboration Tools: Jira, Confluence, Trello, Slack, Microsoft Teams, Notion Monitoring & Security: CloudWatch, Datadog, Splunk, AWS CloudTrail, VPC, IAM Policies, KMS, GDPR, HIPAA, Data Masking, Access Control

Visualization & Reporting: Power BI, Tableau, QuickSight, Looker, SQL-based Dashboards Soft Skills: Problem Solving, Cross-Functional Collaboration, Agile/Scrum, Stakeholder Communication, Analytical Thinking, Leadership, Documentation Writing

PROFESSIONAL EXPERIENCE

JPMORGAN CHASE Nov 2024 – May 2025

SENIOR DATA ENGINEER

Led the design and deployment of a unified data engineering platform on Databricks to support the migration of 4TB+ of financial and market data for FRB reporting and compliance use cases.

Developed scalable batch ETL pipelines using PySpark and Delta Lake with schema enforcement, partitioning, and time travel, enabling traceability and ML reproducibility.

Orchestrated 40+ ETL jobs via Databricks Workflows and Apache Airflow, reducing operational overhead and ensuring SLA compliance for downstream risk and fraud analytics.

Integrated AWS Glue for metadata cataloging and Lambda for triggering API-based ingestions with retry logic, audit logging, and alerting.

Built curated data layers in Snowflake using Snowpipe-based loading and query optimization via clustering and materialized views.

Migrated legacy on-prem batch jobs to Databricks on AWS, improving processing efficiency by 40% and simplifying operational maintenance.

Established CI/CD pipelines using Terraform, GitLab, and Jenkins, driving repeatable infrastructure and rapid environment provisioning.

Ensured compliance with SOX, FINRA, and GDPR via IAM-based access controls, S3 encryption, Delta audit history, and robust data handling policies.

Enabled proactive observability through CloudWatch, Datadog, and DAG monitoring, reducing MTTR for pipeline incidents.

WALMART Feb 2023 – Nov 2024

SENIOR DATA ENGINEER

Designed and implemented a cloud-native, scalable data platform for enterprise-wide sales reporting, processed 6TB+ of daily retail, promotional, and eCommerce data across 10+ internal and vendor sources.

Developed scalable PySpark pipelines on Databricks to cleanse, normalize, and enrich multi-channel POS feeds, generating ML-ready datasets for churn modeling, basket analysis, and customer segmentation.

Engineered ingestion workflows using AWS Glue and Airflow, orchestrating batch and micro-batch pipelines with improved throughput and 50% lower execution time.

Utilized Delta Lake with schema evolution, Z-order clustering, and time travel to enable consistent historical analysis and robust audit trails.

Built high-performance Snowflake pipelines using Snowpipe for real-time promotional and pricing data; applied clustering and caching to optimize BI queries and dashboard performance.

Partnered with analysts and product owners to create curated business marts in Snowflake that powered actionable dashboards in Amazon QuickSight for 20+ retail leaders.

Implemented dynamic partitioning and late-arriving data handling in Spark to support frequent schema changes and evolving business rules without breaking SLA windows.

Enabled proactive monitoring using Airflow sensors, Lambda-based alerting, and CloudWatch logs, reducing escalations and improving DAG reliability by 60%.

Designed cost-optimized compute strategy using Databricks autoscaling clusters and AWS spot instances, achieving up to 35% reduction in compute spend.

Automated infrastructure and deployment using Terraform, GitHub Actions, and Jenkins, enabling repeatable, multi-environment rollouts across dev, QA, and prod.

Ensured data governance and compliance by integrating Great Expectations validations, enforcing column-level constraints, and applying PII protection policies using IAM and S3 encryption.

CITIBANK Jan 2021 – Jun 2021

CLOUD DATA ENGINEER

Designed and implemented cloud-native data ingestion pipelines using AWS Glue, Lambda, and Kinesis to process over 3TB of financial transaction data daily from various internal and third-party sources.

Developed ETL workflows using PySpark and AWS Glue Studio, transforming raw data into curated datasets stored in Amazon S3 and cataloged in AWS Glue Data Catalog for downstream analytics.

Migrated legacy on-premise SQL and Oracle workloads to Amazon Redshift, reducing query execution times by 45% and improving scalability for large-scale reporting.

Integrated AWS Step Functions with Lambda and Glue Jobs to orchestrate complex data workflows, enhancing automation and reducing manual intervention.

Built and deployed Terraform-based infrastructure as code templates to provision cloud resources across Dev, QA, and Prod environments, ensuring consistency and reusability.

Implemented data quality validation rules and custom monitoring logic using CloudWatch and SNS alerts, which improved data reliability across financial systems.

Collaborated with Data Governance and Compliance teams to enforce encryption, IAM roles, and audit logging, ensuring alignment with SOX and internal risk management policies.

Tuned Redshift Spectrum queries and optimized storage formats like Parquet, enhancing performance for analytical workloads and reducing unnecessary data scans.

Created reusable Athena queries, CloudFormation templates, and Python scripts to streamline reporting, provisioning, and cloud resource automation.

Participated in code reviews, cloud security assessments, and cost optimization initiatives to ensure best practices in performance, security, and efficiency.

Authored detailed technical documentation, shared design patterns, and led knowledge transfer sessions to promote team proficiency in cloud data engineering. UNITEDHEALTH GROUP Nov 2018 – Jan 2021

BIG DATA ENGINEER

Engineered and maintained scalable big data pipelines using Apache Spark, Hive, and HBase to process and transform over 10TB of healthcare claims and clinical data daily for enterprise analytics.

Designed and implemented ETL workflows using PySpark, Kafka, and AWS Glue to automate ingestion and transformation of real-time and batch data from internal and external sources.

Developed Spark-based data validation and cleansing frameworks, ensuring data quality standards were met across 100+ production jobs, which improved analytics accuracy by 30%.

Integrated AWS Redshift and S3 for optimized storage and querying of structured and semi-structured healthcare data, reducing query execution time by 40%.

Led the migration of data processing workloads from on-premise Hadoop clusters to AWS EMR, increasing processing speed and reducing infrastructure costs by 25%.

Collaborated with data science and analytics teams to develop ML-ready datasets for healthcare models focused on patient risk prediction and care optimization.

Created and managed data catalogs using AWS Glue Data Catalog and Apache Atlas, improving metadata governance and enabling better data discoverability.

Monitored and orchestrated data pipelines using Apache Airflow, CloudWatch, and Datadog, resolving performance bottlenecks and ensuring system reliability.

Enforced data encryption, role-based access control, and compliance checks to align data workflows with HIPAA and organizational data governance policies.

Worked collaboratively with Data Architects, Security Engineers, and Business Analysts to define data models, integration strategies, and quality standards for healthcare platforms.

Authored detailed technical documentation, led peer code reviews, and mentored junior engineers on big data tools, cloud engineering, and best practices.

COX COMMUNICATIONS Apr 2017 – Nov 2018

SPARK DEVELOPER

Developed and maintained distributed Spark applications using PySpark and Scala on Hadoop YARN clusters, processing over 5TB of telecom usage and customer interaction data daily.

Designed and implemented ETL pipelines to extract data from Kafka, transform it using Spark SQL, and load it into Hive tables for downstream analytics, reducing data latency by 35%.

Optimized Spark jobs by implementing broadcast joins, caching, and partition pruning, which improved job execution time by 40% on average.

Collaborated with Data Architects to migrate legacy MapReduce workflows to Apache Spark, reducing code complexity and improving maintainability.

Utilized AWS S3 as a data lake for raw and processed datasets and integrated with AWS Glue and EMR for scalable Spark job execution.

Implemented unit testing, conducted code reviews, and managed version control using Git, Bitbucket, and Jenkins, ensuring CI/CD best practices.

Executed data validation and anomaly detection using Spark DataFrames, incorporating logic to identify and flag inconsistencies in customer-related data.

Built reusable Spark modules for data standardization, enrichment, and aggregation, supporting multiple downstream analytics initiatives.

Automated Spark job scheduling and monitoring using Apache Airflow, ensuring timely data delivery to platforms such as Tableau and Power BI.

Partnered with Data Scientists to prepare machine learning-ready datasets for models including churn prediction and customer segmentation.

Created comprehensive technical documentation and led knowledge-sharing sessions to enhance Spark proficiency across engineering and analytics teams. SAMMONS Mar 2014 – Apr 2017

HADOOP DEVELOPER

Developed and maintained scalable data pipelines using Hadoop ecosystem tools including HDFS, MapReduce, Hive, Pig, and Sqoop to process and analyze over 10TB of structured and unstructured data across multiple business domains.

Designed and implemented ETL workflows to ingest large volumes of data from Oracle, SQL Server, and flat files into HDFS, reducing manual data processing efforts by 60%.

Built and optimized Hive queries to enable ad hoc reporting and analytics for internal teams, improving query performance by 40% through partitioning, bucketing, and indexing.

Integrated Apache Sqoop to efficiently import/export data between Hadoop and RDBMS, achieving 30% faster data migration and streamlining batch data ingestion pipelines.

Collaborated with Data Scientists to develop custom MapReduce jobs for complex transformations and analytics use cases, supporting over 5 production use cases monthly.

Conducted extensive data profiling and cleansing using Pig scripts and custom UDFs, ensuring higher data quality for downstream applications.

Utilized Apache Oozie to schedule and manage job workflows, automating data pipeline execution and dependency management across the environment.

Participated in configuration and performance tuning of the Cloudera Hadoop cluster, including resource allocation, job tracking, and system monitoring.

Shared reusable components such as HiveQL scripts, Pig macros, and workflow templates to improve collaboration and reduce development effort within the team.

Documented technical specifications and conducted knowledge transfer sessions to onboard new developers and educate stakeholders on data engineering processes.

Worked cross-functionally with BI Analysts, Data Architects, QA Engineers, and Product Owners to ensure data accuracy and integrity throughout the lifecycle.

EDUCATION

Bachelor of Information Technology,

BPUT University, 2013.

CERTIFICATIONS

● AWS Certified Data Engineer – Associate

● AWS Certified AI Practitioner

● Post Graduate Program in Artificial Intelligence and Machine Learning: Business Applications – The University of Texas at Austin

Contact this candidate