SAROJ UPRETI
Data Engineer Houston, TX ***** **************@*****.*** LinkedIn
PROFESSIONAL SUMMARY
Results-driven Data Engineer with over 7 years of experience designing and implementing large-scale data platforms across AWS, Azure, and Google Cloud Platform (GCP). Proven expertise in building robust ETL/ELT pipelines, managing data lakes and warehouses (Snowflake, BigQuery, Synapse), and developing real-time processing solutions using Spark, Dataflow, Databricks, and Kafka. Adept at leveraging tools like dbt, Airflow, and Terraform to automate workflows and infrastructure. Skilled in advanced SQL, Java and Python, with a strong focus on data quality, governance, and performance optimization. Demonstrated ability to collaborate with cross-functional teams to deliver data solutions that drive strategic decision-making in industries including finance, healthcare, and retail. PROFESSIONAL EXPERIENCE
JPMorgan Chase Concord,NC
Data Engineer Nov 2022 – June 2025
• Designed and developed ETL pipelines in Azure Data Factory, automating data ingestion from diverse sources.
• Leveraged Azure Synapse Analytics to create high-performance data warehouses for reporting and analytics.
• Built dbt models to transform raw data into analytics-ready datasets aligned with business logic and reporting needs.
• Utilized Power BI to create visually appealing and insightful dashboards, enabling data-driven decision-making.
• Optimized PostgreSQL database structures and queries, achieving a 35% reduction in query execution time.
• Built real-time data processing solutions in Databricks using Scala, ensuring accurate and timely data delivery.
• Automated infrastructure deployment using Terraform, enabling consistent and efficient environment provisioning.
• Implemented data partitioning strategies in Hive, enhancing performance of large-scale analytical queries.
• Developed Java-based data processing modules integrated within ETL pipelines to support the ingestion, cleansing, and transformation of customer and claims datasets.
• Developed conformed data models to integrate claims, customer, and provider data into a centralized lakehouse for analytics and compliance.
• Worked closely with data scientists and analysts to ensure data pipelines support machine learning workflows.
• Established data quality frameworks, integrating validation checks into data pipelines to ensure accuracy.
• Migrated legacy systems to Azure Synapse, reducing operational overhead and improving scalability.
• Configured and managed Hadoop clusters, ensuring optimal resource utilization and uptime.
• Automated repetitive data processing tasks with Apache Airflow, improving workflow efficiency and reducing manual intervention.
• Worked closely with analytics teams to identify patterns and relationships in large datasets, driving key operational insights.
• Integrated third-party APIs for real-time data enrichment and validation, improving data accuracy and compliance with financial regulations.
• Designed and developed interactive Tableau dashboards and reports to visualize complex datasets, providing actionable insights to business stakeholders and decision-makers.
• Designed data lake architectures to support unstructured and semi-structured data storage.
• Conducted performance tuning for Databricks notebooks, reducing execution times significantly.
• Provided detailed documentation for ETL workflows and processes, enabling knowledge transfer and maintenance.
• Implemented row-level security in Power BI, ensuring data confidentiality across user groups.
• Conducted POCs for new tools and technologies, recommending best-fit solutions for organizational needs.
• Ensured compliance with data governance policies by implementing auditing and encryption protocols. United Healthcare Austin,TX
Data Engineer Mar2020 – Oct 2022
• Designed and implemented an enterprise-grade Data Lake on AWS, supporting diverse use cases including scalable data storage, real-time processing, analytics, and reporting of large and dynamic datasets.
• Extracted data from multiple sources including Amazon S3, Redshift, and RDS, and built centralized metadata repositories using AWS Glue Crawlers and AWS Glue Data Catalog.
• Leveraged AWS Glue Crawlers to classify and catalog data from S3, enabling SQL-based analytics using Amazon Athena.
• Developed and optimized ETL pipelines using AWS Glue to ingest and transform data from external sources (e.g., S3, Parquet, CSV) into Amazon Redshift.
• Authored PySpark scripts within AWS Glue to merge datasets from various tables and automated cataloging with Glue Crawlers for metadata management.
• Implemented monitoring and observability for AWS Glue Jobs and Lambda functions using Amazon CloudWatch with custom metrics, alarms, logs, and automated notifications.
• Migrated on-premises applications to AWS, utilizing EC2 and S3 for data processing and storage, and maintained Hadoop clusters on Amazon EMR.
• Integrated HL7-compliant data exchange using Amazon API Gateway and AWS Lambda to securely transmit EHR messages between systems and ensure interoperability.
• Collaborated with business users to gather requirements and translate them into effective Tableau visualizations.
• Engineered real-time data pipelines using Amazon Kinesis Data Streams, Kinesis Data Firehose, and Kinesis Data Analytics, delivering processed data into S3, DynamoDB, and Redshift.
• Designed and developed scalable data pipelines to ingest, transform, and load data into Snowflake, optimizing warehouse performance and enabling advanced analytics and BI reporting.
• Utilized Python for data analysis, transformation, and reporting, employing libraries like Pandas and NumPy for efficient data manipulation.
• Developed interactive and visually compelling Power BI dashboards to support data-driven decision-making across business units.
• Ensured all protected health information (PHI) stored in Amazon S3 was encrypted in transit and at rest, with access controls managed via AWS IAM and audit logging enabled using AWS CloudTrail to maintain HIPAA compliance.
• Automated routine AWS infrastructure tasks, such as snapshot management and resource cleanup, using Python scripting.
• Installed and configured Apache Airflow and developed DAGs to orchestrate and automate workflows involving AWS S3 and other cloud-native services.
• Managed highly available production environments across multiple Kubernetes clusters, ensuring scalability and resilience.
• Deployed FHIR APIs for healthcare data access and sharing using Amazon API Gateway and AWS Lambda, ensuring structured, secure, and standards-based communication between healthcare applications.
• Optimized SQL query performance by analyzing execution plans, indexing, and rewriting inefficient queries.
• Utilized Agile project management tools such as Jira, Rally to manage user stories, track progress, and facilitate clear communication across teams.
• Built Spark applications for data validation, cleansing, transformation, and advanced aggregation, leveraging Spark SQL for in-depth analytics.
• Performed comprehensive data integrity checks using Hive, Hadoop, and Spark.
• Enhanced Hive query performance through partitioning, clustering, and use of optimized storage formats like Parquet.
Walmart Bentonville, AR
Data Engineer Jan 2018- Feb 2020
• Designed and implemented scalable data ingestion pipelines to support Walmart’s enterprise-wide analytics solutions for supply chain, inventory, and customer behavior analysis.
• Automated end-to-end ETL/ELT processes to ingest, cleanse, and transform large volumes of retail data from disparate sources into structured formats for analytics.
• Built and managed data warehouses and data lakes on Google Cloud Platform (GCP), ensuring optimized storage, performance, and cost-effective scaling.
• Developed and maintained dbt models to apply Walmart-specific business logic, ensuring trusted, analytics-ready datasets across reporting layers.
• Wrote complex SQL queries for data extraction, cleansing, aggregation, and validation across retail, inventory, and transactional datasets spanning billions of records.
• Built and managed Snowflake data models using star and snowflake schemas to support scalable, analytics-ready datasets for business units.
• Built and supported real-time and batch processing pipelines using Dataflow, enabling timely insights on point- of-sale, logistics, and eCommerce activity.
• Leveraged Dataproc to process massive datasets using Spark, performing transformations, joins, and aggregations across historical transactional data.
• Orchestrated complex data workflows using Cloud Composer and Airflow, implementing retry logic, SLA monitoring, and alerting for critical pipelines.
• Developed dynamic Power BI dashboards visualizing real-time KPIs such as store performance, product availability, and online traffic conversion rates.
• Applied DevOps practices to support CI/CD pipelines and infrastructure-as-code for consistent deployment of data engineering environments.
• Queried and optimized petabyte-scale datasets in BigQuery, leveraging partitioning, clustering, and materialized views to reduce query cost and improve performance.
• Managed structured and unstructured datasets in Cloud Storage, integrating raw data layers with downstream tools like BigQuery and Looker.
• Used Pandas, NumPy, and PySpark for advanced data manipulation and large-scale processing within Walmart’s data lake and batch processing environments.
• Containerized key data services using Docker and deployed workloads via Kubernetes, ensuring Walmart’s data platforms remained scalable and resilient.
• Built telemetry data pipelines integrating with Kafka and Splunk, enabling real-time monitoring and log analytics across distributed systems.
• Optimized slow-performing queries by analyzing execution plans, creating indexes, and rewriting logic to meet SLA requirements in large-scale production systems.
• Designed and maintained MongoDB databases for semi-structured product catalogs and customer interaction data.
• Integrated RabbitMQ into asynchronous data flows for event-driven processing between internal services supporting pricing and promotions systems.
EDUCATION
Bachelor’s in Data Science University of the North Carolina at Charlotte SKILLS
• Programming Languages: Python, Scala, Java, SQL, Bash
• Big Data & Processing Frameworks: Apache Spark, Apache Hadoop, Apache Beam, PySpark, Apache Hive, Kafka
• Cloud Platforms: AWS (S3, EMR, RDS), Google Cloud Platform (BigQuery, Dataflow, Pub/Sub, Cloud Composer), Microsoft Azure (Data Factory, Databricks, Synapse Analytics, Blob Storage)
• Data Warehousing & Databases: PostgreSQL, MySQL, SQL Server, Google BigQuery, Azure Synapse, Oracle, Snowflake, Databricks
• ETL & Workflow Orchestration: Apache Airflow, Cloud Composer, Azure Data Factory, SSIS, dbt
• DevOps & CI/CD: Jenkins, Git, Terraform, Docker, Azure DevOps
• Business Intelligence & Reporting: Power BI, Looker, Tableau
• Data Modeling & Design: Star/Snowflake Schema, Dimensional Modeling, Data Lake Architecture, Data Vault
• Tools & Technologies: Linux, Git, REST APIs,
REFERENCE
REFERENCE AVAILABLE UPON REQUEST