Data Engineer Real-Time

Location:

United States

Salary:

90000

Posted:

September 10, 2025

Contact this candidate

Resume:

Srikrupa G

**********@*****.***

+1-219-***-****

Data Engineer Azure, AWS, Snowflake Real-Time & Batch Pipelines PySpark, Kafka, Airflow Scalable Cloud Data Architectures

Summary

Data Engineer with around 4 years of hands-on experience designing, building, and maintaining scalable data pipelines and cloud data platforms using Azure, AWS, and Snowflake. Specialized in real-time and batch data integration, stream processing, advanced SQL/PySpark transformations, and implementing data warehouse architectures. Strong background in operationalizing ELT frameworks, optimizing cloud costs, automating deployments via CI/CD, and delivering data solutions aligned with business-critical SLAs.

Technical Skills:

Cloud Platforms: Azure, AWS, Snowflake

Languages: Python, SQL, PySpark, Scala

Big Data & Streaming: Apache Spark, Kafka, Azure Stream Analytics, AWS Kinesis

Data Warehousing: Snowflake, Redshift, Synapse Analytics, Dimensional Modeling

Databases: Azure SQL, AWS RDS, PostgreSQL, MongoDB, Cosmos DB

DevOps/Tools: GitHub, Azure DevOps, Terraform, CloudWatch, Azure Monitor

Security/Governance: Azure Key Vault, AWS IAM, Snowflake RBAC, data masking

Visualization: Power BI, Tableau

Professional Experience

Citadel (Financial Services), Miami, FL

Data Engineer Aug 2024 – Present

Designed and implemented end-to-end real-time streaming pipelines using Azure Event Hubs and AWS Kinesis for ingestion of large-scale trading, pricing, and order book data, ensuring millisecond-level data capture.

Built and orchestrated real-time trading data pipelines using Apache Airflow, automating DAGs to ingest from Azure Event Hubs and AWS Kinesis and process through Snowflake and Databricks.

Containerized PySpark jobs and Snowflake UDF environments using Docker to enable standardized, portable deployments across dev, QA, and production.

Deployed container-based streaming microservices and data APIs on Azure Kubernetes Service (AKS), using Helm charts and Kubernetes-native scaling for high availability.

Implemented Redis caching layers to serve low-latency reference data (security master, FX rates) for Spark Streaming jobs, reducing external lookups and improving pipeline performance.

Developed Scala-based Spark applications in Databricks for millisecond-level transformations and parsing of high-frequency JSON and Avro message streams.

Developed Snowflake ELT pipelines using Snowpipe, Streams, and Tasks to automate ingestion from AWS S3, followed by incremental loading and transformation for analytics and compliance reports.

Configured Snowflake clustering keys, micro-partitioning strategies, and query result caching to improve query performance for reporting dashboards used by traders and risk managers.

Integrated financial and operational datasets from APIs, FTPs, Azure SQL, and RDS into Snowflake using Python loaders and automated Glue workflows for scalable batch ingestion.

Designed fault-tolerant data pipelines with retry logic, error tables, and dead-letter queues to ensure no data loss or processing failure during outages or schema drift.

Tuned Spark configurations (e.g., parallelism, caching, broadcast joins) in Databricks jobs to handle 100M+ daily records within defined SLA.

Created and managed Azure Data Factory pipelines to orchestrate batch and daily feeds into both Azure Synapse and Snowflake, ensuring appropriate load windows and backfills.

Developed PySpark data validation scripts for quality checks such as null checks, range validations, and referential integrity between fact and dimension tables.

Implemented RBAC policies in Snowflake and Azure using roles, masking policies, and secure views to control access to sensitive trading data.

Metrolinx – Toronto (via Infosys)

Data Engineer May 2022– July 2023

Created near-real-time IoT data pipelines to capture bus and train GPS signals, trip timings, and fare transactions using Azure Stream Analytics and AWS Kinesis Firehose.

Engineered Snowflake data warehouse architecture using a hybrid star schema to store transportation KPIs, ridership statistics, and historical travel patterns.

Automated ingestion of S3-based CSV and JSON transit updates into Snowflake using Snowpipe with error handling and alerting through Lambda and SNS.

Automated near-real-time transportation data pipelines using Apache Airflow to orchestrate Kinesis Firehose, Snowflake Snowpipe, and PySpark transformations.

Packaged Python-based ETL tools and validation scripts into Docker containers for consistent execution in cloud-based and on-prem environments.

Used Redis as a transient data store to aggregate live trip data and power dynamic dashboard updates on vehicle arrival predictions.

Enhanced PySpark streaming pipelines with Scala UDFs to improve performance for time-window aggregations and metadata enrichment logic.

Built PySpark transformation jobs to cleanse, filter, and enrich streaming data with route metadata and time-window aggregations for late-arrival prediction models.

Implemented CDC-based incremental loading in Snowflake using Streams and Tasks to track data changes from PostgreSQL and RDS sources.

Developed reusable SQL templates for fact table loading with surrogate key generation, slowly changing dimensions (SCD Type 2), and surrogate key mapping in Snowflake.

Managed schema drift and source variability by building metadata-driven ingestion frameworks with dynamic schema registration in Glue Catalog and Snowflake.

Deployed and monitored Glue workflows and Azure Data Factory pipelines for daily data pulls and quality rule enforcement across structured and semi-structured datasets.

Tuned long-running queries and large table joins using Snowflake’s clustering keys, materialized views, and query profiling tools.

Created dynamic dashboards in Power BI and QuickSight for operations managers to track vehicle delays, route occupancy, and on-time performance.

Built automated quality checks in Python to compare record counts, missing fields, and partition mismatches between landing zones and Snowflake tables.

Infosys – Bengaluru, India

Intern Data Engineer July 2021 – May 2022

Developed scalable ETL frameworks to ingest application logs, user activity data, and IoT telemetry into Azure SQL and Cosmos DB using Azure Data Factory.

Built Snowflake data models to centralize customer behavioral data, combining batch uploads and Snowpipe-based near real-time tracking logs.

Managed batch and streaming pipelines using Apache Airflow for scheduled ingestion from SFTP, APIs, and databases into Snowflake and Cosmos DB.

Supported client’s Kubernetes-based data platforms by developing container specs and scaling rules for ETL workloads and API connectors.

Integrated Redis into data reconciliation frameworks to track in-flight record counts and detect load mismatches in near real-time.

Wrote Scala-based batch Spark jobs in Databricks to process application logs and user activity data at scale with optimized join and partitioning strategies.

Created a library of stored procedures in Snowflake to implement deduplication, normalization, and exception logging logic for product usage analytics.

Implemented Spark jobs in Databricks to perform batch enrichment joins across streaming event logs and historical customer data.

Used time-travel and zero-copy cloning features in Snowflake for dev/test refresh and disaster recovery validations.

Created schema validation scripts to detect breaking changes in upstream sources before data was committed to production tables.

Built and automated a reconciliation framework in Snowflake to detect and alert on mismatched totals and record discrepancies across sources.

Designed visualizations in Power BI over Snowflake models to display KPIs on product activation, churn rate, and feature adoption.

Wrote test cases using PyTest to validate each transformation and perform row-level comparisons post-load in Snowflake.

HCL – Vijayawada, India

Intern Data Engineer Jul 2020 – May 2021

Assisted in building ADF pipelines for scheduled ingestion of Excel, CSV, and XML files into Azure SQL DB and Snowflake staging tables.

Participated in refactoring SQL logic into modular stored procedures and views for use in Snowflake reporting layers.

Conducted basic data validation and reconciliation tests for nightly batch loads, comparing staging and final Snowflake tables.

Wrote simple Python scripts to load flat files from on-prem directories into S3 buckets and trigger Snowpipe ingestion.

Contributed to Power BI dashboards built on top of Snowflake tables displaying basic operational KPIs and data availability metrics.

Participated in daily Agile scrums and contributed test case reviews and SQL code reviews.

Education

Bachelors: BSc-Bachelor of Science

Group: MSCs (Maths, Statistics, Computer Science)

Master’s: Master's in Information Technology

College: Valparaiso University

Contact this candidate