Data Engineer Real-Time

Location:

Denton, TX

Salary:

90,000

Posted:

July 28, 2025

Contact this candidate

Resume:

Hema Anumala

Data Engineer

Email: *************@*****.*** Mobile: +1-940-***-****

LinkedIn : https://www.linkedin.com/in/anumalahema/ Professional Summary

Data Engineer with 3+ years of end-to-end pipeline ownership, designing, building, and maintaining distributed data systems that serve analytics, ML, and real-time applications.

Fluent in Python, SQL, and Scala, leveraging Pandas, PySpark, and dbt to transform terabyte-scale datasets in both batch and streaming workloads.

Hands-on expertise in cloud data platforms — AWS (S3, Glue, Redshift), Azure (Data Factory, Synapse), and GCP (BigQuery, Dataflow) — with cost-optimization best practices.

Proven track record implementing Apache Spark and Kafka clusters, achieving sub-second latency for high-volume event pipelines (>50 K messages/sec).

Designed Star & Snowflake schemas and data vault models in cloud warehouses, boosting query performance for BI teams by 40 %.

Automated ELT orchestration with Apache Airflow and Prefect, driving >95 % SLA adherence across 200 + daily DAG runs.

Champion of DevOps & IaC: Terraform, Docker, GitLab CI/CD; cut deployment times from hours to minutes while enforcing reproducibility.

Implemented data quality & observability (Great Expectations, Monte Carlo), reducing production incidents by 70 %.

Collaborated with data scientists to deliver feature stores and online lookup services, accelerating model deployment cycles by 30 %.

Enforced governance & security (ROW-level policies, GDPR, HIPAA, KMS encryption) to satisfy internal audits and external compliance.

Experienced with real-time analytics (Kinesis, Kafka Streams, Flink) enabling live dashboards and anomaly detection.

Adept in stakeholder communication, translating complex technical details into clear business value for Product, Finance, and Executive teams.

Practitioner of Agile/Scrum and DataOps, driving backlog grooming, sprint planning, and continuous integration of analytics deliverables.

TECHNICAL SKILLS:

Languages: Python, SQL, Scala, Bash

Frameworks & Engines: Apache Spark, PySpark, Kafka, Flink, dbt, Airflow, Prefect

Cloud & Warehouses: AWS (S3, Glue, Redshift, EMR, Kinesis), Azure (Data Factory, Synapse), GCP

(BigQuery, Dataflow, Pub/Sub)

Databases: PostgreSQL, MySQL, Snowflake, DynamoDB, MongoDB

DevOps & IaC: Terraform, Docker, Kubernetes (basic), GitLab CI/CD, Jenkins

Monitoring & Quality: Great Expectations, Datadog, Prometheus, Monte Carlo, Grafana

BI & Visualization: Tableau, Power BI, Looker, Superset

Methodologies: Data Modeling, ETL/ELT, Streaming, Data Governance, Agile/Scrum, DataOps EXPERIENCE

Client: IBM, USA Duration: Jan 2025 - Present

Role: Data Engineer

Architected a real-time transaction pipeline on AWS (Kinesis Flink Redshift) processing 2 B events/day with <5-second end-to-end latency.

Led migration from on-prem Hadoop to Databricks on AWS, cutting cluster costs 25 % via autoscaling

& spot instances.

Built 150 + dbt models with documented lineage; reduced analyst onboarding time by 60 %.

Implemented Great Expectations suites and Airflow operators, catching 98 % of schema drifts before they hit production tables.

Introduced Terraform modules for S3 bucket policies, IAM roles, and Redshift provisioning, enabling one-click environment replication.

Developed Python micro-service exposing a feature store API backed by DynamoDB & Lambda, powering four ML models in production.

Tuned Spark jobs (partitioning, broadcast joins) to shave nightly batch duration from 4 h to 45 min.

Partnered with InfoSec to encrypt PII using AWS KMS & Lake Formation tags, achieving SOC 2 Type II compliance.

Created observability dashboards in Datadog for DAG runtimes, memory usage, and throughput, reducing MTTR by 40 %.

Piloted CDC ingest with Debezium + Kafka Connect, keeping Redshift replica <5 s behind source OLTP systems.

Deployed CI/CD pipelines (GitLab) with automatic unit tests, linting, and Blue-Green deployments for Airflow and dbt projects.

Collaborated with Product to define SLAs/SLOs for data products, aligning uptime metrics with OKRs.

Drove quarterly capacity planning, projecting storage & compute growth to inform $1.2 M annual budget. Client: Cognizant, India Duration: Feb 2022 - June 2023 Role: Data Engineer

Designed ELT workflows in Azure Data Factory to ingest 30 + vendor feeds into Synapse, enabling unified merchandising analytics.

Built Kafka Streams application in Scala for clickstream enrichment, powering personalization models with <1-min freshness.

Modeled a Snowflake data vault and BI-friendly star schemas, reducing Tableau dashboard load times by 50 %.

Automated inventory demand forecasts by orchestrating PySpark jobs in Databricks, improving accuracy 12 % YoY.

Instituted row-level security and dynamic masking in Synapse to meet GDPR requirements across EU markets.

Led POC for serverless Fivetran dbt Cloud stack; demonstrated 35 % TCO reduction vs. existing ETL scripts.

Embedded unit & integration tests (pytest, sqlfluff) into Azure DevOps pipelines, raising deployment success to 99 %.

Coordinated with BI team to migrate 200 + legacy reports to Looker, introducing governed explores and modeling best practices.

Implemented S3/Snowflake external tables for semi-structured JSON, eliminating ETL lag on product reviews data.

Conducted quarterly data lineage mapping workshops, increasing documentation coverage to 85 %.

Served as Scrum master for the Data Platform squad, accelerating sprint velocity by 20 %.

Organized internal brown-bag sessions on Spark tuning and cost-aware data partitioning. Client: HDFC Bank, India Duration: Jan 2021 - Feb 2022 Role: Junior Data Engineer

Assisted in building batch ETL pipelines (Python + SQL) that loaded CRM, Salesforce, and marketing data into Redshift.

Developed Airflow DAGs with SLA monitors, raising pipeline reliability from 88 % to 97 %.

Wrote optimized SQL UDFs and window functions to support ad-hoc analyst queries.

Maintained Hive tables on EMR, enforcing table partitioning and compression to cut storage 30 %.

Deployed Dockerized Spark jobs via Jenkins, standardizing build & release workflows.

Created Grafana dashboards tracking job failures and cluster utilization, reducing over-provisioning by 15 %.

Implemented basic data masking for PII fields in staging environments to satisfy HIPAA guidelines for healthcare clients.

Collaborated with data scientists to prototype a churn model, supplying feature engineering pipelines that trimmed training time by half.

Authored detailed runbooks and wiki pages, improving team onboarding experience and knowledge retention.

EDUCATION

Masters in Information Systems and Technologies - University of North Texas (2025) Bachelors In Information Technology - V R Siddhartha Engineering College (2022) CERTIFICATIONS

Microsoft Azure Fundamentals

Databricks Data Enginnering Associate

Contact this candidate