Data Engineer Engineering

Location:

Lubbock, TX

Salary:

90000

Posted:

September 10, 2025

Contact this candidate

Resume:

Venkata Teja Kumar Gurram

Data Engineer

Email:*****************@*****.***

Mobile:405-***-****

Linkedln: www.linkedin.com/in/venkata-gurram-7a56b7200

PROFFESSIONAL SUMMARY

Data Engineer with over 5 years of experience in designing, building, and managing robust data pipelines using tools like Apache Spark, Airflow, and Python across multiple domains.

Proficient in developing scalable ETL solutions using Apache Spark, Scala, and SQL to process structured and semi-structured data from various sources for analytics and reporting.

Hands-on expertise in cloud technologies such as AWS (S3, EMR, Redshift), GCP (BigQuery, Dataflow), and Azure Data Factory to architect end-to-end data solutions.

Built and optimized complex data warehouse models using Snowflake, Amazon Redshift, and Google BigQuery to support high-performance analytics and BI workloads.

Strong background in data modeling, designing star and snowflake schemas, and implementing partitioning and clustering strategies for large datasets to optimize query performance.

Experience managing data lakes and ingesting streaming and batch data using Kafka, Apache Flink, and AWS Glue, supporting real-time data applications.

Proficient in SQL, Python, and Scala for data transformation, cleansing, and processing, ensuring high-quality data pipelines across cloud and on-prem environments.

Skilled in implementing CI/CD pipelines using Git, Jenkins, and Terraform for infrastructure as code and streamlined deployment of data engineering projects.

Built reusable ETL/ELT components using Apache Airflow, implementing DAGs for scheduled data transformations with robust logging, retry logic, and failure notifications.

Experience with both relational (PostgreSQL, MySQL) and NoSQL (MongoDB, Cassandra) databases, with a focus on optimizing data access and storage strategies.

Applied principles of data governance, including data lineage, cataloging, masking, and role-based access controls, ensuring secure and compliant data systems.

Proven expertise in performance tuning SQL queries, optimizing Spark jobs, and reducing data processing costs in cloud environments by over 30%.

Collaborated with cross-functional teams, including data scientists, analysts, and business stakeholders to understand data requirements and build scalable solutions.

Developed and maintained data quality frameworks using tools like Great Expectations, ensuring accuracy, completeness, and timeliness of critical data.

Automated data validation and pipeline monitoring using Python, Airflow, and custom alerting mechanisms, improving incident response time by over 40%.

Strong problem-solving abilities in diagnosing data issues, debugging data pipelines, and resolving anomalies in large, distributed systems.

Adept at communicating complex technical ideas to non-technical stakeholders, bridging gaps between data engineering, analytics, and business teams.

Holds relevant certifications in AWS Certified Data Analytics, Google Professional Data Engineer, and Snowflake SnowPro Core, demonstrating expertise across platforms.

TECHNICAL SKILLS

Programming Languages: Python, Java, Scala, SQL

Big Data Technologies: Apache Spark, Apache Flink, Kafka, Hadoop

ETL Tools: Apache Airflow, AWS Glue, NiFi, Informatica

Data Warehousing: Snowflake, Amazon Redshift, Google BigQuery, Teradata

Cloud Platforms: AWS, GCP, Azure

Databases: PostgreSQL, MySQL, MongoDB, Cassandra

DevOps & CI/CD: Git, Jenkins, Terraform, Docker

Monitoring & Logging: AWS CloudWatch, Stackdriver, Datadog, Prometheus

Data Governance & Security: RBAC, Data Masking, KMS, IAM, Audit Logging

Visualization Tools: Tableau, Power BI, QuickSight, Looker

Others: Apache Hive, Athena, Linux, SFTP, Confluence, Jira

PROFESSIONAL EXPERIENCE

Client : Citibank – Data Engineer May 2023 - Present

Built and maintained distributed data pipelines using Apache Spark, Scala, and Python on Hadoop, ingesting transactional, trade, and credit risk data from multiple global business systems.

Migrated complex on-prem Hadoop workflows to Google Cloud Platform (GCP) using Cloud Dataflow, BigQuery, and Cloud Storage, enabling scalable, cost-effective batch and streaming processing.

Developed near real-time feature engineering pipelines for fraud detection models using PySpark, Kafka, and S3, ensuring rapid ingestion and transformation of live banking events.

Created custom Airflow DAGs in Cloud Composer to orchestrate ETL flows across systems like Oracle, SFTP, and Kafka, enabling efficient scheduling, dependency handling, and alerting.

Designed financial data models in BigQuery, building well-structured layers (raw, staging, curated) and implementing partitioning, clustering, and materialized views for optimal performance.

Implemented robust data masking, tokenization, and encryption-at-rest strategies in Snowflake and BigQuery, securing PII and aligning all pipelines with internal security standards.

Built reusable data ingestion modules using Python, PySpark, and SQLAlchemy, allowing onboarding of new data sources with minimal additional engineering effort.

Integrated pipeline monitoring using Stackdriver, Pub/Sub, and custom alerting mechanisms, allowing rapid detection and escalation of failures across the data infrastructure.

Worked closely with financial compliance and audit teams to maintain data lineage, using tools like Apache Atlas and custom metadata tracking frameworks across ingestion layers.

Delivered curated BigQuery datasets for credit exposure and capital adequacy dashboards, supporting risk analysts with clean, trusted, and timely financial reporting data.

Created RBAC policies and column-level access control in Snowflake and GCP IAM, ensuring strict governance over who could view or query sensitive financial datasets.

Automated schema evolution handling using Python and metadata comparison logic, reducing manual intervention when source data structures changed from upstream banking systems.

Wrote complex SQL transformations in BigQuery to denormalize transactional and portfolio data, enabling analytics teams to build dashboards in Looker and Power BI.

Tuned Spark jobs by optimizing shuffle partitions, memory configurations, and data formats like Parquet, reducing resource consumption while maintaining reliability under high data loads.

Participated in regular security and architecture reviews, providing input on data encryption standards, access review procedures, and secure transport mechanisms for sensitive banking data.

Maintained internal documentation on pipeline logic, environment configuration, data dictionaries, and ingestion workflows using Confluence and GitHub, improving onboarding and collaboration.

Client: Geisinger Health System – Data Engineer July 2021 – Aug 2022

Built secure ETL pipelines using Apache Airflow and Python to ingest and process patient data from multiple hospital systems including EHR, lab systems, and external pharmacy feeds.

Modeled healthcare data into Snowflake warehouse using star and snowflake schemas, enabling BI teams to analyze clinical outcomes, billing, and patient demographics with high efficiency.

Implemented data masking, field-level encryption, and pseudonymization using Snowflake security features to ensure HIPAA compliance while still enabling downstream analysis and machine learning.

Created data ingestion frameworks using Python, SQLAlchemy, and Airflow, automating data collection from SFTP, APIs, and HL7 sources into staging and curated zones.

Designed data quality checks using Great Expectations, verifying column ranges, null counts, duplicates, and row consistency before loading critical data into reporting environments.

Collaborated with clinical teams to map ICD-10, CPT, and lab code systems to standardized formats for easier use in dashboards, reporting, and predictive analytics.

Monitored daily batch pipelines using Airflow, Slack alerts, and custom Python logging, ensuring all ETL tasks completed and resolving any data delays promptly.

Developed reusable SQL templates and Python modules to standardize transformations across chronic condition cohorts, inpatient, outpatient, and emergency room visits.

Built dashboards and summary tables in Snowflake to track COVID-19 testing, vaccination rates, and hospitalization trends using curated and validated data.

Optimized Snowflake performance by applying clustering, using appropriate warehouse sizing, and designing schemas that align with query patterns from BI teams.

Partnered with internal compliance teams to maintain a data catalog using Snowflake information schema and external tools, documenting all production data assets.

Wrote complex SQL queries for clinical researchers to extract longitudinal data on patients, treatments, and diagnosis codes to support research projects and data studies.

Enabled self-service analytics by preparing trusted datasets with clearly defined metrics, dimensions, and filters in Snowflake accessible through Tableau and Power BI.

Automated backup and archival of critical datasets using Snowflake Tasks, Streams, and Time Travel, supporting rollback, recovery, and historical analysis.

Participated in regular architecture meetings to review data security, access controls, audit logging, and encryption best practices for clinical data systems.

Provided training sessions and documentation on Airflow, Snowflake, and Python-based data pipelines, supporting team onboarding and process standardization.

Client: AT&T – Data Engineer Jan 2020 – July 2021

Developed and maintained robust ETL pipelines using Apache Spark and Python, processing call detail records and internet usage logs into a centralized data lake for advanced network analytics.

Designed end-to-end data ingestion frameworks using Apache Kafka and AWS Glue, enabling near real-time streaming and batch processing of customer activity logs and service events.

Orchestrated data workflows using Apache Airflow, automating dependencies across PySpark scripts, ingestion layers, and notification systems with clear logging and retry logic.

Migrated legacy ETL jobs to AWS EMR, leveraging S3 and Redshift for cloud-native scalability and easier data sharing between analytics and product teams.

Built analytical datasets in Amazon Redshift by transforming raw telecom data into well-modeled tables using complex SQL queries and custom transformation scripts in Python.

Collaborated with architects to design a scalable data lake architecture in AWS S3, enabling raw, curated, and consumer layers with proper partitioning strategies.

Implemented role-based access control in AWS IAM and set up KMS encryption for sensitive customer records, aligning data infrastructure with internal security requirements.

Created automated testing pipelines for data validation and schema checks using custom Python scripts and integrated them into the CI/CD workflow.

Tuned Apache Spark jobs for performance using optimized partitioning strategies, cache persistence, and parallel execution settings for faster batch processing.

Partnered with business analysts and BI teams to translate requirements into optimized SQL queries and views for dashboards and reporting systems.

Built dashboard-ready datasets by applying complex transformations in Spark SQL, supporting KPIs around customer churn, data usage, and service quality.

Integrated Apache Hive Metastore with EMR clusters for structured querying over raw and semi-structured data sources.

Conducted root-cause analysis of failed ETL jobs by reviewing Airflow logs, Spark driver errors, and cluster-level metrics using AWS CloudWatch.

Developed dynamic parameterized pipelines where configurations are driven through metadata tables in PostgreSQL, reducing code duplication across environments.

Participated in regular code reviews, ensuring Python, SQL, and Spark scripts followed modular design and scalability standards.

Documented data flows, schema definitions, and pipeline dependencies using internal confluence pages and AWS Glue Data Catalog, making onboarding easier for new engineers.

EDUCATION

Master of Science (M.S.) in Computer Science / Data Engineering

Lindsey Wilson university – 2024

Contact this candidate