Data Engineer Senior

Location:

Irving, TX

Posted:

July 17, 2025

Contact this candidate

Resume:

Srikesh Chelakalapally

Sr. Data Engineer

469-***-****

**.*********@*****.***

SUMMARY:

Seasoned Senior Data Engineer with 7 years of experience in architecting, developing, and optimizing data-driven solutions. Specialized in building reliable ETL pipelines using SQL, Python, and cloud-native technologies.

Skilled in designing scalable ETL workflows and data integration processes using tools such as Talend, Python, and RESTful APIs.

Hands-on expertise in full-cycle ETL/ELT development with platforms like Apache Spark, AWS Glue, Matillion, Informatica PowerCenter, Talend, and SSIS.

Strong background in implementing ELT strategies on cloud data warehouses including Snowflake and Amazon Redshift, with a focus on performance tuning and storage optimization.

Developed efficient ELT pipelines in Snowflake using native SQL, Streams, Tasks, and Stored Procedures to support incremental and near real-time data processing.

Built scalable data pipelines on Databricks (Apache Spark) to process high-volume structured and semi-structured datasets for analytics use cases.

Led modernization efforts by migrating legacy ETL systems from Informatica to Databricks and Snowflake, resulting in improved performance and reduced operational overhead.

Engineered data ingestion frameworks in Python to extract data from APIs, flat files, relational databases, and streaming platforms like Kafka and Kinesis.

Automated data workflows and task dependencies using Apache Airflow, integrating with Snowflake, Redshift, and S3 for seamless orchestration.

Developed reusable Python libraries for logging, error handling, and file system abstraction to standardize and streamline pipeline development.

Designed ingestion pipelines into HDFS using Sqoop, Flume, and Kafka, enabling efficient data flow from diverse sources into the Hadoop ecosystem.

Tuned Hadoop clusters by optimizing job execution parameters, memory allocation, and HDFS storage configurations to enhance performance.

Created resilient Oozie workflows to orchestrate MapReduce, Hive, Sqoop, and Spark jobs across distributed environments.

Built serverless, event-driven ingestion pipelines using AWS SQS and SNS to enable decoupled architecture and real-time alerting.

Integrated Apache Zookeeper for distributed coordination and service discovery within Hadoop-based systems.

Deployed Airflow on Amazon MWAA and Kubernetes clusters hosted on EC2 to ensure scalable and secure orchestration of data workflows.

Designed and implemented big data processing pipelines on Amazon EMR using Spark, Hive, and Hadoop for batch ETL workloads.

Built serverless ETL pipelines with AWS Glue, developing Jobs, Crawlers, and Workflows to transform and catalog data from S3, RDS, and external systems.

Specialized in building both real-time and batch data pipelines using Apache Spark (PySpark/Scala) integrated with Kafka for event-driven processing.

Developed streaming ingestion frameworks using Kafka, Spark Structured Streaming, and AWS services like Kinesis, Lambda, and S3.

Proficient with AWS tools such as QuickSight, Lambda, and CloudWatch for analytics, reporting, and operational monitoring.

Led data validation, UAT, and documentation efforts for AWS-based reporting pipelines, ensuring data accuracy and consistency across environments.

Experienced with MongoDB for managing large-scale, schema-flexible datasets, leveraging aggregation pipelines, sharding, and replica sets.

Strong command of relational databases including MySQL, Oracle, and PostgreSQL, with expertise in complex SQL, stored procedures, triggers, & performance tuning.

Conducted performance tuning and query optimization across SQL and NoSQL systems to enhance responsiveness and reduce resource consumption.

Integrated and orchestrated data from multiple sources into data lakes (S3) and data warehouses (Redshift, Athena) to support analytics and reporting.

Implemented cost-effective AWS solutions by applying S3 lifecycle policies, data tiering strategies, and parallel job execution techniques.

Collaborated with DevOps and security teams to build compliant infrastructure using Terraform, AWS Config, CloudTrail, and GuardDuty.

Experienced in setting up CI/CD pipelines using Jenkins, GitHub Actions, AWS CodePipeline, and GitLab CI/CD to streamline deployments.

Proficient in version control using Git and Bitbucket, applying Git Flow branching strategies and managing collaborative development workflows.

Developed unit tests using PyTest and TestNG, and contributed to backend development using frameworks like Spring Boot, Django, and Flask.

Worked in Agile environments (Scrum/Kanban), actively participating in sprint planning, daily standups, and iterative delivery cycles.

TECHNICAL SKILLS:

Category

Technologies / Tools / Platforms

Languages & Scripting

Python, SQL, Scala, PySpark

ETL / ELT Tools

Apache Spark, Databricks, AWS Glue, Talend, Informatica PowerCenter, Matillion, SSIS, Oozie, Sqoop, Flume

Data Warehousing

Snowflake, Amazon Redshift, PostgreSQL, Oracle, MySQL

Big Data Ecosystem

Hadoop, Hive, HDFS, Spark Structured Streaming, MapReduce, Kafka, Zookeeper

Cloud Platforms

AWS (S3, Lambda, Redshift, RDS, Kinesis, Glue, EMR, Athena, SQS, SNS, MWAA, QuickSight, CloudWatch, Config, GuardDuty)

Orchestration &Scheduling

Apache Airflow (including MWAA), Oozie

DevOps & IaC

Terraform, Jenkins, GitHub Actions, GitLab CI/CD, AWS CloudFormation,Pipelines

Version Control

Git, Bitbucket

Testing & QA

TestNG, PyTest, Unit Testing, UAT

Frameworks

Django, Flask, Spring Boot

NoSQL Databases

MongoDB, Cassandra DB

Streaming & Messaging

Apache Kafka, AWS Kinesis, SQS, SNS

Data Integration & APIs

REST APIs, Python-based ingestion, Flat files, Relational DBs

Monitoring & Logging

AWS CloudWatch, Custom Python logging utilities

Methodologies

Agile (Scrum, Kanban), DevOps

CI/CD

Jenkins, GitHub Actions, GitLab CI/CD, AWS CodePipeline

Education :

University of Texas

Master’s degree in information technology

PROFESSIONAL EXPERIENCE:

First National Bank (Banking)

Pittsburgh,PA August2023-present

Sr. Data Engineer

Roles & Responsibilities:

Led the complete modernization of legacy ETL systems, transitioning from Informatica PowerCenter and SSIS to cloud-native platforms like Databricks (Apache Spark) and Snowflake. This initiative improved data processing speeds by 25%.

Engineered scalable and resilient ELT pipelines on Databricks using PySpark and Scala, capable of processing structured, semi-structured (e.g., JSON, Avro), and unstructured data formats. These pipelines supported high-throughput ingestion and transformation for downstream analytics.

Designed high-performance ELT workflows in Snowflake, leveraging advanced features such as Streams, Tasks, and Stored Procedures to implement incremental loads, CDC mechanisms, and near real-time data delivery for business-critical reporting.

Developed modular Python libraries for common ETL operations—logging, exception handling, data validation, and file system abstraction—enabling rapid development and consistent implementation across multiple data engineering projects.

Built a robust data validation framework using Python and PyTest, integrated into Apache Airflow DAGs. This framework automated rule-based checks (e.g., nulls, ranges, schema conformity) to ensure data integrity post-migration.

Collaborated with QA and business teams to define data quality benchmarks and validation rules, leading UAT efforts and producing detailed reconciliation reports to confirm alignment between legacy and modernized systems.

Implemented a centralized metadata and lineage tracking solution by integrating AWS Glue Data Catalog with Amundsen, providing full visibility into data sources, transformation logic, and downstream dependencies.

Automated infrastructure provisioning using Terraform, creating reusable modules for AWS resources such as S3, IAM, Databricks clusters, and Snowflake configurations.

This ensured secure, consistent, and repeatable deployments across all environments.

Designed and managed complex DAGs in Apache Airflow, deployed on Amazon MWAA, to orchestrate multi-stage ETL workflows with built-in retries, SLA monitoring, and alerting—reducing pipeline failures by 25%.

Established CI/CD pipelines using Jenkins, GitHub Actions, and AWS CodePipeline to automate testing, deployment, and configuration of ETL jobs and infrastructure, accelerating delivery cycles and minimizing manual errors.

Optimized Spark and Snowflake performance through advanced techniques such as partitioning, broadcast joins, predicate pushdown, and caching—achieving a 35% reduction in query execution times and improved resource efficiency.

Worked closely with DevOps and security teams to enforce compliance and governance by integrating Terraform-managed infrastructure with AWS Config, CloudTrail, and GuardDuty, ensuring secure and auditable data operations.

Created and maintained role-based access control (RBAC) models in Snowflake to enforce granular data security aligned with organizational policies like FINRA,SOC2 and GCI-DSS.

Identified performance bottlenecks in existing HDFS storage and job scheduling, and re-engineered the data flows using Spark SQL and Delta Lake, reducing batch ETL durations by 35%.

Developed automated validation scripts to compare data fidelity between Hadoop outputs and Snowflake tables, enabling confidence in the phased decommissioning of HDFS workloads.

Documented legacy Hadoop job logic and designed target state architecture diagrams to facilitate enterprise-wide cloud migration planning and governance.

Documented all aspects of the modernization effort, including architecture diagrams, testing strategies, deployment procedures, and operational runbooks, enabling smooth knowledge transfer and long-term maintainability.

Actively contributed to Agile Scrum processes, collaborating with product owners, data scientists, and stakeholders to prioritize features, resolve blockers, and deliver iterative improvements aligned with business goals.

Mentored junior engineers and analysts, providing guidance on Spark optimization, ETL best practices, cloud automation, and data quality assurance—fostering a high-performing and collaborative team culture.

Continuously evaluated emerging technologies in the data engineering space, recommending innovations such as serverless ETL and containerized processing to enhance scalability and future-proof the platform.

United Health Group (Health care insurance)

Addison,TX April2020-July2023

Data Engineer

Roles & Responsibilities:

Architected and deployed real-time claims ingestion and fraud detection pipelines using Apache Kafka and Spark Structured Streaming on AWS EMR, enabling sub-second anomaly detection and alerts across payer-provider ecosystems.

Designed and implemented streaming ingestion frameworks integrating Kafka, AWS Kinesis, and PySpark to process high-volume claim and member events from EHR systems and APIs into S3 and Snowflake.

Built event-driven data workflows using AWS Lambda, SNS, and SQS to orchestrate fraud alert notifications and trigger downstream compliance verification services with minimal latency.

Developed modular Python-based ingestion modules to unify healthcare data from flat files, HL7/FHIR APIs, RDBMS, and streaming sources, enabling reliable data flow into Snowflake and HDFS.

Automated orchestration of regulatory reporting and claims transformation pipelines using Apache Airflow on Amazon MWAA, managing interdependent DAGs with SLA tracking and retries.

Leveraged Apache Zookeeper for coordination across Kafka and Hadoop components, supporting secure, fault-tolerant delivery of PHI and real-time audit logs.

Tuned Kafka clusters and consumer group configurations for optimized throughput and low-latency processing of claims and compliance events, implementing topic compaction and tiered storage.

Designed hybrid processing flows on Amazon EMR using Hive and Spark to enrich claims data with historical provider performance, enabling unified analytics for HEDIS and fraud detection use cases.

Built Snowflake ELT pipelines using SQL, Streams, and Tasks to support incremental HEDIS measure calculation, anomaly scoring, and reporting dashboards integrated with external auditing systems.

Developed Glue Jobs, Crawlers, and Workflows to transform raw streaming healthcare events into curated S3 zones, ensuring schema conformity and compliance with HIPAA data partitioning standards.

Implemented monitoring solutions with AWS CloudWatch, Lambda, and QuickSight to visualize real-time fraud alerts, audit trail activity, and compliance processing SLAs across environments.

Tuned performance in SQL and MongoDB environments, optimizing queries for real-time fraud scoring pipelines and reducing compliance report generation latency by 35%.

Integrated Kafka, Sqoop, and Flume to ingest legacy claims datasets into HDFS, supporting hybrid analysis across batch Medicaid submissions and streaming patient records.

Utilized MongoDB to manage dynamic healthcare datasets such as eligibility updates and lab results, leveraging aggregation pipelines and sharding for high-availability access.

Partnered with DevOps teams to provision IaC modules using Terraform, automating deployment of Kafka clusters, Airflow, EMR, Snowflake roles, and secure S3 environments with AWS GuardDuty and Config.

Implemented CI/CD pipelines using GitHub Actions, Jenkins, and AWS CodePipeline to automate build, test, and deployment processes for regulatory ETL and fraud detection workflows.

Applied Git and Bitbucket version control strategies, including Git Flow and pull request reviews, to ensure release stability and collaborative development for healthcare compliance teams.

Created PyTest and TestNG-based test suites to validate transformation logic in HEDIS reports and fraud scoring models, achieving 90%+ coverage across core data paths.

Actively participated in Agile delivery teams using Scrum and Kanban frameworks, contributing to sprint planning, story estimation, and retrospectives for continuous improvement of real-time data quality and compliance SLAs.

Provided technical leadership in building HIPAA-compliant, streaming-first data architectures for claims, offering mentorship on Kafka, Spark, and AWS-native best practices to cross-functional teams.

Thoughtworks(Ecommerce)

SantaClara,CA Jan 2017 - March 2020

Data Engineer

Roles & Responsibilities:

Developed automated data ingestion and transformation scripts in Python to extract datasets from REST APIs, flat files, and relational databases, streamlining data preparation workflows.

Created reusable Python modules for logging, exception handling, and data quality validations, improving consistency and reducing duplication across ETL pipelines.

Wrote efficient SQL queries for data joins, filtering, and aggregation across Snowflake and Amazon Redshift, supporting business reporting and analytics. Also developed stored procedures to encapsulate transformation logic within ELT processes.

Collaborated with senior engineers to build and maintain scalable ETL pipelines that extracted data from MySQL, PostgreSQL, and flat files, applied transformations using SQL and Python, and loaded results into Redshift.

Monitored and debugged scheduled ETL jobs, ensuring timely data delivery and accuracy. Gained practical experience integrating traditional ETL processes with cloud-based storage and data warehouse platforms.

Developed serverless ingestion pipelines using AWS Lambda + API Gateway to capture real-time order placement and customer activity from web and mobile applications into S3.

Built Athena-based interactive analytics on raw and transformed data in S3, enabling product teams to self-serve reporting with minimal engineering overhead.

Leveraged AWS Glue Crawlers to catalog semi-structured product feeds (JSON, XML), ensuring data discoverability and schema evolution tracking across ingestion zones.

Assisted in big data ingestion efforts, moving structured and semi-structured data into HDFS using tools like Sqoop and Flume.

Contributed to PySpark-based batch processing jobs on Hadoop, applying basic transformations and leveraging Spark DataFrame APIs for data manipulation.

Applied performance tuning techniques such as partitioning and caching in Spark jobs to improve execution efficiency in development and test environments.

Executed data loading and transformation tasks in Snowflake, using Streams, Tasks, and scheduled workflows to automate incremental data processing. Supported schema mapping and normalization of datasets ingested from S3 into curated Snowflake layers.

Worked on Redshift-based reporting solutions, loading raw data using COPY commands and writing SQL transformations to produce clean, analysis-ready tables. Tuned queries by optimizing sort keys and using analyze/vacuum operations.

Contributed to proof-of-concept projects involving real-time ingestion using PySpark and Kafka, simulating log data pipelines and processing streaming events for downstream analytics.

Gained hands-on experience with Kafka, including basic producer/consumer setup and structured streaming integration with Spark.

Used Git and Bitbucket for version control, following Git Flow branching strategies and submitting code changes through structured pull requests.

Utilized Snowflake Tasks and Streams to enable near real-time CDC and incremental processing, improving SLA compliance for critical dashboards.

Wrote unit tests using PyTest to validate core ETL logic, helping reduce bugs and improve reliability during production deployments.

Participated in Agile development processes, contributing to sprint planning, daily standups, and demo sessions while collaborating with engineering and QA teams to review and refine pipeline outcomes.

Contact this candidate