Post Job Free
Sign in

Data Engineer Senior

Location:
Dallas, TX
Salary:
120000
Posted:
October 15, 2025

Contact this candidate

Resume:

Sai Pankaj Varma Dendukuri

Senior Data Engineer

Email: **********@*****.*** Phone: +1-817-***-****

PROFESSIONAL SUMMARY:

Senior Data Engineer with 7+ years of experience designing and delivering large-scale data solutions across healthcare, finance, subscription and technology domains. Skilled in building high-performance ETL/ELT and streaming pipelines on AWS, Snowflake, Hadoop, and Databricks, enabling real-time analytics and regulatory compliance. Adept at data modelling, governance, and security, with a strong record of partnering with data science and BI teams to deliver actionable insights. Experienced in automating infrastructure and CI/CD pipelines, containerization, and orchestration to ensure scalable, reliable, and cost-efficient data platforms.

Extensive hands-on experience with the Hadoop ecosystem (HDFS, Hive, Sqoop, Oozie, Spark, Zookeeper), Databricks, and Azure Data Factory for advanced data transformations, interactive exploration, orchestration, and cluster management.

Built and maintained cloud-based data warehouses in Snowflake, Redshift, and Azure Synapse Analytics, implementing dimensional modelling (star/snowflake schemas), zero-copy cloning, and performance optimization through partitioning, clustering, and resource tuning.

Developed real-time streaming pipelines with Spark Structured Streaming, Kafka, and Kinesis for fraud detection, claims processing, and subscription event tracking.

Implemented data quality checks, lineage tracking, and governance frameworks to ensure data integrity, security, and compliance with regulatory standards (HIPAA, financial regulations).

Engineered cloud infrastructure using Terraform and CloudFormation to deliver repeatable, secure, and cost-optimized environments that scaled automatically with business demand.

Developed and managed large-scale data workflow orchestration with Apache Airflow, designing, scheduling, and monitoring dozens of DAGs across multiple domains to ensure timely, reliable data processing.

Partnered with data science and analytics teams to deliver curated datasets, feature engineering pipelines, and KPI dashboards supporting population health, risk scoring, and churn/fraud analysis.

Designed and deployed containerized Spark jobs and microservices using Docker and Kubernetes/EKS, and implemented CI/CD pipelines with Jenkins and Maven integrated with Git for automated build, test, and deployment of data engineering solutions.

Hands-on experience with data ingestion from relational and NoSQL sources (RDBMS, APIs, flat files) into centralized data lakes, ensuring seamless data flow and scalability, and skilled in developing insightful reports and dashboards with Tableau and Power BI.

TECHNICAL SKILLS:

Skill Category

Tools/Technologies

Programming Languages

Python, SQL, Scala, Java, Bash, Shell scripting

Data Warehousing

Snowflake, Amazon Redshift, Google BigQuery, Azure Synapse Analytics

Databases

Oracle, MySQL, MS SQL Server, DB2, DynamoDB

ETL/ELT Tools

Apache Airflow, AWS Glue, dbt, Apache NiFi, Azure Data Factory

Data Processing

Apache Spark, Kafka, Hadoop, Databricks, Flink, AWS Lambda, Google Dataflow

Big Data Technologies

HDFS, YARN, MapReduce, Pig, Hive, Sqoop, Oozie, ZooKeeper, Parquet, Avro, Kinesis, Flume

Cloud Platforms

AWS (EC2, S3, EMR, RDS, Redshift), GCP (BigQuery, Dataflow, Cloud Functions), Azure (Synapse, Data Factory, Blob Storage)

Orchestration Tools

AWS Step Functions, Apache Airflow, NiFi, Prefect

Monitoring & Logging

CloudWatch, ELK Stack (Elasticsearch, Logstash, Kibana), Prometheus, Grafana

Infrastructure & CI/CD

Terraform, Jenkins, Ansible, Maven

Data Visualization

Power BI, Tableau, Looker, QuickSight

Containerization/Orchestration

Docker, Kubernetes

OS & Version Control

Linux, Unix, Windows; Git, GitHub, GitLab

PROFESSIONAL EXPERIENCE:

SENIOR DATA ENGINEER

Beal Bank, Plano, TX Jan 2023 – Present

Pioneered the integration of multi-cloud data platforms, enhancing Beal Bank’s data processing capabilities and enabling seamless interoperability across AWS, Snowflake, and on-prem environments.

Migrated legacy Hadoop batch workflows to low-latency Spark streaming on EMR, integrating Kafka producers/consumers and automating ingestion from on-prem RDBMS via Apache Sqoop and Flume into S3/Redshift for analytics.

Utilized the Hadoop ecosystem (HDFS, Hive, MapReduce, YARN, Spark) for scalable storage and analysis of large banking datasets, ensuring compliance with internal and external regulatory standards.

Designed scalable data pipelines using AWS Glue dynamic frames and PySpark to cleanse, transform, and enrich complex nested loan, credit, and transactional data, documenting reusable ingestion, transformation, and alerting modules across multiple payment channels and reducing processing time by 40%.

Built Databricks Spark pipelines on Delta Lake ingesting 150M+ transactions/day from Kinesis into AWS S3/Redshift and Azure Blob, enabling near-real-time fraud scoring and cutting data availability lag from 6 hours to 15 minutes.

Built data ingestion pipelines using Azure Data Factory to load high-volume transactional data from on-prem SQL Server and partner APIs into Azure Data Lake Storage Gen2, orchestrating transformations in Databricks for cross-cloud fraud analytics and regulatory reporting.

Developed and implemented real-time fraud detection pipelines with Spark Structured Streaming, Kafka, and AWS Kinesis, integrating AWS SNS/SQS to trigger instant alerts on high-risk transactions and partnering with data science teams on advanced fraud models.

Designed Spark Structured Streaming applications for anomaly detection and rule-based logic to identify unusual customer behaviors, partnering with data science teams for advanced fraud models.

Integrated Kafka producers and consumers to ensure low-latency, fault-tolerant data transfer between fraud monitoring platforms and the bank’s Amazon S3 data lake.

Partnered with risk modeling teams to support feature engineering pipelines, developing Python pipelines to mask sensitive PII and generate fraud-risk features (IP, device ID, geolocation) while integrating real-time scoring models with ingestion workflows.

Developed complex SQL queries for data extraction, cleansing, and transformation across Snowflake, Redshift, Oracle, and MySQL databases, supporting data analysis and enabling analysts to deliver regulatory reports 30% faster.

Implemented PySpark-based data quality checks and leveraged AWS Glue crawlers for schema validation, ensuring accurate ingestion of structured and semi-structured banking data.

Designed S3-based data lake zones (raw, curated, trusted) to securely store transaction logs, fraud detection results, and risk scores, supporting compliance and audit requirements.

Designed and optimized data warehousing solutions in Snowflake for fraud analytics, customer risk scoring, and regulatory reporting, applying automated scaling and advanced clustering to improve performance.

Designed and optimized Azure Synapse Analytics data marts to support customer risk scoring and compliance dashboards, implementing partitioning, materialized views, and RBAC for secure, high-performance access.

Utilized tools such as Chef, Terraform, and Jenkins in conjunction with custom Python and PowerShell scripts to automate infrastructure deployment and auditing.

Provisioned secure, repeatable AWS infrastructure with modular Terraform code embedded in CI/CD pipelines (EMR, S3, IAM) to support compliant banking data systems.

Containerized Spark jobs and fraud validation services using Docker and integrated GitHub with Jenkins for automated CI/CD of data workflows across development, QA, and production.

Provisioned and optimized EMR clusters for distributed fraud and transaction data processing, improving performance tuning to meet SLAs while reducing operational costs.

Automated AWS interactions using Python’s Boto3, including data transfers, EC2 management, and CloudWatch monitoring.

Established data governance frameworks with audit trails and implemented RBAC policies within Snowflake and Oracle to ensure secure, compliant access to sensitive banking data.

Improved pipeline performance and cost by 35% through caching/resource tuning and enhanced observability with CloudWatch dashboards and custom logging, reducing MTTD and MTTR for fraud pipelines.

Orchestrated fraud detection workflows using Apache Airflow, managing dependencies across ingestion, scoring, and notification tasks.

Deployed microservices on Kubernetes using Helm charts, ensuring consistency and scalability of fraud and risk applications in cloud-native environments.

Mentored junior engineers, providing guidance on secure coding practices, Spark optimization, and AWS best practices in high-stakes financial data environments.

AWS DATA ENGINEER

COX Enterprises, Irving, TX Oct 2020 – Dec 2022

Built end-to-end ETL workflows using AWS Glue, Lambda, S3, Redshift, and PySpark to ingest, transform, and validate large-scale customer, subscription, and billing data from CRM platforms, APIs, and on-prem systems into a centralized data lake for analytics and reporting.

Processed large-scale subscriber, advertising, and network usage datasets using Hadoop (HDFS, Hive, Spark) for ETL and historical trend analysis to support revenue optimization and customer experience initiatives.

Built and managed AWS EMR clusters for distributed data processing using Apache Spark, HBase and Hive. HiveQL was used to process and analyze semi-structured and organized data stored in S3.

Developed and optimized Spark jobs on EMR clusters to process batch datasets containing hundreds of millions of records daily, tuning performance via partitioning and memory management.

Reduced ETL processing time by 60% by migrating legacy SSIS and shell-script workflows to AWS Glue and Spark, streamlining data ingestion and transformation for customer, subscription, and billing pipelines.

Developed reusable PySpark and Python modules for normalization, enrichment, and transformation of operational and billing data, and built data transformation frameworks for customer billing records, product reconciliation, and partner settlements, reducing redundancy and cutting development time by 25%.

Designed and executed data validation scripts using SQL to compare source and target datasets, ensuring data quality post-ingestion.

Leveraged AWS Glue crawlers, Data Catalog, and Snowflake (with Snowpipe) for schema management, metadata tracking, and near real-time ingestion to support analytics and reporting workflows.

Designed and maintained data warehousing pipelines in Snowflake analyzing 80M+ customer, subscription, and advertising records sourced from S3, APIs, and on-prem RDBMS, cutting query times from 20 minutes to under 2 minutes through SQL query optimization while ensuring compliance.

Implemented Delta Lake versioning and audit logging in Databricks to track data changes, reducing rollback time from days to hours and ensuring compliance with data regulations and improving the overall quality of the data for analysis.

Used Spark Streaming to receive real time data from the Kafka and store the stream data to HDFS using Python and NoSQL databases such as HBase and Cassandra.

Built event-driven data pipelines and real-time validation layers to detect missing or inconsistent records in customer and billing feeds, improving data quality by 40% and enabling faster real-time analytics for operational and reporting teams.

Collaborated with QA teams to develop automated validation suites, ensuring accuracy of critical fields such as account IDs, subscription dates, and transaction timestamps.

Worked on subscription eligibility and product verification pipelines, integrating with third-party systems and partner APIs for real-time validation.

Built dashboards and reconciliation reports to support BI teams, ensuring actionable insights and end-to-end traceability across customer, subscription, and advertising datasets.

Implemented audit logging, lineage tracking, and RBAC policies to comply with Cox Enterprises’ data governance and security standards.

Built data quality and reconciliation frameworks in Python and SQL to ensure reliability of reporting and analytics deliverables.

Designed and configured CI/CD pipelines with Jenkins, GitLab, and Docker to minimize human interaction in the automatic deployment of Spark apps on AWS EMR. Used Amazon Code Pipeline for delivery and continuous integration.

Containerized transformation logic and reporting services using Docker, ensuring consistent deployment across development, QA, and production.

Deployed Kubernetes-based microservices for real-time analytics, supporting scalable and on-demand data processing for internal teams and customer-facing platforms.

Managed container orchestration with AWS Fargate, allowing serverless container deployments and scaling without managing underlying infrastructure.

Tuned and optimized Snowflake SQL queries using partition pruning, caching, and proper join strategies.

Mentored junior engineers and interns, providing code reviews, documentation, and training in AWS data lake architecture, Spark optimization, and secure data engineering practices.

BIG DATA ENGINEER

Molina Health Care, Long Beach, CA Feb 2018 – Sep 2020

Developed and maintained scalable data pipelines to process claims, pharmacy billing, eligibility, and member/provider data using AWS Glue, PySpark, and Scala, ensuring high data availability and low latency.

Designed and implemented AWS Glue/Spark pipelines ingesting 25M+ healthcare records/day from EHR, payer APIs and pharmacy systems into a HIPAA-compliant S3 data lake, reducing data availability lag from 4 hours to 30 minutes and enabling population health analytics and regulatory reporting.

Participated in data model design reviews, ensuring SQL-based transformations aligned with business logic and compliance needs.

Created dynamic SQL scripts for metadata-driven transformations and automated schema migrations.

Built provider and claims normalization layers in PySpark, harmonizing data across multiple internal systems and third-party partners for consistency in analytics and reporting.

Designed member behavior analytics pipelines, integrating claims history, pharmacy files, and care management data to support population health initiatives and targeted interventions.

Built real-time streaming pipelines using Apache Spark Structured Streaming to monitor claim submissions, eligibility updates, and provider network changes across Molina’s systems.

Leveraged Hadoop HDFS for scalable storage of structured and semi-structured healthcare data including claims, pharmacy feeds, and provider rosters.

Implemented event-driven architecture using AWS SQS and SNS, enabling real-time notifications for rejected claims, member eligibility changes, and high-value transactions to trigger downstream processing.

Supported data science and analytics teams by building feature pipelines, automating KPI reporting, and preparing curated datasets for population health, risk adjustment, and care quality studies.

Reduced claim data processing latency by 19% by optimizing PySpark transformations and partitioning strategies, enabling faster population health analytics and KPI reporting.

Orchestrated and scheduled critical data workflows using Apache Airflow, managing 40+ DAGs supporting claims, eligibility, and provider reporting pipelines.

Implemented data quality checks and anomaly detection to identify duplicate claims, invalid member IDs, mismatched provider NPI numbers, and coding inconsistencies, significantly improving accuracy.

Migrated legacy ETL workflows from on-premises Hadoop infrastructure to AWS-native services, reducing processing times and improving maintainability.

Worked on DevOps and compliance-related tasks, including containerizing ETL jobs, ensuring HIPAA-compliant pipelines, and collaborating on pipeline deployment and monitoring.

Actively participated in sprint planning and data architecture discussions, contributing to design decisions on partitioning, metadata management, and pipeline optimization.

EDUCATION:

Master of Science in Information Systems, University of North Texas, USA

Bachelor of Technology in Computer Science, Vel Tech University, IND



Contact this candidate