Senior Data Engineer - PySpark, Databricks, Azure/AWS

Location:

United States

Salary:

65000

Posted:

May 06, 2026

Contact this candidate

Resume:

Sai Jallu

Dallas,Texas, USA +1-415-***-**** *************@*****.*** LinkedIn

SUMMARY

Data Engineer with 5+ years of experience building scalable data pipelines and distributed data systems across AWS and Azure environments. Strong expertise in PySpark, Databricks, and ETL frameworks, with hands-on experience in real-time streaming using Kafka and Kinesis. Proven track record of improving pipeline performance by up to 40%, optimizing cloud costs, and delivering reliable data solutions for healthcare and large-scale analytics use cases.

WORK EXPERIENCE

BCBS Texas, USA

Senior Data Engineer Aug 2025 - Present

Owned end-to-end ETL pipelines and optimized PySpark and Spark SQL ETL pipelines to process large-scale healthcare claims and member eligibility data, reducing data processing time by 25%.

Built data pipelines on Azure (Databricks, Data Factory, ADLS, Synapse Analytics), enabling scalable processing of claims, provider, and clinical datasets.

Processed over 2M+ records per month using Hadoop ecosystem tools (Hive, HBase, MapReduce), improving data ingestion efficiency and pipeline stability.

Migrated 50+ complex Hive/SQL queries to Spark (PySpark/Scala), improving query performance and reducing execution time by 25% for downstream analytics.

Designed and delivered Power BI dashboards for claims analysis, member utilization, and cost reporting, supporting business and operational decision-making.

Engineered ETL workflows in Azure Databricks to integrate data from multiple sources including EHR systems, claims platforms, and third-party data providers.

Developed a reusable Spark and Hive-based ETL framework, reducing overall data processing time by 30% and improving data consistency across pipelines.

Implemented automated Databricks workflows for parallel data processing, increasing pipeline throughput by 40% and improving job reliability by 30%.

Built and optimized SSIS/SSRS solutions for data integration and reporting, supporting enterprise-level healthcare reporting requirements.

Integrated PyTorch-based machine learning models into data pipelines to support patient risk scoring and predictive analytics, improving model accuracy by 15%.

Created PowerShell scripts to monitor and manage Azure resources, improving resource utilization and supporting cost optimization efforts.

George Mason University Virginia, USA

Research Analyst/Teaching Assistant May 2024 – May 2025

Supported instruction and facilitation of a graduate-level course covering business analytics, data ethics, and data visualization, assisting with assignments, student queries, and case-based discussions for 80+ students.

Designed and demonstrated an AWS-based data pipeline architecture (S3, Glue, Lambda) to help students understand how enterprise-scale data flows and downstream systems operate, improving conceptual clarity and hands-on learning.

Taught and graded MSBA 610: Essentials for Business Analytics, covering topics in data ethics and data-driven decision-making.

Built Tableau dashboards using data generated from AWS pipelines to illustrate real-world analytics use cases, enabling students to better interpret KPIs, trends, and business metrics. Tata Consultancy Services India

Data Engineer Oct 2021 – Feb 2023

Knowledge of open-source technologies for big data like Apache Hadoop, Hive, Spark, Impala, HDFS, YARN, Teradata

Automated pre-processing tasks like cleaning, formatting by using Python, VB, PowerShell scripting and creating Excel macros that reduced manual workload by up to 85%.

Developed ad-hoc financial reports and interactive dashboards on BI reporting tools like Microsoft Power BI, SSRS, SSAS, SAP BO and Excel.

Designed and optimized complex ETL workflows in Informatica PowerCenter, enhancing data integration and processing efficiency across multiple data sources, leading to a 30% reduction in data processing time.

Performed ETL operations using Python, SparkSQL, S3 and Redshift on terabytes of data to obtain customer insights, Involved heavily in setting up the CI/CD pipeline using Jenkins, Terraform and AWS.

Used AWS EMR to transform and move large amounts of data into and out of other AWS data stores and databases, such as Amazon Simple Storage Service (Amazon S3) and Amazon DynamoDB

Transformed the data using AWS Glue dynamic frames with PySpark; cataloged the transformed the data using Crawlers and scheduled the job and crawler using workflow feature.

Designed and managed public/private cloud infrastructures using Confidential Web Services (AWS), which include EC2, S3, Cloud Front, Elastic File System, and IAM, which allowed automated operations. Deployed Cloud Front to deliver content further allowing reduction of load on the servers.

Developed and managed data integration workflows using Informatica Cloud (IICS), streamlining data migration and synchronization across multiple cloud and on premise systems

Airbnb India

Junior Data Engineer May 2020 – Sep 2021

● Engineered scalable ETL pipelines using PySpark, AWS Glue, and Airflow, processing over 180 million user event logs monthly to support data-driven recommendations across the Airbnb marketplace.

● Optimized data ingestion workflows from PostgreSQL and Snowflake into AWS S3 and Redshift, reducing pipeline execution time by 3.5 hours daily while enhancing collaboration between data science and analytics teams.

● Developed automated data quality validation scripts in Python (Pandas, NumPy) and SQL, improving reliability of analytics dashboards used by 20+ business analysts for host engagement reporting.

● Integrated real-time streaming data from Kafka and Kinesis into AWS EMR, ensuring near real-time updates for pricing insights and boosting operational efficiency for revenue management teams.

● Collaborated with cross-functional teams in Agile sprints using Git and Jenkins CI/CD, ensuring consistent deployment of data pipeline updates and improving stakeholder communication efficiency. PROJECT EXPERIENCE

Text Annotation and Inter-Rater Agreement Analysis for NLP Tasks Constructed a labeled dataset of 150 text samples for classification, maintaining balanced categories and annotation consistency validated with Raw Agreement: 92 percent and Cohen’s Kappa: 0.88 using pandas and scikit-learn. Executed NLP experiments leveraging PyTorch and GPU acceleration, cutting model-training time by 40 minutes per epoch and achieving an F1 score of 0.94 on the final test dataset.

Dev Finder – Android Application

Spearheaded end-to-end Android app development with MySQL and Firebase integration, incorporating real-time chat, location-based matching, and user profiles, leading to a 4.5-star internal rating after 80 beta tests. Implemented feature-driven workflows that reduced user onboarding and task completion time by 90 seconds per session, improving developer collaboration efficiency across 200+ active testers. CERTIFICATIONS & AWARDS

● AWS Certified Data Engineer – Associate

● AWS Certified Cloud Practitioner – Foundational

● Business Ally Award (TCS): Improved Data Lake efficiency by migrating from Striim to Qlik CDC.

● Cost Optimization Award (TCS): Cut AWS billing by 40% through ETL automation with Lambda and Glue. EDUCATION

George Mason University Aug 2023 – May 2025

Masters, Computer and Information Sciences

Bennett University July 2017 – May 2021

Bachelors, Computer Science and Engineering

SKILLS & INTERESTS

Languages: Python, SQL, Scala, Shell Scripting, R, T-SQL Big Data & Processing: Apache Spark, PySpark, Hadoop (HDFS, YARN, MapReduce), Hive, Airflow, Kafka Cloud Platforms:

AWS: S3, EMR, Glue, Redshift, Lambda, IAM

Azure: Databricks, Data Factory, Synapse, ADLS

Databases & Warehousing: Snowflake, PostgreSQL, MySQL, Oracle, DB2, Redshift, Hive Data Visualization: Power BI, Tableau, Looker, QlikView CI/CD & DevOps: Jenkins, Terraform, Git, GitLab

Data Governance & Quality: Apache Atlas, Collibra, Informatica Data Quality, Talend, Trifacta

Contact this candidate