Sai Jallu
Dallas,Texas, USA +1-415-***-**** *************@*****.*** LinkedIn
SUMMARY
Data Engineer with 5+ years of experience building scalable data pipelines and distributed data systems across AWS and Azure environments. Strong expertise in PySpark, Databricks, and ETL frameworks, with hands-on experience in real-time streaming using Kafka and Kinesis. Proven track record of improving pipeline performance by up to 40%, optimizing cloud costs, and delivering reliable data solutions for healthcare and large-scale analytics use cases.
WORK EXPERIENCE
BCBS Texas, USA
Senior Data Engineer Aug 2025 - Present
Owned end-to-end ETL pipelines and optimized PySpark and Spark SQL ETL pipelines to process large-scale healthcare claims and member eligibility data, reducing data processing time by 25%.
Built data pipelines on Azure (Databricks, Data Factory, ADLS, Synapse Analytics), enabling scalable processing of claims, provider, and clinical datasets.
Processed over 2M+ records per month using Hadoop ecosystem tools (Hive, HBase, MapReduce), improving data ingestion efficiency and pipeline stability.
Migrated 50+ complex Hive/SQL queries to Spark (PySpark/Scala), improving query performance and reducing execution time by 25% for downstream analytics.
Designed and delivered Power BI dashboards for claims analysis, member utilization, and cost reporting, supporting business and operational decision-making.
Engineered ETL workflows in Azure Databricks to integrate data from multiple sources including EHR systems, claims platforms, and third-party data providers.
Developed a reusable Spark and Hive-based ETL framework, reducing overall data processing time by 30% and improving data consistency across pipelines.
Implemented automated Databricks workflows for parallel data processing, increasing pipeline throughput by 40% and improving job reliability by 30%.
Built and optimized SSIS/SSRS solutions for data integration and reporting, supporting enterprise-level healthcare reporting requirements.
Integrated PyTorch-based machine learning models into data pipelines to support patient risk scoring and predictive analytics, improving model accuracy by 15%.
Created PowerShell scripts to monitor and manage Azure resources, improving resource utilization and supporting cost optimization efforts.
George Mason University Virginia, USA
Research Analyst/Teaching Assistant May 2024 – May 2025
Supported instruction and facilitation of a graduate-level course covering business analytics, data ethics, and data visualization, assisting with assignments, student queries, and case-based discussions for 80+ students.
Designed and demonstrated an AWS-based data pipeline architecture (S3, Glue, Lambda) to help students understand how enterprise-scale data flows and downstream systems operate, improving conceptual clarity and hands-on learning.
Taught and graded MSBA 610: Essentials for Business Analytics, covering topics in data ethics and data-driven decision-making.
Built Tableau dashboards using data generated from AWS pipelines to illustrate real-world analytics use cases, enabling students to better interpret KPIs, trends, and business metrics. Tata Consultancy Services India
Data Engineer Oct 2021 – Feb 2023
Knowledge of open-source technologies for big data like Apache Hadoop, Hive, Spark, Impala, HDFS, YARN, Teradata
Automated pre-processing tasks like cleaning, formatting by using Python, VB, PowerShell scripting and creating Excel macros that reduced manual workload by up to 85%.
Developed ad-hoc financial reports and interactive dashboards on BI reporting tools like Microsoft Power BI, SSRS, SSAS, SAP BO and Excel.
Designed and optimized complex ETL workflows in Informatica PowerCenter, enhancing data integration and processing efficiency across multiple data sources, leading to a 30% reduction in data processing time.
Performed ETL operations using Python, SparkSQL, S3 and Redshift on terabytes of data to obtain customer insights, Involved heavily in setting up the CI/CD pipeline using Jenkins, Terraform and AWS.
Used AWS EMR to transform and move large amounts of data into and out of other AWS data stores and databases, such as Amazon Simple Storage Service (Amazon S3) and Amazon DynamoDB
Transformed the data using AWS Glue dynamic frames with PySpark; cataloged the transformed the data using Crawlers and scheduled the job and crawler using workflow feature.
Designed and managed public/private cloud infrastructures using Confidential Web Services (AWS), which include EC2, S3, Cloud Front, Elastic File System, and IAM, which allowed automated operations. Deployed Cloud Front to deliver content further allowing reduction of load on the servers.
Developed and managed data integration workflows using Informatica Cloud (IICS), streamlining data migration and synchronization across multiple cloud and on premise systems
Airbnb India
Junior Data Engineer May 2020 – Sep 2021
● Engineered scalable ETL pipelines using PySpark, AWS Glue, and Airflow, processing over 180 million user event logs monthly to support data-driven recommendations across the Airbnb marketplace.
● Optimized data ingestion workflows from PostgreSQL and Snowflake into AWS S3 and Redshift, reducing pipeline execution time by 3.5 hours daily while enhancing collaboration between data science and analytics teams.
● Developed automated data quality validation scripts in Python (Pandas, NumPy) and SQL, improving reliability of analytics dashboards used by 20+ business analysts for host engagement reporting.
● Integrated real-time streaming data from Kafka and Kinesis into AWS EMR, ensuring near real-time updates for pricing insights and boosting operational efficiency for revenue management teams.
● Collaborated with cross-functional teams in Agile sprints using Git and Jenkins CI/CD, ensuring consistent deployment of data pipeline updates and improving stakeholder communication efficiency. PROJECT EXPERIENCE
Text Annotation and Inter-Rater Agreement Analysis for NLP Tasks Constructed a labeled dataset of 150 text samples for classification, maintaining balanced categories and annotation consistency validated with Raw Agreement: 92 percent and Cohen’s Kappa: 0.88 using pandas and scikit-learn. Executed NLP experiments leveraging PyTorch and GPU acceleration, cutting model-training time by 40 minutes per epoch and achieving an F1 score of 0.94 on the final test dataset.
Dev Finder – Android Application
Spearheaded end-to-end Android app development with MySQL and Firebase integration, incorporating real-time chat, location-based matching, and user profiles, leading to a 4.5-star internal rating after 80 beta tests. Implemented feature-driven workflows that reduced user onboarding and task completion time by 90 seconds per session, improving developer collaboration efficiency across 200+ active testers. CERTIFICATIONS & AWARDS
● AWS Certified Data Engineer – Associate
● AWS Certified Cloud Practitioner – Foundational
● Business Ally Award (TCS): Improved Data Lake efficiency by migrating from Striim to Qlik CDC.
● Cost Optimization Award (TCS): Cut AWS billing by 40% through ETL automation with Lambda and Glue. EDUCATION
George Mason University Aug 2023 – May 2025
Masters, Computer and Information Sciences
Bennett University July 2017 – May 2021
Bachelors, Computer Science and Engineering
SKILLS & INTERESTS
Languages: Python, SQL, Scala, Shell Scripting, R, T-SQL Big Data & Processing: Apache Spark, PySpark, Hadoop (HDFS, YARN, MapReduce), Hive, Airflow, Kafka Cloud Platforms:
AWS: S3, EMR, Glue, Redshift, Lambda, IAM
Azure: Databricks, Data Factory, Synapse, ADLS
Databases & Warehousing: Snowflake, PostgreSQL, MySQL, Oracle, DB2, Redshift, Hive Data Visualization: Power BI, Tableau, Looker, QlikView CI/CD & DevOps: Jenkins, Terraform, Git, GitLab
Data Governance & Quality: Apache Atlas, Collibra, Informatica Data Quality, Talend, Trifacta