Bala Sai Irrinki
Data Engineer
United States
***************@*****.***
https://www.linkedin.com/in/narayana-de/
Summary
Results-driven Senior Data Engineer with 4+ years of experience architecting scalable data pipelines, ETL/ELT processes, and real-time streaming analytics across cloud platforms (AWS, Azure, GCP). Proven expertise in big data technologies (Apache Spark, Hadoop, Kafka), data warehousing (Snowflake, Redshift, BigQuery), and machine learning operations (MLOps) serving healthcare, financial services, e- commerce, retail, manufacturing, and telecommunications industries. Strong background in data governance, business intelligence, DevOps practices, and enterprise data architecture with demonstrated ability to optimize performance, reduce costs, and drive data-driven decision making.
Core Technical Expertise: Data Pipeline Engineering ETL/ELT Development Real-time Stream Processing Cloud Data Architecture Big Data Analytics Machine Learning Engineering Data Warehousing Business Intelligence DevOps Database Optimization API Development Data Governance
Experience
Data Engineer PNC Bank, TX SEP 2023 – PRESENT
• Architected enterprise-scale data pipelines processing 500+ million financial transactions daily using Apache Spark, AWS EMR, Hadoop ecosystem, and Kafka streams, implementing real-time fraud detection algorithms, automated reconciliation processes, and reducing processing latency by 60% while ensuring PCI DSS, SOX, and Basel III compliance
• Established comprehensive data governance frameworks using Great Expectations, Apache Griffin, Apache Atlas, and DataHub with automated data validation, lineage tracking, and metadata management, achieving 99.95% accuracy across regulatory reporting (CCAR, DFAST) and enabling self-service analytics for 500+ business users
• Designed real-time streaming architectures using Apache Kafka, AWS Kinesis, Apache Flink, and Storm for fraud detection, anti-money laundering (AML), risk monitoring, and algorithmic trading, processing 10M+ events per second with sub-50ms latency and improving fraud identification accuracy by 35%
• Integrated machine learning models (XGBoost, Random Forest, Neural Networks) with streaming pipelines using AWS SageMaker, MLflow, Kubeflow, and Kubernetes, enabling real-time credit scoring, risk assessment, customer segmentation, and A/B testing with 95% prediction accuracy
• Built automated ETL/ELT workflows using AWS Glue, Apache Airflow, Databricks, and custom Python frameworks for regulatory reporting, data warehouse automation, and business intelligence, implementing CDC, incremental loading, and dimensional modeling reducing processing time by 70%
• Optimized data warehouse performance on AWS Redshift, Snowflake, and BigQuery implementing columnar storage, partitioning strategies, materialized views, query optimization, and automated maintenance procedures resulting in 40% faster execution and 25% cost reduction
• Developed machine learning feature engineering pipelines and MLOps infrastructure using PySpark, Apache Spark MLlib, AWS SageMaker, Docker, and CI/CD integration (Jenkins, GitLab) for credit risk modeling, predictive analytics, automated model training, validation, and deployment reducing time-to-production from weeks to hours
• Led cloud migration initiative from legacy on-premises infrastructure to AWS cloud-native architecture using Infrastructure as Code
(Terraform, CloudFormation), lift-and-shift strategies, disaster recovery automation, and cost optimization achieving 30% operational cost reduction while implementing monitoring and alerting using CloudWatch and Grafana
• Collaborated with cross-functional teams including Risk Management, Compliance, Quantitative Research, and Business Intelligence, mentoring 3 junior data engineers, establishing coding standards and best practices documentation, and delivering data-driven insights using Tableau, Power BI, and custom dashboards improving team productivity by 40% Key Technologies: AWS (S3, EMR, Glue, Redshift, Lambda, Kinesis, SageMaker), Apache Spark, Kafka, Airflow, Python, SQL, Snowflake, Databricks, Docker, Kubernetes, Terraform, Jenkins, Tableau Data Engineer DataOne Solutions, India JAN 2020 – JUN 2022
• Designed and implemented scalable ETL/ELT pipelines using Apache Spark, Python, Hadoop ecosystem, and multi-cloud environments
(AWS, Azure, GCP), processing 100+ TB of healthcare (EHR, clinical trials), financial (trading, banking), e-commerce (customer analytics), manufacturing (IoT sensors), and telecommunications (CDR) data daily with 99.9% uptime
• Built real-time streaming data architectures using Apache Kafka, Kinesis, Storm, Flink, and Pulsar for fraud detection, patient monitoring, customer personalization, predictive maintenance, and network optimization, achieving sub-second latency and supporting 10M+ concurrent data streams across multiple industry verticals
• Developed machine learning data pipelines and feature engineering workflows using PySpark, Databricks, MLflow, Kubeflow, scikit- learn, and TensorFlow for predictive analytics including retail inventory optimization, credit risk assessment, healthcare outcome prediction, customer churn modeling, and personalized recommendation systems
• Implemented comprehensive data quality monitoring, validation frameworks, and data governance using Great Expectations, Apache Griffin, dbt, Apache Atlas, and custom Python scripts, ensuring HIPAA, SOX, GDPR, FDA compliance and reducing data errors by 85% across healthcare, financial, and e-commerce domains
• Built enterprise data warehousing solutions using Snowflake, Redshift, BigQuery, Azure Synapse, and Teradata with dimensional modeling, star schema design, partitioning strategies, performance optimization, and automated ETL processes for business intelligence, regulatory reporting, and advanced analytics across multiple business units
• Architected microservices-based data processing platforms using Docker, Kubernetes, Apache Airflow, and cloud-native services supporting concurrent data operations for manufacturing IoT sensors, telecom CDRs, retail POS systems, financial market data feeds, and healthcare monitoring devices
• Created comprehensive data lake architectures using Delta Lake, Apache Iceberg, AWS S3, Azure Data Lake, and GCS with schema evolution, ACID transactions, and time travel capabilities, managing structured, semi-structured, and unstructured data from APIs, databases, files, streaming platforms, IoT devices, and external data sources
• Implemented CI/CD pipelines, DevOps practices, and Infrastructure as Code using Jenkins, GitLab CI, Azure DevOps, Terraform, and Ansible, automating deployment of data engineering workflows across development, staging, and production environments while optimizing database performance across PostgreSQL, MongoDB, Cassandra, Elasticsearch, and Redis clusters improving response times by 60%
• Developed REST APIs, GraphQL endpoints, data services, and self-service analytics platforms using FastAPI, Django, Spring Boot, React, and D3.js enabling 500+ business users across analytics, data science, reporting, and operations teams to consume standardized datasets, real-time metrics, and interactive dashboards for executive reporting and strategic decision making Key Technologies: Python, Apache Spark, Kafka, Airflow, Hadoop, Snowflake, AWS, Azure, GCP, Docker, Kubernetes, PostgreSQL, MongoDB, Elasticsearch, Databricks, MLflow, Tableau, Power BI, Terraform, Jenkins Education
Master of Science in Computer Science University of Texas at Arlington, USA Bachelor of Technology in Computer Science CMR College of Engineering & Technology, INDIA Technical Skills
• Programming Languages & Frameworks: Python, SQL, PySpark, Scala, Java, R, Shell Scripting, Pandas, NumPy, FastAPI, Django, Flask
• Big Data & Streaming Technologies: Apache Spark, Hadoop (HDFS, YARN, MapReduce), Apache Kafka, Apache Storm, Apache Flink, Apache Beam, Databricks, Apache Airflow, Apache NiFi, Kinesis Data Streams, Event Hubs
• Cloud Platforms & Services:
o AWS: S3, EC2, EMR, Glue, Lambda, Redshift, Kinesis, CloudFormation, IAM, RDS, DynamoDB, Athena, SageMaker, Step Functions, CloudWatch
o Azure: Data Factory, Synapse Analytics, Blob Storage, HDInsight, Stream Analytics, Power BI, Machine Learning, Event Hubs, Cosmos DB
o GCP: BigQuery, Cloud Dataflow, Cloud Storage, Dataprep, Cloud Composer, Pub/Sub, Cloud Functions, Dataproc, Cloud SQL
• Data Warehousing & Analytics: Snowflake, Amazon Redshift, Google BigQuery, Azure Synapse, Teradata, Oracle Exadata, Dimensional Modeling, Star Schema, OLAP, Data Marts
• Databases & Storage: PostgreSQL, MySQL, MongoDB, Cassandra, Redis, Elasticsearch, Neo4j, InfluxDB, HBase, Delta Lake, Apache Iceberg, Parquet, Avro, ORC
• Data Visualization & BI Tools: Tableau, Power BI, Looker, Grafana, Apache Superset, D3.js, Matplotlib, Seaborn, Plotly, Qlik Sense
• DevOps & Infrastructure: Docker, Kubernetes, Jenkins, GitLab CI/CD, Terraform, Ansible, AWS CloudFormation, Azure Resource Manager, Infrastructure as Code (IaC), Monitoring & Alerting
• Machine Learning & AI: Scikit-learn, TensorFlow, PyTorch, MLflow, Kubeflow, Feature Engineering, Model Deployment, A/B Testing, Statistical Analysis, Time Series Analysis
• Data Governance & Quality: Apache Atlas, DataHub, Great Expectations, Apache Griffin, Data Lineage, Metadata Management, GDPR, HIPAA, SOX Compliance, Data Cataloging
Technical Projects
Multi-Cloud Enterprise Data Platform
• Architected and implemented enterprise data platform spanning AWS, Azure, and GCP using Terraform, Docker, and Kubernetes, processing 50+ TB daily with automated data cataloging, lineage tracking, cross-cloud replication, and disaster recovery. Integrated Apache Spark, Kafka, and Airflow with cloud-native services reducing vendor lock-in risks and achieving 99.95% uptime. Real-time Customer Intelligence & Analytics Platform
• Built end-to-end streaming analytics platform using Apache Kafka, Spark Streaming, Elasticsearch, and Kibana providing real-time customer behavior insights, personalization, and predictive analytics across web, mobile, and IoT channels for major retail chain. Implemented machine learning models for churn prediction, recommendation engines, and dynamic pricing optimization. Financial Regulatory Compliance Automation
• Developed automated compliance reporting system using Apache Airflow, PySpark, and cloud data warehouses generating CCAR, DFAST, Basel III, and AML reports with 99.9% accuracy. Implemented data lineage tracking, automated testing, and regulatory change management reducing manual effort from 200 hours to 20 hours monthly and ensuring audit readiness. Healthcare Data Integration & Analytics
• Created HIPAA-compliant healthcare data platform integrating EHR, clinical trials, genomics, and IoT medical device data using HL7 FHIR standards, Apache Kafka, and cloud data lakes. Built machine learning pipelines for clinical decision support, patient outcome prediction, and population health analytics serving 2M+ patient records.