Anudeep Banala
Frisco, TX Open to Relocate Ph: +1-774-***-**** E-mail: *************.**@*****.*** LinkedIn GitHub PROFESSIONAL SUMMARY
Data Engineer with 4+ years of experience delivering $1M+ in business value across finance and healthcare. Proven track record of building end-to-end AI-driven data solutions, deploying models on cloud platforms (AWS, Azure, GCP), and driving measurable business impact through fraud detection, Gen AI automation, and real-time analytics. Expert in leveraging Generative AI, RAG, and OpenAI integration for intelligent analytics while optimizing high-performance data pipelines with PySpark, Databricks, and Snowflake. Skilled at translating complex financial and healthcare data into actionable insights that improve KPIs, reduce costs by 40-60%, and enhance regulatory compliance. SKILLS
Programming & Frameworks: Python (PySpark, Pandas, NumPy), SQL, Scala, Java, Bash, dbt, Apache Beam, TensorFlow, PyTorch, Scikit-learn, FastAPI AI & Advanced Analytics: Generative AI (Gen AI), Retrieval-Augmented Generation (RAG), Prompt Engineering, OpenAI, LangChain, Machine Learning Integration, LLMs
(BERT, GPT), NLP
Big Data & Cloud: Apache Spark, Databricks, Delta Lake, Snowflake, Hadoop, AWS (S3, Redshift, Glue, EMR, Kinesis, Lambda, SageMaker), Azure Synapse, GCP BigQuery, Kafka, Docker, Kubernetes
Data Engineering & Storage: ETL/ELT Pipelines, Apache Airflow, Change Data Capture (CDC), PostgreSQL, MySQL, MongoDB, Cassandra, Redis, DynamoDB, Apache Flink Data Modeling & Optimization: Star & Snowflake Schemas, Partitioning, Indexing, Query Optimization, Performance Tuning, Clustering Keys, Materialized Views Security & Compliance: IAM, RBAC, HIPAA, GDPR, SOC 2, PCI DSS, Data Masking, Metadata Management (Apache Atlas, Microsoft Purview, Collibra) Visualization & BI: Tableau, Power BI, Streamlit, Matplotlib, Seaborn EXPERIENCE
Data Engineer Citigroup Dallas, TX Jan 2024 - Present
• Built real-time streaming data pipeline using Apache Kafka and Amazon Kinesis to ingest and process 22 million monthly banking transactions, reducing data processing latency from 3 minutes to 70 seconds for downstream fraud detection systems.
• Optimized Snowflake data warehouse performance by implementing clustering keys and materialized views, cutting average query execution time from 12 seconds to 5 seconds for daily portfolio dashboards serving 150+ investment analysts.
• Developed scalable ETL workflows in Databricks using PySpark to process 5 TB of transactional and customer data weekly, transforming raw data into clean, structured formats for credit risk analysis and regulatory reporting.
• Automated data ingestion and transformation pipelines using AWS Glue and Lambda functions, eliminating 350+ hours of manual data processing work annually while maintaining 99.7% data accuracy for financial reconciliation.
• Implemented Generative AI chatbot using OpenAI and RAG architecture to automate financial analyst queries, reducing manual research time by 15-20 hours per month through automated data retrieval and reporting.
• Designed cloud infrastructure using Terraform and GitLab CI/CD pipelines, reducing environment deployment time from 3 days to 8 hours for compliance-critical data systems and ensuring consistent data pipeline deployments.
• Created comprehensive data validation framework using Great Expectations and dbt to ensure data quality across loan management and CRM systems, achieving under 10-second sync latency between source and target systems.
• Built automated ETL pipelines for regulatory reporting using Apache Airflow, orchestrating complex data workflows to ensure SOC 2 and PCI DSS compliance while processing 500K+ daily financial transactions.
Data Engineer Accenture Hyderabad, India Aug 2020 - Dec 2022
• Designed and maintained HIPAA-compliant ETL pipelines using Apache Airflow to integrate EHR, lab, and insurance claims data from six hospital networks into centralized Snowflake data warehouse, enabling clinical teams to access patient data within 15 seconds.
• Built data processing workflows in Databricks using PySpark to handle 2.3 million patient encounters annually, creating automated data quality checks and validation rules for healthcare analytics and reporting systems.
• Developed automated reporting system using Power BI and SQL to process HL7/FHIR data feeds for patient admission, discharge, and transfer (ADT) workflows, reducing manual data compilation by 80+ hours monthly.
• Created real-time data integration pipelines connecting billing, insurance eligibility, and patient payment systems using Apache Kafka, enabling finance teams to monitor hospital revenue streams in real time.
• Built automated data reconciliation processes for insurance premium calculations and policy management, improving billing accuracy and reducing errors by processing multiple data sources through standardized ETL workflows.
• Engineered data pipeline architecture for underwriting and risk assessment workflows, integrating multiple insurance systems and enabling automated profitability analysis for 15+ high-value accounts.
• Implemented comprehensive data security framework including PHI encryption, role-based access controls, and audit logging across all healthcare data pipelines, ensuring HIPAA compliance during external audits.
• Developed data integration solution to standardize and process 7.4 million patient records from various public health databases, creating unified data model for epidemiological analysis and reporting.
• Created automated data monitoring and reconciliation system for insurance claims, reserves, and reinsurance workflows across SAP and external systems, maintaining 97% accuracy in financial data processing.
KEY PROJECTS
Real-Time ETL Pipeline for E-commerce Data
• Built streaming ETL pipeline using Kafka and Spark to process user activity data from e-commerce platform with 10K+ daily events
• Implemented data transformations and loaded results into PostgreSQL and Redis for real-time recommendations and analytics Financial Data Warehouse with Snowflake
• Created dimensional data warehouse using Snowflake to analyze stock market data with automated daily ETL jobs using Apache Airflow
• Designed star schema and built data quality checks using dbt, achieving 99% data accuracy for financial reporting Fraud Detection ML Pipeline
• Built end-to-end ML pipeline using Python and Scikit-learn to detect fraudulent transactions with 92% accuracy on credit card dataset
• Implemented data preprocessing, feature engineering, and model training workflows using Apache Airflow for automated retraining Healthcare Claims Data Pipeline
• Built ETL pipeline using Python and Pandas to process insurance claims CSV files and load clean data into MySQL database
• Implemented data validation rules and automated reporting using Power BI for claims analysis and fraud detection CERTIFICATIONS
AWS Certified Cloud Practitioner (CLF-C02)
Microsoft Power BI Data Analyst (PL-300)
Python Programming - Coursera
EDUCATION
University of North Texas, Denton, TX May 2024
Master’s degree in Data Science