SINDHU REDDY KANDALA
+1-630-***-**** *********@*****.***.
PROFESSIONAL SUMMARY
Data Analyst / Data Engineer with 3+ years of experience leveraging large-scale healthcare, pharmaceutical, and enterprise datasets to uncover actionable insights, improve business performance, and support data-driven decision-making. Strong expertise in SQL, Python (Pandas, NumPy), statistical analysis, exploratory data analysis (EDA), data visualization, and stakeholder communication. Experienced in building scalable data pipelines, transforming complex datasets, identifying behavioral and business patterns, and delivering analytical solutions through Power BI and Tableau dashboards.
Proven ability to collaborate with product, business, and technical teams to define metrics, execute data initiatives, and translate analytical findings into strategic recommendations. Hands-on experience with cloud platforms including AWS and Azure, distributed data processing technologies such as Spark and Kafka, and healthcare data environments. Holds a Master of Science in Business Analytics and a Bachelor of Science in Statistics & Computer Science, with a strong passion for solving complex business problems through data.
TECHNICAL SKILLS
•Programming Languages:
Python, SQL (T-SQL, Spark SQL), PySpark, R (basic), Scala (basic)
•Cloud Platforms:
AWS: S3, Glue, Lambda, Redshift, EMR, EC2, Athena
Azure: Databricks, Data Factory (ADF), Data Lake, Azure SQL, Azure DevOps CI/CD
GCP: BigQuery, GCP SDKs
•Big Data & Streaming:
Apache Spark, PySpark, Hadoop, Hive, Kafka, Spark Streaming, MapReduce
•ML & AI Tools:
Feature Engineering, Data Preprocessing, Spark ML, SageMaker (basic), Pandas, NumPy, Scikit-learn (basic), Statistical Analysis, Predictive Analytics, GenAI & LLM Application Awareness (RAG patterns, prompt engineering concepts, agentic workflows)
•Orchestration & DevOps:
Apache Airflow, GitLab, Git, Bitbucket, Docker, Kubernetes, Terraform, CI/CD pipelines, SonarQube, GitHub Copilot (AI-assisted development)
•Databases & Warehouses:
Snowflake, Redshift, PostgreSQL, MySQL, SQL Server, Oracle, MongoDB, CosmosDB, Cassandra
•Data Formats & Ingestion:
JSON, Parquet, REST APIs, Relational & Cloud Databases, Fixed-width / Mainframe files
•Visualization & BI:
Power BI, Tableau
•Data Governance:
Data Quality, Metadata Management, Data Lineage, Data Governance, Privacy Compliance, HIPAA Compliance, PHI Data Management, OLAP Modeling (Star & Snowflake schemas)
•Data Modeling:
Dimensional Modeling, Star Schema, Snowflake Schema, Data Warehouse Design, Data Lakehouse Architecture
•Healthcare Data:
EDI 835, EDI 837, Claims Processing, HL7, FHIR, Healthcare Analytics
•Workflow Orchestration:
Apache Airflow, Prefect
PROFESSIONAL EXPERIENCE
Cloud Data Engineer Infovity July 2025 – Present
•Designed and deployed end-to-end, ML-ready data pipelines using Azure Data Factory, AWS Glue, and PySpark to support predictive analytics and model training, reducing time-to-insight by 30%.
•Built and maintained scalable ETL/ELT pipelines ingesting structured and semi-structured data (JSON, Parquet, REST APIs, Oracle, cloud databases) into distributed data lakes on AWS S3 and Azure Data Lake.
•Developed feature engineering pipelines with PySpark and Pandas to generate statistical indicators and aggregates that fed downstream machine learning models, improving model accuracy baselines.
•Orchestrated complex multi-step data workflows with Apache Airflow, automating scheduling and monitoring of 15+ production pipelines and reducing manual intervention by ~40%.
•Used GitHub Copilot to accelerate development of PySpark transformation scripts and SQL queries, improving coding efficiency and reducing time spent on boilerplate pipeline code.
•Processed high-volume datasets using Spark, Kafka, and Hadoop, optimizing Spark transformations to cut data latency by 25% across ML preparation workloads.
•Built analytical datasets in Redshift and Snowflake to support forecasting, reporting, and BI use cases; tuned SQL queries that delivered a 20% reduction in query runtimes.
•Designed and maintained dimensional data models, data warehouse structures, and lakehouse environments supporting Tableau reporting, predictive analytics, and enterprise business intelligence initiatives.
•Optimized database performance through SQL tuning, indexing strategies, and query optimization techniques, improving reporting efficiency and reducing execution bottlenecks.
•Developed Python-based REST API ingestion scripts to extract and load data into AWS S3 for downstream analytics and model development.
•Applied data security best practices including hashing and encryption to protect sensitive data within AI/ML systems, supporting data governance and privacy compliance.
•Created Tableau dashboards visualizing model outputs, user trends, and business KPIs for non-technical stakeholders, accelerating data-driven decision-making.
•Partnered with business users, analytics teams, and executive stakeholders to understand business processes, gather requirements, and deliver scalable data engineering solutions aligned with organizational objectives.
•Implemented data quality monitoring, validation rules, and governance controls to improve data reliability, consistency, and accuracy across enterprise platforms.
•Collaborated cross-functionally with data scientists, product managers, and business stakeholders, translating complex business problems into scalable data and AI solutions.
Environment: AWS (S3, Glue, Redshift, Lambda, EC2), Azure Data Factory, Apache Airflow, Kafka, PySpark, Spark, Hadoop, Snowflake, Python, SQL, Tableau, GitLab, MongoDB, Cassandra, GitHub Copilot
Data Engineer UCB Feb 2024 – Mar 2025
•Built scalable end-to-end pipelines using Azure Databricks and Spark to process large datasets supporting machine learning and advanced analytics for pharmaceutical research.
•Architected training and inference data lakes with Raw, Harmonized, Refined, Curated, and Analytic layers, enabling structured ML experimentation and model reproducibility.
•Performed feature extraction and transformation using Spark SQL for downstream predictive models, reducing feature preparation time by approximately 35%.
•Implemented real-time and batch pipelines using Kafka and Spark Streaming for near-real-time model data ingestion, cutting data-availability lag from hours to minutes.
•Migrated on-premises datasets to cloud-based ML platforms using Azure Data Lake and Snowflake, decommissioning legacy on-prem infrastructure and reducing storage costs by 20%.
•Supported migration of Mainframe batch data and fixed-width file formats into cloud-based analytical platforms, integrating legacy enterprise datasets with modern ETL pipelines.
•Participated in converting SAS and Teradata workloads into Spark-based pipelines, enabling AI-ready data processing and reducing batch runtimes by 30%.
•Implemented data validation and quality checks ensuring accuracy and consistency of ML training datasets; maintained data lineage, metadata management, and governance for ML compliance.
•Processed healthcare and pharmaceutical datasets while ensuring HIPAA compliance and secure handling of PHI-sensitive information across cloud-based analytics environments.
•Supported ingestion and transformation of healthcare claims and remittance-related datasets aligned with EDI 835 and EDI 837 transaction structures for downstream reporting and analytics.
•Worked with HL7 and FHIR healthcare data standards to integrate clinical and operational datasets into centralized analytical platforms.
•Collaborated with analytics and business teams to prepare healthcare datasets for predictive modeling, reporting, and operational performance measurement.
•Performed data profiling, reconciliation, and quality analysis to identify data patterns, improve reliability, and ensure consistency across healthcare data assets.
•Operated in a cross-functional Agile squad, collaborating with engineers, analysts, data scientists, and business stakeholders with clear communication and solution-driven execution.
Environment: Azure Databricks, Azure Data Lake, Azure Pipelines (CI/CD), Kafka, Spark SQL, Spark Streaming, Snowflake, Teradata, Azure SQL, S3, Python, PySpark, Airflow, HL7, FHIR, EDI 835/837, HIPAA
Data Engineer Amazon Jun 2022 – Aug 2023
•Designed and implemented Python-based end-to-end pipelines generating training datasets for predictive sales analytics models, cutting manual dataset preparation time by 40%.
•Cleaned and transformed high-volume transactional data for ML model development, applying normalization, deduplication, and null-handling logic across hundreds of millions of records.
•Designed Snowflake schemas optimized for ML feature storage and retrieval, improving query performance by 35% and enabling faster model iteration cycles.
•Automated anomaly detection logic in Python to identify unusual sales patterns, surfacing actionable signals that reduced false-positive alerts by 25%.
•Improved pipeline efficiency by 40% by automating feature generation workflows, eliminating redundant manual steps and standardizing transformation logic.
•Built Power BI dashboards to visualize model predictions, sales forecasts, and business KPIs for leadership and non-technical stakeholders.
•Implemented data quality checks ensuring consistency of AI training data across heterogeneous source systems; supported batch-oriented data processing and integration of legacy enterprise datasets into modern reporting pipelines.
•Developed complex SQL transformations and optimized reporting datasets to support downstream analytics, dashboarding, and business intelligence initiatives.
•Collaborated with cross-functional teams to automate reporting workflows and improve accessibility of enterprise data assets.
•Collaborated with data analysts and data scientists to prepare datasets for forecasting and trend prediction models, contributing to improved forecast accuracy.
Environment: Python, SQL Server, Azure Data Factory, Snowflake, Power BI, Git, AWS S3
EDUCATION
Master of Science in Business Analytics — Lewis University, Romeoville, IL (Aug 2023 – May 2025)
Bachelor of Science in Statistics & Computer Science — Nizam College, India (Jun 2019 – May 2022)
ACADEMIC PROJECTS
Predictive Healthcare Analytics Using Machine Learning Lewis University (Jan 2025 – May 2025)
•Built an end-to-end healthcare analytics solution using Python, SQL, Pandas, and Scikit-learn to analyze patient records and predict potential health risks, demonstrating full-lifecycle ML pipeline development.
•Performed data cleansing, feature engineering, and exploratory data analysis (EDA) on structured healthcare datasets, improving data quality and boosting model performance by refining input feature sets.
•Developed and evaluated multiple ML classification models; applied hyperparameter tuning and cross-validation to optimize prediction accuracy, achieving measurable improvement over baseline models.
•Designed interactive Power BI dashboards to visualize patient trends, risk scores, and healthcare KPIs, enabling data-driven decision-making for non-technical stakeholders.
•Presented analytical findings and actionable recommendations to faculty stakeholders, demonstrating the business value of predictive analytics in improving healthcare outcomes and resource planning.
Tools: Python, SQL, Pandas, Scikit-learn, Power BI, Feature Engineering, Statistical Analysis, Machine Learning
Retail Sales Data Analysis and Forecasting System Bachelor's Degree (Aug 2021 – Dec 2021)
•Developed an end-to-end data analytics solution using Python, SQL, and Excel to analyze historical retail sales data and surface actionable business trends across multiple product categories.
•Designed and optimized SQL queries to extract, transform, and analyze large datasets from multiple data sources, reducing report generation time and improving data accessibility.
•Applied statistical analysis and forecasting techniques (time series analysis, trend extrapolation) to predict future sales performance, directly supporting business planning and inventory strategy.
•Created interactive reports and visualizations tracking revenue, customer behavior, and product performance metrics, enabling leadership to monitor KPIs in real time.
•Collaborated with faculty mentors and team members to deliver actionable insights, strengthening understanding of data-driven business strategies and cross-functional communication.
Tools: Python, SQL, Excel, Statistical Analysis, Data Visualization, Forecasting
HONORS & AWARDS / CERTIFICATIONS
Data Engineering Professional Certification (DEPC)
Certified Data Pipeline Specialist (CDPS)