SUMMARY
Santosh Chaitanya Reddy
DATA ENGINEER
USA +1-512-***-**** *********************@*****.*** LinkedIn Data Engineer with over 4 years of experience in building scalable data pipelines, cloud-based solutions, and big data platforms. Expertise in designing and implementing data-intensive applications leveraging Hadoop ecosystem, Spark, Kafka, Flink, and cloud services like Azure Data Lake and Databricks. Skilled in both batch and real-time data processing, ensuring high-throughput and low-latency analytics solutions. Proficient in data migration from on-premise SQL databases to cloud environments, optimizing performance, security, and cost efficiency. Strong background in data warehousing, data modeling, and reporting, with hands-on experience in tools such as Hive, HBase, Sqoop, and Oozie. Adept at delivering end-to-end data engineering solutions that improve decision-making, streamline operations, and support enterprise analytics initiatives. SKILLS
Methodologies: SDLC, Agile, Waterfall
Programming Language: Python, R, SQL, Scala
IDE’s: PyCharm, Jupyter Notebook
Packages: NumPy, Pandas, Matplotlib, SciPy, Scikit-learn, Seaborn, TensorFlow, ggplot2 Databases: MySQL, SQL Server, PostgreSQL, Oracle
Big Data Ecosystem: Hadoop, MapReduce, Hive, Pig, Apache Spark, Sqoop, Pyspark, Snowflake, HDFS ETL Tools: SSIS, Apache NiFi, Apache Kafka, Talend, Apache Airflow, Informatica Cloud Technologies: AWS, Azure, GCP, DataBricks
Reporting Tools: Tableau, Power BI, SSRS
Version Control Tools: Git, GitHub, GitLab
Other Skills: Data Cleaning, Data Wrangling, Critical Thinking, Communication & Presentation Skills, Problem-solving, Data Management
Operating Systems: Windows, Linux, Mac
EXPERIANCE
Optum, USA Sr.Data Engineer Jan 2025 – Present
Designed and optimized data pipelines using Hadoop and Hive to process and store large-scale healthcare claims data, reducing overall processing time by 30% and improving data accessibility.
Built predictive analytics workflows with Apache Spark by implementing Random Forest and Gradient Boosting models in Python, enabling anomaly detection and handling high-volume monthly financial datasets.
Developed scalable data processing scripts in Azure Synapse notebooks leveraging PySpark to clean, transform, and normalize JSON data, storing results in Parquet format on ADLS Gen2 for efficient querying and analytics.
Implemented end-to-end ETL pipelines with Azure Data Factory to integrate data from Epic EHR systems into the enterprise data warehouse, ensuring optimized data flow, integrity, and quality.
Created interactive Power BI dashboards and reports for investment analytics, applying K-Means Clustering and Time Series Forecasting to uncover trends and support data-driven investment decisions.
Engineered custom Apache Airflow operators and sensors in Python to automate validation, enrichment, and transformation processes, enhancing data quality and reducing processing time by 20%.
Maintained and enhanced supply chain data pipelines using Apache Spark and Hive, improving accuracy in reporting and delivering a 25% reduction in processing time across analytics workloads.
Extracted and migrated data from SQL Server into HDFS, leveraging Hive and Pig for preprocessing and retrieval, supporting advanced model development and downstream analytics.
Goldman Sachs, USA Data Engineer Jul 2024 – Dec 2024
Designed and optimized Spark-based data transformations, reducing pipeline processing time by 30% and improving overall efficiency by 15%.
Implemented and managed CI/CD pipelines using Jenkins, deploying 10+ applications with a 32% increase in deployment speed and reliability.
Built and deployed real-time analytics pipelines using Azure Stream Analytics and Azure Event Hubs, enabling near real-time insights and event-driven processing.
Improved Hive schema design through performance tuning, reducing data processing time by 30% and enhancing query performance by 45%.
Configured and optimized AWS services (EC2, S3, Redshift, Lambda), achieving a 25% reduction in processing time and a 20% boost in system performance.
Automated and scheduled data workflows with Apache Airflow, ensuring reliable orchestration, error handling, and timely data delivery across the infrastructure.
Developed data normalization and consolidation pipelines using AWS Glue, cutting redundancy by 40%, improving data quality by 50%, and enhancing ML model accuracy by 15%.
Architected and deployed a Kafka-based streaming framework to handle large-scale real-time data ingestion and distribution across enterprise applications.
Virtusa, India Data Engineer Jun 2021 – Aug 2022
Designed and optimized ETL pipelines using Python, SQL, and Hadoop, improving data processing efficiency by 40% for structured healthcare datasets.
Built and deployed real-time data streaming solutions with Apache Kafka, integrating schema validation to enhance data quality by 30% for pharmacy inventory analytics.
Migrated on-premises SSIS workflows to AWS Glue and S3, reducing infrastructure costs and enabling scalable, cloud-native data processing.
Optimized batch processing on AWS EMR with Apache Spark, decreasing pipeline execution time by 30% and improving accuracy of downstream analytics.
Developed automated CI/CD pipelines using Jenkins and Docker, standardizing deployments and reducing release errors by 25%.
Performed ad-hoc and large-scale analytics with Amazon Athena and built interactive dashboards in Amazon QuickSight, enabling data-driven healthcare decisions through real-time KPI visibility.
Applied Agile practices in data engineering projects, accelerating delivery of healthcare data initiatives and shortening development cycles by 20% through improved team collaboration. Nefroverse Technologies, India Jr. Data Engineer Jun 2019 – Jun 2021
Designed and optimized ETL pipelines using Python, SQL, and Hadoop, improving processing efficiency of structured healthcare datasets by 40%.
Built real-time streaming architectures with Apache Kafka, integrating schema validation to enhance data quality by 30% for pharmacy inventory analytics.
Migrated on-premises SSIS workflows to AWS Glue and S3, reducing infrastructure costs while enabling scalable and cloud-native data processing.
Optimized batch processing with Apache Spark on AWS EMR, reducing pipeline execution time by 30% and improving downstream analytics accuracy.
Developed automated CI/CD pipelines using Jenkins and Docker, standardizing deployment workflows and decreasing deployment errors by 25%.
Performed ad-hoc and large-scale analytics with Amazon Athena and created interactive dashboards in Amazon QuickSight, enabling faster decision-making on key healthcare KPIs.
Applied Agile practices across data engineering projects, reducing delivery cycles by 20% and driving better collaboration with cross- functional teams.
EDUCATION
M.S. in Information Technology Project Management
Indiana Wesleyan University, Indiana, IN
B. Tech in Computer Science
Sri Indu College of Engineering & Technology