Data Engineer Machine Learning

Location:

New Delhi, Delhi, India

Posted:

September 10, 2025

Contact this candidate

Resume:

Sai Nikhil Thappetla Data Engineer

**********@**********.*** 716-***-**** USA LinkedIn

Summary

Experienced Data Engineer with 4+ years of strong background in building scalable data pipelines, optimizing ETL workflows, and implementing real-time data processing solutions. Proficient in big data technologies, SQL optimization, and data governance. Skilled in integrating machine learning for predictive analytics and anomaly detection. Adept at ensuring data security, compliance, and workflow orchestration. Passionate about leveraging automation and advanced analytics to drive data-driven decision-making and improve system efficiency across various industries, including finance, e-commerce, and healthcare. Technical Skills

Programming & Scripting: Python (Pandas, NumPy, Scikit-learn, TensorFlow), SQL (PostgreSQL, HiveQL), Shell Scripting

Big Data & ETL: Apache Spark, Hadoop (HDFS, Hive, Impala, Presto), Apache Flink, Apache NiFi

Streaming & Messaging: Apache Kafka, Apache Flink

Databases & Query Optimization: PostgreSQL, SQL tuning (Window Functions, Indexing, Materialized Views)

Machine Learning & Analytics: Scikit-learn, TensorFlow, Apache Spark MLlib, AI-based anomaly detection (Isolation Forest)

Orchestration & Workflow Automation: Apache Airflow, AWS Step Functions

Data Storage & Governance: Apache Ranger, Apache Atlas, Azure Data Lake Storage, Azure Purview

Monitoring & Logging: Prometheus, Grafana, AWS CloudWatch

Security & Compliance: Role-Based Access Control (RBAC), Data Governance (HIPAA, Financial Compliance Standards)

Visualization & BI Tools: Metabase, Tableau, Looker, PowerBI Professional Experience

Data Engineer, Capital One 05/2024 – Present Remote, USA

Designed and deployed High-Performance ETL with AI-Powered Anomaly Detection using Apache Spark, Python (Scikit-learn, Pandas), and Isolation Forest, detecting financial data anomalies with 95% accuracy, ensuring regulatory compliance and real- time data integrity.

Developed scalable ETL pipelines leveraging Apache Spark and Hadoop, optimizing data transformation for 500 billion+ records, reducing processing latency by 60% through efficient partitioning and Parquet-based storage in HDFS.

Automated real-time financial transaction ingestion using Apache Kafka and Apache Flink, enabling high-frequency data processing at 5 million transactions per second, ensuring fault tolerance and SLA compliance.

Enhanced query performance in PostgreSQL with window functions (ROW_NUMBER, LAG, LEAD), materialized views, and indexing, reducing query execution time by 40% for real-time financial trend dashboards in Metabase.

Integrated AI-powered risk assessment models using TensorFlow and Scikit-learn, triggering automated alerts for financial anomalies, reducing risks by 30% via custom Python-based monitoring scripts and Prometheus/Grafana dashboards.

Orchestrated end-to-end data workflows using AWS Step Functions and AWS Lambda, ensuring real-time data availability, anomaly detection, and performance monitoring for 1TB+ daily financial data pipeline throughput. Data Engineer, Dell 01/2023 – 07/2023 Remote, India

Engineered and optimized a scalable data warehouse for e-commerce analytics by migrating raw data into Apache Hive, applying partitioning and bucketing, and using Presto for fast query execution, improving report generation time by 60%.

Built and implemented automated ETL pipelines using Apache Airflow and Apache Spark for efficient transformation of large- scale data, reducing data processing time by 50% and ensuring data consistency across multiple sources.

Integrated machine learning algorithms into data workflows using Apache Spark MLlib to predict customer behavior, improving personalized marketing tactics and boosting engagement by 30% based on predictive models built on historical customer data.

Enhanced query performance in PostgreSQL using advanced indexing methods, window functions (ROW_NUMBER, LAG), and materialized views, easing compute resource use by 40% and improving the analytical processing of sales and product data.

Implemented data governance and security measures using Azure Data Lake Storage and Azure Purview, ensuring secure data storage and compliance, automating metadata management and access controls across 20+ e-commerce data sources. Associate Data Engineer, Cognizant Technology Solutions 01/2020 – 12/2022 Hyderabad, India

Worked on a scalable data lake architecture using Hadoop HDFS and Apache Parquet, enabling efficient storage and retrieval of structured and unstructured healthcare data while ensuring regulatory compliance and data governance.

Developed automated ETL workflows using Apache NiFi and Apache Spark, ingesting diverse data formats (CSV, JSON, XML) from multiple sources, reducing data processing time by 40% and improving operational efficiency in healthcare analytics.

Engineered high-performance data processing pipelines using Apache Hive and Apache Impala, optimizing SQL queries with partitioning and indexing techniques, reducing query execution time by 50% for real-time patient trend analysis.

Implemented secure data access and governance by integrating Apache Ranger for role-based access control (RBAC) and Apache Atlas for metadata management, ensuring compliance with healthcare industry security and privacy regulations.

Directed in-depth data analysis on patient records using Python (Pandas, NumPy) and Scikit-learn, identifying trends in disease progression and treatment effectiveness, supporting data-driven decision-making for hospitals and medical research teams. Education

Master of Science, University at Buffalo-The State University of New York 08/2023 - 12/2024 Buffalo, NY, USA Professional Studies in Data Sciences and Applications Projects

US Traffic Accident Analysis and Hotspot Forecasting Python, SQL, Tableau, Data Analytics

Investigated 1.5M+ traffic accident records across 49 states (2016-2020) using Python (Pandas, NumPy) and SQL, applying statistical methods, regression modeling, and data visualization Matplotlib, Seaborn) to identify accident hotspots.

Processed and transformed 3GB+ datasets with API integrations, creating Tableau dashboards with actionable insights that achieved. Smoking Data Pipeline and Chronic Disease Prediction Scala, ML, Python, Tableau

Engineered a high-performance data pipeline using Spark,Hadoop for Molina Healthcare, processing 100TB+ of smoking-related health records by achieving 94% accuracy in chronic disease predictions with ML models (Logistic Regression, Random Forest).

Automated Ingestion workflows and created Tableau dashboards to deliver actionable insights, enhancing preventive care strategies and operational efficiency.

Contact this candidate