Data Engineer Machine Learning

Location:

Bhubaneswar, Odisha, India

Posted:

September 10, 2025

Contact this candidate

Resume:

Sannidhi Rao Ambaragonda

USA +1-872-***-**** ************@*****.*** LinkedIn

Summary

Experienced Data Engineer with 4+ years of expertise in designing and optimizing scalable data solutions. Proficient in building enterprise data lakes, developing high-performance ETL pipelines using Spark SQL in Databricks, and implementing complex data models (Star and Snowflake schema) for optimized query performance. Skilled in cloud-based data engineering with AWS and Azure, leveraging technologies such as Apache Spark, Kafka, and Snowflake to drive data- driven decision-making. Adept at leading cross-functional teams, automating workflows, and ensuring data integrity across large-scale distributed systems. Skills

Programming Languages: Python, R, Scala, SQL, PySpark, Git, Java

Data Modeling and ETL: ETL Processes, Data Warehousing, Data Modeling, Apache NiFi, Informatica PowerCenter, Apache Flink, Apache Druid, Apache Beam, Medallion Architecture

Web Development: HTML5, JavaScript

Project Management Methodologies: Agile, Waterfall

Machine Learning: Logistic Regression, Decision Trees, Random Forests, PyTorch, AWS SageMaker

Databases: MySQL, PostgreSQL, AzureDB, SQL Server, NoSQL Databases (e.g., MongoDB, Cassandra)

Tools: Tableau, PowerBI, Excel, SAS, SQL Playground, GCP BigQuery, Apache Airflow, Splunk

Infrastructure as Code (IaC): Terraform

Cloud Services: AWS (S3, Redshift, Glue, EMR, EC2, Lambda), Azure (Databricks, Data Lake, Data Factory, Logic Apps, Azure Cosmos DB, ADLS, Azure Synapse Studio, Blob)

SQL Data Warehouse: Implemented in both Azure (SQL Data Warehouse) and AWS (Redshift)

Statistical Analysis: Linear Regression, ANOVA, Chi-Square

Big Data Technologies: Apache Spark, Apache Hadoop, Apache Kafka

Containerization and Orchestration: Docker, Kubernetes

CI/CD and Version Control: Jenkins, Git

Packages: NumPy, Pandas, Matplotlib, SciPy, Scikit-learn, Seaborn, TensorFlow.

Experience

Data Engineer

MetLife – USA

Feb 2025 – Current

Designed and deployed an enterprise-grade data lake architecture for large-scale data processing using AWS S3, Glue, and Lake Formation, enabling real-time analytics and historical reporting.

Ensured data quality, consistency, and governance in relational databases through data profiling, cleansing, transformation, and integrity checks.

Developed high-performance ETL pipelines using Spark SQL and PySpark in Databricks, processing structured and semi-structured data (JSON, Parquet, ORC) across multiple data sources.

Created Star and Snowflake schema data models in ERWIN Data Modeler, improving query performance, BI reporting efficiency, and analytical scalability.

Utilized Amazon Athena and AWS Glue for serverless querying, generating interactive Power BI and QuickSight dashboards to drive business insights.

Designed and optimized NoSQL-based data pipelines using HBase, Cassandra, and MongoDB, supporting low-latency, high-throughput workloads.

Built robust big data pipelines in Hadoop, Hive, and Talend, ensuring fault tolerance and high-volume data ingestion for enterprise analytics. Data Engineer

Tata 1mg – India

Jun 2021 – Jul 2023

Conducted ad-hoc analytics and statistical modeling on pharmaceutical data using Python (Pandas, NumPy, SciPy) to support regulatory submissions, reducing approval timelines by 15%.

Analyzed clinical trial datasets via Power BI dashboards, detecting anomalies and ensuring regulatory compliance (GxP, FDA 21 CFR Part 11).

Collaborated with cross-functional R&D and compliance teams to uncover trends and correlations in experimental datasets, driving data-driven decision-making.

Streamlined and optimized data ingestion pipelines using Azure Data Factory, cutting processing time by 30% while maintaining data accuracy and lineage tracking.

Automated regulatory data workflows using Python, Databricks, and ADF, ensuring compliance accuracy and reducing manual efforts.

Designed and implemented scalable data architectures on Azure Data Lake and Synapse Analytics with robust data governance frameworks for security, auditing, and access control.

Built advanced statistical and machine learning models in Azure Databricks, applying hyperparameter tuning, A/B testing, and model deployment to predict adverse drug reactions and treatment outcomes, improving predictive accuracy by 20%.

Developed Python-based analytics pipelines integrating Power BI and Databricks, providing actionable insights for regulatory decision-makers. Associate Data Engineer

OLX – India

Feb 2019 – May 2021

Migrated petabyte-scale datasets from legacy on-premises systems to cloud platforms (Azure, AWS) ensuring minimal downtime and zero data loss.

Designed and developed scalable ETL pipelines using PySpark, Spark SQL, and Databricks, improving data transformation and validation processes.

Built and orchestrated Apache Airflow DAGs for workflow automation, incremental loads, and near real-time data streaming.

Implemented Spark Streaming applications for event-driven data processing, integrating with Snowflake and DynamoDB for fast analytics.

Tuned Spark applications leveraging in-memory computation, partitioning, and custom aggregate functions, reducing job runtime by 40%.

Worked with structured and semi-structured data formats (JSON, XML, Avro, ORC, Parquet) for optimized storage and querying performance.

Created and managed Hive tables and partitions to support batch processing, reporting, and analytical use cases.

Designed data ingestion frameworks using Spark, Hive, and Sqoop, ensuring high-throughput and fault-tolerant processing.

Implemented database sharding, indexing, and partitioning strategies to enhance query performance and scalability.

Configured and optimized Databricks clusters for both batch and streaming workloads, ensuring cost-efficiency and resource utilization. Education

Masters in Information Technology and Management, Data Analytics Illinois Institute of Technology

08/2023-05/2025

USA

Contact this candidate