Data Engineer Integration

Location:

United States

Posted:

October 15, 2025

Contact this candidate

Resume:

MOUNIKA VATTIKONDA

DATA ENGINEER

Ohio, USA 220-***-**** ************@*****.***

https://www.linkedin.com/in/vattikonda-mounika-95330a2b4/ SUMMARY

• Data Engineer with 6+ years of expertise in designing, developing, and managing end-to-end data pipelines and ETL workflows across cloud and on-premise environments, ensuring seamless data integration from heterogeneous sources.

• Skilled in data modeling and architecture, including star and snowflake schemas, dimensional modeling, and building data warehouses and data lakes for structured and unstructured datasets.

• Proficient in leveraging Big Data technologies such as PySpark, Apache Spark, Hadoop, and Databricks to process, transform, and analyze large-scale datasets efficiently.

• Hands-on expertise in data quality, governance, and lineage, implementing frameworks and automated validations to ensure reliable, compliant, and audit-ready data.

• Strong experience in cloud-based data platforms, including Azure and AWS, utilizing tools such as Data Factory, Synapse, Databricks, S3, Redshift, and Glue for data ingestion, transformation, and storage.

• Adept at orchestration and workflow automation using Apache Airflow, Luigi, and other scheduling tools to manage complex data pipelines and ensure reliable delivery of curated datasets.

• Collaborated with data scientists, analysts, and business stakeholders to prepare analytics-ready datasets, support AI/ML model development, and deliver actionable business insights via visualization tools like Tableau and Power BI.

• Experienced in full SDLC of data engineering projects, including requirement gathering, design, development, testing, deployment, and production support, ensuring robust, scalable, and high-performance data solutions. SKILLS

Programming & Scripting: Python (pandas, NumPy, PySpark), Java, Scala, SQL, Shell Scripting, R Data Warehousing & Databases: Amazon Redshift, Snowflake, Google BigQuery, PostgreSQL, MySQL, Oracle, MongoDB, Cassandra

Big Data & Distributed Systems: Apache Spark, Hadoop (HDFS, MapReduce), Hive, Pig, Kafka, Flink, Airflow, Databricks, EMR

ETL & Data Integration: Apache NiFi, Talend, Informatica, SSIS, Fivetran, Matillion, DBT, DataStage Cloud Platforms: AWS (S3, EC2, Lambda, Glue, Redshift, EMR, Kinesis), GCP (BigQuery, Dataflow, Pub/Sub), Azure (Data Factory, Synapse, Databricks) Data Modeling & Architecture: Star & Snowflake Schemas, OLAP/OLTP Design, Dimensional Modeling, Data Lake

& Lakehouse Architecture

Data Pipeline Orchestration: Apache Airflow, Prefect, Luigi, Dagster Data Governance & Quality: Great Expectations, Apache Atlas, Collibra, Data Cataloging, Data Lineage, Data Profiling & Validation

DevOps & Version Control: Git, GitHub, GitLab, Jenkins, Docker, Kubernetes, CI/CD Pipelines Visualization & Reporting: Tableau, Power BI, Looker, Qlik Sense, Advanced Excel (Pivot, Power Query, VBA) Machine Learning & Analytics: Scikit-learn, TensorFlow, PyTorch, ML Pipelines, Predictive & Prescriptive Analytics Other Tools & Technologies: REST APIs, JSON, XML, JSON Schema, Regex, Unix/Linux, Agile/Scrum WORK EXPERIENCE

Novartis New Jersey, USA

Data Engineer Nov 2023 – Present

Data42 Platform – Enterprise Data Integration & Analytics Project Overview: Data42 is Novartis’s enterprise-scale data platform designed to consolidate clinical, operational, and real- world evidence (RWE) data into a centralized ecosystem. It supports advanced analytics and machine learning, enabling faster, data-driven decisions across R&D, clinical trials, and business operations. Key Contributions:

• Collaborated with clinical and R&D teams to define requirements and design Star and Snowflake schemas, supporting efficient access and retrieval of 500M+ patient and clinical records.

• Built scalable ETL workflows using Python (pandas, PySpark) and Apache NiFi, automating ingestion and transformation of 20+ heterogeneous data sources, reducing manual data prep by 40%.

• Implemented Apache Spark on AWS EMR to process ~2 TB of daily clinical and operational data, reducing batch processing time from 12 hours to 3 hours, enabling near real-time analytics.

• Used AWS Glue and Spark SQL to merge structured and unstructured datasets, achieving 95% data accuracy, enhancing the reliability of downstream analytics and ML models.

• Applied Great Expectations and Collibra for automated data validation, profiling, and lineage tracking, ensuring HIPAA and FDA compliance, and reducing data discrepancies by 30%.

• Migrated legacy ETL pipelines to AWS S3, Redshift, and Lambda, creating a centralized Data Lake, improving data availability for analytics and reducing storage/compute costs by 25%.

• Designed Apache Airflow pipelines to schedule and monitor 100+ daily ETL jobs, improving pipeline reliability by 35% and ensuring timely delivery of curated datasets.

• Prepared clean datasets for Power BI dashboards, enabling actionable insights into patient outcomes and drug efficacy trends, reducing manual reporting efforts by 50%.

LTI Mindtree India

Data Engineer Jan 2018 – Oct 2022

Modernizing Banking with Cloud Data & AI

Project Overview: Modernized legacy banking data systems by building scalable pipelines and a cloud-integrated data platform. Ingested, transformed, and stored data from core banking, payments, and loan management systems, delivering analytics-ready datasets for AI-driven fraud detection, predictive analytics, and regulatory reporting. Key Contributions:

• Developed ETL pipelines using PySpark and SQL, processing 5M+ daily transactions from multiple heterogeneous banking systems, reducing end-to-end processing time by 40%.

• Created optimized data warehouse schemas in Azure Synapse Analytics and Snowflake, using star and snowflake modeling, improving query performance for dashboards by 60%

• Implemented data quality frameworks using Great Expectations and Talend Data Quality, applying automated validations and anomaly detection, reducing downstream errors by 35%.

• Built feature-engineered datasets for predictive models using Python (pandas, NumPy) and PySpark, improving fraud detection accuracy by 25% and enabling real-time risk scoring

• Orchestrated ETL workflows using Apache Airflow and Luigi, scheduling 100+ daily jobs with error handling, logging, and retry mechanisms, reducing manual interventions by 30%.

• Performed pipeline monitoring and optimization using Spark UI, Grafana, and custom Python scripts, ensuring 99.9% uptime for production workflows

• Used Azure Data Factory and Azure Databricks for cloud-based ingestion, transformation, and integration of on- premise and cloud sources, supporting seamless hybrid data architecture.

• Implemented CI/CD using Git, Jenkins, and Docker, automating deployment of ETL scripts and notebooks, reducing release cycles by 20% and maintaining production stability.

• Executed full SDLC including requirement analysis, unit testing, integration testing, regression, and UAT using SQL, PyTest, and custom test scripts, ensuring audit-ready pipelines.

• Collaborated with data scientists and business analysts, using Tableau to validate datasets, deliver dashboards, and provide actionable insights for strategic decision-making. EDUCATION

Master of Science in Computer Science

Youngstown State University, Ohio, USA

Bachelor of Technology in Computer Science and Engineering MVR college of Engineering & Technology, India

Contact this candidate