MOUNIKA VATTIKONDA
DATA ENGINEER
Ohio, USA 220-***-**** ************@*****.***
https://www.linkedin.com/in/vattikonda-mounika-95330a2b4/ SUMMARY
• Data Engineer with 6+ years of expertise in designing, developing, and managing end-to-end data pipelines and ETL workflows across cloud and on-premise environments, ensuring seamless data integration from heterogeneous sources.
• Skilled in data modeling and architecture, including star and snowflake schemas, dimensional modeling, and building data warehouses and data lakes for structured and unstructured datasets.
• Proficient in leveraging Big Data technologies such as PySpark, Apache Spark, Hadoop, and Databricks to process, transform, and analyze large-scale datasets efficiently.
• Hands-on expertise in data quality, governance, and lineage, implementing frameworks and automated validations to ensure reliable, compliant, and audit-ready data.
• Strong experience in cloud-based data platforms, including Azure and AWS, utilizing tools such as Data Factory, Synapse, Databricks, S3, Redshift, and Glue for data ingestion, transformation, and storage.
• Adept at orchestration and workflow automation using Apache Airflow, Luigi, and other scheduling tools to manage complex data pipelines and ensure reliable delivery of curated datasets.
• Collaborated with data scientists, analysts, and business stakeholders to prepare analytics-ready datasets, support AI/ML model development, and deliver actionable business insights via visualization tools like Tableau and Power BI.
• Experienced in full SDLC of data engineering projects, including requirement gathering, design, development, testing, deployment, and production support, ensuring robust, scalable, and high-performance data solutions. SKILLS
Programming & Scripting: Python (pandas, NumPy, PySpark), Java, Scala, SQL, Shell Scripting, R Data Warehousing & Databases: Amazon Redshift, Snowflake, Google BigQuery, PostgreSQL, MySQL, Oracle, MongoDB, Cassandra
Big Data & Distributed Systems: Apache Spark, Hadoop (HDFS, MapReduce), Hive, Pig, Kafka, Flink, Airflow, Databricks, EMR
ETL & Data Integration: Apache NiFi, Talend, Informatica, SSIS, Fivetran, Matillion, DBT, DataStage Cloud Platforms: AWS (S3, EC2, Lambda, Glue, Redshift, EMR, Kinesis), GCP (BigQuery, Dataflow, Pub/Sub), Azure (Data Factory, Synapse, Databricks) Data Modeling & Architecture: Star & Snowflake Schemas, OLAP/OLTP Design, Dimensional Modeling, Data Lake
& Lakehouse Architecture
Data Pipeline Orchestration: Apache Airflow, Prefect, Luigi, Dagster Data Governance & Quality: Great Expectations, Apache Atlas, Collibra, Data Cataloging, Data Lineage, Data Profiling & Validation
DevOps & Version Control: Git, GitHub, GitLab, Jenkins, Docker, Kubernetes, CI/CD Pipelines Visualization & Reporting: Tableau, Power BI, Looker, Qlik Sense, Advanced Excel (Pivot, Power Query, VBA) Machine Learning & Analytics: Scikit-learn, TensorFlow, PyTorch, ML Pipelines, Predictive & Prescriptive Analytics Other Tools & Technologies: REST APIs, JSON, XML, JSON Schema, Regex, Unix/Linux, Agile/Scrum WORK EXPERIENCE
Novartis New Jersey, USA
Data Engineer Nov 2023 – Present
Data42 Platform – Enterprise Data Integration & Analytics Project Overview: Data42 is Novartis’s enterprise-scale data platform designed to consolidate clinical, operational, and real- world evidence (RWE) data into a centralized ecosystem. It supports advanced analytics and machine learning, enabling faster, data-driven decisions across R&D, clinical trials, and business operations. Key Contributions:
• Collaborated with clinical and R&D teams to define requirements and design Star and Snowflake schemas, supporting efficient access and retrieval of 500M+ patient and clinical records.
• Built scalable ETL workflows using Python (pandas, PySpark) and Apache NiFi, automating ingestion and transformation of 20+ heterogeneous data sources, reducing manual data prep by 40%.
• Implemented Apache Spark on AWS EMR to process ~2 TB of daily clinical and operational data, reducing batch processing time from 12 hours to 3 hours, enabling near real-time analytics.
• Used AWS Glue and Spark SQL to merge structured and unstructured datasets, achieving 95% data accuracy, enhancing the reliability of downstream analytics and ML models.
• Applied Great Expectations and Collibra for automated data validation, profiling, and lineage tracking, ensuring HIPAA and FDA compliance, and reducing data discrepancies by 30%.
• Migrated legacy ETL pipelines to AWS S3, Redshift, and Lambda, creating a centralized Data Lake, improving data availability for analytics and reducing storage/compute costs by 25%.
• Designed Apache Airflow pipelines to schedule and monitor 100+ daily ETL jobs, improving pipeline reliability by 35% and ensuring timely delivery of curated datasets.
• Prepared clean datasets for Power BI dashboards, enabling actionable insights into patient outcomes and drug efficacy trends, reducing manual reporting efforts by 50%.
LTI Mindtree India
Data Engineer Jan 2018 – Oct 2022
Modernizing Banking with Cloud Data & AI
Project Overview: Modernized legacy banking data systems by building scalable pipelines and a cloud-integrated data platform. Ingested, transformed, and stored data from core banking, payments, and loan management systems, delivering analytics-ready datasets for AI-driven fraud detection, predictive analytics, and regulatory reporting. Key Contributions:
• Developed ETL pipelines using PySpark and SQL, processing 5M+ daily transactions from multiple heterogeneous banking systems, reducing end-to-end processing time by 40%.
• Created optimized data warehouse schemas in Azure Synapse Analytics and Snowflake, using star and snowflake modeling, improving query performance for dashboards by 60%
• Implemented data quality frameworks using Great Expectations and Talend Data Quality, applying automated validations and anomaly detection, reducing downstream errors by 35%.
• Built feature-engineered datasets for predictive models using Python (pandas, NumPy) and PySpark, improving fraud detection accuracy by 25% and enabling real-time risk scoring
• Orchestrated ETL workflows using Apache Airflow and Luigi, scheduling 100+ daily jobs with error handling, logging, and retry mechanisms, reducing manual interventions by 30%.
• Performed pipeline monitoring and optimization using Spark UI, Grafana, and custom Python scripts, ensuring 99.9% uptime for production workflows
• Used Azure Data Factory and Azure Databricks for cloud-based ingestion, transformation, and integration of on- premise and cloud sources, supporting seamless hybrid data architecture.
• Implemented CI/CD using Git, Jenkins, and Docker, automating deployment of ETL scripts and notebooks, reducing release cycles by 20% and maintaining production stability.
• Executed full SDLC including requirement analysis, unit testing, integration testing, regression, and UAT using SQL, PyTest, and custom test scripts, ensuring audit-ready pipelines.
• Collaborated with data scientists and business analysts, using Tableau to validate datasets, deliver dashboards, and provide actionable insights for strategic decision-making. EDUCATION
Master of Science in Computer Science
Youngstown State University, Ohio, USA
Bachelor of Technology in Computer Science and Engineering MVR college of Engineering & Technology, India