Manoj Gudivaka
Data Engineer
***************@*****.***
PROFISSIONAL SUMMARY:
Senior Data Engineer with 8+ years of experience in designing and developing scalable data platforms using Azure Databricks, Snowflake, PySpark,AWS, Azure Data Factory, Azure Data Lake, and Teradata.
Expertise in building ETL/ELT pipelines using Python, Spark, SQL, and ADF for batch and real-time data processing.
Strong experience in cloud ecosystems including Azure, AWS, and GCP with hands-on expertise in Databricks, ADLS Gen2, Snowflake, Glue, BigQuery, and Dataflow.
Skilled in developing scalable big data and streaming solutions using Apache Spark, Kafka, Flink, Hive, and Airflow.
Hands-on experience in data warehousing, dimensional modeling, Star/Snowflake schemas, and Teradata performance optimization.
Expertise in PySpark, Spark SQL, Delta Lake, and Databricks optimization for high-performance data engineering solutions.
Experience implementing CI/CD pipelines, data governance, data quality frameworks, and cloud-native architecture best practices.
Strong proficiency in Python, Scala, Java, SQL, and PL/SQL with experience across Oracle, PostgreSQL, MySQL, MongoDB, and Teradata databases.
Experienced in ML pipeline orchestration using MLflow, Airflow, and feature engineering for AI/ML workloads.
Proficient in ETL tools including Informatica, DataStage, SSIS, Ab Initio, and Talend for enterprise-scale data integration.
Experience working with structured and semi-structured data formats including Parquet, Avro, JSON, XML, CSV, and ORC.
Strong collaboration skills with experience working in Agile environments alongside Data Scientists, Analysts, and Engineering teams to deliver scalable enterprise data solutions.
SKILLS:
Languages:
Scala, Java, Python, R, Pyspark, SQL, HiveQL, Shell Scripting. C, C golago
IDE’s:
Pycharm, Jupyter Notebook, IntelliJ IDEA.
Methodologies:
SDLC, Agile.
ETL Tools:
Talend, SSIS, Informatica, Data Stage, SSIS, Ab Initio
Cloud Technologies:
AWS, GCP (BQ, Cloud Run, Data Proc, Data Flow, Pub/Sub).
Python Libraries/Packages:
NumPy, Pandas, Matplotlib, SciPy, Scikit-learn, TensorFlow, PyTables, Data Frames, HTTPLib2, Py Query
ML Algorithm
Supervised Learning, Unsupervised Learning, Natural Language Processing.
Reporting Tools:
Tableau, Power BI, SSRS, SSAS, AWS Quicksight.
Big Data Technologies
Hadoop, PIG, HIVE, Data Warehousing, Sqoop, Apache Storm, Kafka, Spark, Pyspark, Spark Streaming, Spark SQL and Data Frames, Graph X, Scala, Elastic Search,GCP, BigQuery, GCP Data Proc, Cloud Run, AWS, Avro. AWS, Amazon EC2, S3.
Databases/ Servers:
MySQL, Oracle, Cassandra, DynamoDB, NoSQL systems,Redis, PostgreSQL, MongoDB, Apache Web Tomcat, JBoss, WebLogic
Versioning tools:
Git, GitLab, GitHub, SVN, and CVS
Orchestration Tools:
Apache NiFi, Apache Airflow, Kafka, Microsoft Purview (Collections, Scanners, Metadata Catalog, Classification, Glossary, Lineage), Data Governance, Data Lineage
EDUCATION: Master’s in Computer Science and Master’s in Data Science – Missouri State University, Springfield, MO
PROFISSIONAL EXPERIENCE:
Senior AWS Cloud Data Engineer Moodys (Cognizant), VA March 2025 – Present
Designed and developed scalable cloud-native ETL/ELT pipelines using AWS Glue, PySpark, S3, and Athena for large-scale data processing and analytics.
Implemented AWS-based Data Lake architecture using Bronze/Silver/Gold layers with Delta and Parquet formats for optimized data management.
Built serverless and event-driven data ingestion pipelines using AWS Lambda, API Gateway, and REST APIs for seamless integration across systems.
Developed and optimized Snowflake integration pipelines on AWS, enabling high-performance enterprise data warehousing solutions.
Leveraged Spark, Kafka, and Hadoop ecosystem technologies to build resilient, fault-tolerant distributed data processing systems.
Integrated ML and AI workloads using AWS SageMaker, Docker, and Kubernetes to support scalable AI-ready cloud platforms.
Designed and optimized PostgreSQL (AWS RDS) data models and Athena queries to improve query performance and reduce processing costs.
Monitored and optimized AWS workloads using CloudWatch, logging frameworks, and performance tuning techniques to enhance reliability and observability.
AWS Cloud Data Engineer Capital One (Cognizant), VA Oct 2024 – March 2025
Designed and implemented scalable ETL/ELT pipelines using Databricks, PySpark, Snowflake, and AWS services for large-scale enterprise data processing.
Re-architected legacy Informatica workflows into optimized Snowflake pipelines using Tasks, Streams, Snowpipe, and Stored Procedures.
Developed automated data ingestion and orchestration workflows using AWS Glue, Apache Airflow, S3, GitHub, and REST APIs.
Built high-performance Spark and Polars-based workflows with advanced optimization techniques for large-scale analytics and centralized enterprise calculations.
Developed complex transformations using Databricks, Spark SQL, Delta Lake, and Parquet supporting batch and near real-time processing.
Implemented event-driven architectures using AWS Lambda, S3 triggers, and CI/CD pipelines with GitHub and Jenkins for automated deployments.
Ensured data quality, governance, lineage, and validation using Spark-based frameworks, metadata tracking, and enterprise governance standards.
Optimized Snowflake and Athena performance through query tuning, partitioning, compression strategies, and cost optimization techniques.
Built and supported AI/ML data pipelines including Generative AI workflows, feature engineering, ML datasets, and vector search using FAISS.
Led cross-functional teams, mentored junior engineers, and collaborated with business stakeholders to deliver scalable cloud-native data solutions.
Data Engineer American Family Insurance, WI Oct 2023 – Sep 2024
Designed and developed scalable data ingestion and CDC pipelines from Oracle to Amazon Redshift using GCP, Scala, and custom ETL frameworks.
Built and optimized real-time streaming pipelines using Kafka, Flink, Spark Streaming, and GCP Dataflow to process billions of events with low latency.
Implemented CDC solutions using Qlik (Attunity), Oracle replication/snapshots, and GCS for reliable real-time data integration and management.
Developed and optimized large-scale data processing workflows using BigQuery, Dataflow, Cloud Run, and GCS for improved performance and cost efficiency.
Integrated ML features and feature-store concepts into production pipelines to support real-time analytics and intelligent data applications.
Designed REST APIs and self-service data workflows enabling real-time access and improved analytics accessibility for business users and analysts.
Monitored and optimized pipeline performance using Prometheus, Grafana, and operational monitoring frameworks to improve reliability and observability.
Collaborated with cross-functional teams, mentored junior engineers, and provided technical leadership on data engineering, ETL, and cloud-based big data solutions.
Environment: Airflow (workflow orchestration), Trino, Hive, Spark (batch and streaming), AWS (Glue, S3, Redshift), GCP (BigQuery, Dataflow), Python (analytical & transactional), SQL (dimensional modeling)
Data Engineer Walmart (Contract) Bentonville Aug 2022- Aug 2023
Migrating an entire oracle database to Big Query and using power BI for reporting. Build data pipelines in airflow in GCP for ETL related jobs using different airflow operators.
Developed data-driven applications using Java and Python, focusing on performance, scalability, and maintainability.
Implemented data transformation processes in GCP using tools like Dataflow and Dataproc, optimizing for performance and cost-efficiency.
Developed ELT processes from the files from abinitio, google sheets in GCP with compute being dataprep, dataproc (pyspark) and Bigquery.
I have experience in GCP Dataproc, Dataflow, Pub Sub, GCS, Cloud functions, Big Query, Stack driver, Cloud logging, IAM, Data studio for reporting etc.
Automated data ingestion pipelines by integrating Python scripts with Apache Spark, facilitating seamless extraction, transformation, and loading processes into Big Query.
Utilized Apache Beam in Dataflow for building robust and scalable data pipelines, handling real-time data processing and batch data processing needs with Data cataloging.
Automated deployment pipelines using Terraform, reducing manual intervention and improving deployment consistency.
Worked extensively with Google Cloud BigQuery for real-time analytics, enabling faster decision-making and reporting.
Developed and monitored Airflow DAGs for ingestion and transformation of large-scale datasets in GCP.
Introduced data monitoring alerts for early failure detection, enhancing data pipeline reliability.
Evaluated and implemented POCs for GCP Dataflow vs. Dataproc to determine cost and performance trade-offs.
Adapted quickly to new GCP services and delivered production-ready pipelines under tight deadlines
Environment: Google Cloud Storage, BigQuery, Dataflow, Pub/Sub, Cloud Functions, Dataproc, Cloud Composer, Google Kubernetes Engine (GKE), Cloud Spanner, Cloud SQL, Data Studio, Looker, Python, Java, Scala, UNIX Shell Scripting, PySpark, Apache Spark, Docker.
Data Engineer Cognizant, India Feb 2017- Aug 2021
Hands on experience in Architecting Legacy Data Migration project on - premises to AWS Cloud.
Experience in Analysis, Design, Development and Testing of ETL methodologies in all the phases of Data Warehousing.
Developed strategy for cloud migration and implementation of best practices using AWS services like database migration service, AWS server migration service from On-Premises to cloud.
AWS EMR to process big data across Hadoop clusters of virtual servers on Amazon Simple Storage Service(S3).
Hands-on use of Spark and Scala APIs to compare the performance of Spark with Hive and SQL, Spark SQL to manipulate Data Frames in Scala.
Led migration of legacy ETL workflows from Informatica PowerCenter (IICS/PWC) to Snowflake, redesigning mappings into cloud-native ELT patterns using Snowflake SQL and Spark.
Prepared scripts to automate the ingestion process using Python and Scala as needed through various sources such as API, AWS S3, Teradata and snowflake.
Implemented validation and reconciliation frameworks to ensure data consistency during migration from on-prem Informatica to Snowflake.
Created Hive tables, loading data in ORC, JSON, AVRO, CSV format and writing hive queries to analyze data using Spark-SQL.
Developed real-time streaming applications using PySpark, Apache Flink, Kafka, and Hive on a distributed Hadoop Cluster.
Developed the code to perform Data extractions from Oracle Database and load it into AWS platform using AWS Data Pipeline.
Expertise in Informatica cloud apps Data Synchronization, Data Replication, Task Flows & Mapping configurations.
Experience in extracting files from MongoDB through Sqoop and placed in HDFS and processed.
Utilized AWS Quicksight to design and implement scalable BI solutions, ensuring accessibility to real-time insights.
Proficient with container systems like Docker and container orchestration like EC2 Container Service, Kubernetes, Terraform.
Experience setting up continuous integration pipelines and building systems like Jenkins or equivalent.
Optimization and performance tuning of Complex and advanced SQL Scripts.
Jira for project management and GitHub for code reposition.
Actively involved in troubleshooting and optimizing data pipelines for performance and reliability, utilizing monitoring and logging tools to identify and resolve issues proactively.
Built and deployed containerized data solutions using Docker/Kubernetes in AWS ECS/EKS.
Integrated ML workflows and feature engineering pipelines to support predictive models.
If you want to future-proof: "Explored early adoption of Gen-AI frameworks to augment data workflows
Environment: Python, Scala, Apache Spark, AWS, Spark MLlib, Spark SQL, PostgreSQL, Hive, Mongo DB, Apache Storm, Kafka, GitHub, Jira, BI tools.