Pavani
857-***-**** ******************@*****.***
Professional Summary
Senior Data Engineer with 7+ years of experience designing and delivering cloud-native, large-scale data pipelines across AWS, Azure, and GCP. Specialized in streaming pipelines (Kafka, Spark Streaming), cloud migrations, and modern data warehouses (Databricks, Snowflake, Redshift, Synapse). Known for improving pipeline efficiency, enabling real-time analytics, and reducing infrastructure costs. Passionate about building reliable data ecosystems that directly drive smarter business decisions.
Education
Masters in Computer Science - Rivier University, NH Technical Skills
• Programming Languages: Python, Scala, SQL, Java, R, Unix Shell Scripting, Power Shell
• Big Data & Streaming: Apache Spark (Core, SQL, Streaming, MLlib), Kafka, Airflow, Databricks, Hadoop (HDFS, YARN, MapReduce, Hive), Apache NiFi, Apache Oozie, Amazon Kinesis, Elasticsearch, Druid, Sqoop
• Data Engineering & ETL Tools: Apache Airflow, Apache NiFi, AWS Glue, Talend, Informatica, SSIS, Matillion ETL, Great Expectaions
• Cloud Platforms & Services:
AWS (EC2, S3, EMR, Redshift, Glue, Athena, DynamoDB, IAM, CloudWatch, Lambda, API Gateway) Azure (Data Factory, Databricks, Synapse Analytics, HDInsight, Stream Analytics, Event Hubs, Functions, Azure Monitor, Azure DevOps)
GCP (BigQuery, DataProc, Dataflow, Cloud Storage, Composer), Snowflake, Delta Lake.
• Data Warehousing & Databases: Snowflake, Redshift, MySQL, PostgreSQL, Oracle, Microsoft SQL Server, DB2, Teradata
• NoSQL Databases: MongoDB, Cassandra, HBase, DynamoDB
• DevOps & Build Tools: Git, GitHub, Jenkins, Docker, Kubernetes, Terraform, CloudFormation
• Visualization & Reporting: Power BI, Tableau, OBIEE, Microsoft Excel
• Machine Learning & NLP: Scikit-learn, NLTK, OpenNLP, Stanford NLP, TextBlob, Beautiful Soup, NumPy, Pandas, SciPy, Matplotlib
• Hadoop Platforms: Cloudera (CDH4/CDH5), Hortonworks, Amazon EMR
• IDEs & Development Tools: Eclipse, IntelliJ, NetBeans
• Operating Systems: Linux, UNIX, RedHat, Windows
• Methodologies: Agile (Scrum), Waterfall
Professional Experience
Wawa, Inc. Nashua, NH Dec 2024 - Present
Senior Data Engineer
• Developed robust ETL pipelines using PySpark and Apache Spark SQL on Amazon EMR to ingest and transform raw data from structured and unstructured sources including MySQL, PostgreSQL, MongoDB, SFTP folders, and customer APIs.
• Automated data ingestion workflows using Apache Airflow, integrating Hive, MySQL, and HDFS for structured data transformation and storage.
• Built real-time streaming pipelines using Kafka and Spark Streaming to process customer transactions, enabling live dashboards that improved marketing response time by 30%.
• Implemented real-time ingestion pipelines with Apache Kafka and Amazon Kinesis, applying stateless and stateful transformations in Spark Streaming, converting input into RDDs and DataFrames, and storing results in Parquet format on HDFS.
• Migrated legacy batch jobs to AWS Glue + Redshift, cutting ETL runtime by 40% and saving approximately $100K/year in infrastructure costs.
• Utilized AWS Glue for ETL jobs to migrate and transform external data in formats like text and Parquet from Amazon S3 into AWS Redshift.
• Designed serverless pipelines using AWS Lambda, API Gateway, and DynamoDB to automate API ingestion, reducing manual effort by 20 hours per week.
• Built and optimized reusable, scalable ETL pipelines to streamline data processing and improve transformation efficiency using Apache Spark and Elasticsearch.
• Created and maintained Databricks notebooks integrated with S3 + Delta Lake, supporting analytics across three business domains and enabling near real-time insights for teams.
• Ingested data into Hive external tables and filtered results using Elasticsearch for downstream analytics and business intelligence use cases.
• Designed and implemented 25+ automated workflows with Airflow DAGs and CloudWatch monitoring, ensuring 99% pipeline reliability and seamless scheduling of batch/streaming pipelines.
• Installed, configured, and monitored the Apache Airflow cluster for optimal job performance and ensured automated daily execution of critical production pipelines.
• Utilized Oozie for orchestrating complex Hadoop workflows and integrated with Hive and Spark for end-to-end data processing.
• Enabled real-time system monitoring through AWS CloudWatch, creating alarms and notifications for EC2, EBS, RDS, ELB, S3, and SNS to proactively address system bottlenecks and failures.
• Delivered Tableau dashboards powered by curated data, enabling leadership to identify new customer behavior trends and make data-driven decisions.
• Generated additional business reports in Tableau by integrating transformed data from the Hadoop Data Lake, supporting strategic insights and trend analysis.
• Developed and deployed shell scripts for automation, environment setup, and deployment processes, contributing to the stability and efficiency of production pipelines.
HSBC Hyderabad, India Apr 2020 - Aug 2023
Senior Data Engineer
• Led migration of a 10TB on-premises data warehouse to Azure Synapse, reducing query times by 35% and cutting reporting cycles from hours to minutes.
• Developed robust ETL pipelines using PySpark and Apache Spark SQL on Amazon EMR to ingest and transform raw data from MySQL, PostgreSQL, MongoDB, SFTP folders, and customer APIs.
• Automated data ingestion workflows using Apache Airflow, integrating Hive, MySQL, and HDFS for structured data transformation and storage.
• Built streaming pipelines with Azure Event Hub + Apache Spark, supporting near-real-time fraud detection use cases.
• Designed and developed custom Kafka producers and consumers to publish and subscribe messages across distributed systems.
• Migrated legacy on-premises Data Warehouse to Azure SQL Data Warehouse, enabling cost-effective and scalable cloud analytics.
• Developed and deployed scalable Apache Spark applications using PySpark and Spark SQL for large-scale transformations and aggregations across multiple formats.
• Created and maintained Hive external/staging tables with partitioning, dynamic partitioning, and bucketing to optimize query performance.
• Created Azure ADF pipelines integrating Blob Storage, SQL, and Delta Lake, standardized enterprise data ingestion.
• Utilized Azure Databricks for unified batch and streaming processing, integrating with Delta Lake and notebooks for structured/semi-structured data.
• Built real-time streaming data pipelines with Apache Kafka, Spark Streaming, and Azure Event Hub transformed data into RDDs/DataFrames and stored in Azure Data Lake (Parquet).
• Implemented a data quality framework with Great Expectations, improving data accuracy and trust across analytics teams.
• Built and optimized Apache Oozie and Airflow workflows for orchestration and monitoring of Hive, Pig, Spark, and Sqoop jobs.
• Partnered with BI teams to create Power BI dashboards on curated datasets, adopted by 200+ business users for critical decision-making.
• Designed Azure HDInsight clusters to scale Hadoop/Spark workloads enterprise-wide.
• Developed CI/CD data pipelines using Azure DevOps, enabled version control and automated deployments.
• Performed Spark job tuning using Spark UI and YARN for memory/disk optimization and parallelization.
• Implemented Apache HBase for storing/retrieving high volumes of semi-structured and time-series data; handled row key design, compression, and optimization.
• Designed scalable architecture for near real-time analytics using Event Hub, Stream Analytics, and Azure Functions.
• Utilized Azure Blob Trigger Functions for automated downstream data processing.
• Created Spark DataFrames for validation and analytics on top of Hive and Azure Data Lake storage layers.
• Built Power BI & Tableau dashboards delivering business-critical insights.
• Developed PowerShell, Shell, and Python scripts to automate deployments, ingestion, and monitoring.
• Monitored enterprise resources with Azure Monitor, configuring alarms and alerts across Blob Storage, Data Factory, Event Hubs, and SQL DW.
• Involved in all phases of SDLC including requirements gathering, analysis, design, development, testing, and deployment. Agilon Health Bengaluru, India Jan 2017 - Mar 2020 Data Engineer
• Designed and implemented ETL pipelines using Google Cloud Platform (GCP) services such as Cloud Dataflow, Data Fusion, and DataPrep, enabling scalable and efficient data integration across distributed systems.
• Designed scalable pipelines with GCP Dataflow + BigQuery to handle healthcare data, supporting real-time reporting for 1M+ patient records.
• Developed and deployed automated data pipelines in Python to support seamless data ingestion and transformation from diverse sources into BigQuery for real-time analytics and reporting.
• Created optimized BigQuery views and used Data Definition Language (DDL) for managing schemas, improving query efficiency and aligning with analytical requirements.
• Migrated complex FHIR file structures into PostgreSQL, maintaining data integrity and enabling streamlined access for downstream applications.
• Migrated ETL from on-prem to GCP BigQuery & Snowflake, reducing infrastructure costs by 25%.
• Monitored and resolved pipeline execution issues using Airflow (Composer) by analyzing DAG logs, improving pipeline reliability and reducing downtime.
• Built ingestion workflows with Airflow + Terraform automation, ensuring reliable deployments across environments.
• Leveraged Spark Streaming for real-time processing of sales and transaction data from Kafka, storing results in Parquet format for efficient querying.
• Enhanced performance of Spark jobs by optimizing Spark SQL queries and Scala transformations, tuning RDD operations and joins, and improving daily pipeline performance by 2x.
• Built and maintained ETL orchestration workflows using Apache Oozie, coordinating Spark and Hive jobs to ensure timely execution of business-critical pipelines.
• Utilized Sqoop to import/export data between HDFS, Hive, and relational databases, supporting legacy system integration and historical data processing.
• Implemented data cleaning and transformation techniques using pandas, NumPy, and Spark SQL, enabling feature engineering and model readiness.
• Created Hive external and managed tables with partitioning and bucketing, optimizing performance for downstream analytics workloads.
• Designed Druid ingestion pipelines to extract and transform website database data for real-time dashboarding and user behavior insights.
• Used Matillion ETL for performance tuning of data pipelines by optimizing SQL scripts and resource parallelism.
• Wrote and automated Terraform scripts to manage EMR infrastructure and automate data loading into ScyllaDB.
• Collaborated with analytics teams to deliver real-time dashboards for provider performance and patient engagement.
• Migrated on-premises ETL processes to GCP using native services such as BigQuery, Cloud DataProc, and Cloud Storage, enhancing scalability and reducing infrastructure costs.
• Integrated Snowflake into GCP architecture for scalable warehousing and downstream consumption.
• Collaborated via Git repositories, ensuring version-controlled, modular, and testable pipeline frameworks.