Azure Data Engineer

Location:

Alpharetta, GA

Posted:

March 17, 2025

Contact this candidate

Resume:

PAVANI GONGALLA

Contact: +1-770-***-****

Email: **************@*******.***

Sr.Data Engineer

LinkedIn: www.linkedin.com/in/pavani-gongalla-7b512112

Profile Summary:

●Having around 10+ years of professional IT experience which includes in Developing, Implementing, configuring, Python, Hadoop& Big Data Technologies.

●Experience on Migrating SQL database to Azure Data Lake, Azure data lake Analytics, Azure SQL Database, Data Bricks and Azure SQL Data warehouse and controlling and granting database access and Migrating On premise databases to Azure Data Lake store using Azure Data factory.

●Expertise in Hadoop architecture and various components such as HDFS, YARN, Hive, Pig, High Availability, Job Tracker, Task Tracker, Name Node, Data Node, Apache, Cassandra, and MapReduce programming paradigm.

●Very Good experience working in Azure Cloud, Azure DevOps, Azure Data Factory, Azure Data Lake Storage, Azure Synapse Analytics, Azure Analytical services, Azure Cosmos NOSQL DB, Azure HDInsight, Big Data Technologies (Hadoop and Apache Spark) and Data bricks.

●performance tuning of PL/SQL queries to enhance execution speed and optimize resource usage, resulting in improved overall system performance

●Experience in using various tools like Sqoop, Flume, Kafka, and Pig to ingest structured, semi-structured and unstructured data into the cluster.

●Hands-on experience in developing web applications and RESTful web services and APIs using Python, Flask, and Django.

●Good working knowledge of the Snowflake and Teradata databases.

●Strong experience in migrating other databases to Snowflake.

●Proficient with Apache Spark ecosystem such as Spark, Spark Streaming using Scala and Python.

●Developed highly optimized Spark applications to perform various data cleansing, validation, transformation, and summarization activities according to the requirement.

●Constructed data staging layers and fast real-time systems to feed BI applications and machine learning algorithms.

●Hands on Experience in Spark architecture and its integrations like Spark SQL, DataFrames and Datasets APIs.

●Utilized PL/SQL to manage and manipulate relational data, ensuring data integrity and optimizing performance during data ingestion processes.

●Created stored procedures and functions to encapsulate business logic, facilitating efficient data manipulation and reducing redundancy.

●Developed complex PL/SQL scripts to automate data transformations and validations, ensuring high data quality and consistency.

●3+ years of experience in writing python as ETL framework and Pyspark to process huge amounts of data daily.

●Having hands-on experience in Application Deployment using CICD pipeline.

●Experience in implementing Spark using Scala and Spark SQL for faster processing of data.

●Strong experience in extracting and loading data using complex business logic using Hive from different data sources and built the ETL pipelines to process terabytes of data daily.

●Experienced in transporting and processing real time event streaming using Kafka and Spark Streaming.

●Designed and developed spark pipelines to ingest real time event-based data from Kafka and other message queue systems and processed huge data with spark batch processing into data warehouse hive.

●Experience working with GitHub/Git 2.12 source and version control systems.

●Developed interactive reports and dashboards using Microsoft Power BI, providing stakeholders with real-time insights and advanced visualizations to support data-driven decision-making.

●Designed complex DAX (Data Analysis Expressions) formulas to create custom measures and calculated columns, ensuring advanced filtering, time intelligence, and dynamic aggregation like claim settlement times, claim aging, and fraud patterns.

Technical Skills:

Hadoop

Hadoop, Map Reduce, HIVE, PIG, Impala SQOOP, HDFS, HBASE, Oozie, Spark, Pyspark, Scala and MongoDB

Cloud Technologies:

Azure Analysis Services, Azure SQL Server, Azure Synapse, DynamoDB, Step Functions, Glue, Athena, CloudWatch, Azure Data Factory, Azure Data Lake, Functions, Azure SQL Data Warehouse, Databricks and HDInsight, AWS S3 bucket, EMR, EC2, Lambda, SQS

DBMS:

Amazon Redshift, PostgreSQL, Oracle 9i, SQL Server, IBM DB2 And Teradata

ETL Tools:

Data Stage, Talend and Abinitio

Deployment Tools:

Git, Jenkins, Terraform and CloudFormation

Programming Language:

Java, Python, Scala

Scripting:

Unix Shell and Bash scripting

Education:

●Bachelors in Computer Science from Jawaharlal Nehru Technological University, India -2013

Certification:

●Microsoft Certified: Azure Data Engineer Associate. Credential ID: 724D15B45563911F.

Work Experience:

State Farm Insurance, Bloomington, IL Jun 2024 - Present

Role: Sr. Azure Data Engineer

Project: Real-time Data Ingestion and Processing for State Farm Claims

Description: The extensive aim of the project includes consolidating which stores on-premises data sources into a scalable cloud data platform, enhancing existing analytics environment for advanced visualization. The project scope includes implementing robust data ingestion pipelines using Azure Data Factory, enabling the extraction of data from various sources and seamless loading into the designated data lake or databases. The firm was strategically positioned to make faster data-backed decisions, mitigate risks proactively while also gaining a competitive edge within the dynamic financial services industry.

Responsibilities:.

●Analyzing the requirements and discussing with managers and leads on the functionality.

●Architected a scalable Azure data platform, leading data modeling efforts for efficient migration of claims data to Azure Data Lake Storage.

●Designed Self-hosted Integration Runtime (SHIR) for secure connectivity between on-premises systems and the Azure cloud.

●Leveraged Azure Data Lake Storage Gen2 for high-volume storage and backup of structured and unstructured data.

●Implemented Azure Service Fabric to orchestrate and manage microservices-based real-time data processing for claims data ingestion and analysis.

●Automated CI/CD pipelines using Azure DevOps and Service Fabric CLI, ensuring smooth deployment and updates without service disruptions.

●Integrated Azure Service Fabric with Azure Event Hubs and Stream Analytics to efficiently process and analyze real-time claim events.

●Designed and implemented ETL pipelines for migrating claims data from on-premises systems to Azure cloud services using Azure Data Factory, T-SQL, Spark SQL, and U-SQL. Managed data ingestion into Azure Data Lake, Azure Blob Storage, Azure SQL Database, and Synapse Analytics, leveraging Azure Databricks with Apache Spark for scalable real-time data processing.

●Worked with Spark applications in Python for developing the distributed environment to load high volume files using Pyspark with different schema into Pyspark Data frames and process them to reload into Azure SQL DB tables.

●Worked on secure service-to-service communication in Azure Service Fabric using managed identities and role-based access control (RBAC).

●Designed and implemented scalable data pipelines using Azure Data Factory and Snowflake for real-time claims processing and fraud detection.

●Implemented Azure Event Hubs for real-time ingestion of claims data and customer interactions, enabling seamless integration with the cloud data platform.

●Optimized Snowflake performance through query tuning, clustering, and compute scaling, ensuring high efficiency for analytics and reporting.

●Migrated on-premises claims data to Azure Data Lake Storage (ADLS) and Snowflake, ensuring cost efficiency, security, and compliance.

●Consolidated data into Azure Data Lake Storage Gen2 for scalable and secure cloud storage, ensuring efficient data retrieval and integration with Azure Synapse Analytics for complex queries and reporting.

●Set up Azure SQL Database for structured data storage and supported data warehousing with Azure Synapse Analytics to enable advanced data analytics and visualization.

●Implemented real-time data streaming pipelines with Apache Kafka and Azure Stream Analytics, optimizing transformation and timely delivery of insights.

●Utilized Databricks' Mosaic suite to customize and fine-tune Generative AI models, enhancing the capability to process and generate claims data.

●Implemented DBRX, Databricks' open-source large language model, to improve natural language understanding and generation within the claims processing system.

●Incorporated Generative AI models into existing ETL pipelines to automate data transformation and enrichment processes, reducing manual intervention and processing time.

●Automated data validation scripts using Python to ensure data integrity and consistency between on-premises and cloud data environments, minimizing data discrepancies and errors post-migration.

●Analyzed the SQL scripts and designed the solution to implement using Pyspark.

●Leveraged Databricks' unified analytics platform to seamlessly integrate AI models with data engineering workflows, ensuring efficient data flow and accessibility.

●Developed custom aggregate functions using Spark SQL and performed interactive querying.

●Azure Monitor and Azure Log Analytics to track pipeline performance, troubleshoot issues, and optimize the cloud infrastructure for cost-efficiency and scalability.

●Integrated Azure Key Vault for secure management of encryption keys and credentials across data pipelines

●Conducted performance tuning for Azure Synapse Analytics and Databricks to improve query response times and real-time data processing efficiency.

●Managed and administered Azure CosmosDB infrastructure, including account setup, partition key configuration, and multi-region replication, ensuring low-latency data access and high availability across global endpoints.

●Applied Generative AI techniques to analyze unstructured data, such as customer interactions and claims notes, extracting valuable insights for decision-making.

●Designed and optimized MongoDB schemas to efficiently store and retrieve unstructured claim-related data, ensuring seamless integration with Azure-based analytics workflows.

●Implemented high-performance queries and indexing strategies in MongoDB to accelerate claims data retrieval for real-time processing within Azure Databricks.

●Utilized PL/SQL to manage and manipulate relational data, ensuring data integrity and optimizing performance during data ingestion processes.

●Worked on Jenkins continuous integration for deployment of project and deployed the project into Jenkins using GIT version control system.

●Designed a data catalog to improve data accessibility and discovery for analytics teams, promoting adherence to governance policies across multiple departments.

●Developed interactive reports and dashboards using Microsoft Power BI, providing stakeholders with real-time insights and advanced visualizations to support data-driven decision-making.

●Built real-time monitoring dashboards by integrating Power BI with Azure Event Hubs and Streaming Analytics, giving instant visibility into claims processing.

●Configured Power BI alerts and subscriptions, enabling proactive decision-making by notifying users of anomalies or threshold breaches in claims metrics.

Environment: Azure SQL, Azure Data Warehouse, Azure data bricks, Azure Data Lake Analytics,GenAI, Azure Blob Storage, MapReduce, Snowflake, Python, PySpark, T-SQL, Git and GitHub,Power bi.

UnitedHealth Group, NY May 2021 - Apr 2024

Role: Azure Data Engineer

Project: Cloud-based Healthcare Data Integration and Processing

Description: The goal of the project was to design and implement a robust cloud-based data platform to streamline healthcare data integration, processing, and analytics. The platform aimed to consolidate data from various sources into Azure-based storage for enhanced analytics, improved compliance, and advanced insights. This initiative enabled Optum to reduce operational complexities, ensure data integrity, and comply with strict healthcare regulations like HIPAA.

Responsibilities:

●Collaborated with cross-functional teams, including compliance, security, and DevOps, to ensure adherence to HIPAA and healthcare data standards.

●Designed and implemented Azure-based data pipelines using Azure Data Factory for ETL workflows, integrating data from healthcare systems such as Epic, Workday, and Impact.

●Designed a scalable data platform using Azure Data Lake Storage Gen2, enabling the storage of structured and unstructured healthcare data for downstream analytics.

●Automated infrastructure provisioning with Terraform and Azure Resource Manager (ARM) templates to support efficient deployment and scalability.

●Utilized Azure Synapse Analytics for complex queries, facilitating advanced reporting and analytics to support healthcare claims and operational insights.

●Migrated on-premises healthcare data to Azure SQL Database and Azure Data Lake, ensuring seamless data transfer with minimal downtime.

●Implemented Azure Event Hubs and Stream Analytics for real-time ingestion and processing of patient data and claims transactions.

●Developed and executed Python scripts for automated data validation, ensuring accuracy and consistency of patient and claims data.

●Configured Azure Key Vault for secure management of credentials, certificates, and encryption keys, ensuring compliance with healthcare security standards.

●Optimized performance of Azure Databricks and Synapse Analytics by tuning queries and configurations to handle large healthcare datasets efficiently.

●Managed and monitored Azure CosmosDB infrastructure to support low-latency global access for patient data and real-time interactions.

●Conducted data quality checks using PySpark and Spark SQL, ensuring compliance with healthcare data governance policies.

●Designed Power BI dashboards and interactive reports to visualize critical healthcare metrics such as claims processing time, patient engagement trends, and fraud detection insights.

●Built custom DAX measures for advanced filtering, time intelligence, and real-time analytics on claims and operational data.

●Integrated real-time streaming data with Power BI to provide dynamic monitoring of patient interactions and claims processing workflows.

●Drove adherence to cloud security, governance, and compliance frameworks for data privacy and regulatory compliance.

●Supported DevOps practices by integrating Azure DevOps pipelines for CI/CD processes, automating deployment of resources and data pipelines.

●Automated provisioning of Azure resources using Terraform, ensuring consistent infrastructure across environments.

●Monitored Azure pipelines with Azure Monitor and Log Analytics to troubleshoot and resolve issues, optimizing pipeline performance and reducing costs.

Environment: Azure Data Factory, Azure Synapse Analytics, Azure Data Lake Storage Gen2, Azure SQL Database, Azure Databricks, Python, PySpark, Terraform, Azure Key Vault, Power BI, Event Hubs, Stream Analytics, CosmosDB, T-SQL.

Baker & Taylor, Feb 2019 - Apr 2021

Role: Big Data engineer

Project: Data Infrastructure Enhancement

Description: The goal of Baker & Taylor was to create a comprehensive analytics and data processing platform for handling massive amounts of heterogeneous data. The project's tasks included ensuring effective data storage and transfer, streamlining data pipelines, and extracting data from multiple sources. Key tasks included developing and managing data workflows, optimizing data processing performance, and implementing solutions for advanced data analytics and reporting to support business insights and decision-making.

Responsibilities:

●Participated in extracting customer big data from various sources, transferring it to Azure Blob Storage for storage in data lakes, which included handling data from mainframes and databases.

●Created Azure Data Factory pipelines to copy the data from source to target systems.

●Developed UNIX shell scripts to load large numbers of files into HDFS from Linux File System.

●Involved in loading data from the UNIX file system and FTP to HDFS.

●Experience in optimizing Map Reduce Programs using combiners, partitioners, and custom counters for delivering the best results.

●Developed Spark Applications by using Scala and Implemented Apache Spark data processing project to handle data from various RDBMS and Streaming sources.

●Worked with different File Formats like TEXT FILE, AVROFILE, ORC, and PARQUET for HIVE querying and processing.

●Scheduled Oozie workflow engine to run multiple Hive and Pig jobs, which independently run with time and data availability.

●Migrated data from Client data scope to Azure SQL server and Azure Data Lake Gen2.

●Set up monitoring and logging for Azure Synapse to proactively identify and resolve data issues, optimizing data pipeline performance.

●Created multiple components using Java and employed Spring Batch for executing ETL batch processing tasks.

●Used Azure Blob Storage and Azure Data Factory (ADF) to effectively move data across databases, and Azure Event Hubs to feed server log data.

●Monitoring daily running Azure pipelines for different applications and supporting multiple applications.

●Introduced partitioning and bucketing techniques in Hive to enhance the organization of data for improved efficiency.

●Worked on complex SQL Queries, PL/SQL procedures and converted them to ETL tools.

●Expressions and Tabular Model. Skills in Power BI visualization, high quality Power BI reports and Dashboards.

●Used Azure Logic Apps workflow engine to manage interdependent jobs and to automate several types of Hadoop jobs, including Java MapReduce, Hive and Sqoop as well as system-specific jobs.

●Applied transform rules Data Conversion, Data Cleansing, Data Aggregation, Data Merging and Data split.

●Worked with Linux systems and RDBMS database on a regular basis to ingest data using Sqoop.

●Worked with Hadoop ecosystem tools, including HDFS, Hive, and Sqoop for data management.

Environment: Azure services, HDFS, Map Reduce, Spark, YARN, Hive, Sqoop, Pig, Java, Python, Jenkins, SQL, ADF, Databricks, Data Lake, ADLS Gen2, Blob, MySQL, Azure Synapse, Power BI.

Wipro/BestBuy, India Mar 2017 - Jan 2019

Role: AWS Data Engineer

Project: E-commerce Solutions

Description: Wipro/BestBuy aimed to develop a scalable data processing and analytics platform to manage and analyze large volumes of retail data. The project involved designing, implementing, and optimizing ETL pipelines and cloud-based data storage solutions to enable advanced analytics and reporting. In order to improve operational insights and decision-making, key duties included creating frameworks for data ingestion, transferring data from on-premises systems, and enabling real-time data processing.

Responsibilities:

●Created ETL Framework using spark on AWS EMR in Python.

●Developed a POC for project migration from on premise Hadoop MapReduce system to AWS.

●Worked on AWS Data Pipeline to configure data loads from S3 to Redshift.

●Involved in migrating tables from RDBMS into Hive tables using SQOOP and later generated data visualizations using Tableau.

●Design and Develop ETL Process in AWS Glue to migrate Campaign data from external sources.

●Construct the AWS data pipelines using VPC, EC2, S3, Auto Scaling Groups, EBS, Snowflake, IAM.

●Generated a script in AWS Glue to transfer the data and utilized AWS Glue to run ETL jobs and run aggregation on PySpark code.

●Created ingestion framework using Kafka, EMR, Cassandra in Python.

●Experience in Developing Spark applications using Spark - SQL in Databricks for data extraction, transformation, and aggregation from multiple file formats for analyzing & transforming the data to uncover insights into the customer usage patterns.

●Very capable of using AWS utilities such as EMR, S3 and to run and monitor Hadoop/Spark jobs.

●Develop Cloud Functions in Python to process JSON files from source and load the files to BigQuery.

●Worked with Spark for improving performance and optimization of the existing algorithms in Hadoop using Spark Context, Spark-SQL, Data Frame, Pair RDD, Spark YARN.

●Designed AWS Cloud Formation templates to create VPC, subnets, NAT to ensure successful deployment of Web applications and database templates.

●Involved in converting Hive/SQL queries into Spark transformations using Spark RDD.

●Creating S3 buckets also manages policies for S3 buckets and Utilized S3 bucket and Glacier for storage and backup on AWS.

●Install and configure Apache Airflow for S3 bucket and Snowflake data warehouse.

●Developed a capability to implement audit logging at required stages while applying business logic.

●Used AWS glue catalog with crawler to get the data from S3 and perform SQL query operations

Environment: AWS Glue, S3, EC2, VPC, Redshift, EBS, EMR, Apache Spark, PySpark, SQL, Python, HDFS, Hive, Apache Kafka, Sqoop, YARN, Oozie, Shell scripting, Linux, Eclipse, Jenkins, Git, GitHub, MySQL, Cassandra, and Agile Methodologies.

Datagaps, Hyderabad, India

Role: Python Developer Feb 2015 - Feb 2017

Project: Integrated Data Platform Development

Description: The Integrated Data Platform Development involved the creation of a robust and versatile data ecosystem to streamline processes, enhance data accessibility, and ensure seamless integration across various sources and formats. Mphasis spearheaded the architecture, development, and maintenance of this sophisticated platform, aiming to streamline processes, enhance data accessibility, and ensure seamless integration across diverse sources and formats.

Responsibilities:

●Developed frontend components using Python, HTML5, CSS3, AJAX, JSON, and jQuery for interactive web applications.

●Analyzed system requirements specifications and actively engaged with clients to gather and refine project requirements.

●Established and maintained automated continuous integration systems using Git, MySQL, and custom Python and Bash scripts.

●Successfully migrated Django databases between SQLite, MySQL, and PostgreSQL while ensuring complete data integrity.

●Developed custom directives using Angular.js and interfaced with jQuery UI using Python and Django for efficient content management.

●Created test harnesses and utilized Python's unit test framework for comprehensive testing to ensure application quality and reliability.

●Developed multi-threaded standalone applications in Python for various purposes, including viewing circuit parameters and performance analysis.

●Proficient in performing various mathematical operations, data cleaning, features scaling, and engineering using Python libraries like Pandas and NumPy.

●Developed machine learning algorithms such as classification, regression, and deep learning using Python, alongside creating, and writing result reports in different formats.

●Utilized Python libraries like Beautiful Soup for web scraping to extract data for analysis and visualization.

●Developed REST APIs using Python with Flask and Django frameworks, integrating various data sources including Java, JDBC, RDBMS, Shell Scripting, Spreadsheets, and Text files.

●Led the development of data platforms from scratch, participating in requirement gathering, analysis, and documentation phases of the project.

Environment: Python, Jupiter Notebook, PyCharm, Django, RDBMS, Shell scripting, SQL, Pyspark, Pandas, Numpy, Matplotlib, MySQL packages, Postgres, UNIX, Jenkin, GIT, XML, HTML, CSS, JavaScript, Shell Scripts, Oracle, PostgreSQL, JSON.

Magneto IT Solutions, India Dec 2013 - Jan 2015

Role: Software Engineer

Project: E-commerce Solutions

Description: Magneto IT Solutions revolve around innovative technology to empower businesses, improve operational efficiency, and enhance their digital presence in the competitive market by developing robust and scalable e-commerce platforms to enable businesses to establish and grow their online presence, facilitating seamless transactions and customer interactions.

Responsibilities:

●Used Micro services architecture to break down the monolithic application to independent components.

●Involved in development of REST Web Services using Spring MVC to extract client related data from databases and implementing Microservices to base on RESTful API utilizing Spring Boot with Spring MVC.

●Customized RESTful Web Service layer to interface with DB2 system for sending JSON format data packets between front-end and middle-tier controllers.

●Improved the performance of the backend batch processes using Multithreading and concurrent package API.

●Used JBPM to control some of the workflows in a different module of the application providing the interface documents for approval.

●Developed the persistence layer using Hibernate Framework by configuring the various mappings in hibernate files and created DAO layer.

●Developed Hibernate and Spring Integration as the data abstraction to interact with the database of Oracle DB.

●Experience working on Application Servers like Apache Tomcat application server.

●Used Maven as build automation tool for deploying the project on Tomcat Application Server.

●Use Jenkins alongside Maven to Compile & Build Microservices code and configure the Build Triggers.

●Worked on JIRA for User requirements and used Jira as bug tracking tools.

●Worked on Agile methodology for the software development process of functional and scalable applications.

●Used GIT to maintain the version of the files and took the responsibility to do the code merges and creating new branches when new feature implementation starts.

Environment: Java, MySQL, Spring Boot, Oracle Database, Apache Tomcat, Maven, Jenkins, Jira, Git, GitHub, and Agile Methodologies.

Contact this candidate