Pallavi Gudala
Senior Data Engineer
Email ID: *********.***@*****.***
Contact: +1-314-***-****
LinkedIn: www.linkedin.com/in/pallavi-g-95460333a
Professional Summary:
Over 9+ years of experience as a Data Engineer with a strong background in designing, analyzing, and developing software applications. Expertise in Big Data ecosystems and Spark technologies, managing large-scale data pipelines and platforms using tools like Spark, AWS, Azure, Talend, Hadoop, Hive, Impala, Oozie, Sqoop, HBase, Scala, Python, Kafka, Kudu, and NoSQL databases such as Cassandra and MongoDB.
Skilled in utilizing Spark, HDFS, MapReduce, YARN, HBase, Pig, Sqoop, Flume, Oozie, Impala, Zookeeper, Hive, NiFi, and Kafka for distributed computing, scalability, and high-performance solutions.
Proficient in designing and building end-to-end applications using modern programming languages and frameworks (e.g., Java, Python, Node.js, React). Skilled in creating user-centric, high-performance, and secure software solutions.
Experienced Data Integration Specialist with extensive expertise in designing, developing, and deploying SSIS (SQL Server Integration Services) packages for ETL processes.
Strong hands-on experience in Sybase to SQL Server schema conversion, including mapping of data types, constraints, and stored procedures.
Demonstrated success in migrating on-premises SSIS solutions to Azure, leveraging Azure Data Factory, Azure SQL Database, and Azure Data Lake for modern, cloud-based architectures.
Knowledgeable in Apache Spark, Kafka, Storm, NiFi, Talend, RabbitMQ, Elasticsearch, Apache Solr, Splunk, and BI tools like Tableau for real-time processing and business intelligence.
Collaborated with data architects to implement medallion architecture (Bronze, Silver, Gold layers) using AWS S3 and Redshift, ensuring data quality and structured data refinement at each stage.
Developed backend services to support data ingestion, transformation, and API delivery, integrating with MSSQL and NoSQL databases.
Hands-on experience with data architecture, encompassing data modeling, storage, processing, governance, and integration.
Proficient in using Sqoop to transfer data between HDFS and relational databases and working with partitioned Hive tables for optimized data storage and querying.
Created dashboards and reports that monitored claim trends, provider performance, and chronic disease prevalence using ICD and CPT classification logic.
Collaborated with reporting and UI development teams to refactor existing data access logic and adapt to updated database schemas.
Designed and implemented scalable ETL pipelines in Databricks using PySpark and Delta Lake, processing terabytes of structured and semi-structured data.
Skilled in extracting, transforming, and loading (ETL) data from multiple sources like SQL Server, Snowflake, Azure, and Excel into Power BI.
Developed and optimized distributed batch and streaming data pipelines using Apache Spark (Core, SQL, Streaming) in both on-prem and cloud environments (AWS EMR, Azure Databricks, GCP Dataproc).
Developed custom GenAI solutions for use cases such as document summarization, data classification, and natural language querying over structured databases.
Collaborated with business stakeholders to gather reporting requirements and translated them into visually compelling Power BI dashboards.
Developed Spark applications using Scala, Python, and Spark SQL/Streaming to process large datasets efficiently.
Implemented Spark Streaming jobs with Scala, creating resilient distributed datasets (RDDs), and using PySpark for real-time streaming data solutions.
Extensive experience in building enterprise-level solutions for batch processing (using Apache Pig) and stream processing (using Spark Streaming, Apache Kafka, and Apache Flink).
Designed and developed high-performance data pipelines in Snowflake using SQL, Streams, Tasks, and Snowpipe for real-time and batch data processing.
In-depth knowledge of NoSQL databases, particularly HBase and Cassandra.
Expertise in Amazon AWS services like EC2, RDS, Glue, Redshift, AWS Lambda, Step Functions, Kinesis, SageMaker, and DynamoDB.
Extensive experience with Ab Initio components and data processing features.
Proficient in dimensional data modeling using tools like ER/Studio, Erwin, and Sybase Power Designer, with strong skills in Star and Snowflake schema modeling.
Hands-on expertise with DevOps tools like Jenkins, Maven, Terraform, Ansible, Docker, and Kubernetes, along with complex shell and Python scripting in Linux environments.
Extensive experience with data visualization tools like Tableau and Power BI, including advanced DAX functions and integrating these tools with various data sources.
Built efficient Spark code using Scala, Python, and Spark SQL/Streaming for accelerated data processing.
Implemented real-time Spark Streaming jobs with Scala, utilizing resilient distributed datasets (RDDs), and worked with PySpark and spark-shell.
Skilled in designing and developing data models for accessing Java applications using SQL, PL/SQL, and Hibernate ORM, with experience in NoSQL databases like MongoDB.
Technical Skills:
Hadoop/Big Data
Spark, Kafka, Hive, HBase, Pig, HDFS, MapReduce, Sqoop, Oozie, Tez, Impala, Ambari,
Yarn
AWS Components
EC2, EMR, S3, RDS, CloudWatch, Athena, RedShift, DynamoDB, Lambda
Azure Components
Apache Hadoop 2.x/1.x, Cloudera CDP, Hortonworks HDP, Azure HDInsight (Data Bricks, Data Lake, Blob Storage, Data Factory ADF, SQL DB, SQL DWH, Cosmos DB, Azure AD)
GCP Components
Dataproc, BigQuery, Compute Engine, Cloud Storage, Bigtable
Programming Languages
Python, Java, PySpark, PL/SQL, SQL, Spark
Operating Systems
Linux, Windows, Centos, Ubuntu, RHEL, Unix
SQL Databases
MySQL, Oracle, MS-SQL Server, Teradata
NoSQL DB
HBase, Dynamo DB, Bigtable
Web Technologies
Spring, Hibernate, Spring Boot
Tools
IntelliJ, Eclipse, Cloud Data Fusion
Scripting Languages
Python, Shell, Java
Education Details:
Bachelor's from Gitam University, Hyderabad (CSE) – 2016
Professional Experience:
Client: Citizen Bank - Providence, RI April 2023 to Present
Role: Senior Data Engineer
Responsibilities:
Involved in designing and deploying multi-tier applications using all the AWS services (EC2, Route53, S3, RDS, Dynamo DB, SNS, SQS, and IAM), focusing on high availability, fault tolerance, and auto-scaling in AWS CloudFormation.
Designed and implemented stateful stream processing applications in Apache Flink to handle complex event processing for fraud detection and customer activity tracking, ensuring sub-second latency.
Migrated legacy on-prem databases and ETL processes to Snowflake cloud data warehouse, improving scalability and reducing infrastructure costs.
Designed and orchestrated ELT pipelines using AWS Glue for data extraction and transformation, storing raw data in S3 (Bronze), applying business rules (Silver), and loading cleansed datasets into Redshift (Gold) for analytics consumption.
Designed and implemented ETL workflows using AWS Glue to ingest and transform large volumes of structured and semi-structured data into Amazon S3.
Integrated Node.js backend with AWS services such as Lambda, S3, and EventBridge for serverless and event-driven architectures.
Implemented form validation, state management, and custom hooks in TypeScript, improving user experience and front-end reliability.
Delivered solutions in high-paced Investment Banking environments, ensuring compliance, security, and precision in financial data handling.
Created and published datasets and dashboards to Power BI Workspaces, sharing insights across departments and teams.
Developed data marts in Amazon Redshift to support reporting and business intelligence use cases.
Optimized Power BI performance through efficient data modeling, query reduction, and incremental data loads.
Developed Flink SQL/Table API queries to accelerate prototyping and enhance maintainability of transformation logic across streaming and batch datasets.
Integrated Generative AI models (e.g., GPT-4, Claude, Gemini) into enterprise data workflows for automated insights, summarization, and intelligent data querying.
Automated deployment and version control of Spark jobs using Airflow, GitLab CI/CD, and containerization strategies (Docker/Kubernetes).
Developed the AWS Data pipelines from various data resources in AWS, including AWS API Gateway to receive responses from AWS Lambda, retrieve data, convert responses into JSON format, and store them in AWS Redshift.
Developed the scalable AWS Lambda code in Python for nested JSON files, converting, comparing, sorting, etc
Developed Spark applications using Scala and implemented an Apache Spark data processing project to handle data from various RDBMS and streaming sources.
Optimized the performance and efficiency of existing Spark jobs and converted the MapReduce script to Spark SQL.
Experienced in collecting data from an AWS S3 bucket in real time using Spark Streaming, doing the appropriate transformations and aggregations, and persisting the data in HDFS.
Implemented the AWS glue catalog with a crawler to get the data from S3 and perform SQL query operations.
Developed robust and scalable data integration pipelines to transfer data from the S3 bucket to the RedShift database using Python and AWS Glue.
Built and maintained the Hadoop cluster on AWS EMR and used AWS services like EC2 and S3 to process and store small data sets.
Developed notebook-based data workflows using Python, SQL, and Scala to support real-time and batch data processing in Databricks.
Developed Python code for different tasks, dependencies, and time sensors for each job for workflow management and automation using the Airflow tool.
Scheduling Spark/Scala jobs using Oozie workflow in Hadoop Cluster and generating detailed design documentation for the source-to-target transformations.
Designed the reports and dashboards to utilize data for interactive dashboards in Tableau based on business requirements.
Environment: AWS EMR, S3, EC2, Lambda, Apache Spark, Spark-Streaming, Spark SQL, Python, Scala, Shell scripting, Snowflake, AWS Glue, Oracle, Git, Tableau.
Client: Molina HealthCare, Bothell, WA May 2021 to March 2023
Role: Azure Data Engineer
Project: Dignify American Society (DAS)
Responsibilities:
Meetings with business/user groups to understand the business process, gather requirements, analyze, design, develop, and implement according to client requirements.
Leveraged Snowflake Data Marketplace and Secure Data Sharing for cross-organizational collaboration and access to third-party datasets.
Create and maintain optimal data pipeline architecture in Microsoft Azure cloud using Data Factory and Azure Databricks.
Extract, transform, and Load data from source systems to Azure Data Storage services using a combination of Azure Data Factory, T-SQL, Spark SQL, and U-SQL Azure Data Lake Analytics. Data Ingestion to one or more Azure Services (Azure Data Lake, Azure Storage, Azure SQL, Azure DW) and processing the data in Azure Databricks.
Designing and developing Azure Data Factory (ADF) extensively for ingesting data from different source systems, such as relational and non-relational, to meet business functional requirements.
Integrated Power BI with diverse data sources such as SQL Server, Azure Data Lake, Excel, and REST APIs for unified business intelligence reporting.
Implemented data preprocessing pipelines to feed clean and relevant data into GenAI models, improving generation quality and business relevance.
Collaborated with BI teams to deliver Power BI and Tableau dashboards based on curated and transformed data from Databricks Lakehouse.
Designed and developed event-driven architecture using blob triggers and Data Factory. Creating pipelines, data flows, and complex data transformations and manipulations using ADF and PySpark with Databricks.
Developed a Spark streaming pipeline to batch real-time data, detect anomalies by applying business logic, and write the anomalies to the HBase table.
Experience working with ERP, SAP, Azure Cloud, Azure Data Factory, Azure Data Lake Storage, Azure Synapse Analytics, Azure Analysis Services, Azure Cosmos NoSQL DB, and Databricks. Developed ETL models in Databricks using Scala to migrate data to Snowflake.
Proficient in optimizing SSIS workflows to enhance performance and ensure seamless data integration.
Adept at implementing hybrid data integration solutions, configuring Azure-SSIS Integration Runtime, and reengineering legacy SSIS packages for cloud compatibility.
Ingested huge volume and variety of data from disparate source systems into Azure Data Lake Gen2 using Azure Data Factory.
Developed Spark applications using Pyspark and Spark-SQL for data extraction, transformation, and aggregation from multiple file formats for analyzing & transforming the data to uncover insights into customer usage patterns.
Worked on Apache Flink to implement the transformation of data streams for filtering, aggregating, and updating state.
Integrated third-party APIs and libraries into Java applications, enhancing functionality and reducing development time.
Conducted code reviews to ensure code quality, maintainability, and adherence to project specifications.
Implemented security measures in Java applications, addressing vulnerabilities and ensuring data integrity.
Created and optimized algorithms in Java for improved data processing and computational efficiency.
Developed and maintained scalable web applications using Java Servlets and JSP.
Configured and customized Spring Boot applications using properties and YAML files to achieve flexibility and ease of maintenance.
Implemented asynchronous processing using Spring Boot's support for asynchronous tasks, improving system responsiveness and resource utilization.
Integrated third-party libraries and frameworks seamlessly into Spring Boot applications to leverage additional functionalities and tools.
Employed Spring Boot's dependency injection features to manage and organize application components, promoting code maintainability.
Developed JSON Scripts for deploying the Pipeline in Azure Data Factory (ADF) that processes the data using the SQL Activity. Performed data flow transformation using data flow activity. Implemented Azure self-hosted integration runtime in ADF. Developed streaming pipelines using Apache Spark with Python.
Improved performance by optimizing computing time to process the streaming data and saved cost to the company by optimizing the cluster run time. Implemented Terraform modules for the deployment of various applications across multiple cloud providers and managed structures.
Performed ongoing monitoring, automation, and refinement of data engineering solutions. Extensively used SQL Server Import and Export Data tool. Working with complex SQL views, Stored Procedures, Triggers, and packages in large databases from various servers.
Created buildings, released pipelines in VSTS, and did deployments using SPN (secure endpoint connection) to implement CI/CD.
Environment & Tools: Big Data, Kafka, Extract, Transform, Load (ETL), Apache Spark, Apache Druid, Azure Synapse Analytics, SQL Database, Azure Data Lake Storage (ADLS), and Azure Data
Factory, Azure Data Lake, Azure Data Factory, Azure DevOps, Teradata, Java, Azure Data Share, MySQL, Jenkins, Git, PySpark with Databricks, and Apache Airflow.
Client: State of WA – Seattle, WA Nov 2019 – May 2021
Role: Big Data Engineer
Responsibilities:
Experience in Job management using Fair scheduler and developing job processing scripts using Oozie workflow.
Prepared scripts to automate the ingestion process using PySpark and Scala as needed through various sources such as API, AWS S3, Teradata, and Redshift.
Collaborated with business stakeholders to gather reporting requirements and translated them into visually compelling Power BI dashboards.
Integrated AWS Step Functions with Glue and Lambda to orchestrate data workflows, manage job dependencies, and handle error retries across complex ELT pipelines.
Built data quality and validation frameworks using Databricks and Great Expectations to ensure accuracy and completeness of ingested data.
Utilized Snowflake's multi-cluster compute architecture to handle large-scale parallel processing and concurrent workloads without performance degradation.
Migrate databases to a cloud platform and perform performance tuning.
Used Spark and Hive to implement the transformations needed to convert the daily ingested data to historic data.
Used Spark-Streaming APIs to perform necessary transformations and actions on the fly for building the common learner data model, which gets the data from Kafka in near real time.
Managed Collibra DGC across the enterprise, driving governance activities for all participating business units and ensuring all work activities are completed on time.
Built different visualizations and reports in Tableau using Snowflake data.
Developed a reusable framework for future migrations that automates ETL from RDBMS systems to the Data Lake utilizing Spark Data Sources and Hive data objects.
Experience designing Star schema, Snowflake schema for Data Warehouse, and ODS architecture.
Design BI applications in Tableau, Power BI, SSIS, SSRS, SSAS, Informatica, and Data Stage.
Worked on AWS Data pipeline to configure data loads from S3 to Redshift.
Migrated on-premises database structure to Redshift data warehouse.
Using AWS Redshift, I have extracted, transformed, and loaded data from various heterogeneous data sources and destinations.
Built data models for affordability and benefit programs used by Medicaid and public sector clients.
Developed ETL jobs in AWS Glue using Scala and PySpark/Python. The initiative was to migrate existing Informatica mapping and workflows to Glue jobs and to ingest data from several feeds to the data lake.
Built pipelines to move hashed and unhashed data from XML files to the Data Lake.
Expertise in analyzing data using Pig scripting, Hive Queries, Spark (Python), and Impala.
Experienced in writing live Real-time Processing using Spark Streaming with Kafka.
Performance tuning in Tableau, such as fetching of data, parameters, custom fields, and transformation.
Involved in importing real-time data to Hadoop using Kafka and implemented the Oozie job for daily imports.
Developed Pig program for loading and filtering the streaming data into HDFS using Flume.
Developed HBase data model on top of HDFS data to perform real-time analytics using Java API.
Developed PySpark and Spark SQL code to process the data in Apache Spark on Amazon EMR to perform the necessary transformations based on the STMs developed.
Environment: Spark, Kafka, Hadoop, HDFS, AWS, Spark-SQL, Snowflake, Python, MapReduce, Tableau, Power BI, SSIS, SSRS, SSAS, Informatica, Data Stage, MySQL, MongoDB, HBase, Oozie, Zookeeper, Tableau.
Client: Nationwide Mutual Insurance Company, Columbus, OH Feb 2017 to Nov 2019
Role: GCP Data Engineer
Project: Investigation and Fraud Settlement across NA.
Responsibilities:
Engaged in the full project lifecycle, including requirements gathering, analysis, design, development, change management, and deployment, focusing on Google Cloud Platform (GCP) infrastructure and services.
Expertise in building reproducible and automated workflows using Vertex AI Pipelines and integrating with other GCP services like Cloud Storage, BigQuery, and Cloud Functions.
Collaborated with cross-functional teams, including data engineers and operations, to implement ETL processes in GCP, optimizing SQL queries for efficient data extraction aligned with GCP BigQuery for analytical workloads.
Worked closely with Engagement Managers and Data Scientists using Agile methodology to understand project requirements, delivering scalable data solutions on GCP.
Designed and deployed Hadoop and Big Data components such as HDFS, Hive, and Apache Oozie within GCP, and migrated Hadoop workloads to GCP's Dataproc for managed clusters.
Proficient in the Data Science Project Life Cycle within GCP, contributing to data acquisition, cleaning, engineering, statistical modeling, testing, validation, and visualization using GCP's AI/ML tools like Vertex AI.
Responsible for building data pipelines in GCP using Dataflow, Pub/Sub, and Apache Kafka for real-time data streaming and ingestion.
Demonstrated proficiency in Java development on GCP, leveraging cloud-native services such as App Engine and Cloud Run for efficient application deployment.
Extensive experience with Spring Boot applications on GCP, leveraging Cloud SQL and Firestore for data persistence and RESTful APIs for backend communication.
Utilized Google Kubernetes Engine (GKE) for container orchestration, ensuring high availability and scalability of applications.
Developed real-time data processing applications using Apache Flink and PySpark on GCP, leveraging Google Cloud Storage (GCS) and BigQuery for data storage and analytics.
Implemented machine learning models and Big Data analytics using Apache Spark with Python on GCP, executing use cases under Spark MLlib.
Migrated on-premises data to Google Cloud Storage and BigQuery, leveraging GCP's Cloud Data Transfer and Pub/Sub for streaming ingestion.
Managed GCP cloud infrastructure using Compute Engine, GCS, Cloud SQL, and BigQuery, optimizing cost and performance.
Developed backend applications using Java Spring Boot on GCP, ensuring scalable and high-performance solutions through Cloud Spanner and Firestore.
Designed and developed ETL pipelines using Google Cloud Dataflow, Apache Beam, and Pub/Sub, facilitating seamless data integration across GCP services.
Integrated Cloud Identity and Access Management (IAM) to secure applications and data, implementing robust authentication and authorization protocols in GCP.
Optimized database interactions using GCP's BigQuery and Firestore, ensuring efficient data access and processing.
Implemented caching solutions using Memorystore on GCP, improving the performance of applications by reducing data retrieval times.
Developed Spark and PySpark applications in GCP's Dataproc for data cleansing, formatting, and schema creation, integrating with GCS and BigQuery.
Built real-time data pipelines using Google Dataflow and Apache Flink to process unbounded streams, delivering insights through BigQuery and Looker.
Developed ETL workflows are orchestrated through Cloud Composer (Apache Airflow) for automation, scheduling, and monitoring in GCP environments.
Strong proficiency in Data Visualization using Looker and integrating various GCP data sources like BigQuery, Cloud SQL, and Google Sheets.
Utilized Looker and Data Studio to create real-time dashboards and integrate data from GCP services such as BigQuery and Pub/Sub.
Expertise in writing complex SQL queries in BigQuery for analytics and reporting, optimizing query performance and cost.
Developed and maintained CI/CD pipelines in GCP for automated build and deployment to development, QA, and production environments. Tuned performance of data pipelines by optimizing PySpark jobs in GCP.
Environment: Google Dataproc, Spark, PySpark, Hive, GCP, GCS, Big Query, Cloud Composer, Cloud Functions, Kafka, Bigtable, Unix.
Client: DXC Technology - Chennai, India Sep 2015 to Jul 2016
Role: Python Developer/ Data Analyst
Responsibilities:
Engaged in a project involving machine learning, big data, data visualization, and Python development, with proficiency in Unix and SQL.
Conducted exploratory data analysis using NumPy, Matplotlib, and pandas.
Possessed expertise in quantitative analysis, data mining, and presenting data to discern trends and insights beyond numerical values.
Utilized Python libraries such as Pandas, NumPy, SciPy, and Matplotlib for data analysis.
Generated intricate SQL queries and scripts for extracting and aggregating data, ensuring accuracy in line with business requirements. Also, I am adept at gathering and translating business requirements into clear specifications and queries.
Produced high-level analysis reports using Excel, providing feedback on data quality, including identification of billing patterns and outliers.
Identified and documented data quality limitations that could impact the work of internal and external data analysts.
Implemented standard SQL queries for data validation and created analytical reports in Excel, including Pivot tables and Charts. Developed functional requirements using data modeling and ETL tools.
Extracted data from various sources like CSV files, Excel, HTML pages, and SQL, performing data analysis and writing to different destinations such as CSV files, Excel, or databases.
Utilized Pandas API for analyzing time series and created a regression test framework for new code.
Developed and managed business logic through backend Python code.
Worked on Django REST framework, integrating new and existing API endpoints.
Demonstrated extensive knowledge in loading data into charts using Python code.
Leveraged Highcharts to pass data and create interactive JavaScript charts for web applications.
Demonstrated proficiency in using Python libraries like OS, Pickle, NumPy, and SciPy.
Environment: Python, HTML5, CSS3, Django, SQL, UNIX, Linux, Windows, Oracle, NoSQL, PostgreSQL, Python libraries, NumPy, Bit Bucket.