Google Cloud Data Engineer

Location:

Manchester, NH

Posted:

August 19, 2024

Contact this candidate

Resume:

Uma Nagalla

www.linkedin.com/in/uma-nagalla-a*b**6293 *************@*****.*** 781-***-****

Profile

Data Engineer with 7 years of overall IT experience specializing in Big Data and Cloud Platforms (AWS and Azure) in particular & expertise in health/medical, banking, and e-commerce industries. With a knack for decision making, great work ethic, and adaptability to dynamic environments, backed by a Masters in Technology and related certifications, have consistently contributed to project excellence and business success by utilizing a diverse range of tools and technologies

Skills

Data Engineering and Big Data Analytics: Hadoop Ecosystem Tools: MapReduce, HDFS, Yarn/MRv2, Pig, Hive, HBase, Spark, Presto, Kafka, Flume, Oozie, Sqoop. Integration of Spark with Cassandra and Zookeeper.

AWS Services: EC2, S3, Redshift, Glue, Lambda functions, Step functions, CloudWatch, SNS, Dynamo, SQS. RDS, VPC, IAM, Elastic Load Balancing, Auto Scaling, CloudFront, SES.

Real-time Data Streaming Solutions: Apache Spark, Spark Streaming, Apache Kafka, Apache Flink.

Azure Data Platform Services: Azure Data Lake (ADLS), Azure Data Factory (ADF), Data Lake Analytics, Stream Analytics, Azure SQL DW, HDInsight/Databricks, NoSQL databases.

Google Cloud Platform Services: Google Cloud Storage, Google Big Query, Google Cloud Data Fusion, Google Cloud Dataflow, Google Cloud Pub/Sub, Google Cloud Dataproc, Google Cloud Databricks, Google Cloud Bigtable, Google Firestore

Text Analytics and Data Visualization: R and Python for text analytics. Data visualizations using Tableau, and PowerBI and Grafana, Elasticsearch.

Development Approaches: Test-driven development (TDD), behavior-driven development (BDD), acceptance test-driven development (ATDD), Agile practices, Scrum, CI/CD, Jupyter and Linux/Unix.

Database Management: System analysis, E-R/Dimensional data modeling, database design. Proof of Concepts (PoCs), gap analysis. Normalization and de-normalization techniques.

File Formats: JSON, XML, YAML, CSV, Parquet, Excel, and Log files.

Programming/Scripting Languages: Python and R for data manipulation, Bash (Shell Scripting), Perl, statistical analysis, modeling, and data munging. SQL across various dialects: MySQL, PL/SQL, PostgreSQL, Redshift, SQL Server, Oracle.

Experience

DATA ENGINEER Chase July 2023 - PRESENT

Executed Extract, Transform, Load (ETL/ELT) pipelines using a combination of AWS Glue and

AWS Step Functions, resulting in a 20% increase in data processing efficiency.

Enhanced AWS Lambda functions for extracting, transforming, and loading data from various sources, including databases, Restful APIs, and file systems. This optimization contributed to streamlined data processing and integration.

Engineered resilient and scalable data integration pipelines with AWS Glue, handling the extraction, transformation, and loading of data from diverse sources resulting in a 10% improvement in data ingestion speed.

Employed AWS Glue and Spark for data cleansing and applied transformations, ensuring data analysis efficiency by 20% and reducing data errors by 20%.

Automated workflows for daily incremental loads using AWS Step Functions, facilitating the smooth transfer of data from traditional RDBMS to AWS data lakes, such as Amazon S3.

Crafted database objects like tables, views, stored procedures, triggers, packages, and functions using Amazon RDS or Amazon Redshift, contributing to efficient data management and structure.

Utilized AWS Data Factory, Spark SQL, and AWS Glue for Extract, Transform, and Load operations, ingesting data into various AWS services such as Amazon S3, Amazon RDS, Amazon Redshift, and Amazon Data Warehouse, resulting in a 30% increase in data availability.

Conducted end-to-end data engineering tasks, employing dimensional modeling techniques to ensure high-quality data modeling and integrity.

Collaborated with cross-functional teams for the seamless integration of the AWS cloud platform, optimizing data storage using services like Amazon S3 and other relevant tools.

Joined forces with cross-functional teams to integrate Snowflake as a central data warehouse solution, leading to improved query optimization and data analysis capabilities.

Developed Python, PySpark, and Bash scripts to transform and load data across on-premises and AWS cloud platforms.

Supported the creation of interactive Power BI dashboards and reporting for data facilitation, integrating with AWS QuickSight or other visualization tools.

Enhanced data analytics capabilities by implementing Databricks on AWS, contributing to increased accuracy and speed in data analysis, resulting in a 20% increase in data analysis speed and accuracy.

Provided production on-call support, assisting various teams with their dependencies, ensuring zero downtime for critical data processes.

Integrated data from diverse sources, including MS-SQL Server, Oracle, PostgreSQL, DB2, MongoDB, and external APIs, using tools like Apache Nifi for efficient data ingestion, resulting in a 30% increase in ingestion efficiency.

Designed and implemented a Snowflake data warehouse for centralized storage (electronic health records (EHRs)), claims data, and provider systems, optimizing for both structured and semi-structured data, with seamless data integration and ETL processes managed through Talend, leading to a 20% improvement in data processing speed.

Implemented Big Data processing using Hadoop ecosystem components and enhanced real-time analytics capabilities by integrating Apache Flink for stream processing.

Set up real-time data streaming with Apache Kafka for streaming ingestion, incorporating the

Confluent Platform for effective Kafka-based data integration and management.

Expanded NoSQL data storage using HBase for handling large-scale, distributed datasets with high throughput, while also incorporating Cassandra for additional NoSQL capabilities.

Utilized Amazon EMR for scalable and cost-effective Big Data processing, integrating with Apache Spark for advanced analytics, achieving a 15% reduction in processing costs. Also,

leveraged AWS Glue for serverless ETL.

Executed workflow automation and coordination using Apache Oozie, complemented by dynamic and extensible workflow orchestration through Apache Airflow.

Leveraged AWS Yarn for resource management and integrated Amazon EKS (Elastic Kubernetes Service) for container orchestration, ensuring scalable and efficient data processing across the platform.

Implemented robust data governance policies, access controls, and encryption mechanisms across the unified platform, utilizing Amazon IAM for identity management and AWS Key Management Service (KMS) for encryption.Employed Python for scripting and data manipulation tasks, enhancing flexibility and customizability within the data engineering workflows.

Established monitoring solutions using tools like Grafana and Amazon CloudWatch, seamlessly integrating performance tuning mechanisms, achieving a 10% improvement in system reliability. Utilized Amazon QuickSight for data exploration and visualization.

Proficient in developing and implementing Spark RDD-based data processing workflows using Scala, Java, or Python programming languages.

Experienced in optimizing Spark RDD performance by tuning various configuration settings, such as memory allocation, caching, and serialization.

Expertise in using Spark RDD transformations and actions to process large-scale structured and unstructured data sets, including filtering, mapping, reducing, grouping, and aggregating data.

Skilled in using Spark RDD persistency and caching mechanisms to reduce data processing overhead and improve query performance.

Familiarity with Spark RDD lineage and fault tolerance mechanisms and their impact on data processing reliability and performance.

Knowledge of Spark RDD optimization techniques, such as data partitioning, shuffle tuning, and pipelining, and their impact on query performance and resource utilization.

Strong understanding of Spark RDD integration with other big data technologies, such as Hadoop, Hive, and Kafka, and their impact on data processing workflows and performance.

Ability to troubleshoot common issues with Spark RDD, such as data processing errors, performance bottlenecks, and scalability limitations.

Experience working with Spark RDD in production environments and implementing performance monitoring and alerting systems to detect and resolve performance issues proactively.

Experienced in optimizing Spark DataFrame performance by tuning various configuration settings, such as memory allocation, caching, and serialization.

Expertise in using Spark DataFrame transformations and actions to process large-scale structured and semi-structured data sets, including filtering, mapping, reducing, grouping, and aggregating data.

DATA ENGINEER THE HARTFORD SEP2020 – JUL 2022

Built Data Pipelines with Azure Data Factory Developed ETL pipelines using Azure Data Factory to orchestrate data movement and transformation across various data sources and destinations.

Designed Data Warehousing Solutions with Azure Synapse Implemented and optimized data warehousing solutions using Azure Synapse Analytics, improving query performance and data integration.

Created and Managed Data Lakes with Azure Data Lake Storage Designed and maintained scalable data lakes using Azure Data Lake Storage for storing and analyzing large volumes of structured and unstructured data.

Implemented Real-Time Data Processing with Azure Stream Analytics Developed real-time analytics solutions using Azure Stream Analytics to process and analyze streaming data from various sources.

Utilized Azure Databricks for Big Data Processing Leveraged Azure Databricks to perform big data processing and analytics, enabling data-driven insights with Apache Spark.

Automated Data Workflows Using Azure Logic Apps Created automated workflows with Azure Logic Apps to integrate and process data across multiple systems.

Optimized Data Storage and Retrieval with Azure Cosmos DB Configured and optimized Azure Cosmos DB for high-performance, globally distributed NoSQL data storage.

Developed Data Integration Solutions with Azure Data Share Facilitated secure and efficient data sharing between organizations using Azure Data Share.

Designed Data Pipelines with Google Cloud Dataflow Created and managed data processing pipelines using Google Cloud Dataflow for batch and stream data processing.

Built and Managed Data Warehouses with BigQuery Developed and optimized data warehouses using Google BigQuery, enabling fast SQL queries and large-scale data analysis.

Implemented Data Lakes with Google Cloud Storage Set up and maintained data lakes using Google Cloud Storage for scalable and secure data storage and retrieval.

Utilized Google Cloud Pub/Sub for Real-Time Data Streaming Designed real-time data streaming solutions using Google Cloud Pub/Sub to handle high-throughput data feeds.

Developed Big Data Solutions with Google Dataproc Leveraged Google Dataproc for Hadoop and Spark-based big data processing, streamlining data analysis and processing.

Automated Data ETL Processes with Google Cloud Composer Created ETL workflows using Google Cloud Composer to automate data extraction, transformation, and loading.

Implemented Data Integration with Google Cloud Data Fusion Used Google Cloud Data Fusion to build and manage data integration pipelines, enhancing data flow across systems.

Optimized Data Storage and Query Performance with Bigtable Configured and optimized Google Bigtable for high-performance NoSQL data storage and real-time analytics.

Developed Data Visualization with Google Data Studio Created interactive dashboards and reports using Google Data Studio to provide actionable insights from data.

DATA ENGINEER UNITED HEALTH GROUP AUG 2016 – AUG 2020

Collected, ingested, and integrated data from diverse sources into a centralized storage system using MSBI, Talend, Informatica, and Big Data tools, achieving a 30% increase in data ingestion efficiency.

Collaborated with teams to comprehend data requirements, utilizing SSAS for designing and implementing data models.

Administered and maintained databases with tools like MSBI, ensuring performance, security, and availability, 20% enhancement in database administration efficiency.

Implemented ETL processes through SSIS to transform raw data for analysis.

Employed data quality checks with tools like Talend and Informatica, addressing inconsistencies or errors.

Utilized Python for additional data manipulation, expanding capabilities and contributing to a more versatile data processing approach.

Preprocessed data using ETL tools to enhance quality and usability.

Assisted in setting up and maintaining data infrastructure, involving servers, clusters, and cloud services, with Big Data technologies.

Documented data processes, workflows, and configurations.

Implemented security protocols with tools like MSBI to protect sensitive data, resulting in a 20% improvement in data security efficiency.

Proficient in handling hive partitions and buckets with respect to business requirements.

Experience in handling hive schema evolution with Avro file format.

Skilled in handling semi structured/serialised data processing using hive (AVRO,PAQUET,ORC)

Proficient in creating and managing Hive tables, including managed, external, and partitioned tables.

Expertise in querying Hive tables using SQL-like syntax and performing data analysis using tools like Apache Spark.

Skilled in integrating Hive tables with other big data technologies, such as Hadoop, HBase, and Impala.

Knowledge of Hive table formats, including ORC, Parquet, and Avro, and their advantages and disadvantages for different use cases.

Experienced in importing and exporting large datasets between Hadoop and relational databases using Sqoop.

Proficient in writing Sqoop commands to transfer data between Hadoop and various databases such as MySQL, and SQL Server.

Skilled in configuring Sqoop jobs for incremental data transfers using Sqoop's incremental import feature.

Proficient in performing data validation and cleansing during data transfer using Sqoop's validation and cleansing options.

Adept in scheduling and automating Sqoop jobs for incremental runs.

Experienced in ETL (Extract, Transform, Load) testing methodologies and processes.

Proficient in testing data extraction processes from various sources, including databases, files, and APIs.

Skilled in validating and verifying data transformation rules and business logic applied during ETL processes.

Strong understanding of data warehouse concepts and testing data loading into data warehouse systems.

Knowledgeable about data integration and consolidation processes in ETL pipelines.

Familiarity with data quality and data cleansing techniques in ETL testing

Expertise in designing and executing test cases for ETL processes to ensure data accuracy and completeness.

Proficient in SQL queries and scripting for data validation and verification during ETL testing.

Experienced in identifying and reporting data quality issues and data anomalies during ETL testing.

Leveraged advanced MS Excel techniques, including charts, graphs, pivot tables, and PowerPoint presentations, for creating and delivering comprehensive weekly and monthly reports.

Collaborated with cross-functional teams, including data scientists, analysts, and business stakeholders.

Education

MASTERS IN COMPUTER INFORMATION RIVIER UNIVERSITY AUG 2022- DEC 2023

420 S Main St, Nashua, NH

Licenses & Certifications

Microsoft Certified: Azure Data Engineer Associate DP203, Microsoft

Databricks Lakehouse Fundamentals, Databricks

Contact this candidate