Resume

Azure Data Engineer

Location:

Novi, MI

Posted:

December 04, 2023

Contact this candidate

Resume:

Name: LAKSHMI POOJITHA

SR GCP Data Engineer

Email Id: ad1oqw@r.postjobfree.com

Contact Details: 248-***-****

Professional Summary:

9+ years of experience in Data Engineering, Data Pipeline Design, Development and Implementation and Full Software Development Life Cycle.

Experience in Software, System analysis, design, development, testing, deployment, maintenance, enhancements, re-engineering, migration, troubleshooting and support of multi-tiered web applications in high performing environments.

Extensively worked with Pyspark/Spark SQL for data cleansing and generating data frames and RDDS.

Experienced in implementation of Cloud migrations/ building analytical platforms using AWS Services

Experience in creating shared VPC with different tags in a single GCP project and using the same in all the projects and Knowledge on Kubernetes service deployments in GCP.

Experienced on Migrating SQL database to Azure Data Lake, Azure data Lake Analytics, Azure SQL Database, Data Bricks and Azure SQL Data warehouse and controlling and granting database access and Migrating On premise databases to Azure Data Lake store using Azure Data factory.

Excellent understanding and knowledge of NOSQL databases like Mongo dB and Cassandra.

Hands on experience on design and develop ETL transformations using PySpark, Spark SQL, S3, Lambda's and Java.

Expertise in using major components of Hadoop ecosystem components like HDFS, YARN, MapReduce, Hive, Impala, Pig, Sqoop, HBase, Spark, Spark SQL, Kafka, Spark Streaming, Flume, Oozie, Zookeeper, Hue.

Responsible for loading processed data to AWS Redshift table for allowing Business reporting team to build dashboards.

Created views in AWS Athena to allow secure and streamlined data analysis access to downstream business teams.

Worked extensively with Data Science team to help productionalize machine learning models and to build various feature datasets as needed for data analysis and modelling.

Hands on experience on Python programming PySpark implementations in AWS EMR, building data pipelines infrastructure to support deployments for Machine Learning models, Data Analysis and cleansing and using Statistical models with extensive use of Python, Pandas, Numpy, Visualization using Matplotlib, Seaborn and Scikit packages for predictions, xgboost.

Experienced in working in SDLC, Agile and Waterfall Methodologies.

Experience developing Kafka producers and Kafka Consumers for streaming millions of events per second on streaming data.

Experience with Extraction, Transformation and Loading (ETL), API Gateway between homogenous and heterogeneous sources using SSIS packages and DTS (Data Transformation Services).

Solid experience on Azure Data Lake Analytics, Azure Data Bricks, Azure Data Lake Storage, Azure Data Factory, Azure SQL databases and Azure SQL Data Warehouse for providing analytics and reports for improving marketing strategies.

Extensive Experience in IT data analytics projects, Hands on experience in migrating on premise ETLs to Google Cloud Platform (GCP) using cloud native tools such as BIG query, Cloud Data Proc, Google Cloud Storage, Composer, API

Experience in GCP with various technologies like Dataflow, Pub/Sub, Big query, GCS Bucket, and all related tools

Experience with in creating Tableau Visualizations using Tableau. Certified Power Center Developer. Certified Unix Korn Shell.

Experience Created and maintained the centralized repository using Informatica power center.

Experience in transferring Streaming data, data from different data sources into HDFS and No SQL databases using Apache Flume. Cluster coordination services through Zookeeper.

Technical Skills:

Big Data Tools

Hadoop, HDFS, Sqoop, Hbase, Hive, MapReduce, Spark, Kafka

Cloud Technologies

Snowflake, SnowSQL, Azure, Databricks, AWS (EMR, EC2, S3, CloudWatch, Event Bridge, Lambda, SNS)

ETL Tools

SSIS, Informatica Power Center

Modeling and Architect Tools

Erwin, ER Studio, Star-Schema, Snowflake-Schema Modeling, FACT and dimension tables, Pivot Tables

Database

Snowflake Cloud Database, Oracle, MS SQL Server, MySQL, Cassandra, DynamoDB

Operating Systems

Microsoft Windows, Unix, Linux

Reporting Tools

MS Excel, Tableau, Tableau server, Tableau Reader, Power BI, QlikView

Methodologies

Agile, UML, System Development Life Cycle (SDLC), Ralph Kimball, Waterfall Model

Python and R Libraries

R-tidyr, tidyverse, dplyr, lubridate, ggplot2, tseries Python - beautiful Soup, numpy, scipy, matplotlib, seaborn, pandas, scikit-learn

Programming Languages

SQL, R(shiny, R-studio), Python (Jupyter Notebook, PyCharm IDE), Scala

Professional Experience:

Client: Abbvie Vernon Hills, IL May 2022 to Present

Role: Sr. GCP Data Engineer

Responsibilities:

Managed infrastructure across multiple projects within the organization on the Google Cloud Platform using Terraform, adhering to the principles of Infrastructure as Code (IaC).

Enhanced the performance of existing BigQuery and Tableau reporting solutions through optimization techniques such as partitioning key columns and thorough testing under different scenarios.

Developed Extract, Load, Transform (ELT) processes utilizing various data sources, including Ab Initio, Google Sheets in GCP, and computing resources like Dataprep, Dataproc (PySpark), and BigQuery.

Successfully migrated an Oracle SQL ETL process to run on the Google Cloud Platform, leveraging Cloud Dataproc, BigQuery, and Cloud Pub/Sub for orchestrating Airflow jobs.

Proficiently utilized Presto, Hive, Spark SQL, and BigQuery alongside Python client libraries to craft efficient and interoperable programs for analytics platforms.

Extensive hands-on experience with Google Cloud Platform's big data-related services.

Deployed Apache Airflow within a GCP Composer environment to construct data pipelines, utilizing various Airflow operators like Bash Operator, Hadoop Operators, Python Callable, and Branching Operators.

Developed innovative techniques for orchestrating Airflow pipelines and employed environment variables for project-level definition and password encryption.

Demonstrated competency in Kubernetes within GCP, focusing on devising monitoring solutions with Stackdriver's log router and designing reports in Data Studio.

Acted as an integrator, facilitating collaboration between data architects, data scientists, and other data consumers.

Translated SAS code into Python and Spark-based jobs for execution in Cloud Dataproc and BigQuery on the Google Cloud Platform.

Facilitated data transfer between BigQuery and Azure Data Warehouse through Azure Data Factory (ADF) and devised complex DAX language expressions for memory optimization in reporting cubes within Azure Analysis Services (AAS).

Utilized Cloud Pub/Sub and Cloud Functions for specific use cases, including workflow triggers upon incoming messages.

Crafted data pipelines using Cloud Composer to orchestrate processes and employed Cloud Dataflow to build scalable machine learning algorithms, while also migrating existing Cloud Dataprep jobs to BigQuery.

Participated in the creation of Hive tables, data loading, and the authoring of Hive queries, which were executed via MapReduce.

Implemented advanced Hive features such as Partitioning, Dynamic Partitions, and Buckets to optimize data storage and retrieval.

Developed code for importing and exporting data to and from HDFS and Hive using Apache Sqoop.

Demonstrated expertise in Hive SQL, Presto SQL, and Spark SQL to perform ETL tasks, choosing the most suitable technology for each specific job.

Authored Hive SQL scripts for creating sophisticated tables with high-performance attributes like partitioning, clustering, and skewing.

Engaged in the transformation and analysis of extensive structured and semi-structured datasets through the execution of Hive queries.

Collaborated with the Data Science team to implement advanced analytical models within the Hadoop cluster, utilizing large datasets.

Leveraged Power BI and SSRS to create dynamic reports, dashboards, and interactive functionalities for web clients and mobile apps.

Developed SAS scripts for Hadoop, providing data for downstream SAS teams, particularly for SAS Visual Analytics, an in-memory reporting engine.

Monitored data engines to define data requirements and access data from both relational and non-relational databases, including Cassandra and HDFS.

Created complex SQL queries and established JDBC connectivity to retrieve data for presales and secondary sales estimations.

Environment: GCP, Pyspark, SAS, Hive, Sqoop, GCPs Data Proc Big Query, Hadoop, Hive, GCS, Python, Snowflake, Dynamo DB, Oracle Database, Power Bi, SDK’S, Data Flow, Glacier, EC2, EMR Cluster, SQL Database, Data Bricks.

Client: Merck Pharma, Branchburg, NJ October 2019 to April 2022

Role: GCP Data Engineer

Responsibilities:

Created ETL pipelines within Google Cloud Platform (GCP) using Apache Beam and Dataflow, resulting in a 20% reduction in data processing time for real-time large-scale data.

Constructed and deployed data pipelines on GCP through Cloud Composer and Cloud Functions, enabling seamless integration with other GCP services like BigQuery, Pub/Sub, and Cloud Storage.

Implemented monitoring and alerting mechanisms in GCP data pipelines using Stackdriver, allowing for proactive identification and resolution of issues.

Designed and executed comprehensive testing strategies for GCP data pipelines, ensuring data accuracy and completeness from ingestion to analysis.

Employed DevOps practices and tools like Jenkins, Terraform, and Ansible to automate GCP infrastructure deployment and configuration, reducing deployment time by 50%.

Utilized Python, SQL, and Bash scripts to create custom data transformations and quality rules, resulting in a 25% decrease in data processing errors.

Developed and managed CI/CD pipelines on GCP using Cloud Build and Cloud Run, streamlining code deployment and testing in a controlled environment.

Implemented data versioning and lineage tracking with tools such as Data Catalog and Data Studio, ensuring the auditability and traceability of healthcare data in GCP.

Conducted capacity planning and scaling of GCP data pipelines using Kubernetes and Cloud Autoscaling, ensuring optimal performance and cost-efficiency.

Formulated multi-cloud strategies, combining GCP's Platform as a Service (PAAS) with Azure's Software as a Service (SAAS) offerings to enhance cloud capabilities.

Designed and implemented end-to-end data pipelines for batch processing, utilizing Scala to develop Spark jobs.

Developed data pipelines to ingest data from weblogs using Flume, Kafka, and Spark Streaming, applying necessary transformations.

Leveraged Spark SQL API within PySpark for data extraction, loading, and SQL querying.

Developed PySpark scripts to encrypt raw data, employing hashing algorithms on specified columns.

Designed, developed, and tested database components, including Stored Procedures, Views, and Triggers.

Built a Python-based API (RESTful Web Service) for revenue tracking and analysis.

Environment: GCP, Big Query, Spark SQL, Python, Pyspark, SQL, Pub/Sub, Cloud storage, Azure, Kubernetes, Stack Driver, ETL, Apache.

Client: Chevron Corporation, Santa Rosa, NM July 2017 to September 2019

Role: Sr. AWS Data Engineer

Responsibilities:

Extensive proficiency in working within the AWS cloud platform, with hands-on experience in AWS services like EC2, S3, EMR, Redshift, Lambda, and Glue.

Proficient in Spark, including expertise in Spark RDD, Data Frame API, Data Set API, Data Source API, Spark SQL, and Spark Streaming.

Developed Spark applications using Python, implementing Apache Spark data processing projects for handling data from various sources, including RDBMS and streaming platforms.

Utilized Spark for enhancing performance and optimizing existing algorithms within Hadoop.

Competent in Spark technologies such as Spark Context, Spark SQL, Spark MLlib, Data Frame, Pair RDD, and Spark YARN.

Employed Spark Streaming APIs to perform real-time transformations and actions, creating a common data model by ingesting data from Kafka and persisting it to Cassandra.

Developed a Python-based Kafka consumer API for consuming data from Kafka topics.

Processed Extensible Markup Language (XML) messages with Kafka and utilized Spark Streaming to capture User Interface (UI) updates.

Created preprocessing jobs using Spark Data Frames to flatten JSON documents into flat files.

Loaded D-Stream data into Spark RDD and performed in-memory data computations to generate output responses.

Extensive experience in developing live real-time processing and core jobs using Spark Streaming in conjunction with Kafka as a data pipeline system.

Successfully migrated an existing on-premises application to AWS, making use of AWS services like EC2 and S3 for data processing and storage.

Proficient in maintaining Hadoop clusters on AWS Elastic MapReduce (EMR).

Loaded data into S3 buckets using AWS Glue and PySpark, filtered data stored in S3 buckets using Elasticsearch, and loaded the data into Hive external tables.

Configured Snowpipe to extract data from S3 buckets and store it in Snowflake's staging area.

Created numerous ODI interfaces for loading into Snowflake DB.

Utilized Amazon Redshift to consolidate multiple data warehouses into a single data warehouse.

Designed columnar families in Cassandra, ingested data from RDBMS, performed data transformations, and exported transformed data to Cassandra as per business requirements.

Utilized the Spark Data Cassandra Connector for loading data to and from Cassandra.

Configured Kafka from scratch, including settings for managers and brokers.

Developed data models for clients' transactional logs and analyzed data from Cassandra tables using the Cassandra Query Language.

Conducted cluster performance testing using the Cassandra-stress tool to measure and enhance Read/Writes.

Employed Hive QL to analyze partitioned and bucketed data, executing Hive queries on Parquet tables stored in Hive for data analysis to meet business specifications.

Used Apache Kafka to aggregate web log data from multiple servers and make it available in downstream systems for data analysis and engineering roles.

Implemented Kafka security measures and enhanced its performance.

Expertise in working with various data formats such as Avro, Parquet, RCFile, and JSON, including the development of user-defined functions (UDFs) in Hive.

Developed custom UDFs in Python and utilized them for data sorting and preparation.

Worked on custom loaders and storage classes in Pig to process diverse data formats like JSON, XML, CSV, and generated bags for further processing using Pig.

Developed Sqoop and Kafka jobs to load data from RDBMS and external systems into HDFS and Hive.

Created Oozie coordinators to schedule Hive scripts, effectively establishing data pipelines.

Authored numerous MapReduce jobs using PySpark and Numpy and implemented Jenkins for continuous integration.

Conducted cluster testing and monitoring of HDFS, Hive, Pig, and MapReduce to facilitate access for new users.

Ensured the continuous monitoring and management of the Hadoop cluster through Cloudera Manager.

Environment: AWS EMR, map R, HDFS, Hive, Pig, Apache Kafka, Sqoop, Python, Pyspark, Shell scripting, Linux, MySQL Oracle Enterprise DB, SOLR, Jenkins, Eclipse, Oracle, Git, Oozie, Tableau, MySQL, Soap, Cassandra and Agile Methodologies.

Client: Hudda Infotech Private Limited Hyderabad, India November 2015 to April 2017

Role: Data Engineer

Responsibilities:

Orchestrated the seamless migration of data from legacy database systems to Azure databases.

Collaborated with external team members and stakeholders to assess the implications of their changes, ensuring smooth project releases and minimizing integration issues in Explore.MS application.

Conducted an in-depth analysis, design, and implementation of modern data solutions using Azure PaaS services to support data visualization. This involved a comprehensive understanding of the current production state and its impact on existing business processes.

Coordinated with external teams and stakeholders, ensuring that changes were thoroughly understood and integrated comfortably to prevent integration issues in the VLIn-Box application.

Assumed responsibility for reviewing VL-In-Box application's test plan and test cases during System Integration and User Acceptance testing phases.

Proficiently executed Extract, Transform, and Load (ETL) processes, extracting data from source systems and storing it in Azure Data Storage services. Leveraged a combination of Azure Data Factory, T-SQL, Spark SQL, and U-SQL in Azure Data Lake Analytics. Data was ingested into one or more Azure Services, including Azure Data Lake, Azure Storage, Azure SQL, and Azure Data Warehouse, with further data processing in Azure Databricks.

Designed and implemented migration strategies for traditional systems in Azure, utilizing approaches like Lift and Shift and Azure Migrate, alongside third-party tools.

Effectively used Azure Synapse to manage processing workloads and deliver data for business intelligence and predictive analytics needs.

Demonstrated experience in data warehouse and business intelligence project implementation using Azure Data Factory.

Collaborated with Business Analysts, Users, and Subject Matter Experts (SMEs) to elaborate on requirements and ensure their successful implementation.

Conceptualized and implemented end-to-end data solutions encompassing storage, integration, processing, and visualization within the Azure ecosystem.

Developed Azure Data Factory (ADF) pipelines, incorporating Linked Services, Datasets, and Pipelines for data extraction, transformation, and loading from diverse sources like Azure SQL, Blob storage, Azure SQL Data Warehouse, and write-back tools.

Assumed responsibility for estimating cluster sizes, monitoring, and troubleshooting Spark Data bricks clusters.

Applied performance tuning techniques to Spark Applications, optimizing factors such as Batch Interval time, Parallelism levels, and memory utilization.

Executed ETL processes using Azure Databricks, migrating on-premises Oracle ETL to Azure Synapse Analytics.

Developed custom User-Defined Functions (UDFs) in Scala and PySpark to meet specific business requirements.

Authored JSON scripts to deploy pipelines in Azure Data Factory (ADF), enabling data processing via SQL Activity.

Successfully created Build and Release processes for multiple projects in the production environment using Visual Studio Team Services (VSTS).

Designed and implemented database solutions in Azure SQL Data Warehouse and Azure SQL.

Proposed architectures with a focus on cost-efficiency within Azure, offering recommendations to right-size data infrastructure.

Established and maintained Azure SQL Database, Azure Analysis Service, Azure SQL Data Warehouse, Azure Data Factory, and Azure SQL Data Warehouse.

Innovatively developed conceptual solutions and proof-of-concepts to validate the feasibility of proposed solutions.

Implemented Copy activities and custom Azure Data Factory Pipeline Activities to enhance data processing workflows.

Assumed the role of creating Requirements Documentation for various projects.

Environment: Azure, Azure SQL, Blob storage, Azure SQL Data Warehouse, Azure Databricks, PySpark, Oracle, Azure Data Factory (ADF), T-SQL, Spark SQL.

Client: Maisa Solutions Private Limited Hyderabad, India June 2014 to October 2015

Role: Hadoop Developer

Responsibilities:

Managed data ingestion from UNIX file systems to Hadoop Distributed File System (HDFS), leveraging contemporary data loading techniques.

Utilized Sqoop for importing and exporting data seamlessly between HDFS, Hive, and external data sources.

Assessed and translated business requirements into comprehensive specifications, adhering to modern project guidelines for program development.

Designed and implemented procedures that address complex business challenges, taking into account hardware and software capabilities, operating constraints, and desired outcomes.

Conducted extensive analysis of large datasets to identify optimal methods for data aggregation and reporting. Responded promptly to ad hoc data requests from internal and external clients, proficiently generating ad hoc reports.

Spearheaded the construction of scalable distributed data solutions using Hadoop, incorporating the latest tools and technologies in the field.

Played a hands-on role in Extract, Transform, Load (ETL) processes, ensuring the efficient handling of data across various stages.

Assumed responsibility for cluster maintenance, including node management, monitoring, troubleshooting, and the review of data backups and log files in the Hadoop ecosystem.

Managed the extraction of data from diverse sources, executed transformations using Hive and MapReduce, and effectively loaded data into HDFS.

Employed Sqoop to facilitate the extraction of data from Teradata and its seamless integration into HDFS, aligning with contemporary best practices.

Conducted in-depth data analysis by running Hive queries and executing Pig scripts to uncover user behavior patterns, such as shopping enthusiasts, travelers, and music lovers.

Exported the insights and patterns derived from the analysis back into Teradata using Sqoop.

Ensured the continuous monitoring and management of the Hadoop cluster through state-of-the-art tools like Cloudera Manager.

Integrated the Oozie workflow engine to orchestrate and automate multiple Hive workflows and processes.

Developed advanced Hive queries to process data and generate data cubes, facilitating data visualization and advanced analytics.

Environment: Hive, pig, Apache Hadoop, Cassandra, Sqoop, Big Data, HBase, Zookeeper, Cloudera, CentOS, No SQL, sencha extjs, java script, ajax, Hibernate, Jms, web logic Application server, Eclipse, Web services, azure, Project Server, Unix, Windows.

Contact this candidate