Data Engineer Engineering

Location:

United States

Posted:

April 10, 2024

Contact this candidate

Resume:

Prabhas Raj Gupta Tadesetti

Senior Data Engineer

Phone: 636-***-**** Email: ***************@*****.***

http://linkedin.com/in/prabhas-raj-409379122

Professional Summary:

11+ years of experience in IT spanning across multiple technologies, with a focus on data engineering, including requirements gathering, data modeling, analysis, ETL development, and visualization reports.

Microsoft Certified – Azure Data Engineer Associate.

Implemented various Azure Cloud Components such as Azure Data Factory, Azure Data Lake, Azure Blob storage, Azure Databricks, Azure Synapse Analytics, Azure SQL DB/DW, and Azure Cosmos DB.

Engaged in project roles covering the full project life cycle, from analysis and design to deployment and maintenance, utilizing SDLC and both Agile and Waterfall methodologies.

Proficient in T-SQL and PL/SQL, including DDL & DML, stored procedures, schema development, triggers, joins, indexes, and relational database models.

Developed Databricks notebooks using Python (PySpark) and SparkSQL for transforming data stored in Azure Data Lake Storage Gen2 across various zones.

Extensive experience with RDBMS concepts, including tables, indexes, views, functions, and stored procedures.

Skilled in creating and managing pipeline jobs, scheduling triggers, and mapping data flows using Azure Data Factory.

Proficient in utilizing Azure Event Hubs along with other Azure services to build real-time data streaming pipelines.

Expertise in complex SQL using Teradata functions, macros, and stored procedures.

Hands-on experience in various data patterns such as structured, semi-structured, and unstructured data.

Understanding of data vaults, data warehouses, and data Lakehouses.

Strong proficiency in Spark Core, Spark SQL, Scala, Spark Streaming, and implementation of Snowflake architecture for unified data storage and analysis.

Proficient in Data Analytics using tools such as Tableau, Power BI, Microsoft Synapse Analytics, SSRS, Alteryx, and Microsoft Excel.

Well-versed in workflow orchestration tools like Oozie, Airflow, and Azure Data Factory, along with Cloud Formation & Terraforms.

Led teams in implementing new requirements, providing ETL mapping documents, and resolving technical issues.

Experienced in connecting AWS resources like S3 bucket, RDS, Redshift, and creating pipelines for data movement to Azure.

Familiarity with various databases including MongoDB, CassandraDB, MySQL, PostgreSQL, and MS SQL Server.

In-depth knowledge of PySpark and experience in building Spark applications using Python.

Exposure to master data management (MDM) tools and practices, including data governance, data profiling, data quality, and data modeling.

Proficient in migrating on-premises databases to Microsoft Azure, including Blobs, Azure Data Warehouse, and Azure SQL Server.

Understanding of RDD operations in Apache Spark, including transformations, actions, and persistence.

Skilled in developing dashboards and parameterized reports using SSRS, Tableau, and Power BI.

Experience in Agile program management, translating user stories into deliverables, stakeholder management, and project delivery.

Worked on industrial use cases in data science using Machine Learning for measurement, forecasting, simulation, recommendations, and optimization.

Good understanding on Apache Zookeeper and Kafka for monitoring and managing Hadoop jobs and using Cloudera CDH4, and CDH5 for monitoring and managing Hadoop clusters.

Good working experience on Spark (Spark Core Component, Spark SQL, Spark Streaming, Spark MLlib, Spark GraphX, SparkR) with Scala and Kafka.

Integrated DBT with data warehouse technologies such as Snowflake, Big Query, or Redshift to leverage their capabilities for efficient data processing and storage.

In depth knowledge of Testing methodologies, concepts, phases, and types of testing, developing Test Plans, Test Scenarios, Test Cases, Test Reports, and documenting test results accordingly after analyzing Business Requirements Documents (BRD), Functional Requirement Specifications (FRS).

I excel in advanced analytics solutions with data mining experience and a unique combination of exceptional written and verbal communication skills, enabling me to effectively translate complex data into actionable insights.

Provided leadership, vision, and strategy to ensure alignment of development teams' operations with business goals.

Technical Skills:

Cloud Technologies

ADF, Azure SQL Database, Azure Data Lake Storage Gen 1 & Gen2, Azure Data Factory (ADF), Azure Databricks, Azure SQL Data Warehouse, Azure Storage, Azure Blob Storage, Cosmos DB, HDInsight.

Tools

MySQL, MS Access, DB2, Azure SQL Server, Azure Synapse, SQL server, MS SQL Server 2016-2008 R2, DBT, PuTTY

Big Data Technologies

HDFS, Sqoop, PySpark, Data Lake, Redshift, Spark, Hive

Data Analysis libraries

Pandas, NumPy, SciPy, Scikit-learn, NLTK, Plotly, Matplotlib

Data Modeling Tools

Toad Data Modeler, SQL Server Management Studio, MS Visio, SAP Power designer, Erwin 9.x

Databases

MySQL, Oracle12c/11g, MS Access 2016/2010, Hive, SQL Server 2014/2016, Amazon Redshift, Azure SQL Database, snowflake

Reporting Tools

Tableau, Informatica Power Center, PowerBI

Languages

SparkSQL, SQL, PySpark, R, Spark, Scala, Bash, PowerShell

Operating Systems

Windows Server 2012 R2/2016, UNIX, CentOS

Data Warehousing

Snowflake, Redshift, Teradata, Azure dedicated DW

Professional Experience:

Wipfli, Atlanta, GA July 2022 – present

Senior Azure Data Engineer

Responsibilities:

Developed Notebooks in Databricks for data extraction of various file formats and transforming and loading the detailed, aggregate data into Azure Data Lake and transmitting data into external data warehouses.

Have end-to-end experience in data modeling pattern i.e., Gathering business requirements, identification of entities, conceptual Data Model, Finalization of attributes and design of logical data model and creation of physical tables in database.

Worked on migration of data from On-premises Source systems to Azure Cloud databases /Storages like Azure SQL, Azure Synapse Analytics, and Azure Data Lake.

Built ingestion pipelines in Azure Data Factory (ADF) using various activities, linked services, and datasets to extract the data from different sources, On Premise and write back into Azure cloud storages and Google cloud storage.

Used Teradata utilities FAST LOAD, MULTI LOAD, TPUMP to load data.

Implemented the design plan for Azure Synapse Analytics with optimization solutions.

Built Transformations using Databricks, Spark SQL, Scala/Python stored into ADLS/Azure Blobs and BigQuery.

Designed and implemented real-time data streaming pipelines using Azure Event Hubs to ingest, process, and analyze large volumes of data.

Designed, built, and managed ETL pipeline, leveraging Logic Apps, python, spark, dbt and Informatica IICS.

Created Azure Data Factory (ADF) batch pipelines to ingest data from end client’s server using Azure Event Hub using spark and load into Azure Data Lake storage Gen2.

Designed and developed linear task flows to combine various data integration tasks and run them in a specific order using Airflow.

Developed and maintained data pipelines using Azure Databricks to extract, transform, and load data from Unity Catalog for analytics and reporting purposes.

Optimized performance of data processing jobs in Azure Databricks to handle large volumes of Unity Catalog data efficiently.

Developed data capture and navigation desktop and mobile apps in Microsoft Power apps that connect to distribution center solutions design analytics in Power BI.

Involved in Extracting, transforming, loading, and performing ETL testing data from DBT, Oracle OCI and Teradata API using IBM InfoSphere DataStage jobs.

Used Cosmos DB for storing catalog data and for event sourcing in order processing pipelines.

Created pyspark/scala scripts in databricks by including unity catalog for additional data governance and security.

Worked on Data Mastering by ingesting required data from various sources, applying translation rules, writing functions, and publishing golden records in the destination.

Conducted data quality checks and implemented data governance processes to ensure the accuracy and reliability of Unity Catalog data in Azure Databricks.

Worked closely with stakeholders to understand business requirements and translate them into technical solutions using Azure Databricks and Unity Catalog data.

Developed UNIX shell script to run the IBM InfoSphere DataStage job, transfer files to the different landing zone.

Worked in transitioning the Qlik Apps from On-prem Enterprise to Qlik SaaS Platform.

For Log analytics and for better query response used Kusto Explorer and created alerts using Kusto query language.

Executed process improvements in data workflows using Alteryx processing engine and SQL.

Worked in Agile methodology to deliver proof of concept and production implementation in iterative sprints.

Used Python/ PySpark / Scala for performing Data cleaning and transformation on all types of datasets using DBT.

Involved in creating the external tables and views in Azure Synapse Analytics (DW) using DBT and initiated stored procedures to move the data from external to internal tables.

Explored Snowflake, leveraged Snow pipe, snow SQL for building ETL pipelines using DBT as per the requirement.

Designed various Jenkins jobs to continuously integrate the processes and executed CI/CD pipeline using Jenkins.

Worked on dimensional modelling to enable Power BI reporting, query, and analysis.

Worked with DB2 group to create Best-Fit Physical Data Model from the Logical Data Model using Forward engineering using Erwin and involved in Normalization and De-Normalization of existing tables for faster query retrieval.

Performed data profiling of data vault hubs, links and satellites using Erwin generated SQL scripts and designed Physical Data Model (PDM) using Erwin data modeling tool and SQL and T-SQL Managed Meta-data for data models.

Designed scripts from Informatica MDM to new Profisee Master Data Management platform components, including supporting applications and interfaces.

Experienced in Git configuration and creating CI/CD pipelines and deployment tasks in Azure DevOps.

Utilized Kubernetes and docker for the runtime environment for the CI/CD system to build, test, and deploy.

Working closely with the business leads to understanding and finalizing the requirements.

Environment: Azure Data Factory (ADF), Azure Databricks, Azure Synapse Analytics, Azure Data Lake, Python, PySpark, SQL, SparkSQL, Scala, Java, Power BI, Qlik, Alteryx, Oracle DB, DBT, PuTTY, Druid, SharePoint, Cosmos DB, Git, Snowflake, Linear Regression, Logistic Regression, Decision tree, SQL, JIRA.

Virtusa, Tampa, FL Oct 2020 – June 2022

Azure Data Engineer

Responsibilities:

Developed and maintained end-to-end Data engineering pipelines to process large scale data using Azure as Cloud platform.

Built Data Engineering Pipelines utilizing Python, Spark, Databricks, Airﬂow, and other technologies.

Researched and implemented various components like pipeline, activity, mapping data ﬂows, data sets, linked services, triggers, and control ﬂow.

Built ETL pipelines using Azure Data factory with Azure SQL as source and Azure Synapse Analytics / Snowﬂake as Data Warehouse.

Worked on building ETL and ELT pipelines using Databricks with Azure Data Lake storage (Gen 2) as source and Azure Cosmos DB as Destination.

Created ADF pipelines which will run Dynamically using parameters and used Filter and Aggregate Transformations to transform the data as per the requirement.

Hands-on experience in designing and developing cloud solutions based on Microsoft Power Platform using Power Apps, Power Automate, OBIEE, Power BI, Power Apps Portal, and Common Data Service (CDS).

Integrated Azure Event Hubs with Apache Kafka for event ingestion and processing in a scalable and fault-tolerant manner.

Integrated Azure Active Directory authentication to every Cosmos DB request sent and demoed feature to stakeholders.

Worked on Building ADF data ﬂow using Azure SQL and Azure Synapse analytics.

Orchestrate complex data pipelines using ADF pipelines. The pipeline contains activities based on Azure Data Copy, ADF Dataﬂow (with Azure SQL as source and Azure Synapse Analytics as target), for Each for Baseline etc.

Performance Tuning of ADF dataﬂows and Pipelines using Custom integration Runtimes and reducing the

shuﬄe partitions.

Provided technical support to the client in establishing and streamlining the centralized data warehouse deployment process in Snowflake via DBT.

Developed Shell scripts for running IBM InfoSphere DataStage Jobs and transferring files to other internal teams and External vendors.

Worked on extracting data from CassandraDB using spark connectors and processing using PySpark.

Handled Ingestion of data from various data sources into Azure Storage from client data sources to the new "PriceFx" ecosystem using ADF Data Flows.

Designed a PySpark-based application to convert data from one format to another, like CSV to Parquet.

Designed and enforced applications using Pyspark to read the CSV files and dynamically generated the tables in Azure data storage.

Used Git version control to manage the source code and integrating Git with Jenkins to support build automation and integrated with Jira to monitor the commits.

Implemented ETL Logic using Pyspark and Spark SQL based Notebooks using DBT.

Orchestrated Workﬂow or Pipeline for the same using ADF Pipeline.

Implemented Data Ingestion from source RDBMS Databases such as Postgres, Azure SQL, etc. using Spark over JDBC on Azure Databricks. The solution is designed using Databricks Secrets and Pyspark.

Defined required Spark SQL Statements to create databases and tables as per data Lakehouse architecture using diﬀerent providers or file formats such as Delta, Parquet, CSV, JSON, etc.

Developed and Deployed Databricks Workﬂows/Jobs as per schedule using Databricks Notebooks built on PySpark and Spark SQL. Orchestrate Databricks based Tasks using Databricks Jobs.

QlikView and Qlik Sense Security Implementation by integrating the version control like SVN and GIT

Responsible for Ingestion of Data from Blob to Kusto and maintaining the PPE and PROD pipelines.

Involved in creating data stage jobs which load data from source to staging and staging to data vault.

Implemented Data Vault modelling concept, solved the problem of dealing with change in environments by separating the business keys and associations between those business keys from the descriptive attributes of those keys using hub, links, and satellites.

Implemented data lakehouse architecture which helps to reduce costs and deliver the data pipelines faster.

Performed extensive debugging, data validation, error handling mechanism, transformation types and data clean up analysis within large datasets.

Experience using CI/CD processes for application software integration and deployment using Git.

Defined sample test cases to Test the application and performed Unit testing to validate the application against the requirements, with proper documentation of the results.

Following Agile methodology with 2-week sprints, involving Backlog refinements and Sprint planning.

Played the role of liaison between business users and development team and created technical specifications based on the business requirements.

Environment: Azure SQL Data Warehouse, Azure Data Factory, Azure Data Lake Store, Azure Databricks, Azure Storage account, Cosmos DB, Azure Automation, Druid, Alteryx, Oracle DB, DBT, PuTTY, CassandraDB, Qlik, PowerShell, Scala, Python, SQL, PySpark, SparkSQL, Power BI, Tableau, SSRS.

US bank, Melbourne, FL May 2019- Sep 2020

Azure Big Data Engineer

Responsibilities:

Specialized in Azure cloud solutions with strong knowledge in Azure Data bricks, Spark, Azure Data Lake, Azure Data Factory, Data Warehouse (Azure Synapse).

Involved in architecting cloud data analytics solutions in Azure and migrating on premise DWH to Azure cloud using services ADLS Gen 2, Azure Data Factory, Azure Databricks, and Azure Synapse.

Experience in migration of on-premises data to Blob storage/Azure Data Lake Store (ADLS) using Azure Data Factory (ADF V1/V2).

Generated automation and deployment templates for relational and NoSQL databases including MSSQL (T-SQL) and Cosmos DB in Azure using Python.

Worked on building a Data Lake/Data Warehouse/Data mart on Snowflake hosted on Azure.

Developed Spark applications using PySpark and Spark SQL in Azure Databricks for ETL operations for multiple file formats to transforming and analyzing the data to cover business insights.

Have extensive experience in creating pipeline jobs, scheduling triggers, and Mapping data flows using Azure Data Factory (V2) and using Key Vaults to store credentials.

Defined Spark programs using Python (PySpark) API for data wrangling and transformations.

Worked on Airflow 1.8(Python2) and Airflow 1.9(Python3) for orchestration and familiar with building custom Airflow operators and orchestration of workflows with dependencies involving multi-clouds.

Responsible for managing MongoDB environment with high availability, performance, and scalability using tools like Mongo Compass, Mongo Atlas Manager, Cloud Manager.

Wrote PL/SQL queries to query the data from oracle database.

Declared Data Frames, Temporary Views for creating Delta tables for ACID transactions.

Involved in creating VMs from the Disk and Basic Knowledge on Logical App.

Managed Azure DevOps pipelines for CI/CD and Release Management workflows.

Generated various Charts, pivot tables, straight tables as per the requirement of the client.

Monitored Development, Enhancement, Support, Change Requests, Testing, UAT Progress and JIRA and maintaining the respective documents in the repositories.

Performed repository management activities to manage the multiple versions of dashboards and merging the latest branch version with master branch to maintain the centralized repository.

Utilized Kubernetes and docker for the runtime environment for the CI/CD system to build, test, and deploy.

Environment: Python, PySpark, Databricks, Airﬂow, CICD, Postgres, AzureSQL, Azure devops, Datalake, ADF, ADLS, Azure synapse, Qlik Sense, Kubernetes AKS, Docker, UAT, JIRA.

United Health Group, MN Nov 2017- Apr 2019

Data Engineer

Responsibilities:

Collaborate with business leaders for data initiatives, with focus on the use of data to optimize business KPIs such as revenue and circulation, along with the team of data professionals with specific focus on: Analytics & Insight, Data Engineering and Data Science.

Used Informatica IICS extensively for ingesting data from disparate source systems.

Used Azure Data Factory as an orchestration tool for integrating data from upstream to downstream systems.

Automated jobs using different triggers (Event, Scheduled and Tumbling) in ADF.

Used Cosmos DB for storing catalog data and for event sourcing in order processing pipelines.

Analyzed the data flow from different sources to target to provide the corresponding design Architecture in Azure environment. Implemented solutions for ingesting data from various sources and processing the Data utilizing Big Data Technologies such as Hive, Pig, Sqoop, Hbase, Map reduce, etc.

Converted existing workflow and forms using PowerApps and Microsoft Power automate (Flow) to improve Business process.

Created Build definition and Release definition for Continuous Integration and Continuous Deployment.

Designed Application Interface Document for the downstream to create new interface to transfer and receive the files through Azure Data Share.

Creating pipelines, data flows and complex data transformations and manipulations using ADF and PySpark with Databricks.

Ingested data in mini-batches and performs RDD transformations on those mini-batches of data by using Spark Streaming to perform streaming analytics in Data bricks.

Initiated, provisioned different Databricks clusters needed for batch and continuous streaming data processing and installed the required libraries for the clusters. Integrated Azure Active Directory authentication to every Cosmos DB request sent and demoed feature to Stakeholders.

Enforced numerous pipelines in Azure using Azure Data Factory v2 to get the data from disparate source systems by using different Azure Activities like Move Transform, Copy, filter, for each, Databricks etc.

Implemented One time data migration of multistate level data from SQL server to Snowflake by using Python and SnowSQL.

Planned several Databricks Spark Jobs with PySpark to perform several tables to table operations.

Extensively used SQL Server Import and Export Data tool and leveraged PL/SQL queries in oracle database.

Working with complex SQL, Stored Procedures, Triggers, and packages in large databases from various servers.

Environment: Hadoop, Hive, Oozie, Spark, Spark SQL, Python, PySpark, Azure Data Factory, Azure SQL, Azure Databricks, Azure DW, BLOB storage, vector DB, Scala, AWS, Linux, Maven, Oracle 11g/10g, Zookeeper, MySQL, Snowflake.

Ivy, Hyderabad Nov 2013 - Oct 2016

Big Data Engineer

Responsibilities:

Involved in complete project life cycle starting from design discussion to production deployment.

Installed Hadoop, Map Reduce, HDFS, AWS and developed multiple MapReduce jobs in PIG and Hive for data cleaning and pre-processing.

Developed a job server (REST API, Spring boot, ORACLE DB) and job shell for job submission, job profile storage, job data (HDFS) query/monitoring.

Implemented solutions for ingesting data from various sources and processing the Data utilizing Big Data Technologies such as Hive, Pig, Sqoop, Hbase, Splunk, Map reduce, etc.

Design and develop a daily process to do incremental import of raw data from DB2 into Hive tables using Sqoop.

Involved in debugging Map Reduce job using MR Unit framework and optimizing Map Reduce. Extensively used Hive/HQL or Hive queries to query data in Hive Tables and loaded data into Hive tables.

Developed data pipeline using Flume, Sqoop, Pig and MapReduce to ingest data into HDFS for analysis. Used Oozie and Zookeeper for workflow scheduling and monitoring.

Effectively used Sqoop to transfer data from databases (SQL, Oracle) to HDFS, Hive.

Integrated Apache Storm with Kafka to perform web analytics.

Uploaded click stream data from Kafka to HDFS, Hbase and Hive by integrating with Storm.

Worked on Big Data Integration &Analytics based on Hadoop, SOLR, Spark, Kafka, Storm, and web Methods.

Worked on migrating MapReduce programs into Spark transformations using Spark and Scala.

Design & implement ETL process using Informatica to load data from Worked extensively with Sqoop for importing and exporting the data from HDFS to Relational Database systems/mainframe and vice-versa.

Developed PIG Latin scripts to extract the data from the web server output files to load into HDFS.

Created concurrent access for hive tables with shared/exclusive locks enabled by implementing Zookeeper in cluster.

Used SSIS to develop jobs for extracting, cleaning, transforming, and loading data into data warehouse.

Strongly recommended bringing in Elastic Search and was responsible for installation, configuring and administration. Implemented using SCALA and SQL for faster testing and processing of data and real time streaming the data using Kafka.

Build Automation Framework using Shell and Python Scripts to validate Source to Target Testing and generate reports and publish the report in the project dashboard.

Involved in design and developed Kafka and Storm based data with the infrastructure team.

Worked on major components in Hadoop Ecosystem including Hive, PIG, HBase, HBase-Hive Integration, Scala, Sqoop and Flume.

Developed Hive Scripts, Pig scripts, Unix Shell scripts, programming for all ETL and ELT loading processes and converting the files into parquet in the Hadoop File System. Worked with Oozie and Zookeeper to manage job workflow and job coordination in the cluster.

Environment: Hadoop, Rest API, Maven, MRunit, Junit, Scala, Sqoop, Kafka, Pig, HDFS, Map Reduce, Hive, HBase, Oozie.

DigiPrise, Hyderabad, India Aug 2011- Oct 2013

Data Analyst

Responsibilities:

Worked with teams to identify and construct relations among various parameters to analyze customer response data.

Developed and improved bidding algorithms for daily optimization using Python and continuously analyze and test new data sources. Also, perform research analysis on bidding strategies.

Developed automated data pipelines from various external data sources (web pages, API etc.) to internal data warehouse (SQL Server, AWS), then export to reporting tools like Tableau.

Carried out various mathematical operations for calculation purpose using Python libraries NumPy, Statsmodels, Pandas.

Worked on Informatica, SSIS package, DTS import/export for transferring data from database by writing PL/SQL queries (Oracle and Text format data) to SQL server.

Recorded the online users' data using Typescipt and Python Django forms and implemented test case using Pytest.

Configured various big data workflows to run on top of Hadoop and these workflows comprise of heterogeneous jobs like Pig, Hive, Spark, etc.

To ensure data was matching as per the business requirements and designed and deployed with Drill Down and Drop-down menu option and Parameterized and Linked reports using Tableau.

Defined and created Cloud Data strategies, including designing multi-phased implementation using AWS, S3.

Developed views in Tableau Desktop that were published to internal team for review and further data analysis and customization using filters and actions, used KPI’s for business performance.

Environment: Python, API, NumPy, Pandas, statsmodels, Hadoop, Pig, Hive, Tableau, AWS, S3, Redshift, JSON, Hive, HiveQL, KPI, ETL.

Education:

Bachelor of Technology, Computer Science, Jawaharlal Nehru Technological University, Kakinada, 2011.

Contact this candidate