Data Engineer Azure

Location:

Irving, TX, 75063

Posted:

February 06, 2024

Contact this candidate

Resume:

Data Engineer

Vishnu Bolloju

Email:**************************@*****.*** Contact No: +1-972-***-****

Linkedin : https://www.linkedin.com/in/vishnu-vardhana-0016a7257/

Data Engineer Big Data SQL Server Azure AWS Azure Data Factory Azure Databricks Power BI Data wrangling Data Governance Python AI ML Data Migration

PROFESSIONAL SUMMARY

Azure Data Engineer with over 11 years of expertise, offering a robust background in data engineering and business intelligence.

Proven track record of successfully migrating SQL databases to Azure Data Lake, Azure SQL Database, Data Bricks, and Azure SQL Data Warehouse, Azure Cosmos DB ensuring seamless integration and efficient data management.

Extensive experience in controlling and granting database access, with a focus on security measures, and migrating on-premise databases to Azure Data Lake Store using Azure Data Factory.

Proficient in developing Spark applications using Spark-SQL in Databricks, specializing in data extraction, transformation, and aggregation from various file formats to derive meaningful insights into customer usage patterns.

In-depth understanding of Spark Architecture, encompassing Spark Core, Spark SQL, Data Frames, Spark Streaming, Driver Node, Worker Node, Stages, Executors, and Tasks.

Expertise in Big Data Hadoop and Yarn architecture, along with a comprehensive grasp of Hadoop demons such as Job Tracker, Task Tracker, Name Node, Data Node, Resource/Cluster Manager, and Kafka for distributed stream processing.

Database design and development proficiency with Business Intelligence using SQL Server 2014/2016, Integration Services (SSIS), DTS Packages, SQL Server Analysis Services (SSAS), DAX, OLAP Cubes, Star Schema, and Snowflake Schema.

Outstanding communication skills coupled with excellent work ethics, known for being a proactive team player with a positive attitude.

Domain knowledge in Finance, Logistics, and Health Insurance, contributing to a holistic understanding of industry-specific data requirements.

Strong skills in visualization tools such as Power BI and Confidential Excel, including advanced knowledge of formulas, pivot tables, charts, and DAX commands.

Proven leadership in various phases of project life cycles, including Design, Analysis, Implementation, and Testing, ensuring successful project delivery.

Led efforts in database administration and performance tuning, providing scalability and accessibility, ensuring 24/7 availability of data, and resolving end-user reporting and accessibility issues.

Microsoft Azure Data Engineer Associate with a focus on implementing Azure data solutions, provisioning storage accounts, and designing relational and non-relational data stores on Azure.

Proficient in developing batch processing solutions using Azure Data Factory and Azure Databricks, including the implementation of clusters, notebooks, jobs, and auto-scaling.

In-depth knowledge of data auditing, data masking, encryption for data at rest and in transit, as well as a good understanding of data warehouse concepts, Hadoop Ecosystem, SQL, relational databases, and Python.Experience in working on apache Hadoop open source distribution with technologies like HDFS, Map-reduce, Python,Spark, Spark-Streaming, Storm, Kafka.

SPECIALITIES

SQL Server / Azure SQL Database

Azure Synapse Analytics,Python

Performance Tuning & Optimization

Data modeling, OLTP/OLAP

Big data Hadoop, Hadoop Ecosystem

Azure Data Factory / ETL / ELT / SSIS

Azure Data Lake Storage,Azure Databricks

SSRS / Power BI /SnowFlake

AWS,Data Engineering,Redshift

Hive, SQOOP, HBase.

TECHNICAL SPECIFICATIONS

Programming & Scripting Languages

Python, Scala, Java,PL/SQL, Shell Scripting.

Databases

MySQL, SQL Server, Postgres, Oracle DB,snowflake

Cloud Services

Azure, Azure DataBricks, Azure DevOps, ADLS, Snowflake, AWS, S3, EMR, Glue.

Big Data Technologies

Hadoop, HDFS, Hive, Sqoop, Spark, Machine Learning, Pandas, NumPy, Zookeeper, Flume, Airflow, Informatica, Snowflake, DataBricks.

Machine Learning

And Statistics

Regression, Random Forest, Clustering, Time-Series Forecasting, Hypothesis, Explanatory Data Analysis

Dashboarding/Reporting/ Analytical Tools

PowerBI

Version Control

Azure Git, Github

IDE & Build Tools

IntelliJ, Eclipse, PyCharm, Maven, Gradle

Platforms

Windows, Linux (Ubuntu), Mac OS, CentOS (Cloudera)

Software Methodologies

Agile, Scrum, Waterfall

EDUCATION DETAILS

Bachelor’s in Computer Science and Engineering May 2012

Vardhaman College of Engineering – 7.75 CGPA

Master’s in Computer Science December 2016

Cleveland State University – 3.7 CGPA

PROFESSIONAL EXPERIENCE

TIAA-CREF, Charlotte, NC July 2021 – Present Sr. Azure Data Engineer

Project Description: TIAA is a financial services company that specializes in providing retirement plans, IRAs, mutual funds, and life insurance. The company is dedicated to helping those who teach, heal, and serve to achieve financial well-being. Design and develop efficient database solutions, creating ETL pipelines for seamless data movement. They optimize database performance, ensuring data security and compliance with industry standards. Collaborating with cross-functional teams, they integrate data from diverse sources, enforce data quality standards, and communicate effectively with stakeholders. Proficient in SQL, relational databases, and cloud-based solutions like Azure, they contribute to TIAA's commitment to financial well-being through robust data management

Key Contributions: I was Responsible for building scalable distributed data solution using Hadoop Cluster environment with Hortonworks distribution.

Responsibilities:

Responsible to Build the ETL Pipelines (Extract, Transform, Load) from data lake to different databases based on the requirements.

Design and Develop Data application using HDFS, Hive, Spark, Scala, Sqoop, Atomic Scheduler, DB2, SQL Server and Teradata.

Spearheaded the development of ETL pipelines, orchestrating the seamless extraction, transformation, and loading of data from the data lake to various databases based on specific requirements.

Designed and implemented data applications utilizing a comprehensive tech stack, including HDFS, Hive, Spark, Scala, Sqoop, Atomic Scheduler, DB2, SQL Server, and Teradata.

Analyzed, designed, and implemented modern data solutions using Azure PaaS services, ensuring optimal data visualization and integration with existing business processes.

Conducted a thorough assessment of current production states, evaluating the impact of new implementations on business processes, and proposing effective optimization solutions.

Executed ETL processes from diverse source systems to Azure Data Storage services, leveraging Azure Data Factory, T-SQL, Spark SQL, and U-SQL in Azure Data Lake Analytics.

Orchestrated data ingestion to Azure Services, including Azure Data Lake, Azure Storage, Azure SQL, and Azure Data Warehouse, with subsequent data processing in Azure Databricks.

Developed robust pipelines in Azure Data Factory, employing Linked Services, Datasets, and Pipelines for extracting, transforming, and loading data from various sources like Azure SQL, Blob storage, Azure SQL Data Warehouse, and write-back tools.

Created Spark applications using Pyspark and Spark-SQL for extracting, transforming, and aggregating data from multiple file formats, uncovering valuable insights into customer usage patterns.

Played a pivotal role in estimating, monitoring, and troubleshooting Spark Databricks clusters, demonstrating expertise in performance tuning for optimal Batch Interval time, Parallelism levels, and memory utilization.

Designed and implemented User-Defined Functions (UDFs) in Scala and Pyspark to meet specific business requirements, contributing to the effectiveness and customization of data processing workflows.

Designed, developed, and tested the ETL strategy to populate data from multiple source systems.

Experienced in developing Data Mapping, Performance Tuning and Identifying Bottlenecks of sources, mappings, targets and sessions.

Strong understanding of Data Modeling in data warehouse environment such as star schema and snowflake schema.

Environment: Azure Data Factory, PySpark Databricks, HDFS, Hive, Azure SQL Data Warehouse, Azure Storage (ADLS), DB2, SQL Server, Scala, and Python for ETL pipelines, data storage, Azure Cosmos DB, processing, and analytics.

Edward Jones, St.Louis, MO Aug 2020 - Jun 2021

Sr. Azure Data Engineer

Project Description: Edward Jones is the principal operating company of the Jones Financial Companies Edward Jones we are developing a next-generation analytics architecture to support our growing business. As part of the Tech Data Engineering team, you will be challenged daily to find innovative solutions to improve the speed and delivery of our information. As a Data Engineer, we focus on design of the Analytic Hub architecture and data pipelines based on business needs and technology efficiency.

Key Contributions: I was Responsible for building scalable distributed data solution using Hadoop Cluster environment with Hortonworks distribution.

Responsibilities:

Analyze, design and build Modern data solutions using Azure PaaS service to support visualization of data. Understand current Production state of application and determine the impact of new implementation on existing business processes.

Extract Transform and Load data from Sources Systems to Azure Data Storage services using a combination of Azure Data Factory, T-SQL, Spark SQL and U-SQL Azure Data Lake Analytics . Data Ingestion to one or more Azure Services and processing the data in In Azure Databricks.

Created Pipelines in ADF using Linked Services/Datasets/Pipeline/ to Extract, Transform and load data from different sources like Azure SQL, Blob storage, Azure SQL Data warehouse, write-back tool and backwards.

Developed Spark applications using Pyspark and Spark-SQL for data extraction, transformation and aggregation from multiple file formats for analyzing & transforming the data to uncover insights into the customer usage patterns.

Responsible for estimating the cluster size, monitoring and troubleshooting of the Spark databricks cluster.

Experienced in performance tuning of Spark Applications for setting right Batch Interval time, correct level of Parallelism and memory tuning.

To meet specific business requirements wrote UDF’s in Scala and Pyspark.

Developed JSON Scripts for deploying the Pipeline in Azure Data Factory (ADF) that process the data using the Sql Activity.

Hands-on experience on developing SQL Scripts for automation purpose.

Created Build and Release for multiple projects (modules) in production environment using Visual Studio Team Services (VSTS).

Environment: Python, Power BI, Logic apps, Azure Function Azure Data Lake, Azure Storage, Azure SQL, Azure DW, Spark- SQL, Azure Data Factory.

AIG, New York, NY Sep 2018 – Jul 2020

Data Engineer

Project Description: American International Group, Inc., also known as AIG, is an American multinational finance and insurance corporation with operations in more than 80 countries and jurisdictions. We had a huge customer database, to make the advertisement more relevant and effective, we accessed second to second viewership data for understanding on how consumers are watching content and analyze patterns to better inform future content creation and scheduling to drive a better consumer experience. Use consumer insights to attribute real life actions to advertising efforts. Insights allows for real time campaign optimization and provide better understanding of marketing effectiveness of all media channels

Key Contributions: I used Azure Data Factory(V2) & PySpark Databricks as ETL tools to increase speed of information access. I was involved in data warehouse implementations using Azure SQL Data warehouse, SQL Database, Azure Data Lake Storage (ADLS), Azure Data Factory v2

Responsibilities:

Involved in creating specifications for ETL processes, finalized requirements and prepared specification documents

Migrated data from on-premises SQL Database to Azure Synapse Analytics using Azure Data Factory, designed optimized database architecture

Created Azure Data Factory for copying data from Azure BLOB storage to SQL Server

Implement ad-hoc analysis solutions using Azure Data Lake Analytics/Store, HDInsight/Databricks

Work with similar Microsoft on-prem data platforms, specifically SQL Server and SSIS, SSRS, and SSAS

Create Reusable ADF pipelines to call REST APIs and consume Kafka Events.

Used Control-M for scheduling DataStage jobs and used Logic Apps for scheduling ADF pipelines

Developing and configuring Build and Release (CI/CD) processes using Azure DevOps, along with managing application code using Azure GIT with required security standards for .Net and java applications.

Migrate the ETL logic, which was currently running in SSISand MS Access, by Azure Pipeline in Azure data factory without any change in business logic

Developed high performant data ingestion pipelines from multiple sources using Azure Data Factoryand Azure Databricks

Extensively worked on creating pipelines in Azure Cloud ADFv2 using different activities like Move &Transform, Copy, filter, for each, Data bricks etc.

Develop dynamic Data Factory pipelines using parameters and trigger them as desired using events like file availability on Blob Storage, based on schedule and via Logic Apps.

Writing SQL queries to help ETL team for system migration. DDL & DML SQL code to map and migrate data from source to destination new server in Azure DB.

Utilized Poly base, T-SQL queries to import huge amount of data from Azure Data Lake Store to Azure SQL Data warehouse and created Azure Run book to Scale up & down Azure Analysis Services and Azure SQL Data warehouse.

Upgrading Azure SQL Data warehouse Gen1 to Azure SQL Data warehouse Gen2

Designed, Developed Azure SQL Data warehouse Fact and Dimension tables. Used different distributions (Hash, Replicated and Round-Robin) while creating Fact\Dim.

Develop Power BI and SSRS reports, Create SSAS Database Cubes to facilitate self-service BI.

Created Azure Data Factory Pipeline to load data from On-premises SQL Server to Azure Data lake store.

Utilize Azure’s ETL, Azure Data Factory (ADF) services to ingest data from legacy disparate data stores to Azure Data Lake Storage

Used Spark-Streaming APIs to perform necessary transformations and actions on the fly for building the common learner data model which gets the data from Kafka in near real time and Persists into Cassandra and redshift implementing massive data lake pipelines.

Environment: Azure Data Factory,Java, Python, ADLS, Oracle, Python, Tableau, Hadoop, Spark, Scala, Hive, SQL Server SSIS.

Eli Lilly and Company, Indianapolis, IN Sep 2016 – Aug 2018 Data Engineer

Project Description: Eli Lilly and Company is an American pharmaceutical company. Build high-performance and scalable data pipelines adhering to data lake house, data warehouse & data marts standards for optimal storage, retrieval, and processing of data. Ability to work with business owners to define key business requirements and convert to technical specifications.

Key Contributions: I developed business logic using Kafka Direct Stream in Spark Streaming and implemented business transformations.

Responsibilities:

Worked on Building and implementing real-time streaming ETL pipeline using Kafka Streams API.

Worked on Hive to implement Web Interfacing and stored the data in Hive tables.

Migrated Map Reduce programs into Spark transformations using Spark and Scala.

Working with open source Apache Distribution then Hadoop admins have to manually setup all the configurations- Core-Site, HDFS-Site, YARN-Site and Map Red-Site. However, when working with popular Hadoop distribution like Hortonworks, Cloudera or MapR the configuration files are setup on startup and the Hadoop admin need not configure them manually.

Experienced with Spark Context, Spark-SQL, Spark YARN.

Implemented Spark Scripts using Scala, Spark SQL to access hive tables into Spark for faster processing

Implemented data quality checks using Spark Streaming and arranged passable and bad flags on the data.

Implemented Hive Partitioning and Bucketing on the collected data in HDFS.

Implemented Sqoop jobs for large data exchanges between RDBMS and Hive clusters.

Extensively used Zookeeper as a backup server and job scheduled for Spark Jobs.

Deploy and monitor scalable infrastructure on Amazon web services (AWS) and configuration management instances and Managed servers on the Amazon Web Services (AWS) platform using Ansible configuration management tools and Created instances in AWS as well as migrated data to AWS from data Center.

Building CI/CD pipelines using Jenkins for deployments for End to End automation to support all build and deployment as a pipeline.

Created job chains with Jenkins Job Builder, Parameterized Triggers, and target host deployments. Utilized many Jenkins plugins and Jenkins API.

Backing up AWS PostGRE to S3 on daily job run on EMR using Data Frames

Developed Spark scripts using Scala shell commands as per the business requirement.

Worked on Cloudera distribution and deployed on AWS EC2 Instances.

Experienced in loading the real-time data to a NoSQL database like Cassandra.

Well versed in using Data Manipulations, Compactions, in Cassandra.

Experience in retrieving the data present in Cassandra cluster by running queries in CQL (Cassandra Query Language).

Worked on connecting Cassandra database to the Amazon EMR File System for storing the database in S3.

Implemented usage of Amazon EMR for processing Big Data across a Hadoop Cluster of virtual servers on Amazon Elastic Compute Cloud (EC2) and Amazon Simple Storage Service (S3).

Deployed the project on Amazon EMR with S3 connectivity for setting a backup storage.

Environment: Hadoop, Map Reduce, Hive, Spark, Oracle, GitHub, Tableau, HBase,AWS, Amazon EC2, S3, Cassandra cluster.

Bharat Biotech, Hyderabad, India. May 2012 – July 2015

Data Engineer/ Hadoop Developer.

Project Description: This project was an Ecommerce web-based application which allowed the customer to get a view of all the products in the store and buy them online. The application mainly dealt with the online payment and billings. It uses Secure Good payment Gateway solution for customer to make a payment. Key Contributions: I designed developed and tested Extract Transform Load (ETL) applications with different types of sources.

Responsibilities:

Convert raw data with sequence data format, such as Avro and Parquet to reduce data processing time and increase data transferring efficiency through the network.

Worked on building end to end data pipelines on Hadoop Data Platforms.

Worked on Normalization and De-normalization techniques for optimum performance in relational and dimensional databases environments.

Creating files and tuned the SQL queries in Hive Utilizing HUE. Implemented MapReduce jobs in Hive by querying the available data.

Exploring with Spark to improve the performance and optimization of the existing algorithms in Hadoop using Spark context, Spark-SQL, Data Frame, pair RDD’s.

Experience with PySpark for using Spark libraries by using Python scripting for data analysis.

Involved in converting HiveQL into Spark transformations using Spark RDD and through Scala programming.

Developed an API to write XML documents from a database. Utilized XML and XSL Transformation for dynamic web-content and database connectivity.

Created User Defined Functions (UDF), User Defined Aggregated (UDA) Functions in Pig and Hive.

Worked on building custom ETL workflows using Spark/Hive to perform data cleaning and mapping.

Implemented Kafka Custom encoders for custom input format to load data into Kafka portions.

Support for the cluster, topics on the Kafka manager. Cloud formation scripting, security and resource automation.

Environment: Python, HDFS, MapReduce, Flume, Kafka, Zookeeper, Pig, Hive, HQL, HBase, Spark, Kafka, ETL, Web Services.

Contact this candidate