Azure Data Lake

Location:

Plano, TX

Posted:

October 09, 2024

Contact this candidate

Resume:

Nikith Kumar Reddy Vangapally

Phone :682-***-****

Email :*********@*****.***

LinkedIn : linkedin.com/in/nikith-reddy-196b55169

Professional Summary:

5+ years of experience in IT with strong expertise in Big Data Ecosystem – Data Engineering, Data Acquisition, Ingestion, Modelling, Storage Analysis, Integration, and Data Processing.

Good experience working with Data governance, Data quality, data lineage, Data architect to design various models and processes

Good experience in Azure applications (Paas &Iaas) Azure synapse analytics, SQL Azure, Azure Data Lake, Azure Data Factory, Azure Analysis Service, Azure Data bricks, Azure monitoring and Key vault.

Experience with Azure Data Storage services using a combination of Azure Data Factory, T-SQL, Spark SQL and U-SQL, Azure Data Lake Analytics, Data Ingestion to one or more Azure Services - (Azure Data Lake, Azure Storage, Azure SQL, Azure DW) and processing the data in In Azure Databricks

Hands-on experience in Amazon Web Services (AWS) Cloud services such as EC2, S3 and IAM. EBS, RDS. ELB, VPC, Autoscaling, CloudFront, CloudTrail, CloudWatch, CloudFormation, Elastic Beanstalk, AWS SNS, AWS SQS, AWS SES,AWS SWF &. AWS Direct Connect.

Experience in using AWS utilities such as EMR, S3, and cloud watch to run and monitor Hadoop and spark jobs on Amazon Web Services (AWS).

Well experienced on creating and monitoring Hadoop clusters on AWS EC2, VM, Hortonworks Data Platform, CDH3, CDH4 Cloudera Manager on Linux.

Experienced in building ETL workflows in the Azure platform using Azure Databricks and Data Factory.

Experience in using distributed computing architectures and building pipelines using AWS products such as (EC2, Redshift, EMR, Elastic search, EBS, S3), Hadoop, Python, Spark and effective use of MapReduce, SQL, and Cassandra to solve big data type problems.

Hands on experience with Hadoop Cluster architecture and sources and involved in HDFS maintenance and loading of structured and unstructured data.

Good experience in managing and reviewing Hadoop log files and NoSQL databases like HBase, and MongoDB.

Experience working on Data Migration, Data Conversions, Data Extraction/Transformation/Loading (ETL).

Hands on experience in using Data bricks for handling all analytical process from ETL to all data modelling by leveraging familiar tools, languages, and skills, via interactive notebooks or APIs.

Expert in constructing a Kafka broker with proper configurations for the needs of the organization using Big Data.

Expertise in Apache Hadoop ecosystem components like Spark, Hadoop Distributed File Systems (HDFS), Kafka, Hive, MapReduce, Sqoop, HBase, Zookeeper, Exadata, Airflow, Snowflake, YARN, Flume, Pig, Nifi and Scala.

Experience in developing UDFs when necessary to use in PIG and HIVE queries.

Experience in writing complex SQL, PL/SQL, T-SQL queries implementing procedures, functions and for improving the performance of databases.

Excellent understanding of NoSQL databases like HBASE, Cassandra and MongoDB.

Technical Skills:

Hadoop Components / Big Data

HDFS, Hue, MapReduce, PIG, Hive, HCatalog, HBase, Sqoop, Impala, Zookeeper, Flume, Kafka, Yarn, Cloudera Manager, Kerberos, pyspark, Airflow, Kafka Snowflake

Spark Components

Apache Spark, Data Frames, Spark SQL, Spark, YARN, Pair RDDs

Web Technologies / Other components

Log4j, HTML, XML, CSS.

Server Side Scripting

UNIX Shell, Power Shell Python Scripting (Boto3)

Databases

Oracle, Microsoft SQL Server, MySQL, DB2, Teradata, snowflake, Yellow bricks.

Programming Languages

Java, Scala, Python.

Web Servers

Apache Tomcat, WebLogic.

IDE

Eclipse, Dreamweaver

OS/Platforms

Windows, Linux (All major distributions), Unix, CENTOS

Methodologies

Agile (Scrum), Waterfall, UML, Design Patterns, SDLC.

AWS Services

S3, EC2, EMR, Redshift, RDS, Glue, Lambda, Kinesis, SNS, SQS,AMI, IAM, Cloud formation

Azure

Azure Data Factory / ETL / ELT / SSIS,Azure Data Lake Storage, Azure Databricks

Data Modelling Tools

Erwin Data Modeler 9.8, ER Studio v17, and Power Designer 16.6

Education:

Master of Science in Computer Science Jan 2022 - May 2023

University of North Texas, Denton, TX.

Bachelor of Technology in Computer Science June 2014 – Aug 2018

Osmania University, India.

Professional Experience:

Client : Master Card, Plano TX July 2023 – Till Date

Role : Senior Data Engineer

Responsibilities:

Responsible in requirement gathering, understanding the business value and providing the analytical data solutions.

Project deals with migrating data from on premises to cloud (AWS). Our sources are MySQL, Oracle, MongoDB.

We used Kafka as a messaging service. Created data pipelines from heterogeneous sources to Snowflake.

Developed Spark Jobs with the Spark core, Spark SQL libraries for Processing the data.

Worked on IAM security polices to provide fine grained access to AWS S3 using Lambda functions, DynamoDB.

Implemented the AWS Step functions to automate and orchestrate the Sage Maker related tasks such as publishing data to S3.

Configured, Scheduled and Triggered ETL jobs in ETL Orchestrator using JSON to load the source data into Data Lake.

Created a snowflake warehouse strategy and set it up to use PUT scripts to migrate a terabyte of data from S3 into Snowflake.

By creating a customized read/write Snowflake utility function in python, data was transferred from an AWS S3 bucket to Snowflake.

Created S3 buckets and managed S3 bucket policies, as well as using S3 buckets for storage and backup on AWS.

Developed ETL Processes in AWS Glue to migrate Campaign data from external sources like S3, ORC/Parquet/Text Files into AWS snowflake.

Worked on GIT hub to clone the repository and commit the changes in different versions of the code and released to Bit-Bucket by creating the Pull Requests for merge process.

Extensively worked towards performance tuning/optimization of queries, contributing to a improvement in the deployed code.

Leveraged Apache Spark with Python for substantial Big Data Analytics and Machine Learning undertakings, executing machine learning use cases within Spark ML and MLlib.

Created a data pipeline involving various AWS services including S3, Kinesis firehose, kinesis data stream, SNS, SQS, Athena, Snowflake etc.

Worked on end-to-end deployment of the project that involved Data Analysis, Data Pipelining, Data Modelling, Data Reporting and Data documentations as per the business needs.

Worked on huge volumes of historical data using various parallel computation engines like Snowflake, enabling the data scientists to speed their modelling process.

Created views, materialized views with a flexibility to slice and dice data easily saving time.

Worked on root cause analysis of internal, external data and processes to answer specific business questions.

Implemented various integrations with Kafka such as Elastic Search, Databases (RDBMS or NoSQL).

Worked with ETL (Extract, Transformation, and Load) tools to design, develop and deliver the requirements, improved ETL process by decreasing end to end load process.

Implemented data validation checks to make sure data is flowing through the channels as expected and for any undesired data values, I setup a monitoring framework in place using AWS CloudWatch service which would alert the key individuals to debug and identify the cause and resolve by a one-time/long-term fix

Implemented Performance analysis and fixed issues for the Spark Jobs to optimize the execution time there by reduce the cost of executing resources.

Transformed batch data from several tables containing Millions of records from SQL Server, MySQL, PostgreSQL, and csv file datasets into data frames using PySpark.

Conducted Data blending, Data Preparation using SQL for Tableau consumption and publishing data sources to Tableau server.

Environment: Python, Scala, Spark, S3, Kinesis firehose, kinesis data stream, SNS, SQS, Athena, snowflake, SQL, Tableau, Git, REST, Bitbucket, Jira.

Client. : Neptune Software Info, India February 2018 – Oct 2021

Role : Data Engineer

Responsibilities:

Analysed the Data from different sourcing using Big Data Solution Hadoop by implementing Azure Data Factory, Azure Data Lake, Azure Data Lake Analytics and HD Insights.

Used Azure Cloud, Azure Data Factory, Azure Data Lake Storage, Azure Synapse Analytics, Azure Analytical services, Azure Cosmos NO SQL DB, Azure HDInsight and Big Data Technologies (Hadoop and Apache Spark) and Databricks.

Generated DDL statements for the creation of new ER/studio objects like table, views, indexes, packages and stored procedures.

Migrated SQL database to Azure Data Lake, Azure Data Lake Analytics, Azure SQL Database, Data Bricks, Azure SQL Data warehouse, controlled and granted database access and Migration On-premises databases to Azure Data Lake store using Azure Data Factory.

Experience with installing, analyzing, and supporting SQL Server 2008, 2008R2, 2012, 2014, 2016, 2017

Solid experience with creating SQL Server jobs/alerts, and implementing scheduled tasks

Developed Spark applications using Spark/PySpark - SQL in Databricks for data extraction, transformation, and aggregation from multiple file formats for analyzing and transforming the data to uncover insights into the customer usage, consumption patterns, and behaviour.

Developed Spark and Python solutions for regular expression (regex) projects within the Hadoop/Hive environment

Developed ETL Processes using Azure Data Factory to migrate the data collected from external sources like API, SQL Server, and SFTP server into Azure Databricks.

Worked on data processing and transformations and actions in Spark by using Python (PySpark) and Scala languages.

Used Spark API over Cloudera Hadoop YARN to perform analytics on Data in Hive.

Accessed Databricks data from Azure Data Lake and Ingested data from heterogeneous sources into Databricks

Created Azure HD Insights cluster using PowerShell scripts to automate the process.

Worked on Azure Data Lake storage gen2 to store excel files, parquet files and retrieve user data using Blob API.

Worked with Azure Databricks, PySpark, Scala, Spark SQL, Azure ADW, and Hive used to load and transform data.

Build data pipelines in airflow in Azure for ETL related jobs using different airflow operators

Worked with Hive for data retrieval process and wrote Hive queries to load and process data in Hadoop file system, developed Hive DDLs to create, drop and alter tables.

Involved in migration of data from existing RDBMS to Hadoop using Sqoop for processing data, evaluate performance of various algorithms/models/strategies based on real-world data sets

Implemented Data Validation using MapReduce programs to remove un-necessary records before move data into Hive tables.

Environment: Python, Azure-Data factory, Azure Databricks, Azure Blob storage, Azure SQL, SQL server 2016, Hadoop, Erwin 9.8, Hive, ER/Studio, Apache Spark, Pig, Sqoop, Yarn, Scala, MapReduce, PySpark, Teradata.

Contact this candidate