Data Engineer Azure

Location:

West Grove, NJ, 07753

Salary:

$65/ hr on C2C

Posted:

March 25, 2024

Contact this candidate

Resume:

THARUN KUMAR

****@*********.***

732-***-****

(Azure Data Engineer)

Professional Summary:

To obtain a challenging Data Engineer position that leverages my 9+ years of experience in the software industry, including 5+years of experience in Azure cloud services and Big data technologies.

Hands on experience in working with Azure Cloud and its components like Azure Data Factory, Azure Databricks, Logical Apps, Azure function Apps, snowflake and Azure DevOps services.

Hands on working experience on working on Azure stack moving data from Data Lake to Azure blob storage.

Strong background in Data Load/Integration using the ADF.

Experience in building ETL (Azure Data Bricks) data pipelines leveraging PySpark, Spark SQL.

Experience in developing pipelines in spark using Scala and PySpark.

Implemented azure functions, azure storage, and service bus queries for large enterprise level ERP integration system.

Experience in creating and managing Azure DevOps tools for continuous integration and deployment (CI/CD) pipelines.

Experience in Data Pipeline Development, Data Modelling.

Developed data ingestion workflows to read data from various sources and write it to Avro, Parquet, Sequence, JSON, and ORC file formats for efficient storage and retrieval.

Have a good working experience in Hadoop, Java, HDFS, Map-Reduce, Hive, Teg, Python, PySpark.

Hands on working experience and developing large scale data pipelines using spark and hive.

Experience in using Apache Sqoop to import and export data to and from HDFS and Hive.

Hands-on experience in setting up workflow using Apache Oozie workflow engine for managing and scheduling Hadoop jobs.

Experience in importing and exporting the data using SQOOP from HDFS to Relational Database systems

Experience in optimizing query performance in Hive using bucketing and partitioning techniques.

Optimized data ingestion, Data modelling, data Encryption and performance by tuning ETL workflows.

Extensive hands on experience tuning spark Jobs.

Experienced in working real time streaming with Kafka as data pipeline using spark streaming module.

Experience with Apache Kafka and Azure Event Hubs for messaging and streaming applications using Scala.

Well versed in using ETL methodology for supporting corporate-wide- solution using Informatica.

Worked on Data serialization formats for converting complex objects into sequence bits by using Parquet, ORC, AVRO, JSON and CSV formats.

Optimized Spark jobs and workflows by tuning Spark configurations, partitioning and memory allocation settings.

Extensive experience in develop, maintain and implementation of EDW, Data Marts, ODS and Data warehouse with Star schema and snowflake schema.

Hands on experience with GitHub to push the code for maintaining versions.

Comprehensive knowledge of Software Development Life Cycle and worked on Agile Methodology.

Strong Experience in working with python Libraries like Pandas, Numpy, Django.

Technical Skills:

Big Data Technologies

Map Reduce, Hive, Teg, Python, PySpark, Scala, Kafka, Spark streaming, Oozie, Sqoop, Zookeeper

Hadoop Distribution

Cloudera, Horton Works

Azure Serivices

Azure data Factory, Azure Data Bricks, Logic Apps, Functional App, Snowflake, Azure DevOps

Languages

Java, SQL, PL/SQL, Python, HiveQL, Scala.

Web Technologies

HTML, CSS, JavaScript, XML, JSP, Restful, SOAP

Operating Systems

Windows (XP/7/8/10), UNIX, LINUX.

Build Automation tools

Ant, Maven

Version Control

GIT, GitHub.

IDE &Build Tools, Design

Eclipse, Visual Studio, Power Bi, Tableau

Databases

MS SQL Server 2016/2014/2012, Azure SQL DB, Azure Synapse. MS Excel, MS Access, Oracle 11g/12c, Cosmos DB

Work Experience:

Client: SEI Investments, Oaks, PA Mar 2021 to Till Now

Role: Azure Data Engineer

Responsibilities:

Performed all phases of software engineering including requirements analysis, application design, and code development & testing.

Developed and maintained end-to-end operations of ETL data pipeline and worked with large data sets in azure data factory.

Increased the efficiency of data fetching by using queries for optimizing and indexing.

Wrote SQL queries using programs such as DDL, DML and indexes, triggers, views, stored procedures, functions and packages.

Worked on Azure Data Factory to integrate data of both on-prem (MYSQL, Cassandra) and cloud (Blob storage, Azure SQL DB) and applied transformations to load back to snowflake.

Deployed Data Factory for creating data pipelines to orchestrate the data into SQL database.

Developed custom activities using Azure Functions, Azure Databricks, and PowerShell scripts to perform data transformations, data cleaning, and data validation.

Working on Snowflake modelling using data warehousing techniques, data cleansing, Slowly Changing Dimension phenomenon, surrogate key assignment and change data capture.

Analytical approach to problem-solving; ability to use technology to solve business problems using Azure data factory, data lake and azure synapse.

Developed ELT/ETL pipelines to move data to and from Snowflake data store using combination of Python and Snowflake Snow SQL.

Developing ETL transformations and validation using Spark-SQL/Spark Data Frames with Azure data bricks and Azure Data Factor

Worked with Azure Logic Apps administrators to monitor and troubleshoot issues related to process automation and data processing pipelines.

Developed and optimized code for Azure Functions to extract, transform, and load data from various sources, such as databases, APIs, and file systems.

Designed, built, and maintained data integration programs in a Hadoop and RDBMS

Processed HDFS data and created external tables using Hive and developed scripts to ingest and repair tables that can be reused across the project.

Developed CI/CD framework for data pipelines using Jenkins tool.

Collaborated with DevOps engineers to developed automated CI/CD and test-driven development pipeline using azure as per the client requirement.

Collaborated on ETL tasks, maintaining data integrity and verifying pipeline stability.

Hands on experience in using Kafka, Spark streaming, to process the streaming data in specific use cases.

Developed a data pipeline using Kafka, Spark, and Hive to ingest, transform and analyzing data.

Working with JIRA to report on Projects, and creating sub tasks for Development, QA, and Partner validation.

Experience in full breadth of Agile ceremonies, from daily stand-ups to internationally coordinated PI Planning

Environment: Azure Databricks, Data Factory, Logic Apps, Functional App, Snowflake, MS SQL, Oracle, HDFS, MapReduce, YARN, Spark, Hive, SQL, Python, Scala, Pyspark, GIT, JIRA, Jenkins, kafka, ADF Pipeline, Power Bi.

Client: Stryker, New Jersey. Oct 2018 – Feb 2021

Role: Azure Data Engineer

Responsibilities:

Hands-on experience working with Hadoop Cloudera distributed platform, Microsoft Azure.

Data Ingestion to one or more Azure Services - (Azure Data Lake, Azure Storage, Azure SQL, Azure DW) and processing the data in Azure Databricks.

Worked on Microsoft Azure services like HDInsight Clusters, BLOB, Data Factory and Logic Apps and also done POC on Azure Data Bricks.

Perform ETL using Azure Data Bricks, Migrated on premise Oracle ETL process to azure synapse analytics.

Worked on Migrating SQL database to Azure Data Lake, Azure Data lake analytics, Azure SQL Database, Data Bricks and Azure SQL Data warehouse.

Controlling and granting database access and Migrating on Premise databases to azure data lake store using Azure Data Factory.

Deployed and optimized Python web applications to Azure DevOps CI/CD to focus on development.

Developed enterprise level solution using batch processing and streaming framework (using Spark Streaming, apache Kafka.

Processed the schema oriented and non-schema-oriented data using Scala and Spark.

Created Partitions, Buckets based on State to further process using Bucket based Hive joins.

Created Hive Generic UDF's to process business logic that varies based on policy.

Imported Data using Sqoop to load Data from MySQL to HDFS on regular basis.

Processed HDFS data and created external tables using Hive and developed scripts to ingest and repair tables that can be reused across the project.

Worked with Data Lakes and big data ecosystems (Hadoop, Spark, Hortonworks, Cloudera).

Load and transform large sets of structured, semi structured, and unstructured data.

Written Hive queries for data analysis to meet the Business requirements.

Wrote Hive queries for data analysis to meet the specified business requirements by creating Hive tables and working on them using Hive QL to simulate MapReduce functionalities.

Created and maintained HiveQL scripts and jobs using tools such as Apache Oozie and Apache Airflow.

Created automated scripts using Sqoop commands and shell scripts to schedule and run Sqoop jobs on a regular basis.

Created Partitions, Buckets based on State to further process using Bucket based Hive joins.

Developed a data pipeline using Kafka, Spark, and Hive to ingest, transform and analyzing data.

Worked on RDD’s & Data frames (SparkSQL) using PySpark for analyzing and processing the data.

Used Spark Streaming to divide streaming data into batches as an input to Spark engine for batch processing

Implemented CICD pipelines to build and deploy the projects in Hadoop environment.

Using JIRA to manage the issues/project workflow.

Worked on Spark using Python (PySpark) and Spark SQL for faster testing and processing of data.

Used Git as version control tools to maintain the code repository.

Environment: Azure Databricks, Data Factory, Logic Apps, Functional App, Snowflake, MS SQL, Oracle, HDFS, MapReduce, YARN, Spark, Hive, SQL, Python, Scala, PySpark, GIT, JIRA, Jenkins, Kafka, ADF Pipeline, Power Bi.

Client: Bosch, Allenhurst, NJ. May 2016-Sep 2018

Role: Big Data Engineer

Responsibilities:

Imported Data using Sqoop to load Data from MySQL to HDFS on regular basis.

Performing aggregations on large amounts of data using Apache Spark, Scala, and landing data in Hive warehouse for further analysis.

Worked with Data Lakes and big data ecosystems (Hadoop, Spark, Hortonworks, Cloudera).

Load and transform large sets of structured, semi structured, and unstructured data.

Written Hive queries for data analysis to meet the Business requirements.

Built HBASE tables by leveraging on HBASE Integration with HIVE on the Analytics Zone.

Hands on experience in using Kafka, Spark streaming, to process the streaming data in specific use cases.

Developed data pipeline using Flume, Sqoop to ingest customer behavioral data histories into HDFS for analysis.

Worked on analyzing Hadoop cluster using different big data analytic tools including Hive, and MapReduce.

Developed a data pipeline using Kafka, Spark, and Hive to ingest, transform and analyzing data.

Wrote Hive queries for data analysis to meet the specified business requirements by creating Hive tables and working on them using Hive QL to simulate MapReduce functionalities.

Migrated the existing data to Hadoop from RDBMS (Oracle) using Sqoop for processing the data.

Implemented CICD pipelines to build and deploy the projects in Hadoop environment.

Using JIRA to manage the issues/project workflow.

Worked on Spark using Python (PySpark) and Spark SQL for faster testing and processing of data.

Developed automated testing scripts using Informatica Data Validation Option to ensure data accuracy and consistency across different systems.

Used Spark Streaming to divide streaming data into batches as an input to Spark engine for batch processing.

Used Zookeeper to coordinate, synchronize and serialize the servers within the clusters.

Worked on Oozie workflow engine for job scheduling.

Used Git as version control tools to maintain the code repository.

Worked on SparkSQL using PySpark for analyzing and processing the data.

Worked closely with the team in fixing the JVM related issues.

Environment: Sqoop, MYSQL, HDFS, Apache Spark Scala, Hive Hadoop, Cloudera, HBASE, Kafka, MapReduce, Zookeeper, Oozie, Data Pipelines, RDBMS, AWS, EC2, Python, PySpark, Ambari, JIRA.

Client: Lancaster Technologies, India Aug 2014 – April 2016

Role: Hadoop Developer

Responsibilities:

Worked on GIT to maintain source code in Git and GitHub repositories.

Prepared an ETL framework with the help of Sqoop and hive to be able to frequently bring in data from the source and make it available for consumption.

Developed ETL jobs using Spark -Scala to migrate data from Oracle to new MySQL tables.

Rigorously used Spark -Scala (RDD’s, Data frames, Spark SQL) and Spark - Cassandra -Connector API's for various tasks (Data migration, Business report generation etc.)

Developed Spark Streaming application for real time sales analytics.

Analyzed the source data and handled efficiently by modifying the data types. Used excel sheet, flat files, CSV files to generated PowerBI ad-hoc reports.

Analyzed the SQL scripts and designed the solution to implement using PySpark.

Extracted the data from other data sources into HDFS using Sqoop.

Handled importing of data from various data sources, performed transformations using Hive, Map Reduce and loaded data into HDFS.

Developed custom scripts and tools using Oracle's PL/SQL language to automate data validation, cleansing, and transformation processes.

Extracted the data from MySQL into HDFS using Sqoop.

Implemented automation for deployments by using YAML scripts for massive builds and releases

Apache Hive, Apache Pig, HBase, Apache Spark, Zookeeper, Flume, Kafka and Sqoop.

Implemented Data classification algorithms using MapReduce design patterns.

Extensively worked on creating combiners, Partitioning, distributed cache to improve the performance of MapReduce jobs.

Environment: Hadoop, Hive, spark, PySpark, Sqoop, Spark SQL, Cassandra, YAML, ETL.

Contact this candidate