Data Engineer Azure

Location:

Posted:

February 05, 2024

Resume:

PROFESSIONAL PROFILE:

●Having *+ years of IT industry experience as a Data Engineer with solid understanding of Data Modelling, Data Validation, Evaluating Data Sources, and strong understanding of Data Warehouse / Data Mart Design, ETL, BI, OLAP and Client/Server applications on AWS and Azure.

●Expert in designing and implementing end-to-end ETL pipelines, utilizing tools like Azure Data Factory, Informatica, SSIS, and Talend to efficiently extract, transform, and load data across diverse sources.

●Proficient in managing Azure Data Factory and ensuring adherence to data policies, with hands-on experience in utilizing Azure Blob Storage for storage and backup.

●Accomplished data engineer with a strong proficiency in developing ETL processes within Azure Databricks, specializing in seamless connectivity to relational databases through Kafka.

●Proactively uphold data quality and integrity within Azure SQL Database through meticulous oversight, while streamlining the deployment and operationalization of ETL processes to enhance overall data processing efficiency.

●Experienced in streaming pipelines, utilizing Azure Event Hubs and Stream Analytics for real-time analysis of data-driven workflows.

●Skilled in working with Azure Blob Storage, developing frameworks for handling vast volumes of data and system files.

●Accomplished data engineer with a proven track record in designing and automating pipelines using Databricks, streamlining ETL processes, and ensuring ongoing maintenance of workloads.

●Proficient in SSIS, adept at creating robust ETL packages to extract data from diverse sources like Access database, Excel Spreadsheet, and flat files, while maintaining data integrity through SQL Server.

●Expert in data extraction and ingestion into Hadoop Data Lake, employing Sqoop, Hive, and Spark to manage data from various sources. Skilled in using PySpark to load high-volume files into PySpark DataFrames and processing them for reloading into Azure SQL DB tables.

●Specialized in optimizing Hive queries with best practices and parameters, utilizing Hadoop, YARN, Python, and PySpark for efficient data processing.

●Demonstrated expertise in automating data ingestion into Lakehouse, transforming data using Apache Spark, and storing in Delta Lake, ensuring ACID transactions for data consistency.

●In-depth experience implementing and identifying new SQL Server features, enhancing query performance through T-SQL language enhancements, in-memory optimization, and Extended Events.

●Recognized as a subject matter expert in performance tuning stored procedures, functions, T-SQL scripts, indexes, and SSIS packages.

●Proficient in Spark SQL and DataFrames API for building and optimizing Spark applications, along with extensive experience in writing complex SQL queries, including joins, sub-queries, and correlated sub-queries for cross-verification of data.

●Skilled in designing and implementing data warehouse solutions in Snowflake and Amazon Redshift for efficient storage and retrieval of large-scale datasets.

●Proven ability to optimize schema designs in Snowflake and Redshift for effective data organization, query performance, and resource utilization.

● Successful in developing and maintaining end-to-end ETL pipelines to ingest, transform, and load data into Power BI, ensuring data accuracy and integrity.

●Accomplished in creating visually compelling reports and dashboards in Power BI, effectively conveying insights and trends from diverse datasets.

●Proficient in customizing Jira workflows to align with data engineering processes, ensuring seamless collaboration and issue tracking.

●Good Knowledge on Implementing and managing version control using Git, tracking changes in the data engineering codebase and configurations.

TECHNICAL SKILLS:

Programming

Python, PowerShell, Scala, SQL, T-SQL

ETL Tools

Azure Data Factory, Informatica, SSIS, Talend

Big Data Technologies

Hadoop, Spark, Kafka, Hive, Sqoop

Database Technologies

SQL Server, MySQL, Oracle, MongoDB

Cloud Platforms

AWS (S3, Glacier, Lambda, Redshift), Microsoft Azure

Data Warehousing

Snowflake, Amazon Redshift

Data Processing

Spark SQL, PySpark, Spark Streaming, Kafka Streams API

Version Control

Git

Visualization Tools

Tableau, Power BI

Workflow Management

Apache Airflow, Oozie

Collaboration Tools

Jira

Code Repositories

GitHub

Web Technologies

HTML5, CSS3, Bootstrap, Ajax, JSON, XML, JavaScript, jQuery, Angular.js, React.js, Node.js, TypeScript

PROFESSIONAL EXPERIENCE:

Client: Jeppesen,Englewood,CO Sept 2020 – Present

Role:Sr.DataEngineer

Responsibilities:

Worked on creating Azure Data Factory and managing policies for Data Factory and Utilized Blob storage for storage and backup on Azure.

Worked on developing the process and ingested the data in Azure cloud from web service and loaded it to Azure SQL DB.

Worked with ETL operations in Azure Databricks by connecting to different relational databases using Kafka and used Informatica for creating, executing, and monitoring sessions and workflows.

Ensured data quality and integrity of the data using Azure SQL Database and automated ETL deployment and operationalization.

Performed Streaming of pipelines using Azure Event Hubs and Stream Analytics to analyze the data from the data-driven workflows.

Worked with Azure Blob Storage and developed the framework for the implementation of the huge volume of data and the system files.

Designed and developed the pipelines using Databricks and automated the pipelines for the ETL processes and further maintenance of the workloads in the process.

Worked on creating ETL packages using SSIS to extract data from various data sources like Access database, Excel Spreadsheet, and flat files, and maintain the data using SQL Server.

Responsible for data extraction and data ingestion from different data sources into Hadoop Data Lake by creating ETL pipelines using Sqoop, Hive and Spark.

Worked with Spark applications in Python for developing the distributed environment to load high volume files using PySpark with different schema into PySpark Data frames and process them to reload into Azure SQL DB tables.

Worked on optimization of Hive queries using best practices and right parameters and using Hadoop, YARN, Python, and PySpark.

Worked on automating data ingestion into the Lakehouse and transformed the data, used Apache Spark for leveraging the data, and stored the data in Delta Lake.

Used Data bricks, Scala, and Spark for creating the data workflows and capturing the data from Delta tables in Delta Lakes.

Worked with Delta Lakes for consistent unification of Streaming, processed the data, and worked on ACID transactions using Apache Spark.

Responsible for identifying and implementing new SQL Server features designed to improve query performance; this included utilizing new T-SQL language enhancements, performing in-memory optimization, and performance tuning with Extended Events.

Served as the resident SME on performance tuning stored procedures, functions, T-SQL scripts, indexes and SSIS packages.

Extensively used spark SQL and Data frames API in building spark applications.

Written complex SQLs using joins, sub queries and correlated sub queries. Expertise in SQL Queries for cross verification of data.

Designed and implemented data warehouse solutions using Snowflake and Amazon Redshift to support efficient storage and retrieval of large-scale datasets.

Optimized schema designs in Snowflake and Redshift for effective data organization, query performance, and resource utilization.

Developed and maintained end-to-end ETL pipelines to ingest, transform, and load data into Power BI, ensuring data accuracy and integrity.

Created visually compelling reports and dashboards in Power BI to convey insights and trends from diverse datasets, facilitating effective data exploration.

Customized Jira workflows to align with data engineering processes, ensuring seamless collaboration and issue tracking.

Implemented and managed version control using Git to track changes in data engineering codebase and configurations.

Developed front end using JavaScript, HTML5, CSS3, Bootstrap, Ajax, JSON, XML, jQuery, Angular.js, React.js, Node.js, TypeScript.

Environment: Azure, Blob Storage, Kafka, Informatica, SSIS, Excel, Hadoop, Sqoop, Hive, Spark (PySpark), Python, YARN, Scala, T-SQL, Snowflake, Amazon Redshift, Power BI, Jira, Git, JavaScript, HTML5, CSS3, Bootstrap, Ajax, JSON, XML, jQuery, Angular.js, React.js, Node.js, TypeScript.

Client: GileadSciences, FosterCity, CA July 2017 – Aug 2020

Role:Sr.DataEngineer

Responsibilities:

Worked on designing AWS EC2 instance architecture to meet high availability application architecture and security parameters.

Created AWS S3 buckets and managed policies for S3 buckets and Utilized S3 buckets and Glacier for storage and backup. Worked on Hadoop cluster and data querying tools to store and retrieve data from the stored databases.

Designed and deployed automated ETL workflows using AWS lambda, organized and cleansed the data in S3 buckets using AWS Glue and processed the data using Amazon Redshift.

Worked on AWS Services like AWS SNS to send out automated emails and messages using BOTO3 after the nightly run.

Worked on the development of tools which automate AWS server provisioning, automated application deployments, and implementation of basic failover among regions through AWS SDK’s.

Worked within the ETL architecture enhancements to increase the performance using query optimizer.

Created ETL Mapping with Talend Integration Suite to pull data from Source, apply transformations, and load data into the target database.

Worked on processing the data and testing using Spark SQL and on real-time processing by Spark Streaming and Kafka using Python.

Scripted using Python and PowerShell for setting up baselines, branching, merging, and automation processes across the process using GIT.

Implemented the data that is extracted using Spark, Hive, and large data sets using HDFS.

Migrated Map Reduce programs into Spark transformations using Spark and Scala.

Experienced with Spark Context, Spark-SQL, and Spark YARN.

Worked on Streaming data transfer, data from different data sources into HDFS, No SQL databases.

Did Performance tuning to optimize SQL queries using query analyzer.

Loaded the data from different relational databases like MySQL and Teradata using Sqoop jobs.

Utilized analytical, statistical, and programming skills to collect, analyze and interpret large data sets to develop data-driven and technical solutions to difficult business problems using tools such as SQL, and Python.

Worked on designing and developing the SSIS Packages to import and export data from MS Excel, SQL Server, and Flat files.

Developed ETL pipelines to load and transform data into Snowflake and Redshift from various source systems, ensuring data integrity and accuracy.

Implemented and manage data security measures in Snowflake and Redshift, including access controls, encryption, and data masking.

Worked with Tableau for generating reports and created Tableau dashboards, pie charts, and heat maps according to the business requirements.

Designed and implement interactive dashboards, reports, and visualizations in Tableau to communicate complex data patterns and trends effectively.

Utilized Jira for effective project management, including task tracking, sprint planning, and issue resolution in data engineering projects.

Foster collaborative coding practices by facilitating code reviews, pull requests, and effective use of Git branching strategies.

Environment:AWS,ETL, Talend Integration Suite, Spark SQL, Kafka, Python, PowerShell, GIT, Spark (Spark Context, Spark-SQL, Spark YARN), Hive,HDFS (Hadoop Distributed File System), MapReduce, Spark Transformations, Scala, NoSQL databases, Sqoop, MySQL, Teradata, Tableau, Snowflake, Redshift, Jira, Git.

Client: Mayo Clinic, Haryana; India May 2015 – June 2017

Role:DataEngineer

Responsibilities:

Extract Transform and Load data from Sources Systems to Azure Data Storage services using a combination of Azure Data Factory, T-SQL, Spark SQL, and U-SQL Azure Data Lake Analytics Data Ingestion to one or more Azure Services - (Azure Data Lake, Azure Storage, Azure SQL, Azure DW) and processing the data in In Azure Databricks.

Created Pipelines in ADF using Linked Services/Datasets/Pipeline/ to Extract, Transform, and load data from different sources like Azure SQL, Blob storage, Azure SQL Data warehouse, write-back tool and backwards.

Worked on Microsoft Azure services like HDInsight Clusters, BLOB, ADLS, Data Factory and Logic Apps and did POC on Azure Data Bricks.

Worked on Building and implementing a real-time streaming ETL pipeline using Kafka Streams API.

Responsible for importing data to HDFS using Sqoop from different RDBMS servers and exporting data using Sqoop to the RDBMS servers after aggregations for other ETL operations.

Worked on Spark using Python as well as Scala and Spark SQL for faster testing and processing of data.

Experience in designing and developing applications in PySpark using python to compare the performance of Spark with Hive.

Developed Spark applications using PySpark and Spark-SQL for data extraction, transformation, and aggregation from multiple file formats for analyzing & transforming the data to uncover insights into the customer usage patterns.

Created custom T-SQL procedures to read data from flat files to dump to SQL Server database using SQL Server import and export data wizard.

Designed and implement database solutions in Azure SQL Data Warehouse, Azure SQL.

Conducted performance tuning and optimization of SQL queries and data processing in Snowflake and Redshift for enhanced query performance.

Collaborated with data architects and analysts to understand data requirements and implement effective solutions using Snowflake and Redshift.

Implemented several DAX functions for various fact calculations for efficient data visualization in Power BI

Utilized Power BI gateway to keep dashboards and reports up to-date with on premise data

Maintain Jira reports and dashboards to provide insights into project progress, issues, and performance metrics for data engineering teams.

Maintained Git repositories, ensuring organization, structure, and documentation of data engineering codebase.

Responsible for writing Hive Queries to analyze the data in Hive warehouse using Hive Query Language (HQL). Involved in developing Hive DDLs to create, drop and alter tables.

Extracted the data and updated it into HDFS using Sqoop Import from various sources like Oracle, Teradata, SQL server, etc.

Environment: Azure Data Factory, T-SQL, Spark SQL, HDInsight Clusters, Kafka Streams API, Sqoop, HDFS (Hadoop Distributed File System), Spark (PySpark, Scala), Python, SQL Server, Snowflake, Redshift, Power BI, Jira, Git, Hive.

Client:UBS,Mumbai,India Mar 2014 – Apr 2015

Role:SQLDeveloper

Responsibilities:

Worked on AWS Services like AWS SNS to send out automated emails and messages using BOTO3 after the nightly run.

Worked on the development of tools which automate AWS server provisioning, automated application deployments, and implementation of basic failover among regions through AWS SDK’s.

Performed ETL processes to data ready for creating business analysis visuals which help the leadership team to make the right business decisions.

Performed data extraction, transformation, and loading (ETL) between systems using SQL tools such as SSIS.

Worked on scripting with Python in Spark for transforming the data from various files like Text files, CSV and JSON.

Created automated python scripts to convert the data from different sources and to generate the ETL pipelines.

Exploring with Spark, improving performance and optimization of the existing algorithms in Hadoop using Spark Context, Spark-SQL, Data Frame, and Pair RDDs.

Used Sqoop to import data from Relational Databases like MySQL, and Oracle.

Migrated SQL server 2008 to SQL Server 2012 in Microsoft Windows Server 2012 Enterprise Edition.

Extensively worked on SQL Server Integration Services (SSIS), and Tableau Reporting

Monitored and troubleshoot issues related to data loading, processing, and querying in Snowflake and Redshift environments.

Provided expertise and guidance to team members on best practices, performance optimization, and troubleshooting techniques for Snowflake and Redshift.

Optimized Tableau workbooks and dashboards for performance, considering factors such as data volume and complexity.

Provided training sessions for team members on Tableau functionalities and best practices for efficient data visualization and reporting.

Integrated Jira with other tools and platforms used in the data engineering ecosystem to streamline workflows and enhance collaboration.

Integrated Git with continuous integration/continuous deployment (CI/CD) pipelines to automate and streamline data engineering workflows.

Environment: AWS SNS, ETL, SQL Tools (e.g., SSIS), Python Scripting, Spark (PySpark), Hadoop, Spark Context, Sqoop, MySQL, Oracle, SQL Server Integration Services (SSIS), Tableau Reporting, Snowflake, Amazon Redshift, Jira Integration, Git, CI/CD.

EDUCATION: JAWAHARLAL NEHRU TECHNOLOGOICAL UNIVERSITY Hyderabad, TS,India

BTech in COMPUTER SCIENCE AND ENGINEERING June 2010 - March 2014

Major in Computer Science

Contact this candidate