Data Engineer Azure Sql

Location:

Marion, IL

Posted:

September 14, 2023

Contact this candidate

Resume:

TARUN MANCHUKONDA

Senior AWS Data Engineer

602-***-**** ************@*****.***

PROFESSIONAL SUMMARY

7+ years of IT experience as a Data Engineer with a focus on Data Analysis and Data Engineering using Hadoop and Spark Frameworks.

Familiarity with Agile (Scrum) software development process and SDLC.

Expertise in Data Architecture, Data Analytics, and advanced Data processing.

Proficient in Data Manipulation and Cleansing using Python Scripts.

Skilled in Python packages like Pandas, NumPy, Scikit learn, SciPy, and Matplotlib.

Experienced with Big Data components such as Hadoop (HDFS, MapReduce, Yarn), Pig, Hive, Spark (Streaming, SQL, RDD), Sqoop, Oozie, Kafka, and ML Algorithms.

Hands-on experience with Cloudera CDH and Hortonworks HDP for Hadoop cluster management.

Knowledgeable in Spark development using Scala for performance comparison.

Worked with Azure cloud services, including Data Bricks, Data Factory, Machine Learning, SQL, and Data Lake.

Expertise in managing Azure Data Lakes and integration with other Azure services.

Proficient in Amazon Web Services (AWS) Cloud Platform, handling EC2, S3, Redshift, DynamoDB, and more.

Migrated data using Sqoop between HDFS and Relational Database Systems.

Skills in handling large datasets with Spark's in-memory capabilities, partitions, broadcasts, joins, and transformations during ingestion.

Proficient in Hadoop ecosystem tools like MapReduce, HDFS, Pig, Hive, Kafka, Yarn, Sqoop, Storm, Oozie, HBase, and Zookeeper.

Solid understanding of Hadoop architecture and ecosystem components.

Familiar with Spark architecture, including Core, SQL, Data Frames, Streaming, MLLib, and RDD.

Experience with Apache Airflow for conditional tasks and triggering rules for joins.

Proficient in writing Python MapReduce Jobs for processing structured, semi-structured, and unstructured data.

Knowledgeable in NoSQL databases like HBase, Cassandra, and MongoDB for writing applications.

Skilled in working with different file formats like Text, Sequence, Xml, parquet, and Avro in complex MapReduce programs.

Experience in data pipeline development using Flume, Sqoop, and Pig for weblogs extraction and storage in HDFS.

Extensive experience in importing and exporting data using Flume and Kafka stream processing platforms.

Familiar with fact dimensional modeling (Star Schema, Snowflake Schema) and ETL processes from various sources.

Strong understanding of architecting, designing, and operationalization of large-scale data solutions on Snowflake Cloud Data Warehouse.

Proficient in Python, UNIX, and shell scripting.

Expertise in Data Warehousing ETL concepts using Informatica Power Center, OLAP, OLTP, and AutoSys.

Experience with MongoDB development, schema design, map reduce functions, and SQL relational database migration to NoSQL databases.

TECHNICAL SKILLS:

Languages

Scala, Python

Cloud Technologies

MS Azure, Amazon Web Services (AWS)

MS Azure Services

Azure data bricks, Azure Data Factory (ADF), Machine Learning, Azure SQL, Azure Data Lake

AWS Services

EC2, S3, VPC, ELB, IAM, DynamoDB, CloudFront, CloudWatch, Route 53, Elastic Beanstalk (EBS), Auto Scaling, Security Groups, EC2 Container Service (ECS), Code Commit, Code Pipeline, Code Build, Code Deploy, Dynamo DB, Auto Scaling, Security Groups, Redshift, CloudWatch, CloudFormation, CloudTrail, Ops Works, Kinesis, SQS, SNS, SES

Big Data Technologies

HDFS, Map Reduce, Pig, Hive, Sqoop, Oozie, Scala, Spark, Kafka, Nifi, Airflow, Flume, Snowflake

Hadoop Frameworks

Cloudera CDHs, Hortonworks HDPs

Databases

Oracle, MySQL, MS SQL Server, DB2

NOSQL Databases

HBase, Cassandra, MongoDB

Modelling Schemas

Star Schema, Snowflake Schema

BI Tools

Tableau, Power BI

Operating System

Windows, Linux

Methodologies

Agile, Waterfall

PROFESSIONAL EXPERIENCE:

Client: Fifth Third Bank, Marion, Illinois February 2022 to Present

Senior AWS Data Engineer

Responsibilities:

Created logical and physical data models for new and existing Data solutions to facilitate understanding of data structure.

Participated in client meetings to gather migration requirements and other project specifications.

Utilized Spark's in-memory computing capabilities for advanced text analytics and processing.

Developed Python code to extract data from HBase and designed solutions using PySpark.

Implemented an automated data ingestion process within the AWS cloud environment.

Successfully loaded data from web services into AWS RDS (Relational Database Service) or Amazon Redshift on a daily basis, ensuring data accuracy and timeliness.

Designed and constructed data pipelines using AWS Glue and AWS Step Functions.

Effectively extracted, transformed, and loaded data from various sources, including AWS RDS, Amazon S3, and AWS Redshift, enhancing data integration capabilities.

Harnessed the power of AWS Glue in combination with Amazon S3 to create intricate data pipelines.Utilized Spark with Cloudera Hadoop YARN for performing analytics on data stored in Hive.

Optimized Hadoop algorithms using Spark Context, Spark-SQL, Data Frames, and Pair RDDs.

Utilized Spark Streaming APIs to perform real-time transformations and actions, integrating data from Kafka and persisting it into Cassandra.

Proficient in designing compelling email content and implementing personalization strategies using dynamic content blocks, AMP script, or other personalization methods offered by Salesforce Marketing Cloud.

Developed Scala-based Spark applications for data cleansing, aggregation, de-normalization, and preparation for machine learning and reporting.

Implemented Azure Key Vault and Azure Active Directory for data security throughout the pipeline architecture.

Wrote Python scripts to interact with HBase and performed SQL script analysis for implementation using PySpark.

Conducted Proof of Concepts (POCs) using Apache Spark and Scala for project integration.

Migrated existing MapReduce jobs to Spark transformations and actions using Spark RDDs, Data frames, and Spark SQL APIs.

Designed ETL jobs using Informatica for data processing.

Developed data pipelines with Kafka, Spark, and Hive to ingest, transform, and analyze customer behavioral data.

Utilized Power BI for connecting with Hive and generating daily data reports.

Environment: Python (NumPy, SciPy, Matplotlib, Pandas), Spark, Scala, MS Azure, Azure SQL, Blob Storage, ADF, Azure Data Bricks, Spark Context, Spark SQL, Hive, Apache Airflow, Cassandra, MongoDB, Map Reduce, Kafka, Spark Streaming, PySpark, Informatica

Client: Molina Healthcare, Irving, Texas January 2020 to February 2022

Data Engineer

Responsibilities:

Developed Python scripts for database content updates and file manipulation.

Designed and developed modules in Hadoop Big Data platform, utilizing MapReduce, Hive, Sqoop, Kafka, and Oozie.

Created real-time data processing applications with Scala and Python.

Analyzed Hadoop cluster using Pig, Hive, and MapReduce for big data analytics.

Implemented MapReduce programs to parse raw data, populate staging tables, and store refined data in the EDW.

Developed ETL processes in AWS Glue to migrate Campaign data from external sources into AWS Redshift.

Responsible for Data Extraction, aggregation, and consolidation of Adobe data within AWS Glue using PySpark.

Built modern data solutions using AWS IA services for data visualization.

Utilized various AWS services like EC2, IAM, Elastic MapReduce (EMR), and EBS.

Handled data importing from AWS S3 to HDFS and performed transformations using Spark.

Implemented AWS Lambdas for real-time monitoring dashboards from system logs.

Worked on Azure web applications, App services, Azure storage, Azure SQL Database, Virtual machines, and more.

Experienced in writing reusable Terraform modules and templates, streamlining deployment processes, and ensuring consistency and scalability in cloud infrastructure.

Created HBase tables and column families to store user event data.

Developed Hive scripts in Hive QL for data de-normalization and aggregation.

Installed and configured Zookeeper for cluster resource coordination and monitoring.

Utilized Apache Flume to collect and aggregate weblogs and unstructured data from various sources and stored in HDFS for analysis.

Coded MapReduce programs in Java for data cleaning and processing.

Developed Python scripts for vulnerability assessment with SQL Queries, including SQL injection.

Configured Oozie workflow engine for automating Map/Reduce jobs.

Involved in DWH/BI project implementation using Azure Data Factory (ADF) and Data Bricks.

Designed, developed, and maintained data integration programs in Hadoop and RDBMS environments, incorporating traditional and non-traditional source systems as well as RDBMS and NoSQL data stores.

Moved data from HDFS to Cassandra using Map Reduce and Bulk Output Format class.

Worked with Hadoop data operation components such as Zookeeper and Oozie.

Collected and aggregated log data using Flume and staged it in HDFS for further analysis.

Wrote complex SQL Queries for data quality checks and verification.

Environment: Python, AWS, EC2, S3, IAM, Elastic MapReduce (EMR), EBS, AWS Glue, AWS Lambdas, MS Azure, App services, Azure storage, Azure SQL Database, Virtual machines, Fabric controller, Azure AD, Azure search, Big Data, Hadoop, MapReduce, Hive, Sqoop, Kafka, Oozie, Scala, Pig, HBase, HiveQL, Zookeeper, Apache Flume, HDFS, Python, SQL, MS Azure, ADF, Data Bricks, SQL

Client: BestBuy, New York, NY November 2017 to December 2019

Data Engineer

Responsibilities:

●Successfully configured Azure Data Factory to automate the loading of data from Azure Blob Storage into Azure SQL Data Warehouse, streamlining data ingestion.

●Extracted, transformed, and loaded data from diverse heterogeneous sources into Azure SQL Data Warehouse, encompassing data migration from on-premises environments to Azure storage containers, ensuring data consistency and accessibility.

●Proficiently worked within the Apache Spark ecosystem using Scala and Python.

●Utilized various Spark components, including Spark, Spark Streaming, Spark RDD, and Spark SQL, to process and analyze data efficiently.

●Selected and generated data into CSV files, stored in Azure Blob Storage via Azure Virtual Machines (VMs), and structured it within Azure SQL Data Warehouse.

●Conducted Data Cleansing and Data Wrangling with Python Pandas and NumPy.

●Loaded data into Spark RDD and performed in-memory data computation for generating output responses.

●Implemented Spark using Scala with Data Frames and Spark SQL API for faster data processing.

●Developed Scala scripts and UDFFs using Data Frames/SQL/Datasets and RDD/MapReduce in Spark for Data Aggregation and queries, with data written back into OLTP systems through Sqoop.

●Created Streaming pipelines using Azure Event Hubs and Stream Analytics for data analysis.

●Setup and maintained various Azure services, including Azure SQL Database, Azure Analysis Service, Azure SQL Data Warehouse, Azure Data Factory, and Azure SQL Data Warehouse.

●Developed scalable frameworks in Azure Databricks using metadata.

●Created Hive tables, loaded data, and performed data analysis using Hive queries.

●Utilized Apache Airflow for authoring, scheduling, and monitoring Data Pipelines.

●Developed workflows using Azure Logic Apps for sending alerts/notifications on different jobs in Azure.

●Converted Hive/SQL queries into Spark transformations using Spark SQL and Scala.

●Implemented core API services using Python and PySpark.

●Worked with Spark and Kafka to consume data and convert it to a common format using Scala.

●Loaded data into Hive and Cassandra from CSV files using Spark/PySpark.

●Developed Spark code using Scala and Spark SQL/Streaming for faster data testing and processing.

●Created External tables in Azure SQL Database for data visualization and reporting purposes.

●Used Spark SQL with Scala to create data frames and perform transformations.

●Developed Spark jobs using Scala and Python on top of Yarn/MRv2 for interactive and Batch Analysis.

Environment: Python (NumPy, SciPy, Matplotlib, Pandas), Apache Spark, Kafka, Spark SQL, AWS, Redshift, S3, EC2, AWS Glue, MS Azure, Azure SQL Database, Azure Analysis Service, Azure SQL Data warehouse, Azure Data Factory, Azure Logic Apps, Spark RDD, Spark Streaming, PySpark, HDFS, Tableau

Client: Prudential Financial, Lexington, Kentucky July 2016 to August 2017

Data Engineer

Responsibilities:

•Implemented Agile Methodology as the organization's standard for data model implementation.

•Designed and developed a Security Framework for fine-grained access to AWS S3 objects using AWS Lambda and DynamoDB.

•Conducted end-to-end Architecture & implementation assessments of AWS services like Amazon EMR, Redshift, and S3.

•Utilized AWS EMR to transform and move large data volumes between AWS data stores and databases, such as Amazon S3 and DynamoDB.

•Implemented Data Quality framework using AWS Athena, Snowflake, Airflow, and Python.

•Built scalable distributed data solutions using Hadoop, working on components like HDFS, Yarn, Resource Manager, Node Manager, Job Tracker, Task Tracker, Name Node, Data Node, and MapReduce. Utilized Impala for data operations in HDFS.

•Developed automated HBase test cases for data quality checks using HBase command line tools.

•Managed data from different sources and handled HDFS maintenance and loading of structured and unstructured data.

•Automated complex workflows using Apache Airflow workflow handler.

•Loaded bulk data in HBase using MapReduce by creating H-files directly and designed HBase tables for storing various data formats from different portfolios.

•Utilized Pandas and NumPy in Python for developing data visualization charts.

•Imported data from various sources into HDFS using Sqoop, performed transformations using Hive and MapReduce, and loaded data into HDFS.

•Processed data by collecting, aggregating, and moving from various sources using Apache Flume and Kafka.

•Created Pig Latin scripts for sorting, grouping, joining, and filtering enterprise-wide data.

•Troubleshot MapReduce job execution issues by inspecting and reviewing log files.

•Designed automated workflows using Oozie for both time-driven and data-driven tasks, and utilized Zookeeper for cluster coordination.

•Developed multiple Python-based MapReduce jobs for data cleaning and preprocessing, along with data capacity planning and node forecasting.

•Imported data into HDFS from SQL databases and files using Sqoop and streaming systems using Storm into Big Data Lake.

•Generated Tableau dashboards with filters, parameters, and calculated fields for interactive data reporting.

Environment: Agile, Scrum, Python (Pandas, Matplotlib, SciPy, NumPy), Big Data, Hadoop, HDFS, Map Reduce, AWS, S3, EC2, Redshift, AWS Lambda, AWS EMR, AWS Athena, SQL Database, SQL Data warehouse, HBase, Apache Airflow, Flume, Kafka, Pig, Big Data Lake, Tableau

Contact this candidate