Big Data Engineer

Location:

Dallas, TX

Posted:

October 26, 2022

Contact this candidate

Resume:

Madhukar Enugurthi

Data Engineer

Available on C*C

Professional Summary:

8+ years of experience as Data Engineer in BIGDATA using HADOOP, Spark framework and Analysis, Design, Development, Documentation, Deployment and Integration using SQL and Big Data technologies.

Well versed in configuring and administering the Hadoop Cluster using Cloudera and Horton works.

Experience in creating separate virtual data warehouses with difference size classes in AWS Snowflake.

Experience with data transformations utilizing Snow SQL and Python in Snowflake.

Hands-on experience in bulk loading and unloading data into Snowflake tables using COPY command.

Experience in working with AWS S3 and Snowflake cloud Data warehouse.

Expertise in building Azure native enterprise applications and migrating applications from on-premises to Azure environment.

Implementation experience with Data lakes and Business intelligence tools in Azure.

Experience in creating real time data streaming solutions using Apache Spark/ Spark Streaming/ Apache Storm, Kafka and Flume

Currently working on Spark applications extensively using Scala as the main programming platform

Processing this data using Spark Streaming API with Scala.

Used Spark Data Frames, Spark-SQL and RDD API of Spark for performing various data transformations and dataset building

Developed RESTful web Services to retrieve, transform and aggregate the data from different end points to Hadoop (Hbase, Solr).

Created Jenkins Pipeline using Groovy scripts for CI/CD.

Exposure to Data Lake Implementation and developed Data pipelines and applied business logic utilizing Apache Spark

Involved converting Cassandra/Hive/SQL queries into Spark transformations using RDD's and Scala.

Implemented Spark Scripts using Scala, Spark SQL to access hive tables into spark for faster processing of data.

Hands on experience doing real time on NO-SQL databases like MongoDB, HBase and Cassandra

Experience in creating MongoDB clusters and hands on experience with complex MongoDB aggregate functions and mapping

Experience in using Flume to load log files into HDFS and Oozie for data scrubbing and process

Experience on performance tuning of HIVE queries and Map Reduce programs for scalability and faster execution

Experienced in handling real time analytics using HBase on top of HDFS data

Experience in transforming, Grouping, Aggregations, Joins using Kafka Streams API

Hands on experience deploying KAFKA connect in standalone and distributed mode creating docker containers using DOCKER

Created TOPICS and written KAFKA producer and consumer in Python as required, developed KAFKA source/sink connectors to store the streaming new data into topics, from topics to required different database by performing ETL tasks also used Akka toolkit with Scala to perform some builds

Experienced in collecting metrics for Hadoop clusters using Ambari & Cloudera Manager.

Has knowledge on Storm architecture, Experience in using data modeling tools like Erwin

Excellent experience in using scheduling tools to automate batch jobs

Hands on experience in using Apache SOLR/Lucene

Expertise using SQL Server, SQL, queries, procedures, functions

Hands on experience in App Development using Hadoop, RDBMS and Linux shell scripting

Strong experience in Extending Hive and Pig core functionality by writing custom UDFs

Experience in transforming business requirements into analytical models, designing algorithms, building models, developing data mining and reporting solutions that scale across big volume of structured and unstructured data.

Extensive experience in Text Analytics, developing different statistical Machine Learning, Data mining solutions to various business problems and gathering data visualization using Python and R.

Ability to work as team and individually on many cutting-edge technologies with excellent management skills, business understanding and strong communication skills

Technical Skills:

Hadoop/Big Data

HDFS, MapReduce, Yarn, HBase, Pig, Hive, Sqoop, Flume, Oozie, Zookeeper,

Splunk, Hortonworks, Cloudera

Programming languages

SQL, Python, R, Scala, Spark, Linux shell scripts

Databases

RDBMS (MySQL, DB2, MS-SQL Server, Teradata, PostgreSQL), NoSQL

(MongoDB, HBase, Cassandra), Snowflake virtual warehouse

OLAP & ETL Tools

Tableau, Spyder, Spark, SSIS, Informatica Power Center, Pentaho, Talend

Data Modelling Tools

Microsoft Visio, ER Studio, Erwin

Python and R libraries

R-tidyr, tidy verse, dplyr reshape, lubridate, Python – beautiful Soup,

numpy, SciPy, matplotlib, python-twitter, pandas, scikit-learn, keras.

Machine Learning

Regression, Clustering, MLLib, Linear Regression, Logistic Regression,

Decision Tree, SVM, Naive Bayes, KNN, K-Means, Random Forest,

and Gradient Boost & Adaboost, Neural Networks and Time Series Analysis.

Data analysis Tools

Machine Learning, Deep Learning, Data Warehouse, Data Mining, Data Analysis,

Big data, Visualizing, Data Munging, Data Modelling

Cloud Computing Tools

Snowflake, SnowSQL, Amazon Web Services (AWS), Microsoft Azure, Google Cloud Platform (GCP)

Amazon Web Services

EMR, EC2, S3, RDS, Cloud Search, Redshift, Data Pipeline, Lambda.

Reporting Tools

JIRA, MS Excel, Tableau, Power BI, QlikView, Qlik Sense, D3, SSRS, SSIS

IDE’s

PyCharm

Development Methodologies

Agile, Scrum, Waterfall

Professional Experience:

Client: Tory Burch, New York, NY May 2020 – Present

Role: Data Engineer

Responsibilities:

Developed Talend Bigdata jobs to load heavy volume of data into S3 data lake and then into Snowflake.

Developed snowpipes for continuous injection of data using event handler from AWS (S3 bucket).

Developed SnowSQL scripts to deploy new objects and update changes into Snowflake.

Developed a Python script to integrate DDL changes between on-prem Talend warehouse and snowflake.

Working with AWS stack S3, EC2, Snowball, EMR, Athena, Glue, Redshift, DynamoDB, RDS, Aurora, IAM, Firehose, and Lambda.

Designing and implementing new HIVE tables, views, schema and storing data optimally.

Performing Sqoop jobs to land data on HDFS and running validations.

Configuring Oozie Scheduler Jobs to run the Extract jobs and queries in a automated way.

Querying data by optimizing the query and increasing the query performance.

Designing and creating SQL Server tables, views, stored procedures, and functions.

Performing ETL operations using Apache Spark, also using Ad-Hoc queries and implementing Machine Learning techniques.

Worked on configuring CICD for CaaS deployments (k8's).

Involved in migrating master-data from Hadoop to AWS.

Worked with Spark for improving performance and optimization of the existing algorithms in Hadoop using Spark Context, Spark-SQL, Data Frames, Pair RDD's.

Developed preprocessing job using Spark Data frames to transform JSON documents to flat file

Loaded D-Stream data into Spark RDD and did in-memory data computation to generate output response

Processing with Amazon EMR big data across a Hadoop cluster of virtual servers on Amazon Elastic Compute Cloud (EC2) and Amazon Simple Storage Service (S3).

Imported data from AWS S3 into Spark RDD, performed transformations and actions on RDD's.

Worked on Big Data infrastructure for batch processing and real-time processing using Apache Spark

Developed Apache Spark applications by using Scala for data processing from various streaming sources

Processed the Web server logs by developing Multi-Hop Flume agents by using Avro Sink and loaded into Cassandra for further analysis, Extracted files from Cassandra through Flume

Responsible for design and development of Spark SQL Scripts based on Functional Specifications

Worked on the large-scale Hadoop YARN cluster for distributed data processing and analysis using Spark, Hive, and Cassandra

Involved in converting Cassandra/Hive/SQL queries into Spark transformations using RDD's and Scala

Implemented Spark Scripts using Scala, Spark SQL to access hive tables to spark for faster processing of data.

Developed Some Helper class for abstracting Cassandra cluster connection act as core toolkit

Involved in creating Data Lake by extracting customer's data from various data sources to HDFS which include data from Excel, databases, and log data from servers

Moved data from HDFS to Cassandra using Map Reduce and Bulk Output Format class.

Extracted files from Cassandra through Sqoop and placed in HDFS and processed it using Hive

Writing MapReduce (Hadoop) programs to convert text files into AVRO and loading into Hive table

Experienced in writing real-time processing and core jobs using Spark Streaming with Kafka as a data pipeline system

Extending HIVE/PIG core functionality by using custom User Defined Function's (UDF), User Defined Table-Generating Functions (UDTF) and User Defined Aggregating Functions (UDAF) for Hive and Pig

Involved in loading data from rest endpoints to Kafka producers and transferring the data to Kafka brokers

Used Apache Kafka functionalities like distribution, partition, replicated commit log service for messaging

Partitioning Data streams using Kafka. Designed and configured Kafka cluster to accommodate heavy throughput.

Exported the analyzed data to the relational databases using Sqoop for visualization and to generate reports for the BI team

Used Apache Oozie for scheduling and managing multiple Hive Jobs. Knowledge of HCatalog for Hadoop based storage management

Migrated an existing on-premises application to Amazon Web Services (AWS) and used its services like EC2 and S3 for small data sets processing and storage, experienced in maintaining the Hadoop cluster on AWS EMR

Developed solutions to pre-process large sets of structured, semi-structured data, with different file formats like Text, Avro, Sequence, XML, JSON, and Parquet

Generated various kinds of reports using Pentaho and Tableau based on Client specification

Have come across new tools like Jenkins, Chef and Rabbit MQ.

Worked with SCRUM team in delivering agreed user stories on time for every Sprint

Environment: Snowflake, SnowSQL, Hadoop, MapReduce, HDFS, Yarn, Hive, Sqoop, Oozie, Spark, Scala, AWS, EC2, S3, EMR, Cassandra, Flume, Kafka, Pig, Linux, Shell Scripting

Client: USAA, San Antonio, TX Jul 2018 – Mar 2020

Role: Data Engineer

Responsibilities:

Worked on Snowflake Shared Technology Environment for providing stable infrastructure, secured environment, reusable generic frameworks, robust design architecture, technology expertise, best practices and automated SCBD (Secured Database Connections, Code Review, Build Process, and Deployment Process) utilities.

Designed ETL process using Pentaho Tool to load from Sources to Targets with Transformations.

Worked on Snowflake Schemas and Data Warehousing.

Developed Pentaho Bigdata jobs to load heavy volume of data into S3 data lake and then into Redshift data warehouse.

Migrated the data from Redshift data warehouse to Snowflake database

Build dimensional modelling, data vault architecture on Snowflake.

Built scalable distributed Hadoop cluster running Hortonworks Data Platform (HDP 2.6)

Involved in developing Spark code using Scala and Spark-SQL for faster testing and processing of data and exploring of optimizing it using Spark Context, Spark-SQL, PairRDD's

Serializing JSON data and storing the data into tables using Spark SQL

Spark Streaming collects data from Kafka in near-real-time and performs necessary transformations and aggregation to build the common learner data model and stores the data in NoSQL store (HBase).

Worked on Spark framework on both batch and real-time data processing

Hands on experience in MLLib from Spark are used for predictive intelligence, customer segmentation and for smooth maintenance in Spark streaming

Developing programs for Spark streaming which takes the data from Kafka and pushes into different sources

Loading the data from the different Data sources like (Teradata, DB2, Oracle and flat files) into HDFS using Sqoop and load into Hive tables, which are partitioned.

Created different pig scripts& converted them as shell command to provide aliases for common operation for project business flow.

Implemented Partitioning, Bucketing in Hive for better organization of the data.

Created few Hive UDF's to as well to hide or abstract complex repetitive rules.

Developed Oozie Workflows for daily incremental loads, which gets data from Teradata and then imported into hive tables.

Developed bash scripts to bring log files from FTP server and then processing it to load into Hive tables.

All the bash scripts are scheduled using Resource Manager Scheduler.

Developed Map Reduce programs for applying business rules on the data.

Developed a NiFi Workflow to pick up the data from Data Lake as well as from server and send that to Kafka broker

Involved in loading and transforming large sets of structured data from router location to EDW using an Apache NiFi data pipeline flow

Implemented Kafka event log producer to produce the logs into Kafka topic which are utilized by ELK (Elastic Search, Log Stash, Kibana) stack to analyze the logs produced by the Hadoop cluster

Did Implementation using Apache Kafka replacement for a more traditional message broker (JMS Solace) to reduce licensing and decouple processing from data producers, to buffer unprocessed messages.

Implemented receiver-based approach, here I worked on Spark streaming for linking with Streaming Context using Python and handle proper closing & waiting stages as well.

Experience in Implementing Rack Topology scripts to the Hadoop Cluster.

Implemented the part to resolve issues related with old Hazel cast API Entry Processor.

Used Akka Toolkit to perform few builds and used Akka with Scala

Excellent knowledge with Talend Administration console, Talend installation, using Context and global map variables in Talend

Used dashboard tools like Tableau

Used Talend Admin Console Job conductor to schedule ETL Jobs on daily, weekly basis

Environment: Hadoop HDP, Linux, MapReduce, HBase, HDFS, Hive, Pig, Tableau, NoSQL, Shell Scripting, Sqoop, Open-source technologies Apache Kafka, Apache Spark, Git, Talend.

Hadoop Engineer Jan 2016 – Mar 2017

IES Info Hub, Pvt Ltd, India

Responsibilities:

Developed highly optimized Spark applications to perform data cleansing, validation, transformation and summarization activities

Data pipeline consist Spark, Hive and Sqoop and custom build Input Adapters to ingest, transform and analyze operational data.

Created Spark jobs and Hive Jobs to summarize and transform data.

Used Spark for interactive queries, processing of streaming data and integration with popular NoSQL database for huge volume of data.

Converted Hive/SQL queries into Spark transformations using Spark Data Frames and Scala.

Used different tools for data integration with different databases and Hadoop.

Used Spark for interactive queries, processing of streaming data and integration with popular NoSQL database for huge volume of data.

Built real time data pipelines by developing Kafka producers and spark streaming applications for consuming.

Ingested syslog messages parse them and streams the data to Kafka.

Handled importing data from different data sources into HDFS using Sqoop and performing transformations using Hive, Map Reduce and then loading data into HDFS.

Exported the analyzed data to the relational databases using Sqoop, to further visualize and generate reports for the BI team.

Collecting and aggregating large amounts of log data using Flume and staging data in HDFS for further analysis

Analyzed the data by performing Hive queries (Hive QL) to study customer behavior.

Helped Devops Engineers for deploying code and debug issues.

Used Hive to analyze the partitioned and bucketed data and compute various metrics for reporting.

Developed Hive scripts in Hive QL to de-normalize and aggregate the data.

Scheduled and executed workflows in Oozie to run various jobs.

Implemented business logic in Hive and written UDF’s to process the data for analysis.

Addressing the issues occurring due to the huge volume of data and transitions.

Designed, documented operational problems by following standards and procedures using JIRA.

Environment: Spark, Scala, Hive, Apache NiFi, Kafka, HDFS, Oracle, HBase, MapReduce, Oozie, Sqoop

Data Engineer Aug 2014 – Nov 2015

Dell, India

Responsibilities:

Interacting with the Business Requirements and the design team and preparing the Low-Level Design and high-level design documents

Provide in-depth technical and business knowledge to ensure efficient design, programming, implementation and on-going support for the application.

Involved in identifying possible ways to improve the efficiency of the system

Logical implementation and interaction with HBase

Efficiently put and fetched data to/from HBase by writing MapReduce job.

Developed Map Reduce jobs to automate transfer of data from/to HBase

Assisted with the addition of Hadoop processing to the IT infrastructure

Implemented Map/Reduce job and execute the Map/Reduce job to process the log data from the ad-servers.

Used Hive to analyze the partitioned and bucketed data and compute various metrics for reporting.

Worked on MongoDB, and Cassandra

Prepared multi-cluster test harness to exercise the system for better performance

Environment: Hadoop, HDFS, MapReduce, HBase, Hive, Cassandra, Hadoop distribution of Hortonworks, Cloudera, SQL PLUS and Oracle 10g.

Education

Master’s in Business Administration, Data Analytics, University of New Haven Connecticut, USA

Bachelor of Technology, Computer Science, Jawaharlal Nehru Technological University Telangana, India

Contact this candidate