Hadoop Engineer

Location:

Posted:

December 10, 2021

Resume:

High-Level Overview

*+ years serving the Hadoop/Big Data space in roles such as Hadoop Engineer, AWS Big Data Engineer, Cloud Engineer, and Big Data Engineer.

Database analysis experience.

Experienced with Big Data link technologies such as Amazon Warehouse Services (AWS), Microsoft Azure, Apache Kafka, Python, Apache Spark, HIVE, Apache Kafka, and Hadoop.

Skilled in Hadoop Architecture and various components such as HDFS, Job Tracker, Task Tracker, Name Node, Data Node.

Hands on experience in working with Hive, Sqoop, Flume, and Oozie.

Import/export terabytes of data between HDFS and RDBMS using Sqoop.

Hands-on with NoSQL databases such as HBase and Cassandra.

Experience installing Hadoop ecosystems; installing and configuring nodes/clusters, administration, and de/commissioning.

Experience in XML technologies like Informatica XML parser and XML writer.

Hands-on performance tuning at the source, target, and data stage job levels using indexes, hints, and partitioning.

Performance tuning in the live systems for ETL/ELT jobs in Hive, Hadoop ecosystem.

Apply knowledge in PL/SQL and write SQL queries.

Work with Impala, Spark/Scala, Shark, and Storm.

Apply Python-based design and development programming.

Create Pyspark Data Frames on multiple projects and tie into Kafka.

Configure Big Data Hadoop and Apache Spark in Big Data.

Preparation of test cases, documenting and performing unit testing and Integration.

In-depth understanding of Data Structures and Algorithms and Optimization.

Self-motivated, excellent team player, with a positive attitude and adhere to strict deadlines.

Experience in Automation Testing, Software Development Life Cycle (SDLC) using the Waterfall Model, and a good understanding of Agile Methodology.

Write SQL queries for data validation of the reports and dashboards as necessary.

Worked with Big Data ecosystems (Hadoop, Spark, Cloudera).

Experienced analyzing Microsoft SQL Server data models and identifying and creating inputs to convert existing dashboards that use Excel as a data source.

Applied expertise designing custom reports using data extraction and reporting tools, and development of algorithms based on business cases.

Technical Skills

BIG DATA PLATFORMS: Cloudera CDH, Hortonworks HDP, Amazon Web Services (AWS)/Amazon Cloud

DATA PIPELINES/ETL: Flume, Spark, Kafka, Hive, Pig, Spark Streaming, Spark SQL, Data Frames, Kinesis, Spark, Spark Streaming, Spark Structured Streaming

DATA VISUALIZATION: Tableau, Power BI, Excel, Kibana

DATABASES AND DATA WAREHOUSES: Cassandra, HBase, Amazon Redshift, DynamoDB, MongoDB, Oracle, PostgreSQL, MySQL, Hive

DATA STORES (repositories): Data Lake, HDFS, Data Warehouse, S3

AWS PLATFORMS: AWS IAM Formation, AWS Redshift, AWS RDS, AWS EMR, AWS S3, EC2, AWS Lambda, AWS Kinesis, AWS ELK, AWS Cloud

CLOUD SERVICES & DISTRIBUTIONS: AWS, Azure, Anaconda Cloud, Elasticsearch, Cloudera, Databricks, Hortonworks, Elastic. Elastic Cloud

SOFTWARE DEVELOPMENT: Spark, Scala, Hive, PySpark, Keras, TensorFlow, JavaScript, SQL, Shell Script, Python

DEVELOPMENT TOOLS, AUTOMATION, CICD: Git, GitHub, MVC, Jenkins, CI CD, Jira, Agile, Scrum

Scheduler Tool: Airflow

GCP PLATFORM: DATA PROC,PUB-SUB

Work Experience

Nov 2019 - Present

Hadoop Engineer

Walgreen Company

Chicago, IL

Created Hive tables, loaded data, and wrote Hive queries, which run internally in the map.

Created a prototype for real-time analysis using Spark Streaming and Kafka.

Developed multiple Spark Streaming and batch Spark jobs using Scala and Python on AWS

Installed and configured Hive and wrote Hive UDFs.

Implemented Spark using Scala and utilized Data Frames and Spark SQL API for faster processing of data.

Implemented Kafka messaging consumer.

Applied Spark and Spark SQL for faster testing and processing.

Handled storage capacity management, performance tuning, and benchmarking of clusters.

Worked with Amazon Web Services (AWS) and involved in ETL, Data Integration, and Migration.

Produced ETL to Hadoop file system (HDFS) and wrote HIVE UDFs.

Transferred data between a Hadoop ecosystem and structured data storage in RDBMS (MySQL) using Sqoop.

Loaded data from the UNIX file system to HDFS.

Wrote streaming applications with Spark Streaming/Kafka Stream.

Streamed processed data to DynamoDB using Spark for making it available for visualization and report generation by the BI team.

Wrote Spark SQL script to check validate, cleanse, transform, and aggregate data.

Used Airflow for scheduling spark job in EC2 Hadoop cluster.

Mar 2018 – Nov 2019

AWS Big Data Engineer

Skanska USA Building Inc.

Parsippany-Troy Hills, NJ

Launched and configured The Amazon EC2 (AWS) Cloud Servers using AMI's (Linux/Ubuntu) and configured the servers for specified applications.

Implemented Spark in EMR for processing Big Data across the Data Lake in AWS System.

Installed, configured, and managed AWS Tools such as ELK and Cloud Watch for Resource Monitoring.

Implemented AWS EMR to process big data across Hadoop clusters of virtual servers on Amazon Simple Storage Service (S3).

Implemented security measures AWS provides, employing key concepts of AWS Identity and Access Management (IAM).

Implemented AWS IAM user roles and policies to authenticate and control access.

Developed AWS Cloud Formation templates to create custom infrastructure of pipeline.

Developed multiple Spark Streaming and batch Spark jobs using Scala and Python on AWS.

Designed Logical and Physical data modelling for various data sources on AWS Redshift.

Created custom users and groups using the Amazon AWS IAM console.

Programmed/configured AWS Lambda functions for event-driven processing to various AWS resources.

Ingested data through AWS Kinesis Data from various sources to S3.

Utilized Google Dataproc to manage Spark and Hadoop service to optimize processes for batch processing, querying, streaming, and machine learning.

Automated cluster creation and management using Google Dataproc.

Applied Microsoft Azure Cloud Services (PaaS & IaaS), Storage, Active Directory, Application Insights, Internet of Things (IoT), Azure Search, Key Vault, Visual Studio Online (VSO) and SQL Azure.

Cloud Formation, AWS IAM and Security Group in Public and Private Subnets in VPC.

Automated AWS components like EC2 instances, Security groups, ELB, RDS, Lambda and IAM through AWS Cloud Formation templates.

Processed multiple terabytes of data stored in AWS using Elastic Map Reduce (EMR) to AWS Redshift.

Configured AWS Kinesis for processing huge amounts of real-time data.

Specified nodes and performed the data analysis queries on Amazon Redshift clusters on AWS.

Implemented AWS Lambda functions to run scripts in response to events in Amazon Dynamo DB table or S3 bucket using Amazon API gateway.

Utilized AWS Kinesis for real-time data processing.

Jun 2016 – Mar 2018

Cloud Engineer

Forrester Research

Cambridge, MA

Joined, manipulated, and discerned actionable insights from large data sources using Python and SQL.

Ingested large data streams from the REST API.

Implemented Kafka streaming to send messages from the API to spark.

Developed consumer intelligence reports based on market research, data analytics, and social media.

Used AWS Redshift for storing the information on cloud.

Used Spark SQL and knowledge Frames API to load structured and semi-structured knowledge into Spark Clusters.

Utilized data mining and web scraping techniques to deliver analytics on large research projects using Python.

Streamed data from Kafka brokers using Spark Streaming and processed the data using explode transformation.

Finalized the data pipeline using HBASE as a NoSQL storage option.

Worked on AWS to form, manage EC2 instances, and Hadoop Clusters.

Deployed the large knowledge Hadoop application mistreatment Talend on Cloud AWS.

Carried out social media sentiment analysis to enrich projects with advanced data offerings.

Wrote shell scripts for log files to Hadoop cluster through automatic processes.

Successfully loaded files to HDFS from SQL using Spark.

Used Spark-SQL and Hive search language (HQL) for obtaining client insights

Optimized Hive analytics SQL queries, created tables/views, wrote custom queries and Hive-based exception process.

Sep 2014 – Jun 2016

Big Data Engineer

HollyFrontier Corporation

Dallas, TX

Built large-scale and complex data processing pipelines.

Developed Spark code using Scala and Spark-SQL/Streaming for faster processing of data.

Developed and managed all cluster-related testing activities.

Used Spark API over Hortonworks Hadoop YARN to perform analytics on data in Hive.

Worked with Apache Spark data structures, critical features, and performance tuning.

Worked with cluster and configuration management systems Docker and Kubernetes.

Analyzed Hadoop clusters using Big Data analytic tools Hive and MapReduce.

Applied analytic tools MapReduce, Hive, HDFS, Spark, Kafka,

Applied understanding of Hadoop Architecture and underlying Hadoop framework, including Storage Management.

Sep 2013 – Sep 2014

Big Data Developer

Westat

Rockville, MD

Partitioned and bucketed log file knowledge.

Collected relational data designed with custom input adapters and Sqoop.

Wrote Oozie configuration scripts for exportation of log files to Hadoop cluster through automatic processes.

Accessed Hadoop Cluster (CDM)and reviewed log files of all daemons.

Programmed Oozie advancement engine to run multiple Hive queries.

Implemented workflows with Apache Oozie framework to automate tasks.

Utilized Sqoop to import structured data from MySQL, SQL Server, PostgreSQL, and a semi-structured csv file dataset into HDFS data lake.

Analyzed giant sets of structures, semi-structured and unstructured knowledge by running Hive.

Developed Python Scripts for data capture and delta record process between fresh arrived knowledge and already existing knowledge in HDFS.

Consumed knowledge from writer queue exploitation Storm.

Jul 2010 – Sep 2013

Database Administrator

Campbell Soup Company

Camden, NJ

Installed, configured, managed, and supported SQL Servers and related databases.

Reviewed vendor computer equipment, servers and software and made recommendations to management regarding procurement.

Automated database operations by programming scripts/macros.

Managed work task queue requests from multiple departments and worked in alignment with priority classifications (e.g., critical, high, moderate, not urgent) applied to work task requests.

Designed and established SQL applications.

Created tables and views in the SQL database.

Supported schema changes and maintained the database to perform in optimal conditions.

Created and managed tables, views, user permissions, and access control.

Education

Bachelor of Science - Mathematics - University of Maryland Baltimore County

Contact this candidate