Hadoop Engineer

Location:

Springfield, MO

Posted:

March 22, 2021

Contact this candidate

Resume:

Adeseye Agbabiaka

Phone: 417-***-**** Email: **************@*****.***

Professional Summary

I am an experienced data engineer proficient in building data pipelines using a large variety of big data technologies. Skilled professional that creates solutions to address data engineering problems. I have proved my dedication and productivity in my former positions. Experienced in data loading, processing, and wrangling both on on-premises and cloud platforms. I am a team player, and I am used to working with agile methodologies.

Experience

June 2019 – Present

AWS Big Data Engineer O'Reilly Auto Parts Springfield, MO

•Executed Hadoop/Spark jobs on AWS EMR using programs, data stored in S3 Buckets.

•Developed AWS Cloud Formation templates to create a custom infrastructure of our pipeline.

•Implemented AWS IAM user roles, instance profiles, and policies to authenticate and control access.

•Designed a data warehouse and performed the data analysis queries on Amazon redshift clusters on AWS.

•Executed ELK (Elastic Search, Log stash, Kibana) stack in AWS to gather and investigate the logs created by the website.

•Wrote Unit tests for all code using different frameworks like PyTest.

•Worked on architecting Serverless design using AWS API, Lambda, S3, and Dynamo DB with optimized design with Auto scaling performance.

•Designed the schema, cleaned up the input data, processed the records, wrote queries, and generated the output data using Redshift.

•Populated database tables via AWS Kinesis Firehose and AWS Redshift.

•Worked on the data lake on AWS S3 to integrate it with different applications and development projects.

•Implemented usage of Amazon EMR for processing Big Data across Hadoop Cluster of virtual servers on Amazon Elastic Compute Cloud (EC2) and Amazon Simple Storage Service (S3)

•Created Spark jobs that run in EMR clusters using AMR Notebooks.

•Developed Spark programs using python to run in the EMR clusters.

•Created User Defined Functions (UDF) using Scala to automate few of the business logic in the applications.

•Used Python for developing Lambda functions in AWS.

•Designed AWS Glue pipelines to ingest, process, and store data interacting with different services in AWS.

June 2017 – June 2019

Big Data Engineer Mitchell International San Diego, CA

•Implemented a streaming job with Apache Kafka to ingest data from Rest API.

•Designed a Spark Streaming job, using PySpark that streamed the data from Kafka into Spark Dataframes.

•Used Spark to filter and format the data before designing the sink to store the data in the Hive data warehouse.

•Used Hive to store the final data after the spark applications for Data Scientists to fetch and process.

•Designed HiveQL and SQL queries to retrieve data from the data warehouse and create views for the final users to consume.

•Developed Spark SQL script for handling different data sets and verified its performance over the jobs using the Spark UI.

•Created Hive tables to store different data formats of data coming from different data sources.

•Maintained the Hive metastore tables to store the metadata of the tables in the data warehouse.

•Automated the ingestion pipeline using bash scripts and Cron jobs to perform the ETL pipeline daily.

•Imported data from the local file system, RDBMS into HDFS and Sqoop and developed a workflow using Shell Scrips to automate the tasks of loading the data into HDFS

•Evaluated various data processing techniques available in Hadoop from various perspectives to detect aberrations in data.

•Provided connections to different Business Intelligence tools to the tables in the data warehouse such as Tableau and Power BI.

•Cleaned up the input data, specified the schema, processed the records, wrote UDFs, and generated the output data using Hive.

August 2015 – June 2017

Big Data Engineer Traveler’s Insurance Hartford, CT

•Migrated complex MapReduce programs into Apache Spark RDD operations.

•Used shell scripts to migrate data between Hive, HDFS, and MySQL.

•Designed and tested Spark SQL clients with PySpark

•Created software applications using Spark Core, Spark SQL, and Data Frames/Data Sets/RDD API

•Coded PySpark application as ETL processes.

•Developed Python and Scala applications on Linux and UNIX platforms.

•Wrote Hive Queries for analyzing data in Hive warehouse using Hive Query Language from External Tables hosted on S3 buckets

•Installed, Configured, and Managed tools such as ELK for Log processing and Resource Monitoring

•Used data from text files from HDFS to create an RDD and manipulate data using PySpark and Spark Context in Jupiter notebooks

•Developed Apache Airflow DAGs to automate the pipeline and send reports using different operators like python, Shell, AWS.

•Created managed and external tables to store ORC and Parquet files using HQL.

•Utilized Avro-tools to build the Avro schema to create an external hive table using PySpark.

January 2013 – August 2015

Data Engineer UPS Atlanta, GA

•Ingested RDBMS data to Hadoop ecosystem HDFS using SQOOP by writing SQOOP jobs.

•Used Apache Hive to query and analyze the data.

•Analyzed the insurance claims data which consists of information about the client.

•Collected and aggregated large amounts of log data using Apache Flume and staging data in HDFS for further analysis.

•Configure and deploy production-ready multi-node Hadoop clusters with services such as Hive, Sqoop, Flume, Oozie.

•Write producer /consumer scripts to process JSON response in python.

•Evaluated various data processing techniques available in Hadoop from various perspectives to detect aberrations in data.

•Performed data profiling and transformation on the raw data using Python, and oracle.

•Developed scripts to automate the workflow processes and generate reports.

•Developed POC using Scala & deployed on Yarn cluster, compared the performance of Spark, with Hive and SQL.

•Created Hive External tables and loaded the data into tables and query data using HQL.

•Performed performance calibration for Spark Streaming e.g., setting right Batch Interval time, the correct level of executors, choice of correct publishing & memory.

•Assisted in exporting analyzed data to relational databases using Sqoop

Skills

DATABASE

SQL, NoSQL, Apache Cassandra, MongoDB, Hbase, RDBMS, Hive

BIG DATA PLATFORMS

Amazon AWS, Microsoft Azure, Elasticsearch, Apache Solr, Lucene, Cloudera Hadoop, Cloudera Impala, Databricks, Hortonworks Hadoop

FILES

HDFS, Avro, Parquet, Snappy, Gzip, SQL, Ajax, JSON, GSON, ORC

DATA VISUALIZATION

Tableau, Microsoft Power BI

OPERATING SYSTEMS

Linux, macOS, Microsoft Windows

PROGRAMMING

Python, Scala, PHP • Python • Bash • LISP • SQL • JavaScript • JQuery • C • C++ • XML • HTML • CSS, Visual Basic, VBA, .Net, Spark, HiveQL, Spark API, REST API

Cloud

AWS EMR, AWS S3, AWS Lambda, AWS Glue, AWS Redshift

HADOOP ECOSYSTEM COMPONENTS & TOOLS

Apache Cassandra, Apache Flume, Hadoop, Apache Hadoop YARN, Apache Hbase, Apache Hive, Apache KafkaApache Spark, Spark Streaming, Spark MLlib, GraphX, SciPy, Pandas, RDDs, DataFrames, Datasets, Mesos, Apache Tez, Apache Airflow Sqoop

Certifications

Spark Fundamentals, IBM

Education

Florida State University, Tallahassee, FL

Ph.D., Industrial Engineering

University of Lagos, Lagos, Nigeria

BS, Systems Engineering

Contact this candidate