BIG DATA ENGINEER

Location:

Washington, DC

Posted:

September 15, 2022

Contact this candidate

Resume:

Niyi Celestine

Niyi Celestine - Big Data Engineer

Washington, DC 20009

****.*********@*****.***

+1-202-***-****

• 7 years' experience in Big Data development applying Apache Spark, HIVE, Apache Kafka, and Hadoop.

• 9 years' total IT/software/database design/development/deployment/support experience (7 years in Big Data and 2 years in software/IT data systems).

• Experienced with Big Data link technologies such as Amazon Warehouse Services (AWS), Microsoft Azure, Apache Kafka, Python, Apache Spark, HIVE, Apache Kafka, and Hadoop.

• Experienced analyzing Microsoft SQL Server data models and identifying and creating inputs to convert existing dashboards that use Excel as a data source.

• Applied Python-based design and development programming to multiple projects.

• Created Pyspark Data Frames on multiple projects and tied into Kafka.

• Configured Big Data Hadoop and Apache Spark in Big Data.

• Built AWS Cloud Formation templates used for Terraform with existing plugins.

• Developed AWS Cloud Formation templates to create custom infrastructures of pipelines.

• Implemented AWS IAM user roles and policies to authenticate and control user access.

• Applied expertise designing custom reports using data extraction and reporting tools, and development of algorithms based on business cases.

• Performance-tuned data-heavy dashboards and reports for optimization using various options such as Extracts, Context filters, writing efficient calculations, Data source filters, Indexing, and Partitioning in the data source, etc.

• Wrote SQL queries for data validation of reports and dashboards.

• Worked with Data Lakes and Big Data ecosystems (Hadoop, Spark, Hortonworks, Cloudera).

• Proven success working on different Big Data technology teams operating within an Agile/Scrum project methodology.

Willing to relocate: Anywhere

Authorized to work in the US for any employer

Work Experience

BIG DATA ENGINEER

Amtrak - Washington, DC

June 2021 to Present

Worked with a dev team consisting of 8 members responsible for migrating all historical data and incremental data from several sources into a data warehouse in AWS, creating different data marts for several departments and units for customer data analysis. Utilized various resources from the team like GitHub, Databricks, AWS Services, Informatica and many more.

Utilized Jira to track the tasks and progress of each team member. Programmed using Python, Pyspark and SQL as primary coding tools. Wrote PySpark scripts using AWS Glue to help process historical data to “raw folders” in S3. Wrote Python scripts using AWS Lambda to help process the real-time data to AWS Aurora. Led data migration team for migration of over 2 years historical data from Development environment to Test environment, and then to Stage environment until Production environment. Produced daily data migration reports to capture the data migration state. Performed root-cause analysis to understand issues with data in databases, data warehouses and data marts, identified root causes and performed appropriate technical troubleshooting. Big Data Engineer

The Harbor Bank of Maryland - Baltimore, MD

December 2019 to June 2021

• Created Pyspark streaming job to receive real time data from Kafka.

• Defined Spark data schema and set up development environment inside the cluster.

• Designed Spark Python job to consume information from S3 Buckets using Boto3.

• Utilized a cluster of multiple Kafka brokers to handle replication needs and allow for fault tolerance.

• Created a pipeline to gather data using Pyspark, Kafka, and HBase.

• Used Spark streaming to receive real-time data using Kafka.

• Worked with unstructured data and parsed out the information by Python built-in functions.

• Configured a Python API Producer file to ingest data from the Slack API, using Kafka for real-time processing with Spark.

• Processed data with natural language toolkit to count important words and generate word clouds.

• Started and configured master and slave nodes for Spark.

• Set up cloud compute engine managed and unmanaged mode and SSH key management.

• Worked on virtual machines to run pipelines to on a distributed system.

• Led presentations about the Hadoop ecosystem, best practices, and data architecture in Hadoop.

• Managed hive connection with tables, databases, and external tables.

• Installed Hadoop using Terminal and set the configurations. AWS Data Engineer

Fifth Third Bancorp - Cincinnati, OH

May 2017 to December 2019

• Made and oversaw cloud VMs with AWS EC2 command-line clients and AWS administration reassure.

• Used Apache Spark DataFrame API over the Cloudera platform to perform analytics on Hive data.

• Added support for Amazon AWS S3 and RDS to host static/media files and the database into Amazon Cloud.

• Used Ansible Python Script to generate inventory and push the deployment to AWS Instances.

• Executed Hadoop/Spark jobs on AWS EMR using programs, data stored in S3 Buckets.

• Implemented usage of Amazon EMR for processing Big Data across Hadoop Cluster of virtual servers on Amazon Elastic Compute Cloud (EC2) and Amazon Simple Storage Service (S3) AWS Redshift.

• Implemented AWS Lambda functions to run scripts in response to events in the Amazon Dynamo DB table or S3.

• Populated database tables via AWS Kinesis Firehose and AWS Redshift.

• Automated the installation of ELK agent (file beat) with Ansible playbook. Developed KAFKA. Queue System to Collect Log data without Data Loss and Publish to various Sources.

• Applied AWS Cloud Formation templates for Terraform with existing plugins.

• Developed AWS Cloud Formation templates to create a custom infrastructure of our pipeline.

• Implemented AWS IAM user roles and policies to authenticate and control access.

• Specified nodes and performed the data analysis queries on Amazon redshift clusters on AWS. Hadoop Engineer

Natural Resource Partners - Houston, TX

January 2016 to May 2017

• Integrated Kafka with Spark Streaming for real time data processing of logistic data.

• Used shell scripts to migrate the data between Hive, HDFS and MySQL.

• Installed and configured HDFS cluster for bigdata extraction, transformation, and load.

• Utilized Zookeeper and Spark interface for monitoring proper execution of Spark Streaming.

• Configured Linux on multiple Hadoop environments setting up Dev, Test, and Prod clusters within the same configuration.

• Created a pipeline to gather data using Pyspark, Kafka and HBase.

• Sent requests to source REST Based API from a Scala script via Kafka producer.

• Utilized a cluster of multiple Kafka brokers to handle replication needs and allow for fault tolerance.

• Hands-on with Spark Core, SparkSession, SparkSQL, and Data Frames/Data Sets/RDD API, Spark jobs, Spark SQL, and Data Frames API to load structured data into Spark clusters.

• Created a Kafka broker that uses the schema to fetch structured data in structured streaming.

• Defined Spark data schema and set up development environment inside the cluster.

• Interacted with data residing in HDFS using Pyspark to process the data. Big Data Engineer

Kinetic - New York, NY

May 2014 to January 2016

• Connected and ingested data using different ingestion tools such as Kafka and Flume.

• Worked on importing the received data into Hive using Spark.

• Applied HQL for querying desirable data in Hive used for further analysis.

• Implemented Partitioning, Dynamic Partition and Buckets in Hive which resulted in an increase in performance as well as proper and logical organization of data.

• Decoded the raw data and loaded into JSON before sending the batched streaming file over the Kafka producer.

• Received the JSON response in Kafka consumer written in Python.

• Established a connection between the HBase and Spark for the transfer of the newly populated data frame.

• Designed Spark Scala job to consume information from S3 Buckets.

• Monitored background operations in Hortonworks Ambari.

• HDFS Monitoring job status and life of the Data Nodes according to the specs. Software/IT Data Systems Programmer

Hobby Lobby Stores - Oklahoma City, OK

February 2012 to May 2014

• Gathered requirements from the client, analyzed, and prepared a requirement specification document for the client.

• Identified data types and wrote and ran SQL data cleansing and analysis scripts.

• Formatted files to import and export data to a SQL Server repository.

• Applied Git to store and organize SQL queries.

• Improved user interface for the database by reducing user input with automated inputs.

• Re-designed forms for easier access.

• Applied code modifications and wrote new scripts in Python.

• Worked with software/IT technology team to improve data integration processing.

• Reported and resolved discrepancies in a timely manner through the appropriate channels. Education

Master of Mathematics in Mathematics

Western Illinois University - Macomb, IL

Bachelors of Applied Science in Mathematics

University of Lagos - Lagos, NG

Skills

• Data Processing (Compute) Engines: Apache Spark, Spark Streaming, Flink.

• Compute Engines: Apache Spark, Spark Streaming, Storm.

• Parquet

• Avro & JSON

• ORC

• Text

• CSV.

• Apache Ant

• Apache Flume

• Apache Hadoop

• Apache YARN

• Apache Hive

• Apache Kafka

• Apache MAVEN

• Apache Oozie

• Apache Spark

• Apache Tez

• Apache Zookeeper

• Cloudera Impala.

• Unix/Linux

• Windows 10

• Ubuntu

• Apple OS.

• HiveQL

• MapReduce

• XML

• FTP

• Python

• UNIX

• Shell scripting

• LINUX.

• Cloudera

• Hortonworks

• AWS

• Elastic

• ELK

• Cloudera CDH 4/5

• Hortonworks HDP 2.5/2.6

• Amazon Web Services (AWS).

• Pentaho

• QlikView

• Tableau

• PowerBI

• Matplot.

• Microsoft SQL Server Database (2005, 2008R2, 2012)

• Database & Data Structures

• Apache Cassandra

• Amazon Redshift

• DynamoDB

• Apache HBase

• Apache Hive

• MongoDB.

Contact this candidate