Data Engineer Hadoop

Location:

Tappahannock, VA, 22560

Posted:

September 15, 2022

Contact this candidate

Resume:

King Mayingani

King Mayingani - Senior Data Engineer

Houston, TX 77077

*********.*****@*****.***

+1-281-***-****

· 18+ years’ combined experience in database/IT and Big Data, Cloud, Hadoop.

· Past 9+ years concentrated on Big Data space.

· Hands-on experience developing Teradata PL/SQL Procedures and Functions and SQL tuning of large databases.

· Import and export Terabytes of data between HDFS and Relational Database Systems using Sqoop.

· Hands-on with Extract Transform Load (ETL) from databases such as SQL Server and Oracle to Hadoop HDFS in Data Lake.

· Write SQL queries, Stored Procedures, Triggers, Cursors and Packages.

· Utilize Python libraries for analytical processing.

· Display analytical insights through data visualization Python libraries and tools such as Matplotlib and Tableau.

· Develop Spark code for Spark-SQL/Streaming in Scala and PySpark.

· Integrate Kafka and Spark using Avro for serializing and deserializing data, and for Kafka Producer and Consumer.

· Convert Hive/SQL queries into Spark transformations using Spark RDD and Data Frame.

· Use Spark SQL to perform data processing on data residing in Hive.

· Configure Spark Streaming to receive real-time data using Kafka.

· Use Spark Structured Streaming for high performance, scalable, fault-tolerant real-time data processing.

· Write Hive/Hive QL scripts to extract, transform, and load data into databases.

· Configure Kafka cluster with Zookeeper for real- time streaming.

· Build highly available, scalable, fault-tolerant systems using Amazon Web Services (AWS).

· Hands-on with Amazon Web Services (AWS), and cloud services such as EMR, EC2, S3, EBS and IAM entities, roles, and users.

· Experience using Hadoop clusters, Hadoop HDFS, Hadoop tools and Spark, Kafka, Hive in social data and media analytics using Hadoop ecosystem.

· Highly knowledgeable in data concepts and technologies including AWS pipelines, cloud repositories

(Amazon AWS, MapR, Cloudera).

· Hands on experience using Cassandra, HIVE, No-SQL databases (HBase, MongoDB), SQL databases (Oracle, SQL, PostgreSQL, MySQL server, as well as data lakes and cloud repositories to pull data for analytics.

· Experience with Microsoft Azure.

# 18+ years’ combined experience in database/IT and Big Data, Cloud, Hadoop.

# Past 9+ years concentrated on Big Data space.

# Hands-on experience developing Teradata PL/SQL Procedures and Functions and SQL tuning of large databases.

# Import and export Terabytes of data between HDFS and Relational Database Systems using Sqoop.

# Hands-on with Extract Transform Load (ETL) from databases such as SQL Server and Oracle to

Willing to relocate: Anywhere

Work Experience

Senior Data Engineer

KPMG - Houston, TX

June 2020 to Present

· Worked with an Agile technology team consisting of 15 technology specialists responsible for building an infrastructure for KPMG to be able to collect data from their customers around the world, process the data, and submit API Requests to Oracle GTM for classification based on US government requirements on Trade agreements and Custom, and then reprocess the response and the data back from GTM and send back to customers.

· Worked with review team to assess existing system and made recommendations for improvement presented as formal presentations to management stakeholders.

· Cross-collaborated with other teams (e.g., App Dev Team, Oracle Dev Team, Oracle GTM Team).

· Planned and helped architect a MS Azure-based solution.

· Implemented CI/CD tools Upgrade, Backup, and Restore.

· Handled versioning with Git.

· Performed manual and automated validation throughout testing to ensure data validity and quality.

· Created Kafka topics for Kafka brokers to listen from and data transferring to function in a distributed system.

· Built data ingestion and processing pipelines with Spark on Databricks.

· Implemented security mechanisms with Azure IAM (Identity and Access Management), Azure Key Vaults, and Azure Active Directory.

· Implemented Delta Lake with implementation of ACID procedures.

· Implemented Azure Synapse Analytics and Azure Databricks.

· Utilized Kafka, ADF, and Spark for data transfer operations.

· Used Azure Synapse and MS Power BI for Data Analysis and Reporting.

· Properly allocating cluster resources.

· Performed work in a MS Azure DevOps collaborative environment.

· Developed and tested entities to process and send API Requests to GTM.

· Developed and tested History and events.

Senior Data Engineer

3Cloud Solutions / Pella Corporation - Pella, IA

January 2020 to June 2022

· Assigned to a dev team consisting of 7 resources, including a Project Manager, Data Architect, Product Owner and Data Engineers (4) mandated to identify and build Big Data solutions to support the company’s plan to triple annual revenue.

· Worked with dev team and company stakeholders to evaluate existing system and identify, map out, and present a go-forward plan for a Big Data solution to support the company’s growth strategy.

· Presented a plan that included keeping the existing on-prem infrastructure and building an additional infrastructure on MS Azure for real-time purchase analysis and machine learning.

· Performed technical work based on an Agile project design/development/delivery methodology that included regular Scrums, Sprints, and Retrospective meetings.

· Utilized iObeya online visual management tool based on the Obeya management method.

· Worked in a cross-functional production environment engaging with multiple teams

(e.g., ML Team, App Dev Team, Oracle Dev Team).

· Implemented Microsoft Azure Cloud Services (PaaS & IaaS), Storage, Web Apps, Active Directory, Application Insights, Internet of Things (IoT), Azure Search, Key Vault, Visual Studio Online (VSO) and SQL Azure.

· Worked on Azure data factory pipeline to schedule job in Azure Databricks in Azure cloud.

· Used Qlik to automatically produce real-time transaction streams into Kafka, enabling streaming analytics.

· Handled security by implementing and configuring Azure IAM (Identity and Access Management) and Azure Key Vaults.

· Implemented Delta Lake with implementation of ACID procedures and CDC with Delta Lake Tables.

· Implemented Azure Synapse Analytics and Azure Databricks.

· Created Kafka topics for Kafka brokers to listen from and transfer data to function in a distributed system.

· Moved transformed data to Spark cluster where the data is set to go live on the application using Kafka.

· Designed and developed data pipelines in an Azure environment using ADF, Azure Databricks, Azure SQL, Azure Synapse for analytics and MS PowerBI for reporting.

· Utilized Git fir versioning.

Cloud Engineer

Bloomberg L.P. - New York, NY

January 2020 to January 2022

· Hands-on with AWS data migration between database platforms Local SQL Servers to Amazon RDS and EMR HIVE.

· Optimized Hive analytics SQL queries, created tables/views, and wrote custom queries and Hive-based exception processes.

· Worked as part of the Big Data Engineering team to design and develop data pipelines in an Azure environment using ADL Gen2, Blob Storage, ADF, Azure Databricks, Azure SQL, Azure Synapse for analytics and MS Power BI for reporting.

· Worked on AWS to form and manage EC2 instances and Hadoop Clusters.

· Implemented a Hadoop Cloudera distributions cluster using AWS EC2.

· Deployed the large knowledge Hadoop application mistreatment Talend on Cloud AWS.

· Utilized AWS Redshift to store Terabytes of data on the Cloud.

· Used Spark SQL and DataFrames API to load structured and semi-structured data into Spark Clusters.

· Wrote shell scripts for log files to Hadoop cluster through automatic processes.

· Used Spark-SQL and Hive Query Language (HQL) for obtaining client insights.

· Ingested large data streams from company REST APIs into EMR cluster through AWS kinesis.

· Implemented AWS Fully Managed Kafka streaming to send data streams from the company APIs to Spark cluster in AWS Databricks.

· Streamed data from AWS Fully Managed Kafka brokers using Spark Streaming and processed the data using explode transformations.

· Finalized the data pipeline using DynamoDB as a NoSQL storage option.

· Developed consumer intelligence reports based on market research, data analytics, and social media.

· Joined, manipulated, and drew actionable insights from large data sources using Python and SQL.

Big Data Engineer

Cincinnati Financial Corporation - Fairfield, OH

November 2017 to January 2020

· Created HBase tables, loaded data, and wrote HBase queries to process the data.

· Created a cluster of Kafka brokers to fetch structured data in structured streaming.

· Designed HBase queries to perform data analysis, data transfer, and table design.

· Developed Cloud-based Big Data Architecture using Hadoop, AWS EMR and AWS Databricks.

· Created Hive and SQL queries to spot emerging trends by comparing historical data to currently collected batches data.

· Set up Ambari open-source management platform for provisioning, managing, monitoring and securing Apache Hadoop clusters.

· Developed PySpark applications as ETL processes.

· Designed, developed, and tested Spark SQL clients with PySpark.

· Completed Hadoop data ingestion and Hadoop cluster handing in real-time processing using Apache Kafka as ingestion tool and Apache Spark Streaming as processing engine.

· Collected data using REST APIs, built HTTPS connection with client server, sent GET requests, and collected responses in Kafka Producer.

· Imported data from web services into HDFS using Apache Flume, Apache Kafka and transformed data using Spark.

· Developed Spark jobs using Spark SQL, PySpark, and DataFrames API to process structured, and unstructured data into Spark clusters.

· Utilized Spark RDD to create Spark DataFrames from unstructured data stored in AWS S3 and applied Spark DataFrame transformations to give a predefined schema to data and store into Hive tables for further analysis by data scientists.

· Split JSON files into Spark DataFrames to process in parallel in fully distributed Hadoop cluster for a better performance and fault tolerance.

· Decoded raw data from JSON and streamed it using the Kafka Producer API.

· Integrated Kafka with Spark Streaming for real-time data processing using DStreams.

· Used Spark to parse out the needed data by using SparkSQL Context and selected features with target information and assigned names.

· Conducted exploratory data analysis and managed dashboard for weekly report.

· Stored data pulled from diverse APIs into HBase on AWS Elastic MapReduce.

· Scheduled pipelines using Airflow DAGS to ingest data to HDFS, and trigger Spark submit jobs for processing.

Big Data Developer

OMD Worldwide - New York, NY

March 2016 to November 2017

· Created DTS package to schedule the jobs for batch processing.

· Created indexes, constraints, and rules on database objects for optimization.

· Installed SQL Server client-side utilities and tools for all the front-end developers/programmers.

· Designed and developed database using ER diagram, normalization, and relational database concepts.

· Developed SQL Server Stored Procedures and tuned SQL Queries (using Indexes and Execution Plan).

· Developed User Defined Functions and created Views.

· Created Triggers to maintain the Referential Integrity.

· Implemented Spark using Python/Scala and utilized Spark Core, and Spark SQL for faster processing of data instead of MapReduce in Java.

· Used joins and sub-queries to simplify complex queries involving multiple tables and optimized the procedures and triggers to be used in production.

· Provided disaster recovery procedures and policies for backup and recovery of databases.

· Completed performance tuning in SQL Server 2000 using SQL Profiler and data loading.

· Involved in performance tuning to optimize SQL queries using query analyzer.

· Applied hands-on troubleshooting to the Hadoop Cluster and fixed technical issues. Big Data Developer, Prospection and Production

Schlumberger - Sugar Land, TX

August 2012 to March 2016

· Utilized Omega Geophysical data processing and wrangling platform to transform seismic and microseismic data into intelligence to increase success rate across the Exploration and Production life cycle of oil and gas fields.

· Utilized Petrel E&P, explorations to create oil and gas reservoirs models, and production data to calibrate models for more accurate and robust recovery forecast, improved reservoir predictions, and comprehensive quantitative interpretation.

· Utilized WellBOOK, Well Intervention and Stimulation design and simulation software to prepare optimized designs for future well intervention and stimulation to boost overall oil and gas wells and field production and maintain reservoir specifications optimum.

· Worked with business and analytics teams in gathering the system requirements and recommended solutions to meet customer business requirements.

· Implemented and performed big data processing using Hadoop, Hive, Sqoop.

· Imported/exported data between RDBMS like MS SQL Server, MySQL, PostgreSQL, and Hadoop HDFS using Apache Sqoop.

· Ingested data into RDBMS (MySQL, PostgreSQL and, SQL) using import statements.

· Configured JDBC drivers for each database to use Apache Sqoop.

· Moved Data from HDFS to Hive External table converting the CSV file to Avro file format.

· Moved data from Hive External to Internal with HQL from Avro file format to the ORC/Parquet file format.

· Analyzed large amounts of data sets to determine optimal way to aggregate and report on it.

· Developed workflow using Oozie for running MapReduce jobs and Hive Queries. Schlumberger – Cape Town, SA

Prospection & Production Data Processing and Analytics Engineer Nov 2004 – August 2012

· Worked with business and analytics teams in gathering the system requirements and recommended solutions to meet customer business requirements.

· Installed, configured, maintained, upgraded, and supported MySQL and SQL databases.

· Applied performance tuning to improve issues with a large, high-volume, multi-server MySQL and SQL installation for job applicant site of clients.

· Set up replication for disaster and point-in-time recovery. Replication was used to segregate various types of queries and simplify backup procedures.

· Programmed SQL functions to select information from database and send it to the front end upon user request.

· Performed database maintenance on a regular schedule and performed troubleshooting activities on the database.

· Translated logical database designs into actual physical database implementations.

· Performed security audit of MySQL internal tables and user access. Revoked access for unauthorized users.

· Provided customer assistance and support pertaining to database system queries and complaints. Education

Masters in Physics

University Marien Ngouabi

Skills

• Programming Languages: Python, Scala.

• Operating Systems: UNIX/Linux, Windows.

• Testing Tools: PyTest. Hadoop HDFS in Data Lake. # Write SQL queries, Stored Procedures, Triggers, Cursors and Packages. # Utilize Python libraries for analytical processing. # Display analytical insights through data visualization Python libraries and tools such as Matplotlib and Tableau. # Develop Spark code for Spark-SQL/Streaming in Scala and PySpark. # Integrate Kafka and Spark using Avro for serializing and deserializing data, and for Kafka Producer and Consumer. # Convert Hive/SQL queries into Spark transformations using Spark RDD and Data Frame. # Use Spark SQL to perform data processing on data residing in Hive. # Configure Spark Streaming to receive real- time data using Kafka. # Use Spark Structured Streaming for high performance, scalable, fault- tolerant real-time data processing. # Write Hive/Hive QL scripts to extract, transform, and load data into databases. # Configure Kafka cluster with Zookeeper for real- time streaming. # Build highly available, scalable, fault-tolerant systems using Amazon Web Services (AWS). # Hands-on with Amazon Web Services (AWS), and cloud services such as EMR, EC2, S3, EBS and IAM entities, roles, and users. # Experience using Hadoop clusters, Hadoop HDFS, Hadoop tools and Spark, Kafka, Hive in social data and media analytics using Hadoop ecosystem. # Highly knowledgeable in data concepts and technologies including AWS pipelines, cloud repositories (Amazon AWS, MapR, Cloudera). # Hands on experience using Cassandra, HIVE, No-SQL databases (HBase, MongoDB), SQL databases (Oracle, SQL, PostgreSQL, MySQL server, as well as data lakes and cloud repositories to pull data for analytics. # Experience with Microsoft Azure.

• MS SQL Server

• Oracle

• DB2

• MySQL

• PostgreSQL

• Cassandra

• MongoDB.

• HiveQL

• SQL

• MapReduce

• Python

• PySpark

• Shell.

• Cloudera

• MapR

• Databricks

• AWS

• MS Azure

• GCP.

• Hive

• Kafka

• Hadoop

• HDFS

• Spark

• Cloudera

• Azure Databricks

• HBase

• Cassandra

• MongoDB

• Zookeeper

• Sqoop

• Tableau

• Kibana

• MS Power BI

• QuickSight

• Hive Bucketing and Partitioning

• Spark performance Tuning

• Optimization

• Spark Streaming.

• Cassandra

• Hadoop

• YARN

• HBase

• Hcatalog

• Hive

• Kafka

• Nifi

• Airflow Oozie

• Spark

• Zookeeper

• Cloudera Impala

• HDFS

• MapR

• MapReduce

• Spark.

• Apache Spark

• Spark Streaming

• Storm

• Pig

• MapReduce.

• AWS S3

• EMR

• Lambda Functions

• Step Functions

• Glue

• Athena

• Redshift Spectrum

• Quicksight

• DynamoDB

• Redshift

• CloudFormation

• MS ADF

• Azure Databricks

• Azure Data Lake

• Azure SQL

• Azure HDInsight

• GCP

• Cloudera

• Anaconda Cloud

• Elastic.

Contact this candidate