Big Data/ Cloud/ Hadoop Engineer

Location:

Huntsville, AL

Posted:

April 28, 2023

Contact this candidate

Resume:

Clovis Mepon Kamgang

Big Data/ Cloud/ Hadoop Engineer

Email: ************@*****.*** Phone: 256-***-****

PROFESSIONAL PROFILE

•8+ years’ combined experience in database/IT and Big Data, Cloud, Hadoop

•Hands-on experience developing Teradata PL/SQL Procedures and Functions and SQL tuning of large databases.

•Import and export Terabytes of data between HDFS and Relational Database Systems using Sqoop.

•Hands-on with Extract Transform Load (ETL) from databases such as SQL Server and Oracle to Hadoop HDFS in Data Lake.

•Write SQL queries, Stored Procedures, Triggers, Cursors and Packages.

•Utilize Python libraries for analytical processing.

•Display analytical insights through data visualization Python libraries and tools such as Matplotlib and Tableau.

•Develop Spark code for Spark-SQL/Streaming in Scala and PySpark.

•Integrate Kafka and Spark using Avro for serializing and deserializing data, and for Kafka Producer and Consumer.

•Convert Hive/SQL queries into Spark transformations using Spark RDD and Data Frame.

•Use Spark SQL to perform data processing on data residing in Hive.

•Configure Spark Streaming to receive real-time data using Kafka.

•Use Spark Structured Streaming for high performance, scalable, fault-tolerant real-time data processing.

•Write Hive/Hive QL scripts to extract, transform, and load data into databases.

•Configure Kafka cluster with Zookeeper for real- time streaming.

•Build highly available, scalable, fault-tolerant systems using Amazon Web Services (AWS).

•Hands-on with Amazon Web Services (AWS), and cloud services such as EMR, EC2, S3, EBS and IAM entities, roles, and users.

•Experience using Hadoop clusters, Hadoop HDFS, Hadoop tools and Spark, Kafka, Hive in social data and media analytics using Hadoop ecosystem.

•Highly knowledgeable in data concepts and technologies including AWS pipelines, cloud repositories (Amazon AWS, MapR, Cloudera).

•Hands on experience using Cassandra, HIVE, No-SQL databases (HBase, MongoDB), SQL databases (Oracle, SQL, PostgreSQL, MySQL server, as well as data lakes and cloud repositories to pull data for analytics.

•Experience with Microsoft Azure.

TECHNICAL SKILLS

Programming Languages: Python, Scala.

Databases: MS SQL Server, Oracle, DB2, MySQL, PostgreSQL, Cassandra, MongoDB.

Scripting: HiveQL, SQL, MapReduce, Python, PySpark, Shell.

Distributions: Cloudera, MapR, Databricks, AWS, MS Azure, GCP.

Big Data Primary Skills: Hive, Kafka, Hadoop, HDFS, Spark, Cloudera, Azure Databricks, HBase, Cassandra, MongoDB, Zookeeper, Sqoop, Tableau, Kibana, MS Power BI, QuickSight, Hive Bucketing and Partitioning, Spark performance Tuning, Optimization, Spark Streaming.

Apache Components: Cassandra, Hadoop, YARN, HBase, Hcatalog, Hive, Kafka, Nifi, Airflow Oozie, Spark, Zookeeper, Cloudera Impala, HDFS, MapR, MapReduce, Spark.

Data Processing: Apache Spark, Spark Streaming, Storm, Pig, MapReduce.

Operating Systems: UNIX/Linux, Windows.

Cloud Services: AWS S3, EMR, Lambda Functions, Step Functions, Glue, Athena, Redshift Spectrum, Quicksight, DynamoDB, Redshift, CloudFormation, MS ADF, Azure Databricks, Azure Data Lake, Azure SQL, Azure HDInsight, GCP, Cloudera, Anaconda Cloud, Elastic.

Testing Tools: PyTest.

PROFESSIONAL EXPERIENCE

Big Data Engineer Sentar Inc., Huntsville, Alabama since May 2020 – Present

(Sentar is a cyber intelligence company, applying analytics and systems engineering expertise for national security.)

•Worked with a team of developers with specialty in RDBMS, mainframe, Unix scripting, Sqoop, and PySpark.

•Worked in a Hadoop environment with coding in Java and Python.

•Created data ingestion framework for multiple source systems using PySpark.

•Performed Spark optimizations based on Shuffling reduction.

•Created mock Spark data frames to test Cassandra processes.

•Provided connections to different Business Intelligence tools to the tables in the data warehouse such as Tableau and Power BI.

•Converted Hive/SQL queries into Spark transformations using Spark RDDs and Scala.

•Wrote Java code to support API’s transformation of consumer requests into Cassandra queries.

•Implemented Kafka common consumer for storing published payment lifecycle events in Apache Cassandra.

•Wrote scripts using PySpark and Unix Shell Scripting.

•Worked on Airflow scheduler to schedule spark jobs using spark operator.

•Developed PySpark scripts for ingestion of structured and unstructured data.

•Modeled data based on business requirements using PySpark.

•Worked with Hadoop Admins to fix encoding issues on data files.

•Used Spark Streaming as Kafka Consumer to process consumer data.

•Wrote Spark SQL to create and read Cassandra tables,

•Wrote streaming data into Cassandra tables with spark structured streaming.

•Wrote Bash script to gather cluster information for Spark submits.

•Wrote partitioned data into Hive external tables using Parquet format.

•Work on AWS Glue for reading parquet files in S3 bucket and writing as CSV file in S3 and finally writing to the redshift table using COPY command.

•Work on AWS lambda to parse XML files and write CSV file in S3 bucket and finally writing into the redshift table using COPY command.

•Work on AWS lambda boto3 API to read JSON file in S3 bucket and write the data into DynamoDB table.

•Worked on Azure data factory pipeline to schedule job in Azure Databricks in Azure cloud.

•Enforced YARN Resource pools to share resources of cluster for YARN jobs submitted by users.

•Documented requirements, including the available code which should be implemented using Spark, Hive, HDFS and ElasticSearch.

•Served an advisory role regarding Spark best practices.

Big Data Developer Crafty Apes LLC., El Segundo, California Aug 2019 – Jan 2020

(Crafty Apes is a full-service visual effect company)

•Managed data coming from different sources.

•Migrated an existing on-premises application to AWS. Used AWS services like EC2 and S3 for small data sets processing and storage. Maintaining the Hadoop cluster on AWS EMR

•Developed Spark jobs to clean data obtained from various feeds to make it suitable for ingestion into Hive tables for analysis

•Implemented Spark using Scala and utilizing Data Frames and Spark SQL API for faster processing of data.

•Used NoSQL database MongoDB in implementation and integration with different collections and documents

•Developed Spark Applications by using Spark, Java, and implemented Apache Spark data processing project to handle data from various RDBMS and Streaming sources.

•Authored queries in AWS Athena to query from files in S3 for data profiling.

•Imported and exported data between the environments like MySQL, HDFS and deploying into productions.

•Worked on partitioning and used bucketing in HIVE tables and setting tuning parameters to improve the performance.

•Involved in converting Hive/SQL queries into Spark transformations using Spark RDDs, python.

•Worked on processing the data from kafka topic using spark structured streaming.

•Extracted Real time feed using Kafka and Spark Streaming and convert it to RDD and process data in the form of Data Frame and save the data as Parquet format in HDFS.

•Developed multiple Kafka Producers and Consumers from scratch to as per the requirement specifications.

•Experience in moving the data from snowflake to s3 and vice versa.

•Worked on writing the data to snowflake from Pyspark applications.

•Experience in writing Stored Procedures, Functions and Triggers in Snowflake.

•Developed Spark scripts by using Scala shell commands as per the requirement. Developed Spark Application by using Python (Pyspark).

•Used Apache Flume and Kafka for collecting, aggregating, and moving data from various sources.

•

•Big Data Engineer Client: Ghost VFX New York Aug 2018 – Aug 2019

•

•Participated in design, development, and system migration of high-performance metadata-driven data pipeline with Kafka and Hive.

•Created PySpark streaming job to receive real time data from Kafka.

•Utilized a cluster of multiple Kafka brokers to handle replication needs and allow for fault tolerance.

•Cluster coordination services through Kafka and Python.

•Applied version controller tools GitHub (GIT), Subversion (SVN) and software build tool Apache Maven.

•Designed a data warehouse and performed the data analysis queries on Amazon Redshift clusters on AWS and Snowflake.

•Designed AWS Glue pipelines to ingest, process, and store data interacting with different services in AWS.

•Created Spark jobs that run in EMR clusters using AMR Notebooks.

•Developed Spark code using Python to run in the EMR clusters to load data into Snowflake dataware house.

•Created User Defined Functions (UDF) using Scala to automate some business logic in the applications.

•Designed the schema, cleaned up the input data, processed the records, wrote queries, and generated the output data using Redshift.

•Data modeling to define fact tables and dimension tables for the Enterprise Data Warehouse in Snowflake.

•Populated database tables via AWS Kinesis Firehose and AWS Redshift.

•Designed extensive automated test suites utilizing Selenium in Python.

•Contributed to serverless architectural design using AWS API, Lambda, S3, and Dynamo DB with optimized design with Auto scaling performance.

•Implemented usage of Amazon EMR for processing Big Data across Hadoop Cluster of virtual servers on Amazon Elastic Compute Cloud (EC2) and Amazon Simple Storage Service (S3).

•Executed Hadoop/Spark jobs on AWS EMR using program data stored in S3 Buckets.

•Produced AWS Cloud Formation templates to create a custom infrastructure of our pipeline.

•Wrote Unit tests for all code using PyTest for Python.

•Worked on the data lake on AWS S3 to integrate it with different applications and development projects.

•Migrated data from Hortonworks cluster to Amazon EMR cluster

Cloud Engineer Bloomberg L.P. – New York, NY Jan 2017 – Aug 2018

(Bloomberg L.P. is a privately held financial, software, data, and media company headquartered in Midtown Manhattan, New York City)

•Optimized Hive analytics SQL queries created tables/views and wrote custom queries and Hive-based exception processes.

•Hands-on with AWS data migration between database platforms Local SQL Servers to Amazon RDS and EMR HIVE.

•Worked as part of the Big Data Engineering team to design and develop data pipelines in an Azure environment using ADL Gen2, Blob Storage, ADF, Azure Databricks, Azure SQL, Azure Synapse for analytics and MS Power BI for reporting.

•Worked on AWS to form and manage EC2 instances and Hadoop Clusters.

•Implemented a Hadoop Cloudera distributions cluster using AWS EC2.

•Deployed the large knowledge Hadoop application mistreatment Talend on Cloud AWS.

•Utilized AWS Redshift to store Terabytes of data on the Cloud.

•Used Spark SQL and DataFrames API to load structured and semi-structured data into Spark Clusters.

•Wrote shell scripts for log files to Hadoop cluster through automatic processes.

•Used Spark-SQL and Hive Query Language (HQL) for obtaining client insights.

•Ingested large data streams from company REST APIs into EMR cluster through AWS kinesis.

•Implemented AWS Fully Managed Kafka streaming to send data streams from the company APIs to Spark cluster in AWS Databricks.

•Streamed data from AWS Fully Managed Kafka brokers using Spark Streaming and processed the data using explode transformations.

•Finalized the data pipeline using DynamoDB as a NoSQL storage option.

•Developed consumer intelligence reports based on market research, data analytics, and social media.

•Extracted data from RDBMS (Oracle, MySQL) to Hadoop Distributed File System (HDFS) using Sqoop.

•Joined, manipulated, and drew actionable insights from large data sources using Python and SQL.

Hadoop Engineer The Coca-Cola Company, Atlanta, GA Jan 2015 – Dec 2016

(The Coca-Cola Company is an American multinational beverage corporation founded in 1892, best known as the producer of Coca-Cola)

https://www.bing.com/ck/a p=b5f6523eee0b02cfJmltdHM9MTY3Mjc5MDQwMCZpZ3VpZD0wYWUwZTU1ZC05NGVlLTZlZTQtM2Q3Ni1mNzFkOTUxYzZmZjcmaW5zaWQ9NTY3OQ&ptn=3&hsh=3&fclid=0ae0e55d-94ee-6ee4-3d76-f71d951c6ff7&psq=Where+is+The+Coca-Cola+Company+in+USA&u=a1aHR0cHM6Ly9lbi53aWtpcGVkaWEub3JnL3dpa2kvVGhlX0NvY2EtQ29sYV9Db21wYW55&ntb=1

•Worked in Data & Analytics Technologies organization that is responsible for building cloud-based analytics products for APAC, EMEA, Americas and Corporate that directly impacts Coca-Cola's business growth globally

•Configured Kafka Producer with API endpoints using JDBC Autonomous REST Connectors.

•Configured a multi-node cluster of 10 Nodes and 30 brokers for consuming high volume, high-velocity data.

•Implemented parser, query planner, query optimizer, and native query execution using replicated logs combined with indexes, supporting full relational KQL queries, including joins.

•Developed distributed query agents for performing distributed queries against shards.

•Wrote Producer/Consumer scripts to process JSON response in Python.

•Developed JDBC/ODBC connectors between Hive/Snowflake and Spark for the transfer of the newly populated data frame.

•Developed scripts for collecting high-frequency log data from various sources and integrating it into HDFS using Flume; staging data in HDFS for further analysis.

•Wrote complex queries the API into Apache Hive on Hortonworks Sandbox.

•Utilized HiveQL to query the data to discover trends from week to week.

•Configured and deployed production-ready multi-node Hadoop services Hive, Sqoop, Flume, Airflow on the Hadoop cluster with latest patches.

•Created Hive queries to summarize and aggregate business queries by comparing Hadoop data with historical metrics.

EDUCATION

Pursuing Bachelor of Science: Network Engineering and Security

Western Governors University, Salt Lake City, UT

Associate of Applied Science: Networking Specialist (Windows & Unix/Linux)

Gwinnett Technical College, Lawrenceville, GA

Bachelor of Science: Information Technology

University of Ngaoundere, Cameroun

CERTIFICATIONS

•Certified [Test out linux pro] - [2022]

•Certified [ PR21-PC Repair & Ntwrk ] - [2019]

•Certified [ Data Center Specialist Crt ] - [2021]

•Certified [ NA21-Network Administrator Crt ] - [2021

Contact this candidate