Big Data Engineer

Location:

Pleasanton, CA

Posted:

March 15, 2023

Contact this candidate

Resume:

KOLADE ADELAJA

Contact: 925-***-**** (M); Email: **************@*****.***

Attuned to the latest trends and advancements in this field, I am consistently delivering impeccable results through my dedication in handling multiple functions and activities under high pressure environment with tight deadlines

BIG DATA ENGINEER

EXECUTIVE SNAPSHOT

IT Professional with 6+ years of experience in Big Data development and overall experience of 8+ years with Data administration; Passionate about Big Data technologies and the delivery of effective solutions through creative problem solving with a track record of building large-scale systems using Big Data technologies.

Excellent academic credentials with B.SC in Computer Information Systems from Prairie View A&M University along with working proficiency in using Apache Hadoop ecosystem components like Apache Kafka, Spark, Elastic Search, YARN, HDFS, Hive, Sqoop, Flume, Oozie, HBase and Zookeeper.

Experienced with Big Data link technologies such as Amazon Warehouse Services (AWS), Microsoft Azure, Apache Kafka, Python, Apache Spark, HIVE, Apache Kafka, and Hadoop.

Well versed in applying

oPython-based design and development programming to multiple projects

oExpertise designing custom reports using data extraction and reporting tools, and development of algorithms based on business cases.

Familiar with Software Development Life Cycle with experience in preparing and executing Unit Test Plan and Unit Test Cases after software development

Skilled in creating Pyspark Data Frames on multiple projects and tied into Kafka and building Kafka Clusters, Spark clusters, ELK cluster and their integration with Hadoop Clusters.

Demonstrated excellence in:

oData migrations involving relational databases such as Oracle, My SQL, Postgre SQL

oSearching tools like ELK Stack (Elastic search, Log stash and Kibana)

oAnalyzing Microsoft SQL Server data models and identifying and creating inputs to convert existing dashboards that use Excel as a data source.

oExtracting Real time feed using Kafka and Spark Streaming and process data in the form of Data Frames to do Transformations and aggregations.

oDeveloping Spark code using Scala/Python and Spark-SQL/Streaming for faster processing of data and data enrichment.

Knowledge of performance-tuning data-heavy dashboards and reports for optimization using various options such as Extracts, Context filters, writing efficient calculations, Data source filters, Indexing, and Partitioning in the data source.

Brilliant in providing clear and effective testing procedures and systems to ensure an efficient and effective process. Clearly documents big data systems, procedures, governance and policies.

PROFESSIONAL EXPERIENCE

Mar 2022 - Nov 2022 with Safeway & Albertsons, Pleasanton, California

As a Tech Lead/ Data Engineer

Created data pipelines to ingest data from local Rest APIs using Kafka

Developed and maintained continuous integration systems in a Cloud computing environment (GCP).

Wrote SQL Queries for analyzing data in GCP using Big Query

Built SQL views on top of source data tables in BQ

Wrote Validation scripts for SQL queries

Used Kafka as a messaging system to implement real-time streaming solutions

Utilized a cluster of Kafka brokers to handle replication needs and allow for fault tolerance.

Worked with different data formats such as Avro, CSV, Parquet and JSON.

Leading my team in QA process improvement to clearly understands and communicates all aspects of ETL/ELT testing involving multiple source systems.

Configured, deployed, and supported Cloud services (GCP)

Initialed Data Transfer Job from GCS Bucket to BQ

Created Table Views in BQ from data sources

Used Stone Branch to run ETL jobs on GCP

Writing and documentation technical test strategies, test plans, test cases, etc. to guarantee quality assurance in a Data Warehouse environment in GCP.

Handled schema changes in data stream using Kafka.

Put schema check in place for error messages which were directed to the error topic

Created Partitioned tables in BQ to write efficient queries

Formulated balancing impact of Kafka producer and Kafka consumer message (topic) consumption.

Responsible for Kafka operation and monitoring, and handling of messages funneled through Kafka topics.

Coordinated Kafka operation and monitoring with dev ops personnel

Designed Architecture Diagrams, Sequence diagrams and Flow Charts for the project

Setup Calls with Directors, Solutions Architects and Managers to get approvals and recommendations as needed.

Performed gradual cleansing and modeling of datasets.

Worked cross-platform with other teams to achieve project goals and deliverables.

Feb 2020 - Feb 2022 with Edward Jones Investments, St. Louis, MO

As a Big Data Engineer

Designed Spark streaming to receive real time information from Kafka and store the stream information to HDFS.

Created Hive Tables, loaded with information, and wrote Hive queries.

Created Hive external tables and designed information models in Hive.

Utilized Spark Streaming with Kafka and HDFS to create an ETL pipeline.

Optimized data storage in Hive using partitioning and bucketing mechanisms on the managed and external tables.

Used Spark SQL and Data Frames API to load structured and semi-structured information into Spark Clusters.

Used Spark API over Hadoop YARN to perform analytics on information in Hive.

Migrated ETL jobs to Python scripts for transformations, joins, aggregations before HDFS.

Wrote and deployed a Python script to find all Red shift tables not used in production code and send Slack notification reminding Engineering team to drop these tables.

Worked on Airflow scheduler tool to run spark job using spark operator.

Integrated Hadoop with Active Directory and enabled Kerberos for Authentication.

Responsible to coordinate data analysis and data warehousing testing - Test case writing, execution, environment planning and business QA support Automation testing.

Worked on disaster management with Hadoop cluster.

Collaborated on a Hadoop cluster (CDH) and reviewed log files of all daemons.

Configured, deployed, and supported Cloud services, including Amazon Web Services (AWS), EC2, and Azure.

Worked with leadership and engineering team to address and overcome the challenges and continuously improve the automation capabilities .

Built tools to support automated testing of products.

Designed of an automated test system and testing of APIs in QA using Spark/PySpark and Kafka to consume events from disparate internl APIs.

Hands-on using Microsoft Azure Cloud Services (PaaS & IaaS), Storage, Web Apps, Active Directory, Application Insights, Internet of Things (IoT), Azure Search, Key Vault, Visual Studio Online (VSO) and SQL Azure.

Performed performance calibration for Spark Steaming (e.g., setting right Batch Interval time, correct level of executors, choice of correct publishing and memory).

Collected, aggregated, and moved information from servers to HDFS on Apache Spark and Spark Streaming.

Performed storage capability management, performance calibration and benchmarking of clusters.

Worked on importation and claims information between Hadoop Distributed File System (HDFS) and Relational Database Management System (RDBMS).

Imported and exported dataset transfers between legacy databases and HDFS.

Delivered quality through automated testing by taking ownership of the design, development, and execution of the test automation, including the following processes in both QA and Production environment.

Enforced workflows victimization Apache Airflow framework to automatize tasks and YARN Resource pools to share resources of cluster for YARN jobs submitted by users.

Apr 2018 - Feb 2020 with The Hanover Insurance Group, Inc., Worcester, MA

As a Big Data Engineer

Designed and developed ETL workflows using Python and Scala for processing data in HDFS.

Serialised and deserialised data using Python Lambda functions.

Used Scala for concurrency support, which is the key in parallelizing processing of the large datasets.

Created Hive External tables and loaded the data into tables and query data using HQL.

Created a Kafka producer to connect to different external sources and bring the data to a Kafka broker.

Automated, configured, and deployed instances on AWS, Azure environments and Data centers.

Configured, deployed, and supported Cloud services, including Amazon Web Services (AWS), EC2, and GCP.

Built and analyzed Regression model on Google Cloud Platform using PySpark.

Developed and maintained ETL jobs.

Created and maintained a data warehouse in AWS Redshift.

Created and managed external tables to store ORC and Parquet files using HQL.

Created a NoSQL HBase database to store the processed data from Apache Spark.

Developed Apache Airflow DAGs to automate the pipeline.

Developed Spark Streaming application to pull data from cloud to Hive and HBase.

Built Real-Time Streaming Data Pipelines with Kafka, Spark Streaming and Hive.

Collected, aggregated, and shuffled data from servers to HDFS using Apache Spark and Spark Streaming.

Streamed prepared information to HBase utilizing Spark.

Used HBase connector for Spark.

Handled schema changes in data stream using Kafka.

Responsible for Kafka operation and monitoring, and handling of messages funneled through Kafka topics.

Coordinated Kafka operation and monitoring with dev ops personnel.

Formulated balancing impact of Kafka producer and Kafka consumer message (topic) consumption.

Performed gradual cleansing and modeling of datasets.

Utilized Avro-tools to build the Avro schema to create external hive tables using PySpark.

Nov 2015 - Apr 2018 with Tesla Inc., Palo Alto, CA

As a Big Data Developer

Configured a Flume agent for ingestion of data from source APIs and store to HDFS.

Created data pipelines to ingest data from local Rest APIs using Kafka and Python.

Developed and maintained continuous integration systems in a Cloud computing environment (Azure).

Constructed Hive views on top of the source data tables.

Installed and configured Hive, Sqoop, Flume, Oozie on the Hadoop cluster.

Developed multiple spark jobs using SQL context in Spark and connected to Hive.

Wrote Hive Queries for analyzing data in HDFS using Hive.

Built Hive views on top of source data tables.

Loaded ingested data into Hive Managed and External tables.

Utilized HiveQL to query the data to discover transaction trends.

Used Cloudera Manager for installation and management of single-node and multi-node Hadoop cluster.

Transformed data from unstructured to structured data frames for data analysis.

Loaded and transformed large sets of structured, semi-structured, and unstructured data.

Used Kafka as a messaging system to implement real-time streaming solutions using Spark Steaming.

Used Spark to load batches of Data Frames.

Utilized a cluster of Kafka brokers to handle replication needs and allow for fault tolerance.

Extracted metadata from Hive tables with Presto.

Worked with different data formats such as Avro, CSV, Parquet, JSON, and ORC.

Used Cloudera Manager for installation and management of single-node and multi-node Hadoop clusters.

Oct 2013 - Nov 2015 with Babcock United Technologies Corporation, Farmington, CT

As a Software/Database Programmer/Analyst

Assigned to a development team that designed and programmed a variety of customized software programs.

Programmed a range of functions (e.g., automate logistical tracking & analysis, automate schematic measurements)

Developed client-side testing/validation using JavaScript.

Worked hands-on with technologies such as XML, Java, and JavaScript.

Installed My SQL Servers, configured tables, and supported database operations.

Implemented mail alert mechanism for alerting users when their selection criteria were met.

Conducted unit and systems integration tests to ensure system functioned to specification.

Established communications interfacing between the software program and the database backend.

ACADEMIC CREDENTIALS

M.SC in Computer Information Systems

Prairie View A&M University

technical skills

Languages: Spark-Scala, Spark-Python (PySpark), SQL, R

Scripting Languages: Unix Shell scripting, HiveQL, SparkSQL, SQLAzure

Data Warehousing: Hive, Redshift

Hadoop Distributions: Hadoop, Cloudera Hadoop, Hortonworks Hadoop, YARN

ID Development Tools: Jupyter Notebook, Eclipse, PyCharm, IntelliJ

Visualization: Tableau

Version Controls: SVN, Git Hub

Databases: SQL Server, Oracle, My SQL, HBase, Presto

Cloud Platforms/Services: AWS, Azure, Cloudera

Operating Systems: Windows, Linux, Unix

Data Formats: Avro, CSV, Parquet, JSON

Tools: Sqoop, Oozie, Flume, Kafka, Airflow, Spark Streaming, Airflow

Responsibilities

Contact this candidate