AWS BIG DATA ENGINEER

Location:

San Antonio, TX

Posted:

November 25, 2022

Contact this candidate

Resume:

BILL MUTABAZI

Big Data Engineer

Email: ************@*****.*** Phone: 804-***-****

Professional Summary

I am a Hadoop Big Data Engineer and Developer with skills in legacy Hadoop Ecosystems, Cloudera Hadoop, Hortonworks Hadoop, Amazon Web Services, Azure and IBM cloud. I have in-depth 6+ years of experience and knowledge of Hadoop architecture and its components like YARN, HDFS. I am skilled in the use of Spark, Spark Streaming, Spark SQL, Kafka, and programming languages like python and Scala.

Skills & Abilities

IDE:

Jupyter Notebooks (formerly iPython Notebooks), Eclipse, IntelliJ, PyCharm

PROJECT METHODS:

Agile, Scrum,Test-Driven Development, Unit Testing

BIG DATA PROCESSING:

Spark, Spark Streaming, Kafka and Hadoop.

HADOOP DISTRIBUTIONS:

Hadoop, Cloudera Hadoop, Hortonworks Hadoop

CLOUD PLATFORMS:

Amazon AWS, IBM Cloud, Microsoft Azure

CLOUD DATABASE & TOOLS:

Apache Hbase, SQL, MongoDb, DynamoDB, Oracle

PROGRAMMING LANGUAGES:

Java, Python, Scala, PySpark.

SCRIPTING:

Hive, MapReduce, SQL, Spark SQL, Shell Scripting

CONTINUOUS INTEGRATION (CI-CD):

Jenkins, GitLab

VERSIONING:

Git, GitHub, Gitlab

PROGRAMMING METHODOLOGIES:

Object-Oriented Programming, Functional Programming

FILE FORMAT AND COMPRESSION:

CSV, JSON, Avro, Parquet, ORC

FILE SYSTEMS:

HDFS, Data Lake

ETL TOOLS:

Apache Camel, Flume, Kafka,Sqoop, Spark, AWG glue, Apache Nifi

DATA VISUALIZATION TOOLS:

Tableau, Power BI

SEARCH TOOLS:

Apache Lucene, Elasticsearch

SECURITY:

Kerberos, Apache Ranger

AWS:

Glue, DynamoDB, Redshift, Lambda, S3, EC2, Kinesis, EMR, QuickSight Databricks

Data Query:

Spark SQL, Data Frames

Experience

AWS BIG DATA ENGINEER CAPITAL ONE

RICHMOND, VA JUN 2021 - PRESENT

•Developed multiple Spark Streaming and batch jobs using Python on AWS EMR.

•Utilize programming languages like Java, Scala, Python and Open Source RDBMS and NoSQL databases and Cloud based data warehousing services such as Redshift and Snowflake.

•Defined and implemented data modeling techniques for the corporate data warehouse in AWS Redshift.

•Helped the business to estimate the actual time frames of epics, stories, or feature implementation using agile and Jira.

•Wrote numerous Python scripts using pyTest to test code in a QA environment using all foreseen test cases.

•Worked with AWS Lambda functions for event-driven processing using AWS boto3 module in python.

•Used different tools to create AWS jobs in AWS like Arow Python, Scala,Jenkins, Amazon Web Service (s3, oneLake, EMR and EC2).

•Built well-managed data solutions, tools, and capabilities to enable self-service frameworks for data consumers.

•Used AGILE Methodology with a daily stand up and a Spring planning as well as retro meeting weekly in a sprint that lasted 2 weeks.

•Documented processes using Confluence for any reference articles, Github for source code repository management, Google drive for video and large files storage, JIRA for backlog tracking and Encrypted Internal tools for any Confidential documents.

•Created data models for RDBMS and NoSQL databases and Cloud based, data warehousing services such as Redshift and Snowflake.

•Monitor the performance as I enhance security through solving any security vulnerabilities, Hence Increment versioning the feature in order to track what has been addressed.

•Ensured the existing pipelines and its source code being used are delivering the intended data through continuous Integration and continuous delivery CI/CD foundation for how we use code.

AWS BIG DATA ENGINEER EXPEDIA INC.

SEATTLE, WA MAY 2019 – JUN 2021

•Developed multiple Spark Streaming and batch Spark jobs using Scala and Python on AWS

•Responsible for Designing Logical and Physical data modelling for various data sources on AWS Redshift

•Defined and implemented schema for a custom Hbase

•Created Apache Airflow DAGS using Python

•Wrote numerous Spark programs in Scala for information extraction, change, and conglomeration from numerous record designs.

•Worked with AWS Lambda functions for event-driven processing using AWS boto3 module in python.

•Validated the model using both classification report and ROC

•Executed Hadoop/Spark jobs on AWS EMR using programs, data stored in S3 Buckets.

•Performed Explore Data Analysis using both Pandas.

•Configured access for inbound and outbound traffic RDS DB services, DynamoDB tables, EBS volumes to set alarms for notifications or automated actions on AWS

•Developed AWS Cloud Formation templates to create custom infrastructure of our pipeline

•Implemented AWS IAM user roles and policies to authenticate and control access

•Specified nodes and performed the data analysis queries on Amazon redshift clusters on AWS

•Developed Spark programs using Pyspark.

•Created User Defined Function(UDF) using python in Spark.

•Worked on AWS Kinesis for processing huge amounts of real time data

•Developed scripts for collecting high frequency log data from various sources and integrating it into AWS using Kinesis, staging data in the Data Lake for further analysis.

•Worked with different data science teams and provided respective data as required on an ad-hoc request basis

•Hands-on experience in Microsoft Azure Cloud Services (PaaS & IaaS), Storage, Active Directory, Application Insights, Internet of Things (IoT), Azure Search, Key Vault, Visual Studio Online (VSO) and SQL Azure.

•Responsible for ensuring Systems & Network Security, maintaining performance and setting up monitoring using CloudWatch and Nagios.

•Experience in working on version controller tools like GitHub (GIT), Subversion (SVN) and software builds tools like Apache Maven, Apache Ant.

•Developed designed tested Spark SQL jobs with Scala, Python Spark consumers

•Extensively worked on CI/CD pipeline for code deployment by engaging different tools (Git, Jenkins, CodePipeline) in the process right from developer code check-in to Production deployment

•Created and maintained ETL pipelines in AWS using Glue, Lambda and EMR

BIG DATA ENGINEER NET APPS, INC.

SUNNYVALE, CA DEC 2017 – MAY 2019

•Implemented solutions for ingesting data from various sources and processing the Data-at-Rest utilizing Big Data technologies such as Hadoop, Map Reduce Frameworks, HBase, Hive.

•Exploring Spark for improving the performance and optimization of the existing algorithms in Hadoop using Spark Context, Spark-SQL, Data Frame, spark Paired RDD, Spark YARN.

•Created ETL pipelines using different processors in Apache Nifi.

•Experienced developing and maintaining ETL jobs.

•Performed data profiling and transformation on the raw data using Pig, Python, and oracle

•Experienced with batch processing of data sources using Apache Spark.

•Developing predictive analytics using Apache Spark Scala APIs.

•Created Hive External tables and loaded the data into tables and query data using HQL.

•Used Sqoop to efficiently transfer data between databases and HDFS and used Flume to stream the log data from servers.

•Developed Spark code using Scala and Spark-SQL for faster testing and data processing.

•Imported millions of structured data from relational databases using Sqoop import to process using Spark and stored the data into HDFS in CSV format.

HADOOP DEVELOPER METLIFE INSURANCE COMPANY

NEW YORK, NY MAR 2014 – DEC 2015

•Worked with technology and business groups for Hadoop migration strategy.

•Created map-reduce jobs for data processing using JAVA as a programming language.

•Research and recommend suitable technology stack for Hadoop migration considering current enterprise architecture.

•Validated and Recommended on Hadoop Infrastructure and data center planning considering data growth.

•Transferred data to and from clusters, using Sqoop and various storage media such as flat files.

•Developed MapReduce programs and Hive queries to analyze sales pattern and customer satisfaction index over the data present in various relational database tables.

•Orchestrated hundreds of Sqoop scripts, Pig scripts, Hive queries using Oozie workflows and sub-workflows

•Worked extensively in performance optimization by adopting/deriving at appropriate design patterns of the MapReduce jobs by analyzing the I/O latency, map time, combiner time, reduce time etc.

•Followed Agile methodology for the entire project.

Education

MASTER’S DEGREE IN COMPUTER SCIENCE

University of Nebraska-Lincoln, Lincoln - NE

-Graduated with Distinction: 3.81/4 GPA

BACHELOR’S DEGREE IN INFORMATION TECHNOLOGY & BUSINESS MANAGEMENT

William Penn University, Oskaloosa - IA.

-Graduated with Distinction: 3.73/4 GPA

Certifications

●IBM Cloud V3 Certified

●AZ-900 Microsoft Azure Fundamentals Certified

●Hadoop 101 - Cognitive Class

●Big Data 101 - Cognitive Class

●Alpha Lambda Delta, and Sigma Beta National Academic Honor Society

Contact this candidate