AWS Big Data Engineer

Location:

Schaumburg, IL

Posted:

March 30, 2023

Contact this candidate

Resume:

WORK EXPERIENCE

AWS Big Data Engineer with Bitwise Inc. – Schaumburg, IL, Jun’21-Present

Client: Mercola.com

Bitwise leverages its deep technological expertise to develop solutions for their clients’ most complex business problems. Over 20+ years of consistently bringing high-performance solutions to our market leading clients in diverse industries across the world, guarantees that our tools, processes and people are above par to offer singular value to every engagement. Partnerships with Bitwise yield long term business gains, as clients deliver more and deliver faster, because they are ahead of the curve in a dynamic business landscape.

Designed a data warehouse and performed the data analysis queries on Amazon redshift clusters on AWS

Worked on the data lake on AWS S3 to integrate it with different applications and development projects

Used Python for developing Lambda functions in AWS

Implemented usage of Amazon EMR for processing Big Data across Hadoop Cluster of virtual servers on Amazon Elastic Compute Cloud (EC2) and Amazon Simple Storage Service (S3)

Created Spark jobs that run in EMR clusters using AMR Notebooks

Developed Spark programs using python to run in the EMR clusters

Executed ELK (Elastic Search, Log stash, Kibana) stack in AWS to gather and investigate the logs created by the website

Wrote Unit tests for all code using different frameworks like PyTest

Worked on architecting Serverless design using AWS API, Lambda, S3, and Dynamo DB with optimized design with Auto scaling performance

Designed the schema, cleaned up the input data, processed the records, wrote queries, and generated the output data using Redshift

Populated database tables via AWS Kinesis Firehose, Athena and AWS Redshift

Created User Defined Functions (UDF) using Scala to automate few of the business logic in the applications.

Designed AWS Glue pipelines to ingest, process, and store data interacting with different services in AWS

Executed Hadoop/Spark jobs on AWS EMR using programs, data stored in S3 Buckets

Developed AWS Cloud Formation templates to create a custom infrastructure of our pipeline

Implemented AWS IAM user roles, instance profiles, and policies to authenticate and control access

Big Data Engineer with Bitwise Inc. – Schaumburg, IL, Nov’19-Jun’21

Client: Homeserve

Worked as part of the Big Data Engineering team to design and develop data pipelines in an Azure environment using ADL Gen2, Blob Storage, ADF (Data Factory), Azure Databricks, Azure SQL, Azure Synapse for analytics and MS Power BI for reporting.

Designed HiveQL and SQL queries to retrieve data from the data warehouse and create views for the final users to consume

Developed Spark SQL script to manage various data sets and verified its performance over the jobs using the Spark UI

Used Spark to filter and format the data before designing the sink to store the data in the Hive data warehouse.

Created Hive tables to store different data formats of data coming from different data sources.

Maintained the Hive metastore tables to store the metadata of the tables in the data warehouse.

Automated the ingestion pipeline using bash scripts and Cron jobs to perform the ETL pipeline daily.

Imported data from the local file system, RDBMS into HDFS and Sqoop and developed a workflow using Shell Scrips to automate the tasks of loading the data into HDFS

Evaluated various data processing techniques available in Hadoop from various perspectives to detect aberrations in data.

Provided connections to different Business Intelligence tools to the tables in the data warehouse such as Tableau and Power BI.

Cleaned up the input data, specified the schema, processed the records, wrote UDFs, and generated the output data using Hive

Implemented a streaming job with Apache Kafka to ingest data from Rest API.

Utilized Gradient boosted trees and random forests to create a benchmark for potential accuracy.

Data Engineer with Credit Suisse, Raleigh, NC, May’16-Jun’18

Remote Project

Involved in building our ETL Data Warehousing/Pipeline solutions for our customers, pulling data from various sources and file formats.

The project consisted in a Data migration mainly from Oracle Exadata servers to Snowflake (Staging area, Snowpipes) in the Cloud.

Developed PySpark application as ETL job to read data from various file system sources, apply transformations, and write to NoSQL database (MongoDB/Cassandra)

Built a series of Glue jobs in AWS to move data from disparate sources and write to multiple target systems such as RDS (MySQL), DynamoDB, MongoDB, Oracle on prem databases and S3 buckets using EMR Cluster with PySpark as computing engine.

All Glue jobs scripts were written in PySpark and they read the records in PostgreSQL/MySQL tables and write the data in json and Parquet formats into S3, that way they can be referenced in Snowflake by storage integration.

Extracted data from different databases and scheduling Apache workflows to execute the task daily

Architected a lightweight Kafka broker and integrated Kafka with Spark for the real-time data processing

Collected data using REST API, built HTTPS connection with client server, sent GET request and collected response in Kafka Producer

Imported data from web services into HDFS and transformed data using Spark

Executed Hadoop/Spark jobs on AWS EMR using programs, data stored in S3 Buckets and creating complex queries with AWS Athena from S3 buckets.

Used Spark SQL for creating and populating HBase warehouse

Installed and configured Kafka cluster and monitored the cluster

Used the Pandas library and Spark in python for data cleansing, validation, processing, and analysis

Created Hive external tables and designed data models in Apache Hive

Developed multiple Spark Streaming and batch Spark jobs using Python on AWS

Implemented advanced procedures of feature engineering for data science team using the in-memory computing capabilities like Apache Spark written in Python

Implemented Rack Awareness in Production Environment

Worked with Spark Context, Spark -SQL, Data Frames and Pair RDDs

Ingested data through AWS Kinesis Data Stream and Firehose from various sources to S3

Worked with Amazon Web Services (AWS) and involved in ETL, Data Integration, and Migration

Documented the requirements including the available code which should be implemented using Spark, Amazon DynamoDB, Redshift, and Elastic Search

Imported and exported data into HDFS and Hive using Sqoop

Big Data Developer (Project) with Big Lots, Inc, Columbus, OH, May’14-Apr’16

Transferred data using Informatica tool from AWS S3

Used AWS Redshift for storing the data on cloud

Worked on Impala to compare processing time of Impala with Apache Hive for batch applications to implement the former in project

Created a POC involved in loading data from LINUX file system to AWS S3 and HDFS.

Extracted Real time feed using Kafka and Spark Streaming and convert it to RDD and process data in the form of Data Frame and save the data as Parquet format in HDFS

Designed and implemented test environment on AWS

Involved in running Hadoop jobs for processing millions of records and data which was updated daily/weekly.

Integrating Kafka with Spark streaming for high speed data processing

Configured Spark Streaming to receive real time data and store the stream data to HDFS

Architecting and DevOps for AWS & Google cloud services including in house Data Center for middleware system and web services. Also, managing security review and web compliance management

Extract Real time feed using Kafka and Spark Streaming and convert it to RDD and process data in the form of Data Frame and save the data as Parquet format in HDFS.

Developed Spark code using Scala and Spark-SQL/Streaming for faster processing of data.

Proposed solutions and strategies to tackle business challenges

Worked on Impala to compare processing time of Impala with Apache Hive for batch applications to implement the former in project.

Created a POC involved in loading data from LINUX file system to AWS S3 and HDFS.

PREVIOUS EXPERIENCE

Software Developer, King Khalid University, Kingdom of Saudi Arabia, May’07–Apr’14

Elmak Saad

Mobile: + 224-***-****

Email: *****************@*****.***

CORE COMPETENCIES

Big Data

Spark/Cloud

Predictive Modeling

R/Python/Tableau

Data Analytics/Visualization

Scala, Python, and SQL queries

Leadership Skills

ACADEMIC DETAILS

M.Sc. in Business Analytics, Graduate School of Business, University of Illinois at Chicago, June 2022

M.Sc. in Computer Science, Faculty of Engineering and Technology, University of Gezira, Sudan, 2003

B.Sc. in Computer Science, Institute of computer Science, International University of Africa, Sudan, 1998

CERTIFICATIONS

SAS Certified Predictive Modeler Using SAS Enterprise Miner 14, 3/2019

SAS Certified Advanced Programmer for SAS 9, 11/2018

SAS Certified Base Programmer for SAS 9, 10/2018

PROFILE SUMMARY

Result-oriented Data Engineer with nearly 9 years of experience in Hadoop, Spark and Big Data

Skilled in HDFS, Spark, Hive, Sqoop, HBase, Flume, Oozie, and Zookeeper

Experienced in Text Analytics, developing different Statistical Machine Learning, Data Mining solutions for various business problems, and generating data visualizations using R, Python, and Tableau

Create Spark Core ETL processes to automate using a workflow scheduler.

Use Apache Hadoop to work with Big Data and analyze large data sets efficiently

Experience handling XML files as well as Avro and Parquet SerDes

Performance tuning at source, Target and Data Stage job levels using Indexes, Hints and Partitioning in DB2, ORACLE

Experience in ecosystems like Hive, Sqoop, MapReduce, Flume, and Oozie.

Gained exposure about Hive's analytical functions; extend Hive functionality by writing custom UDFs.

Track record of results in an Agile methodology using data-driven analytics.

Experience importing and exporting terabytes of data between HDFS and Relational Database Systems using Sqoop.

Load and transform large sets of structured, semi structured and unstructured data working with data on Amazon Redshift, Apache Cassandra, and HDFS in Hadoop Data Lake

Skilled with BI tools like Tableau and PowerBI, data interpretation, modeling, data analysis, and reporting with the ability to assist in directing planning based on insights

Experience of working in production environments, migrations, installations, and development

Effective mentor & problem-solver with strong analytical, interpersonal, negotiation & troubleshooting skills

TECHNICAL SKILLS

Languages and Scripting: Spark, PySpark, Python, Java, Scala, Hive, Kafka, SQL, Shell scripts, HiveQL

Python packages: Numpy, TensorFlow, Pandas, Scikit-Learn, SciPy, Matplotlib, Seaborn

Data Base: Cassandra, Snowflake, Hbase, Redshift, DynamoDB, MongoDB, MS Access, SQL, MySQL, Oracle, PL/SQL, SQL, RDBMS

Data Lake, Data Warehouse, SQL Database, RDBMS, NoSQL Database, Amazon Redshift, Apache Cassandra, MongoDB, Spark-SQL, MySQL, Oracle, S3, Athena, Databricks

Stream Processing: Apache Hadoop, Spark Streaming, Kafka

Analysis: Advanced Data Modeling, Statistical, Sentiment, Exploratory, Stochastic Forecasting, Regression, Predictive

Communication: Reporting, Documentation, Presentation, Collaboration. Clear and effective with a wide variety of colleagues and audiences.

Cloud: Amazon Web Services (AWS), Google Cloud Platform (GCP), Azure

ETL – Flume, Airflow, Apache Spark, Nifi, Apache Kafka

Proficient in various distributions such as Hadoop Apache ecosystems, Microsoft Azure and Spark Databricks.

Contact this candidate