WORK EXPERIENCE
AWS Big Data Engineer with Bitwise Inc. – Schaumburg, IL, Jun’21-Present
Client: Mercola.com
Bitwise leverages its deep technological expertise to develop solutions for their clients’ most complex business problems. Over 20+ years of consistently bringing high-performance solutions to our market leading clients in diverse industries across the world, guarantees that our tools, processes and people are above par to offer singular value to every engagement. Partnerships with Bitwise yield long term business gains, as clients deliver more and deliver faster, because they are ahead of the curve in a dynamic business landscape.
Designed a data warehouse and performed the data analysis queries on Amazon redshift clusters on AWS
Worked on the data lake on AWS S3 to integrate it with different applications and development projects
Used Python for developing Lambda functions in AWS
Implemented usage of Amazon EMR for processing Big Data across Hadoop Cluster of virtual servers on Amazon Elastic Compute Cloud (EC2) and Amazon Simple Storage Service (S3)
Created Spark jobs that run in EMR clusters using AMR Notebooks
Developed Spark programs using python to run in the EMR clusters
Executed ELK (Elastic Search, Log stash, Kibana) stack in AWS to gather and investigate the logs created by the website
Wrote Unit tests for all code using different frameworks like PyTest
Worked on architecting Serverless design using AWS API, Lambda, S3, and Dynamo DB with optimized design with Auto scaling performance
Designed the schema, cleaned up the input data, processed the records, wrote queries, and generated the output data using Redshift
Populated database tables via AWS Kinesis Firehose, Athena and AWS Redshift
Created User Defined Functions (UDF) using Scala to automate few of the business logic in the applications.
Designed AWS Glue pipelines to ingest, process, and store data interacting with different services in AWS
Executed Hadoop/Spark jobs on AWS EMR using programs, data stored in S3 Buckets
Developed AWS Cloud Formation templates to create a custom infrastructure of our pipeline
Implemented AWS IAM user roles, instance profiles, and policies to authenticate and control access
Big Data Engineer with Bitwise Inc. – Schaumburg, IL, Nov’19-Jun’21
Client: Homeserve
Worked as part of the Big Data Engineering team to design and develop data pipelines in an Azure environment using ADL Gen2, Blob Storage, ADF (Data Factory), Azure Databricks, Azure SQL, Azure Synapse for analytics and MS Power BI for reporting.
Designed HiveQL and SQL queries to retrieve data from the data warehouse and create views for the final users to consume
Developed Spark SQL script to manage various data sets and verified its performance over the jobs using the Spark UI
Used Spark to filter and format the data before designing the sink to store the data in the Hive data warehouse.
Created Hive tables to store different data formats of data coming from different data sources.
Maintained the Hive metastore tables to store the metadata of the tables in the data warehouse.
Automated the ingestion pipeline using bash scripts and Cron jobs to perform the ETL pipeline daily.
Imported data from the local file system, RDBMS into HDFS and Sqoop and developed a workflow using Shell Scrips to automate the tasks of loading the data into HDFS
Evaluated various data processing techniques available in Hadoop from various perspectives to detect aberrations in data.
Provided connections to different Business Intelligence tools to the tables in the data warehouse such as Tableau and Power BI.
Cleaned up the input data, specified the schema, processed the records, wrote UDFs, and generated the output data using Hive
Implemented a streaming job with Apache Kafka to ingest data from Rest API.
Utilized Gradient boosted trees and random forests to create a benchmark for potential accuracy.
Data Engineer with Credit Suisse, Raleigh, NC, May’16-Jun’18
Remote Project
Involved in building our ETL Data Warehousing/Pipeline solutions for our customers, pulling data from various sources and file formats.
The project consisted in a Data migration mainly from Oracle Exadata servers to Snowflake (Staging area, Snowpipes) in the Cloud.
Developed PySpark application as ETL job to read data from various file system sources, apply transformations, and write to NoSQL database (MongoDB/Cassandra)
Built a series of Glue jobs in AWS to move data from disparate sources and write to multiple target systems such as RDS (MySQL), DynamoDB, MongoDB, Oracle on prem databases and S3 buckets using EMR Cluster with PySpark as computing engine.
All Glue jobs scripts were written in PySpark and they read the records in PostgreSQL/MySQL tables and write the data in json and Parquet formats into S3, that way they can be referenced in Snowflake by storage integration.
Extracted data from different databases and scheduling Apache workflows to execute the task daily
Architected a lightweight Kafka broker and integrated Kafka with Spark for the real-time data processing
Collected data using REST API, built HTTPS connection with client server, sent GET request and collected response in Kafka Producer
Imported data from web services into HDFS and transformed data using Spark
Executed Hadoop/Spark jobs on AWS EMR using programs, data stored in S3 Buckets and creating complex queries with AWS Athena from S3 buckets.
Used Spark SQL for creating and populating HBase warehouse
Installed and configured Kafka cluster and monitored the cluster
Used the Pandas library and Spark in python for data cleansing, validation, processing, and analysis
Created Hive external tables and designed data models in Apache Hive
Developed multiple Spark Streaming and batch Spark jobs using Python on AWS
Implemented advanced procedures of feature engineering for data science team using the in-memory computing capabilities like Apache Spark written in Python
Implemented Rack Awareness in Production Environment
Worked with Spark Context, Spark -SQL, Data Frames and Pair RDDs
Ingested data through AWS Kinesis Data Stream and Firehose from various sources to S3
Worked with Amazon Web Services (AWS) and involved in ETL, Data Integration, and Migration
Documented the requirements including the available code which should be implemented using Spark, Amazon DynamoDB, Redshift, and Elastic Search
Imported and exported data into HDFS and Hive using Sqoop
Big Data Developer (Project) with Big Lots, Inc, Columbus, OH, May’14-Apr’16
Transferred data using Informatica tool from AWS S3
Used AWS Redshift for storing the data on cloud
Worked on Impala to compare processing time of Impala with Apache Hive for batch applications to implement the former in project
Created a POC involved in loading data from LINUX file system to AWS S3 and HDFS.
Extracted Real time feed using Kafka and Spark Streaming and convert it to RDD and process data in the form of Data Frame and save the data as Parquet format in HDFS
Designed and implemented test environment on AWS
Involved in running Hadoop jobs for processing millions of records and data which was updated daily/weekly.
Integrating Kafka with Spark streaming for high speed data processing
Configured Spark Streaming to receive real time data and store the stream data to HDFS
Architecting and DevOps for AWS & Google cloud services including in house Data Center for middleware system and web services. Also, managing security review and web compliance management
Extract Real time feed using Kafka and Spark Streaming and convert it to RDD and process data in the form of Data Frame and save the data as Parquet format in HDFS.
Developed Spark code using Scala and Spark-SQL/Streaming for faster processing of data.
Proposed solutions and strategies to tackle business challenges
Worked on Impala to compare processing time of Impala with Apache Hive for batch applications to implement the former in project.
Created a POC involved in loading data from LINUX file system to AWS S3 and HDFS.
PREVIOUS EXPERIENCE
Software Developer, King Khalid University, Kingdom of Saudi Arabia, May’07–Apr’14
Elmak Saad
Mobile: + 224-***-****
Email: *****************@*****.***
CORE COMPETENCIES
Big Data
Spark/Cloud
Predictive Modeling
R/Python/Tableau
Data Analytics/Visualization
Scala, Python, and SQL queries
Leadership Skills
ACADEMIC DETAILS
M.Sc. in Business Analytics, Graduate School of Business, University of Illinois at Chicago, June 2022
M.Sc. in Computer Science, Faculty of Engineering and Technology, University of Gezira, Sudan, 2003
B.Sc. in Computer Science, Institute of computer Science, International University of Africa, Sudan, 1998
CERTIFICATIONS
SAS Certified Predictive Modeler Using SAS Enterprise Miner 14, 3/2019
SAS Certified Advanced Programmer for SAS 9, 11/2018
SAS Certified Base Programmer for SAS 9, 10/2018
PROFILE SUMMARY
Result-oriented Data Engineer with nearly 9 years of experience in Hadoop, Spark and Big Data
Skilled in HDFS, Spark, Hive, Sqoop, HBase, Flume, Oozie, and Zookeeper
Experienced in Text Analytics, developing different Statistical Machine Learning, Data Mining solutions for various business problems, and generating data visualizations using R, Python, and Tableau
Create Spark Core ETL processes to automate using a workflow scheduler.
Use Apache Hadoop to work with Big Data and analyze large data sets efficiently
Experience handling XML files as well as Avro and Parquet SerDes
Performance tuning at source, Target and Data Stage job levels using Indexes, Hints and Partitioning in DB2, ORACLE
Experience in ecosystems like Hive, Sqoop, MapReduce, Flume, and Oozie.
Gained exposure about Hive's analytical functions; extend Hive functionality by writing custom UDFs.
Track record of results in an Agile methodology using data-driven analytics.
Experience importing and exporting terabytes of data between HDFS and Relational Database Systems using Sqoop.
Load and transform large sets of structured, semi structured and unstructured data working with data on Amazon Redshift, Apache Cassandra, and HDFS in Hadoop Data Lake
Skilled with BI tools like Tableau and PowerBI, data interpretation, modeling, data analysis, and reporting with the ability to assist in directing planning based on insights
Experience of working in production environments, migrations, installations, and development
Effective mentor & problem-solver with strong analytical, interpersonal, negotiation & troubleshooting skills
TECHNICAL SKILLS
Languages and Scripting: Spark, PySpark, Python, Java, Scala, Hive, Kafka, SQL, Shell scripts, HiveQL
Python packages: Numpy, TensorFlow, Pandas, Scikit-Learn, SciPy, Matplotlib, Seaborn
Data Base: Cassandra, Snowflake, Hbase, Redshift, DynamoDB, MongoDB, MS Access, SQL, MySQL, Oracle, PL/SQL, SQL, RDBMS
Data Lake, Data Warehouse, SQL Database, RDBMS, NoSQL Database, Amazon Redshift, Apache Cassandra, MongoDB, Spark-SQL, MySQL, Oracle, S3, Athena, Databricks
Stream Processing: Apache Hadoop, Spark Streaming, Kafka
Analysis: Advanced Data Modeling, Statistical, Sentiment, Exploratory, Stochastic Forecasting, Regression, Predictive
Communication: Reporting, Documentation, Presentation, Collaboration. Clear and effective with a wide variety of colleagues and audiences.
Cloud: Amazon Web Services (AWS), Google Cloud Platform (GCP), Azure
ETL – Flume, Airflow, Apache Spark, Nifi, Apache Kafka
Proficient in various distributions such as Hadoop Apache ecosystems, Microsoft Azure and Spark Databricks.