Sign in

Cloud Data Engineer

San Antonio, TX
February 06, 2023

Contact this candidate


Professional Summary

Overall experience of over *0 years in the Information Technology field.

Develop and fine-tune SQL, Python, Scala, Hive, RDDs, Data Frames, etc.

Work with large, complex data sets, real-time/near real-time analytics, and distributed Big Data platforms.

Demonstrated skill with Hadoop distributions such as Cloudera, Hortonworks, and MapR

In-depth knowledge in incremental imports, partitioning, and bucketing concepts in Hive and Spark SQL needed for optimization.

Hands-on application with AWS Cloud and Microsoft Azure Cloud Services (PaaS & IaaS).

Develop Spark programs using PySpark APIs to compare the performance of Spark with Hive and SQL.

Use Scala libraries to process XML data.

Build continuous Spark streaming ETL pipelines with Spark, Kafka, Scala, HDFS, and MongoDB.

Program AWS Lambda functions to run Python scripts in response to events in S3.

Create AWS Cloud Formation templates to create infrastructure in the Cloud.

Adept gathering and aggregating data from various sources and integrating into HDFS.

Apply real-time log data collection from multiple sources, including social media (Facebook, Twitter, Google, LinkedIn), webserver logs, and databases using Flume.

Deploy large multiple nodes of Hadoop and Spark clusters.

Develop custom large-scale enterprise applications using Spark for data processing.

Schedule workflows and ETL processes with Apache Oozie.

Design Big Data solutions for traditional enterprise businesses.

Define job flows in Hadoop environment using tools such as Oozie for data scrubbing and processing.

Configure Zookeeper to provide Cluster coordination services.

Load logs from multiple sources directly into HDFS using tools such as Flume.

Commission and decommission nodes on Hadoop Cluster.

Configure NameNode High Availability and perform Hadoop Cluster Disaster Management.

Skilled SQL Server developer (query optimization, stored procedures, etc.) and Administration (failover using high availability, mirroring techniques, jobs, database backups, etc.).

Convert business requirements into tangible deliverables that optimize business operations.

Technical Skills List

Big Data - Apache Hadoop, Apache Hive, Apache Kafka, Apache Oozie, Apache Spark, Apache Zookeeper, HDFS Hortonworks, MapReduce, HiveQL, MapReduce Apache Spark, Spark Streaming, HBase, Kafka, Sqoop, Airflow

Cloud - AWS – EC2, EMR, Redshift, CloudWatch, Cloud Formation, Lambda, S3, DynamoDB

Programming Languages - Python, Shell Scripting, Scala, SparkSQL, PySpark, Kibana Query Language (KQL)

Databases - MySQL, DynamoDB, PostgreSQL, Oracle

Operating Systems – Windows, Unix/Linux, Windows 10

Professional Experience

Role: Cloud Data Engineer

Company: Comcast

Location: Philadelphia, PA

Date Start/End: March 2021 to Present

Comcast Cable Communications is a telecommunications company.

Developed and fine-tuned dataset workflows with AWS Step Functions to update tables with newly imported data weekly.

Developed PySpark and Scala jobs on AWS Glue that would help troubleshoot dataset workflows.

Used Scala libraries to update jobs on AWS Glue.

Programmed AWS Lambda functions to run Python scripts in response to events in S3.

Implemented AWS Secrets Manager into glue jobs to help encrypt account numbers and other private information for client hashing and protection.

Hands-on application with AWS Cloud (PaaS & IaaS).

Developed large-scale enterprise applications using Spark for data processing.

Scheduled workflows and ETL processes for constant monitoring and support with AWS Step Functions.

Designed email notifications with AWS SNS for updates on several Lambda functions, Glue jobs, and Tables.

Applied extensive knowledge on exploring table items and updating on AWS DynamoDB.

Implemented SQL queries in AWS Athena to view table contents and data variables in multiple datasets.

Implemented AWS EC2 Instances by shell scripting in AWS CLI to trigger instances in multiple datasets.

Worked with AWS CloudFormation by monitoring events in several workflow stacks of datasets.

Implemented AWS SQS to queue up the flow of events in certain datasets on the pipeline.

Role: Cloud Data Engineer

Company: DXC Technology

Location: Tysons, VA

Date Start/End: October 2019 to March 2021

DXC Technology is a Fortune 500 global IT services leader specializing in Insurance BPaaS and BPO, Analytics and Engineering, Applications, Security, Cloud, IT Outsourcing, and Modern Workplace digitalization.

Implemented serverless architecture using AWS Lambda with Amazon S3 and Amazon Dynamo DB.

Designed logical and physical data modelling for multiple data sources on Confidential Amazon Redshift.

Designed/developed ETL jobs to extract data from Salesforce replica and load it in data mart in Amazon Redshift.

Implemented Spark in EMR for processing Big Data across Data Lake in AWS System.

Created AWS Lambda function for extracting data from Kinesis Firehose and post the data to AWS S3 bucket on scheduled basis using AWS Cloud Watch event.

Automated, configured, and deployed instances on AWS, Azure environments, and Data centers.

Worked with Amazon AWS IAM console to create custom users and groups.

Programmed Flume and HiveQL scripts to extract, transform, and load the data into database.

Configured ETL to Hadoop file system (HDFS) and wrote HIVE UDFs.

Applied Hive for queries and incremental imports with Spark and Spark jobs.

Completed AWS data migration between different database platforms such as Local SQL Server to Amazon RDS and EMR HIVE.

Established AWS S3 bucket integration for application and development projects.

Managed and reviewed Hadoop log files in AWS S3.

Led critical on-prem data migration to AWS cloud, assisting with performance tuning and providing successful path towards Redshift Cluster and AWS RDS DB engines.

Collected log information using custom-engineered input adapters and Kafka.

Created custom producer to ingest the data into Kafka topics for consumption by custom Kafka consumers

Role: Hadoop Engineer

Company: Citigroup

Location: New York, NY

Date Start/End: March 2018 to October 2019

Citigroup is an American multinational investment bank and financial services corporation.

Built the infrastructure required for extraction transformation and loading of data for a variety of data sources using AWS technologies

Worked with stakeholders (e.g., Executives, Product, Data and Design teams) to assist with data-related technical issues.

Worked with Airflow to schedule Spark applications.

Created multiple Airflow DAGs to manage parallel execution of activities and workflows.

Designed multiple applications to consume and transport data from S3 to EMR and Redshift.

Developed PySpark application to process consecutive datasets.

Created EMR clusters using Cloud Formation.

Created Lambda Applications triggered based on events over S3 buckets.

Created Spark programs using Scala for better performance.

Adjusted Spark Applications shuffle partition size to execute maximum level of parallelism.

Used Elastic Search to monitor log applications.

Performed incremental appends of datasets.

Optimized Spark using map side join type transformations to reduce shuffle.

Applied Kafka Stream library.

Developed Kafka producer and consumer programs using Scala, and created UDFs in Scala.

Utilized Spark SQL API in PySpark to extract and load data and perform SQL queries.

Implemented parser, query planner, query optimizer, and native query execution using replicated logs combined with indexes supporting full relational Kibana Query Language (KQL) queries, including joins.

Created a Kafka broker which uses schema to fetch structured data in structured streaming.

Wrote producer /consumer scripts to process JSON response in Python.

Wrote shell scripts to automate workflows to pull data from various databases into Hadoop framework for users to access data through Hive-based views.

Wrote Hive queries for analyzing data in Hive warehouse using Hive Query Language.

Utilized HiveQL to query data to discover trends from week to week.

Configured and deployed production-ready multimode Hadoop services Hive, Sqoop, Flume, and Oozie on the Hadoop Cluster with latest patches.

Created Hive queries to summarize and aggregate business queries by comparing Hadoop data with historical metrics.

Loaded ingested data into Hive managed and external tables.

Performed upgrades, patches, and bug fixes in Hadoop in a Cluster environment.

Used Cloudera Manager for installation and management of single-node and multi-node Hadoop Cluster.

Role: Big Data Engineer

Company: Kelly Services, Inc.

Location: Troy, MI

Date Start/End: August 2016 to March 2018

Kelly Services, Inc. is an American office staffing company that operates globally.

Designed a cost-effective archival platform for storing big data using Hadoop and its related technologies.

Used Spark with Kafka for real-time streaming of data.

Imported unstructured data into the HDFS using Spark Streaming and Kafka.

Wrote scripts using Spark SQL to check, validate, cleanse, transform, and aggregate data.

Migrated MapReduce jobs to Spark using Spark SQL and Data Frames API to load structured data into Spark clusters.

Used SparkSQL for creating and populating HBase warehouse.

Implemented Kafka messaging consumer.

Utilized Spark and Spark SQL for faster testing and processing.

Used Sqoop to efficiently transfer data between databases and HDFS and used Flume to stream the log data from servers.

Managed and reviewed Hadoop log files.

Applied Google Dataproc to streamline data processing between clusters and Google Cloud Storage.

Applied Zookeeper for cluster coordination services.

Download data through Sqoop and Hive in the HDFS platform.

Defined data security standards and procedures in Hadoop using Apache Ranger and Kerberos.

Performed upgrades, patches, and bug fixes in HDP in a cluster environment.

Used Jenkins CI for CICD and Git for version control.

Role: Big Data Engineer

Company: Oracle

Location: Redwood Shores, CA

Date Start/End: December 2014 to August 2016

Oracle is an American multinational computer technology corporation (now headquartered in Austin, Texas; formerly headquartered in Redwood Shores, California. Oracle sells database software and technology, cloud engineered systems, and enterprise software products, as well as develops and builds tools for database development and systems of middle-tier software, Enterprise Resource Planning (ERP) software, Human Capital Management (HCM) software, Customer Relationship Management (CRM) software (AKA customer experience), Enterprise Performance Management (EPM) software, and Supply Chain Management (SCM) software.

Performance tuned/optimized SQL queries using a query analyzer.

Created model software to interact with HDFS and MapReduce.

Created indexes, constraints, and rules on database objects for optimization.

Designed and developed Hadoop applications.

Installed SQL Server Client-side utilities and tools for front-end developers/programmers.

Loaded ingested data into Hive Managed and External tables.

Wrote custom user define functions (UDF) for complex Hive queries (HQL).

Wrote shell scripts to automate workflows to pull data from various databases into Hadoop framework for users to access the data through Hive based views.

Wrote Hive Queries for analyzing data in Hive warehouse using Hive Query Language.

Developed scripts for collecting high-frequency log data from various sources and integrating it into HDFS using Flume.

Utilized HiveQL to query the data to discover trends from week to week.

Configured and deployed production-ready Hadoop cluster with services such as Hive, Sqoop, Flume, and Oozie.

Created Hive queries to summarize and aggregate business queries by comparing Hadoop data with historical metrics.

Imported millions of structured data from relational databases using Sqoop import into HDFS in CSV format.

Role: Linux Systems Administrator

Company: Ericsson

Location: Lagos, Nigeria

Date Start/End: October 2012 to December 2014

Ericsson is one of the leading providers of Information and Communication Technology (ICT) to service providers. We enable the full value of connectivity by creating game-changing technology and services that are easy to use, adopt, and scale, making our customers successful in a fully connected world.

Served as Administrator for technical monitoring and maintenance, upgrades, and support of Linux-based systems.

Provided technical support for hundreds of users.

Built, installed, and configured servers with OS of RedHat Linux.

Installed and configured Apache, Web Logic, and Web Sphere applications.

Installed Red Hat Linux Kickstart on RedHat 4.x/5.x, performed Red Hat Linux Kernel tuning and memory upgrades, and provided long-term technical maintenance and upgrade support.

Applied OS patches and upgrades on a regular basis and upgraded administrative tools and utilities, and configured or added new services

Configured DNS and DHCP on clients’ networks.

Created database tables with various constraints for clients accessing FTP.

Provided remote system administration.


Bachelor of Science - Computer Science - Babcock University (Ilishan-Remo, Nigeria)

Contact this candidate