Post Job Free

Resume

Sign in

Senior Big Data Engineer

Location:
Manhattan, NY, 10023
Posted:
September 14, 2023

Contact this candidate

Resume:

DIVINE EPIE NGOL ESUEH

AWS Cloud Senior Data Engineer

with overall 15 years of extensive IT experience including 9+ years of experience in Big Data

Hadoop and AWS/Azure Cloud: targeting assignments with an organization of repute.

Email: adzo15@r.postjobfree.com Phone: 646-***-****

Profile Summary

•Experience working on Big Data systems, ETL pipelines, and real-time analytic systems, including slicing/dicing OLAP Cubes and drilling tabular models.

•Achievement-driven professional skilled in databases, data management, analytics, data processing, data cleanings, data modeling, and data-driven projects.

•Skillfully developed a technical plan and roadmap and wrote scalable CloudFormation Script.

•Diligently developed and deployed a CloudFormation script to provision a secure Virtual Private Cloud (VPC) which housed our multi-tier infrastructure.

•Skilled at building and deploying CloudFormation scripts to provide a secure and reliable multi-tier infrastructure which included a Network Address Translation Service (NAT) gateway, Internet Gateway (IGW), Elastic Load Balancer (ELB), Auto-Scaling Group (ASG), Security Groups (SG), DynamoDB and Elastic Cloud Computing (EC2) instances.

•Proven experience as a data engineer, with a focus on GCP technologies and Terraform (IAAC).

•Strong expertise in GCP services like BigQuery, Dataflow, Pub/Sub, and Cloud Storage

•Expertise in troubleshooting to fix issues with scripts or processes.

•Extensive experience in Infrastructure as Code (IaC) such as Terraform and CloudFormation for this project.

•Experienced with Amazon Web Services (AWS) and GCP, such as Elastic Cloud Computing (EC2) instances, Identity and Access Management (IAM), DynamoDB, Lambda, Load Balancer, Auto-Scaling Groups, CloudWatch, Systems Manager (SSM), Virtual Private Cloud (VPC) and all of its components; Subnets, route tables, Internet Gateway (IGW), Network Address Translation (NAT) gateway, Network Access Control Lists (NACLs), VPC flow logs, Security Groups (SG), VPC endpoint, Application Load Balancer (ALB), BigQuery, Pub/Sub, Google Cloud Storage, Cloud Composer, Cloud functions.

•Worked with various file formats (Parquet, Avro & JSON) and Compressions (Snappy& Gzip).

•Skillfully deployed large multiple nodes of a Hadoop and Spark cluster.

•Project Initiation and Planning to define project scope, goals, and objectives, identify stakeholders and their roles while migrating from on-premise Teradata environment to the Cloud. Create a detailed migration plan with timelines and milestones.

•Widely used cloud platforms and services such as AWS, GCP, S3, BigQuery, Cloud Storage, EMR, Redshift, Glue jobs, Google DataPrep, Composer, PubSub, BigTable, Azure Databricks, Terraform.

Education

•AWS Certified Practitioner.

•Bachelor of Science in Business Administration from the University of the People.

Key Skills

AWS Cloud, GCP Big Data Systems CloudFormation Scripts

Terraform Scripts Project Management Data Visualization

SQL Server Database Project Execution Documentation

Soft Skills

Adaptable Communicator

Strategic Thinker Collaborative

Team Player Problem-Solving

Technical Skills

APACHE: Apache Ant, Apache Flume, Apache Hadoop, Apache YARN, Apache Hive, Apache Kafka, Apache MAVEN, Apache Oozie, Apache Spark, Apache Tez, Apache Zookeeper, Cloudera Impala, HDFS, Hortonworks, MapReduce.

SCRIPTING: HiveQL, MapReduce, XML, FTP, Python, UNIX, Shell scripting, LINUX.

OPERATING SYSTEMS: Unix/Linux, Windows 10, Ubuntu, Apple OS.

FILE FORMATS: Parquet, Avro & JSON, ORC, Text, CSV.

DISTRIBUTIONS: Cloudera, Hortonworks, AWS, Elastic, ELK, Cloudera CDH 4/5, Hortonworks HDP 2.5/2.6, Amazon Web Services (AWS).

DATA PROCESSING (COMPUTE) ENGINES: Apache Spark, Spark Streaming, Flink.

DATA VISUALIZATION TOOLS: Pentaho, QlikView, Tableau, PowerBI, matplotlib.

COMPUTE ENGINES: Apache Spark, Spark Streaming, Storm, AWS EC2.

DATABASE: Microsoft SQL Server Database (2005, 2008R2, 2012), Database & Data Structures, Apache Cassandra, Amazon Redshift, DynamoDB, Apache HBase, Apache Hive, MongoDB, Snowflake, BigQuery.

SOFTWARE: Microsoft Project, Primavera P6, VMWare, Microsoft Word, Excel, Outlook, PowerPoint; Technical Documentation Skills, Terraform

BIG DATA: Hadoop, Hive, Flume, Sqoop, Airflow, Nifi, Spark, Spark Streaming, HBase, Pig, Yarn, Kafka, Zookeeper.

Work Experience

Sr. Big Data Engineer

NFL, New York, NY, USA, April 2022 to Present

•As a dedicated member of the ingestion team, I played a pivotal role as a Python Engineer and Developer.

•Design, build, and maintain data pipelines on GCP, utilizing tools such as Dataflow, Dataprep, and BigQuery for batch and real-time data processing and terraform.

•Implementing Data Pipelines leveraging Google Cloud products such as Cloud BigQuery, GCS, Cloud DataFlow, Cloud Pub/Sub, Cloud BigTable.

•Perform data transformation, cleansing, and enrichment to prepare data for analysis and reporting, leveraging tools like Dataflow and Cloud Dataprep, Terraform as IaaC tool for deployment of infrastructure.

•Spearheading the development of end-to-end data ingestion pipelines from scratch, demonstrating expertise in Python coding and engineering practices.

•Design and manage data warehousing solutions using Google BigQuery or other appropriate GCP services.

•Optimize data storage and processing costs on GCP services (BigQuery, Cloud Composer, etc.) by managing resource allocation efficiently integrating and deploying infrastructure with Terraform/YAML scripts.

•Rectifying issues encountered in existing pipelines and constructing new pipelines from scratch, employing Python as our primary programming language.

•Managing seamless transfer of files from the Vendor Drop zone to the raw zone, where the original files were deposited without any transformations.

•Orchestrating the movement of the data files to the refined zone after implementing necessary transformations and manipulations, rendering them ready for utilization by Data Analysts who would leverage the data for diverse analyses and report generation.

•Gaining invaluable experience in problem-solving, Python development, and the intricate process of handling data pipelines. By collaborating closely with team members and actively contributing to the efficient movement and transformation of data, worked towards enhancing our team's performance and overall project success.

•Leveraging Python scripts to orchestrate the transfer of vendor files across various ingestion points and storage locations, facilitating accessibility for Data Analysts.

•Implementing data transformations and manipulation techniques within Python scripts to ensure data cleanliness and suitability for storage in a data warehouse.

•Using Git for robust source control, creating dedicated branches for tasks, and merging changes to the main branch.

•Facilitating the production deployment process through pull requests.

•Employing AWS Glue and crawler functionalities to create tables that seamlessly integrated with the data pipelines.

•Developed multiple programs using Scala and deployed them on the Yarn cluster, comparing the performance of Spark with Hive and SQL.

•Validating data accuracy and compliance with requirements by leveraging AWS Athena for querying and result verification.

•Creating data process workflows using Spark, Hive, and Pig, ensuring efficient and accurate data analysis.

•Automating, configuring, and deploying instances on AWS and Data centers.

•Implementing secure storage of AWS credentials by utilizing parameter store, eliminating hardcoded values in Python scripts, and enhancing security practices.

•Collaborating within an agile environment, utilizing Jira and Confluence to manage tasks.

•Actively participating in two-week sprints, meticulously tracking progress, and providing comprehensive reports during daily stand-up meetings.

•Facilitating communication and coordination with project stakeholders, attending meetings to gather requirements and align project details.

•Involved in splitting JSON files into DataFrames to be processed in parallel for better performance and fault tolerance.

•Developing Spark programs using PySpark and maintaining the data warehouse in AWS Redshift.

•Engaging with Senior Managers and top-level executives to provide progress updates and ensure alignment with organizational goals and expectations.

Sr. Big Data Engineer

Mediaocean, New York, NY, USA, February 2020 to March 2022

•I was a pivotal member of our team, responsible for architecting, implementing, and managing complex data pipelines and analytics solutions on Google Cloud Platform (Google Cloud Platform)

•Used BigQuery and GCP DataFlow to create a complex ETL pipeline and SQL Queries with Kafka for publishing & subscribing data and transform according to business rules.

•Data Pipeline Design and Development using Dataproc Serverless and Spark as well as Cloud Composer Workflow Management

•Leveraged BigQuery and GCS Integration for Data Security and Compliance considering performance optimization.

•Migrated to the new infrastructure that is generated from the updated Terraform code across 4 environments.

•Evaluate the existing Teradata environment, including databases, tables, schemas, and data pipelines.

•Analyze data volumes, types, and dependencies.

•Map existing Teradata structures to their equivalents in the chosen cloud platform: S3 buckets and Google Cloud Storage.

•Address any compatibility issues between Teradata and cloud technologies.

•Replicate existing ETL (Extract, Transform, Load) pipelines from Teradata to the cloud environment.

•Perform rigorous testing to confirm that data in the cloud matches the source data in Teradata.

•Managed complete Big Data flow from data ingestion to its destination in Redshift.

•Designed and implemented a scalable, secure, and cost-effective cloud-based solution to process and analyze massive volumes of data on AWS using AWS Kinesis, AWS Glue, EMR, Spark Redshift, and QuickSight and Terraform.

•Implemented end-to-end solutions using DevOps tools such as CICD to deploy Big Data pipelines.

•Deployed EC2 Instance and performed Auto Scaling, snapshot backup, and managing template.

•Developed ETLs to pull data from various sources and transform it for reporting applications using PL/SQL.

•Developed AWS CloudFormation templates to create a custom infrastructure for our pipeline.

•Developed performance tuning in Spark program for different source system domains and inserted it into a harmonized layer.

•Performed Data Engineering with AWS cloud services including AWS Cloud services planning, designing, and DevOps support like IAM user, group, roles & policy management and Terraform infrastructure.

•Created modules for Apache airflow to call different services in the cloud including EMR, S3, Athena, Crawlers, Lambda functions, and Glue jobs.

•Optimized data storage in Kafka Brokers within the Kafka cluster by partitioning Kafka Topics.

•Contributed to loading data from the UNIX file system to HDFS.

•Involved in running Hadoop jobs for processing millions of records and data which were updated daily/weekly.

•Monitored resources, such as Amazon DB Services, CPU, Memory, and EBS volumes.

•Implemented usage of Amazon EMR for processing Big Data across the Hadoop Cluster of virtual servers on Amazon Elastic Compute Cloud (EC2) and Amazon Simple Storage Service (S3) AWS Redshift.

•Implemented AWS Lambda functions to run scripts in response to events in Amazon Dynamo DB table or S3.

•Decoded raw data and loaded it into JSON before sending the batched streaming file over to the Kafka producer.

•Specified nodes and performed data analysis queries on Amazon Redshift Clusters on AWS.

•Reflecting the live stream data from DB2 to the HBase table using Spark Streaming and Apache Kafka.

•Optimized data transfer over the network using Snappy, etc.

•Used ETL to transfer the data from the target database to HDFS to send it to the MicroStrategy reporting tool.

Sr Cloud/Sr Data Engineer

Alcon, Fort Worth, TX, June 2018 to January 2020

•Worked extensively with AWS including doing Docker deployments on ECS.

•Utilized AWS Lambda functions to trigger and manage real-time data processing, ensuring efficient and scalable execution of data transformations.

•Executed Hadoop/Spark jobs on AWS EMR using programs, data stored in S3 Buckets, and ingested data through AWS Kinesis Data Stream and Firehose from various sources to S3.

•Transformed and analyzed the data using Pyspark, and Hive, based on ETL Mapping.

•Worked with Amazon Web Services (AWS) and was involved in ETL, Data Integration, and Migration.

•Documented the requirements including the available code which should be implemented using Spark, Amazon DynamoDB, Redshift, and Elastic Search.

•Define Airflow DAGs (Directed Acyclic Graphs) to schedule and manage ETL tasks.

•Handle retries, error handling, and task dependencies within the DAGs.

•Design a cloud-native architecture on Azure for scalability and reliability.

•Use Airflow as the orchestration tool to schedule and monitor data pipeline tasks.

•Integrate multiple data sources including on-premises databases, external APIs, and cloud storage.

•Leverage Azure services like Azure Data Factory to efficiently extract data.

•Expertise in implementing Spark using Python and Spark SQL for faster testing and processing of data.

•Imported and exported data into HDFS and Hive using Sqoop.

•Used SparkSQL for creating and populating the HBase warehouse.

•Worked with Spark Context, Spark -SQL, DataFrames, and Pair RDDs.

•Extracted data from different databases and scheduled Apache Airflow workflows to execute the task daily.

•Developed PySpark application to read data from various file system sources, apply transformations and write to NoSQL database.

•Worked on Hadoop, Spark, and similar frameworks, NoSQL, and RDBMS databases including Redis and MongoDB.

•Attended meetings with managers to determine the company's Big Data needs and developed Hadoop systems.

•Finalized the scope of the system and delivered Big Data solutions.

•Collaborated with the software research and development teams and built cloud platforms for the development of company applications.

•Training staff in data resource management.

•Collected data using REST API, built HTTPS connection with client-server, sent GET request, and collected response in Kafka Producer.

•Imported data from web services into HDFS and transformed data using Spark.

•Worked on AWS Kinesis for processing huge amounts of real-time data.

Sr Big Data Engineer

J.D. Power, Westlake Village, California, December 2016 to May 2018

•Designed and developed a robust data pipeline using Apache Airflow, enabling orchestration, scheduling, and monitoring of various data processing tasks.

•Integrated Snowflake as the cloud data warehouse, providing a centralized and scalable repository for structured and semi-structured data.

•Leveraged Azure Databricks for advanced data processing, analytics, and machine learning tasks, utilizing its powerful Spark-based platform.

•Demonstrated expertise with Azure services, showcasing several years of experience in building and managing data solutions within the Azure ecosystem.

•Successfully delivered two previous data engineering projects within Azure, showcasing a consistent track record of meeting project objectives and deadlines.

•Integrated Power BI with Azure services, enabling interactive and visually appealing dashboards and reports for business stakeholders.

•Use Azure Databricks for scalable and parallel data processing.

•Implement ETL workflows using PySpark for data transformation and enrichment.

•Handle schema evolution and data quality checks within the pipeline.

•Create interactive and visually appealing dashboards using Power BI.

•Connect Power BI to Snowflake to establish a live connection for real-time analytics.

•Created Spark Scala Datasets in Azure Databricks, defining the schema with Scala case classes.

•Execute long-running jobs for pre-processing products and warehouse data in Azure Synapse Analytics to cleanse and prepare the data before consuming.

•Used Azure Stream Analytics to divide streaming data into batches as input to Azure Databricks engine for batch processing.

•Data required extensive cleaning and preparation for ML Modelling, as some observations were censored without any clear notification.

•Set up Azure Logic Apps for pipeline automation.

•Repartitioned datasets after loading Gzip files into Data Frames and improved the processing time.

•Used Azure Event Hubs to process data as a Kafka Consumer.

•Implemented Azure Databricks using Python and utilized Data Frames and Azure Spark SQL API for faster processing of data.

•Involved in the implementation of analytics solutions through Agile/Scrum processes for development and quality assurance.

•Interacted with data residing in Azure Data Lake Storage using Azure Databricks to process the data.

•Populated data frames inside Azure Databricks jobs, Azure Spark SQL, and Data Frames API to load structured data into Azure Databricks clusters.

•Monitored background operations in Azure HDInsight.

•Captured and transformed real-time data into a suitable format for Scalable analytics using Azure Stream Analytics.

•Forwarded requests to source REST Based API from a Scala script via Azure Event Hubs Producer.

Sr Big Data Engineer

Merrill Edge's, Atlanta, GA, July 2014 to November 2016

•Develop a robust ETL (Extract, Transform, Load) pipeline to efficiently process and analyze large volumes of data from various sources using GCP services.

•Used GCS (Google Cloud Storage) as a storage repository for raw and intermediate data.

•Source for initial data extraction and destination for processed data.

•Leveraged BigQuery as the data warehouse for storing structured and semi-structured data to support ad-hoc queries and fast analytical processing.

•Built ETL pipeline with Dataproc with manages Apache Spark and Hadoop clusters for large-scale data processing for data transformation tasks that require distributed computing.

•Optimized Google Dataflow for building and executing data pipelines with serverless scalability.

•Cloud functions was implemented as serverless compute platform for executing event-driven functions to trigger pipeline execution in response to events from Pub/Sub or other sources.

•Transformed data loaded into BigQuery tables for analysis with schema evolution and partitioning strategies implemented for performance optimization.

•Implemented Spark streaming for real-time data processing with Kafka and handled large amounts of data with Spark.

•Wrote streaming applications with Spark Streaming/Kafka.

•Used SQL to perform transformations and actions on data residing in HDFS.

•Configured various Big Data ecosystems tools such as Elastic Search, Logstash, Kibana, Kafka, and Cassandra.

•Built Real-Time Streaming Data Pipelines with Kafka, Spark Streaming, and Cassandra.

•Participated in various phases of data processing (collecting, aggregating, and moving from various sources) using Apache Spark.

•Managed structured data via Spark SQL then stored into Hive tables for downstream consumption.

•Analyzed and tuned the Cassandra data model for multiple internal projects/customers.

•Interacted with data residing in HDFS using Spark to process the data.

•Coordinated Kafka operation and monitoring with DevOps personnel.

•Worked with Elasticsearch and Logstash (ELK) performance and configure tuning and imported data from web services into HDFS using Apache Flume, and Apache Kafka and transformed data using Spark.

•Created a Kafka producer to connect to different external sources and bring the data to a Kafka broker.

•Developed ETL pipeline to process log data from Kafka/HDFS sequence file and output to Hive tables in ORC format.

•Responsible for designing and deploying new ELK clusters.

•Defined the Spark/Python (PySpark) ETL framework and best practices for development and wrote Python code that tracks Kafka message delivery.

•Used Cloudera Manager for installation and management of a multi-node Hadoop cluster.

Big Data Engineer

Publix Super Market Inc., Lakeland, FL, September 2012 to June 2014

•Sourced data using APIs with data available in JSON to be converted to Parquet and Avro formats.

•Used Kafka to ingest Data and create topics for data streaming.

•Utilized Spark Streaming for data processing and creating DStreams from data received from Kafka.

•Stored results of processed data in Hive.

•Worked on large data warehouse analysis services servers and developed the different reports for the analysis from those servers.

•Wrote Hive scripts to process HDFS data and wrote shell scripts to automate workflows to pull data from various databases into the Hadoop framework for users to access the data through Hive views.

•Implemented Kafka streaming to send data streams from the company APIs to HDFS, processing the data with Spark Structured Streaming.

•Developed DAG data pipeline to onboard and change management of datasets.

•Scheduled jobs using Apache Airflow.

•Used Kafka on publish-subscribe messaging as a distributed commit log.

•Collected log data from various sources and integrated it into HDFS using Flume and staged data in HDFS for further analysis.

•Developed SQL queries to Insert, update and delete data in a data warehouse.

Previous Experience

Technical Consultant

System One, Fremont, CA October 2008 to August 2012

•Led 24x5 operation team for Incidents, Major Incidents, Service Requests & Change management, and the performance of the team, ensuring high availability.

•Managed bridge calls with internal and external teams to resolve Incidents by being the first point of contact.

•Identified Problems through evaluation of reports and problem control using RCA methods and workaround documentation.

•Spearheaded governance calls with all Stakeholders for reporting the metrics to measure the project status, and review Process Documents, and updates.

•Contributed towards resolving technical product queries customers through chat and e-mails.



Contact this candidate