Resume

Senior Big Data Engineer

Location:

Atlanta, GA, 30324

Posted:

April 26, 2024

Contact this candidate

Resume:

NELLY-ANNE NDIKUM

Phone: 737-***-****

Email: ad41nu@r.postjobfree.com

Professional Summary

•Big Data Developer with nearly 10 years of well-rounded experience in ETL pipeline development in top technology firms, cloud services, and legacy systems data engineering using tools like Spark, Kafka, Hadoop, Hive, AWS, and Azure

•Proficient in crafting resilient data infrastructure and reporting solutions, with extensive experience in distributed platforms and systems, coupled with a deep comprehension of databases, data storage, and movement

•Developed Python scripts to automate the extraction, transformation, and loading (ETL) of multi-terabyte datasets into a Hadoop ecosystem, enhancing data availability and analysis speed.

•Collaborated with DevOps teams to identify business requirements and implement CI/CD pipelines, showcasing expertise in Big Data Technologies such as AWS, Azure, GCP, Databricks, Kafka, Spark, Hive, Sqoop, and Hadoop

•Engineered and optimized Apache Spark jobs in Scala for batch processing of multi-terabyte datasets, achieving a 30% reduction in processing times and resource consumption.

•Designed and implemented a data cleansing and validation process using SQL and PL/SQL, enhancing data quality for a multinational marketing campaign management system.

•Implemented data governance practices in data modeling, including metadata management and data quality metrics, to improve the reliability of business intelligence reports.

•Specialized in architecting and deploying fault-tolerant data infrastructure on Google Cloud Platform (GCP), harnessing services like Google Compute Engine, Google Cloud Storage, and Google Kubernetes Engine (GKE)

•Demonstrated proficiency in designing and executing data pipelines using GCP data services such as BigQuery, Cloud Dataflow, Cloud Dataproc, and Cloud Composer, while ensuring robust data security and governance practices

•Ingested and queried data from multiple NoSQL databases, including MongoDB, Cassandra, HBase, and AWS DynamoDB

•Leveraged containerization technologies like Docker and Kubernetes to optimize performance and scalability on GCP

•Extensive hands-on experience with AWS services including EC2, S3, RDS, and EMR, leveraging them effectively in various data projects

•Expertise in implementing CI/CD pipelines on AWS, utilizing AWS Lambda for serverless computing solutions, and employing event-driven architectures

•Proficient in AWS data services such as Athena and Glue for data analytics and ETL processes, with a proven track record of creating interactive dashboards and reports using AWS QuickSight

•Applied knowledge in extracting data from AWS services and utilizing reporting tools like Tableau, PowerBI, and AWS QuickSight to generate tailored reports.

Technical Skills

PROGRAMMING LANGUAGES

•Python, Scala, Java, PySpark, Spark, SQL, Shell Scripting, MapReduce, SparkSQL, HiveQL.

CLOUD PLATFORMS

•AWS, Azure, GCP, Databricks.

BIG DATA TOOLS

•Kafka, Spark, Hive, Apache Nifi, HBase, Cloudera, Flume, Sqoop, Hadoop, HDFS, Spark streaming, YARN, SQL, Oracle, MongoDB, Cassandra, Oozie.

CLOUD TOOLS

•AWS Glue, EMR, S3, Lambda, Step Functions, SNS, SQS, IAM, DyanmoDB, Redshift, RDS, Azure Data Factory, Data Lake Gen2, Power BI, Synapse.

PROJECT METHODS

•ETL, CI/CD, Unit Testing, Debugging, Agile, Scrum, Test-Driven Development, Version Control (Git).

Experience

BIG DATA ENGINEER APPLE US AUSTIN, TEXAS JUNE 2021 – PRESENT

•Worked in the Datapipeline team, focusing on optimizing the data pipeline's efficiency by monitoring, troubleshooting, and performing bug fixes.

•Leveraged technologies such as Hadoop, Kafka, and Hive, ensured seamless data flow, and utilized visualization tools like Datadog and AWS CloudWatch for trend analysis.

•Conducted bug fixes and code cleanup to enhance pipeline performance.

•Gathered and analyzed data, ensuring its integrity through unit testing.

•Utilized Airflow for job orchestration and Python, and PySpark for coding.

•Employed SQL, Hive, Presto, Oracle, and Flume for data querying and processing.

•Monitored and troubleshooted jobs using Airflow, AWS CloudWatch, Oozie, Spark UI, Cloudera Manager, and Kubernetes.

•Created comprehensive data pipelines using PySpark to process streaming data, integrating with Apache Kafka and storing results in HBase for real-time analytics.

•Architected and coded Python applications for real-time data processing using Apache Kafka and Apache Storm, resulting in a 40% increase in processing efficiency.

•Facilitated data movement across environments, ensuring smooth ingestion from various sources.

•Engineered CI/CD pipelines using Jenkins integrated with Ansible to automate deployment and orchestration of big data workflows.

•Developed Python scripts for code cleanup, job monitoring, and data migration.

•Collaborated with cross-functional teams to address business needs and enhance data systems.

•Designed and implemented real-time data analytics solutions using Scala and Spark Structured Streaming, processing live streams from social media platforms to generate instant insights.

•Analyzed data patterns using Grafana, CloudWatch metrics, and dashboards.

•Automated workflows with shell scripts to streamline data extraction into the Hadoop framework.

•Designed and implemented a Terraform solution to automate the provisioning of cloud infrastructure for big data applications, reducing setup time by over 40%.

•Overcame timezone challenges while collaborating with offshore team members to ensure timely task completion and issue resolution.

CLOUD DATA ENGINEER DRISCOLL’S WATSONVILLE CA OCTOBER 2020 – JUNE 2021

•Loaded and transformed large sets of structured and semi-structured data using AWS Glue.

•Implemented AWS Datasync tasks to automate data transfer between on-premises storage and AWS S3, streamlining backup and disaster recovery operations.

•Designed and implemented a scalable data lake on AWS S3, optimizing data storage and retrieval for a multi-petabyte dataset across a distributed analytics platform.

•Developed producer /consumer scripts for Kafka to process JSON responses in Python.

•Implemented different instance profiles and roles in IAM to connect tools in AWS.

•Evaluated and proposed new tools and technologies to meet the needs of the organization.

•Architected and deployed a Snowflake environment to support a data lake solution, integrating data from multiple sources at a scale of petabytes per day.

•Leveraged Terraform to enforce cloud security best practices across big data deployments, aligning infrastructure with compliance requirements.

•Excellent understanding/knowledge of tools in AWS like Glue, EMR, S3, Lambda, Redshift, and Athena.

•Engineered real-time data ingestion pipelines using AWS Kinesis Streams, facilitating high-throughput data flow for immediate processing and analytics in a financial trading application.

•Experience creating and maintaining a data warehouse in AWS Redshift.

•Experienced in Object-oriented programming, integrating and testing Software implementations collecting business specifications, user requirements, design confirmation, development, and documenting the entire software development life cycle and QA.

•Configured and managed AWS EKS clusters to orchestrate containerized big data applications, ensuring robust scalability and high availability for microservices-driven architectures.

•Architected a high-performance data warehousing solution using AWS Redshift, integrating data from various sources via AWS Glue and optimizing query performance for BI tools.

•Designed a NoSQL solution using AWS DynamoDB to support web-scale applications, implementing best practices for data modeling to ensure low-latency access and high throughput.

•Utilized Scala to develop a custom library for geospatial analysis on top of Spark, enabling complex spatial queries and visualizations for urban planning data.

•Able to learn and adapt quickly to emerging new technologies and tools in the cloud and programming languages.

•Created Spark jobs using Databricks and AWS infrastructure and Python as a programming language.

•Monitored and tuned AWS EMR cluster performance using CloudWatch and custom metrics, reducing job execution times and cost per job by optimizing resource allocation.

•Developed and deployed big data processing jobs on AWS EMR, utilizing Apache Spark and Hadoop to perform complex transformations and aggregations on large datasets.

•Developed Kafka producer and consumer programs and ingested data into AWS S3 buckets.

•Leveraged AWS Lambda and S3 triggers to automate image processing workflows, enabling real-time image analysis and metadata tagging for a media library.

•Implemented parser, query planner, query optimizer, and native query execution using replicated logs combined with indexes, supporting full relational KQL in Confluent Kafka.

BIG DATA DEVELOPER HOME DEPOT ATLANTA GA JUNE 2019 – OCTOBER 2020

•Attended sprint planning, daily scrums, and retrospective meetings.

•Evaluated requirements and helped in the creation of user stories.

•Developed the code and modules for different ETL pipelines.

•Designed and implemented an Azure Data Lake storage solution, enabling scalable data storage and analytics across a distributed architecture for a healthcare data management platform.

•Processed large sets of structured and unstructured data using Spark and Scala as a programming language.

•Imported data from DB2 to HDFS using Apache Nifi in the Azure cloud.

•Worked as part of the Big Data Engineering team and worked on pipeline creation activities in the Azure environment.

•Led the design and implementation of a dimensional data modeling project for a retail analytics platform, which included the development of star schemas to facilitate faster OLAP queries.

•Implemented data transformation pipelines within Snowflake using Snowflake SQL and SnowPipe to ingest real-time data streams for immediate analytics.

•Leveraged Azure Databricks for real-time data processing and machine learning, developing predictive models that improved demand forecasting for a retail chain.

•Designed and implemented a microservices architecture using MongoDB as the backend database, which supported asynchronous data processing and improved scalability for a financial services application.

•Created a real-time analytics dashboard by streaming data from Azure Event Hubs into Azure Databricks and subsequently visualizing in Power BI, providing actionable insights for operational teams.

•Implemented a log analytics solution using Azure Databricks to process and analyze multi-terabyte log files, identifying critical performance bottlenecks and security threats.

•Developed a secure and compliant data ingestion pipeline into Azure Data Lake (ADLS) using Azure Data Factory, ensuring data integrity and privacy for financial services data.

•Configured and managed Azure HDInsight clusters running Apache Hadoop and Spark, optimizing resource allocation and performance for cost-effective big data processing.

•Architected and deployed a high-availability cluster using PostgreSQL, achieving 99.99% uptime and enhancing data reliability for a critical healthcare data management system.

•Created Hive tables, loaded data, and wrote Hive queries for data analysis using external tables.

•Performed incremental data load in the Apache Hive data warehouse as part of the daily data ingestion workflow.

•Integrated Azure Event Hubs with Azure Stream Analytics to capture and analyze millions of live events for a smart city traffic management system.

•Involved in performance tuning of Hive queries to improve data processing and retrieving.

•Developed and maintained a data warehouse using Kimball methodology, focusing on the integration of disparate data sources into a cohesive analytical environment.

•Utilized Azure Data Lake Analytics to run U-SQL jobs on unstructured data, extracting meaningful insights for a digital marketing campaign analysis.

•Engineered a scalable IoT data platform on Azure using HDInsight and Azure Blob Storage, processing sensor data from thousands of devices to optimize manufacturing processes.

•Experience in using different file formats like Parquet, Avro, ORC, and CSV, JSON.

•Performed data enrichment, cleansing, and data aggregations through RDD transformations using Apache Spark in Scala.

BIG DATA ENGINEER ASICS IRVINE CA MARCH 2018 – JUNE 2019

•Developed shell scripts to automate workflows for pulling data from various databases into Google Cloud Storage, using Google Cloud Functions for orchestrating data access via BigQuery views.

•Crafted SQL queries for data analysis in BigQuery, utilizing its standard SQL capabilities to manage and analyze data stored in BigQuery datasets.

•Ingested data into relational databases (Cloud SQL for MySQL and PostgreSQL) from Google Cloud Storage using Dataflow, which efficiently handled data transformation and loading tasks.

•Conducted extensive data profiling and modeling for a data warehouse project, ensuring compliance with HIPAA regulations while supporting complex queries.

•Developed complex data transformation pipelines in Scala integrating with Apache Kafka for efficient data ingestion and stream processing in a financial services environment.

•Developed Python modules to encrypt and decrypt data using advanced cryptographic standards, ensuring secure data storage and transmission in distributed systems.

•Configured Cloud SQL instances and integrated them with Google Cloud Data Fusion for seamless data ingestion and transformation within the GCP ecosystem.

•Utilized Dataflow to extract data from Cloud SQL and load it into BigQuery, leveraging Apache Beam pipelines written in Python for data processing tasks.

•Implemented partitioning and indexing strategies on a multi-terabyte PostgreSQL database, significantly improving performance and manageability for large-scale analytics applications.

•Implemented a MongoDB-based document store for a content management system, enabling dynamic content updates and personalized user experiences through schema-less data structures.

•Migrated data from Google Cloud Storage to BigQuery using the Parquet file format, taking advantage of BigQuery's external table functionalities for efficient query performance.

•Created a monitoring solution using Docker, Grafana, and Prometheus to monitor big data processing jobs and optimize resource allocation.

•Created high-performance algorithms in Scala to support machine learning workflows on large datasets, significantly improving model training efficiency and accuracy.

•Processed and synchronized data between BigQuery and Cloud SQL (MySQL) using federated queries and Dataflow to maintain consistency and optimize access.

•Implemented data processing pipelines in Apache Beam, executed on Dataflow, to preprocess datasets before loading them into Cloud SQL databases.

•Automated and scheduled Dataflow pipelines using Cloud Scheduler and Pub/Sub to trigger batch processing jobs based on time or event-driven criteria.

DATA ENGINEER CVS WOONSOCKET RI JUNE 2016 – MARCH 2018

•Involved in requirements analysis, design, development, implementation, and documentation of ETL pipelines.

•Proposed big data solutions for the business requirements.

•Worked with different teams in the organization in gathering the system requirements.

•Performed big data extraction, processing, and storage using Hadoop, sqoop, SQL, and hiveQL.

•Engineered time-series data storage in Cassandra, utilizing its fast write capabilities to support high-throughput logging and monitoring of industrial automation systems.

•Engineered a Scala application to perform text mining and sentiment analysis on large volumes of customer feedback data, helping to enhance customer service strategies.

•Wrote Python scripts to automate cloud infrastructure deployments, including data storage and compute resources, using AWS Boto3 SDK, reducing manual setup time by 70%.

•Developed and maintained robust ETL pipelines using SQL and Python to automate data transfer between MySQL and Hadoop ecosystems, enhancing data availability for analytics.

•Developed a temporal data model to capture the history and changes over time in a legal document management system, facilitating advanced version control and audit capabilities.

•Leveraged Terraform to enforce cloud security best practices across big data deployments, aligning infrastructure with compliance requirements.

•Developed map-reduce jobs in Java for data transformation and processing.

•Scheduled the pipelines using Oozie to orchestrate map-reduce workflows in Unix environments.

•Employed Python’s Matplotlib and Seaborn libraries to create visualization dashboards that track data trends and performance metrics, shared across the organization via a Flask-based web application.

•Imported/Exported data between MySql and HDFS using Sqoop.

DATA ENGINEER DICK'S SPORTING GOODS CORAOPOLIS, PENNSYLVANIA SEPTEMBER 2014 – JUNE 2016

•Commissioned and de-commissioned data nodes and performed name-node maintenance.

•Designed and built a scalable web scraping system using Python, BeautifulSoup, and Scrapy to collect and parse structured data from over 1,000 websites daily.

•Backed up and cleared logs from HDFS space to enable optimal utilization of nodes.

•Wrote shell scripts for time-bound command execution.

•Architected a fault-tolerant data lake using Scala and Apache Hadoop, facilitating scalable storage and efficient querying of structured and unstructured data.

•Edited and configured HDFS and tracker parameters.

•Designed and implemented a scalable data warehousing solution using Microsoft SQL Server, integrating data from multiple ERP systems, resulting in a unified view for business intelligence reporting.

•Led the migration of an Oracle database to Microsoft SQL Server, including schema conversion and data integrity validation, which streamlined operational processes and reduced costs.

•Automated configuration and maintenance of a 100-node Hadoop cluster using Ansible playbooks, enhancing system reliability and operational efficiency.

•Scripted requirements using BigSQL and provided time statistics of running jobs.

•Assisted with code review tasks in simple to complex Map/reduce Jobs using Hive and Pig.

•Leveraged Scala to build and maintain scalable and robust ETL frameworks that support the ingestion and normalization of data from diverse sources into a centralized Hive warehouse.

•Leveraged Python’s Pandas library to perform data cleansing, preparation, and exploratory data analysis, significantly reducing data preprocessing time for downstream analytics.

•Performed Cluster Monitoring using the Big Insights ionosphere tool.

•Worked with application teams to install operating system, Hadoop updates, patches, and version upgrades as required.

•Installed Oozie workflow engine to run multiple Hive and Pig jobs.

Education

KENNESAW STATE UNIVERSITY

INFORMATION TECHNOLOGY

Contact this candidate