Post Job Free

Resume

Sign in

Sr. Big Data Developer

Location:
Washington, VA, 22747
Posted:
April 26, 2024

Contact this candidate

Resume:

NAZIR ZUBERU

Senior Big Data Engineer

Email: ad4srk@r.postjobfree.com; Phone: 901-***-****

PROFILE SUMMARY

Dynamic and motivated IT professional with over 12+ years of experience in the field of Big Data focusing on customer-facing products and strong expertise in data transformation & ELT on large data sets.

Proficient in architecting, designing, and developing fault-tolerant data infrastructure and reporting solutions, worked with distributed platforms and systems, with a deep understanding of databases, data storage, and data movement.

Collaborated with Dev-ops teams to identify business requirements and implemented CICD with experience with Big Data Technologies such as Amazon Web Services (AWS), Microsoft Azure, GCP, Databricks, Kafka, Spark, Hive, Sqoop, and Hadoop

Proficient in architecting and implementing fault-tolerant data infrastructure on Google Cloud Platform (GCP), leveraging services like Google Compute Engine, Google Cloud Storage, and Google Kubernetes Engine (GKE)

Experienced in designing and developing data pipelines using GCP data services such as BigQuery, Cloud Dataflow, Cloud Dataproc, and Cloud Composer

Proficient in implementing data security and governance best practices on GCP, including encryption at rest and in transit, IAM policies, and compliance controls.

Expertise in designing efficient data ingestion processes using GCP's low latency systems.

Hands-on experience with GCP data services such as BigQuery, Cloud DataFlow, and Cloud Dataproc.

Knowledgeable in GCP's data storage solutions for handling large datasets securely

Skilled in optimizing performance and scalability on GCP by leveraging containerization technologies like Docker and Kubernetes

Demonstrated proficiency with AWS services including EC2, S3, RDS, and EMR, utilizing them extensively in multiple data projects

Skilled in implementing Continuous Integration/Continuous Deployment (CICD) pipelines on AWS, automating deployment and testing processes for increased efficiency.

Experienced in leveraging AWS Lambda for developing serverless computing solutions and implementing event-driven architectures, insightful knowledge of using AWS data services such as Athena and Glue for performing data analytics and ETL processes effectively.

Proven track record of working on AWS QuickSight for creating interactive dashboards and reports, facilitating data visualization and analysis.

Applied expertise in creating tailored reports by extracting data from AWS services and utilizing reporting tools like Tableau, PowerBI, and AWS QuickSight.

Skilfully used Azure data services such as Azure SQL Database and Azure Data Factory for efficient storage, management, and processing of large datasets.

Proficiency in implementing Azure Databricks for building and managing scalable Spark-based data analytics solutions, enabling real-time insights and decision-making.

Extensive experience in managing and optimizing on-premises data storage systems, including traditional relational databases and distributed file systems, to meet the needs of Big Data workloads.

Skilled in deploying and configuring on-premises data processing systems such as Apache Hadoop and Apache Spark, optimizing resource utilization and ensuring efficient data processing.

Proficiently designed efficient data ingestion processes with low-latency systems and data warehouses

Insightful knowledge of Hadoop Architecture and various components such as HDFS, Yarn, and MapReduce concepts.

Possess an understanding of Delta Lake architecture, including Delta Lake tables, transactions, schema enforcement, and time-traveling capabilities in Databricks and Snowflake.

Expertise with NoSQL databases such as HBase and Cassandra to provide low latency.

Well-versed with spark performance tuning at the source, target, and data stage job levels using indexes, hints, and partitioning and worked on data governance and data quality.

Successfully used PL/SQL and SQL to create queries and develop Python-based designs and programs.

TECHNICAL SKILLS

Programming Languages: Python, Scala, Java, C, C++ SQL

Databases/Data warehouses: MS SQL Server, Oracle, DB2, MySQL, PostgreSQL, Snowflake, Hive, Redshift, BigQuery

NoSQL Databases: HBase, Cassandra, MongoDB.

Cloud Platforms: AWS, MS Azure, GCP

Big Data Primary Skills: MapReduce, Sqoop, Hive, Kafka, Spark, Cloudera, Databricks, Zookeeper

CICD: GitHub, Docker, Jenkins, Terraform, and Kubernetes.

Orchestration Tools: Airflow, Oozie, Cloud Composer, AWS MWAA

Data Analytics: Tableau, MS PowerBI, Amazon QuickSight.

Operating Systems: UNIX/Linux, Windows

Cloud Services:

AWS S3, EMR, Lambda Functions, Step Functions, Redshift Spectrum, Redshift, RDS, Quicksight, DynamoDB, CloudFormation, CloudWatch, SNS, SES, SQS.

Azure Data Factory, Azure Databricks, Azure Data Lake Gen2, Azure SQL, Azure HDInsight,

GCP, DataProc, BigQuery, Dataflow, Cloud Composer

Testing Tools: PyTest, Selenium, ScalaTest

PROFESSIONAL EXPERIENCE

Sr Data Engineer

FedEx, Memphis, Tennessee Oct 2021 - Present

Spearheading projects within the GCP platform to govern data assets, including ingestion, transformation, and pipelining processes.

Overseeing the complete lifecycle of data assets on GCP, ensuring seamless flow from ingestion through transformation to delivery for FedEx stakeholders.

Developing Spark applications using Scala, leveraging its concise syntax and functional programming capabilities for efficient data processing.

Collaborating on initiatives to develop data products and services utilizing GCP platform assets for external stakeholders.

Leading cross-functional teams in designing, developing, and implementing data projects on GCP, fostering collaboration across departments.

Identifying opportunities to optimize the GCP platform for improved performance, scalability, and reliability, implementing necessary solutions.

Implementing robust quality assurance measures to validate data accuracy, completeness, and consistency across the GCP data pipeline.

Implemented Kafka for real-time data streaming and messaging within big data projects, ensuring reliable and scalable data ingestion from various sources.

Ensuring adherence to data governance policies and security standards on GCP, implementing measures to protect sensitive data.

Providing expertise in GCP Big Data technologies and best practices, guiding efficient design and implementation of data solutions.

Effectively communicating with stakeholders to gather requirements, provide project updates, and address concerns related to data projects.

Documenting technical specifications and procedures for GCP data governance and platform usage, delivering training to team members as required.

Experienced in utilizing Google Cloud Storage for scalable and durable storage of large datasets.

Proficient in designing and orchestrating data processing pipelines using Google Cloud Dataflow to ingest, transform, and analyze data in real-time and batch modes.

Integrated Kafka with Spark Streaming for real-time data processing, enabling immediate insights and actions on streaming data.

Experienced in leveraging Google Cloud Functions for serverless event-driven computing, enabling seamless integration and execution of data processing tasks.

Configured and managed Kafka clusters for real-time data streaming and messaging, enabling seamless communication between distributed systems.

Proficient in utilizing Google Cloud Dataproc for managing and scaling Apache Spark and Hadoop clusters, facilitating large-scale data processing and analytics.

Experienced in leveraging Google Cloud Pub/Sub for reliable, asynchronous messaging between independent applications, supporting data streaming and event-driven architectures.

Skilled in utilizing Google BigQuery for high-performance, serverless data warehousing, and analytics, enabling interactive querying and real-time insights on large datasets.

Proficient in leveraging Google Bigtable for scalable, NoSQL storage of structured and semi-structured data, supporting high-volume, low-latency applications.

Utilized Kafka Connect for integrating external data sources and sinks with Kafka clusters, enabling seamless data movement and synchronization.

Integrated Scala with Apache Spark to perform distributed data processing tasks efficiently, optimizing performance and resource utilization.

Experienced in implementing data engineering solutions on GCP, ensuring scalability, reliability, and cost-effectiveness while adhering to best practices and security standards.

Adept at collaborating with cross-functional teams to design, develop, and deploy data-driven applications and solutions on the Google Cloud Platform.

Passionate about staying updated with the latest advancements in GCP's data engineering tools and technologies, continuously exploring new ways to optimize and enhance data processing workflows.

Facilitating the integration of GCP data assets into external-facing data products and services, ensuring compatibility and functionality.

Utilized GitHub for version control and collaboration on code repositories within big data projects, enabling seamless code management, review, and integration.

Leveraged Docker for containerization of big data applications and services, facilitating consistent and portable deployment across different environments.

Implemented multi-tenancy in Kafka clusters to support isolation and resource management for different user groups or applications.

Integrated PySpark with external data sources and storage systems for data ingestion and extraction, ensuring seamless data integration.

Implemented Jenkins for continuous integration and continuous deployment (CI/CD) pipelines within big data workflows, enabling automated testing, builds, and deployments.

Utilized Scala's type safety and pattern matching capabilities to ensure robust error handling and data validation in data processing pipelines.

Utilized Terraform for infrastructure as code (IaC) provisioning within big data environments, enabling declarative configuration and automated provisioning of infrastructure resources.

Implemented data transformation and cleansing operations using PySpark and Scala, ensuring data quality and integrity throughout the pipeline.

Implemented Kubernetes for container orchestration and management within big data clusters, enabling scalable, resilient, and self-healing deployment of containerized applications.

Lead Data Engineer

AbbVie Inc., North Chicago, IL Mar 2019 – Oct 2021

Developed and deployed Big Data analytics solutions on AWS to monitor and analyze adverse event reports, safety signals, and other pharmacovigilance data sources

Leveraged advanced analytics techniques, including Natural Language Processing (NLP) and machine learning, on AWS to extract insights from unstructured data sources such as medical records, clinical trial data, and social media

Implemented Scala-based test suites using frameworks like ScalaTest and Specs2 to ensure code quality and reliability, facilitating automated testing and continuous integration practices.

Designed and implemented data processing pipelines on AWS to ingest, cleanse, and transform large volumes of pharmacovigilance data for analysis and reporting purposes

Collaborated with cross-functional teams, including data scientists, pharmacovigilance experts, and regulatory affairs professionals, on AWS to define requirements and deliver actionable insights.

Proficient in utilizing AWS S3 for scalable, durable, and secure object storage, facilitating storage and retrieval of large volumes of data.

Utilized Spark's Structured Streaming API to process and analyze continuous streams of data from Kafka topics, enabling complex event processing and pattern recognition.

Experienced in leveraging AWS Glue for extract, transform, and load (ETL) tasks, enabling automated data preparation and integration for analytics and machine learning.

Skilled in configuring and managing AWS EMR (Elastic MapReduce) clusters for processing large-scale data using Apache Hadoop, Spark, and other frameworks.

Proficient in querying data directly from AWS S3 and other data sources using AWS Athena, enabling interactive querying and analysis of data stored in various formats.

Experienced in implementing serverless computing solutions using AWS Lambda, enabling event-driven processing and execution of code without provisioning or managing servers.

Implemented Kafka MirrorMaker for replicating data across Kafka clusters in different data centers or regions, ensuring data durability and disaster recovery capabilities.

Proficient in designing and orchestrating workflows using AWS Step Functions, facilitating coordination and sequencing of distributed components in serverless applications.

Utilized Apache Airflow for orchestrating and automating complex data workflows within big data environments.

Applied principles of reactive programming using Scala and Play Framework to develop responsive and scalable web services for data visualization and analytics purposes.

Designed and implemented DAGs (Directed Acyclic Graphs) in Apache Airflow to define and manage data processing pipelines, enabling clear visualization and monitoring of workflow dependencies.

Leveraged Airflow's scheduling capabilities to automate the execution of data processing tasks at defined intervals or in response to triggering events within big data workflows.

Skilled in monitoring and managing AWS resources using AWS CloudWatch, enabling real-time monitoring, logging, and alerting for AWS services and applications.

Experienced in automating software release pipelines using AWS CodePipeline, enabling continuous integration and delivery (CI/CD) of code changes across development, testing, and production environments.

Proficient in provisioning and managing AWS infrastructure using AWS CloudFormation, enabling infrastructure as code (IaC) for consistent and repeatable deployment of resources.

Skilled in implementing messaging and queuing solutions using AWS SQS, facilitating asynchronous communication between distributed components and microservices.

Experienced in building scalable, event-driven architectures using AWS SNS, enabling messaging and push notifications across distributed systems.

Applied statistical methods and predictive modeling techniques on AWS to identify potential safety risks, drug interactions, and adverse reactions in pharmaceutical products.

Collaborated with IT infrastructure teams to optimize data storage, processing, and retrieval capabilities for pharmacovigilance data sets.

Participated in the evaluation and selection of Big Data technologies, tools, and platforms on AWS to support pharmacovigilance initiatives and enhance data analysis capabilities

Sr Data Engineer

iHerb, Pasadena, CA Nov 2017 – Mar 2019

Utilized Big Data analytics on Azure to assess market dynamics, payer preferences, and healthcare economics, facilitating informed decision-making in market access and pricing strategies

Leveraged advanced analytics techniques on Azure to analyze large datasets and derive actionable insights for pharmaceutical companies

Utilized Azure services such as Azure Synapse Analytics and Azure Machine Learning for specific data processing and machine learning tasks, complementing the AWS ecosystem for comprehensive Big Data solutions

Collaborated with cross-functional teams to gather requirements and develop data-driven solutions to optimize pricing and reimbursement strategies

Designed and implemented scalable data pipelines to process, cleanse, and transform structured and unstructured healthcare data

Utilize Azure Data Factory for orchestrating and automating data workflows, including data movement and transformation tasks.

Integrated Apache Flink with Kafka for stream processing tasks requiring low-latency and high-throughput data processing, complementing Spark's batch processing capabilities in big data projects.

Leverage Azure Databricks for scalable data processing using Apache Spark, enabling advanced ETL operations.

Implement Azure Logic Apps for automating workflows and integrating various Azure services and external systems.

Utilize Azure Functions for event-driven, serverless computing, enabling automation of data processing tasks.

Implement data validation rules and checks within Azure Data Factory pipelines to ensure data quality during ETL processes.

Utilize Azure Data Lake Analytics for running ad-hoc queries and performing data validation against large datasets.

Utilize Azure Synapse Analytics for building and managing enterprise data warehouses at scale.

Leverage Azure Analysis Services for online analytical processing (OLAP) and build multidimensional data models for reporting and analytics.

Utilize Azure Stream Analytics for real-time data ingestion, processing, and analytics from various streaming sources.

Implement Azure Event Hubs for scalable event ingestion and processing of high-volume data streams in real time.

Utilize Azure Cosmos DB for globally distributed, multi-model NoSQL data storage, supporting document, key-value, graph, and column-family data models.

Implement Azure Table Storage for semi-structured data storage and fast access to key-value data.

Utilize Azure Monitor for monitoring and analyzing the performance and health of Azure services and resources.

Implement Azure Application Insights for application performance monitoring (APM) and diagnostics, enabling proactive issue detection and troubleshooting.

Utilize Azure Data Explorer for real-time log and telemetry data analysis, enabling rapid debugging and troubleshooting of data pipelines and applications.

Implement Azure DevOps for end-to-end application lifecycle management (ALM), including code version control, continuous integration, and continuous delivery (CI/CD), facilitating streamlined debugging and issue resolution workflows.

Leveraged Docker Compose for defining and managing multi-container Docker applications within development and testing environments, enabling simplified configuration and deployment.

Utilized Jenkins Pipeline for defining and executing CI/CD workflows as code, enabling version-controlled and reusable pipeline definitions within big data projects.

Implemented Terraform modules for modular and reusable infrastructure provisioning within big data environments, enabling consistent and scalable infrastructure deployments.

Data Engineer

State Farm Insurance, Bloomington, IL Sept 2015 – Nov 2017

Leveraged Cloudera distribution of Apache Hadoop and HDFS for distributed storage and processing of large-scale data.

Utilized Apache Hive for data warehousing and querying, enabling SQL-like queries and analytics on Hadoop distributed file system (HDFS) data.

Employed Apache Pig for data processing and transformation tasks, facilitating ETL (Extract, Transform, Load) operations on large datasets.

Utilized Apache Spark for in-memory data processing and analytics, enabling high-speed processing and advanced analytics on distributed datasets.

Orchestrated workflow scheduling and coordination using Apache Oozie, enabling automation and management of data processing workflows.

Leveraged Apache Kafka for real-time data streaming and messaging, enabling high-throughput, fault-tolerant data ingestion and processing.

Utilized Apache HBase for real-time, NoSQL database capabilities, enabling random read/write access to large volumes of structured and semi-structured data.

Implemented Apache Impala for interactive querying and analysis of data stored in HDFS and HBase, providing low-latency SQL queries for analytics.

Applied machine learning algorithms using libraries such as Apache Mahout and scikit-learn for predictive analytics and machine learning tasks.

Developed data analysis and algorithmic solutions using programming languages like Python and R.

Created dashboards and visualization tools using Tableau to present insights derived from data analysis and segmentation results.

Generated ad-hoc reports and analyses to support decision-making processes and business insights.

Optimized data processing pipelines for improved efficiency, scalability, and performance.

Implemented caching mechanisms and parallel processing techniques to handle large datasets effectively and optimize processing workflows.

Collaborated with cross-functional teams, including data scientists, business analysts, and software engineers, to drive data-driven decision-making and innovation.

Documented data processing workflows, algorithms, and data schemas for future reference, reproducibility, and knowledge sharing.

Ensured compliance with data privacy regulations and company policies, implementing data governance practices for maintaining data integrity and security.

Monitored data access and usage to prevent unauthorized access and ensure data security.

Participated in project meetings and discussions, providing technical insights, recommendations, and expertise.

Big Data Engineer

Raytheon Technologies, Arlington, Virginia Feb 2012 – Sep 2015

Designed and architected big data platforms tailored for cybersecurity threat detection.

Collaborated closely with cybersecurity experts to comprehend requirements and translate them into scalable and efficient data architectures.

Implemented robust algorithms and data processing pipelines to detect and mitigate cybersecurity threats effectively.

Developed and optimized machine learning models to analyze network traffic patterns and identify anomalies indicative of potential security breaches.

Leveraged a variety of machine learning algorithms, including supervised and unsupervised techniques, to analyze large volumes of network data.

Applied anomaly detection algorithms to identify unusual patterns and behaviors signaling cyber-attacks or malicious activities.

Integrated threat intelligence feeds from external sources into big data platforms to enhance cybersecurity analytics capabilities.

Developed mechanisms to ingest, parse, and analyze threat intelligence data in real time, enriching cybersecurity analytics capabilities.

Implemented real-time monitoring and alerting systems to promptly notify security teams of potential security incidents.

Utilized streaming data processing frameworks to analyze incoming network traffic and generate alerts for suspicious activities.

Leveraged Hadoop for distributed storage and processing of large datasets.

Employed Apache Kafka as a distributed streaming platform for ingesting and processing real-time data feeds.

Utilized various machine learning libraries and frameworks, such as Scikit-learn and Apache Mahout, for developing predictive models tailored for cybersecurity threat detection.

Utilized SQL for querying and analyzing data within big data environments.

Utilized Python's Pandas library for data manipulation and analysis, enabling efficient handling of structured data within big data environments.

Leveraged Pandas DataFrames for performing data cleaning, transformation, and aggregation tasks on large datasets, enhancing data preprocessing workflows.

Utilized Pandas' powerful querying and filtering capabilities to extract relevant information from complex datasets within big data platforms.

Employed NumPy for numerical computing tasks, including array manipulation, mathematical operations, and linear algebra computations within Python-based big data workflows.

Utilized NumPy arrays for efficient storage and manipulation of large-scale numerical data, enhancing performance and memory efficiency in data processing tasks.

Leveraged Apache Hive for data warehousing and querying capabilities.

Utilized Apache HBase for real-time, NoSQL database capabilities within big data environments.

EDUCATION

MA Actuarial Science, Ball State University

MSc Mathematics with Big Data, African Institute for Mathematical Science

B.Sc. Actuarial Science, University for Development Studies

CERTIFICATIONS

Machine Learning, Coursera

Python and Machine Learning for Financial Analysis, Udemy

Tableau for data science, Udemy

Data Analysis with Python, Udemy

Python, IBM & Coursera

Introduction to Data Science and Analytics, IBM



Contact this candidate