Post Job Free


Sign in

Senior Big Data Engineer

Boydton, VA
September 29, 2023

Contact this candidate


Derrick McGlone

Email: Contact: (925) 215- 5446

Profile Summary

•A seasoned professional offering over 7 years of experience in the development of Big Data Solutions with extensive experience as a Hadoop Developer, leveraging the power of cloud platforms to design and implement scalable and high-performance data processing pipelines

•Proficiency in data ingestion, processing, and manipulation both in on-premise and cloud platforms across various sectors like healthcare, information technology, or finance using AWS and Azure

•Proficiency in designing scalable and efficient data architectures on Azure, leveraging services like Azure Data Lake, Azure Synapse Analytics, and Azure SQL Data Warehouse.

•Built and managed data pipelines using Azure Data Factory, ensuring efficient and reliable data processing and analysis workflows.

•Utilized Azure Storage to enable fast and efficient data storage and retrieval, leveraging its powerful storage capabilities.

•Implemented an AWS EMR cluster, allowing them to process large amounts of data in parallel and reduce processing times by 50%.

•Utilized AWS EMR and AWS Lambda to process and analyze large volumes of data in the cloud.

•Designed and implemented a data warehousing solution using AWS Redshift, and AWS Athena to enable efficient querying and analysis of data.

•Utilized AWS Cloudformation to manage and provision AWS resources for the data pipeline.

•Created workflows using AWS Step Functions to orchestrate and manage complex data processing tasks.

•Used AWS Glue to perform ETL operations and integrate with various data sources and destinations.

•Added value to Agile/Scrum processes such as Sprint Planning, Backlog, Sprint Retrospective, Requirements Gathering and provide planning and documentation for projects

•Proficient in various distributions such as Hadoop Apache ecosystems, Microsoft Azure and Spark Databricks

•Implemented data governance frameworks, resulting in a 15% reduction in data errors and inconsistencies in Financial data

•Actively contributed to the migration of on-premises data solutions to the cloud, resulting in significant cost savings and increased scalability.

•Formulated comprehensive technical strategies and devised scalable CloudFormation Scripts, ensuring streamlined project execution.

•Expertise in performance optimization for a spark in multiple platforms like Databricks, Glue, EMR, and on-premises

•Optimized data ingestion and transformation processes for big data workloads and achieved a 40% reduction in data processing time

•Orchestrated the deployment of a secure Virtual Private Cloud (VPC) through meticulous CloudFormation scripting, housing a robust multi-tier infrastructure.

•Proficiently managed diverse file formats (Parquet, Avro, JSON) and compression techniques (Snappy, Gzip) to optimize data processing.

•Implemented automated monitoring and alerting systems on cloud platforms to identify and resolve data processing issues, reducing downtime by 25%

•Used Spark to improve Python programming skills and understanding of the architecture of Spark

•Hands-on experience with Hive and HBase integration, understanding the knowledge of how different they are from one another

•Practical experience in importing and exporting terabytes of data between HDFS and Relational Database Systems using Sqoop

•Skilled in data ingestion, extraction, and transformation using ETL processes with Hive, Sqoop, Kafka, Firehose, Flume, and Kinesis

•Successfully streamlined data pipelines, leading to a 20% increase in data processing efficiency

•Successfully leveraged Terraform to implement an end-to-end Virtual Private Cloud (VPC) for hosting a web application, showcasing adaptability across Infrastructure as Code (IaC) tools

Technical Skills

•AWS S3, EMR, AWS DynamoDB, AWS Redshift, RDS, Athena, Elasticsearch, Kinesis, CloudFormation, Lambda, CloudWatch.

•Azure Data Factory, Azure Databricks, Azure Data Lake Gen2, Azure SQL, Azure HDInsight, Synapse, Stream analytics.

•Programming Languages: Scala, Python, Java, Bash

•Hadoop Components: Hive, Pig, Zookeeper, Sqoop, Oozie, Yarn, Maven, Flume, HDFS, Airflow

•Hadoop Administration: Zookeeper, Oozie, Cloudera Manager, Ambari, Yarn

•The architecture of ETL Data Pipelines: Apache Airflow, Hive, Sqoop, Flume, Scala, Python, Flume, Apache Kafka, Logstash

•Scripting: HiveQL, SQL, Shell Script Language

•Big Data Frameworks: Spark and Kafka

•Spark Framework: Spark API, Spark Streaming, Spark SQL, Spark Structured Streaming

•Visualization: Tableau, QlikView, PowerBI

•Software Development IDE: Jupyter Notebooks, PyCharm, IntelliJ

•Continuous Integration (CI-CD): Jenkins, Versioning: Git, GitHub, Bitbucket

•Method: Agile Scrum, Test-Driven Development, Continuous Integration, Unit Testing, Functional Testing, Scenario Testing


Senior Data Engineer

Chevron, San Ramon, CA, May’21-Present

•Collaborating with data scientists and analysts to develop machine learning models using Amazon SageMaker for tasks such as fraud detection, risk assessment, and customer segmentation.

•Containerizing Confluent Kafka application and configured subnet for communication between containers.

•Utilizing AWS S3 for efficient data collection and storage, enabling easy access and processing of large datasets.

•Orchestrating data pipelines with AWS Step Functions and leveraging Amazon Kinesis for event messaging.

•Managing data cleaning and preprocessing using AWS Glue, and writing transformation scripts in Python.

•Performing real-time data analysis using Amazon Kinesis Data Analytics and Amazon Managed Streaming for Apache Kafka (Amazon MSK).

•Implementing data transformations with Amazon Athena for SQL processing and AWS Glue for Python processing, encompassing cleaning, normalization, and data standardization.

•Applying Amazon EC2, Amazon CloudWatch, and AWS CloudFormation in various AWS projects.

•Leveraging Amazon S3 and Amazon DynamoDB to load data into Spark data frames, performing in-memory data computation to generate output responses.

•Monitoring Amazon RDS and CPU Memory using Amazon CloudWatch.

•Developing automated Python scripts to convert data from diverse sources and generate ETL pipelines.

•Creating a robust data lake using Amazon S3 for efficient storage and processing of vast amounts of data.

•Transforming SQL queries into Spark transformations using Spark RDDs, Python, and Scala.

•Creating and optimizing data processing workflows using AWS services such as Amazon EMR and Amazon Kinesis to process and analyze large amounts of data in a timely and efficient manner.

•Designing and monitoring scalable and high-performance computing clusters using AWS Lambda, S3, Amazon Redshift, Databricks, and Amazon CloudWatch.

•Employing Amazon Athena for faster data analysis compared to Spark.

•Writing streaming applications with Apache Spark Streaming and Amazon Managed Streaming for Apache Kafka (Amazon MSK).

•Utilizing Amazon EMR for processing Big Data across the Hadoop Cluster of virtual servers on Amazon EC2 and Amazon S3, along with Amazon Redshift for data warehousing.

•Collaborating with the DevOps team to deploy pipelines in higher environments using AWS CodePipeline and AWS CodeDeploy.

•Developed and implemented recovery plans and procedures.

•Executing Hadoop/Spark jobs on Amazon EMR with data and programs stored in Amazon S3 buckets.

•Effectively managing tasks and tracking progress using Jira Kanban boards and ServiceNow, following Agile methodology principles.

Hadoop Data Engineer

First American Financial, Santa Ana, California, Jan’20-May’21

•Involved in AWS data migration between database platforms Local SQL Servers to Amazon RDS and EMR HIVE.

•Optimized Hive analytics SQL queries and created tables/views and wrote custom queries and Hive-based exception processes.

•Designed and developed data pipelines in an Azure environment using ADL Gen2, Blob Storage, ADF, Azure Databricks, Azure SQL, Azure Synapse for analytics and MS PowerBI for reporting.

•Developed consumer intelligence reports based on market research, data analytics, and social media.

•Deployed the large knowledge Hadoop application mistreatment Talend on Cloud AWS.

•Utilized AWS Redshift to store Terabytes of data on the Cloud.

•Used Spark SQL and DataFrames API to load structured and semi-structured data into Spark Clusters.

•Wrote shell scripts for log files to Hadoop cluster through automatic processes.

•Implemented AWS Fully Managed Kafka streaming to send data streams from the company APIs to Spark cluster in AWS Databricks.

•Joined, manipulated, and drew actionable insights from large data sources using Python and SQL.

•Worked on AWS to form and manage EC2 instances and Hadoop Clusters.

•Implemented a Hadoop Cloudera distributions cluster using AWS EC2.

•Used Spark-SQL and Hive Query Language (HQL) for obtaining client insights.

•Ingested large data streams from company REST APIs into EMR cluster through AWS kinesis.

•Streamed data from AWS Fully Managed Kafka brokers using Spark Streaming and processed the data using explode transformations.

•Finalized the data pipeline using DynamoDB as a NoSQL storage option.

•Collaborated with cross-functional teams, including data scientists and analysts, to deliver data-driven solutions that address business needs.

Big Data Engineer

Burlington Stores Inc., North Burlington, NJ, Jul’18-Dec’19

•Ingested data using Apache Flume, with Kafka as the data source and HDFS as the data sink, ensuring smooth and efficient data flow within the Hadoop ecosystem.

•Collected and aggregated large volumes of log data using Apache Flume, staging the data in HDFS for further analysis and insights.

•Transformed ETL processes to leverage Hadoop's distributed file system (HDFS), enhancing data processing capabilities and taking advantage of Hadoop's scalability.

•Managed storage capacity, fine-tuned performance, and conducted benchmarking of Hadoop clusters to optimize data processing and enhance overall system efficiency.

•Seamlessly transferred data between the Hadoop ecosystem and structured data storage in MySQL RDBMS using Sqoop, ensuring seamless data integration and synchronization.

•Loaded data from the UNIX file system to HDFS, facilitating efficient data management within the Hadoop environment.

•Streamlined log data ingestion into HDFS using Flume, simplifying the process and enabling real-time data analysis.

•Installed and configured Hive, a powerful data warehousing and SQL-like querying tool, and developed custom Hive User-Defined Functions (UDFs) to extend Hive's functionality and meet specific business needs.

•Administered cluster coordination through Zookeeper, ensuring proper coordination and synchronization among the nodes in the Hadoop cluster.

•Utilized Sqoop to export data from DB2 to HDFS, ensuring smooth data transfer and integration across different data sources.

•Assisted the team in exporting analyzed data to relational databases using Sqoop, facilitating data reporting and analysis across multiple platforms.

•Developed workflow using Oozie to automate the execution of MapReduce jobs and Hive queries, streamlining data processing tasks and enhancing workflow efficiency.

•Gained insightful updates on available technologies, industry trends, and cutting-edge applications.

Hadoop Administrator

Old Republic International, Chicago, Illinois, Nov’17 – Jun’18

•Worked with highly unstructured and structured data.

•Optimized and integrated Hive, SQOOP, and Flume into existing ETL processes, accelerating the extraction, transformation, and loading of massive structured and unstructured data.

•Used Hive to simulate data warehouse for performing client-based transit system analytics.

•Work on hive partitioning bucketing, different files formats JSON, XML, CSV ORC

•Installed clusters, commissioned, and decommissioned data nodes, configured slots and on-name node high availability, and capacity planning.

•Executed tasks for upgrading clusters on the staging platform before doing it on the production cluster.

•Used Cassandra to work on JSON-documented data.

•Used HBase to store the majority of data that needed to be divided based on region.

•Managed Zookeeper configurations and ZNodes to ensure High Availability on the Hadoop Cluster.

•Applied Hadoop system administration using Hortonworks/Ambari and Linux system administration (RHEL 7, Centos.).

•Conducted HDFS balancing and fine-tuning for optimal performance of MapReduce applications, improving data processing efficiency.

•Developed a comprehensive data migration plan to seamlessly integrate other data sources into the Hadoop system, facilitating unified data management.

•Utilized open-source configuration management and deployment tools such as Puppet and Python, streamlining the setup and management of Hadoop clusters.

•Configured Kerberized authentication for the cluster, ensuring secure user access and authentication within the Hadoop environment.

•Tailored YARN Capacity and Fair schedulers to align with organizational requirements, effectively managing resource allocation and job prioritization.

•Conducted cluster capacity and growth planning, providing valuable insights for node configuration and resource allocation to accommodate future needs.

•Optimized MapReduce counters to expedite data processing and achieve optimal performance for data-intensive operations.

•Played a key role in designing backup and disaster recovery methodologies for Hadoop clusters and related databases, ensuring data resiliency and business continuity.

•Expertly performed upgrades, patches, and fixes on the Hadoop cluster, utilizing either rolling or express methods to minimize downtime and maintain system stability.

Data Engineer

Virginia Commonwealth University, Richmond, VA, Aug’16-Nov’17

•Developed Python-based notebooks using Alteryx for automated weekly, monthly, and quarterly reporting ETL, streamlining data processing and analysis.

•Orchestrated the creation of Spark Scala Datasets within Azure Databricks, seamlessly defining schema structures through the implementation of Scala case classes.

•Assisted in managing the execution of intricate, long-running jobs within Azure Synapse Analytics, ensuring meticulous pre-processing of product and warehouse data. The result was data that was cleansed and optimally prepared for downstream consumption.

•Leveraged the power of Azure Stream Analytics to efficiently segment streaming data into batches, strategically feeding the Azure Databricks engine for comprehensive batch processing.

•Prepared data for ML Modelling, showcasing adept data handling skills particularly in scenarios where certain observations were censored without clear notifications.

•Elevated processing efficiency by diligently repartitioning datasets following the ingestion of Gzip files into Data Frames, effectively reducing processing time.

•Ingeniously harnessed Azure Event Hubs for seamless data processing akin to a Kafka Consumer, thereby ensuring an uninterrupted and streamlined data flow.

•Utilized the versatility of Azure Databricks to implement solutions using Python, while harnessing the prowess of Data Frames and the Azure Spark SQL API. This strategic approach led to accelerated data processing.

•Seamlessly interacted with Azure Data Lake Storage through Azure Databricks, effectively harnessing and processing data residing within the storage.

•Demonstrated proficiency in populating data frames within Azure Databricks jobs, alongside Azure Spark SQL and Data Frames API. These techniques facilitated the structured loading of data into Azure Databricks clusters.

•Monitored background operations within Azure HDInsight, ensuring optimal platform performance and operational integrity.

•Showcased adaptability by transforming real-time data into a compatible format for scalable analytics, harnessing the capabilities of Azure Stream Analytics.

•Demonstrated strategic agility by skillfully forwarding requests to source REST-based APIs from a Scala script via Azure Event Hubs Producer, contributing to streamlined data sourcing.


•Engineer of Science from Virginia Commonwealth University

Contact this candidate