Big Data Engineer

Location:

Thousand Oaks, CA

Posted:

July 20, 2021

Contact this candidate

Resume:

EZRA DUDDEN

BIG DATA DEVELOPER

Email: ************@*****.*** Phone: 805-***-****

Professional Summary

Having 6 years Skilled in managing data analytics and data processing, database and data driven projects, Skilled in DataFrames, Analytics Systems for diverse end users and industries covering heavy equipment, feature film and television entertainment, and industrial technology integration. Skilled in Database systems and administration. Proficient in writing technical reports and documentation. Adept with various distributions such as Cloudera Hadoop, Hortonworks, and Elastic Cloud, Elasticsearch. Skilled with Spark and Kafka.

Technical Skills

Apache

Kafka, Maven, Oozie, Pig, Hue, Sqoop, Flume, Hadoop, HBase, Cassandra, Airflow, Tez, ZooKeeper

BI Visualization

Kibana, Tableau, PowerBI

Programming

Python, Scala, Java, Shell Scripting

Development

Drupal, UX Design

File Types

XML, CSV, JSON, Avro, Parquet, ORC

APIs

REST API

Development

Agile, Kanban, Scrum, Continuous Integration, TDD, Unit Testing, Functional Testing, Design Thinking, Lean, Six Sigma

Soft Skills

Communication, Collaboration, Customer Service, Help Desk, Mentoring, Reviewing

Big Data

RDDs, UDFs, Data Frames, Datasets, Pipelines, Data Lakes, Data Warehouse, Data Analysis

Hadoop

Hadoop, HDFS, Hadoop YARN, Hortonworks, Cloudera, Impala

Spark & Hive

Apache Spark, Spark Streaming, Apache Hive, Hive QL

Database

Redshift, ArangoDB, Cassandra, HBase, MongoDB, SQL, NoSQL, MySQL, RDBMS, Access, Oracle

File Management

HDFS, Snappy, Gzip, DAS, NAS, SAN

Cloud Services & Distributions

AWS, Azure, Anaconda Cloud, Elasticsearch, Solr, Lucene, Cloudera, Databricks, Hortonworks, Elastic. Cloud Foundry, Elastic Cloud

Software

Microsoft Office Suite, Peoplesoft, WEKA, PandaDoc

Operating Systems

Linux/UNIX, Windows

Virtualization & Network

SQL Injection, Data FTK imager, WAN/LAN, TCP/IP, Routing, VMware, VirtualBox, OSI Model

Experience

02/20 - Current

Big Data Engineer

MGM Studio, Thousand Oaks, CA

Developed Spark jobs using Spark SQL to process structured data into Spark clusters.

Implemented Spark using Python and utilized Data Frames and Spark SQL API for faster processing of data.

Ingested Images responses in Kafka producer written in Python.

Defined a schema for a custom HBase table.

Analyze the performance of digital strategies to yield business recommendations.

Worked on AWS Kinesis for processing real-time data.

Handling schema changes in the data stream using Spark.

Connected databases and Tableau to integrate to HBase and configured for optimized data extraction.

Used Tableau to develop multiple SQL queries to join tables and create dashboards.

10/18 – 02/20

Big Data Developer

John Deer, Moline, IL

Communicated between business and development team to translate requirements into technical feasibilities.

Spark jobs, Spark SQL and Data Frames API to load structured data into Spark clusters.

Created a Kafka broker which uses schema to fetch structured data in structured streaming.

Interacted with data residing in HDFS using Spark to process the data.

Handled structured data via Spark SQL, stored into Hive tables for consumption.

Accessed Hadoop file system (HDFS) using Spark and managed data in Hadoop data lakes with Spark.

Handled structured data with Spark SQL to process in real time from Spark Structured Streaming.

Involved in converting Hive/SQL queries into Spark transformations using Spark RDDs Python

Support for the clusters, topics on the Kafka manager.

Configured Spark Streaming to receive real-time data to store in HDFS.

Ingested production line data from IoT sources through internal RESTful APIs.

Created Python scripts to download data from the APIs and perform pre-cleaning steps.

Performed ETL operations on IoT data using PySpark.

Wrote user-defined functions (UDFs) to apply custom business logic to datasets using PySpark.

Configured AWS S3 to receive and store data from the resulting PySpark job.

Wrote DAGs for Airflow to allow scheduling and automatic execution of the pipeline.

Performed cluster-level and code-level Spark optimizations.

Created AWS Redshift Spectrum external tables to query data in S3 directly for data analysts.

Wrote simple SQL scripts on the final database to prepare data for visualization with Tableau.

Hands-on with Spark Core, Spark SQL and Data Frames/Data Sets/RDD API.

10/17 – 10/18

Big Data Developer

BlackRock, New York, NY

Used Spark to process data on top of YARN and performed analytics on data in Hive.

Used Cloudera Manager to measure performance for optimization.

Experience in collecting real-time log data from diverse sources like RESTful API and social media using web crawlers in Python, and filtered to load the data into HDFS for further analysis - ingested through Kafka.

Handled importing of data from various data sources, performed transformations and analysis using spark.

Hands-on experience installing, configuring, and deploying Hadoop distributions in cloud environments

Experience in Oozie and workflow scheduler to manage Hadoop jobs by Direct Acyclic Graph (DAG) of actions with control flows.

Wrote shell scripts for exporting log files to Hadoop cluster through automated processes.

Migrated data from RDBMS for streaming or static data into the Hadoop cluster using Hive, Flume and Sqoop.

Transfered Streaming log data from different data sources into HDFS and HBase using Apache Flume.

Involved in complete Big Data flow of the application starting from data ingestion from upstream to HDFS, processing the data into HDFS using Spark.

Developed nifi pipeline using ELK for quicker testing and handling of information.

Used Kafka producer to ingest the raw data into Kafka topics run the Spark Streaming app to process event-driven data.

Collecting the real-time data from Kafka using Spark Streaming and perform transformations.

Designed Hive queries to perform data analysis, data transfer and table design.

Expertise in Hive queries and have extensive knowledge on joins.

Developed end to end Hive Queries to parse the raw data, populated external & internal tables and store the refined data in partitioned external tables

08/16 – 10/17

Big Data Engineer

SAIC, Huntsville, AL

Implemented Spark in EMR for processing Big Data across our Data Lake in AWS System

Worked with Amazon AWS IAM console to create custom users and groups.

Created AWS Lambda function for extracting the data from Kinesis Firehose and post the data to AWS S3 bucket on scheduled basis (every 4 hours) using AWS cloud watch event.

Implemented Serverless architecture using AWS Lambda with Amazon S3 and Amazon Dynamo DB.

Expertise in AWS data migration between different database platforms like Local SQL Server to Amazon RDS and EMR HIVE.

Led many critical on-prem data migration to AWS cloud, assisting the performance tuning and providing successful path towards Redshift Cluster and AWS RDS DB engines.

Worked on AWS S3 bucket integration for application and development projects.

Experience in managing and reviewing Hadoop log files in AWS S3.

Responsible for Designing Logical and Physical data modelling for various data sources on Confidential Amazon Redshift.

Designed and Developed ETL jobs to extract data from Salesforce replica and load it in data mart in Amazon Redshift.

Education

Bachelor of Science in Mathematics (major concentration: Theoretical Mathematics from Metropolitan State University of Denver

Contact this candidate