Big Data Engineer

Location:

Posted:

March 19, 2021

Resume:

Professional Introduction

Well-Rounded Big Data Engineer and Developer with hands-on experience in all phases of Big Data environments such as design, implementation, development and customization and performance tuning, data cleaning and database. Having 5+ years of experience in Big data.

Summary of Qualifications

Good Knowledge on Spark framework on both batch and realtime data processing.

Hands-on experience processing data using Spark Streaming API.

Skilled in AWS, Redshift, Cassandra, DynamoDB and various cloud tools. Use of cloud platforms AWS, Microsoft Azure, and Google Cloud platform.

Have worked with over 100 terabytes of data from data warehouse and over 1 petabyte of data from Hadoop cluster.

Have handled over 70 billion messages a day funneled through Kafka topics

Provide end-to-end data analytics solutions and support using Hadoop systems and tools on cloud services as well as on premise nodes.

Expert in big data ecosystem using Hadoop, Spark, Kafka with column-oriented big data systems such as Cassandra and Hbase

Capable of building data tools to optimize utilization of data and configure end-to-end systems. Spark SQL to perform transformations and actions on data residing in Hive.

Spark Streaming to divide streaming data into batches as an input to Spark engine for batch processing.

Responsible for building quality for data transfer pipelines for data transformation using Flume, Spark, Spark Streaming, and Hadoop.

Able to architect and build new data models that provide intuitive analytics to customers.

Able to design and develop new systems and tools to enable clients to optimize and track using Spark.

Education

Bachelor’s Degree in Physics

Azusa Pacific University – Azusa, CA

Certification

Spark Fundamentals, IBM

Technical Skills

Databases

Apache Cassandra, Amazon Redshift, AmazonRDS, SQL, Apache Hbase, Hive, MongoDB

Data Storage

HDFS, Data Lake, Data Warehouse, Database, PostgreSQL

Amazon Stack

AWS, EMR, EC2, EC3, SQS, S3, DynamoDB,

Redshift, Cloud Formation

Programming Languages

Spark, Scala, PySpark, Java, Shell Script Language

Virtualization

VMWare, vSphere, Virtual Machine

Data Pipelines/ETL

Flume, Apache Kafka, Logstash

Development Environment

IDE: Jupyter Notebooks, PyCharm, IntelliJ, Spyder, Anaconda

Continuous Integration (CI CD): Jenkins Versioning: Git, GitHub

Cluster Security & Authentication

Kerberos and Ranger

Query Languages

SQL, Spark SQL, Hive QL, CQL

Distributions

Hadoop, Cloudera, Hortonworks

Hadoop Ecosystem

Hadoop, Hive, Spark, Maven, Ant, Kafka, HBase, YARN, Flume, Zookeeper, Impala. HDFS, Pig, Mesos, Oozie, Tez, Zookeeper, Apache Airflows

Frameworks

Spark, Kafka

Search Tools

Apache Solr/Lucene, Elasticsearch

File Formats

Parquet, Avro

File Compression

Snappy, Gzip, ORC

Methodologies

Test-Driven Development, Continuous Integration, Unit Testing, Functional Testing, Scenario Testing, Regression Testing

Streaming Data

Kinesis, Spark, Spark Streaming, Spark Structured Streaming

Development Methodologies

Test-Driven Development, Continuous Integration, Unit Testing, Functional Testing, Scenario Testing, Regression Testing

Professional Experience

CLOUD DATA ENGINEER

MSNBC, New York, NY December 2019-Present

Optimized data collection, flow, and delivery for cross-functional teams

Consumed data through sources such as RESTful APIs, S3 buckets, and SFTP servers.

Transmitted collected data using AWS Kinesis to consumers.

Wrote an AWS Lambda function to perform validation on data and connected the lambda to Kinesis Firehose.

Defined configurations for AWS EMR, Kinesis, and S3 policies.

Worked with transient EMRs to run Spark jobs and perform ETL.

Loaded transformed data into several AWS database and data warehouse services using Spark connectors.

Monitored pipelines using AWS SNS to receive alerts regarding pipeline failures.

Performed root cause analyses (RCAs) whenever issues arose and developed solutions to prevent possible future issues.

Documented pipeline functions and drafted SOPs in regard to previous RCAs.

Orchestrated pipeline using AWS StepFunctions.

Set up test environment in AWS VPC to create an EMR cluster

Used EMR to create Spark applications that filter user data

Corrected AVRO schema tables to support changed user data

Developed SQL queries to aggregate user subscription data

Set up data pipeline using Docker image container with AWS CLI and Maven to be deployed on AWS EMR.

Wrote Bash script to be used during cluster launch to set up HDFS

Appended EMR cluster steps using JSON format to execute tasks preparing cluster during launch

DATA ENGINEER

John Deer, Moline, IL March 2018-December 2019

Ingested production line data from IoT sources through internal RESTful APIs.

Created Python scripts to download data from the APIs and perform pre-cleaning steps.

Performed ETL operations on IoT data using PySpark.

Wrote user-defined functions (UDFs) to apply custom business logic to datasets using PySpark.

Configured AWS S3 to receive and store data from the resulting PySpark job.

Wrote DAGs for Airflow to allow scheduling and automatic execution of the pipeline.

Performed cluster-level and code-level Spark optimizations.

Created AWS Redshift Spectrum external tables to query data in S3 directly for data analysts.

Communicated between business and development team to translate requirements into technical feasibilities.

Wrote simple SQL scripts on the final database to prepare data for visualization with Tableau.

Managed a small offshore team and performing code reviews to ensure quality of code before committing.

Hands-on with Spark Core, Spark SQL and Data Frames/Data Sets/RDD API.

Spark jobs, Spark SQL and Data Frames API to load structured data into Spark clusters.

Created a Kafka broker which uses schema to fetch structured data in structured streaming.

Interacted with data residing in HDFS using Spark to process the data.

Handled structured data via Spark SQL, stored into Hive tables for consumption.

Accessed Hadoop file system (HDFS) using Spark and managed data in Hadoop data lakes with Spark.

Handled structured data with Spark SQL to process in real time from Spark Structured Streaming.

Involved in converting Hive/SQL queries into Spark transformations using Spark RDDs Python

Support for the clusters, topics on the Kafka manager.

Configured Spark Streaming to receive real-time data to store in HDFS.

Big Data Developer

Mattel, El Segundo, CA June 2017 – March 2018

Transferred unstructured and structured data from data lake to data lake.

Wrote Java code for Kafka producers and consumers to establish a pub-sub model for incoming data.

Maintained health of Kafka clusters as well as Hadoop services using Zookeeper.

Utilized HDFS to store data from other data lakes and SequenceFiles to efficiently store smaller files.

Created Spark scripts in Scala to perform ETL on HDFS data at a better rate than MapReduce.

Optimized Spark code and tuned cluster settings for optimal performance.

Converted different formats of data to Parquet for querying in Postgres and Hive.

Convened with data analysts in determining best practices for modelling data for analytics.

Mapped to HBase tables and implemented SQL queries to retrieve data.

Streaming events from HBase to Solr using HBase Indexer.

Assisted in upgrading, configuration and maintenance of various Hadoop infrastructures like Pig, Hive, and HBase.

Import the data from different sources like HDFS/HBase into Spark RDD.

Validated data against known-good sources to ensure data quality.

Visualized data using Kibana and the ELK stack.

Participated in scrum calls to signal progress to stakeholders.

BIG DATA DEVELOPER

Frito-Lay, Plano TX March 2016 – June 2017

Installed and configured Hadoop on clusters, including HDFS and MapReduce.

Utilized Sqoop to migrate data from deprecated warehouses onto HDFS.

Wrote MapReduce scripts in Java to process advertising data and store results onto Hive.

Created shell scripts in *nix environment to facilitate migration process.

Partitioned and bucketed external Hive tables to lower execution costs.

Maintained health of Hadoop services and nodes using Zookeeper.

Orchestrated workflow of pipeline using Oozie.

Researched potential technologies to add to current tech stack.

Wrote cron tabs to schedule pipeline execution.

Documented pipeline structure, source to target mappings, and functionality of technologies.

Implementation of several applications, highly distributive, scalable and large in nature using Cloudera Hadoop.

Migrated streaming or static RDBMS data into Hadoop cluster from dynamically- generated files using Flume and Sqoop.

Worked with Linux systems and RDBMS database on a regular basis in order to ingest data using Sqoop.

Captured data and importing it to HDFS using Flume and Kafka for semi-structured data and Sqoop for existing relational databases.

Identified and ingested source data from different systems into Hadoop HDFS using Sqoop, Flume, creating HBase tables to store variable data formats for data analytics.

Tristan Pacba

Big Data Engineer

Phone: 917-***-****

Email: *************@*****.***

Contact this candidate