Professional Introduction
Well-Rounded Big Data Engineer and Developer with hands-on experience in all phases of Big Data environments such as design, implementation, development and customization and performance tuning, data cleaning and database. Having 5+ years of experience in Big data.
Summary of Qualifications
Good Knowledge on Spark framework on both batch and realtime data processing.
Hands-on experience processing data using Spark Streaming API.
Skilled in AWS, Redshift, Cassandra, DynamoDB and various cloud tools. Use of cloud platforms AWS, Microsoft Azure, and Google Cloud platform.
Have worked with over 100 terabytes of data from data warehouse and over 1 petabyte of data from Hadoop cluster.
Have handled over 70 billion messages a day funneled through Kafka topics
Provide end-to-end data analytics solutions and support using Hadoop systems and tools on cloud services as well as on premise nodes.
Expert in big data ecosystem using Hadoop, Spark, Kafka with column-oriented big data systems such as Cassandra and Hbase
Capable of building data tools to optimize utilization of data and configure end-to-end systems. Spark SQL to perform transformations and actions on data residing in Hive.
Spark Streaming to divide streaming data into batches as an input to Spark engine for batch processing.
Responsible for building quality for data transfer pipelines for data transformation using Flume, Spark, Spark Streaming, and Hadoop.
Able to architect and build new data models that provide intuitive analytics to customers.
Able to design and develop new systems and tools to enable clients to optimize and track using Spark.
Education
Bachelor’s Degree in Physics
Azusa Pacific University – Azusa, CA
Certification
Spark Fundamentals, IBM
Technical Skills
Databases
Apache Cassandra, Amazon Redshift, AmazonRDS, SQL, Apache Hbase, Hive, MongoDB
Data Storage
HDFS, Data Lake, Data Warehouse, Database, PostgreSQL
Amazon Stack
AWS, EMR, EC2, EC3, SQS, S3, DynamoDB,
Redshift, Cloud Formation
Programming Languages
Spark, Scala, PySpark, Java, Shell Script Language
Virtualization
VMWare, vSphere, Virtual Machine
Data Pipelines/ETL
Flume, Apache Kafka, Logstash
Development Environment
IDE: Jupyter Notebooks, PyCharm, IntelliJ, Spyder, Anaconda
Continuous Integration (CI CD): Jenkins Versioning: Git, GitHub
Cluster Security & Authentication
Kerberos and Ranger
Query Languages
SQL, Spark SQL, Hive QL, CQL
Distributions
Hadoop, Cloudera, Hortonworks
Hadoop Ecosystem
Hadoop, Hive, Spark, Maven, Ant, Kafka, HBase, YARN, Flume, Zookeeper, Impala. HDFS, Pig, Mesos, Oozie, Tez, Zookeeper, Apache Airflows
Frameworks
Spark, Kafka
Search Tools
Apache Solr/Lucene, Elasticsearch
File Formats
Parquet, Avro
File Compression
Snappy, Gzip, ORC
Methodologies
Test-Driven Development, Continuous Integration, Unit Testing, Functional Testing, Scenario Testing, Regression Testing
Streaming Data
Kinesis, Spark, Spark Streaming, Spark Structured Streaming
Development Methodologies
Test-Driven Development, Continuous Integration, Unit Testing, Functional Testing, Scenario Testing, Regression Testing
Professional Experience
CLOUD DATA ENGINEER
MSNBC, New York, NY December 2019-Present
Optimized data collection, flow, and delivery for cross-functional teams
Consumed data through sources such as RESTful APIs, S3 buckets, and SFTP servers.
Transmitted collected data using AWS Kinesis to consumers.
Wrote an AWS Lambda function to perform validation on data and connected the lambda to Kinesis Firehose.
Defined configurations for AWS EMR, Kinesis, and S3 policies.
Worked with transient EMRs to run Spark jobs and perform ETL.
Loaded transformed data into several AWS database and data warehouse services using Spark connectors.
Monitored pipelines using AWS SNS to receive alerts regarding pipeline failures.
Performed root cause analyses (RCAs) whenever issues arose and developed solutions to prevent possible future issues.
Documented pipeline functions and drafted SOPs in regard to previous RCAs.
Orchestrated pipeline using AWS StepFunctions.
Set up test environment in AWS VPC to create an EMR cluster
Used EMR to create Spark applications that filter user data
Corrected AVRO schema tables to support changed user data
Developed SQL queries to aggregate user subscription data
Set up data pipeline using Docker image container with AWS CLI and Maven to be deployed on AWS EMR.
Wrote Bash script to be used during cluster launch to set up HDFS
Appended EMR cluster steps using JSON format to execute tasks preparing cluster during launch
DATA ENGINEER
John Deer, Moline, IL March 2018-December 2019
Ingested production line data from IoT sources through internal RESTful APIs.
Created Python scripts to download data from the APIs and perform pre-cleaning steps.
Performed ETL operations on IoT data using PySpark.
Wrote user-defined functions (UDFs) to apply custom business logic to datasets using PySpark.
Configured AWS S3 to receive and store data from the resulting PySpark job.
Wrote DAGs for Airflow to allow scheduling and automatic execution of the pipeline.
Performed cluster-level and code-level Spark optimizations.
Created AWS Redshift Spectrum external tables to query data in S3 directly for data analysts.
Communicated between business and development team to translate requirements into technical feasibilities.
Wrote simple SQL scripts on the final database to prepare data for visualization with Tableau.
Managed a small offshore team and performing code reviews to ensure quality of code before committing.
Hands-on with Spark Core, Spark SQL and Data Frames/Data Sets/RDD API.
Spark jobs, Spark SQL and Data Frames API to load structured data into Spark clusters.
Created a Kafka broker which uses schema to fetch structured data in structured streaming.
Interacted with data residing in HDFS using Spark to process the data.
Handled structured data via Spark SQL, stored into Hive tables for consumption.
Accessed Hadoop file system (HDFS) using Spark and managed data in Hadoop data lakes with Spark.
Handled structured data with Spark SQL to process in real time from Spark Structured Streaming.
Involved in converting Hive/SQL queries into Spark transformations using Spark RDDs Python
Support for the clusters, topics on the Kafka manager.
Configured Spark Streaming to receive real-time data to store in HDFS.
Big Data Developer
Mattel, El Segundo, CA June 2017 – March 2018
Transferred unstructured and structured data from data lake to data lake.
Wrote Java code for Kafka producers and consumers to establish a pub-sub model for incoming data.
Maintained health of Kafka clusters as well as Hadoop services using Zookeeper.
Utilized HDFS to store data from other data lakes and SequenceFiles to efficiently store smaller files.
Created Spark scripts in Scala to perform ETL on HDFS data at a better rate than MapReduce.
Optimized Spark code and tuned cluster settings for optimal performance.
Converted different formats of data to Parquet for querying in Postgres and Hive.
Convened with data analysts in determining best practices for modelling data for analytics.
Mapped to HBase tables and implemented SQL queries to retrieve data.
Streaming events from HBase to Solr using HBase Indexer.
Assisted in upgrading, configuration and maintenance of various Hadoop infrastructures like Pig, Hive, and HBase.
Import the data from different sources like HDFS/HBase into Spark RDD.
Validated data against known-good sources to ensure data quality.
Visualized data using Kibana and the ELK stack.
Participated in scrum calls to signal progress to stakeholders.
BIG DATA DEVELOPER
Frito-Lay, Plano TX March 2016 – June 2017
Installed and configured Hadoop on clusters, including HDFS and MapReduce.
Utilized Sqoop to migrate data from deprecated warehouses onto HDFS.
Wrote MapReduce scripts in Java to process advertising data and store results onto Hive.
Created shell scripts in *nix environment to facilitate migration process.
Partitioned and bucketed external Hive tables to lower execution costs.
Maintained health of Hadoop services and nodes using Zookeeper.
Orchestrated workflow of pipeline using Oozie.
Researched potential technologies to add to current tech stack.
Wrote cron tabs to schedule pipeline execution.
Documented pipeline structure, source to target mappings, and functionality of technologies.
Implementation of several applications, highly distributive, scalable and large in nature using Cloudera Hadoop.
Migrated streaming or static RDBMS data into Hadoop cluster from dynamically- generated files using Flume and Sqoop.
Worked with Linux systems and RDBMS database on a regular basis in order to ingest data using Sqoop.
Captured data and importing it to HDFS using Flume and Kafka for semi-structured data and Sqoop for existing relational databases.
Identified and ingested source data from different systems into Hadoop HDFS using Sqoop, Flume, creating HBase tables to store variable data formats for data analytics.
Tristan Pacba
Big Data Engineer
Phone: 917-***-****
Email: *************@*****.***