BIG DATA ENGINEER

Location:

Westlake Village, CA

Posted:

June 10, 2022

Contact this candidate

Resume:

Overview

Data/Big Data Engineering experience spans 11+ years.

Project roles include MDM Data Engineer, Lead AWS Data Engineer, Hadoop Data Engineer, AWS Cloud Data Engineer, Big Data Engineer, and Data Engineer,

Skilled in managing data analytics and data processing, database and data driven projects

Skilled in Architecture of Big Data Systems, ETL Pipelines, and Analytics Systems for diverse end users

Skilled in Database systems and administration

Proficient in writing technical reports and documentation

Adept with various distributions such as Cloudera Hadoop, Hortonworks, and Elastic Cloud, Elasticsearch

Expert in bucketing and partitioning

Expert in Performance Optimization

Technical Skills

APACHE

Apache Ant, Apache Flume, Apache Hadoop, Apache YARN, Apache Hive, Apache Kafka, Apache MAVEN, Apache Oozie, Apache Spark, Apache Tez, Apache Zookeeper, Cloudera Impala, HDFS

Hortonworks, MapR, MapReduce

SCRIPTING

HiveQL, MapReduce, XML, FTP,

Python, UNIX, Shell scripting, LINUX

OPERATING SYSTEMS

Unix/Linux, Windows 10, Ubuntu, Apple OS

FILE FORMATS

Parquet, Avro & JSON, ORC, text, csv

DISTRIBUTIONS

Cloudera, Hortonworks, AWS, Elastic, ELK, Cloudera CDH 4/5, Hortonworks HDP 2.5/2.6, Amazon Web Services (AWS)

DATA PROCESSING (COMPUTE) ENGINES

Apache Spark, Spark Streaming, Flink

DATA VISUALIZATION TOOLS

Pentaho, QlikView, Tableau, PowerBI, Matplotlib, Plotly, Dash,

COMPUTE ENGINES

Apache Spark, Spark Streaming, Storm

DATABASE

Microsoft SQL Server Database

Database & Data Structures, Apache Cassandra, Amazon Redshift, DynamoDB, Apache HBase, Apache Hive, MongoDB,

SOFTWARE

Microsoft Project, Primavera P6, VMWare, Microsoft Word, Excel, Outlook, Power Point; Technical Documentation Skills

Work Experience

Pennynac Westlake Village, California

MDM Data Engineer March 2021 – Present

Met with stakeholders and product owners to gather requirements for the development.

Worked with the dev team to propose an architecture and create the corresponding architecture artifacts.

Integrated different services in AWS with Snowflake (e.g., SNS, AWS Lambda, API Gateway and S3).

Developed and maintained Python AWS Lambda functions.

Integrated AWS Lambda functions with AWS API Gateway.

Developed and maintained the MDM API Gateway platform using Python as programming language.

Created on-demand solutions to consume data from the MDM Platform using different AWS services.

Used Reltio UI and APIs to troubleshoot data issues.

Monitored AWS resources and created dashboards using New Relic.

Implemented different integration pipelines to send data to internal and external consumers.

Integrated different AWS services like S3, glue, SNS, SQS, lambda and EC2 to create ETL pipelines.

Worked closely with the DevOps team to deploy infrastructure in AWS.

Created Snowflake Snowpipe pipelines to integrate AWS S3 and Snowflake.

Performed technical troubleshooting of Python code and AWS infrastructure.

Amgen Thousand Oaks, California

Lead AWS DATA ENGINEER October 2019 – March 2021

Developed ETL pipelines for daily and weekly data ingestion.

Wrote Python code to integrate the pipelines and automate the processes.

Created Spark scripts to ingest data from different data sources.

Implemented AWS S3 as a Data Lake to store the incoming data.

Processed the incoming data using Spark in a Databricks environment.

Performed data value and data type standardization to create tables on the top of the Data Lake.

Created tables and views on AWS Redshift as a Data Warehouse.

Used programing languages to build the application in Spark.

Heavy usage of Scala for building said applications in Spark

Used Spark Batch processing with Scala as a programing language.

Used Scala Native data types like Datasets created from case classes and implemented in a wide variety of data transformations.

Automated the ETL jobs to perform in a daily and weekly basis by using Apache Airflow.

Used and created JSON configuration files to store information about each of the pipelines.

Created and performed transactions in tables in Databricks Delta Lake.

Created and made entries in log tables to keep track of the ETL process status.

Implemented Data Quality Management techniques on the data to ensure consistency.

Optimized Spark jobs to perform faster.

Collaborated closely with the offshore team and stakeholders to define the project scope and deliver on time.

Integrated different tools in the AWS ecosystem for the ETL jobs to run.

Created and populated log tables to track the outcome of the ETL steps.

Wrote complex SQL queries to create, join and aggregate tables for their analysis and visualization.

Delivered results weekly to stakeholders and drive daily meetings with the offshore team to track the progress of the team.

Deployed Databricks clusters for development, test, and production in an automated way.

Ran Spark jobs using Pyspark and Scala for data processing.

Traveler's Insurance Hartford, CT

HADOOP DATA ENGINEER May 2018 – October 2019

Configured Linux on multiple Hadoop environments setting up Dev, Test, and Prod clusters within the same configuration.

Created a pipeline to gather data using PySpark, Kafka and HBase.

Sent requests to source REST Based API from a Scala script via Kafka Producer.

Utilized a cluster of multiple Kafka brokers to handle replication needs and allow for fault tolerance.

Hands-on experience with SparkCore, SparkSession, SparkSQL, and Data Frames/Data Sets/RDD API.

Utilized Spark jobs, Spark SQL, and Data Frames API to load structured data into Spark clusters

Created a Kafka broker which uses the schema to fetch structured data in structured streaming.

Created ETL Pipelines using spark as a processing tool for different data sources.

Defined Spark data schema and set up development environment inside the cluster.

Handled Spark-submit jobs to all environments.

Interacted with data residing in HDFS using PySpark to process the data.

Decoded the raw data and loaded into JSON before sending the batched streaming file over the Kafka Producer.

Received the JSON response in Kafka consumer written in Python.

Used Scala for applications for data processing jobs of different nature.

Formatted the response into a data frame using a schema containing, News Type, Article Type, Word Count, and News Snippet to parse the JSON.

Established a connection between the HBase and Spark for the transfer of the newly populated data frame.

Designed Spark Scala job to consume information from S3 Buckets.

Monitored background operations in Hortonworks Ambari.

HDFS Monitoring job status and life of the DataNodes according to the specs.

Managed Zookeeper configurations and ZNodes to ensure High Availability on the Hadoop Cluster.

Managed hive connection with tables, databases, and external tables.

Set up ELK collections to all environments and replications of shards.

Created standardized documents for company all usage.

Worked one on one with clients to resolve issues regarding Spark jobs submissions.

Worked on AWS Kinesis for processing huge amounts of real-time data.

Developed multiple Spark Streaming and batch Spark jobs using Java, Scala, and Python on AWS.

Implemented Hortonworks medium and low recommendations on all environment.

Spotify Chicago, IL

HADOOP DATA ENGINEER November 2016 – April 2018

Involved in the creation of Hive tables, loading data from different sources, and performing Hive queries.

Implemented data queries and transformations on Hive/SQL tables by using the Spark SQL and Spark DataFrames APIs.

Worked in different teams and reviewing Scala code collaborated in creating Spark Scala.

Created Spark Streaming applications as well as Spark Batch processing jobs with Scala.

Worked on importing and exporting data between the HDFS and relational databases.

Created UDF's using Scala and used it in the spark programs.

Created a pipeline to gather new music releases of a country for a given week using PySpark, Kafka and Hive.

Sent requests to Spotify REST Based API from a python script via Kafka Producer.

Collected log data from various sources and integrating it into HDFS using Flume; staging data in HDFS for further analysis.

Received the JSON response in Kafka consumer python file, formatted the response into a data frame using a schema containing, country code, artist name, number of plays and genre to parse the JSON,

Established a connection between Hive and Spark for the transfer of the newly populated data frame,

Extracted metadata from Hive tables with Hive QL,

Utilized a cluster of three Kafka brokers to handle replication needs and allow for fault tolerance,

Stored the data pulled from the API into Apache Hive on Hortonworks Sandbox,

Utilized HiveQL to query the data to discover music release trends from week to week,

Assisted in Installing and configuring Hive, Sqoop, Flume, Oozie on the Hadoop cluster with latest patches.

Loaded into ingested data into Hive Managed and External tables.

Wrote custom user define functions (UDF) for complex Hive queries (HQL).

Performed upgrades, patches, and bug fixes in Hadoop in a cluster environment.

Wrote shell scripts to automate workflows to pull data from various databases into Hadoop framework for users to access the data through Hive based views.

Wrote Hive Queries for analyzing data in Hive warehouse using Hive Query Language.

Built the Hive views on top of the source data tables and built a secured provisioning.

Used Cloudera Manager for installation and management of single-node and multi-node Hadoop cluster.

Wrote shell scripts for automating the process of data loading.

KeyCorp Cleveland, OH

AWS Cloud DATA ENGINEER September 2014 – October 2016

Made and oversaw cloud VMs with AWS EC2 command-line clients and AWS administration reassure.

Used Spark DataFrame API over the Cloudera platform to perform analytics on Hive data.

Added support for Amazon AWS S3 and RDS to host static/media files and the database into Amazon Cloud.

Used Ansible Python Script to generate inventory and push the deployment to AWS Instances.

Executed Hadoop/Spark jobs on AWS EMR using programs, data stored in S3 Buckets.

Implemented usage of Amazon EMR for processing Big Data across Hadoop Cluster of virtual servers on Amazon Elastic Compute Cloud (EC2) and Amazon Simple Storage Service (S3) AWS Redshift.

Implemented AWS Lambda functions to run scripts in response to events in the Amazon Dynamo DB table or S3.

Populated database tables via AWS Kinesis Firehose and AWS Redshift.

Automated the installation of ELK agent (file beat) with Ansible playbook.

Developed KAFKA Queue System to Collect Log data without Data Loss and Publish to various Sources.

Applied AWS Cloud Formation templates used for Terraform with existing plugins.

Developed AWS Cloud Formation templates to create a custom infrastructure of our pipeline.

Implemented AWS IAM user roles and policies to authenticate and control access.

Specified nodes and performed the data analysis queries on Amazon redshift clusters on AWS.

DXC Technology Tysons, VA

BIG DATA ENGINEER April 2012 – August 2014

Wrote shell scripts to automate data ingestion tasks.

Used Cron jobs to schedule the execution of data processing scripts.

Processed multiple terabytes of data stored in AWS using Elastic Map Reduce (EMR) to AWS Redshift.

AWS EMR to process big data across Hadoop clusters of virtual servers on Amazon Simple Storage Service (S3).

Worked with Spark using Scala and Spark SQL for a faster processing of the data.

Built Spark Streaming pipelines including ingestion tools like Kafka and Flume.

Automated AWS components like EC2 instances, Security groups, ELB, RDS, Lambda and IAM through AWS Cloud Formation templates.

Implemented security measures AWS provides, employing key concepts of AWS Identity and Access Management (IAM).

Migrated complex MapReduce scripts to Apache Spark RDDs code.

Designed and developed ETL workflows using Scala and Python for processing structured and unstructured data from the HDFS.

Developed data transformation pipelines using Spark RDDs and Spark SQL.

Created multiple batch Spark jobs using Java.

Launched and configured The Amazon EC2 (AWS) Cloud Servers using AMIs (Linux/Ubuntu) and configuring the servers for specified applications.

Developed metrics, attributes, filters, reports, dashboards and created advanced chart types, visualizations, and complex calculations to manipulate the data.

Implemented a Hadoop cluster and different processing tools including Spark, MapReduce.

Pushing containers into AWS ECS.

Use Scala to connect to EC2 and push files to AWS S3.

Installed, Configured and Managed AWS Tools such as ELK, Cloud Watch for Resource Monitoring.

Carnival Cruise Lines Miami, FL

DATA ENGINEER February 2011 – April 2012

Developed time series forecasting models in python using StatsModels and TensorFlow to forecast demand for cruise ships.

Helped balance the load of ports by forecasting demand.

Incorporated data mined and scraped from outside sources.

Worked with the DevOps team to integrate solutions into their software.

Enhanced data collection procedures to include information that is relevant for building analytic systems.

Ad-hoc analysis and presentation of results in a clear manner.

Created machine learning algorithms using Scikit-learn and Pandas.

Built predictive models to forecast demand.

Hands-on use of commercial data mining with tools created in R and Python.

Processing, cleansing, and verifying the integrity of data used for analysis.

Solved analytical problems, and effectively communicated methodologies and results.

Generalized feature extraction in the machine learning pipeline which improved efficiency throughout the system.

Education

Bachelor of Science

Bachelor’s Degree in Biomedical Engineering – National Polytechnic Institute of Mexico

Master’s Degree in Computer Science – National Polytechnic Institute of Mexico

Certifications

6.431x Probability: The Science of Uncertainty and Data – MITx edX

IBM: Hadoop and Big Data

IBM: Apache Spark

Contact this candidate