Hadoop Big Data Engineer

Location:

Houston, TX

Posted:

June 26, 2023

Contact this candidate

Resume:

JOHN SHIA

Phone: 832-***-****/ Email: *********@*****.***

CLOUD SR. DATA ENGINEER

TECHNICAL SKILLS, STRENGTHS

Apache: Apache Ant, Apache Flume, Apache Hadoop, Apache YARN, Apache Hive, Apache Kafka, Apache MAVEN, Apache Oozie, Apache Spark, Apache, Tez, Apache Zookeeper, Cloudera Impala, HDFS HortonWorks, MapR, MapReduce

Scripting: HiveQL, MapReduce, XML, FTP, Python, UNIX, Shell scripting, LINUX

Operating Systems: Unix/Linux, Windows 10, Ubuntu, Apple OS

File Formats: Parquet, Avro & JSON, ORC, text, csv, flat file

Distributions: Cloudera, Hortonworks, AWS, Elastic, ELK, Cloudera CDH 4/5, Hortonworks

HDP 2.5/2.6, Amazon Web Services (AWS)

Data Processing, Compute Engines: Apache Spark, Spark Streaming, Flink

Data Visualization Tools: Pentaho, QlikView, Tableau, Power BI, Matplot

Compute Engines: Apache Spark, Spark Streaming, Storm

Databases/Data warehouses: Microsoft SQL Server Database (2005, 2008R2, 2012) Database & Data Structures, Apache Cassandra, Amazon Redshift, DynamoDB, Snowflake, Apache HBase, Apache Hive, MongoDB

Cloud: AWS, Azure, GCP

•Data engineer with 10 years of IT experience and 8+ consecutive years concentrated on Big Data/Hadoop. Experienced in custom data pipelines, working with R&D as well as in production environments.

•Used to working in production environments, migrations, installations, and development.

•Experience working large-scale distributed systems.

•Expertise in three different cloud providers, such as Amazon Web Services (AWS), Microsoft Azure, and Google Cloud Platform (GCP), to leverage their unique capabilities and services.

•Expert in providing consulting in terms of appropriate naming convention used in setting up of datawarehouses, datamarts, tables, schemas, etc.

•Proficient leveraging multiple cloud providers and Snowflake's pricing models, the project aims to optimize costs by selecting the most cost-effective services, utilizing reserved instances, and dynamically scaling resources based on demand.

•Skilled in HDFS, Spark, Hive, Sqoop, HBase, Flume, Oozie, and Zookeeper.

•Add value to Agile/Scrum processes such as Sprint Planning, Backlog, Sprint Retrospective, Requirements Gathering and provide planning and documentation for projects.

•Ensure projects are on track with stakeholder specifications.

•Write SQL queries, stored Procedures, Triggers, Cursors and Packages.

•Apply in-depth understanding/knowledge of Hadoop architectures and various components such as HDFS, MapReduce, Spark, and Hive.

•Work with Data Lakes and Big Data ecosystems (Hadoop, Spark, Hortonworks, Cloudera).

•Create Spark Core ETL processes to automate using a workflow scheduler.

•Use Apache Hadoop to work with Big Data and analyze large data sets efficiently.

•Hands on experience in ecosystems like Hive, Sqoop, MapReduce, Flume, and Oozie.

•Strong knowledge about Hive's analytical functions; extend Hive functionality by writing custom UDFs.

•Write SQL, PL/SQL for creating tables, views, indexes, stored procedures, and functions.

•Hands-on experience developing PL/SQL Procedures and Functions and SQL tuning of large databases. (is not the same

•Track record of results in an Agile methodology using data-driven analytics.

•Experience importing and exporting terabytes of data between HDFS and Relational Database Systems using Sqoop.

•Load and transform large sets of structured, semi structured, and unstructured data working with data on Amazon Redshift, Apache Cassandra, and HDFS in Hadoop Data Lake.

•Experience handling XML files as well as Avro and Parquet SerDes.

•Performance tuning at source, Target and Data Stage job levels using Indexes, Hints and Partitioning in DB2, ORACLE .

•Skilled with BI tools like Tableau and PowerBI, data interpretation, modeling, data analysis, and reporting with the ability to assist in directing planning based on insights.

R&D as well as in production environments.

IT background in signal processing.

Areas of expertise include:

•Hadoop

•Spark

•Python

•Scala

•PySpark

•Data Lake

•Data Warehousing (Redshift, Snowflake, BigQuery)

•Hive

•HBase

•SQL/NoSQL

•Cloud (AWS, Azure, GCP)

•MySQL

Strengths: Skilled in managing data analytics and data processing, database and data-driven projects, skilled in the architecture of BigData systems, ETL pipelines, and analytics systems for diverse end-users, skilled in database systems and administration, proficient in writing technical reports and documentation, adept with various distributions such as Cloudera Hadoop, Hortonworks, Cloud (AWS, Azure, GCP), and Elastic Cloud, Elasticsearch, expert in bucketing and partitions, an expert in performance optimization

PROFESSIONAL EXPERIENCE

WESTLAKE CORPORATION, HOUSTON, TEXAS

Sr. Data Engineer

September 2022 – Current

Project Summary: Was involved in wide variety of projects under Digital Office whose purpose was to build Big Data infrastructure for Westlake and improve the level of profit maximization through analytics.

•Designed an end-to-end data engineering solution for processing and analyzing IoT sensor data in real-time using Apache Kafka and Spark Streaming.

•Implemented Kafka producers to ingest the generated sensor data streams into the Kafka topics for real-time processing and analysis.

•Set up an AWS infrastructure using Terraform to provision the necessary resources, such as Amazon Kinesis Data Streams, Amazon EMR (Elastic MapReduce), and Amazon S3 (Simple Storage Service), for ingesting, processing, and storing the IoT sensor data.

•Utilized Spark Streaming to consume the sensor data from Kafka, perform data transformations, filtering, aggregations, and enrichment operations to derive meaningful insights.

•Integrated machine learning models or advanced analytics algorithms with Spark to detect anomalies, predict future sensor readings, or perform pattern recognition on the IoT data.

•Designed historical and streaming pipeline architecture for IP21 data for sensors installed on chemical reactor for predictive analytics involving unwanted chemical adhesion on reactor walls: IP21, Aspentech Cloud Connect, Aspentech SQL+, S3, Snowflake SnowPipe, AWS Lambda, Snowflake connectors.

•Used DBT (Data Build Tool) to remap various existing table names (prefixes) used for visualization of rail cars used for chlorine shipment transfer.

•Ported Snowflake SQL linear interpolate routines and documented the internals, including data flow.

•Validated/cross-validated data and data attributes from various sources using Aspentech SQL+, Aspentech Cloud Connect, Snowflake, SeeQ, etc. and bash tools/scripts.

•Tested feasibility of DBT install on various work and development laptops, including connecting to Snowflake.

•Wrote custom AWS lambda in python, including supporting lambda layers, to filter OSI PI IoT data for parametrized tag which were outputted to a custom PowerBI URL in real time—validated using SeeQ, AWS lambda logs, and PowerBI visualization.

•Deploy and monitor the entire data engineering solution using Terraform, enabling infrastructure as code, and facilitating reproducibility and scalability of the project.

•Feasibility studies and documentation for SQL+ install, PLQ historical data pull via SQL+, data engineering project management for Plaquemain plant, early parallelization of data engineering tasks, IP21 historical load, SAP PM historical load, structure for IP21 streaming, structure for SAP PM streaming, structure for Waites Wireless streaming, etc.

•Investigated and documented and proposed various testing and data engineering pipeline schemes for Waites Wireless rugged sensors for possible integration into current and future Westlake Snowflake based data warehouse and AWS data lakes.

•Per request from data sciences group, wrote a custom super fast CSV upload routine in python using deque compared to previous implementation using pandas dataframe. Tested it against Vaex and Dask dataframes.

•Researched various streaming technologies and possible implementation involving Aspentech Cloud Connect, SQL+, MSK Kafka, AWS Kinesis, SQL+, etc.

•Analyzed pros and cons of Aspentech Cloud Connect, lack of AWS Kinesis support, architectural implementation possibility of using Glue/Spark job to pull IP21 data via SeeQ.

•Designed an alternative data engineering architectural scheme using SQL+ driver and python connector for 4 years of historical data.

•Troubleshooted off-by-1 bug in chlorine dashboard involving chlorine rail car transport data, going through dbt SQL and Snowflake code.

•Provided input and guidance to group responsible for implementing REST API interface for Power Steering project using Python and Django.

CERBERUS CAPITAL MANAGEMENT, L.P., NEW YORK, NY

Sr. Big Data Engineer

February 2022 – September 2022

Worked with a Big Data team on a near-real-time and batch analytics global sensor pipeline for PQ Chemical Corp. with plants in US and Europe. Applied AspenTech IP21 protocol, Windows Server Virtual Machine, AWS Boto3, Python, Python Pandas, AWS Kinesis, AWS Lambda, AWS EC2, Azure services (Azure Data factory, Databricks, Data Lake Gen 2, Azure Functions, Azure SQL, Synapse Analytics), Snowflake and TimeScaleDB.

•Created a personal TimeScaleDB account troubleshoot PQ Chemical Corp. TimeScaleDB interface errors where behavior of Python TimeScaleDB connection/query code gave a different behavior from direct interface through dBeaver.

•Performed technical troubleshooting on:

•AWS connections and configuration issues by creating a duplicate personal AWS account to test AWS boto3 applications on personal AWS accounts before running it on PQ Chemical AWS account set up due to repeated AWS connection and configuration issues.

•AWS Kinesis connection and configuration issues by creating a standalone AWS boto3 consumer program that polled AWS Kinesis Data Stream (KDS) for new data and pushed (wrote) the new data to TimeScaleDB database.

•Created an IP21/AWS Kinesis producer that will automatically scale from a single plant to a central server serving the entire country, region, or a continent.

•Wrote smart aggregation and batching option to a PySpark producer code.

•Wrote payload decomposition code to the producer in case the sensor data overwhelms the size restriction of AWS kinesis data buffer.

•Wrote custom IP21/Python/Boto3/AWS Kinesis Data Stream producer and custom Python/Boto3/TimeScaleDB consumer that reads data from IP21 sensors and writes out the time series data from TimeScaleDB.

•Created custom AWS Lambda Consumer for AWS Kinesis Data Stream.

•Configured and installed UAT (User Acceptance Testing) pipeline for PQ Chemical plants from US central server.

•Troubleshot various UAT (User Acceptance Testing) installation and configuration issues such as heap memory corruption causing stack parameter passing issues.

Created custom AWS Lambda Layers for Python packages that are not commonly used.

•Wrote various Postgres PSQL SQL scripts to load historical US and EU central server data from EC2 to TimeScaleDB using postgres COPY command after batch decomposition from GZIP compression and merging (from thousands to tens of thousands of sensor tags).

•Created and maintained Azure devops code repository.

•Design and implement a data pipeline using Apache Spark for batch processing of large volumes of data.

•Utilized Azure services such as Azure Data Lake Storage, Synapse and Azure Databricks for scalable storage and distributed processing capabilities.

•Integrated Kafka as a message queue to handle real-time data ingestion and streaming.

•Developed Spark jobs to read data from Kafka topics, transform it, and load it into the desired target system.

•Utilized Terraform to provision and manage the necessary infrastructure components, such as virtual machines, storage accounts, and networking resources in Azure (Synapse, Databricks, Data Lake, Data factory, Azure SQL).

•Designed and implement fault-tolerant and scalable architectures using Spark clusters and Azure services.

•Implemented data quality checks and validation mechanisms within the Spark jobs to ensure data integrity and consistency.

•Optimized Spark jobs for performance by tuning Spark configurations, leveraging data partitioning techniques, and utilizing cluster resources efficiently.

•Implemented monitoring and logging mechanisms to track the progress of Spark jobs, identify performance bottlenecks, and troubleshoot issues.

ANTHEM INC., Atlanta, GA

Sr. Big Data Engineer

October 2020 – February 2022

Worked with productionization of existing and new healthcare data models in the area of Medicare, Medicaid, and commercial healthcare data.

•Designed and implement a data ingestion pipeline to capture diverse data sources.

•Utilized Google Cloud Platform (GCP) services such as Pub/Sub and Cloud Storage for reliable and scalable data ingestion.

•Leveraged Spark for data preprocessing and transformation tasks.

•Developed Spark-based data processing workflows to clean, validate, and transform raw data.

•Utilized Spark Streaming for real-time data processing requirements.

•Implemented custom data transformations and aggregations using Python.

•Built batch processing pipelines to handle large volumes of data efficiently.

•Utilized Apache Spark's batch processing capabilities for parallel processing and optimized resource utilization.

•Employed Spark's DataFrame and SQL APIs for data querying and manipulation.

•Utilized Snowflake, a cloud-based data warehousing solution, for storing and querying large datasets efficiently.

•Designed and implement Snowflake schemas and tables to support the analytical needs of Anthem Inc.

•Ensured data integrity, security, and compliance within the Snowflake environment.

•Used Terraform to define and provision the infrastructure required for the data engineering project.

•Created and managed GCP resources such as virtual machines, storage buckets, and networking components using Terraform's declarative syntax.

•Automated the infrastructure provisioning process for easy replication and scalability.

•Implemented workflow orchestration using tools like Apache Airflow or Google Cloud Composer.

•Documented the project architecture, design decisions, and Google data flow diagrams.

•Fine-tuned Spark configurations, parallelism settings, and resource allocation to achieve optimal processing speeds.

•Implemented caching and partitioning strategies to improve query performance in Snowflake.

•Utilized GCP services (Dataproc, Dataflow, BigQuery, Cloud Storage) auto-scaling capabilities and cluster management features to dynamically allocate resources based on the workload.

•Converted SAS code into HIVE SQL for Medicaid healthcare model.

•Operationalized SAS/TERADATA/Python NICU Medicaid healthcare model.

•Wrote SQL generator in bash script that takes as its input a text list of features to analyze and generates SQL queries required for analysis.

•Analyzed and compared healthcare model output in development vs production node using Spark, Bash, HIVE, and Tableau.

•Debugged, maintained, and enhanced Disenrollment model written in Teradata, HIVE SQL, PySpark, Python, and Bash script.

•Wrote and debugged Airflow scripts for scheduling of various models.

•Wrote Bash script to generate HIVE SQL queries on fly to pivot and do comparative analysis of key columns for source tables for Disenrollment model.

•Analyzed over 10,000 features using PySpark in PySpark shell.

•Regularly worked with healthcare tables that had millions to billions of rows in Teradata and HIVE.

•Created and managed various system and bash Environmental Variables to simplify code development and management in SQL, Bash, and Airflow scripts.

•Analyzed SAS CSV output data file comprising millions of rows and thousands of columns using Spark, HDFS, bash, PySpark, and HIVE SQL.

BLOCKSTREAM, San Francisco, CA

Hadoop Data Engineer

January 2019 – October 2020

•Wrote scripts to automate workflow, execution, loading, parsing of data into a different database system

•Deployed Hadoop to create data pipelines involving HDFS

•Managed multiple Hadoop clusters to support data warehousing

•Created and maintained Hadoop HDFS, Spark, HBase pipelines

•Supported data science team with data wrangling, data modeling tasks

•Built continuous Spark streaming ETL pipeline with Kafka, Spark, Scala, HDFS and MongoDB.

•Scheduled and executed workflows in Oozie to run Hive

•Worked on importing unstructured data into HDFS using Spark Streaming and Kafka.

•Worked on installing clusters, commissioning and decommissioning data nodes, configuring slots, and ensured name node availability with regards to capacity planning and management

•Implemented Spark using Scala and Spark SQL for optimized analysis and processing of data.

•Configured Fair Scheduler to allocate resources to all the applications across the cluster.

•Configured Spark Streaming to process real time data via Kafka and store it to HDFS using Scale.

•Loaded data from multiple servers to AWS S3 bucket and configured bucket permissions.

•Used Apache Kafka to transform live streaming with batch processing to generate reports

•Implemented Partitioning, Dynamic Partitions, and Buckets in HIVE for performance optimization and organized data in logical grouping.

•Maintained and performed Hadoop log file analysis for error, access statistics for fine tuning

•Optimized data storage in Kafka Brokers within the Kafka cluster by partitioning Kafka Topics.

•Used Impala for faster data analysis on HDFS data over Hive when fault tolerance was not required, and data was in text format

UNITED AIRLINES, Chicago, IL

Data Engineer

February 2017 – January 2019

•Used Amazon Kinesis to stream weblogs

•Implemented Kinesis, EMR pipeline for customer engagement database

•Assisted in designing, building and maintaining database to analyze life cycle of checking transactions

•Analyzed Hadoop clusters in Hadoop and AWS ecosystems

•Used AWS Kinesis to process data and load into AWS RDS MySQL database and AWS S3.

•Used AWS RedShift Clusters to sync data as a data warehouse solution to AWS pipeline and used AWS RDS to store the data for retrieval to dashboard for visualization.

•Designed and developed ETL jobs to extract data from AWS S3 and loaded it into data mart in Amazon Redshift.

•Experienced in working with multiple Terabytes of data stored in AWS S3 data lake

•Set up and configured multiple node Kafka clusters for data ingestion from API’s

•Implemented and maintained EMR, Redshift pipeline for data warehousing

•Partitioned topics from various sources across Kafka clusters and used Spark engine to process data

•Captured transactional data for early fraud detection by analyzing consumer habits

•Generated early warnings for irregular transactions on the data platform

•Used Spark engine, Spark SQL for data analysis and provided intermediate results to data scientists for transactional analytics and predictive analytics

•Supported data analysts with possible online payment data corruption detection using Tableau visualization

•Used Spark SQL to generate customer insights for decision analysis by business analysts

•Closely worked with data science team in building Spark MLib applications to build various predictive models

•Utilized AWS Cloud Formation to create templates for database development.

•Migrated data from HortonWorks cluster to AWS EMR cluster.

•Configured AWS IAM and Security Group as required and distributed them as groups into various availability zones of the VPC.

KROGER, Cincinnati, OH

Hadoop Data Engineer

June 2015 – February 2017

•Created and maintained IoT sensor pipeline using Apache Kafka, Spark, Hive

•Developed a data pipeline for extracting historic flood information from online sources

•Used Python for web scraping of relevant articles of most recent floods

•Used Python to make requests from news sources Web API’s as well as social media API’s such as Facebook and Twitter

•Worked with Apache Spark for large data processing integrated with functional programming language Scala for fast, distributed processing

•Worked with both batch and real-time processing with Spark frameworks.

•Created a Kafka producer to connect to different external sources and push data to a Kafka broker.

•Stored unprocessed JSON and HTML files in HDFS data lake

•Processed IoT data lakes on Amazon using EMR

•Worked with analytics team to provide querying insights and helped develop methods to map informative text more efficiently

•Created and maintained Python PySpark scripts for Hive interface

•Optimized tables and schema for use in machine learning models for more value added analytic model

•Supported analytics team with Hive data format and Spark internals

•Installed entire Hadoop ecosystem on new servers including Hadoop, Spark, PySpark, Kafka, HortonWorks, Hive, Cassandra, etc.

•Retrieved structured and unstructured data from HDFS and MySQL to Spark to perform MapReduce jobs

•Implemented advanced procedures such as text analytics and processing using in memory computing capability methods through Scala scripts running on Apache Spark

•Used Spark Context and Spark Session to process text files by flat mapping, mapping to RDD, and reducing RDD’s by key to identify text containing key metrics

STATE FARM INSURANCE, BLOOMINGTON, IL

Hadoop Data Engineer

October 2013 – June 2015

•Optimized Spark programs for improved execution speed

•Created and maintained PySpark scripts to connect to MySQL via JDBC driver

•Installed and configured Tableau Desktop on one of the three nodes to connect to the HortonWorks Hive Framework (database) through HortonWorks ODBC connector for analytics on cluster.

•Implemented Twitter/Flume/Spark/MySQL pipeline for market analysis

•Maintained MySQL data warehouse by pushing data through Spark pipeline

•Enabled security to the cluster using Kerberos and integrated clusters with LDAP at Enterprise level.

•Implemented user access with Kerberos and cluster security using Ranger.

•Assisted in Install of Hive, Pig, Sqoop, Flume, Oozie and HBase on the Hadoop cluster with latest patches.

•Configured Hive, Sqoop, Flume, Oozie and HBase on the Hadoop cluster with latest patches.

•Created and maintained Flume configuration files

•Supported internal analytics team with data cleansing for purpose of sentiment analysis

•Used Ambari stack to manage big data clusters, and performed upgrades for Ambari stack, elastic search etc.

•Implemented cluster co-ordination services through Zookeeper.

•Implemented Capacity Schedulers on the Yarn Resource Manager to share the resources of the cluster for user jobs

•Created Oozie workflow for various tasks such as Similarity matching and consolidation

EDUCATION & CERTIFICATIONS

EDUCATION

B.A. Computer & Information Sciences

University of California, Santa Cruz

MBA in Finance

University of North Texas

CERTIFICATIONS

•Scala Programming for Data Science, Level I (IBM)

•Introduction to Cloud (IBM Developer Skills Network)

•Capstone: Retrieving, Processing, and Visualizing Data with Python (Coursera)

•Using Databases with Python (Coursera)

•IBM Hadoop

•IBM Big Data

•Certified as AWS Cloud Practitioner in 2020

•Certified as Associate AWS Solutions Architect in 2020

•Certified as Associate AWS Developer in 2021

Contact this candidate