Data Engineer

Location:

United States

Posted:

February 03, 2021

Contact this candidate

Resume:

HUAZHI FANG

DATA ENGINEER

***********@*****.***

PHONE: 669-***-****

sdsds\

E CERTIFICATIONS

Data Analytics & Data Science, Digi-Safari & Tredence Inc.

Hadoop, IBM

Big Data, IBM

Spark Fundamentals I, IBM

WORK HISTORY

YAHOO Data Engineer

Sunnyvale, CA September 2019 – Current

Implemented solutions for ingesting data from various sources and processing the Data-at-Rest utilizing Big Data technologies such as Hadoop, Map Reduce Frameworks, HBase, Hive.

Worked on AWS Kinesis Consumer/Producer library for processing real-time data

Configured Stream sets to store converted data to SQL SERVER using JDBC drivers.

Created Hive external tables and designed data models in Apache Hive.

Expertise in optimizing the storage in Hive using partitioning and bucketing mechanisms on managed and external tables.

Used Spark SQL and DataFrames API to process structured and semi structured information into Spark Clusters.

Extensively worked on CI/CD pipeline for code deployment by engaging different tools (Git, Jenkins, Code Pipeline)

Used Cloudera Manager for installation and management of a multi-node Hadoop cluster

Worked on Cassandra designs as well as information modeling.

Creation, configuration, and monitoring Shards sets.

Performed analysis of data governance and distribution, choosing a shard Key to distribute data evenly.

Worked with Spark core, SparkSQL, Data Frames, and Pair RDDs.

Enforced YARN Resource pools to share resources of cluster for YARN jobs submitted by users.

Exploring Spark for improving the performance and optimization of the existing algorithms in Hadoop using Spark Context, Spark-SQL, DataFrames, Pair RDD's, Spark YARN.

Used Spark to export transformed streaming datasets into Redshift on AWS cloud.

Created Lambda to process information from S3 to Spark for organized

Installed and configured Kafka cluster

Monitoring Kafka cluster; Architected a lightweight Kafka broker

Worked on integration of Kafka with Spark for real-time data processing

Extracted the needed data from the server into Hadoop file system (HDFS) and bulk loaded the cleaned data into HBase using Spark.

Accessed Hadoop file system (HDFS) using Spark and managed data in Hadoop data lakes.

Worked with the Spark-SQL context to create data frames to filter input data for model execution.

Utilized Spark Data Frame and Data Set from Spark SQL API widely for information handling.

Used Spark SQL to process the huge amount of structured data.

SPOTIFY DATA ENGINEER

New York, NY December 2017 - September 2019

Installed and configured Kafka producer to ingest data from Rest API

Installed and configured Spark consumer to stream data from Kafka Producer

Used Spark to migrate the data to HBASE.

Proficient experience in writing Queries, Stored procedures, Functions, and Triggers by using SQL.

Worked on support, development, testing, and coordination of teams during new system deployments.

Wrote custom user define functions (UDF) for complex Hive queries (HQL)

Configure and deploy production-ready multi-node Hadoop services Hive, Sqoop, Flume, Oozie on the Hadoop cluster with latest patches

Experience in creating dynamic web interfaces using JavaScript, jQuery, HTML5, CSS Experience with creating metadata and testing database.

Developed scripts for collecting high-frequency log data from various sources and integrating it into HDFS using Flume; staging data in HDFS for further analysis.

Used Cloudera Manager for installation and management of single-node and multi-node Hadoop cluster

Configuring a multi-node cluster of 10 Nodes and 30 brokers for consuming high volume, high-velocity data

Used Spark SQL to perform transformations and actions on data residing in Hive.

Used Zookeeper for numerous styles of centralized configurations, as well for Kafka offset management

Assigned to making Hive tables, loading the info and writing hive queries.

Import/export knowledge into HDFS and Hive in exploitation of Sqoop and Kafka.

Created Partitions, Buckets supported State to additional method exploitation Bucket primarily based Hive joins.

Configure and deploy production-ready multi-node Hadoop services Hive, Sqoop, Flume, Oozie on the Hadoop cluster with latest patches

Built a prototype for real-time analysis using Spark Streaming and Kafka.

Implemented Kafka messaging consumer

Used Flume and HiveQL scripts to extract, transform, and load the data into database.

Loaded into ingested data into Hive Managed and External tables.

Utilized HiveQL to query data to discover trends from week to week with SCD

Involved in creating Hive tables, loading data, and writing hive queries

Used Spark Streaming to divide streaming data into batches as an input to Spark engine for batch processing.

EBAY BIG DATA ENGINEER

Washington, DC June 2015 – December 2017

Installed and configured Spark consumer to stream data from Kafka Producer

Installed and configured Hive for data warehousing and HQL ETL.

Used Spark to migrate the data to Hive

Worked on creation of AWS EC2 instances, EMR and Lambda Applicationss.

Deployed a large knowledge base to handme Hadoop applications mistreatment

Used AWS Redshift for storing DWH information on cloud.

Performed maintenance, monitoring, deployments, and upgrades across infrastructure that supports all Hadoop clusters.

Used Zookeeper and Oozie for coordinating the cluster and programming workflows.

Involved in DMLs and DDLs from tables stored in HDFS as external Hive tables.

Transformed the logs data into knowledge models using Kibana

Written UDF functions to format logs for ingestion .

Used HBase to store majority of transactional information that required custom partitioning.

Experience with Spark for process ingested data from varied sources.

Created HBase tables to store variable data formats of information returning from completely different portfolios.

Used Spark SQL and Data Frames API to load structured and semi structured data into Spark Clusters.

Wrote shell scripts for log files to Hadoop cluster through automatic processes.

Successfully loaded files to HDFS from MySQL using Spark.

AHOLD Data Engineer

SALISBURY, NC October 2013 – June 2015

Installed and configured Hadoop cluster including HDFS, Yarn and MapReduce.

Used Sqoop to migrate data from MySQL to Hive

Installed and configured Hive and also written Hive UDFs.

Worked with different file formats and compression techniques to meet company standards.

Involved in loading data from the UNIX file system to HDFS.

Installed and configured MySQL server to allow remote user access on Ubuntu

Loaded RDBMS of large datasets to big data by using Sqoop

Accessed Hadoop cluster (CDM) and reviewed log files of all daemons.

Analyzed datasets using Hive, MapReduce, and Sqoop to recommend business improvements

Maintaining and troubleshooting network connectivity using WireShark

Collected and aggregated large amounts of log data using Apache Flume and staging data in HDFS for further analysis.

Installed and configured Flume agent to ingest data from Rest API

SKILLS

APACHE - Apache Flume, Apache Hadoop, Apache YARN, Apache Hive, Apache Kafka, Apache Oozie, Apache Spark, Apache Zookeeper, Cloudera Impala, HDFS, Hortonworks, MapR, MapReduce

SCRIPTING - HiveQL, MapReduce, XML, Python, Pandas, R, JavaScript, HTML, CSS, PHP, UNIX, Shell scripting, LINUX

DATA PROCESSING (COMPUTE) ENGINES - Apache Spark, Spark Streaming, SparkSQL

DATA VISUALIZATION - Excel, Tableau, Spark GraphX

DATABASE - Microsoft SQL Server, Oracle, MySQL, Apache Cassandra, Amazon Redshift, DynamoDB, Apache HBase, Apache Hive, MongoDB

EDUCATION

BACHELOR Materials Engineering

University of Science and Technology Beijing

PH.D. Materials Science

University of Science and Technology Beijing

Contact this candidate