HUAZHI FANG
DATA ENGINEER
***********@*****.***
PHONE: 669-***-****
sdsds\
E CERTIFICATIONS
Data Analytics & Data Science, Digi-Safari & Tredence Inc.
Hadoop, IBM
Big Data, IBM
Spark Fundamentals I, IBM
WORK HISTORY
YAHOO Data Engineer
Sunnyvale, CA September 2019 – Current
Implemented solutions for ingesting data from various sources and processing the Data-at-Rest utilizing Big Data technologies such as Hadoop, Map Reduce Frameworks, HBase, Hive.
Worked on AWS Kinesis Consumer/Producer library for processing real-time data
Configured Stream sets to store converted data to SQL SERVER using JDBC drivers.
Created Hive external tables and designed data models in Apache Hive.
Expertise in optimizing the storage in Hive using partitioning and bucketing mechanisms on managed and external tables.
Used Spark SQL and DataFrames API to process structured and semi structured information into Spark Clusters.
Extensively worked on CI/CD pipeline for code deployment by engaging different tools (Git, Jenkins, Code Pipeline)
Used Cloudera Manager for installation and management of a multi-node Hadoop cluster
Worked on Cassandra designs as well as information modeling.
Creation, configuration, and monitoring Shards sets.
Performed analysis of data governance and distribution, choosing a shard Key to distribute data evenly.
Worked with Spark core, SparkSQL, Data Frames, and Pair RDDs.
Enforced YARN Resource pools to share resources of cluster for YARN jobs submitted by users.
Exploring Spark for improving the performance and optimization of the existing algorithms in Hadoop using Spark Context, Spark-SQL, DataFrames, Pair RDD's, Spark YARN.
Used Spark to export transformed streaming datasets into Redshift on AWS cloud.
Created Lambda to process information from S3 to Spark for organized
Installed and configured Kafka cluster
Monitoring Kafka cluster; Architected a lightweight Kafka broker
Worked on integration of Kafka with Spark for real-time data processing
Extracted the needed data from the server into Hadoop file system (HDFS) and bulk loaded the cleaned data into HBase using Spark.
Accessed Hadoop file system (HDFS) using Spark and managed data in Hadoop data lakes.
Worked with the Spark-SQL context to create data frames to filter input data for model execution.
Utilized Spark Data Frame and Data Set from Spark SQL API widely for information handling.
Used Spark SQL to process the huge amount of structured data.
SPOTIFY DATA ENGINEER
New York, NY December 2017 - September 2019
Installed and configured Kafka producer to ingest data from Rest API
Installed and configured Spark consumer to stream data from Kafka Producer
Used Spark to migrate the data to HBASE.
Proficient experience in writing Queries, Stored procedures, Functions, and Triggers by using SQL.
Worked on support, development, testing, and coordination of teams during new system deployments.
Wrote custom user define functions (UDF) for complex Hive queries (HQL)
Configure and deploy production-ready multi-node Hadoop services Hive, Sqoop, Flume, Oozie on the Hadoop cluster with latest patches
Experience in creating dynamic web interfaces using JavaScript, jQuery, HTML5, CSS Experience with creating metadata and testing database.
Developed scripts for collecting high-frequency log data from various sources and integrating it into HDFS using Flume; staging data in HDFS for further analysis.
Used Cloudera Manager for installation and management of single-node and multi-node Hadoop cluster
Configuring a multi-node cluster of 10 Nodes and 30 brokers for consuming high volume, high-velocity data
Used Spark SQL to perform transformations and actions on data residing in Hive.
Used Zookeeper for numerous styles of centralized configurations, as well for Kafka offset management
Assigned to making Hive tables, loading the info and writing hive queries.
Import/export knowledge into HDFS and Hive in exploitation of Sqoop and Kafka.
Created Partitions, Buckets supported State to additional method exploitation Bucket primarily based Hive joins.
Configure and deploy production-ready multi-node Hadoop services Hive, Sqoop, Flume, Oozie on the Hadoop cluster with latest patches
Built a prototype for real-time analysis using Spark Streaming and Kafka.
Implemented Kafka messaging consumer
Used Flume and HiveQL scripts to extract, transform, and load the data into database.
Loaded into ingested data into Hive Managed and External tables.
Utilized HiveQL to query data to discover trends from week to week with SCD
Involved in creating Hive tables, loading data, and writing hive queries
Used Spark Streaming to divide streaming data into batches as an input to Spark engine for batch processing.
EBAY BIG DATA ENGINEER
Washington, DC June 2015 – December 2017
Installed and configured Spark consumer to stream data from Kafka Producer
Installed and configured Hive for data warehousing and HQL ETL.
Used Spark to migrate the data to Hive
Worked on creation of AWS EC2 instances, EMR and Lambda Applicationss.
Deployed a large knowledge base to handme Hadoop applications mistreatment
Used AWS Redshift for storing DWH information on cloud.
Performed maintenance, monitoring, deployments, and upgrades across infrastructure that supports all Hadoop clusters.
Used Zookeeper and Oozie for coordinating the cluster and programming workflows.
Involved in DMLs and DDLs from tables stored in HDFS as external Hive tables.
Transformed the logs data into knowledge models using Kibana
Written UDF functions to format logs for ingestion .
Used HBase to store majority of transactional information that required custom partitioning.
Experience with Spark for process ingested data from varied sources.
Created HBase tables to store variable data formats of information returning from completely different portfolios.
Used Spark SQL and Data Frames API to load structured and semi structured data into Spark Clusters.
Wrote shell scripts for log files to Hadoop cluster through automatic processes.
Successfully loaded files to HDFS from MySQL using Spark.
AHOLD Data Engineer
SALISBURY, NC October 2013 – June 2015
Installed and configured Hadoop cluster including HDFS, Yarn and MapReduce.
Used Sqoop to migrate data from MySQL to Hive
Installed and configured Hive and also written Hive UDFs.
Worked with different file formats and compression techniques to meet company standards.
Involved in loading data from the UNIX file system to HDFS.
Installed and configured MySQL server to allow remote user access on Ubuntu
Loaded RDBMS of large datasets to big data by using Sqoop
Accessed Hadoop cluster (CDM) and reviewed log files of all daemons.
Analyzed datasets using Hive, MapReduce, and Sqoop to recommend business improvements
Maintaining and troubleshooting network connectivity using WireShark
Collected and aggregated large amounts of log data using Apache Flume and staging data in HDFS for further analysis.
Installed and configured Flume agent to ingest data from Rest API
SKILLS
APACHE - Apache Flume, Apache Hadoop, Apache YARN, Apache Hive, Apache Kafka, Apache Oozie, Apache Spark, Apache Zookeeper, Cloudera Impala, HDFS, Hortonworks, MapR, MapReduce
SCRIPTING - HiveQL, MapReduce, XML, Python, Pandas, R, JavaScript, HTML, CSS, PHP, UNIX, Shell scripting, LINUX
DATA PROCESSING (COMPUTE) ENGINES - Apache Spark, Spark Streaming, SparkSQL
DATA VISUALIZATION - Excel, Tableau, Spark GraphX
DATABASE - Microsoft SQL Server, Oracle, MySQL, Apache Cassandra, Amazon Redshift, DynamoDB, Apache HBase, Apache Hive, MongoDB
EDUCATION
BACHELOR Materials Engineering
University of Science and Technology Beijing
PH.D. Materials Science
University of Science and Technology Beijing