Xing Li
BIG DATA ENGINEER
Phone: 813-***-**** *********@*****.***
SUMMARY
** ***** ** ********** ** the development of custom Hadoop Big Data solutions, platforms, pipelines, data migration, and data visualizations.
Ability to troubleshoot and tune relevant programming languages like SQL, Java, Python, Scala, Hive, RDDs, DataFrame & MapReduce. Able to design elegant solutions through the use of problem statements.
In-depth understanding/knowledge of Hadoop Architecture and various components such as HDFS, Job Tracker, Task Tracker, Name Node, Data Node, and MapReduce concepts and experience in working with MapReduce programs using Apache Hadoop for working with Big Data to analyze large datasets efficiently.
Created classes that simulate real-life objects and write loops to perform actions on your data.
AWS tools (Redshift, Kinesis, S3, EC2, EMR, DynamoDB, Elasticsearch, Athena, Firehose, Lambda)
Good experience in working with cloud environment like Amazon Web Services (AWS) EC2 and S3.
Experience working closely with operational data to provide insights and value to inform company strategy.
Experience with multiple terabytes of data stored in AWS using Elastic Map Reduce (EMR) and Redshift
Developed data queries using HiveQL and optimized the Hive queries
Built customer behavior classification model using Random Forest algorithm within PySpark
Design, development, and system migration of high performant metadata-driven data pipeline with Kafka and Hive, providing data export capability through API and UI.
Dealing with multiple terabytes of mobile ad data stored in AWS using Elastic Map Reduce and Redshift Postgresql.
Experience working on various Cloudera distributions like (CDH 4/CDH 5), Knowledge of working on Hortonworks and Amazon EMR Hadoop distributors.
Created Hive Managed and External tables with partition and bucket in Hive and loaded data into Hives
Built and analyzed Regression model on google cloud using PySpark
Excellent Knowledge in understanding Big Data infrastructure, distributed file systems -HDFS, parallel processing - MapReduce framework, and complete Hadoop ecosystem - Hive, Hue, HBase, Zookeeper, Sqoop, Kafka-Storm, Spark, Flume, and Oozie.
In-depth knowledge of real-time ETL/Spark analytics using Spark SQL with visualization Hands-on experience on YARN (MapReduce 2.0) architecture and components such as Resource Manager, Node Manager, Container and Application Master, and execution of a MapReduce job.
WORK HISTORY
OCTOBER 2019 - Present
Big Data Developer Verizon Tampa, FL
Developed Spark streaming application to pull data from cloud servers to the Hive table.
Used Spark SQL to process the huge amount of structured data.
Worked with the Spark-SQL context to create data frames to filter input data for model execution.
Converted some existing hive scripts to Spark applications using RDD's for transforming data and loading into HDFS.
Developed Hive queries in daily use to retrieve datasets
Working on data ingestion from JDBC Databases to Hive
Using Spark to do transformation and preparation of Dataframes
Removing and filtering unnecessary data from raw data using spark
Mask high Confidentiality information in different tables and databases
Created Spark 2.2.0 jobs to manipulate datasets optimizing transformations
Developed and managed an environment based on DBeaver, pgadmin, jupyter, Hue on a Cloudera Distribution
Configured Stream-set to store the converted data to HIVE using JDBC drivers.
In charge of Writing UDF's to find locations close to 500 ft of multiple relevant points.
Writing spark code to remove certain fields from the hive table.
Join multiple tables to find the correct information for certain addresses.
Writing code to standardize string and IP addresses over datasets.
Using Hadoop as a datalake to store large amounts of data.
In charge of removing empty string in fields to verify data cleanliness.
Developed oozie workflows and in charge of running Oozie job to run jobs in parallel.
Using Oozie to update tables automatically over Hive.
Writing Unit tests for all code using different frameworks like PyTest.
Writing code according to the schema change of the source table based on the data warehouse dimensional modeling.
Create a geospatial index, perform a geo-spatial search, populate a result column that indicates Y/N based on the distance found.
Develop test cases for features deployed for Spark code and UDFs.
Optimizing JDBC connection with bulk upload for Hive import and Spark Imports
Handle defects from internal testing tools to increase code coverage over Spark.
Writing UDF to find the max downstream speed for locations and populate it to the table.
January 2018 - OCTOBER 2019
Data Engineer The Home Depot Atlanta, GA
Installed and configured Hive and also written Hive UDFs.
Designed and Developed ETL jobs to extract data from Salesforce replica and load it in data mart in Amazon Redshift.
Configure Kafka producer with API endpoints using JDBC Autonomous REST Connectors
Performed upgrades, patches, and bug fixes in Hadoop in a cluster environment
Worked on Spark SQL to check the data; Wrote Spark applications for data validation, cleansing, transformation, and custom aggregation.
Loaded ingested data into Hive Managed and External tables.
Learned many technologies on the job as per the requirement of the project. Good communication, interpersonal and analytical skills, and a highly motivated team player with the ability to work independently
Write a python script receive requests to Thames Water REST Based API from a python script, through the request libraries and serve to Kafka producer
Created Hive queries to summarizing and aggregating business queries by comparing Hadoop data with historical metrics
Worked on streaming the processed data to DynamoDB using Spark for making it available for visualization and report generation by the BI team.
Evaluate and propose new tools and technologies to meet the needs of the organization. Excellent understanding/knowledge of Hadoop architecture and various components such as HDFS, Job Tracker, Task Tracker, Name Node, Data Node,
Built the Hive views on top of the source data tables, and built a secured provisioning
Set-Up environment on AWS EMR and implemented Hadoop on EC2 with Amazon Redshift and AWS S3.
Provided status to stakeholders and facilitated periodic review meetings.
Configure Kafka producer with API endpoints using JDBC Autonomous REST Connectors
Wrote shell scripts for automating the process of data loading
Wrote shell scripts to automate workflows to pull data from various databases into the Hadoop framework for users to access the data through Hive based views
Ability to learn and adapt quickly to emerging new technologies and paradigms.
Implemented Spark in EMR for processing Big Data across our Data Lake in AWS System
May 2016 - January 2018
Data Engineer Alignment Healthcare Orange, CA
Wrote shell scripts for automating the process of data loading
And Map Reduce programming paradigm.
Performed streaming data ingestion to the Spark distribution environment, using Kafka.
Proficient in writing complex queries the API into Apache Hive on Hortonworks Sandbox
Formatted the response into a data frame using a schema containing, country code, artist name, number of plays, and genre to parse the JSON
Write producer /consumer scripts to process JSON response in python
Developed distributed query agents for performing distributed queries against shards
Experience with hands-on data analysis and performing under pressure.
Proficient experience in writing Queries, Stored procedures, Functions, and Triggers by using SQL. Support development, testing, and operations teams during new system deployments.
Implemented Spark using Scala and utilized Data Frames and Spark SQL API for faster processing of data.
Created Hive queries to summarizing and aggregating business queries by comparing Hadoop data with historical metrics
Worked closely with stakeholders and data scientists/data analysts to gather requirements and create an engineering project plan.
ETL to Hadoop file system (HDFS) and wrote HIVE UDFs.
Hive for queries and incremental imports with Spark and Spark jobs
Developed DBC//ODBC connectors between Hive and Spark for the transfer of the newly populated data frame
Writing Hive Queries for analyzing data in Hive warehouse using Hive Query Language
Managed jobs exploitation honest computer hardware to apportion process resources.
Used Spark engine, Spark SQL for data analysis, and provided to the data scientists for further analysis.
Implemented parser, query planner, query optimizer, and native query execution using replicated logs combined with indexes, supporting full relational KQL queries, including joins
Worked with Amazon Web Services (AWS) and involved in ETL, Data Integration, and Migration.
Created materialized views, partitions, tables, views, and indexes.
Wrote custom user define functions (UDF) for complex Hive queries (HQL)
February 2015 – May 2016
Data Engineer Cloudflare San Francisco, CA
Designed and enforced check settings on AWS.
Periodically imported logs to HDFS mistreatment writer.
Designed acceptable partitioning/bucketing schema to permit quicker knowledge retrieval throughout analysis mistreatment Spark.
Developed MapReduce programs to parse the raw data, populate staging tables, and store the refined data in partitioned tables in the EDW.
Used SparkSQL module to store data into HDFS
Created Hive queries that helped market analysts spot emerging trends by comparing fresh data with EDW reference tables and historical metrics.
Used Spark DataFrame API over Hortonworks platform to perform analytics on data.
Proficient in writing complex queries to the Apache Hive API on Hortonworks Sandbox
Collaborated with various teams & management to understand a set of requirements & design a complete system
Implemented a complete Big data pipeline with real-time processing
Developed Spark code in Scala using Spark SQL & Data Frames for aggregation
Worked with Sqoop to ingest & retrieve data from various RDBMS such as MySQL
Created schema in Hive with performance optimization using bucketing & partitioning
Worked rigorously with Impala for executing ad-hoc queries
Written Hive queries to transform data for further downstream processing
Experience with Kafka brokers, zookeepers & Kafka Control center.
Created Hive External tables and loaded the data into tables and query data using HQL.
Used Sqoop to efficiently transfer data between databases and HDFS and used Flume to stream the log data from servers.
Developed Spark code using Scala and Spark-SQL for faster testing and data processing.
Leverage Hadoop ecosystem/Cloud platform to design and develop solutions using Kafka, Spark, Hive, and relevant Big Data technologies
Worked with the Spark-SQL context to create data frames to filter input data for model execution.
November 2013 – February 2015
Hadoop Developer 3M Saint Paul, MN
Developed new application and added functionality to existing applications using Java/J2EE technologies
Wrote SQL queries to retrieve data from the database using JDBC
Deep understanding and implementations of various methods to load MySQL tables from the server and Local File System.
Implemented and supported SQL Server Database Systems for transactional environments.
Monitored and tuned SQL Server Databases across all environments including Development, QA, Production
Maintain performance and scale of SQL Server Database System for Enterprise Data Warehouse.
Implemented Capacity Schedulers on the Yarn Resource Manager to share the resources of the cluster for the jobs given by the users.
Optimized ETL jobs to reduce memory and storage consumption.
Communicated and present findings, orally and visually in a way that can be easily understood by business counterparts
Created & maintained Java scripts to connect to MySQL via JDBC driver
September 2011 – November 2013
Hadoop Developer Whirlpool Benton Harbor, MI
Used Sqoop to expeditiously transfer information between information databases and HDFS and used Flume to stream the log data from servers.
Enforced partitioning, bucketing in Hive for higher organization of the info.
Worked with totally different file formats and compression techniques to standards.
Loaded information from a UNIX system to HDFS.
Used UNIX system shell scripts to alter the build method, and to perform regular jobs like file transfers between totally different hosts.
Assigned in production support, that concerned observance server and error logs, and foreseeing and preventing potential problems, and escalating issue once necessary.
Documented Technical Specs, Dataflow, information Models, and sophistication Models
Documented needs gathered from stakeholders.
With success loaded files to HDFS from Teradata and loaded from HDFS to HIVE.
Used Zookeeper and Oozie for coordinating the cluster and programming workflows
Involved in researching various available technologies, industry trends, and cutting-edge applications.
Data ingestion is done using Flume with source as Kafka Source & sink as HDFS.
Performed storage capacity management, performance tuning, and benchmarking of clusters.
TECHNICAL SKILLS
PROGRAMMING
Spark API, Scala, Java, Python, C, Assembly Language, UNIX Shell Scripting
DATA REPOSITORIES
HDFS, Data Warehouse, Data Lake, S3
DATABASE
Apache Cassandra & Hbase, AWS Redshift
HADOOP DISTRIBUTIONS
Cloudera, Hortonworks, Hadoop
QUERY/SEARCH
SQL, HiveQL, Spark SQL, Elasticsearch
Frameworks
Spark, Kafka, Spark Streaming
Visualization
Kibana, Tableau, PowerBI, Excel
File Formats
Parquet, Avro, Orc, JSON
HADOOP
Hive, MapReduce, Zookeeper, Yarn
QUERY LANGUAGE
SQL, HiveQL, Spark SQL, CQL
SOFTWARE DEVELOPMENT
Agile, Continuous Integration, Test-Driven Development, Unit Testing, Functional Testing, Gradle, Git, SVN, Jenkins, Travis, Jira, Maven
DEVELOPMENT ENVIRONMENTS
Jupyter Notebooks, PyCharm, IntelliJ, Eclipse, Netbeans, vscode
AMAZON CLOUD
Amazon AWS (EMR, EC2, EC3, SQL, S3, DynamoDB, Cassandra, Redshift, Cloud Formation)
Data Pipeline Tools
Apache Airflow, Apache Oozie, Nifi
Admin Tools
Cloudera Manager, Ambari, Zookeeper
EDUCATION
BACHELOR OF SCIENCE IN COMPUTER ENGINEERING
Michigan Technological University
Location Houghton, Michigan
CERTIFICATIONS
Hadoop Fundamentals, IBM
Big Data Fundamentals, IBM