Big Data Developer

Location:

Tampa, FL

Posted:

March 10, 2021

Contact this candidate

Resume:

Xing Li

BIG DATA ENGINEER

Phone: 813-***-**** *********@*****.***

SUMMARY

** ***** ** ********** ** the development of custom Hadoop Big Data solutions, platforms, pipelines, data migration, and data visualizations.

Ability to troubleshoot and tune relevant programming languages like SQL, Java, Python, Scala, Hive, RDDs, DataFrame & MapReduce. Able to design elegant solutions through the use of problem statements.

In-depth understanding/knowledge of Hadoop Architecture and various components such as HDFS, Job Tracker, Task Tracker, Name Node, Data Node, and MapReduce concepts and experience in working with MapReduce programs using Apache Hadoop for working with Big Data to analyze large datasets efficiently.

Created classes that simulate real-life objects and write loops to perform actions on your data.

AWS tools (Redshift, Kinesis, S3, EC2, EMR, DynamoDB, Elasticsearch, Athena, Firehose, Lambda)

Good experience in working with cloud environment like Amazon Web Services (AWS) EC2 and S3.

Experience working closely with operational data to provide insights and value to inform company strategy.

Experience with multiple terabytes of data stored in AWS using Elastic Map Reduce (EMR) and Redshift

Developed data queries using HiveQL and optimized the Hive queries

Built customer behavior classification model using Random Forest algorithm within PySpark

Design, development, and system migration of high performant metadata-driven data pipeline with Kafka and Hive, providing data export capability through API and UI.

Dealing with multiple terabytes of mobile ad data stored in AWS using Elastic Map Reduce and Redshift Postgresql.

Experience working on various Cloudera distributions like (CDH 4/CDH 5), Knowledge of working on Hortonworks and Amazon EMR Hadoop distributors.

Created Hive Managed and External tables with partition and bucket in Hive and loaded data into Hives

Built and analyzed Regression model on google cloud using PySpark

Excellent Knowledge in understanding Big Data infrastructure, distributed file systems -HDFS, parallel processing - MapReduce framework, and complete Hadoop ecosystem - Hive, Hue, HBase, Zookeeper, Sqoop, Kafka-Storm, Spark, Flume, and Oozie.

In-depth knowledge of real-time ETL/Spark analytics using Spark SQL with visualization Hands-on experience on YARN (MapReduce 2.0) architecture and components such as Resource Manager, Node Manager, Container and Application Master, and execution of a MapReduce job.

WORK HISTORY

OCTOBER 2019 - Present

Big Data Developer Verizon Tampa, FL

Developed Spark streaming application to pull data from cloud servers to the Hive table.

Used Spark SQL to process the huge amount of structured data.

Worked with the Spark-SQL context to create data frames to filter input data for model execution.

Converted some existing hive scripts to Spark applications using RDD's for transforming data and loading into HDFS.

Developed Hive queries in daily use to retrieve datasets

Working on data ingestion from JDBC Databases to Hive

Using Spark to do transformation and preparation of Dataframes

Removing and filtering unnecessary data from raw data using spark

Mask high Confidentiality information in different tables and databases

Created Spark 2.2.0 jobs to manipulate datasets optimizing transformations

Developed and managed an environment based on DBeaver, pgadmin, jupyter, Hue on a Cloudera Distribution

Configured Stream-set to store the converted data to HIVE using JDBC drivers.

In charge of Writing UDF's to find locations close to 500 ft of multiple relevant points.

Writing spark code to remove certain fields from the hive table.

Join multiple tables to find the correct information for certain addresses.

Writing code to standardize string and IP addresses over datasets.

Using Hadoop as a datalake to store large amounts of data.

In charge of removing empty string in fields to verify data cleanliness.

Developed oozie workflows and in charge of running Oozie job to run jobs in parallel.

Using Oozie to update tables automatically over Hive.

Writing Unit tests for all code using different frameworks like PyTest.

Writing code according to the schema change of the source table based on the data warehouse dimensional modeling.

Create a geospatial index, perform a geo-spatial search, populate a result column that indicates Y/N based on the distance found.

Develop test cases for features deployed for Spark code and UDFs.

Optimizing JDBC connection with bulk upload for Hive import and Spark Imports

Handle defects from internal testing tools to increase code coverage over Spark.

Writing UDF to find the max downstream speed for locations and populate it to the table.

January 2018 - OCTOBER 2019

Data Engineer The Home Depot Atlanta, GA

Installed and configured Hive and also written Hive UDFs.

Designed and Developed ETL jobs to extract data from Salesforce replica and load it in data mart in Amazon Redshift.

Configure Kafka producer with API endpoints using JDBC Autonomous REST Connectors

Performed upgrades, patches, and bug fixes in Hadoop in a cluster environment

Worked on Spark SQL to check the data; Wrote Spark applications for data validation, cleansing, transformation, and custom aggregation.

Loaded ingested data into Hive Managed and External tables.

Learned many technologies on the job as per the requirement of the project. Good communication, interpersonal and analytical skills, and a highly motivated team player with the ability to work independently

Write a python script receive requests to Thames Water REST Based API from a python script, through the request libraries and serve to Kafka producer

Created Hive queries to summarizing and aggregating business queries by comparing Hadoop data with historical metrics

Worked on streaming the processed data to DynamoDB using Spark for making it available for visualization and report generation by the BI team.

Evaluate and propose new tools and technologies to meet the needs of the organization. Excellent understanding/knowledge of Hadoop architecture and various components such as HDFS, Job Tracker, Task Tracker, Name Node, Data Node,

Built the Hive views on top of the source data tables, and built a secured provisioning

Set-Up environment on AWS EMR and implemented Hadoop on EC2 with Amazon Redshift and AWS S3.

Provided status to stakeholders and facilitated periodic review meetings.

Configure Kafka producer with API endpoints using JDBC Autonomous REST Connectors

Wrote shell scripts for automating the process of data loading

Wrote shell scripts to automate workflows to pull data from various databases into the Hadoop framework for users to access the data through Hive based views

Ability to learn and adapt quickly to emerging new technologies and paradigms.

Implemented Spark in EMR for processing Big Data across our Data Lake in AWS System

May 2016 - January 2018

Data Engineer Alignment Healthcare Orange, CA

Wrote shell scripts for automating the process of data loading

And Map Reduce programming paradigm.

Performed streaming data ingestion to the Spark distribution environment, using Kafka.

Proficient in writing complex queries the API into Apache Hive on Hortonworks Sandbox

Formatted the response into a data frame using a schema containing, country code, artist name, number of plays, and genre to parse the JSON

Write producer /consumer scripts to process JSON response in python

Developed distributed query agents for performing distributed queries against shards

Experience with hands-on data analysis and performing under pressure.

Proficient experience in writing Queries, Stored procedures, Functions, and Triggers by using SQL. Support development, testing, and operations teams during new system deployments.

Implemented Spark using Scala and utilized Data Frames and Spark SQL API for faster processing of data.

Created Hive queries to summarizing and aggregating business queries by comparing Hadoop data with historical metrics

Worked closely with stakeholders and data scientists/data analysts to gather requirements and create an engineering project plan.

ETL to Hadoop file system (HDFS) and wrote HIVE UDFs.

Hive for queries and incremental imports with Spark and Spark jobs

Developed DBC//ODBC connectors between Hive and Spark for the transfer of the newly populated data frame

Writing Hive Queries for analyzing data in Hive warehouse using Hive Query Language

Managed jobs exploitation honest computer hardware to apportion process resources.

Used Spark engine, Spark SQL for data analysis, and provided to the data scientists for further analysis.

Implemented parser, query planner, query optimizer, and native query execution using replicated logs combined with indexes, supporting full relational KQL queries, including joins

Worked with Amazon Web Services (AWS) and involved in ETL, Data Integration, and Migration.

Created materialized views, partitions, tables, views, and indexes.

Wrote custom user define functions (UDF) for complex Hive queries (HQL)

February 2015 – May 2016

Data Engineer Cloudflare San Francisco, CA

Designed and enforced check settings on AWS.

Periodically imported logs to HDFS mistreatment writer.

Designed acceptable partitioning/bucketing schema to permit quicker knowledge retrieval throughout analysis mistreatment Spark.

Developed MapReduce programs to parse the raw data, populate staging tables, and store the refined data in partitioned tables in the EDW.

Used SparkSQL module to store data into HDFS

Created Hive queries that helped market analysts spot emerging trends by comparing fresh data with EDW reference tables and historical metrics.

Used Spark DataFrame API over Hortonworks platform to perform analytics on data.

Proficient in writing complex queries to the Apache Hive API on Hortonworks Sandbox

Collaborated with various teams & management to understand a set of requirements & design a complete system

Implemented a complete Big data pipeline with real-time processing

Developed Spark code in Scala using Spark SQL & Data Frames for aggregation

Worked with Sqoop to ingest & retrieve data from various RDBMS such as MySQL

Created schema in Hive with performance optimization using bucketing & partitioning

Worked rigorously with Impala for executing ad-hoc queries

Written Hive queries to transform data for further downstream processing

Experience with Kafka brokers, zookeepers & Kafka Control center.

Created Hive External tables and loaded the data into tables and query data using HQL.

Used Sqoop to efficiently transfer data between databases and HDFS and used Flume to stream the log data from servers.

Developed Spark code using Scala and Spark-SQL for faster testing and data processing.

Leverage Hadoop ecosystem/Cloud platform to design and develop solutions using Kafka, Spark, Hive, and relevant Big Data technologies

Worked with the Spark-SQL context to create data frames to filter input data for model execution.

November 2013 – February 2015

Hadoop Developer 3M Saint Paul, MN

Developed new application and added functionality to existing applications using Java/J2EE technologies

Wrote SQL queries to retrieve data from the database using JDBC

Deep understanding and implementations of various methods to load MySQL tables from the server and Local File System.

Implemented and supported SQL Server Database Systems for transactional environments.

Monitored and tuned SQL Server Databases across all environments including Development, QA, Production

Maintain performance and scale of SQL Server Database System for Enterprise Data Warehouse.

Implemented Capacity Schedulers on the Yarn Resource Manager to share the resources of the cluster for the jobs given by the users.

Optimized ETL jobs to reduce memory and storage consumption.

Communicated and present findings, orally and visually in a way that can be easily understood by business counterparts

Created & maintained Java scripts to connect to MySQL via JDBC driver

September 2011 – November 2013

Hadoop Developer Whirlpool Benton Harbor, MI

Used Sqoop to expeditiously transfer information between information databases and HDFS and used Flume to stream the log data from servers.

Enforced partitioning, bucketing in Hive for higher organization of the info.

Worked with totally different file formats and compression techniques to standards.

Loaded information from a UNIX system to HDFS.

Used UNIX system shell scripts to alter the build method, and to perform regular jobs like file transfers between totally different hosts.

Assigned in production support, that concerned observance server and error logs, and foreseeing and preventing potential problems, and escalating issue once necessary.

Documented Technical Specs, Dataflow, information Models, and sophistication Models

Documented needs gathered from stakeholders.

With success loaded files to HDFS from Teradata and loaded from HDFS to HIVE.

Used Zookeeper and Oozie for coordinating the cluster and programming workflows

Involved in researching various available technologies, industry trends, and cutting-edge applications.

Data ingestion is done using Flume with source as Kafka Source & sink as HDFS.

Performed storage capacity management, performance tuning, and benchmarking of clusters.

TECHNICAL SKILLS

PROGRAMMING

Spark API, Scala, Java, Python, C, Assembly Language, UNIX Shell Scripting

DATA REPOSITORIES

HDFS, Data Warehouse, Data Lake, S3

DATABASE

Apache Cassandra & Hbase, AWS Redshift

HADOOP DISTRIBUTIONS

Cloudera, Hortonworks, Hadoop

QUERY/SEARCH

SQL, HiveQL, Spark SQL, Elasticsearch

Frameworks

Spark, Kafka, Spark Streaming

Visualization

Kibana, Tableau, PowerBI, Excel

File Formats

Parquet, Avro, Orc, JSON

HADOOP

Hive, MapReduce, Zookeeper, Yarn

QUERY LANGUAGE

SQL, HiveQL, Spark SQL, CQL

SOFTWARE DEVELOPMENT

Agile, Continuous Integration, Test-Driven Development, Unit Testing, Functional Testing, Gradle, Git, SVN, Jenkins, Travis, Jira, Maven

DEVELOPMENT ENVIRONMENTS

Jupyter Notebooks, PyCharm, IntelliJ, Eclipse, Netbeans, vscode

AMAZON CLOUD

Amazon AWS (EMR, EC2, EC3, SQL, S3, DynamoDB, Cassandra, Redshift, Cloud Formation)

Data Pipeline Tools

Apache Airflow, Apache Oozie, Nifi

Admin Tools

Cloudera Manager, Ambari, Zookeeper

EDUCATION

BACHELOR OF SCIENCE IN COMPUTER ENGINEERING

Michigan Technological University

Location Houghton, Michigan

CERTIFICATIONS

Hadoop Fundamentals, IBM

Big Data Fundamentals, IBM

Contact this candidate