data engineer spark application

Location:

Torrance, CA

Posted:

December 12, 2018

Contact this candidate

Resume:

Jieli Chen

************@*****.***; (***) - *** - 0561; https://github.com/JieliChen268

Address: Los Angeles, Torrance, CA, 90502, USA: https://www.linkedin.com/in/jielichen/

SUMMARY:

Computer Science graduate student with a strong math and programming background pursuing data engineer careers.

EDUCATION

California State University, Dominguez Hills Carson, Los Angeles, CA

M.S. in Computer Science, GPA: 3.9/4.0 Graduation Date: Dec. 2018

Awards & Honors: CSU Foundation_Edison Scholarship

SKILLS

Programming Languages: Python, Java, Scala, SQL, JavaScript

Database: MySQL, Postgres, MongoDB, HDFS, Cassandra, Elasticsearch

Big Data: Spark DataFrame/Dataset API, Spark SQL, Spark Structured Streaming, Airflow,

Hadoop, MapReduce, Kafka, Hive

Cloud Computing: Azure (Azure Function App, Event Hub, Databricks),

AWS (EC2, EMR, Kinesis, redshift,S3)

Statistics: Hypothesis Testing, Linear Regression, Naïve Bayes, Logistic Regression

Machine Learning Model: Decision Trees, K Nearest Neighbors, Random Forest,

Gradient Boosting

Machine Learning Tools: TensorFlow, Sci-kit Learn, NumPy, SciPy, Pandas

Data Visualization: Jupyter Notebook, matplotlib, seaborn, Tableau

Operating System: Linux (Ubuntu, RedHat, CentOS), Windows, MacOS

Tools: Docker, Vagrant, Git

WORK EXPERIENCES

Symantec, Business Intelligence & Telemetry Culver City, CA May. 2018 - Aug.2018

Software Engineer Intern

Project 1: Data Pipeline Visualization and Management System Design and Development (Python)

Used Apache-Airflow framework to design data pipeline workflow management system

Built and configured Airflow clusters to manage the ETL jobs and their dependencies

Developed DAG tasks to schedule and trigger ETL jobs in many different remote servers

Visualized ETL jobs running and reported failure in time

Reduced the failure rate of ETL jobs by around 30 % to get in time data for reporting and data analysis

Project 2: BIT Data Pipeline Migrating to Azure cloud (C#, Python, Scala)

Used .Net Core to implement Azure function App to trigger data ingest into Azure Event Hub message bus

Used Spark SQL, Spark DataFrame/Dataset API and Structured Streaming to implement streaming Spark applications /ETL jobs

Deployed Sparking applications in Azure Databricks clusters

Used Spark UI and Catalyst to optimize Spark job performance

YanSet Los Angeles, CA Oct. 2017- Jan.2018 Software Engineer Intern – Data Analytics

Developed and built end-to-end data processing pipelines using engineering methods, including data collection, filtering, analysis, aggregation, loading and reporting, etc.

Used analytical and statistical methods (logistic regression and Gradient Boosting classification models) to find the main top 10 features of User Retention, and provide reports to product feature team to add features to improve user retention rate

Measured views, times, clicks and other user engagement indicators, and provide solutions to improve user engagement by doubling user clicking times

PROJECTS

Real Time Twitter Streaming Analysis System (Scala) Apr. 2018 - May.2018

●Got a stream of tweets by listening to twitter app

●Used Spark streaming to extract words, filter words start with hashtag from streaming tweets

●Map the filtered words to key value pair and count them over a 30 - minute window

●Putted the structured into Cassandra database

●Deployed real time twitter streaming analysis application on EMR

Movie Recommender System (Hadoop - MapReduce) Jul. 2017 - Aug. 2017

●Used Netflix data as input to recommend movies according to the movies that users watched and rated.

●Used Collaborative Filtering algorithm for this recommender system because the number of users weighs more than number of products

●Built a user rating matrix to represent which movies were related

●Created a co-occurrence matrix to represent the relationship between different movies

●Multiplied user rating matrix and co-occurrence matrix to get a merged recommender list for users.

●Implemented MapReduce jobs in Intellij IDEA to do matrix multiplication

Contact this candidate