Data Analyst Python

Location:

Washington, DC

Posted:

July 03, 2020

Contact this candidate

Resume:

Charles Li

443-***-**** *******.**@***.*** LinkedIn GitHub

EDUCATION

JOHNS HOPKINS UNIVERSITY Baltimore, MD

MS Information Systems (STEM) (Beta Gamma Sigma, Current GPA: 3.72/4, Top 15%) Aug 2019 – May 2020 TIANJIN UNIVERSITY OF FINANCE & ECONOMICS Tianjin, China Bachelor of Accounting Information Systems(GPA:3.84/4, Top 10%) Sep 2014 – Jun 2018 SKILLS

Programming: Python (Sklearn, TensorFlow, Pandas, Numpy), R, KDB, SQL, Matlab, Java, HTML/CSS, PHP Distributed Computing: AWS (EC2, S3, EMR), Spark (Spark SQL, Spark ML), Kafka, Flink, Hadoop, Hive, Pig, Azure Data Visualization: Tableau, Looker, Power BI, Shiny, Matpotlib, Seaborn, Excel (Pivot Table, VBA), Google Analytics Statistics and Machine learning: Classification, Regression, Clustering, A/B Testing WORK EXPERIENCE

Global AI Data Engineering Intern Feb 2020 – Present Constructed & utilized Google Cloud Platform to perform data processing & ETL; Components include virtual server, GPU driver, Hadoop & Spark cluster & Database (MongoDB & SQL) Built a data pipeline with Python, Spark & Looker to perform data scraping, cleaning, feature engineering, backtesting, visualization, and ETL; Processed data from multiple resources (Quandl, Bloomberg, Yahoo) Created ML strategies based on Premium, Fund Flow, VIX & Smart Beta and a dashboard in Tableau Tencent America Data Analyst Intern Jun 2019 – Feb 2020 Wrote queries in Oracle to pull data; Successfully detected 300 fraud transactions through analyzing features; Saved the company from a potential $200,000 loss

Built a pipeline in Python to applying machine learning algorithms including Random Forest, Logistic Regression, SVM and XGBoost to classify fraud transactions

Created a dashboard in Tableau and Looker for visualization and analyzed significant factors involved FinTech4Good Data Analyst Intern Oct 2018 – Mar 2019 Designed and implemented text analytics and NLP methods for effectively extracting information from various sources like legal documents to help evaluate the D-App companies in FinTech industry for incubation Collected data with various structures by flexibly using tools like DataBricks, Kafka, Apache Sqoop and Scrapy Implemented analytics and visualization method and created a dashboard using Power BI for the alternative data of D- Apps, such as the user activity, volume, and ratings PROFESSIONAL PROJECTS

Fashion MNIST Data Classification Using TensorFlow April 2020 Designed a Convolutional Neural Network model in TensorFlow for image recognition and classification Performed dropout regularization and data augmentation to reduce overfitting; Increased model accuracy from 87% to 95% Tag Prediction in Stack Overflow March 2020

Extracted, cleaned and transformed data from Stack Overflow’s HTML to Spark dataframe using beautifulsoup and pandas Created a pipeline in Spark which removed stop words, digits, and symbols and converted text data to the features using TF-IDF Applied classification algorithms (Logistic Regression, Naive Bayes and Random Forest) to predict the tag on the posted question Achieved the best test accuracy of 88% with Random Forest Return Customer Prediction (Data Mining and Deep Learning) March 2020 Led a four-member team to predict whether patients will return to the shop using shopping mall dataset Performed data cleaning and built predicting models (Logistic Regression, Random Forest, and Neural Network) Evaluated models based on validation accuracy and achieved the highest 93% accuracy with Random Forest PBS Kids: Uncover Factors to Help Measure How Young Children Learn Oct 2019 – Dec 2019 Cleaned in Python a raw dataset of 12GB about the user and access information of PBS Kids, a learning app for young children Performed feature selection methods of Variance Selection, Chi-Square Test, and Recursive Feature Elimination Method Built machine learning models to predict examination pass rates using Logistic Regression, Linear Discriminant Analysis (LDA), Random Forest, Bagging Trees, Adaboost, XGBoost, Gradient Boosting Machine and some ensemble method (Stacking Classifier, Majority Voting, and Neural Network)

Achieved the best prediction accuracy of 90.3% with Bagging Trees Sentimental Analysis of Amazon Alexa Aug 2019 – Sep 2019 Web-scraped 5GB posts regarding Amazon Alexa in Twitter into finely-organized CSVs Cleaned useless data and stopwords; extracted the keywords from the rest and valued their frequency and significance via NLTK and JIEBA; visualized the keywords

Built a machine learning model using Logistic Regression and Naïve Bayes to perform sentimental analysis Visualized in Tableau the positive and negative feedbacks to help Amazon understand user experiences.

Contact this candidate