Data Manager

Location:

Glendale, CA

Posted:

April 03, 2020

Contact this candidate

Resume:

Zhangqian (Eric) Ouyang

Tel: 571-***-**** Email: *********@******.***

Address: **** ******* ***, ********, **. 91026

Summary

Seeking for Fulltime Job: Data Science & Machine learning & Deep Learning 2 years of hands-on experience developing classical machine learning, data science, CNN, RNN models Master of science with a strong analytical and theoretical background Possess solid programming, algorithm skills. A passionate and enthusiastic learner to conquer the unknown EDUCATION

University of Southern California Jan. 2018 – Dec. 2019 Master of Science in Electrical Engineering GPA: 3.33/4.0 South China Agricultural University Sep. 2013 – June. 2017 Bachelor of Science in Optoelectronic Information Science and Engineering GPA: 3.44/4.0 Related Courses: Mathematical Pattern Recognition, Foundation of Artificial Intelligence, Machine Learning from Signal, Algorithm Analysis, Natural Language Processing, Parallel and Distributed Computation, Probability in Engineering, C Programming, Digital Image Processing PROFESSIONAL SKILLS

Program Language: Python (Proficiency), SQL, MATLAB, C/C++, SparkSQL, CUDA/GPU Programming Platform: Linux, Docker, command line, Google colab, Jupyter Notebook, HPC Database: MongoDB, MySQL, ElasticSearch, PostgreSQL Frames & Tools: Spark, Hive, Hadoop, TensorFlow, MxNet, PyTorch, PaddlePaddle, git, Kibana, Azkaban, Tableau, AWS Python Libraries: PySpark, Scikit-learn, pandas, NumPy, SciPy, Scrapy, OpenCV, matplotlib, seaborn, OpenMP, MPI Deep Learning Models: RNN, Transformer, LSTM, CNN, RCNN, Attention-model, Residual-model, Q Learning Machine Models: Classification, Regression, Clustering, Bagging and Boosting method, Hidden Markov Model, Hypothesis Testing WORK&INTERN

Image Data Technology Co., Ltd June. 2019 – Aug. 2019 Algorithm Engineer Intern Guangzhou, China

Image Data Technology is an AI business company focusing on retail and FMCG industries.

Member of AI lab researched Residual Attention network models family using MxNet. Worked on Linux and Docker platform

Conducted bad case feature analysis among loss function, attentional mechanism, residual learning

Network improvement on large .rec data (150 GB), finetuned and deployed pre-trained DenseNet 201 using Caffe on Kubernetes

Provided shelves data analysis including distribution rate, OOR rate for selling strategies, revenue prediction, and KPI calculation PROJECTS

Spark Big Data Analysis and Dispatch: PySpark Air Quality Analysis Dec.2019 – Jan.2020

Processed Data ETL. Developed UDF function to modify DataFrame and wrote graded data into SparkSQL.

Used SparkSQL to write modified statistical results into ElasticSearch. Deployed Kibana and finished data visualization

Packed job .zip file and uploaded onto Azkaban. Realized automatically dispatching job using Azkaban and finished big data analysis.

Designed A/B testing on medium for significant differences among medium grade. Observed a right skew distribution tendency.

Analyzed air quality with respect to seasonality, hours of day and other features using SQL LSTM POS Tagging using TensorFlow Mar.2019 – May.2020

Sentiment analysis using multilayer layer perceptron with the batch norm. Completed word2vec and visualized t-SNE.

Connected above embedding module to POS tagging RNN neural network. Finetuned the embedding module as a pre-trained shallow net

Applied Bi LSTM with a self-attention mechanism. Compared with multi-layer LSTM, multi-layer RNN.

Re-wrote drop out function and achieved 10 times effect sparsely drop out. Selected seq2seq loss with Adam optimizer

Bi-LSTM gave 93% precision on Japanese, 95% on Italian and 89% on Arabian Classification on Cardiotocography Oct.2018 – Dec.2018

Biological data: 2126 instances, 21 medical features, 2 labels. Conducted data snooping and found an extreme class imbalance ratio up to 20.

Conducted data cleansing, preprocessing. Called RFECV and chi2 to do the feature selection and maintained linearity among features

Performed k-means clustering to analyze data distribution and feature selection effects

Trained Random Forest, Adaboost using l-1 regularization for model selection.

Adopted SMOTE for class imbalance, realized 25% improvement on rarest class and 94% precision compare to original 89%.

Developed linear SVM and Logistic Regression hierarchical classifier for multi-label. Chooses SVM and achieved 96% precision Multi-class and Multi-Label Classification: Anuran Call June.2018 – July.2018

Biological data: 7195 instances, 22 Mel-frequency cepstral features, 4 families, 8 genus, and 10 species. Used cepstral data to categorized fogs

Calculated p-value for feature selection and created SVM for each label using RBF and linear kernels.

Tuned Support Vector Machine with GridSearchCV and trained with Elastic-Net.

Developed a classifier chain method with logistic regression and received 96% precision. Applied A/B testing for correlation between labels Time Series Classification: Activity Recognition May.2018 – June.2018

Sensor data: 42240 instances, 6 body position signal features, 7 activity classes. Collected sensor data and predicted activity,

Finished data snooping using scatter plot matrix, data cleansing and preprocessing. Applied chi2 for time-domain feature selection.

Performed RFECV 5-fold to select a splitting number. Picked 6 for feature augmentation to balance training and testing precision trade-off

Compared confusion matrix, ROC-AUC between original data and augmented data, our model achieved a 94% precision on activity recognition Python Web Crawling on Movie Feb.2018 – March.2018

Deployed MongoDB on Linux OS. Constructed spider file. Grabbed data from https://movie.douban.com/top250

Wrote spider result into .json file and saved into MongoDB. Used NoSQL manager to manager MongoDB

Contact this candidate