+1-716-***-**** Shreyas Kulkarni *** Englewood Ave, Buffalo, NY 14214
*****@*******.*** https://www.linkedin.com/in/shreyaskulkarni20/ https://github.com/Shreyas20 EDUCATION
University at Buffalo, State University of New York Aug 2017-Feb 2019 Master of Science (M.S.) in Data Science GPA - 3.45 Pune Institute of Computer Technology, Pune July 2013-May 2017 Bachelor of Engineering (B.E.) in Computer Science Graduated with first class RELEVENT WORK EXPERIENCE
1) Froot Research, Pune, India Data science Research Intern June 2018- Aug 2018
- Technologies and languages- Python, Apache spark, Association rule mining, MySQL, Pandas, NumPy, WARMR, ILP
- Assembled an algorithm to bucketize the integer data based on the distribution along the column to generate the frequent patterns.
- Researched various methods and generated frequent patterns for multi-relational database with virtual join efficiently using PySpark.
- Detecting anomalies in generated Association rules to check if timestamp has any effect on generated rules. 2) GS Lab, Pune, India Project Intern July 2016- June 2017
- Technologies and languages- RabbitMQ, mongoDB, Spark, Elasticsearch, Kibana, J48, Java, Python, JS, HTML, weka
- Developed a generic system, which can be collaborated in the backend of any platform for analysis and recommendations.
- Analyzed user data using various visualizations efficiently to assist vendors to customize their system.
- Enhanced user experience by predicting the results depending on various attributes demanded by users with 80% accuracy. SKILLS
Competences: Python (Pandas, NumPy, Matplotlib, TensorFlow, Scikit learn, Keras, MLLib), R (caret, data.table ggplot2, dyplr, xgboost, apriori, glmnet, bootstrap, TwitteR, Fiftystater), MySQL, Java, MATLAB, Statistics, Probability, Scala, C, mongoDB, Oracle SQL, C++, HTML Tools: Apache spark, Hadoop, R Studio, Elasticsearch, Kibana, Weka, Android, Rabbitmq, Selenium, Eclipse SELECTED PROJECTS
1) Consensus based distributed Neural Network Java, Weka tools, PeerSim
- Working on an independent research project with Prof. Dutta in which consensus based neural network model is built in java using weka.
- Reduced computation time to almost 1% of centralized perceptron with the comparable accuracy using Gossip based distributed model built using peersim package. The model gave consistently better performance on all types of datasets with variety of network topologies. 2) Santander product recommendation R, Association Rules, SVD, Hierarchical clustering, K-means, ggplot2
- Analyzed user data based on gender, nationalities, age, profession and drew out necessary conclusions for recommendations.
- Performed low rank approximation using SVD on training data of 30M users and recommended account types for users based on their personal details for test data for 3M rows. Also tried to cluster data using different algorithm which failed due to disparity.
- Generated various association rules between user details and account details using apriori in R to get recommendations. 3) Article classification using spark Python, NYTimesArticleAPI, RDD, Apache Spark, MLLib, TF-IDF
- Scrapped NYT articles for various topics using NYTimes developer API with BeautifulSoup library in python, published in a month.
- Pre-processed data using PySpark MlLib modules and created tokens to get scaled feature vectors for all articles using TF-IDF.
- Applied classification algorithm like logistic regression, RF, Naive Bayes on data using PySpark to get almost 80% accuracy for all methods. 4) Word count and word co-occurrence using MapReduce Python, ArticleAPI, TweePy, Pandas, D3.js, MRJob, Hadoop
- For the topic 'shooting', twitter data and NYT articles are collected using respective APIs along with BeautifulSoup library in python.
- After removing stop-words, symbols and digits in mapper, word count is done for both datum in reducer, using MRJob library.
- Co-occurrence of top 10 words are calculated in different reducer and visualized results in D3.js using word cloud for both of the datum. 5) Data Scientists Analysis R, regsubsets, glmnet, rf, ggplot2
- Separated out different respondents like workers, students, career-switchers participated in the survey based on various parameters.
- Assessed supervised algorithms like OLS, subset selections, shrinkage methods, random forest in R to predict the salaries using worker’s responses as training set. Achieved 85% accuracy with Random forest to conclude that it is the best method with the categorical variables.
- Analyzed and pointed out different trends in the field of data science based on these responses by using ggplot2 visualizations. 6) Twitter Data analysis R, TwitteR, Fiftystater, ggplot2
- Collected tweets related to flu based on hashtags in R and plotted it based on their locations on heat map sorted by the states.
- Compared it with the heat map having level of actual flu in the states and concluded the relationship between tweets and actual flu. 7) NYC Parking Violation Analysis Python, NumPy, Pandas, Matplotlib, MySQL, PyMySQL
- Analyzed NYC parking data of 4M rows stored in MySQL database to draw out conclusions from observed pattern using visuals created with Matplotlib library in python. Storing entire data in MySQL gave higher operational speed than Pandas.
- Used PyMySQL to perform operations on database using python and Pandas data-frame, NumPy for handling data efficiently. ADDITIONAL ACTIVITIES
Created pip installable python package for bucketing algorithm with GNU GPL v3.0, used to convert integer array to categorical buckets.
Passed Oracle Database Certified Associate with 77%.
Published paper ‘Generic User Event Activity Analysis and Prediction’ in IRJET journal.