PRERAK SHAH
857-***-**** ****.****@*****.***.*** https://www.linkedin.com/in/sprerak48 Boston
EDUCATION
Northeastern University Boston, MA
Masters in Analytics with concentration in Statistical Modeling (GPA: - 3.89/4) Sep 2019- Present Dwarkadas J. Sanghvi College of Engineering, Mumbai University Mumbai, India B. E. in Computer Engineering (GPA: - 9.00/10) July 2016- May 2019 Shri Bhagubhai Mafatlal Polytechnic, Mumbai University Mumbai, India Diploma. in Computer Engineering (GPA: - 8.89/10) July 2013- May 2016 PROFESSIONAL EXPERIENCE
April Innovations Mumbai, IN
Machine learning Developer Nov 2018 – July 2019
• Participated in meetings, workflow design and making of optimal data pipeline architecture
• Performed data wrangling on 10k image data for training and testing purpose using pandas & numpy libraries
• Processed YOLO-v3 pre-trained model for image classification and detecting Nude images on the company website
• Created CNN model for categorizing and blocking Nude images on the company website using python libraries and hosted on AWS EC2 server
• The model performed in a range 84-90% of accuracy in terms of predicting Nude images
• Developed dashboard using flask and matplotlib library to track performance metrics and generate insights of NudeNet predictions
• This project helped increase visibility of the website in SEO process Money Control Mumbai, IN
Data Analyst Dec 2017 – Oct 2018
• Involved in client meetings, assembling large data sets that meets functional business requirements
• Performed web-scraping using beautiful soup library on comments i.e. about 5k of stockholders from websites and blogs
• Applied data munging using pandas, numpy and structured the raw data in SQLite database.
• Performed sentiment analysis with NLTK & NLP libraries on SVM & RNN model, which increased the accuracy by 15%, which helped in preparing the Money Control Chatbot
• The analysis was drawn in Dialogue flow and the sentiment analysis helped design Chatbot to communicate and answer basic stocks-related questions to end-users based on behaviors ACADEMIC PROJECTS
Title: Detecting 3 types of Spoofing in Trading with Citi Bank (MIT Fintech) Boston, MA Technology: Python, Tableau Feb 2020-Present
• Performed Data munging and normalization of features using pandas and numpy libraries
• Applied PCA for feature selection and statistical information
• Developed Logistic Model using sklearn library for predicting spoofing in Trading data of more than 14lakh entries
• Improved the accuracy by 20% by using XGBoost for classification and prediction
• Visualized the spoofing by each end-user in a dedicated time frame using Tableau Dashboard Title: Coronavirus Data Visualization Boston, MA
Technology: Excel, Tableau. Jan 2020-Feb 2020
• Gathered data from publicly available from WHO, CDC using Tableau Server for dates 31 Jan’20 to 9 Feb’20
• Performed inferential statistics for top 10 countries and top 10 provinces in china affected by coronavirus using correlation
• Calculated the Fatality ratio and recovered vs confirmed ratio for mainland china
• Plotted the infected areas with heat map
• Compared deaths of previous corona-virus outbreaks with NCoronavirus and visualized using pie charts Title: Churn Modelling Boston, MA
Technology: R, R Shiny Nov 2019-Dec 2019
• Performed data cleaning on churn-modelling datasets i.e. 10k using dplyr and bsda package
• Created regression models i.e. Linear, Logistic to determine the trend of customer’s behavior with credit-card services.
• Analyzed the factors affecting the customer’s exit from credit service using clustering and regularization techniques in RStudio for visualization and inferential statistics using party, ggthemes, caret, corrplot, ggplot2, mlr
• Using SVM model resolved overfitting which help in achieving 83-88% of accuracy
• Resulting in prediction of customer’s exit from the credit-card service using randomforest, XGBoost, gbm Title: Autonomous Tagging of Text Data using Machine Learning Mumbai, IN Technology: Python, Google Collab and AWS July 2017-May 2019
• Performed web-scraping using urllib of stack overflow discussion board and stored using buckets on AWS EC2 server
• Applied data cleaning and loading of data from wide variety of data sources (Quora, stackoverflow) using SQL querying
• Removed stop words and lemmatization of text data using nltk library
• With the help of feature-extraction technique i.e. tfidf vectorizer, assign weights to each word in the text
• Developed LSTM model using keras, for determining the most-used feature tags and grouping of similar discussions
• The model performed in the range 82-88% accuracy, categorizing the similar discussion based on tags and feature vector