Rahul Dhakecha ******@****.*****.***
**** ****** **., ***, ************, PA 19139 551-***-****
EDUCATION
University of Pennsylvania, Philadelphia May, 2018 MS - Data Science (Dept of Computer & Information Science); CGPA: 3.56/4.0 Spring 2018 courses : Big Data Analytics; Fall 2017 courses: Software Systems, Elements of Probability Theory Spring 2017 courses: Mathematical Statistics, Database and Information Systems, Convex Optimization Fall 2016 courses : Machine Learning, Modern Data Mining, Engineering Economics Sardar Vallabhbhai National Institute of Technology India, 2015 Bachelor of Technology in Electrical Engineering,CGPA: 8.41/10.00 SKILLS
Languages: Python(NumPy, SciPy), R, scikit-learn, C++, SQL, NoSQL, MATLAB, CVX, LATEX Databases: M ySQL, MongoDB;Web Languages: HTML, NodeJS Statistics: Exploratory Data Analysis, Hypothesis testing, parameter estimation, ANOVA, Parametric & Non-Parametric tests Data Technologies: Hadoop, Spark(PySpark), TensorFlow, MapReduce Miscellaneous : Amazon Web Services, Docker
WORK EXPERIENCE
Data Scientist Intern, Sprint, Kansas Summer 2017
● Worked on text analytics in Customer Experience department;developed SQL queries to fetch text from Teradata
● Cleaned, analyzed and processed data from various surveys; applied topic modelling using Latent Dirichlet Allocation in R
● Successfully developed Naive Bayes model for 2-level classification of text reviews in Python
● Successfully integrated new algorithm with existing rule based classification; updated Medallia keywords Research Assistant, Dept of Computer Science; University of Pennsylvania (MySQL, Python, Spark) Aug 2017-Present
● Multi-threaded scrap of Goodreads website to fetch book reviews, ratings, etc using Selenium, BeautifulSoup
● Built data pipeline, ETL and wrangled data, modelled Goodreads database & performed preliminary EDA
● Successfully framed and tested hypotheses; applied PCA to extract important groups of books preferred by users DATA SCIENCE/MACHINE LEARNING PROJECTS
Influence Maximization in Social Networks ( Independent Study with Prof Hamed Hassani;Python, Spark) Fall, 2017
● Successfully deployed BFS and Page Rank algorithm in distributed setting on social graph data of Stack Overflow
● Developed independent cascade model for modelling social network; tested network model on synthetic Kronecker graph
● Deployed Python code for modelling & learning temporal large scale network by NETINF algorithm, on memetracker data Music Recommendation System (team of 4, SQL, Python, NoSQL) March-April, 2017
● Developed relational and nonrelational database instances on AWS using MySQL & MongoDB from multiple datasets
● Developed NodeJS framework to integrate database with front end application, developed using HTML and AngularJS
● Built efficient SQL queries to recommend music to a user based on overlapping music of other users. Predicting readmission probability for diabetes inpatients ( R) October, 2016
● EDA and Cleaning of dataset; mitigated nonlinearity and heteroscedasticity by concave function transformations
● LASSO and Elastic Net used to determine important predictors; developed multiple linear regression model SOFTWARE DEVELOPMENT PROJECTS
Penn Cloud - distributed mailing system (C++, team of 4) September, 2017
● Developed a cloud system with scalable and fault-tolerant key-value store with efficient replication
● System supports mail services and storage service with features like uploading, downloading and sharing large files
● Robust communication achieved via gRPC between central controller, front end and storage nodes Distributed Chat System (C++) September, 2017
● Developed fully distributed, scalable client & server systems using UDP; multicast incorporated to deliver messages
● System supports various chat rooms along with unordered, FIFO and totally ordered multicast DATA SCIENCE CASE STUDIES
Billion Dollar Billy Beane : Developed simple linear regression model to predict average salary of baseball player; Performed model selection using Forward, Backward and all subset methods; using Cp, BIC criteria Fuel Efficiency in Automobiles: Created and bootstrapped multiple linear regression model; accounted for categorical variables Framingham Heart Study : Classified positive and negative patients for Heart Disease using logistic regression and Random Forests; results compared with Linear Discriminant Analysis and Naive Bayes