Data Engineer

Location:

Chicago, IL

Posted:

October 20, 2017

Contact this candidate

Resume:

Divya Vasireddy

Email:**********@****.***.*** Mobile: 732-***-**** https://linkedin.com/in/divyavasireddy/

EDUCATION

ILLINOIS INSTITUTE of TECHNOLOGY Chicago, IL

M.S., Data Science (GPA: 3.85/4) Dec 2017 (Expected)

Focus areas: Statistical Analysis, Time Series Analysis & Forecasting, Statistical Learning/Machine Learning, Advanced Data Mining, Advanced Database Organizations, Monte Carlo Methods, Online Social Network Analysis

JAWAHARLAL NEHRU TECHNOLOGICAL UNIVERSITY Hyderabad, India

B.S., Computer Science (GPA: 3.66/4) May 2010

PROFESSIONAL EXPERIENCE

DISCOVERY HEALTH PARTNERS Chicago, IL

Data Science Intern May 2017- Aug 2017

Worked with 3-person Data Science team to predict the number and amounts of claims for each patient service by applying Machine Learning classification techniques

Developed data preparation algorithm using Python and SQL to retrieve, aggregate, and vectorize data from 15 MySQL data warehouse tables (~250 GB data)

Leveraged out of core processing Stochastic Gradient Descent (SGD) classification technique in scikit-learn package in Python to handle large volume of data and iteratively aggregate the results

Developed visualizations to present results to business teams using Matplotlib and seaborn packages in Python.

Tools used: Pycharm, Github, Python, MySQL

ADP (Fortune 500 Cloud provider of HR, Benefits, Tax Solutions for 650,000 clients globally) Hyderabad, India

Data Engineer Aug 2014 - Aug 2015

Part of DataCloud Innovation team responsible for developing applications that enable analytics on top of ADP payroll data; served clients in Financial Services, Real Estate, and Insurance industries

Created python scripts to process JSON files and Data Frames in Spark to cleanse and prepare data in Hive tables

Developed custom build Pig scripts for implementing tools for the data science team.

Leveraged DevOps methodologies to package and migrate code from Dev to Test and Prod environments

Optimized Hive data transfer scripts for parallel processing and worked on building an efficient data pipeline for data delivery

ADP Hyderabad, India

Big Data Developer Jun 2012 - Aug 2014

Developed Sqoop scripts to perform daily import and export of payroll data from Oracle database to HDFS

Developed Hive jobs to transform payroll data from Oracle database to HDFS

Advised project leader on key design decisions including implementation Dynamic partitioning and Bucketing for Data Processing in Hive

Developed ETL jobs over the course of multiple product releases

ADP Hyderabad, India

Software Engineer Jun 2011 – Jun 2012

Developed Java applications for ADP portal product used by clients across 110 countries; primarily contributed to business workflow management, messaging, and Identity & Access Management (IAM) applications

Developed Java applications in various IDE tools including Eclipse, My Eclipse, and Maven, for 3 consecutive Agile-based product releases

Developed complex SQL queries to retrieve data from Oracle and MySQL databases

Received ‘Certification of Appreciation’ and ‘You made a difference’ awards for two consecutive years at ADP for training and onboarding new hires across US and India teams

ACADEMIC PROJECTS

Github link for projects: https://github.com/Vasireddydivya

Santander Bank Product Recommendation (Course: Advanced Data Mining)

Objective: Build a better recommendation system for targeted advertising to customers

Contribution: Imputed missing values by finding the distribution/frequency of each feature using Python Blended train and test data sets based on customer id and added XGBoost predictive algorithm to predict product trends

Impact: Recommendation model showed a 0.3+ MAP@7 score (score predicts top 7 products customers will choose from)

HR Resources Analytics (Course: Data Preparation and Analysis)

Objective: Identify the most valuable employees, predict churn among them, and make recommendations to retain those most valuable employees

Contribution: Used different feature selection techniques to select the best features and identified ‘Random Forest’ as the best model that fits the data in terms of Accuracy, Precision, Recall metrics and ROC curve Created Tableau visualizations to present the analysis and results

Impact: Identified top 33% most valuable employees among 15000 employees with 99.7% accuracy

Performed Clustering on 20 newsgroup and Yelp Datasets (Course: Statistical Learning)

Objective: Perform K-means, LDA, and LSA clustering on 20 news groups and Yelp data sets (JSON files)

Contribution: Identified best set of clusters by using ‘nbclust’ function in R. Implemented the LSA and LDA using singular value decomposition for each data set; compared the LSA and LDA performance based on Accuracy metric

Impact: Identified 3 major groups for each of the data sets with 88% accuracy

SKILLS AND CERTIFICATIONS

‘Machine Learning by Stanford University’ through Coursera (April 15th, 2017)

‘Neural Networks for Machine Learning by University of Toronto’ through Coursera (July 1st, 2017)

Software and Programming languages: Core Java, Python (Keras using Tensor Flow, Scikit-learn, Numpy, Scipy, Matplitlib), Tableau, Microsoft Excel, SQL, MATLAB, Putty, GitHub, SVN

Statistical Tools: R, ggplot2, dplyr, reshape

Contact this candidate