Data Scientist

Location:

Hyderabad, Telangana, India

Salary:

$100000/year

Posted:

September 01, 2020

Contact this candidate

Resume:

RIYAZ SYED

Email: *****.***********@*****.*** Phone:+1-672-***-****

Summary:

Data Scientist with 5 years of professional experience, performing Statistical Modelling, Data Mining, Data Exploration and Data Visualization of structured and unstructured datasets and implementing Machine Learning and Deep Learning models based on business understanding to deliver insights that drive key business decisions to provide value to the business.

Experience of working in various domains such ad E-commerce, Retail, Finance, facilitating the entire lifecycle of data science project: Data Extraction, Data Pre-Processing, Feature Engineering, Dimensionality Reduction Algorithm Implementation, Back-Testing and Validation.

Experience in data-wrangling, loading in Big Data platforms such as Apache Spark, working efficiently through SQL server after doing enough Data Extracting process from several sources, transforming in transit and loading into relevant platform to perform actions.

Expertise on using Machine Learning Techniques in Python & R (R Studio) such as Linear Models, Polynomial, Support Vector; classification models such as Logistic Regression, Decision Trees, Support Vector Machine and K-NN (K Nearest Neighbor) also in clustering like K-means.

Experienced in Building Recommendation engines using Association rule, collaborative filtering and segmentations

Expertise in Advanced Ensemble Techniques Stacking, Blending, Bagging, Boosting and models Random Forest, Gradient Boosting and Extreme Gradient Boosting etc.

Knowledge of interactive ML tools such as TensorFlow and Caffe, PyTorch and Theano, and expertise in using strong Coding Platforms such as Spyder, Jupyter Notebook, R Studio offered by Anaconda Navigator as well as Google Colab.

Experience in text mining and topic modelling using NLP & Neural Network, tokenizing, stemming and lemmatizing, tagging part of speech using TextBlob, Natural Language Toolkit (NLTK) and Spacy while building Sentiment Analysis.

Knowledge of AI & Deep Learning techniques such as Convolutional Neural Network (CNN) for Computer Vision, Recurrent Neural Network (RNN), Deep Neural Network with applications of Backpropagation, Stochastic Gradient Descent (SGD), Long Short-Term Memory (LSTM) and Continuous Bag of words, Text Analytics etc.

Proficient in using PostgreSQL, Microsoft SQL server and MySQL to extract data using multiple types of SQL Queries including Create Table, Join, Conditionals, Drop, Case etc.

Hands-on experience on Apache Hive, Apache Spark using Python for Big Data.

Skilled creating executive Tableau Dashboards for Data visualization and deploying it to the servers.

Proficient in Data Visualization tools such as Tableau and PowerBI, Big Data tools such as Hadoop HDFS, Spark (PySpark), MapReduce.

Experience using Matplotlib and Seaborn in Python for visualization and Pandas in Python for performing exploratory data analysis.

Experience in Web Data Mining using Python's NLTK, ScraPy, BeautifulSoup packages and REST APIs along with working knowledge of Natural Language Processing (NLP) to analyze text patterns.

Experience with Python libraries including NumPy, Pandas, SciPy, Scikit-learn, NLTK and SpaCy.

knowledge in Time-Series Analysis using ARIMA, Multiplicative & Additive Decomposition along with Stumpy, Exponential Smoothing to find-out seasonal or cyclical patterns in the data.

Experience with working on deploying machine learning models into production for the teams.

Excellent communication skills and experienced in daily scrum meetings with cross teams.

Analytical Techniques Hypothesis testing: one-way and two-way factorial anova, t-tests, Chi-Square Fit test. Regression Methods: Linear, polynomial, decision trees, Support vector.

Classification: Logistic Regression, K-NN, Decision Trees, Naïve Bayes, Support Vector Machines (SVM); Clustering: K-means clustering, Hierarchical clustering.

Deep Learning: Artificial Neural Networks, Computer Vision (Convolutional Neural Networks); PyTorch, TensorFlow.

Dimensionality Reduction: Principal Component Analysis (PCA), Linear Discriminant Analysis (LDA)

Recommendation engine: Association Rule learning Market Basket Analysis, Collaborative filtering, Segmentations.

Text Analytics/ Natural Language Processing: Stemming, NLTK, Spacy, TFIDF, Word2Vec, Doc2Vec, Topic Modelling,

Data Visualization Tools: Tableau, ggplot, plotly, PowerBI, Python matplotlib, seaborn, IBM Analytics. Languages Python (Jupyter Notebook, Spyder, Google Colab), R (Shiny, Statistical Analysis) on R Studio, PostGRE SQL, MySQL. Database Systems SQL Server, MYSQL, Teradata, NoSQL (MongoDB, HBase, Cassandra), AWS (DynamoDB, ElastiCache) Big Data Analysis Apache Spark (Using PySpark), AWS (S3, EC2). Cloud Services Google Cloud, AWS (S3, EC2, SageMaker).

Data Scientist:

Freshworks Solutions: January2019- Present.

Vancouver, BC

Project Description: The objective of the project was to develop sentiment analysis to understand customer's sentiment towards the products and services that the company was providing. Company wanted to understand how effective products is working on diversified customer requirements and complexity of the choices of the customer. Based on reaction of the users, company wanted to move further with the production.

Responsibilities:

Developed Sentiment Analysis using Machine Learning & NLP by training historical Data provided by organization to understand sentiment of end users.

Performed Data Collection, Data Cleaning and Data Visualization and, Text Feature Extraction and performed key statistical findings to develop business strategies.

Employed NLP to classify text within the dataset. Categorization involved labeling natural language texts with relevant categories from a predefined set.

Used text to understand user sentiments over the time. Data was facilitated from various sources such as company official website, Twitter, Facebook, Quora etc.

Initiated various pre-processing phases of text like Tokenizing, Stemming & Lemmatization, Stop Words, Vocabulary Phrase Matching, POS Tagging using NLTK and Spacy libraries on python and converting the raw text to structured data.

Created Sparse dataset by using Count Vectorizer, Document Term Matrix and TF-IDF Vectorization while assigning IDs for each word and checking frequency of words in corpus.

Constructed new vocabulary to encode the variables to numerical array for machine-readable format using Bag of words and TF-IDF.

Trained model in python to predict the sentiments on word embeddings of the reviews using Word2Vec.

Build classification models based on Logistic Regression, Decision Trees, Random Forest to classify the texts through labels

Applied, VADER (Valence Aware Dictionary and sentiment Reasoner), lexicon and rule-based sentiment analysis tool from NLTK for unlabeled text to classify the sentiments of the texts.

Used Polarity Score Performed by VADER to recognizes the sentiments of the texts into negative, neutral, positive by normalizing the compound score.

Closely monitored the performance of the model by using confusion matrix, classification report, Accuracy, Recall, Precision and F1- Score, additionally performed A/B testing to see which of the proposed solution worked better.

The NLP text analysis monitored, tracked and classified user discussion about product and service in online discussion. (ScraPy)

Data Scientist:

Pkc Solutions: October 2017 - January2019

Toronto, ON

Project Description:

The project was to make prediction model using machine learning to predict the credit defaults of the customer. Having a proper fraud detection system in place to identify errors, waste and inefficiencies, thus increasing profitability or reducing losses.

Responsibilities:

Participated in all phases of project life cycle including data mining, data cleaning, Data Exploration and developing models, validation and creating reports.

Performed data cleaning on a huge dataset which had missing data and extreme outliers from and explored data to draw relationships and correlations between variables.

Performed data-preprocessing on messy data including imputation, normalization, scaling, and feature engineering using Pandas.

Utilized random under-sampling to create a training dataset with a balanced class distribution.

Conducted exploratory data analysis using Python Matplotlib and Seaborn to identify underlying patterns and correlations between features.

Linear Discriminant Analysis (LDA) used as dimensionality reduction technique in the pre-processing step for pattern-classification and machine learning model.

Used t-SNE to project these higher dimensional distributions into lower-dimensional visualizations.

Build classification models based on KNN, Random Forest, XGBoost Classifier to predict the default of loan.

Used Neural Network as classification model using Keras with TensorFlow backend on Google Colab GPUs.

Used various metrics such as F-Score, ROC and AUC to evaluate the performance of each model and Cross Validation to test the models with different batches of data to optimize the models.

Applied comprehensive tuning of the Random Forest algorithm Found specific factors which are most important for detecting fraudulent transactions.

Implemented and tested the model and collaborated with development team to get the best algorithms and parameters.

Data Scientist:

National Bank: July 2016 – sept 2017

Montreal, Quebec

Project Description: The project involved customers segmentation into low value Low Value: Customers who are less active than others, not very frequent buyer/visitor and generates very low - zero - maybe negative revenue Value: In the middle of everything. Often using our platform (but not as much as our High Values), frequent and generates moderate revenue. High Value: The group we don't want to lose. High Revenue, Frequency and low Inactivity. Therefore, boost customer retention rates and target specific audience with cyclic promotions strategy.

Responsibilities:

Performed Data Collection, Data Cleaning, Data Visualization and Feature Engineering using Python libraries such as Pandas and Numpy, matplotlib and seaborn.

As the methodology, we used Recency, Frequency and Monetary Value.

Applied Elbow method to select optimum clusters for K-means algorithm.

Calculated recency using most recent purchase date of each customer and how many days they are inactive for, applied K-means clustering to assign customers a recency score.

Calculated frequency using total number orders for each, applied K-means clustering to assign frequency score.

Calculated revenue using total number orders for each, applied K-means clustering to assign revenue score.

Assigned an overall score, using scores (cluster numbers) for recency, frequency & revenue.

Named these scores: 0 to 2: Low Value,3 to 4: Mid Value,5+: High Value.

Optimized SQL queries for transforming raw data into MySQL with Informatica to prepare structured data for machine learning.

Used Tableau for data visualization and interactive statistical analysis. The same was used for presenting preliminary reports to the stakeholders.

Worked with Business Analysts to understand the user requirements, layout, and look of the interactive dashboard.

Studied the behavior of new customers and existing customers by performing EDA. Customers were segmented based on their RFM scores by k-means clustering.

The life-time values were classified based on the RFM model by using a XGBoost classifier.

Data Analyst:

Tcs Solutions (India): January 2014 - June 2014

Project Description: The objective of the project was to develop a machine learning model to predict the Sales of fashion products to be sold in the next season. The model used information about the future trends present in web pages and with the data present in the previous year's sales history. For the creation of this model we used techniques of text mining and data mining.

Responsibilities:

The implementation of the project went through several phases namely: data set analysis, preprocessing data set, user-generated data extraction and modeling.

A user-generated data extraction program was developed to extract potentially useful information from web pages.

Performed Data Cleaning, Data Visualization, Information retrieval, Feature Engineering using Python libraries such as Pandas, NumPy, Sklearn, Matplotlib and Seaborn.

Feature Engineered raw data by doing imputation, normalization and scaling as required on the data frame. Converted categorical variables to numerical values using Label Encoder for EDA and readability by the machine learning models.

Performed univariate, bivariate and multivariate analysis to check how the features were related in conjunction to each other and the risk factor.

Applied PCA to reduce the correlation between features and high dimensionality of the standardized data so that maximum variance is preserved along with relevant features.

Built machine learning models for Regressions based on Decision Trees, Support Vector Machine and Random Forest to predict the different risk levels of applicants and used Grid Search to improve the accuracy over the cleaned data.

Evaluated the model's performance using various metrics like coefficient of determination (R2), RMSE and Cross Validation to test the models with different batches of data to optimize the models.

Education

●Bachelor of Science In Computers (JNTUH INDIA) …2013

●Maters in computers 2016

Contact this candidate