Data Python

Location:

Edison, NJ

Posted:

August 22, 2020

Contact this candidate

Resume:

SAI CHARAN REDDY OBILIACHIGARI

551-***-**** *****************@*****.***

Click (LinkedInProfile) GitHub Repository

Summary:

I have strong analytical skills with 5 years of professional Data Scientist/Data Engineering with Manager level experience, Research, building and deploying AI Models in Python, Groovy, PySpark, HIVE, AWS Services, SQL & NoSQL Data bases, Data Mining, Machine Learning, Neural Networks (Deep Learning), Customer Analytics, Financial Modeling, Data Visualization, A/B Testing and Statistics with rich domain knowledge in Finance and Healthcare sectors.

** National Level Award Winner (Best Research Paper) ** ABR New Orleans LA, USA . (Ctrl + Click to view the certification)

Customer Enhancement using Topic Modeling (NLP)

Python, Web Scrapping, Text Cleaning, Text Extraction from Web and Images (OCR), Feature Engineering

Customizing the Data using Topic Modelling as a Feature Generator, H2O AutoML for Computations.

Text Extraction by building the customized NLP models from scratch.

oPOS Extraction, Chunking – Chinking Extractions

oStanford CoreNLP models for text Extractions and Regex Extractions

Technical Skills:

• Programming Languages: R, Python, SQL, PySpark.

• Big Data Technologies: PySpark, Hadoop, Google Cloud, Informatica, Azure.

• Tools: Tableau, Advance Excel, MATLAB, BI Reporting.

• Data bases: MySQL, HDFS, Neo4j, AWS DynamoDB, AWS RDS.

• AWS Services: EC2, S3, DynamoDB, RDS, Auto Scaling, Kafka, CloudWatch, ELB, SageMaker. Lambda.

• Other Tools: Dockers for Containerizing the codes

Technology Experience :

Python, R, Informatica. 6 Years

SQL, Oracle, Excel, Tableau, Power BI, Hadoop, PySpark: 6 years

Machine Learning, Deep learning, NLP: 5 Years

AWS (EC2, S3 Buckets, Dynamo DB, RDS, Lambda), SAS, Groovy: 4 years

Machine Learning Techniques: (Numpy, Pandas, Scikit-learn, SVM, Linear & Logistic Regression, Random Forests, Decision Trees, Nearest Neighbors, Apriori, K-Means, DBSCAN clustering, XGBoost, LGBRegressors)

NLP Techniques: (Tokenization, Bag of Words, TF-IDF, Word Embeddings, Word2Vec, Regular Expressions, Stemming, Lemmatization, LDA, NMF, Naïve Base, Latent Analysis, NLTK, Gesim), OCR, RPA.

Deep Learning Techniques: (CNN, R-CNN, LSTMs, GRUs, Stacked AutoEncoders, GANs),OpenCV, OCR, OpenAI Gym for Reinforcement Learning (Tensorflow+Keras, PyTorch, Theano backend)

Professional Work Experience:

Lead Data Scientist SimplySpeak (Hospital & Healthcare) New York Jan 2020 – July 2020

Designed and built the AI algorithm that reduces the transcription work for Doctors, which would enhance the accuracy in between doctor and patient’s conversation by accurate automated categorization and Entity Extraction from the Natural Conversation.

Text Extracting using Entity Recognition techniques also build our own NER Model, StanfordCoreNLP for information retrieval, and performed Text Classification, Topic Modeling, Text Similarity using Cosine Similarity for Document Similarity.

•Using Python, PySpark analyzed huge data analysis and extracting insights

•Complete computation on PySpark environment, Colab GPUs for quick analysis of the huge data and for building the model.

•Data extraction from EHR and complete in-depth analysis on EHR for insights.

•Created a Secured Automated ETL on Medical data using Informatica PowerCenter.

•Stored all the data in Graph Database for semantic queries with nodes and edges by Graph Structures.

•Customized Entity Extraction from the Patient- Doctor Conversation form IBM Watson generated data

•By integrating the ML, Deep Learning (DL) with Natural Language Processing (NLP). Which includes.

•Topic Clustering (K-Means, Hierarchical). Finding the optimal number of Topics and Clusters with Optimizing Algorithms, also building the classification Models using the below Techniques

•Recurrent Neural Networks (single and Bi-Directional LSTMs, Deep Bi-Directional LSTMs (DBLSTMs), GRUs) creating into Auto-Encoders, Transformers (BERT, ALBERT, ULMFit), Word Vectors, Word Embeddings

Techniques: Coding, cleaning, transformation to Model building and building REST API are in R and Python, SQL, AWS Services, PySpark, Hive, NLP, ML, NLTK, SpaCy, Standford Core NLP, Inception Annotation, Gensim, T5 Transformers, PyTorch, SageMaker, S3, Tableau.

Health Outcomes Lead Researcher/ Data Engineer USF Health Florida March 2019 – Feb 2020

As a Lead, worked with 5 Data Science Researchers, 2 Senior Surgeons by direct reporting to Research Director of Data Science at USF Health. I have worked on multiple aspects from Data Engineering to Model building.

•70% of my work is in Data Engineering part on 140 Data sets of different feature names and different data types which has large number of Null values which has represented in many way (Nan, Null, ?, No Answer, Hesitated…….) used SQL, Informatica PowerCenter, PySpark and Python for creating a dataset ready for analysis.

•PySpark for Exploratory Data Analysis and also built End-to-End ETL pipelines.

•Integrated OCR in the pipeline for extracting information from image datasets.

•Exploratory Data Analysis of Billion’s EHR’s of patients, AWS SageMaker for computations also Scala programming language in Spark environment for Big data analysis. OCR for information extraction from Images, Diagnosed Image reports

•Feature Engineering on 180 features for finding the high influencing features for predicting the target.

•Shortlisting the ICD-9 and ICD-10 Codes for the Risky Surgeries for reducing its Mortality.

•Built and Trained the model that predicts the ICD 9 & 10 codes based on the patient's EHR.

•Computaional: SageMakers and S3 Buckets for computational purpose

•ML Model: Random Forest, Ensembling Techniques, LightGBM, XGBoost, Python, OCR.

•Target: Finally built a highly dynamic model for predicting the Exact Principle Diagnosis ICD with 95.78% Acc.

Techniques: SQL, PySpark, R, Python, Pandas, Numpy, NLTK, Scikit-Learn, Keras, Tensorflow, OCR, SageMaker, AWS Lambda, S3.

Lead Data Scientist St. Peter’s University NJ, USA Aug 2018 – Sep 2019

•Leaded two projects, A Project on Health care analytics on improving the Patient care quality by predicting the optimal Length of stay (LOS) and Readmissions to enhance the efficiency in operation workload.

oTechniques: Text Data Engineering (Data Cleaning, Text Preprocessing, Extracting data) from Electronic Health Records (EHR), EDA, Visualization (Tableau, PowerBI, Excel), OCR for extracting information from images

oModel Building: LightGBMClassifier, Random Forests, SVMs, Boosting Algorithms, Ensembling models.

Achieved: Predicted the optimal LOS and Readmissions for different cases and working on Reducing the Mortality for Risky Surgeries.

•In this project to enhance and improve the business and customer satisfaction. Using NLP techniques by performing Topic Modelling to find out the most discussed negative topics from 400K reviews. Built an efficient models Click to view Project Architecture

oTechniques: OCR for information extraction from images, Text Preprocessing, Text Classification, Document Similarity using Cosine similarity technique and used Euclidean distance, LDA (Latent Dirichlet Allocation), NMF (Non-Negative matrix factorization), Topic Modeling on Nouns, Entity Recognition, Latent Semantic Analysis, N-Grams,

TF-IDF, Naïve Base (Text Classification) Visualizations (Word Phrase, Word Nets and Word Clouds).

Achieved: Ranked the topics and reasons that are negatively influencing at different seasonality and recommended solutions to improve the business.

Techniques: Python, R, NLP, NLTK, Pandas, Numpy, Gensim, SpaCy, Tansformers, Groovy, OCR, Tableau, Excel, OCR, RPA, PyTorch, SQL, PowerBI.

Lead Data Scientist/Data Engineer MMPC New York, USA April 2019 – Aug 2019

As a Project Lead coordinated the Data Science team which involves 2 Data Scientists, 2 IT Specialist. End-to-End project building with a clear and efficient

ETL on the 8Million Customer Data:

•Informatica and SQL for Extracting, Transforming and Loading the Data from different resources such as MySQL Server and from Oracle DB and performed different Transformation in the PowerCenter.

•Data flow Designing and flow Management and Workflow Monitoring on different stages of ETL.

•PySpark and Informatica for ETL pipelines and Exploration on millions of records quickly.

•Integrated different set of customers data, purchasing data, service orders data and many more.

•Finally built the datasets that are ready for Data Analytics.

Built an ML Model to Predict & Forecast the future sales:

•Built and Deployed a Forecasting ML model by considering past years sales, reviews and Star ratings.

oInsights: Discovered the positive and negative opinions out from the text data using NLP techniques and from sales data predicted the future sales using predictive analytics.

Recommended Cost-efficient Marketing Strategy:

•Built an efficient Machine Learning model that classifies and predicts a solution accurately by comparing all the domain competitors with cost-efficient marketing strategy by conducting all sorts of analysis includes numerical, categorical, time-series analysis to improve the business.

•Finally concluded using A/B testing for finding the best model and employed Model validation Classification metrics. Fulfilled all my responsibilities by reporting to the CEO using Tableau.

Techniques: Python, Web Scraping, Data Engineering, Feature Engineering, Scikit-learn, NLTK, Pandas, OCR, NumPy, SciPy and Seaborn.

Research Data Science Intern Intelligent Rabbit NJ, USA Feb 2019 – April 2019

Built and Deployed NN & NLP Models by integrating NLP & CNN:

Performed all phases of text data acquisition, text cleaning, developing models, validation, and visualization to deliver data science solutions using Tableau. IMAGE CLASSIFICATIONS & DETECTIONS USING (CNN, R-CNN using PyTorch in Python)

•RNN(LSTMs) for text modeling. Integrated NLP and CNN for building a Model for Image Classification and Categories the images using its text. VGG-16 & VGG-19 for Feature extraction.

•Developed a new model using Transfer Learning by fine-tuning the significant hyperparameters of Convolutional Neural Networks (CNN). Performing sentimental Analysis on text data using NLP.

Techniques: Python, Convolutional Neural Networks, Open CV, PyTorch, OCR, Recurrent Neural Networks, Auto Encoders, Reinforcement Learning with OpenAI Gym, GANs – Generative Adversarial Networks.

Data Engineer SLN Technologies Jan 2017 – Aug 2018

•Analyzing huge real-world data of the company products, sales and giving meaningful insights using Exploratory Data Analysis (EDA) and Predictive Analytics.

•H20 Computational GPU for building the model and for quick analysis on the huge datasets.

•Informatica PowerCenter for Data Transformations like manipulations, integrations, Aggregation.

•Built and Deployed a Recommendation engine for better improvement of the sales with Machine Learning Algorithms such as Linear, Logistic Regression, SVC, SVR, Decision Trees, Ensembling techniques.

•MySQL, PySpark and AWS servers for the big data processing, MATLAB for Data Analytics.

•Experience to collect, aggregate and store the web log data from web servers, stored into HDFS.

•Stored all the manipulated data sets in Graph Database

Techniques: Python, Scikit-Learn, Numpy, Pandas, Hadoop, Hive, Impala, Spark, MapReduce, PySpark, Informatica, SQL, NoSQL.

Financial Research Analyst APU Malaysia Feb 2016 – Dec 2016 This is banking sector related project and need to build predictive models for the identification/detection of fraudulent transactions by applying machine learning methods. Our new model considered more features and involved more models and therefore successfully increased the accuracy of prediction for more than 5%.

Major concern is the data security where I worked on SAP Lumira and Informatica for secured ETL Pipelines

•Participated in all phases of data acquisition, data cleaning, developing models, validation, and visualization to deliver data science solutions that predicts the fraud transactions.

•Worked on fraud detection analysis on payments transactions using the history of transactions with supervised learning methods.

•Ensembled methods were used to increase the accuracy of the training model with different Bagging and

Boosting methods.

•Used cross-validation to test the models with different batches of data to optimize the models and prevent overfitting.

Techniques: R, Python, Numpy, Pandas, Machine Learning, SAP, Informatica

DATA SCIENCE PROJECTS – “All the Projects is in Python Programming Language”

NLP Building a Topic Modeling system for improving the business on Amazon Reviews Click to view code

NLP From scratch to Model Building and Implementing in the real World Click to view code Amazon fashion Recommendation Systems using NLP and ML Click to view code

NLP Movie Recommendation Engine Click to view code

Artificial Neural Networks for Predicting customer existence Click to view code

GANs Generative Adversarial Networks for Image Data Click to view code

Google stock price prediction using RNNs(LSTMs) Click to view code

CNN for image Classification Click to view code

Tensor Flow for Image Classification using Softmax Classifier Click to view code

Human Activity Recognition Click to view code

Projects using Stacked Auto Encoders(Advanced Deep Learning) Click to view code Building Financial Model Github(View Report)

RESEARCH WORKs ON MACHINE LEARNING (NLP & DEEP LEARNING):

1)Drug Discovery using LSTM’s, GRU’s, GAN’s, Transformers (BERT, ALBERT)

2)AI for Diabetes medicine prediction (Model predicts 21 Medicines with Boolean anweres)

3)Improving the customer satisfaction and business based on seasonality and NLP Solution Recommendation

4)Finance and AI Integration that forecasts future trends of stocks for asset allocations.

5)Predicting the Length of Stay and Readmissions in hospitals using Growing Neural Gas model.

6)“Analysis of Carcinogens present in drinking water of New York and Prediction of Cancer deaths till 2025”

7)“Deep Learning Model (LSTMs, Deep RNN) that ranks the research papers based on content relevance”.

8)“Transfer Learning (CNN) and performing Sentimental Analysis using NLP on YELP reviews”

COURSES:

Clear understanding on Market Concepts, Usage of Terminal Certified from “Bloomberg”

Deep Learning (CNN, RNN, LSTMs) by Andrew NG (Stanford University)

Machine Learning Course by Andrew NG (Stanford University).

Conceptual and Background Math for Machine Learning in Applied AI.

EDUCATION

MS - Data Science & Business Analytics Bachelor’s in Computer Science & Engineering

Saint Peter’s University - GPA: 3.85 Satyabhama University-Chennai, India - GPA: 3.7

Contact this candidate