Subhabrata Das, Ph.D.
******@********.*** ● 212-***-**** ●921 US Highway 202/206, Apt #2, Bridgewater, NJ 08807, USA SUMMARY
• Senior Data Scientist with PhD from Columbia University and Master’s degree from Cornell University with 8+ yr hands on experience in interpreting and analyzing data through ML, NLP, PHP, Databricks, Enterprise datalog & amp; Statistical techniques and deriving meaningful insights to implement solutions in a fast-paced environment.
• Skilled at Power BI, R, Machine Learning, Azure Data Factory, Python, C#, SQL, PHP, Shiny, R and Tableau & Statistical Modeling.
• Extensive Experience on Machine Learning Statistic Modeling, Databricks,Predictive Modeling, Data Analytics, Data Modeling, Data Architecture, Data Analysis, Data Mining, Natural Language Processing, Artificial Intelligence algorithms.
• Experienced in utilizing analytical applications like R, and Python to identify trends and relationships between different pieces of data, draw appropriate conclusions and translate analytical findings into risk management and marketing strategies that drive value; Used text Clustering, topic Modelling, Information Extraction.
• Developed different Statistical Machine Learning, Data Mining solutions to various business problems and generating data visualizations using R, Python and creating dashboards using tools like Tableau, PowerBI.
• Experienced in machine learning techniques like Clustering Analysis, Market Basket Analysis, Association Rules, Naïve Bayes, Recommendation System, Dimension Reduction, Principal Component Analysis (PCA), Decision Tree and Neural Networks.
• Working knowledge in building prediction models using Linear Regression Analysis, Logistic Regression, Correlation Coefficient, Co-efficient of determination techniques.
• Good knowledge on statistical analysis techniques like Confidence Interval, Hypothesis testing, ANOVA, Conjoint analysis.
• Understanding on AWS, PHP and Python for Data Analysis.
EDUCATION
Columbia University, School of Engineering New York, NY
Ph.D., M. Phil in Quantitative Modeling, GPA: 3.7/4.0 Jan 2015 – Feb 2019
Cornell University Ithaca, NY
M.E. in Chemical Engineering, GPA: 3.5/4.0 Aug 2013-May 2014
National Institute of Technology Trichy, India
B. Tech, Chemical Engineering, GPA:8.05/10 Sep 2009-Aug 2013
Technical Skills:
Languages
Python, PHP, R, Scala, PIG, MATLAB, Java, C, C++, HTML, UNIX Shell, JavaScript, SQL, UML
Big Data Tools
Hadoop Stack, Apache Spark, Storm
Testing Tools
HP Quality Center ALM, Jira, Rally
Relational Databases
MySQL, MS-SQL, PostgreSQL
Databases
Oracle, DB2, MySQL, MS SQL Server, MS Access, Teradata
Reporting, Visualization
Tableau, Matplotlib, Seaborn, ggplot, Crystal Reports, Cognos, Shiny
Machine Learning
Linear Regression, Logistic Regression, Gradient boosting, Random
Forests, Maximum likelihood estimation, Clustering, Classification &
Association Rules, K-Nearest Neighbors (KNN), K-Means Clustering, Decision Tree (CART & CHAID), Neural Networks, Principal Component Analysis,
Weight of Evidence (WOE) and Information Value (IV), Factor Analysis, Sampling Design, Time Series Analysis, ARIMA, ARMA, GARCH, Market Basket Analysis, Text mining
OS
UNIX, Linux, Windows, MacOS
WORK EXPERIENCE
MICROSOFT, DATA SCIENTIST Bridgewater, NJ, May 2020-June 2020
•Implementing -Greedy Contextual Bandit algorithms in reinforcement learning using Python for News and Feeds
project to build recommendation strategies and decision making; changed current logs to match format Azure ML
•Debugged scripts written in COSMOS, SQL and C# for document and user embedding as part of Core Ranker Team
•Conducted A/B Experiments for score card analysis and quality control of recommender system
•Developing REST API’s and deploying microservices in Azure, used Data Factory for end to end process mining
•Deploying and managing containerized applications using Azure Kubernetes Services, streaming capabilities
•Experienced with Azure Machine Learning Services and MLflow, Object-oriented Python for the model
Environment: Databricks, Machine learning, AWS, MS Azure, Python (Scikit-Learn/Scipy/Numpy/Pandas), Pytorch, R
Avantor, Biopharma Technology Center, Entrepreneurial QA Scientist Bridgewater, NJ, May 2019-Apr 2020
• Designed, modeled, validated and tested statistical algorithms against various data sets in the Biopharma industry. including behavioral data and deployed predictive models using Jupiter notebook, python, spark, data bricks.
• Performed Data Transformation method for Rescaling and Normalizing variables.
• Understand the Requirements and Functionality of the application from specifications using NLP.
• Designed, developed, tested, and maintained Tableau functional reports based on user requirements.
• Mastered the ability to design and deploy rich Graphic visualizations with Drill Down and Drop-down menu option and Parameters using Tableau, AWS, Sagemaker, Comprehend, elastic search, XGboost and API Gateway
• Worked closely with Business users. Interacted with ETL developers, Project Managers, and members of the QA teams.
• Developed Tableau data visualization using Cross tabs, Heat maps, Box and Whisker charts, Scatter Plots, Geographic Map, Pie Charts and Bar Charts and Density Chart. Used Azure Data Factory for model deployment
• Designed, modeled, validated and tested statistical algorithms against various data sets including behavioral data and deployed predictive models, created Pyspark scripts for ETL and used machine learning, deep learning algorithms for price predictions for Image classification and Object detection using Databricks, Pytorch and AWS Sagemaker.
• Created and maintained reports to display the status and performance of deployed model and algorithms.
• Experienced in database testing, wrote complex SQL queries to verify the transactions and business goal.
• Worked on data profiling and data validation to ensure the accuracy of the data between the warehouse and source systems.
• Developed Time series forecasting model for various business databases using the ARIMA and ARIMAX.
• Worked with sales and Marketing team for Partner and collaborate with a cross-functional team to frame and answer important data questions, prototyping and experimentation ML/DL algorithms and integrating into production system for different business needs.
Environment: Databricks, Machine learning, AWS, MS Azure, Cassandra, Spark, HDFS, Hive, PHP, Linux, Python (Scikit-Learn/Scipy/Numpy/Pandas), Pytorch, R, SAS, SPSS, MySQL, Eclipse, PL/SQL, SQL connector, Tableau, Datalog.
United Nations Secretariat, Economic and Social Affairs, Econometrics Data Scientist NY, Mar 2019-May 2019
Performed data analysis for multiple economics departments in United Nations which helped in decision making.
Developed advanced Data science models using python to predict the correlations in Sustainable development goals.
Worked with all levels of stake holders to develop reports, used for quality reviews and other desired outcomes.
Used best practice methods to clean, manipulate, transform, and merge datasets in Python.
Developed advanced Data science models using python to predict the metrics affecting the economy of Tanzania.
Worked with all levels of stake holders to develop reports, used for quality reviews and other desired outcomes.
Built key business metrics, Visualizations, dashboards, reports with Tableau.
Performed regression analysis on data sets using linear models to extract insights and forecast trends.
Environment: Python 2.7, Databricks, CDH5, HDFS, Hadoop 2.3,Hive, Impala, Linux, Tableau Desktop, SQL Server 2012, Microsoft Excel.
Columbia University, Research Team Leader/ Data Scientist New York, NY, Jan 2015-Feb 2019
Involved in all phases of data acquisition, data collection, data cleaning, model development, model validation, and visualization to deliver data science solutions for personal care industry using Jupiter notebook, spark, data bricks.
Worked on data cleaning and ensured data quality, consistency, integrity using R, Python and Shiny.
Used PCA and other feature engineering techniques to reduce the high dimensional data, feature normalization techniques and label encoding with Scikit-learn library in Python.
Used Pandas, NumPy, Seaborn, Matplotlib, Scikit-learn in Python for developing various machine learning models such as Logistic regression, Gradient Boost Decision Tree and Neural Network.
Used cross-validation to test the models with different batches of data to optimize the models and prevent overfitting.
Experimented with Ensemble methods to increase the accuracy of the training model with different Bagging methods.
Implemented a Python-based distributed random forest via PySpark and MLlib.
Created and maintained reports to display the status and performance of deployed model and algorithms.
Experienced in database testing, wrote complex SQL queries to verify the transactions and business goal.
Worked on data profiling and data validation to ensure the accuracy of the data between the warehouse and source systems using NLP, AWS, Sagemaker, Comprehend, elastic search, XGboost and API Gateway
Developed Time series forecasting model for various business databases using the ARIMA and ARIMAX
Developed SQL scripts like creating tables, sequences and data views.
Involved in all phases of data acquisition, data collection, data cleaning, model development, model validation, and visualization to deliver data science solutions for object detection and image classification using Pytorch and AWS.
Hands on building Groups, hierarchies, Sets to create detail level summary reports and Dashboard using KPI's.
Used groups, bins, hierarchies, and filters to create focused and effective visualizations.
Experienced in forecasting and various trend reports.
Used parameters and input controls to give users control over certain values.
Utilized Tableau server to publish and share the reports with the business and academic users.
Environment: Databricks, Python, MS-SQL, Hive, PostgreSQL, MySql, Pandas, Numpy, Matplotlib, Seaborn, Tableau, PowerBI, SVN, Pytorch, Amazon Web Services
Binoptics/Macom Corporation, Data Analyst and Statistical modeling Ithaca, NY, June 2014-Sept 2014
Collaborated in creating different stages in Big data architecture.
Performed data analysis for multiple marketing departments which helped my team in decision making.
Developed advanced Data science models using python to predict the semiconductor business.
Worked with all levels of stake holders to develop reports, used for quality reviews and other desired outcomes.
Used best practice methods to clean, manipulate, transform, and merge datasets in Python.
Built key business metrics, Visualizations, dashboards, reports with tableau.
Performed regression analysis on data sets using linear models to extract insights and forecast trends.
Transformed visualizations into meaningful insight for reporting.
Implemented NLP to quickly route customers and customer service agents to the information they need.
Environment: SQL Server 2008, Teradata 13, Erwin 8, Oracle 9i, PL/SQL, SQL*Loader, ODS, OLAP, SSAS, Informatica Power Center, OLTP
Cornell University, Data Analyst Ithaca, NY, Oct 2013-May 2014
Interpreted and created story-board on business discoveries using Business Intelligence tools and communicated to end clients which resulted in actionable outcomes
Performed data visualization and Designed dashboards with Tableau, and provided complex reports, including charts, summaries, and graphs to interpret the findings to the team and stakeholders.
Performed Data Collection, Data Preparation, Feature Engineering, Hypothesis Testing, Data Reduction and Data Mining to help Data Scientists developing mathematical and statistical models.
Data Preprocessing (Missing Value Treatment, Outlier Detection, Data Exclusion, Feature Engineering)
Modeling Techniques Used (Linear Regression, Logistic Regression, Count Data Regression, K-Nearest neighbors, K
Means Clustering, Naïve Bayes, Bayesian Data Analysis, Support Vector Machines, Random Forests, DIM RED:
PCA/stepAIC, Regularization, Cross Validation)
Environment: MS SQL Server, R/R studio, SQL Enterprise Manager, Red shift, MS Excel, Power BI, Tableau, T-SQL, ETL, MS Access, XML, MS office 2007, Outlook, AS E-Miner.
Swiss Federal Institute of Technology, Summer Intern Zurich, May 2012-July 2012
Modified existing code to correct errors, upgrade interfaces and improve code performance to speed-up Planned, Co-ordinated analysis and extracted data from multiple source systems into data warehouse RDBMS while ensuring data integrity.
Assisted in architecture of new data warehouse dimensional modelling using star schema.
Performed Data Analysis on customer financial transaction records using big data tools such as Map-reduce, hive.
Researched and explored suitable machine learning techniques based on data sets to predict potential business.
long time simulations for fluid flow; flow visualization for increasing productivity and cutting unnecessary costs.
Environment: Python, R, Hive, MS-SQL, PostgreSQL, SSIS, SSRS, SSAS, Tableau, PowerBI, Qliksense, Pandas, Numpy, Matplotlib, Seaborn, Jupyter, Hadoop