Sai Sree B
518-***-**** **********@*****.***
PROFFESIONAL SUMMARY
Over 3 years of work experience with a wide range of skill sets. Experience with Statistics, Data Analysis, Supervised and Unsupervised Machine Learning and Data Visualization using Python and Tableau.
Proficient in managing entire data science project life cycle and actively involved in all the phases of project life cycle including data acquisition, data cleaning, data engineering, features scaling, features engineering, machine learning
(Regression, Decision trees, Random Forest, SVM, clustering, K-NN, Dimensionality reduction, Gradient descent, GBM, XGBoost, Time series Analysis, Neural networks, CNN and NLP), model validation and making recommendations to the business.
Experienced with Data Analytics, Data Reporting, Ad-hoc Reporting, Graphs, Scales, PivotTables and OLAP reporting using various tools like Power Bi, Tableau, Python.
Worked on complex KPI scorecards, heat maps, tree views, circle views, histogram visualizations and interactive dashboards to find the trend analysis of data using tools like Power Bi and Tableau.
Strong experience working with databases like Oracle SQL, No SQL databases like MongoDB
Proficient in using Hadoop, HBase, Spark, and Hive for basic analysis and extraction of data in the infrastructure to provide data summarization.
Enlightened with Amazon Web Services cloud environments.
Worked with various file formats (delimited text files, click stream log files, Apache log files, Avro files, JSON files, XML Files).
Use Flume, Kafka, Nifi, and HiveQL scripts to extract, transform, and load the data into database. Able to perform cluster and system performance tuning.
Experienced in developing Machine Learning models in development environments like Azure Databricks, Jupyter notebooks.
Experienced in developing Data Pipelines and scheduling the tasks using Python and Azure Data Bricks. Willing to relocate: Anywhere
SKILLS
Programming: Python (NumPy, Pandas, Scikit-Learn), MATLAB, SQL, HiveQL, PySpark.
Analytics and Visualization Tools: Tableau, Ggplot (R), Plotly, Seaborn, Matplotlib, MATLAB
Statistical Methods: ARIMA, Regression Analysis, Hypothesis Testing, Survival Models.
Other Tools: Git Version Control, Jupyter Notebook, IPython Notebook, Unix Shell, Atom, PyCharm.
Machine Learning Algorithms: Logistic Regression, Linear Regression, Decision Tree, Random Forests, Gradient Boosting, Voting Estimators, SMOTE, Lasso and Ridge Regression, Nearest Neighbor Classifier, K-means clustering, Gaussian Mixture, DBSCAN, Principal Component Analysis, Auto Encoder, Singular Value Decomposition, Support Vector Machines, Auto Regression & Moving Averages.
Big Data: HDFS, MapReduce, Hive, HBase, Storm, Kafka, Elastic Search, Redis, Flume, Flink, Sqoop, Spark, Hadoop, Azure Data Factory, Azure HDInsight, Azure SQL Data Warehouse.
Statistical Methods: Parametric Time Series, regression models, principal component analysis and Dimensionality Reduction.
Databases: Apache Cassandra, Amazon Redshift, Amazon RDS, SQL, Apache Hbase, Hive, MongoDB, NoSQL Database, MySQL, MS Access.
Data Storage: HDFS, Data Lake, Data Warehouse, Database, PostgreSQL.
Data Pipelines/ETL: Flume, Apache Storm, Apache Spark, Nifi, Apache Kafka.
Amazon Stack: AWS, EMR, EC2, EC3, SQS, S3, DynamoDB, Redshift, Cloud Formation
Operating Systems: Linux/UNIX, Windows
WORK EXPERIENCE
Department Of Labour, ITS Albany, NY
Big Data Analyst - Apr 19 - Present
Role & Responsibilities:
Worked with Cloud product team to support daily data requests using Python, SQL and Tableau.
Analyzed millinons of rows to come up with meaningful business driven insights which helped the team to understand about the product users better.
Responsible to manage data coming from different sources.
Transformed the data by applying ETL process using Hive with large sets of structured, semi structured and unstructured data.
Exported data using Sqoop from HDFS to Teradata on regular basis.
Written Hive Querie for data analysis to meet the business requirements.
Creating Hive tables and working on them using Hive QL.
Involved in creating Hive tables, loading with data and writing hive queries which will run internally in map reduce way.
Designed and implemented Pyspark-based large-scale parallel relation-learning system.
Build dataframes using PySpark with AWS EC2 cluster for bigdata requests within the org, Scraped data from 3rd party API’s using Python and aggregated multiple data sources for analysis and product insights.
Developed clustering algorithm for customer segmentation by demographics using Python and Sklearn.
Came up with classification algorithms to know customer retention and analysis using Python and Jupyter Notebooks.
Owned dashboards in Tableau for weekly and monthly client business metrics reviews and updates.
Building large-scale and complex data processing pipelines.
Solid experience with Pyspark data structures, critical features and performance tuning.
Cluster and configuration management systems, like Docker and Kubernetes.
Amazon AWS services (EC2, S3, RDS etc)
Analyze Hadoop clusters using big data analytic tools including Hive, and MapReduce.
Developed Spark code using Scala and Spark-SQL/Streaming for faster processing of data.
Used Spark-SQL to Load JSON data and created Schema RDD and loaded it into Hive Tables and handled Structured data using SparkSQL.
Used Spark API over Hortonworks Hadoop YARN to perform analytics on data in Hive.
Responsible for development and management of all cluster related testing activities.
Excellent understanding of Hadoop Architecture and underlying Hadoop framework including Storage Management. Environment : Python, SQL, Tableau, PySpark, Jupyter Notebooks, AWS EC2, HDFS, Hive, MySQL Workbench, PySpark, Sqoop, Teradata. Trans Global Geomatics Pvt Ltd. Hyderabad, India
Data Analyst Dec 17 – Dec18
Role & Responsibilities:
Designing and maintaining data systems and databases, this includes fixing coding errors and other data-related problems.
Developing and implementing databases, data collection systems, data analytics and other strategies which optimizes statistical efficiency and quality.
Mining data from primary and secondary sources, then reorganizing said data in a format that can be easily read by either human or machine.
Manipulation of data, running pivot tables, V-lookups and formulas in Excel creating reports that managers consume.
Using Pandas, NumPy, Scikit-Learn, FeatureTools, Missingno, and SciPy packages to clean, explore and manipulate data to perform feature engineering.
Matplotlib, Seaborn, and Plotly libraries were used to visualize analysis during exploratory data analysis (EDA).
Scikit-Learn and XGBoost libraries were employed to build and evaluate the performance of different models.
Leveraged Scikit-Learn’s model selection algorithms to perform hyperparameter tuning and k-folds cross validation.
Evaluated the performance with a confusion matrix and tuning the model based on accuracy, precision, recall, and F1 score.
Plotted an ROC Curve to determine the proper threshold and measure model performance.
Leveraged unsupervised techniques such as K-Means and Gaussian Mixture Models to detect anomalies and improve our ensemble.
PROJECTS
Personalized Cancer Diagnosis
Succeeded in building a classifier that helps molecular pathologist to classify genetic variations/mutations based on evidence from text-based
clinical literature.
Performed uni-variate analysis on text features and gene features,then implemented various machine learning models and stacked models
Achieved a average log loss score of 1.0578 using stacking classifier after doing necessary Hyper-parameter tuning. Tools: Python, Numpy, Pandas, Scikit-learn, Sqlit3, Naive Bayes, KNN, Logistic Regression, Stacking classifier. Quora question pair similarity
Developed a model to predict whether a pair of questions are duplicate or not, which is useful to instantly provide answer to questions that have already been answered
Performed text feature engineering and implemented random forest, logistic regression, linear SVM model, XGBoost with Hyperparameter tuning and achieved a log loss of 0.313. Tools: Python,Numpy, Pandas, Scikit-learn, Plotly, NLTK, BeautifulSoup, XGBoost Taxi demand prediction in NYC
Cleaned and performed Exploratory data analysis and feature engineering on New York city taxi data using Python and Dask.
Segmented the data by using K-means clustering and applied baseline models, Linear Regression, Random Forest Regressor, XGBoost Regressor to predict the number of taxi pickups in a given location at a particular time interval and computed a Mean Absolute per error of 11.57
Tools: Python, Pandas, Numpy, Matplotlib, Sqlite3, Dask, Folium, Gpxpy, Scikit-Learn, XGBoost Stack overflow Tag prediction
Collaborated with a team of 3 to design a model to suggest tags for question posted on Stack overflow
Interpreted data, created statistical analysis of content of questions, framed a multi class classification problem and build logistic regressor andsupport vector regressor to predict tags for content of question.
Achieved a average F-1 score of 0.51 after necessary hyper-parameter tuning. Tools: Python, pandas, Numpy, sqlite3, matplotlib, seaborn, scipy, NLTK, Scikit-learn. EDUCATION
State University of New York Albany, New York
MASTER’S IN DATA SCIENCE Jan. 2019 - Dec. 2020
Theory of Statistic, Optimization Methods, Data Mining, Applied Statistics, Machine Learning, Database Systems and Data Analysis, Business Analytics and Text Mining, Topological Data Analysis. Jawaharlal Nehru Technological University, Kakinada Guntur, India BACHELOR’S IN Electronics and Communication ENGINEERING Jun.2014 - Apr. 2018 Signals and Systems, Computer Networks, Network Analysis, Digital Communication Systems, Analog Communication Systems.