Data Scientist Hadoop

Location:

Gastonia, NC

Posted:

May 26, 2021

Contact this candidate

Resume:

AMOL B

Phone: +1-980-***-****

Email:*********@*****.***

Professional Summary:

7+ Years of IT experience with multinational clients performing Data Modeling, Statistical Modelling, Data Mining, Data Exploration and Data Visualization of structured and unstructured datasets.

Experience in data-wrangling, loading in Big Data platforms such as Apache Spark, Kafka working efficiently through SQL server after doing enough Data Extracting process from several sources, transforming in transit, and loading into relevant platform to perform actions.

Experience in Developing Spark applications using Spark - SQL in Databricks for data extraction, transformation and aggregation from multiple file formats for analyzing & transforming the data to uncover insights into the customer usage patterns.

Expertise on using Machine Learning and Deep Learning Techniques in Python & R (R Studio) such as Ranking, Linear Models, Polynomial, Support Vector, LSTM, Random Forest, Clustering, classification models such as Logistic Regression, Decision Trees, Support Vector Machine, and KNN.

Experienced in Building Recommendation engines using Association rule, collaborative filtering, and segmentations.

Expertise in Advanced Ensemble Techniques Stacking, Blending, Bagging, Boosting and models Random Forest, Gradient Boosting and Extreme Gradient Boosting etc.

Expertise of interactive ML tools such as TensorFlow and Caffe, PyTorch and Theano, and expertise in using strong Coding Platforms such as Spyder, Jupyter Notebook, R Studio offered by Anaconda Navigator as well as Google Colab.

Experience in text mining and topic modelling using NLP & Neural Network, tokenizing, stemming, and lemmatizing, tagging part of speech using Text Blob, Natural Language Toolkit (NLTK) and Spacy while building Sentiment Analysis.

Experience in AI & Deep Learning techniques such as Convolutional Neural Network (CNN) for Computer Vision, Recurrent Neural Network (RNN), and Deep Neural Network with applications of Backpropagation, Stochastic Gradient Descent (SGD), Long Short-Term Memory (LSTM) and Continuous Bag of words, Text Analytics etc.

Proficient in using PostgreSQL, Snowflake, Microsoft SQL server and MySQL to extract data using multiple types of SQL Queries including Create Table, Join, Conditionals, Drop, Case etc.

Extensive knowledge in MLOps which aims to deploy and maintain ML systems in production reliably and efficiently.

Skilled creating executive Tableau Dashboards for Data visualization and deploying it to the servers.

Proficient in Data Visualization tools such as Tableau and Power BI, Big Data tools such as Hadoop HDFS, Spark (PySpark), MapReduce.

Experience using Matplotlib and Seaborn in Python for visualization and Pandas in Python for performing exploratory data analysis.

Experience in Web Data Mining using Python’s NLTK, ScraPy, Beautiful Soup packages and REST APIs along with working knowledge of Natural Language Processing (NLP) to analyze text patterns.

Areas of Expertise:

Big Data Ecosystems: Hadoop, MapReduce, HDFS, HBase, Hive, Sqoop, Cassandra

Programming Languages: Python, Java8, Scala, UNIX, LINUX, Shell scripting

Framework: Apache Hadoop, Apache Spark

Databases: Oracle, SQL, PostgreSQL, MySQL, DynamoDB

Tools: Eclipse, JDeveloper, MS Visual Studio, Docker, AirFlow, JIRA

Cloud: AWS, GCP

Version Control: GIT, SVN

Professional Experience:

Client: Bank of America- Charlotte, NC May 2019 – Present

Role: Data Scientist

Responsibilities

Extracted the data from hive tables by writing efficient Hive queries.

Performed preliminary data analysis using descriptive statistics and handled anomalies such as removing duplicates and imputing missing values.

Application of various machine learning algorithms and statistical modeling like decision trees, text analytics, natural language processing (NLP), supervised and unsupervised, regression models, neural networks, deep learning, SVM, clustering to identify Volume using scikit-learn package in python, Matlab.

Performed data cleaning and feature selection using MLlib package in PySpark and working with deep learning frameworks such as Caffe, Neon etc.

Developed Spark/Scala, Python, R for regular expression in the Hadoop/Hive environment with Linux/Windows for big data resources. Used clustering technique K-Means to identify outliers and to classify unlabeled data.

Use Principal Component Analysis in feature engineering to analyze high dimensional data.

Implemented end-to-end systems for Data Analytics, Data Automation and integrated with custom visualization tools using Python, R, Mahout, Hadoop and MongoDB.

Worked with Machine Learning algorithms such as decision trees and random forest.

Used Spark Data frames, Spark-SQL, Spark MLLib extensively and developing and designing POC's using Scala, Spark SQL and MLlib libraries.

Developed various Tableau Data Models by extracting and using the data from various sources files, DB2, MongoDB, Excel, Flat Files and Big data.

Designed and implemented end-to-end systems for Data Analytics and Automation, integrating custom visualization tools using R, Tableau, Power BI

Participated in all phases of Data mining, Data-collection, Data-Cleaning, Developing-Models, Validation, Visualization and Performed Gap Analysis.

Implemented Classification using supervised algorithms like Logistic Regression, Decision trees, KNN, Naive Bayes.

Designed both 3NF data models for ODS, OLTP systems and Dimensional Data Models using Star and Snowflake Schemas.

Used AWS data pipeline for Data Extraction, Transformation and Loading from homogeneous or heterogeneous data sources and built various graphs for business decision-making using Python matplot library

Implemented ETL process wrote and optimized SQL queries to perform data extraction and merging from SQL server database.

Built data pipelines from multiple data sources by performing necessary ETL tasks. Performed Exploratory Data Analysis using R and Apache Spark.

Designed the data flow for the collapse of 4 legacy data warehouses into an AWS Data Lake

Built a data lake as a cloud-based solution in AWS using Apache Spark and provide visualization of the ETL orchestration using CDAP tool.

Created SQL tables with referential integrity and developed queries using SQL, SQL PLUS and PL/SQL.

Client: ADP, Pacedona, CA Sept18- May 2019

Role: Data Scientist

Responsibilities:

●Worked on data that was a combination of unstructured and structured data from multiple sources and automate the data cleaning using Python scripts. Performed data analysis by using Hive to retrieve the data from Hadoop cluster, SQL to retrieve data from Oracle database.

●Created Data Quality Scripts using SQL and Hive to validate successful data load and quality of the data and extracted data from HDFS and prepared data for Exploratory Data Analysis using data munging.

●Developed data analysis by using Hive to retrieve the data from Hadoop cluster, SQL to retrieve data from Oracle database and used ETL for data transformation.

●Created multiple custom SQL queries in Teradata SQL Workbench in AWS, cloudera platform to prepare the right data sets for Tableau dashboards. Queries involved retrieving data from multiple tables using various join conditions that enabled to utilize efficiently optimized data extracts for Tableau workbooks.

●Used Spark and H2O together with Flow UI to perform various deep learning tasks implementing classification and regression algorithms and used Amazon Sage maker to installation fashions and to track modifications in deploying the fashions git is used.

●Used elastic search engine to store, search, and analyze big volumes of data and powered applications that had complex search features and requirements and used in AWS platform to deploy, operate, and scale Elasticsearch clusters.

●Performed pre-processing of data such as handling missing values, dealing with outliers and extreme values also making necessary transformations of the variables for the data to follow normal distribution.

●Implemented a Singular Value Decomposition (SVD) collaborative filtering algorithm to recommend products and services to users.

●Leveraged the Python package SciPy to perform Singular Value Decomposition (SVD) on User-Item matrices to make recommendations.

●In addition to SVD, spearman correlation and Cosine similarity were utilized to compare the performance.

●Explored a K-Nearest Neighbors approach for building the recommender engine.

●Enhanced Key Performance Indices (KPI’s) such as Click through Rate (CTR) to build ratings.

●Built custom algorithms and ensembles to make recommendations.

●Experience processing clickstream data to build ratings.

●Used Power BI for evaluating and improving existing BI system’s, developed and executed database queries and conduct analyses.

Lavie (Bagzone Lifestyles), India Jan ’14 –Dec’15

Data Analyst

Description: Lavie established itself as one of the best handbag brands in India with its first bag collection that was showcased in 2010. A stylish footwear collection under the brand name Fé Lavie was shortly launched thereafter. I was working as a junior data analyst to develop a machine learning model to predict the Sales of fashion products to be sold in the next season. The model used information about the future trends present in web pages and with the data present in the previous year’s sales history. For the creation of this model, we used techniques of text mining and data mining.

Responsibilities:

●Created and updated SQL tables, database, stored procedures, and queries to modify and/or create reports for respective business units and used Mongo DB to create queries.

●Performed Data visualization and Designed dashboards with KIBANA, and generated complex reports, including Charts, Summaries, and Graphs to communicate the findings to the team and stakeholders.

●A user-generated data extraction program was developed to extract potentially useful information from web pages.

●Performed Data Cleaning, Data Visualization, Information retrieval, Feature Engineering using Python libraries such as Pandas, NumPy, Scikit-learn, Matplotlib and Seaborn.

●Feature Engineered raw data by doing imputation, normalization and scaling as required on the data frame. Converted categorical variables to numerical values using Label Encoder for EDA and readability by the machine learning models.

●Performed univariate, bivariate, and multivariate analysis to check how the features were related in conjunction to each other and the risk factor.

●Applied PCA to reduce the correlation between features and high dimensionality of the standardized data so that maximum variance is preserved along with relevant features.

●Built machine learning models for Regressions based on Decision Trees, Support Vector Machine and Random Forest to predict the different risk levels of applicants and used Grid Search to improve the accuracy over the cleaned data.

●Proactively identified opportunities to automate time and resource intensive procedures associated with data validation and transformation using Python, Azure Data Factory.

●Evaluated the model’s performance using various metrics like coefficient of determination (R2), RMSE and Cross Validation to test the models with different batches of data to optimize .

Education:

B.Sc in IT

Mulund College Of commerce,Mulund(w), Mumbai,2012

Contact this candidate