Post Job Free
Sign in

Data sceintist

Location:
Russian Federation
Posted:
August 17, 2018

Contact this candidate

Resume:

ParthKumar

Data Scientist

Email id:*********@*****.***

PROFESSIONAL SUMMARY:

Above 8+ years of experience in Data Analysis, Machine Learning, Data mining with large data sets of Structured and Unstructured data, Data Acquisition, Data Validation, Predictive modeling, Data Visualization.

Experienced in lasted BI tools like Tableau, Power BI, Qlik Sense, Qlik View.

Extensive experience in Text Analytics, developing different Statistical Machine Learning, Data Mining solutions to various business problems and generating Data Visualizations using R, Python and Tableau.

Experience in developing different Statistical Machine Learning, Text Analytics, Data Mining solutions to various business generating and problems data visualizations using R and Tableau.

Expertise in transforming business requirements into building models, designing algorithms, developing data mining and reporting solutions that scales across massive volume of unstructured data and structured.

Proficient in Machine Learning techniques (Decision Trees, Linear, Logistics, Random Forest, SVM, Bayesian, XG Boost, K-Nearest Neighbors) and Statistical Modeling in Forecasting/ Predictive Analytics, Segmentation methodologies, Regression based models, Hypothesis testing, Factor analysis/ PCA, Ensembles.

Experience in designing visualizations using Tableau software and Storyline on web and desktop platforms, publishing and presenting dashboards.

Experience on advanced SASprogramming techniques, such as PROC APPEND, PROC DATASETS, and PROC TRANSPOSE.

Strong experience in Software Development Life Cycle (SDLC) including Requirements Analysis, Design Specification and Testing as per Cycle in both Waterfall and Agile methodologies.

Expertise in Python programming with various packages including NumPy, Pandas, SciPy and ScikitLearn.

Proficient in Data visualization tools such as Tableau, Plotly,PythonMatplotlib and Seaborn.

Familiar with Hadoop Ecosystem such as HDFS, HBase, Hive, Pig and Oozie.

Experienced in building models by using Spark (PySpark, SparkSQL, Spark MLLib, and Spark ML).

Experienced in Cloud Services such as AWS EC2, EMR, RDS, S3 to assist with big data tools, solve the data storage issue and work on deployment solution.

Data scientist and mentor the team to prepare new POCs and models for healthcare customers.

Worked on deployment tools such as Azure Machine Learning Studio, Oozie, and AWS Lambda.

Proficient in JAVA, Python, R, C/C++, SQL, Tableau.

Worked and extracted data from various database sources like Oracle, SQL Server and Teradata.

Experience in foundational machine learning models and concepts( Regression, boosting, GBM, NNs, HMMs, CRFs, MRFs, deep learning).

Skilled in System Analysis, Dimensional Data Modeling, Database Design and implementing RDBMS specific features.

Developed a generic model for predicting repayment of debt owed in the healthcare, large commercial, and government sectors.

Facilitated and helped translate complex quantitative methods into simplified solutions for users.

Knowledge of working with Proof of Concepts and gap analysis and gathered necessary data for analysis from different sources, prepared data for data exploration using data munging.

Git, Java, MySQL, MongoDB, Neo4J, AngularJS, SPSS, Tableau.

Excellent knowledge of Hadoop Ecosystem and Big Data tools as Pig, Hive &Spark.

Worked on different data formats such as JSON, XML and performed machinelearningalgorithms in Python.

EDUCATION:

Bachelors of Engineering.

TOOLS AND TECHNOLOGIES:

Statistics/ML

Exploratory Data Analysis: Univariate/Multivariate Outlier detection, Missing value imputation, Histograms/Density estimation, EDA in Tableau

Supervised Learning: Linear/Logistic Regression, Lasso, Ridge, Elastic Nets, Decision Trees, Ensemble Methods, Random Forests, Support Vector Machines, Gradient Boosting, XGB, Deep Neural Networks, Bayesian Learning

Unsupervised Learning: Principal Component Analysis, Association Rules, Factor Analysis, K-Means, Hierarchical Clustering, Gaussian Mixture Models, Market Basket Analysis, Collaborative Filtering and Low Rank Matrix Factorization

Sampling Methods: Bootstrap sampling methods and Stratified sampling

Model Tuning/Selection: Cross Validation, Walk Forward Estimation, AIC/BIC Criterions, Grid Search and Regularization

Time Series: ARIMA, Holt winters, Exponential smoothing, Bayesian structural time series

Machine Learning /

Deep Learning

Python: caret, glmnet, forecast, xgboost,Keras,Pytorch, theano, Sk-learn

SAS: Forecast server, SAS Procedures and Data Steps.

Spark: MLlib, GraphX.

SQL: Subqueries, joins, DDL/DML statements.

Databases/ETL/Query

Teradata, SQL Server, Redshift, Postgres and Hadoop (MapReduce); SQL, Hive, Pig and Alteryx, Talend Open Studio, SSIS

DW-BI tools

Tableau, ggplot2,RShiny, Microsoft Power BI

PROFESSIONAL EXPERIENCE:

Client: Millipore Sigma - Burlington, MA

Oct 2017 – Till date

Role: Data Scientist

Description:Millipore Corporation provides solutions and services for the research, development, and Production of Biotechnology and pharmaceutical drug therapies The company offers analytics and sample preparation Products Suchaschromolith products, test strips, reference materials, and disinfection controls.

Responsibilities

Utilized Apache Spark with Python to develop and execute Big Data Analytics and Machine learning applications, executed machine learning use cases under Spark ML and Mllib.

Solutions architect for transforming business problems into BigData and Data Science solutions and define Big Data strategy and Roap map.

Identified areas of improvement in existing business by unearthing insights by analyzing vast amount of data using machine learning techniques. TensorFlow, Scala, Spark, MLLib, Python and other tools and languages needed.

Create and validate machine learning models with Azure Machine Learning

Designing a machine learning pipeline using Microsoft Azure Machine Learning to predict and prescribe and Implemented a machine learning scenario for a given data problem

Used Scala for coding the components in Play and Akka.

Worked on different Machine learning models like LogisticRegression, Multilayer perceptron classifier, K-means clustering by creating Scala-SBT packaging and run it in Spark-shell (Scala) and Auto-encoder model with using R programming.

A passionate Data Scientist with around 6 years of experience in Data Mining, Data Modelling, Data Visualization, Machine Learning with rich domain knowledge and experience in Healthcare, Banking and Travel industries.

Worked on setting up and configuring AWS's EMR Clusters and Used Amazon IAM to grant fine-grained access to AWS resources to users

Created detailed AWS Security Groups, which behaved as virtual firewalls that controlled the traffic allowed to reach one or more AWSEC2 instances

Wrote scripts and indexing strategy for a migration to Redshift from Postgres9.2 and MySQL databases.

Wrote Kinesis agents to pipe data from streaming app into S3.

Good Knowledge in Azure cloud services, Azure storage, Azure active directory, Azure Service Bus. Create and manage Azure ADtenants, and configure application integration with AzureAD. Integrate on-premises WindowsAD with AzureAD Integrating on-premises identity with Azure Active Directory.

Working knowledge of Azure Fabric, Micro services, IoT &Docker containers in Azure. Azure infrastructure management & PaaS Solution Architect - (Azure AD, Licenses, Office365, DR on cloud using Azure RecoveryVault, Azure Web Roles, Worker Roles, SQLAzure, Azure Storage).

Utilized Spark, Scala, Hadoop, HBase, Cassandra, MongoDB, Kafka, Spark Streaming, MLLib, Python, a broad variety of machine learning methods including classifications, regressions, dimensionally reduction etc. and Utilized the engine to increase user lifetime by 45% and triple user conversations for target categories.

Designed and developed NLP models for sentiment analysis.

Led discussions with users to gather business processes requirements and data requirements to develop a variety of Conceptual, Logical and Physical Data Models. Expert in Business Intelligence and Data Visualization tools: Tableau, Microstrategy.

Performed Multinomial Logistic Regression, Random forest, Decision Tree, SVM to classify package is going to deliver on time for the new route and Performed data analysis by using Hive to retrieve the data from Hadoop cluster, Sql to retrieve data from Oracle database.

Worked on machine learning on large size data using Spark and MapReduce.

Let the implementation of new statistical algorithms and operators on Hadoop and SQL platforms and utilized optimizations techniques, linear regressions, K-means clustering, Native Bayes and other approaches.

Power User of Machine learning algorithms and libraries likeNumpy, Pandas, Scikit learn Shiny and ggplot2.

Developed Spark/Scala, Python for regular expression (regex) project in the Hadoop/Hive environment with Linux/Windows for big data resources.

Developed Data Mapping, Data Governance, Transformation and Cleansing rules for the Master Data Management Architecture involving OLTP, ODS and OLAP.

Data sources are extracted, transformed and loaded to generate CSV data files with Python programming and SQL queries.

Stored and retrieved data from data-warehouses using Amazon Redshift.

Worked on TeradataSQL queries, Teradata Indexes, Utilities such as Mload, Tpump, Fast load and Fast Export.

Application of various machine learning algorithms and statistical modeling like decision trees, regression models, neural networks, SVM, clustering to identify Volume using scikit-learn package in python, Matlab.

Used Data Warehousing Concepts like Ralph Kimball Methodology, Bill Inmon Methodology, OLAP, OLTP, Star Schema, Snow Flake Schema, Fact Table and Dimension Table.

Refined time-series data and validated mathematical models using analytical tools like R and SPSS to reduce forecasting errors.

Worked on data pre-processing and cleaning the data to perform feature engineering and performed data imputation techniques for the missing values in the dataset using Python.

Created Data Quality Scripts using SQL and Hive to validate successful data load and quality of the data. Created various types of data visualizations using Python and Tableau.

Environment: Horton works - Hadoop Map Reduce, Pyspark, Spark, R, Spark MLLib, Tableau, Informatica, SQL, Excel, VBA, BO, CSV, Erwin, SAS, AWS Redshift, ScalaNlp, Cassandra, Oracle, MongoDB, Cognos, SQL Server 2012, Teradata, DB2, SPSS, T-SQL, PL/SQL, Flat Files, XML, and Tableau.

Client:MALeagueof Community Health. Boston, MA

Jan 2016– Sep 2017

Role: Data Scientist

Description:The Massachusetts League of Community Health Centers' annual Community Health Institute, Staying Power: Resilience Through Innovation, will kick off with a pre-session focused on the timely issue of sexual harassment and bullying in the workplaceServices provided by the company include facilities management services to occupiers of commercial real estate as well as property management, leasing, capital markets, appraisal, and brokerage services to owners of commercial real estate.

Responsibilities:

Worked closely with business, datagovernance, SMEs and vendors to define data requirements.

Designed and provisioned the platform architecture to execute Hadoop and machine learning use cases under Cloud infrastructure, AWS, EMR, and S3.

Selection of statistical algorithms - (Two Class Logistic Regression Boosted Decision Tree, Decision Forest Classifiers etc)

Used MLlib, Spark's Machine learning library to build and evaluate different models.

Worked in using Teradata14 tools like Fast Load, Multi Load, T Pump, Fast Export, Teradata Parallel Transporter (TPT) and BTEQ.

Involved in creating Data Lake by extracting customer's Big Data from various data sources into Hadoop HDFS. This included data from Excel, Flat Files, Oracle, SQL Server, MongoDb, Cassandra, HBase, Teradata, Netezza and also log data from servers

Used Spark Data frames, Spark-SQL, Spark MLLib extensively and developing and designing POC's using Scala, Spark SQL and MLlib libraries.

Created high level ETL design document and assisted ETL developers in the detail design and development of ETL maps using Informatica.

Used R, SQL to create Statistical algorithms involving Multivariate Regression, Linear Regression, Logistic Regression, PCA, Random forest models, Decision trees, Support Vector Machine for estimating the risks of welfare dependency.

Helped in migration and conversion of data from the Sybase database into Oracle database, preparing mapping documents and developing partial SQL scripts as required.

Generated ad-hoc SQL queries using joins, database connections and transformation rules to fetch data from legacy Oracle and SQL Server database systems

Executed ad-hoc data analysis for customer insights using SQL using Amazon AWS Hadoop Cluster.

Worked on predictive and what-if analysis using R from HDFS and successfully loaded files to HDFS from Teradata, and loaded from HDFS to HIVE.

Designed the schema, configured and deployed AWS Redshift for optimal storage and fast retrieval of data.

Analyzed data and predicted end customer behaviors and product performance by applying machine learning algorithms using Spark MLlib.

Performed data mining on data using very complex SQL queries and discovered pattern and Used extensive SQL for data profiling/analysis to provide guidance in building the data model

Environment:: R, Machine Learning, Teradata 14, Hadoop Map Reduce, Pyspark, Spark, R, Spark MLLib, Tableau, Informatica, SQL, Excel, AWS Redshift, Scala Nlp, Cassandra, Oracle, MongoDB, Informatica MDM, Cognos,SQL Server 2012, Teradata, DB2, SPSS, T-SQL, PL/SQL, Flat Files, XML, and Tableau.

Client: Southwest Airlines, Dallas, TX May 2014 - Dec 2015

Role: Data scientist/Machine learning

Description:The mission of Southwest Airlines is dedication to the highest quality of customer service delivered with a sense of warmth, friendliness, individual pride, and company spirit.

Responsibilities:

Responsible for performing Machine-learning techniques regression/classification to predict the outcomes.

Performed Ad-hoc reporting/customer profiling, segmentation using R/Python.

Tracked various campaigns, generating customer profiling analysis and data manipulation.

Provided R/SQL programming, with detailed direction, in the execution of data analysis that contributed to the final project deliverables. Responsible for data mining.

Utilized Label Encoders in Python to convert non-numerical significant variables to numerical significant variables to identify their impact on pre-acquisition and post acquisitions by using 2 sample paired t test.

Worked with ETLSQL Server Integration Services (SSIS) for data investigation and mapping to extract data and applied fast parsing and enhanced efficiency by 17%.

Developed Data Science content involving Data Manipulation and Visualization, Web Scraping, Machine Learning, Python programming, SQL, GIT and ETL for DataExtraction.

Designing suite of Interactive dashboards, which provided an opportunity to scale and measure the statistics of the HR dept. which was not possible earlier and schedule and publish reports.

Provided and created data presentation to reduce biases and telling true story of people by pulling millions of rows of data using SQL and performed Exploratory DataAnalysis.

Applied breadth of knowledge in programming (Python, R), Descriptive, Inferential, and Experimental Design statistics, advanced mathematics, and database functionality (SQL, Hadoop).

Migrated data from Heterogeneous Data Sources and legacy system (DB2, Access, Excel) to centralized SQLServer databases using SQLServer Integration Services (SSIS).

Involved in defining the Source To business rules, Target data mappings, and data definitions.

Successfully interpreted, analyzed and performed Predictive Modelling using Python with Numpy, Pandas packages.

Performing Data Validation / Data Reconciliation between disparate source and target systems for various projects.

Utilized a diverse array of technologies and tools as needed, to deliver insights such as R, SAS, Matlab, Tableau and more.

Built Regression model to understand order fulfillment time lag issue using Scikit-learn in Python.

Utilized Spark, Scala, Hadoop, HBase, Kafka, Spark Streaming, MLlib, Python, a broad variety of machine learning methods including classifications, regressions, dimensionally reduction etc.

Used T-SQL queries to pull the data from disparate systems and Data warehouse in different environments.

Worked closely with the Data Governance Office team in assessing the source systems for project deliverables.

Extracting data from different databases as per the business requirements using Sql Server Management Studio.

Interacting with the ETL, BI teams to understand / support on various ongoing projects.

Extensively using MS Excel for data validation.

Involved in data analysis with using different analytic techniques and modeling techniques.

Environment:Data Governance, SQL Server, Python, ETL, MS Office Suite - Excel(Pivot, VLOOKUP), DB2, R, Visio, HP ALM, Agile, Azure, MDM, Share point, Data Quality, Tableau and Reference Data Management.

Client: Coventry Health Care, Bethesda, Maryland Aug 2013 – Apr 2014

Role: Machine Learning Engineer

Description:Coventry Health Care Management utilizes multiple software systems to support the intake and processing of authorization requests, the exchange of data between the payer and vendors contracted to perform services on our behalf, manage Case and Disease programs, provide robust reporting and decision support, and generally automate and facilitate their business processes.

.

Responsibilities:

CodedRfunctionstointerfacewithCaffeDeepLearningFramework

WorkinginAmazonWebServicescloudcomputingenvironment

Workedwithseveralpython packagesincludingnumpy, pandas, Pyspark, CausalInfer, spacetime.

Implementedend-to-endsystemsforDataAnalytics,DataAutomationandintegratedwithcustomvisualizationtoolsusingR,Mahout, HadoopandMongoDB.

Gatheringall thedatathatisrequiredfrommultipledatasourcesandcreatingdatasetsthatwillbeusedinanalysis.

PerformedExploratoryDataAnalysisandDataVisualizationsusingRandTableau.

Perform a properEDA, Univariateandbi-variateanalysistounderstandtheintrinsiceffect/combinedeffects.

Workedwith Data governance, Data quality, data lineage, Data architect todesignvariousmodelsandprocesses.

IndependentlycodednewprogramsanddesignedTablestoloadandtesttheprogrameffectivelyforthegivenPOC'susingwithBig Data/Hadoop.

Designed data modelsand data flowdiagramsusingErwinandMSVisio.

AsanArchitectimplementedMDMhubtoprovideclean, consistentdatafor a SOAimplementation.

Developed,Implemented&MaintainedtheConceptual,Logical&Physical Data ModelsusingErwinforForward/ReverseEngineeredDatabases.

Established Data architecturestrategy, bestpractices, standards, androadmaps.

Leadthedevelopmentandpresentationofadataanalyticsdata-hubprototypewiththehelpoftheothermembersoftheemergingsolutionsteam

PerformeddatacleaningandimputationofmissingvaluesusingR.

WorkedwithHadoopecosystemcoveringHDFS, HBase, YARNandMapReduce

Takeupad-hocrequestsbasedondifferentdepartmentsandlocations

UsedHivetostorethedataandperformdatacleaningstepsforhugedatasets.

Environment: Oracle 10g, Hive, OLAP, DB2, Metadata, MS Excel, Mainframes MS Visio, Rational Rose, Requisite Pro, Hadoop, PL/SQL, etc

Client: People Tech Group. Hyderabad, India Sep 2011 – Jul 2013

Role: Data Analyst/Data Modeler

Description:Founded in 2006, People Tech is an emerging leader in the Enterprise Applications and IT Services marketplace. People Tech draws its expertise from strategic partnerships with technology leaders like Microsoft, Oracle and SAP and combines that with the deep understanding of its employees

Responsibilities:

Analyzed data sources and requirements and business rules to perform logical and physical data modeling.

Analyzed and designed best fit logical and physical data models and relational database definitions using DB2. Generated reports of data definitions.

Involved in Normalization/De-normalization, Normal Form and database design methodology.

Maintained existing ETL procedures, fixed bugs and restored software to production environment.

Developed the code as per the client's requirements using SQL, PL/SQL and Data Warehousing concepts.

Involved in Dimensional modeling (Star Schema) of the Data warehouse and used Erwin to design the business process, dimensions and measured facts.

Worked with Data Warehouse Extract and load developers to design mappings for Data Capture, Staging, Cleansing, Loading, and Auditing.

Developed enterprise data model management process to manage multiple data models developed by different groups

Designed and created Data Marts as part of a data warehouse.

Wrote complex SQL queries for validating the data against different kinds of reports generated by Business Objects XIR2.

Using Erwin modeling tool, publishing of a data dictionary, review of the model and dictionary with subject matter experts and generation of data definition language.

Coordinated with DBA in implementing the Database changes and also updating Data Models with changes implemented in development, QA and Production. Worked Extensively with DBA and Reporting team for improving the Report Performance with the Use of appropriate indexes and Partitioning.

Developed Data Mapping, Transformation and Cleansing rules for the Master Data Management Architecture involved OLTP, ODS and OLAP.

Tuned and coded optimization using different techniques like dynamic SQL, dynamic cursors, and tuning SQL queries, writing generic procedures, functions and packages.

Experienced in GUI, Relational Database Management System (RDBMS), designing of OLAP system environment as well as Report Development.

Extensively used SQL, T-SQL and PL/SQL to write stored procedures, functions, packages and triggers.

Analyzed of data report were prepared weekly, biweekly, monthly using MS Excel, SQL & UNIX.

Environment: ER Studio, Informatica Power Center 8.1/9.1, Power Connect/ Power exchange, Oracle 11g, Mainframes,DB2 MS SQL Server 2008, SQL,PL/SQL, XML, Windows NT 4.0, Tableau, Workday, SPSS, SAS, Business Objects, XML, Tableau, Unix Shell Scripting, Teradata, Netezza, Aginity.

Client: InnovaInfotech - Hyderabad, India Mar 2009 – Aug 2011

Role: Data Analyst

Description: INNOVA InfoTech is an offshore software services and IT consulting company based in Bangalore, India. As a committed outsourcing partner and an IT vendor, our goal is to ensure cost effective, technical excellence and on-time deliveries. While we take care of their end-to-end programming and consulting needs, our clients focus on core business activities which correlate directly to their revenues and profitability. Strategic partnership with us gives our clients access to latest technology, skilled manpowerand scalable team which ultimately results in lower risk and higher ROI.

Responsibilities:

Worked with internal architects, assisting in the development of current and target state data architectures.

Worked with project team representatives to ensure that logical and physical ER/Studio data models were developed in line with corporate standards and guidelines.

Involved in defining the business/transformation rules applied for sales and service data.

Implementation of Metadata Repository, Transformations, Maintaining DataQuality, DataStandards, Data Governanceprogram, Scripts, Stored Procedures, triggers and execution of test plans

Define the list codes and code conversions between the source systems and the data mart.

Involved in defining the source to business rules, target data mappings, data definitions.

Responsible for defining the key identifiers for each mapping/interface.

Remain knowledgeable in all areas of business operations in order to identify systems needs and requirements.

Responsible for defining the key identifiers for each mapping/interface.

Performed data quality in Talend Open Studio.

Enterprise Metadata Library with any changes or updates.

Document data quality and traceability documents for each source interface.

Establish standards of procedures.

Coordinated meetings with vendors to define requirements and system interaction agreement documentation between client and vendor system.

Environment: Windows Enterprise Server 2000, SSRS, SSIS, Crystal Reports, DTS, SQL Profiler, and Query Analyze.



Contact this candidate