PROFESSIONAL SUMMARY:
Highly efficient Data Scientist/Data Analyst with 8+ years of experience in Data Analysis, Machine Learning, Data mining with large data sets of Structured and Unstructured data, Data Acquisition, Data Validation, Predictive modeling, Data Visualization, Web Scraping. Adept in statistical programming languages like R and Python including Big Data technologies like Hadoop, Hive.
Proficient in managing entire data science project life cycle and actively involved in all the phases of project life cycle including data acquisition, data cleaning, data engineering, features scaling, features engineering, statistical modeling (decision trees, regression models,clustering), dimensionality reduction using Principal Component Analysis and Factor Analysis, testing and validation using ROC plot, K- fold cross-validation and data visualization.
Adept and deep understanding of Statistical modeling, Multivariate Analysis, model testing, problem analysis, model comparison, and validation.
Have ability to build advanced statistical and predictive models, such as generalized linear, decision tree, neural network models, ensembles models, Support Vector Machines (SVM), and Random Forest.
Experience with a variety of NLP methods for information extraction, topic modeling, parsing, and relationship extraction with developing, deploying, and maintaining production NLP models with scalability. Creative thinking and propose innovative ways to look at problems by using data mining approaches on the set of information available.
Experience in working with relational databases (Teradata, Oracle) with advanced SQL programming skills.
Experience in Big Data platforms like Hadoop platforms (Map-R, Hortonworks & others), Aster and Graph Databases
Identifies/creates the appropriate algorithm to discover patterns, validate their findings using an experimental and iterative approach.
Closely worked with product managers, Service development managers, and product development team in productizing the algorithms developed.
Experience in designing visualizations using Tableau software and publishing and presenting dashboards.
Experience in operations research / optimization.
Strong experience in Software Development Life Cycle (SDLC) including Requirements Analysis, and design
Expertise in transforming business requirements into analytical models, designing algorithms, building models, developing data mining and reporting solutions that scale across a massive volume of structured and unstructured data.
Skilled in performing data parsing, data manipulation and data preparation with methods including describe data contents, compute descriptive statistics of data, regex, split and combine, Remap, merge, subset, reindex, melt and reshape.
Experience in using various packages in R and python-like ggplot2, caret, dplyr, Rweka, gmodels, twitter, NLP, Reshape2, rjson, plyr, pandas, NumPy, Seaborn, SciPy, Matplotlib, sci-kit-learn, Beautiful Soup.
Extensive experience in Text Analytics, generating data visualizations using R, Python and creating dashboards using tools like Tableau.
Experience with tools in the distributed computing, GPU, cloud platforms and Big Data domains is a plus (e.g. HPC, GCP, AWS, MS Azure, Hadoop, Spark)
Hands on experience with big data tools like Hadoop, Spark, Hive, Pig, PySpark, Spark SQL,PySpark
Hands on experience in implementing LDA, Naive Bayes and skilled in Random Forests, Decision Trees, Linear and Logistic Regression, SVM, Clustering, neural networks, Principle Component Analysis.
Good Knowledge in Proof of Concepts (PoC's), gap analysis and gathered necessary data for analysis from different sources, prepared data for data exploration using data munging.
Good industry knowledge, analytical &problem-solving skills and ability to work well within a team as well as an individual.
Expertise in transforming business requirements into analytical models, designing algorithms, building models, developing data mining and reporting solutions that scale across a massive volume of structured and unstructured data.
Experience in designing stunning visualizations using Tableau software and publishing and presenting dashboards, Storyline on web and desktop platforms.
Experience and Technical proficiency in Designing, Data Modeling Online Applications, Solution Lead for Architecting Data Warehouse/Business Intelligence Applications.
Experience with Data Analytics, Data Reporting, Ad-hoc Reporting, Graphs, Scales, PivotTables and OLAP reporting.
Highly skilled in using Hadoop (pig and Hive) for basic analysis and extraction of data in the infrastructure to provide data summarization.
Highly skilled in using visualization tools like Tableau, ggplot2,dash, flask for creating dashboards.
Worked and extracted data from various database sources like Oracle, SQL Server, DB2, regularly accessing JIRA tool and other internal issue trackers for the Project development.
Highly creative, innovative, committed, intellectually curious, business savvy with good communication and interpersonal skills.
Extensive experience in Data Visualization including producing tables, graphs, listings using various procedures and tools such as Tableau.
Experience of implementing deep learning algorithms such as Artificial Neural network (ANN) and Recurrent Neural Network (RNN), tuned hyper-parameter and improved models with Python packages TensorFlow.
Extensive experience in operating Big Data Pipelines (Spark, Hive, Presto, SQL engines) batch and streaming.
Extracted data from HDFS and prepared data for exploratory analysis using data munging.
Implemented, tuned and tested the model on AWS EC2 with the best algorithm and parameters.
EDUCATION
Masters (MS) in Computer Technology. School of Technology, Eastern Illinois University. Charleston, IL
Masters (MA) in Economics. Department of Economics, Eastern Illinois University. GPA Charleston, IL
M.A. Economics Department of Economics, University of Karachi. Karachi, Pakistan
Bachelors in Commerce, Department of Economics, University of Karachi. Karachi, Pakistan
TECHNICAL SKILLS
Languages
C, C++, XML, R/R Studio, SAS Enterprise Guide, SAS, R, Python 2.x/3.x, Java, C, SQL, Shell Scripting, Maven, Scala, spark 2, 2.3, Spark Sql, Spark Streaming, Hadoop, MapReduce, R - (Packages: Stats, Zoo, Matrix, data table, OpenSSL), HDFS, Eclipse, Anaconda, Jupyter notebook
NO SQL Databases
Cassandra, HBase, MongoDB, Maria DB
Statistics
Hypothetical Testing, ANOVA, Confidence Intervals, Bayes Law, MLE, Fish Information, Principal Component Analysis (PCA), Cross-Validation, correlation.
BI Tools
Tableau, Tableau server, Tableau Reader, Splunk, SAP Business Objects, OBIEE, SAP Business Intelligence, QlikView, Amazon Redshift, or Azure Data Warehouse
Algorithms
Logistic Regression, Random Forest, XG Boost, KNN, SVM, Neural Network rk, Linear Regression, Lasso Regression, Generalized Linear Models, Boxplots, K-Means Clustering, SVN, PuTTY, WinSCP, Redmine (Bug Tracking, Documentation, Scrum), Neural networks, AI, Teradata, Tableau, H2O flow, Splunk, GitHub, Linear, regression.
Data Analysis and Data Science
Deep neural network, Logistic regression, Decision Tress, Random Forests, KNN, XGBoost, Ensembles (Bagging, Boosting), Support Vector Machines, Neural Networks, graph/network analysis, and time series analysis(ARIMA model), NLP.
Big Data
Hadoop, HDFS, HIVE, PuTTy, Spark, Scala, Sqoop
Reporting Tools
MS Office (Word/Excel/PowerPoint/ Visio/Outlook), Crystal Reports XI, SSRS
Database Design Tools and Data Modeling
MS Visio, ERWIN 4.5/4.0, Star Schema/Snowflake Schema modeling, Fact & Dimensions tables, physical & logical data modeling, Normalization and De-normalization techniques, Kimball &Inmon Methodologies
PROFESSIONAL SUMMARY:
Mariner's Bank – Edgewater, NJ Sep 2018 – Till Date
Data Scientist
Mariner's Bank is a community-focused bank founded by a conscientious group of community leaders and local investors determined to provide a better alternative to big city banking. The Bank provides traditional deposit services such as checking, savings, money market, certificate of deposits and individual retirement accounts. The Bank also offers residential, commercial and construction loans mortgages, home equity and small business loans.
Responsibilities:
Help clients effectively use data to make better decisions by applying various statistical and machine-learning techniques to a wide variety of challenging questions.
Lead cross-functional teams to evaluate and implement new advertising and communications teams to evaluate and integrate new tools.
Devised infrastructure for advertising and organic social media monitoring using sources such as social media APIs.
Managed the behavior score risk modeling for consumer, and business credit card products.
Created systems to identify notable donors and consolidate sparse donor-provided information like occupation using web scraping, WordNet/other NLP strategies, and various APIs.
Used various machine learning algorithms such as Linear Regression, Ridge Regression, Lasso Regression, Elastic net regression, KNN, Decision TreeRegressor, SVM, Bagging Decision Trees, Random Forest, AdaBoost, and XGBoost.
Worked with ETL of large data sets using Python, Teradata, SSIS, or Talend to source, load, and verify data of various formats
Works closely with other model risk governance, credit bureaus, external consultants and compliance and regulatory response to ensure proper development and installation of performance reporting and model tracking.
Leverage a broad stack of technologies — Python, Docker, AWS, Airflow, and Spark to reveal the insights hidden within huge volumes of numeric and textual data.
Solved business problems including segmenting customers by purchasing behavior, modelling customer profitability and lifetime, forecasting financial metrics on the scale of months or years, predicting win/loss rates in contract negotiations.
Leverage multiple modeling techniques from regression to classification to segmentation.
Developing and executing statistical and mathematical solutions to business problems. Framing problem, developing roadmap, and communicating intended approach and quantitative methods to develop solution.
Improving products and services or solving problems using best practice and knowledge of internal and or external business issues.
Analyze internal processes to identify opportunities for improvement, as well as devise and implement new innovative workflow solutions to improve the time to market/quality/create efficiency in your dataset
Using analytical rigor and statistical methods to analyze large amounts of data, extracting actionable insights using advanced statistical techniques such as data analysis, data mining, optimization tools, and machine learning techniques and statistics.
Work with data users to determine, create, and populate optimal data architectures, structures, and systems. Plan, design, and optimize data throughput and query performance.
Craft and implement data pipelines utilizing Glue, Lambda, Spark, Python.
Work with Data Engineers to determine how to best source data, including identification of potential proxy data sources, and design business analytics solutions, considering current and future needs, infrastructure and security requirements, load frequencies, etc.
Federal Home Loan Bank of Boston - Boston, MA Jun 2017 - Sep 2018
Data Scientist
Federal Home Loan Bank of Boston is a bank for banks, credit unions, community development financial institutions, and insurance companies. Cooperatively owned by more than 440 New England financial institutions, the Bank provides reliable access to wholesale credit for these members and other qualified borrowers. The mission of the Federal Home Loan Bank of Boston is to provide highly reliable wholesale funding, liquidity, and a competitive return on investment to its member financial institutions in New England. We strive to consistently develop and deliver the best financial products, services, and expertise that support housing finance, community development, and economic growth, including programs targeted to lower-income households.
Responsibilities:
Worked with the team doing research in the fields of macroeconomics, financial econometrics, and housing prices across the US by using Stata and R to analyze data.
Worked with policymakers and senior economists to develop predictive models for key economic indicators.
Our team worked on the housing market prices regression in the Suburban Boston area. Using the regression model and the trained and tested data-set, we were able to successfully predict the house prices for Boston’s Suburbs.
Used a variety of Machine Learning models including linear regression, logistic regression, time series modeling, decision trees (including random forests/bootstrapping/boosting), clustering algorithms such as Gaussian Mixture Models/K-Means.
Wrote up reports to deliver at meetings translating the technical results of our modeling into clear, straightforward language understood by non-experts.
Developed machine-learning models to explain patterns in data associated with leverage- based asset bubbles.
Used web scraping algorithms to programmatically obtain housing data in key cities across the US.
Parsed scraped data in Pandas, BeautifulSoup, and Regex, worked with various machine learning and statistical models to analyze housing prices nationwide and their connection to leverage in the financial industry.
Created several asset pricing models with KNN and Random Forests to analyze house prices in various regions as a function of both micro-level features as well as macroeconomic indicators.
Utilized Gaussian Mixture Models to cluster loans in the REPO market based on risk indicators.
Implemented Monte-Carlo simulations and simulated reinforcement learning models in response to sovereign debt crises.
DOITT-City of New York - New York, NY Aug 2016 - Jun 2017
Data Analyst/ Data Scientist
The New York City Department of Information Technology & Telecommunications (DoITT) provides for the sustained, efficient, and effective delivery of IT services, infrastructure, and telecommunications to enhance service delivery to the City's residents, businesses, employees, and visitors.
Fire Risk Predictability and Analysis
A machine-learned model (using R) based on recent historical data from Inspectional Services Department violations, Assessing Department building/property data, and incident reports from the New York Fire Department to provide measures of fire risk for addresses across the City. Results from the model's performance were encouraging, and indicated that it could significantly improve the City's efforts to avert fire incidents before they happen. Also, fire risk patterns were analyzed according to housing type, neighborhood, prior citizen complaints, and other parcel-level variables.
Human Resources Analytics
This project determines why talented and most experienced employees leaving prematurely. It also predicts which valuable employees will leave next. Factors such as Satisfaction Level, Last evaluation, Number of projects, Time spent at the company, Promotion Frequency, Work place accident incidents, Departments, Salary have been used as part of the dataset.
Responsibilities:
Worked with data scientists and the research team to gain valuable insights.
Worked on data cleaning and ensured data quality, consistency, integrity using Pandas, NumPy.
Performed data imputation using Scikit-learn package in Python.
Explored and analyzed the customer specific features by using Matplotlib, Seaborn in Python and dashboards in Tableau.
Designed rich data visualizations to model data into human-readable form with Tableau and Matplotlib.
Worked with Amazon EC2 based cloud-hosted architecture systems to provide solutions for client.
Generated comprehensive analytical reports by running SQL queries against current databases to conduct data analysis.
Work on different data formats such as JSON, XML and applied machine learning algorithms in Python.
Use AWS Environment for loading data files from the cloud servers.
Collaborated with business leaders to analyze problems optimize processes and build presentation dashboards.
Merge data into AWS environment to make several teams could access the data from different locations which saves times and increase security.
Program both R and Python scripts and modules for data collection, cleaning, analysis and visualization.
Updated legacy data systems to convert hard copies to searchable online database format.
Environment: Apache Hadoop, Linux, SQL, Tableau, Python (Numpy, Pandas, Scikit-learn, Matplotlib, Seaborn), AWS.
Euclid Insurance - New York, NY (Offshare) May 2011 - Jan 2016
Data Analyst/ Python Developer
Euclid Managers has been serving the employee benefits brokerage community since 1976. They work with over 2,000 independent brokers and agents. Their group benefits division proudly represents UnitedHealthcare, MetLife and Delta Dental of IL. Their team assists our brokers in finding the right product and most competitive rates for their group clients. Mission is to deliver to our brokers the best tools, superior customer service and an excellent carrier portfolio in the marketplace.
I was involving in developing statistical models and algorithms to predict, classify, quantify, and/or forecast business metrics. Partner with business units and Analytics colleagues outside the workgroup to simulate and validate processes to maximize success metrics.
Responsibilities:
Used python libraries like Beautiful Soap, NumPy.
Created various types of data visualizations using Python and Tableau.
Monitoring and tracking process performance using analytics tools like Tableau dashboard, R.
Utilized standard Python modules such as csv, robot parser, iterators and pickle for development.
Developed MapReduce programs to parse the raw data, and create intermediate data which would be further used to be loaded into Hive portioned data.
Involved in creating Hive ORC tables, loading the data into it and writing Hive queries to analyze the data.
Experience in performance analysis and capacity planning for growing MongoDB and Hadoop clusters.
Created views in Tableau Desktop that were published to internal team for review and further data analysis and customization using filters and actions.
Worked on Python OpenStack APIs and used NumPy for Numerical analysis.
Used Python scripts to update content in the database and manipulate files.
Used Python creating graphics, data exchange and business logic implementation.
Performed troubleshooting, fixed and deployed many Python bug fixes of the applications and involved in fine tuning of existing processes followed advance patterns and methodologies.
Skilled in using collections in Python for manipulating and looping through different user defined objects.
Used DDL and DML for writing triggers, stored procedures, and data manipulation.
Interacted with Team and Analysis, Design and Develop database using ER Diagram, involved in Design, Development and testing of the system.
Developed SQL Server Stored Procedures, Tuned SQL Queries (using Indexes).
Installed numerous python packages using pip and easy install.
Environment: Python 2.7, Tableau, R, Windows XP, UNIX, HTML, SQL server 2005.