Machine Learning Data Scientist

Location:

Little Elm, TX

Salary:

95000

Posted:

February 27, 2025

Contact this candidate

Resume:

Rajkumar Conjeevaram Mohan

**** ******* **** **, ****** Elm, TX - 75068 +1-202-***-**** **********.****@*****.*** SUMMARY

Data Scientist with strong experience in developing and implementing machine learning models, specializing in time-series forecasting and statistical modeling. Adept at researching, experimenting, and deploying AI/ML solutions across digital platforms to drive business insights and automation. Skilled in the full AI/ML development life-cycle, including model evaluation, optimization, and visualization for high-performance outcomes. Passionate about leveraging cutting-edge AI research to solve complex business challenges. Strong collaborator, working closely with IT and Data Science teams to integrate AI-driven solutions effectively. EDUCATION

The George Washington University, Washington, DC Dec 2023 Master of Science, Data Science (GPA: 3.91)

Relevant courses: Machine Learning, Time Series, Natural Language Processing, Deep Learning Imperial College London, Greater London, UK Nov 2017 Master of Science in Computing (Specialism in Artificial Intelligence) Relevant courses: Advanced Statistical Machine Learning, Intelligent Data Analysis University of Liverpool, Liverpool July 2013

Bachelor of Science with Honors in Computer Information Systems (GPA: 4.0) Degree Classification: First Class Honors

TECHNICAL SKILLS

Coding (Python, R, PySpark, Apache Spark Scala), Data Analysis, Machine Learning Modeling, Software Development (Python, Java), Mathematical Modeling, Web Development (HTML, CSS, JavaScript), Database Development (RDBMS, NoSQL, PL/SQL), Database Modeling (MySQL, Oracle), Project Management, Innovation, Interpersonal Skills, Hard Work and Dedication, Critical Thinking, Data Entry Skills, Coordination Skills, Business Requirements, Business Processes, Budgeting Skills, and Attention to Detail. WORK EXPERIENCE

Software Engineer May 2024 - Jan 2025

Inten IT Solutions

• Loaded large volumes of warranty-claim text data using Python and pre-processed them using the distributed computing platform – PySpark for accelerated computing.

• Used sentence-piece tokenizer to reduce the vocabulary size and make the model memory-efficient.

• For increased reliability, BiLSTM-based RNN was leveraged to train the embedding vector space as it allows the model to better capture the contextual meaning in contrast to shallow/default models.

• Used large volumes of warranty-claim text data to train a BERT LLM model to identify components that often fail, and helped the company identify them.

• The trained embedding model was frozen and used to train the BERT model to identify the components that often fail, and most importantly the reasons mentioned in the claim forms to help the quality control team to revise their manufacturing policies.

• Reduced data retrieval time by almost 40% by switching from on-premise Relational Database to Google BigQuery and optimizing for performance by enabling Clustering, Partitioning, and Caching on frequently used queries.

• Used PyTorch to train deep learning models and extensively used Python programming language at the firm.

• Turned the client's user manual PDF documents into text embeddings and created a search tool using GCP AlloyDB, enabling the client (a manufacturer) to look up relevant information instantly. Tools used: PyCharm for Deep Learning and Machine Learning workloads, MySQL, and BigQuery for storing documents. Used Google Cloud Platform infrastructure for everything. Data Science Consultant Aug 2023 – Jan 2024

World Bank Group

• Downloaded remote-sensing data from PlanetScope to generate training images with bounding boxes (a process called rasterization) covering vehicles on highway lanes using QGIS.

• Generated training images were analyzed for patterns representing stationary and non-stationary vehicles using Python and Matplotlib to determine relevant augmentations.

• Used a road segmentation deep learning model to segment the highway road in aerial images before extracting patches from them to train an object-detection model to detect vehicles on highway lanes.

• Predicted bounding boxes along with additional information enabled differentiating between logistic and personal vehicles.

• Developed and implemented a reliable and scalable framework to estimate trade volume exchanged at the South African border to enhance operational transparency and improve trust among trading parties.

• Applied Deep Learning on satellite images to estimate the count of the logistic assets with other external information to approximate trade volume.

Research Assistant June 2022 – Jan 2023

The George Washington University – Department of Political Science

• Used R programming language to load large data with complex relationships between vertices and pre-processed them accordingly for different visualizations.

• Removed duplicate edges with different descriptions, incorrect years, or relationships for creating a spatial graph that is aimed at displaying connections between militant organizations across the globe.

• Leveraged the metrics from Graph Theory such as betweenness centrality, and degree of centrality to adjust the size of the nodes to represent influential organizations, and colored the nodes to signify the traffic respectively. This allowed researchers to spot extremists in the large network with ease.

• Created an interactive web-based application using R with Shiny framework to display a dashboard of different visualizations to help researchers understand how militant organizations function and evolve.

• Created a timeline-based hierarchical plot that showed how a group evolved over time i.e., patterns of splinting, and merging at particular timelines allowed researchers to better understand their behavior by associating them with private/undisclosed information. Data Scientist Feb 2022 – June 2022

Briggs & Stratton

• Identified patterns in customer purchasing habits and product preferences, and turned data into actionable recommendations that increased sales by 15%.

• Employed advanced analytics such as the Bayesian causal inferencing model to determine the cause of their engine's premature piston ring failure and helped the business fix the problem.

• I analyzed historical sales data to identify trends and patterns. Using this insight, I created a forecast for the quarterly period that accurately predicted the demand. Upon reviewing the results, I suggested optimizing the inventory levels.

• Analyzed data from various sources (e.g., sensors, logs) to detect anomalies or trends that could indicate quality issues in production processes or vehicle performance.

• Built predictive models using various machine learning tools to predict the possibility of equipment failure.

Data Engineer / Scientist

SF Technology Solutions Jan 2018 – Oct 2021

• Developed code to handle large data and performed visualization in R using the `ggplot` library.

• Identified and rectified errors, and inconsistent features, and imputed missing data by Clustering and Gaussian methods.

• Performed relevant preprocessing steps ensuring consistent units across features before fitting statistical machine learning models and visualizing its results.

• Used PySpark to leverage the power of distributed computing for preprocessing large volumes of data.

• Effectively communicated technical findings and insights to non-technical members lucidly.

• Developed and presented analytical insights on medical and other data.

• Used AWS cloud services for computing resources and deploying the application.

• Create several types of data visualizations using Python and Tableau.

• Collected data needs and requirements by Interacting with the other departments.

• Performed preliminary data analysis using descriptive statistics and handled anomalies such as removing duplicates and imputing missing values.

• Implemented various visualizations using Matplotlib to visualize the data and for limpid communication.

• Developed Machine Learning algorithms like Classification, Regression, and Deep Learning using Python and optimized them to deliver the best performance out of the available data.

• I used GitHub to maintain versions and collaborate with the team.

• Have created containers using Docker to ensure compatibility with different environments.

• Developed PySpark code to process the data on Amazon EMR to perform the necessary transformations based on the STMs developed.

• Worked on different data formats such as JSON and XML and performed machine learning algorithms in Python.

Database Developer

AXS Technologies Aug 2013 – Nov 2017

• Responsible for designing databases that meet the application requirements and creating Entity Relationship diagrams.

• Optimized databases using appropriate techniques such as table normalization, indexing, and setting appropriate cache size.

• On top of the back-end logic written in the application, an additional layer of security is written using PL/SQL to prevent SQL injection by verifying whether the input data complies with the expected format and standards.

• Team player with strong experience in solving complex problems.

• Created MySQL PL/SQL routines triggered by events such as attempting to insert duplicate records, deletion of records, and memory-related issues.

• Involved in writing optimal SQL queries for efficient retrieval of data.

• Involved in automation tasks such as triggering Java code that leverages Selenium to scrap information off the web by writing shell scripts.

• Developed a webpage that enables Managers to visualize sales performance. TECHNICAL PROJECTS

Neural Machine Translation (English – Czech) Deep Learning Project 2024

• Translated text from English to Czech with an accuracy of 0.856 measured by METEOR score.

• SentencePiece, a language-agnostic tokenizer, was employed to minimize vocabulary size and efficiently train the LLM.

• Wrote the code implementing GPT-3, a large language model (LLM) from scratch using PyTorch. Brain Tumor 3D Segmentation Deep Learning Project 2023

• U-Net, a semantic segmentation model was trained on BraTS 2017-2020 challenge dataset using GPU.

• Employed skip connections to ensure continued gradient flow during the backpropagation to train a deep network.

• The deep learning model was able to precisely localize tumorous cells in the human brain with an accuracy of 80% measured by the Jaccard (IoU) score and aid doctors with earlier diagnoses of cancer.

US Air Pollution Prediction and Forecast Time Series Forecasting 2022

• Forecasted CO Air Quality Index with an accuracy of 68.11% measured by R2-score.

• Converted the non-stationary signal into the stationary signal by performing relevant transformations such as logarithmic and differencing.

• Used Generalized Partial Autocorrelation to uncover the order of the Autoregressive and Moving Average processes that generated the data.

• Used the orders with the Levenberg–Marquardt algorithm to estimate the coefficient of the process for forecasting.

Credit Card Default Data Analysis 2022

• Classified bank clients who are likely to default on their next credit card bill with an accuracy of 81.3% measured by F1-score.

• Used simple decision tree model to fit the data and used the optimized rules, which is interpretable, for decision making purposes.

• Mitigated the risks of financial insolvency by meticulously selecting the correct balance between the model’s sensitivity in recalling potential high-risk and low-risk clients so the institution do not lose potential clients while attempting to recognize high-risk clients.

• Used R with `caret` package to train a decision tree whose rules are interpretable and enable easy decision-making.

Skills/Topics Platforms/Tools/IDEs Langs/Libs - SDKs Virtual Private Cloud

Data & ML Pipeline

Server Administration -

Ubuntu

Project Management

Natural Language Processing

(NLP)

Computer Vision

Generative AI

Large Language Models

(LLM)

Machine Learning

Data Science

Data Analytics

Probabilistic Graphical

Models

Big Data & Analytics

Distributed Training

Data Wrangling

NoSQL

Data Management

Data Structures

Data Visualization

Artificial Neural Networks

(ANN)

Deep Learning

Transfer Learning

Reinforcement Learning

Evolutionary Algorithms

Time Series Analysis

Decision Trees

Ensemble Models

Boosting algorithms

Geographic Information

Survey

Rasterization

DNA transcription

Gene Clustering

Statistical Analysis

Docker

Databricks

Google Cloud Platform (GCP)

Amazon Web Services (AWS)

GCP BigQuery

Vertex AI

GCP Compute Engine

AWS EC2

AWS S3 bucket

VsCode

PyCharm Professional

Jupyter Notebook

GitHub & GitLab

TensorFlow

Keras

PyTorch

Tableau

R Studio

Eclipse IDE for Java

Development

QGIS

MongoDB

Neo4j

Apache Hive

MLOPs

CI/CD Pipeline

Google Cloud Storage

Google Bigtable

Python (Advanced)

R Programming Language

(Advanced)

PySpark (Advanced)

JavaScript (Medium)

JSON (Medium)

PHP (Some)

Java (Some)

.Net (Some)

HTML (Medium)

CSS (Medium)

Spark (Medium)

Hadoop

Ubuntu

Scikit-Learn

Pandas

Numpy

Matplotlib

Plotly

Seaborn

NLTK

Spacy

HuggingFace

visNetwork

ggplot

caret

dplyr

ggraph

Nibabel

BeautifulSoup

Selenium

Contact this candidate