Data Scientist/ Data Engineer/ Data
Analyst
SUAVIS GIRAMATA
Phone: 336-***-**** Email: *********@*****.*** GitHub: https://github.com/G-Suavis Education
MASTER OF SCIENCE IN DATA SCIENCE – East Carolina University – Greenville Aug 2023 - Dec 2024
BACHELOR OF SCIENCE IN BIOLOGY (Cell and Molecular Concentration) – East Carolina University – Greenville Jan 2020 - May 2023
Skills
Programming & Scripting
Python: Pandas, Numpy,
Matplotlib
R: dplyr, ggplot2, tidyr
SQL: SQLite, MySQL, PostgreSQL
Databases
Relational: PostgreSQL, MySQL,
SQLite
NoSQL: MongoDB
Data Engineering & Big Data
Tools
Hadoop, Apache Spark, prefect,
DBeaver, DHIS2 API, ETL pipeline
Development, Data cleaning
&Transformation
Data Analysis & Visualization
Exploratory Data Analysis (EDA),
Statistical Analysis, Data Mining.
Visualization Tools: Apache
Superset & Power BI
Machine Learning & AI
Algorithms: Decision Trees,
Random Forest, Regression,
Clustering,
Model Evaluation &
Optimisation : Cross-validation,
confusion matrix, ROC-AUC
Frameworks: scikit-learn,
TensorFlow, Pytorch.
Natural Language Processing
(NLP): Cosine Similarity,
Sentiment Analysis, Semantic
Similarity
Cloud & Platforms
Azure (Virtual Machines, Azure
SQL Database, Azure
Storage), Google Colab, Git,
GitHub
Other Tools & Concepts
Version Control: Git, GitHub
Privacy & De-identification
Techniques
Data Quality Assessment
(Completeness, accuracy,
consistency, etc.)
Work Experience
PRIMARY HEALTHCARE DATA SCIENTIST INTERN
Resolve to Save Lives / Health Intelligence Center Jan 2025 – Present
● Selected and funded by Resolve to Save Lives to support Rwanda's Health Intelligence Center, with technical supervision provided by SAND Technologies, contributing to the development of the country’s first integrated national health data ecosystem.
● Built and orchestrated Python-based ETL pipelines using DHIS2 APIs, prefect, and DBeaver to ingest patient-level data from HMIS and 4+ EMR systems, integrating over 5.6 TB of health data (2022-2025) into a centralized warehouse. The pipeline was optimized for monthly ingestion, progressively syncing historical records without overloading the system.
● Transformed and de-identified sensitive patient-level data across 50+ datasets, masking direct identifiers such as patient names using rule-based logic and statistical techniques to ensure compliance with internal privacy protocols and enable safe use of data for analytics and reporting.
● Led data quality assessments across 10+ ingested datasets, using a weighted scoring framework based on completeness, accuracy, validity, consistency, timeliness, and uniqueness, applied a 1-5 scoring scale, and calculated weighted averages to derive overall quality percentage.
● Ensured all sources met a minimum quality threshold of 85% and proactively provided feedback to data owners when scores fell below, helping correct root issues and maintain high standards for analysis and reporting.
● Created 30+ dashboards in Superset to monitor maternal and infectious disease indicators, uncovered underreporting in HIV testing, high rates of antenatal care gaps among adolescents, and year-over-year improvements in maternal mortality.
● Trained and tested machine learning models, including random forest, logistic regression, and gradient boosting, to predict adverse pregnancy outcomes using detailed maternal audit data from admission to delivery. Random forest achieved the highest accuracy at 90.4%, outperforming gradient boosting by 86% and logistic regression 84% GRADUATE RESEARCH ASSISTANT (M.S), METAMORPHIC TESTING PRIORITIZATION FOR FAIRNESS EVALUATION ON LARGE LANGUAGE MODELS.
East Carolina University, Greenville, NC
Aug 2023 - Dec 2024
● Designed and implemented an AI fairness evaluation framework for ChatGPT using metamorphic testing to detect biased behavior across sensitive attributes.
● Created two large-scale datasets: Source Test Cases (baseline queries with sensitive characteristics) and Follow-up Test Cases (modified versions with controlled changes), simulating real-world bias scenarios.
● Developed a Python-based pipeline that leveraged the OpenAI API to submit test cases to ChatGPT and collect responses, applied NLP techniques such as sentiment analysis, cosine similarity, and semantic similarity (using TensorFlow and Scikit-learn) to quantify discrepancies between source and follow-up.
● Prioritized metamorphic relations (MRs) using diversity-based, fault-based, distance-based, and random strategies; demonstrated that the proposed Diversity Score-based prioritization achieved a 91.6% fault detection rate at the first MR and outperformed baseline methods by up to 130%, validating its effectiveness for uncovering fairness-related faults in LLMs; visualized trends using Matplotlib and ggplot2, and presented findings in a thesis defense emphasizing the need for responsible AI in critical domains like healthcare, finance, and education.
GRADUATE TEACHING ASSISTANT
East Carolina University, Greenville, NC
Aug 2023 – Dec 2024
● Guided students in Python programming, debugging, and algorithm design to strengthen their understanding of computational problem-solving
CLINICAL LABORATORY ASSISTANT
East Carolina University, Greenville, NC
Aug 2020 – May 202
● Optimized inventory systems through data-driven tracking and analysis, ensuring timely availability of research materials Projects
HEART DISEASE PREDICTIVE MODELING AND MACHINE LEARNING Sept 2024
● Cleaned and preprocessed 303 patient records, ensuring data readiness by addressing missing values and encoding categorical variables using Python.
● Conducted feature analysis, identifying critical predictors of heart disease, such as cholesterol levels, maximum heart rate, and exercise-induced angina.
● Built logistic regression, random forest, and support vector machine models, achieving an accuracy of 83% (Logistic Regression) and F1-scores of 87%(class0) and 75%(class1) for balanced predictive analysis.
● Generated actionable insights to support early intervention strategies in healthcare, showcasing relationships between critical features and outcomes through visualizations in matplotlib and seaborn. NATURAL DISASTER DATA ANALYSIS AND FORECAST(NOAA) DASHBOARD (GROUP PROJECT) July 2024
● Developed a comprehensive Power BI dashboard to analyze and visualize NOAA storm event data across multiple metrics, including property and crop damage, death counts, and event frequency by state and event type.
● Cleaned and processed over 1.66 million records in R Studio, covering storm events from 50 U.S. states and multiple event types over 34 years.
● Presented findings on the cost of damaged properties and crops, identifying Texas as the state with the highest property damage and Nebraska and Iowa as the leading states for crop damage
● Analyzed the number of direct and indirect deaths and injuries per state and event type, highlighting heat waves and tornadoes as leading causes of fatalities.
● Created state-wise comparisons of storm damage costs and frequency and visualized trends over time, helping to inform stakeholders of the economic and human impacts of severe weather events. STUDENTS PERFORMANCE FROM KAGGLE
April 2024
● Analyzed a dataset of 1,000 students' records with 40 variables to uncover factors influencing academic success using Python.
● Performed data cleaning and preprocessing to structure demographic and performance data for analysis
● Created detailed visualizations using Pandas, Matplotlib, and Seaborn to identify trends, correlations, and performance drivers. ANOMALY DETECTION AND STATISTICAL ANALYSIS
Feb 2024
● Applied advanced statistical analysis and outlier detection methods to identify anomalies in experiments velocity of light data (Michelson- Morley experiments).
● Compared variations across experimental conditions to highlight discrepancies and deepen understanding of physical measurements. DATA TRANSFORMATION AND RELATIONAL MODELLING
Oct 2023
● Converted hierarchical key-value datasets into normalized relational tables using SQLite.
● Designed schemas with primary and foreign keys to improve data integrity, accessibility, and scalability.
● Wrote optimized SQL queries for complex data extraction and relationship handling.