gravatar

lukahere007

Luke Wamalwa

Recently Published

Phishing Email Detection Analysis with Random Forest & GBM
End‑to‑end comparison of Random Forest and Gradient Boosting classifiers on an imbalanced phishing‑email dataset. We split 60/20/20, applied down‑sampling, up‑sampling, ROSE and SMOTE, tuned probability thresholds to maximize F₁, then evaluated the champion model on a held‑out test set
A Validation-Based Model Selection Strategy for Breast Cancer Diagnosis Using Logistic Regression
This analysis investigates the use of logistic regression models to predict malignancy in breast cancer based on tumor characteristics derived from digitized medical images. Using the Breast Cancer Wisconsin (Diagnostic) dataset, we: Conducted exploratory data analysis to visualize key variables. Performed feature selection using both statistical significance and multicollinearity checks. Split the dataset into 60% training, 20% validation, and 20% test sets to evaluate generalization. Fitted multiple logistic regression models, refining them iteratively based on AIC, deviance, accuracy, and ROC/AUC. Identified a final model with three key predictors: texture_mean, concavity_mean, and radius_mean. Validated the final model on the test set, achieving strong predictive performance with high sensitivity, specificity, and an AUC of 0.974. Visualized the effect of predictors on malignancy probability using ggplot2, pROC, and ggpmisc. This project demonstrates the importance of model validation, feature interpretability, and visualization in clinical predictive modeling, and offers a reproducible pipeline for diagnostic model development using logistic regression.
Power Analysis for A/B Testing: Impact of Sample Size in R
This project demonstrates how small sample sizes in A/B testing can lead to inconclusive results and how adjusting sample sizes through power analysis reveals statistically significant effects. Simulated data is used to compare p-values, conversion rates, and statistical power before and after sample size adjustment. Visualizations include ggstatsplot, ggpubr, ggsignif, and ggpmisc.
Simulated Vaccine Analysis
This report explores a simulated clinical dataset evaluating vaccine-induced immune responses over time. It models antibody titer changes across vaccine doses and patient characteristics using ANOVA, logistic regression, and visualizations from the tidyverse, ggpubr, ggpmisc, and gghighlight packages.