gravatar

karim6T

Abdoul Toure

Recently Published

Titanic Analysis - EDA
In this analysis, we will be evaluating the Titanic dataset. The data dictionary is as follows: - Passengerid: Passenger ID - Age: Age in years - Fare: Passenger fare - Sex: Sex (female, male) - Sibsp: \# of siblings/spouses aboard the Titanic (Sibling = brother, sister, stepbrother, stepsister ; Spouse = husband, wife ) - Parch: \# of parents/children aboard the Titanic (Parent = mother, father; Child = daughter, son, stepdaughter, stepson) - Pclass: Ticket class (1 = 1st class, 2 = 2nd class, 3 = 3rd class) - Embarked: Port of Embarkation (C = Cherbourg, Q = Queenstown, S = Southampton) - Survived: Survival (0 = No, 1 = Yes) - Cabin: Cabin number - Ticket: Ticket number
US Company Bankruptcy Prediction - Implement EDA/Clustering Analysis and Interpret
This analysis uses the American Companies Bankruptcy Prediction dataset to build a two-cluster k-means model and evaluate its ability to separate surviving firms from failed firms. The workflow follows the standard exploratory data analysis pipeline: data ingestion, type verification, missing value assessment, outlier handling, transformation, splitting, feature engineering, modeling, and evaluation. The dataset contains 18 financial variables along with a company identifier, year, and survival status label. The table below documents the original column codes, the renamed variables used throughout this analysis, and a short description of each measure.
Predictive_Data_AT: A Structured Workflow for Data Cleaning, Imputation, and Descriptive Analytics in R
Predictive_Data_AT is a comprehensive R‑based workflow that guides users through the full lifecycle of preparing a raw dataset for predictive modeling. The script automates essential preprocessing tasks, including directory setup, data import, sampling, missing‑value detection, blank‑to‑NA conversion, factor re‑encoding, and multi‑stage imputation using mean, mode, and interpolation strategies. It also generates detailed exploratory summaries and descriptive statistics such as central tendency, dispersion, skewness, and kurtosis to help users evaluate the effects of cleaning and imputation on data structure and distribution. This framework provides a reproducible, pedagogically structured template for developing data literacy, ensuring data integrity, and preparing high‑quality inputs for downstream predictive analytics.