Recently Published
Spam vs Ham Classification
This project uses the SpamAssassin corpus to build a Naive Bayes classifier that predicts whether emails are spam or ham. After preprocessing and vectorizing the text, the model is trained and evaluated on a test set. The results highlight the challenges of imbalanced text data while demonstrating the full workflow for building a spam detection model in R.
Chess ELO Calculations
This publication applies the Elo expected-score formula to a chess tournament dataset. By estimating each player’s predicted score from their rating and average opponent strength, I calculate expected performance for all seven rounds. I then compare these expectations to actual results, identify overperformers and underperformers, and demonstrate how Elo metrics can be used to evaluate tournament performance.
Tidying and Transforming Data (Corrected)
In this assignment, I tidied a flight delay dataset, reshaped it into long format, and compared on-time vs delayed flights. The analysis showed how overall results differed from city-level patterns.
Analyzing Nobel Prize Data Using the Nobel Prize API
This report explores Nobel Prize data retrieved directly from the official Nobel Prize API. Using R, the analysis extracts and transforms JSON data to answer key questions about laureate distribution, average age at award, and global trends across countries and continents.
Scenario Design Analysis: Netflix Recommender System
This report explores Netflix’s recommendation system through the scenario design framework. It examines both user and organizational perspectives to understand how personalization supports engagement and satisfaction. The analysis also includes a brief reverse engineering of Netflix’s hybrid model and recommendations to improve transparency, diversity, and user experience.
Sentiment Analysis with Tidy Data
Using Jane Austen’s novels, this analysis extends examples from Text Mining with R through the application of NRC, Bing, and AFINN lexicons to study sentiment and emotion.
New York Times API — Most Emailed Articles Analysis
his report uses the New York Times Most Popular API to retrieve and analyze data on the most emailed articles from the past seven days. The JSON response was parsed and transformed into a clean R DataFrame, allowing exploration of the most discussed sections and topics trending among readers.
Most Valued Data Science Skills: A Relational Database and R Analysis
This project analyzes Kaggle’s Data Science Job Postings & Skills (2024) dataset using PostgreSQL and RStudio to identify the most in-demand skills in data science. After cleaning and normalizing the data, the results show that Python, SQL, and Machine Learning are the top skills sought by employers in today’s data-driven market.
Most Valued Data Science Skills
This project explores the most valued data science skills using a Kaggle dataset of over 12,000 LinkedIn job postings. The analysis was completed in R through data cleaning, transformation, and visualization to identify which technical and analytical skills are most in demand. The results show that Python, SQL, Machine Learning, and Communication are among the top abilities sought by employers in 2024.
Favorite Books HTML XML and JSON Data Representation
This assignment demonstrates how the same dataset can be represented in different formats (HTML, XML, JSON) and how each can be read back into R for comparison. The example uses three personal books to show how structure and purpose vary across formats while maintaining consistent information.
Tidying and Transforming Travel Agency Price Data
This final part of the project focuses on tidying a travel pricing dataset. By converting wide-format seasonal data and separating combined fields, it became easier to analyze and visualize price trends across agencies and service types. This final section concludes the full data tidying workflow by emphasizing organization, transformation, and clarity in analysis.
Tidying and Transforming Country GDP and Population Data
In this continuation of the project, I tidied and transformed a wide-format dataset containing population and GDP data for the USA, China, and India from 2000 to 2010. Converting the dataset into long format made it easier to analyze growth patterns and visualize economic trends across countries and years.
Tidying and Transforming Travel Expense Report Data
In the first part of this project, I worked with an untidy Travel Expense Report dataset to demonstrate how data can be cleaned, structured, and transformed using R. The analysis compared total and average spending across two cities, San Jose and Seattle, and the final visualization highlighted how data tidying enables clear and reliable insights.
Tidying and Transforming Data
In this assignment, I tidied a flight delay dataset, reshaped it into long format, and compared on-time vs delayed flights. The analysis showed how overall results differed from city-level patterns.
Chess Tournament Project
This project takes raw chess tournament data from a text file and transforms it into a clean dataset. I extracted player information, calculated average opponent ratings, and exported the results to a CSV file for further analysis.