Recently Published
Netflix and Amazon: A Scenario Design Look at Recommender Systems
This short analysis explores how Netflix and Amazon use recommender systems to personalize the user experience. It applies Bruce Temkin’s Scenario Design framework to examine goals from both the user’s and organization’s perspectives, with practical recommendations for improving transparency, personalization, and fairness.
Sentiment Analysis – Week 10 Assignment 10A
Applying Bing, AFINN, and Loughran lexicons to analyze sentiment in technical text. Demonstrates how lexicon choice affects results and shows limitations of emotion-based scoring for structured or business language.
Tidyverse CREATE — dplyr on mpg
A short vignette demonstrating a simple dplyr workflow on the built-in mpg dataset, including data exploration, a mutate transformation, grouping, summarizing, and identifying the top 5 manufacturers by efficiency.
Inference for Numerical Data – YRBSS Analysis
This lab explores numerical data from the Youth Risk Behavior Surveillance System (YRBSS). It includes descriptive statistics, visualizations, hypothesis testing, and confidence intervals to examine relationships between student weight, physical activity, height, and sleep.
Lab 6 — Inference for Categorical Data (DATA 606)
I used the YRBSS high school survey to practice inference with categorical data. I summarized texting-while-driving and helmet use, built bootstrap confidence intervals (including a clean visualization), and tested whether sleeping 10+ hours is linked to strength training every day. I also explored how the margin of error changes with the true proportion and sample size. This helped me connect the theory (SE, CI, Type I error) to real-world survey data.
DATA 607 Project 3 — Code Validation, Data Transformation, and SQLite Integration (Kevin Martin)
This notebook validates and extends the data engineering workflow for our team’s Data 607 Project 3. It processes the original Google Trends dataset (data_science_skills_gt.csv) into a clean, long-format version (trends_long.csv) and stores it in a structured SQLite database (warehouse.db).
The project demonstrates reproducible data cleaning, transformation, and database loading in R, ensuring team members can reliably access the processed data for analysis and visualization. The process highlights best practices for collaborative version control and R-to-SQL integration in data science workflows.
DATA 606 Lab 5B – Exploring Confidence Intervals with Bootstrap Sampling
In this lab, I explored how sample size and confidence level affect the width and coverage of confidence intervals using simulated data on U.S. adults’ beliefs about climate change. Through the infer package in R, I learned how to construct bootstrap confidence intervals, interpret their meaning, and visualize how interval coverage changes across 90%, 95%, and 99% confidence levels. This report demonstrates how higher confidence leads to wider intervals and how sampling variability impacts real-world inference.
Foundations for Statistical Inference – Sampling Distributions
This lab explores how random samples can be used to estimate population proportions and the variability of those estimates. Using simulated data from a 2019 Gallup report on global attitudes toward science, the analysis demonstrates how sample size and repeated sampling affect the shape, spread, and center of the sampling distribution. Concepts like unbiased estimators and standard error are visualized through simulated proportions of people who believe science benefits or does not benefit them.
Food Access and Pricing Inequality Across NYC Neighborhoods
This proposal explores how grocery pricing and store availability differ between low-income and high-income neighborhoods in New York City. Focusing on Brownsville (Brooklyn) and Lower Manhattan, the project investigates whether essential items like eggs and milk are more expensive or less accessible in areas with higher poverty rates. The analysis uses a placeholder dataset for now, with plans to incorporate real data from NYC Open Data, USDA Food Access Research Atlas, and the U.S. Census for the final project.
Week 7 – Working with HTML, XML, and JSON in R
This project explores how different data formats—HTML, XML, and JSON—can represent the same information and be read into R for analysis. Each file was created manually to better understand structural differences and how R packages like rvest, xml2, and jsonlite handle them. The comparison confirmed that all formats matched perfectly after being normalized. This assignment helped me connect classroom learning to real-world data handling, especially how formats are chosen based on whether data is meant for humans or systems to read.
Exploring the Normal Distribution in Fast Food Nutrition Data
This lab explores the concept of the normal distribution using nutritional data from fast food restaurants.
Through visualization, simulation, and probability analysis in R, we examine how well real-world data (like calories from fat, sodium, and carbohydrates) align with a theoretical normal distribution.
Using the tidyverse and openintro packages, I compared McDonald’s and Dairy Queen menu items, generated Q-Q plots, and calculated both theoretical and empirical probabilities.
This lab demonstrates how statistical concepts can be applied to everyday datasets — providing practical experience in data visualization, distribution analysis, and probability modeling.
Project 2 – Data Transformation: Converting Wide Data into Tidy Formats
This project demonstrates how to transform wide datasets into tidy formats using R. Three datasets—Sales, Scores, and Vaccinations—were cleaned, reshaped, and summarized to prepare them for analysis and visualization. The project highlights the use of pivot_longer(), mutate(), and group_by() for data tidying, and includes visual summaries created with ggplot2. The completed outputs were exported to CSV files and packaged into a single zip file for easy sharing.
Lab 3 — Probability (Hot Hand)
This lab investigates the “hot hand” idea using Kobe Bryant’s 2009 NBA Finals shot data. I compute streak lengths from the real data and compare them to a simulation of an independent shooter with the same make rate (45%). Using histograms and summary statistics of streak lengths, I assess whether Kobe’s patterns look meaningfully different from randomness. The results suggest most streaks are short, and the longer ones we do see are consistent with what independence would produce.
Assignment 5B – ELO Calculations and Performance Analysis
This report analyzes chess tournament results using ELO calculations to compare actual player performance against expected outcomes based on pre-tournament ratings. It identifies the top overperformers and underperformers, explains patterns using statistical modeling, and includes visualizations, tables, and a CSV export of results. The analysis demonstrates how data transformation and tidy data principles can be applied to real-world competitive data.
Assignment 5A – Airline Delays Analysis
This assignment analyzes airline delay data for two airlines across five cities.
The dataset was initially provided in a wide format and transformed into a tidy long format using R.
The analysis includes:
1. Cleaning and handling missing data.
2. Calculating the overall share of delays for each airline.
3. Comparing the percentage of delays within each city, visualized through a stacked bar plot.
4. Identifying discrepancies between overall totals and city-by-city breakdowns, illustrating Simpson’s Paradox.
This work demonstrates how data transformation and visualization in R can be used.
Project 1 - Chess Tournament Data
Analysis of chess tournament player data using R. Includes average opponent ratings and player statistics.
Week 3B: Window Functions — Moving Averages
This report analyzes stock price data for Apple and Microsoft using YTD averages and 6-day moving averages to highlight short-term vs. long-term trends.
Week 3A: Global Baseline Estimates (Movie Ratings)
Global Baseline recommender using μ + user_avg + movie_avg − μ. Includes cleaned data, baseline tables, predictions, recommendations, and visuals.
Week 2B: Evaluating Classification Model Performance
Null error rate, confusion matrices at thresholds 0.2/0.5/0.8, and accuracy/precision/recall/F1 for penguin predictions.
Week 2A: SQL and R — Movie Ratings
Connecting to MySQL from RStudio, importing data, exporting to CSV, and generating movie ratings summaries.
Week 2A: SQL and R — Movie Ratings
Connecting to MySQL from RStudio, importing movie ratings data, exporting to CSV, and generating basic summaries.
DATA607 Week 1: Pima Indians Diabetes Analysis
Assignment for DATA607 showing how to load and clean the Pima Indians Diabetes dataset.