RPubs

by RStudio

Kevin_Martin16

Kevin Martin

Recently Published

DATA 606 — Lab 9: Multiple Linear Regression

This lab explores how different factors—like beauty ratings, gender, rank, and other course characteristics—relate to professor evaluation scores. Using multiple linear regression, I break down how each variable contributes to the overall model and compare simple and multiple regression results. The goal is to understand which variables truly matter once everything is considered together.

about 11 hours ago

Project 4 – Spam vs Ham Email Classification

For this project I built a simple spam-vs-ham email classifier using the SpamAssassin public corpus. I walk through how I loaded the raw email files, cleaned the text, created a term–document representation, trained a model, and evaluated how well it separates spam from legitimate (ham) email. This was my first hands-on experience with document classification, so I also include a short reflection on what I learned and the questions it raised for me about how spam filters work in the real world.

15 days ago

Lab 8: Introduction to Linear Regression

This lab explores relationships between different freedom measures in the Human Freedom Index dataset using simple linear regression. I examine how strongly one type of freedom predicts another, assess model assumptions using residual diagnostics, and interpret the results using correlation, R² values, and fitted regression lines.

20 days ago

Exploring Nobel Prize Trends with the Nobel API

This analysis uses the Nobel Prize v2.1 API to explore patterns in Nobel awards across categories, countries, and decades. After flattening and cleaning the nested JSON data, I generated summary tables and visualizations showing which fields receive the most prizes, where laureates are born, and how awards have changed over time. The project highlights the challenges of working with nested API data and the importance of careful data inspection and step-by-step processing.

21 days ago

Netflix and Amazon: A Scenario Design Look at Recommender Systems

This short analysis explores how Netflix and Amazon use recommender systems to personalize the user experience. It applies Bruce Temkin’s Scenario Design framework to examine goals from both the user’s and organization’s perspectives, with practical recommendations for improving transparency, personalization, and fairness.

24 days ago

Sentiment Analysis – Week 10 Assignment 10A

Applying Bing, AFINN, and Loughran lexicons to analyze sentiment in technical text. Demonstrates how lexicon choice affects results and shows limitations of emotion-based scoring for structured or business language.

29 days ago

Tidyverse CREATE — dplyr on mpg

A short vignette demonstrating a simple dplyr workflow on the built-in mpg dataset, including data exploration, a mutate transformation, grouping, summarizing, and identifying the top 5 manufacturers by efficiency.

about 1 month ago

Inference for Numerical Data – YRBSS Analysis

This lab explores numerical data from the Youth Risk Behavior Surveillance System (YRBSS). It includes descriptive statistics, visualizations, hypothesis testing, and confidence intervals to examine relationships between student weight, physical activity, height, and sleep.

about 1 month ago

Lab 6 — Inference for Categorical Data (DATA 606)

I used the YRBSS high school survey to practice inference with categorical data. I summarized texting-while-driving and helmet use, built bootstrap confidence intervals (including a clean visualization), and tested whether sleeping 10+ hours is linked to strength training every day. I also explored how the margin of error changes with the true proportion and sample size. This helped me connect the theory (SE, CI, Type I error) to real-world survey data.

about 1 month ago

DATA 607 Project 3 — Code Validation, Data Transformation, and SQLite Integration (Kevin Martin)

This notebook validates and extends the data engineering workflow for our team’s Data 607 Project 3. It processes the original Google Trends dataset (data_science_skills_gt.csv) into a clean, long-format version (trends_long.csv) and stores it in a structured SQLite database (warehouse.db). The project demonstrates reproducible data cleaning, transformation, and database loading in R, ensuring team members can reliably access the processed data for analysis and visualization. The process highlights best practices for collaborative version control and R-to-SQL integration in data science workflows.

about 1 month ago

DATA 606 Lab 5B – Exploring Confidence Intervals with Bootstrap Sampling

In this lab, I explored how sample size and confidence level affect the width and coverage of confidence intervals using simulated data on U.S. adults’ beliefs about climate change. Through the infer package in R, I learned how to construct bootstrap confidence intervals, interpret their meaning, and visualize how interval coverage changes across 90%, 95%, and 99% confidence levels. This report demonstrates how higher confidence leads to wider intervals and how sampling variability impacts real-world inference.

about 2 months ago

Foundations for Statistical Inference – Sampling Distributions

This lab explores how random samples can be used to estimate population proportions and the variability of those estimates. Using simulated data from a 2019 Gallup report on global attitudes toward science, the analysis demonstrates how sample size and repeated sampling affect the shape, spread, and center of the sampling distribution. Concepts like unbiased estimators and standard error are visualized through simulated proportions of people who believe science benefits or does not benefit them.

about 2 months ago

Food Access and Pricing Inequality Across NYC Neighborhoods

This proposal explores how grocery pricing and store availability differ between low-income and high-income neighborhoods in New York City. Focusing on Brownsville (Brooklyn) and Lower Manhattan, the project investigates whether essential items like eggs and milk are more expensive or less accessible in areas with higher poverty rates. The analysis uses a placeholder dataset for now, with plans to incorporate real data from NYC Open Data, USDA Food Access Research Atlas, and the U.S. Census for the final project.

about 2 months ago

Week 7 – Working with HTML, XML, and JSON in R

This project explores how different data formats—HTML, XML, and JSON—can represent the same information and be read into R for analysis. Each file was created manually to better understand structural differences and how R packages like rvest, xml2, and jsonlite handle them. The comparison confirmed that all formats matched perfectly after being normalized. This assignment helped me connect classroom learning to real-world data handling, especially how formats are chosen based on whether data is meant for humans or systems to read.

about 2 months ago

Exploring the Normal Distribution in Fast Food Nutrition Data

This lab explores the concept of the normal distribution using nutritional data from fast food restaurants. Through visualization, simulation, and probability analysis in R, we examine how well real-world data (like calories from fat, sodium, and carbohydrates) align with a theoretical normal distribution. Using the tidyverse and openintro packages, I compared McDonald’s and Dairy Queen menu items, generated Q-Q plots, and calculated both theoretical and empirical probabilities. This lab demonstrates how statistical concepts can be applied to everyday datasets — providing practical experience in data visualization, distribution analysis, and probability modeling.

about 2 months ago

Project 2 – Data Transformation: Converting Wide Data into Tidy Formats

This project demonstrates how to transform wide datasets into tidy formats using R. Three datasets—Sales, Scores, and Vaccinations—were cleaned, reshaped, and summarized to prepare them for analysis and visualization. The project highlights the use of pivot_longer(), mutate(), and group_by() for data tidying, and includes visual summaries created with ggplot2. The completed outputs were exported to CSV files and packaged into a single zip file for easy sharing.

about 2 months ago

Lab 3 — Probability (Hot Hand)

This lab investigates the “hot hand” idea using Kobe Bryant’s 2009 NBA Finals shot data. I compute streak lengths from the real data and compare them to a simulation of an independent shooter with the same make rate (45%). Using histograms and summary statistics of streak lengths, I assess whether Kobe’s patterns look meaningfully different from randomness. The results suggest most streaks are short, and the longer ones we do see are consistent with what independence would produce.

2 months ago

Assignment 5B – ELO Calculations and Performance Analysis

This report analyzes chess tournament results using ELO calculations to compare actual player performance against expected outcomes based on pre-tournament ratings. It identifies the top overperformers and underperformers, explains patterns using statistical modeling, and includes visualizations, tables, and a CSV export of results. The analysis demonstrates how data transformation and tidy data principles can be applied to real-world competitive data.

2 months ago

Assignment 5A – Airline Delays Analysis

This assignment analyzes airline delay data for two airlines across five cities. The dataset was initially provided in a wide format and transformed into a tidy long format using R. The analysis includes: 1. Cleaning and handling missing data. 2. Calculating the overall share of delays for each airline. 3. Comparing the percentage of delays within each city, visualized through a stacked bar plot. 4. Identifying discrepancies between overall totals and city-by-city breakdowns, illustrating Simpson’s Paradox. This work demonstrates how data transformation and visualization in R can be used.

2 months ago

RPubs

Kevin_Martin16

Kevin Martin

Recently Published

DATA 606 — Lab 9: Multiple Linear Regression

Project 4 – Spam vs Ham Email Classification

Lab 8: Introduction to Linear Regression

Exploring Nobel Prize Trends with the Nobel API

Netflix and Amazon: A Scenario Design Look at Recommender Systems

Sentiment Analysis – Week 10 Assignment 10A

Tidyverse CREATE — dplyr on mpg

Inference for Numerical Data – YRBSS Analysis

Lab 6 — Inference for Categorical Data (DATA 606)

DATA 607 Project 3 — Code Validation, Data Transformation, and SQLite Integration (Kevin Martin)

DATA 606 Lab 5B – Exploring Confidence Intervals with Bootstrap Sampling

Foundations for Statistical Inference – Sampling Distributions

Food Access and Pricing Inequality Across NYC Neighborhoods

Week 7 – Working with HTML, XML, and JSON in R

Exploring the Normal Distribution in Fast Food Nutrition Data

Project 2 – Data Transformation: Converting Wide Data into Tidy Formats

Lab 3 — Probability (Hot Hand)

Assignment 5B – ELO Calculations and Performance Analysis

Assignment 5A – Airline Delays Analysis

Project 1 - Chess Tournament Data

Week 3B: Window Functions — Moving Averages

Week 3A: Global Baseline Estimates (Movie Ratings)

Week 2B: Evaluating Classification Model Performance

Week 2A: SQL and R — Movie Ratings

Week 2A: SQL and R — Movie Ratings

DATA607 Week 1: Pima Indians Diabetes Analysis

Sign In

Kevin_Martin16

Kevin Martin

Recently Published