Recently Published
NYPD Arrest Rate vs Weather Variables
The NYPD Arrests Data-sets provides a detailed record of arrests in New York City during this period. It serves as a valuable resource for understanding crime patterns and trends within the city. The dataset encompasses a wide range of information, including the demographic details of individuals arrested, the types of crimes committed, and the locations where arrests occurred.
Investigating the correlation between arrest rates and weather variables can yield significant insights that inform critical decisions across various domains. This analysis has the potential to influence public policy, policing strategies, and even resource allocation through tax optimization.
Multiple Linear Regression
Grading the professor
Many college courses conclude by giving students the opportunity to evaluate the course and the instructor anonymously. However, the use of these student evaluations as an indicator of course quality and teaching effectiveness is often criticized because these measures may reflect the influence of non-teaching related characteristics, such as the physical appearance of the instructor. The article titled, "Beauty in the classroom: instructors' pulchritude and putative pedagogical productivity" by Hamermesh and Parker found that instructors who are viewed to be better looking receive higher instructional ratings.
GGplot Vignette
This project showcases an example of an R Vignette, centered around the powerful visualization capabilities of the ggplot2 package. In this RMD, a dataset sourced from FiveThirtyEight is used, specifically focusing on the age distribution within the U.S. Congress (https://fivethirtyeight.com/features/aging-congress-boomers/). The goal is to demonstrate how to effectively utilize ggplot2, a part of the TidyVerse ecosystem, to create insightful and visually appealing plots from this dataset.
Document Classification - Logistic Regression
Using the Miller Center API, we will be analyzing the speeches of the Presidents of the United States. We will be using the transcript of the speeches to classify the speeches into two categories: speeches by Barack Obama and speeches by other Presidents. We will be comparing the CountVectorizer and TfidfVectorizer to convert the text data into numerical data. We will be using the Logistic Regression model to classify the speeches.
Introduction to linear regression
The Human Freedom Index is a report that attempts to summarize the idea of "freedom" through a bunch of different variables for many countries around the globe. It serves as a rough objective measure for the relationships between the different types of freedom - whether it's political, religious, economical or personal freedom - and other social and economic circumstances. The Human Freedom Index is an annually co-published report by the Cato Institute, the Fraser Institute, and the Liberales Institute at the Friedrich Naumann Foundation for Freedom.
In this lab, you'll be analyzing data from Human Freedom Index reports from 2008-2016. Your aim will be to summarize a few of the relationships within the data both graphically and numerically in order to find which variables can help tell a story about freedom.
Netflix Recommender System
Netflix, Inc., an American media company founded in 1997, is the world's preeminent subscription video on-demand (SVOD) service. Launched in 2007, it offers a vast library of original and acquired films and television shows across various genres. As of January 2024, Netflix boasts an impressive subscriber base exceeding 260 million paid memberships in over 190 countries, solidifying its position as the industry leader in streaming media.
Netflix Recommender System
Netflix, Inc., an American media company founded in 1997, is the world's preeminent subscription video on-demand (SVOD) service. Launched in 2007, it offers a vast library of original and acquired films and television shows across various genres. As of January 2024, Netflix boasts an impressive subscriber base exceeding 260 million paid memberships in over 190 countries, solidifying its position as the industry leader in streaming media.
Personalized Recommendations Drive User Engagement:
Central to Netflix's success is its sophisticated recommendation system. This system, powered by advanced machine learning algorithms, analyzes a multitude of user data points, including viewing history, search queries, and user ratings. By leveraging these insights, Netflix curates a personalized selection of movies and TV shows, significantly enhancing user engagement and satisfaction. This data-driven approach ensures that subscribers discover content tailored to their individual preferences, fostering a more enjoyable and immersive entertainment experience.
Inference for numerical data
In this lab, we will explore and visualize the data using the **tidyverse** suite of packages, and perform statistical inference using **infer**. The data can be found in the companion package for OpenIntro resources, **openintro**.
Vignette for Purrr
The `purrr` package provides functions that eliminate the need for many common for loops. They are more consistent and thus easier to learn than many of the alternative functions in the base R package. It allows you to generalize a solution to every element in a list. It also allows you get lots of small pieces and compose them together with the pipe.
Sentiment Analysis
In Text Mining with R, Chapter 2 looks at Sentiment Analysis. In this assignment, you should start by getting the primary example code from chapter 2 working in an R Markdown document. You should provide a citation to this base code. You’re then asked to extend the code in two ways:
Work with a different corpus of your choosing, and
Incorporate at least one additional sentiment lexicon (possibly from another R package that you’ve found through research).
As usual, please submit links to both an .Rmd file posted in your GitHub repository and to your code on rpubs.com. You make work on a small team on this assignment.
NYT Web API
This project aims to construct a data frame containing the current New York Times Best Sellers List for the ‘Combined Print & E-Book Fiction’ category. The data will be retrieved by leveraging the New York Times Books API.
Super Bowl Data
Super Bowl Data scraped and Cleaned from website
Inference for categorical data
You will be analyzing the same dataset as in the previous lab, where you delved into a sample from the Youth Risk Behavior Surveillance System (YRBSS) survey, which uses data from high schoolers to help discover health patterns. The dataset is called yrbss.
Project 3 - Data Science Skills
This project aims to establish a quantitative assessment of the relative value of specific skills for data science professionals. We will achieve this by analyzing data extracted from job postings on relevant job boards. The analysis will focus on two key aspects of data scientist job postings: advertised salary and the frequency of specific skills mentioned in the job descriptions. By correlating these factors, we can develop a proxy measure to compare the relative value of various skills sought after in the data science job market.
Working with XML and JSON in R
Pick three of your favorite books on one of your favorite subjects. At least one of the books should have more than one author. For each book, include the title, authors, and two or three other attributes that you find interesting.
Take the information that you’ve selected about these three books, and separately create three files which store the book’s information in HTML (using an html table), XML, and JSON formats (e.g. “books.html”, “books.xml”, and “books.json”). To help you better understand the different file structures, I’d prefer that you create each of these files “by hand” unless you’re already very comfortable with the file formats.
Write R code, using your packages of choice, to load the information from each of the three sources into separate R data frames. Are the three data frames identical?
Your deliverable is the three source files and the R code. If you can, package your assignment solution up into an .Rmd file and publish to rpubs.com. [This will also require finding a way to make your three text files accessible from the web].
Foundations for statistical inference - Confidence intervals
In this lab, we will explore and visualize the data using the
**tidyverse** suite of packages, and perform statistical inference using
**infer**.
Foundations for statistical inference - Sampling distributions
In this lab, you will investigate the ways in which the statistics from a random sample of data can serve as point estimates for population parameters. We’re interested in formulating a sampling distribution of our estimate in order to learn about the properties of the estimate, such as its distribution.
Fast Food Distribution
In this lab, you'll investigate the probability distribution that is most central to statistics: the normal distribution. If you are confident that your data are nearly normal, that opens the door to many powerful statistical methods. Here we'll use the graphical tools of R to assess the normality of our data and also
learn how to generate random numbers from a normal distribution.
Language Diversity Dataset
This data will be analyzed to determine which countries have the highest amount of languages per capita. Some countries have a vast amount of different cultures and diversity and to determine which has the most we'll create a new column with the values of the amount of languages per capita. By calculating this average language count per individual, we aim to identify the countries within this dataset that boast the highest average number of languages spoken per resident.
FIFA Player Data
Source: 2021 Fifa player data. We will begin the tidying of this data set by loading the necessary libraries as well as loading the raw csv file into a data frame. We then store the data in a data frame which we will call fifaplayer_data
FDA-Approved A.I.-based algorthms
Content: This dataset contains information on medical devices and algorithms approved by the FDA from 1995 to 2021.
Selection Reason: This dataset was chosen as an illustrative example of an untidy dataset due to the presence of the following data quality issues:
Duplicate variables: The dataset contained redundant variables named "Medical specialty" and "Secondary medical specialty" with identical purposes.
Ambiguous variable names: The dataset included variable names that were unclear or lacked proper definition.
Missing or incomplete data: Some data points were either missing entirely or incomplete.
Inconsistent missing value representation: Missing data was represented inconsistently.
Tidying and Transforming Data
(1) Create a .CSV file (or optionally, a MySQL database!) that includes all of the information above. You’re encouraged to use a “wide” structure similar to how the information appears above, so that you can practice tidying and transformations as described below.
(2) Read the information from your .CSV file into R, and use tidyr and dplyr as needed to tidy and transform your data.
(3) Perform analysis to compare the arrival delays for the two airlines.
Hot Hand Theory
Basketball players who make several baskets in succession are described as having a hot hand. Fans and players have long believed in the hot hand phenomenon, which refutes the assumption that each shot is independent of the next. However, a 1985 paper by Gilovich, Vallone, and Tversky collected evidence that contradicted this belief and showed that successive shots are independent events. This paper started a great controversy that continues to this day, as you can see by Googling hot hand basketball.
Manipulation and Data Processing
#1. Using the 173 majors listed in fivethirtyeight.com’s College Majors dataset [https://fivethirtyeight.com/features/the-economic-guide-to-picking-a-college-major/], provide code that identifies the majors that contain either "DATA" or "STATISTICS"
#2 Write code that transforms the data below:
[1] "bell pepper" "bilberry" "blackberry" "blood orange"
[5] "blueberry" "cantaloupe" "chili pepper" "cloudberry"
[9] "elderberry" "lime" "lychee" "mulberry"
[13] "olive" "salal berry"
Into a format like this:
c("bell pepper", "bilberry", "blackberry", "blood orange", "blueberry", "cantaloupe", "chili pepper", "cloudberry", "elderberry", "lime", "lychee", "mulberry", "olive", "salal berry")
The two exercises below are taken from R for Data Science, 14.3.5.1 in the on-line version:
#3 Describe, in words, what these expressions will match:
(.)\1\1
"(.)(.)\\2\\1"
(..)\1
"(.).\\1.\\1"
"(.)(.)(.).*\\3\\2\\1"
#4 Construct regular expressions to match words that:
Start and end with the same character.
Contain a repeated pair of letters (e.g. "church" contains "ch" repeated twice.)
Contain one letter repeated in at least three places (e.g. "eleven" contains three "e"s.)
R and SQL
Data 607 Week Two - R and SQL
Part 1: Build Table
•Choose six recent popular movies.
•Ask at least five people that you know (friends, family, classmates, imaginary friends if necessary) to rate each of these movies that they have seen on a scale of 1 to 5.
Part 2: Store data in SQL database
• Take the results (observations) and store them in the class MySQL database:-Server name: cunydata607sql.mysql.database.azure.com-Username / password: will be given to you in an email Note: it is good practice to change your password. To do so, use this SQL command: SET PASSWORD = ‘<your new password here>';
Part 3: Transfer data from SQL database to R dataframe
• Load the information from the SQL database into an R dataframe. Part 4: Missing data strategy
• Implement an approach to missing data
•Explain why you decided to take the chosen approach Note: consider that later in the course you will revisit this information you have collected and will use it to implement a Recommender.
Bonus Challenge Questions:
You’re encouraged to optionally find other ways to make your solution better. For example, consider incorporating one or more of the following suggestions into your solution:
•Use survey software to gather the information.
•Are you able to use a password without having to share the password with people who are viewing your code?
• There are a lot of interesting approaches that you can uncover with a little bit of research.
•While it’s acceptable to create a single SQL table, can you create a normalized set of tables that corresponds to the relationship between your movie viewing friends and the movies being rated?
• Is there any benefit in standardizing ratings? How might you approach this?
Loading Data into a Data Frame
You should first study the data and any other information on the GitHub site, and read the associated fivethirtyeight.com article.
To receive full credit, you should:
1.
Take the data, and create one or more code blocks. You should finish with a data frame that contains a subset of the columns in your selected dataset. If there is an obvious target (aka predictor or independent) variable, you should include this in your set of columns. You should include (or add if necessary) meaningful column names and replace (if necessary) any non-intuitive abbreviations used in the data that you selected. For example, if you had instead been tasked with working with the UCI mushroom dataset, you would include the target column for edible or poisonous, and transform “e” values to “edible.” Your deliverable is the R code to perform these transformation tasks.
2.
Make sure that the original data file is accessible through your code—for example, stored in a GitHub repository or AWS S3 bucket and referenced in your code. If the code references data on your local machine, then your work is not reproducible!
3.
Start your R Markdown document with a two to three sentence “Overview” or “Introduction” description of what the article that you chose is about, and include a link to the article.
4.
Finish with a “Conclusions” or “Findings and Recommendations” text block that includes what you might do to extend, verify, or update the work from the selected article.
5.
Each of your text blocks should minimally include at least one header, and additional non-header text.
6.
You’re of course welcome—but not required--to include additional information, such as exploratory data analysis graphics (which we will cover later in the course).
7.
Place your solution into a single R Markdown (.Rmd) file and publish your solution out to rpubs.com.
8.
Post the .Rmd file in your GitHub repository, and provide the appropriate URLs to your GitHub repositor y and your rpubs.com file in your assignment link.
Lesson 1 : Intro to R and RStudio
The goal of this lab is to introduce you to R and RStudio, which you’ll be using throughout the course both to learn the statistical concepts discussed in the course and to analyze real data and come to informed conclusions. To clarify which is which: R is the name of the programming language itself and RStudio is a convenient interface.
As the labs progress, you are encouraged to explore beyond what the labs dictate; a willingness to experiment will make you a much better programmer. Before we get to that stage, however, you need to build some basic fluency in R. Today we begin with the fundamental building blocks of R and RStudio: the interface, reading in data, and basic commands.