RPubs

by RStudio

RDLong718

Rashad Long

Recently Published

NYPD Arrest Rate vs Weather Variables

The NYPD Arrests Data-sets provides a detailed record of arrests in New York City during this period. It serves as a valuable resource for understanding crime patterns and trends within the city. The dataset encompasses a wide range of information, including the demographic details of individuals arrested, the types of crimes committed, and the locations where arrests occurred. Investigating the correlation between arrest rates and weather variables can yield significant insights that inform critical decisions across various domains. This analysis has the potential to influence public policy, policing strategies, and even resource allocation through tax optimization.

about 1 year ago

Multiple Linear Regression

Grading the professor Many college courses conclude by giving students the opportunity to evaluate the course and the instructor anonymously. However, the use of these student evaluations as an indicator of course quality and teaching effectiveness is often criticized because these measures may reflect the influence of non-teaching related characteristics, such as the physical appearance of the instructor. The article titled, "Beauty in the classroom: instructors' pulchritude and putative pedagogical productivity" by Hamermesh and Parker found that instructors who are viewed to be better looking receive higher instructional ratings.

about 1 year ago

GGplot Vignette

This project showcases an example of an R Vignette, centered around the powerful visualization capabilities of the ggplot2 package. In this RMD, a dataset sourced from FiveThirtyEight is used, specifically focusing on the age distribution within the U.S. Congress (https://fivethirtyeight.com/features/aging-congress-boomers/). The goal is to demonstrate how to effectively utilize ggplot2, a part of the TidyVerse ecosystem, to create insightful and visually appealing plots from this dataset.

about 1 year ago

Document Classification - Logistic Regression

Using the Miller Center API, we will be analyzing the speeches of the Presidents of the United States. We will be using the transcript of the speeches to classify the speeches into two categories: speeches by Barack Obama and speeches by other Presidents. We will be comparing the CountVectorizer and TfidfVectorizer to convert the text data into numerical data. We will be using the Logistic Regression model to classify the speeches.

about 1 year ago

Introduction to linear regression

The Human Freedom Index is a report that attempts to summarize the idea of "freedom" through a bunch of different variables for many countries around the globe. It serves as a rough objective measure for the relationships between the different types of freedom - whether it's political, religious, economical or personal freedom - and other social and economic circumstances. The Human Freedom Index is an annually co-published report by the Cato Institute, the Fraser Institute, and the Liberales Institute at the Friedrich Naumann Foundation for Freedom. In this lab, you'll be analyzing data from Human Freedom Index reports from 2008-2016. Your aim will be to summarize a few of the relationships within the data both graphically and numerically in order to find which variables can help tell a story about freedom.

about 1 year ago

Netflix Recommender System

about 1 year ago

Netflix Recommender System

Netflix, Inc., an American media company founded in 1997, is the world's preeminent subscription video on-demand (SVOD) service. Launched in 2007, it offers a vast library of original and acquired films and television shows across various genres. As of January 2024, Netflix boasts an impressive subscriber base exceeding 260 million paid memberships in over 190 countries, solidifying its position as the industry leader in streaming media. Personalized Recommendations Drive User Engagement: Central to Netflix's success is its sophisticated recommendation system. This system, powered by advanced machine learning algorithms, analyzes a multitude of user data points, including viewing history, search queries, and user ratings. By leveraging these insights, Netflix curates a personalized selection of movies and TV shows, significantly enhancing user engagement and satisfaction. This data-driven approach ensures that subscribers discover content tailored to their individual preferences, fostering a more enjoyable and immersive entertainment experience.

about 1 year ago

Inference for numerical data

In this lab, we will explore and visualize the data using the **tidyverse** suite of packages, and perform statistical inference using **infer**. The data can be found in the companion package for OpenIntro resources, **openintro**.

about 1 year ago

Vignette for Purrr

The `purrr` package provides functions that eliminate the need for many common for loops. They are more consistent and thus easier to learn than many of the alternative functions in the base R package. It allows you to generalize a solution to every element in a list. It also allows you get lots of small pieces and compose them together with the pipe.

about 1 year ago

Sentiment Analysis

In Text Mining with R, Chapter 2 looks at Sentiment Analysis. In this assignment, you should start by getting the primary example code from chapter 2 working in an R Markdown document. You should provide a citation to this base code. You’re then asked to extend the code in two ways: Work with a different corpus of your choosing, and Incorporate at least one additional sentiment lexicon (possibly from another R package that you’ve found through research). As usual, please submit links to both an .Rmd file posted in your GitHub repository and to your code on rpubs.com. You make work on a small team on this assignment.

about 1 year ago

NYT Web API

This project aims to construct a data frame containing the current New York Times Best Sellers List for the ‘Combined Print & E-Book Fiction’ category. The data will be retrieved by leveraging the New York Times Books API.

about 1 year ago

Super Bowl Data

Super Bowl Data scraped and Cleaned from website

about 1 year ago

Inference for categorical data

You will be analyzing the same dataset as in the previous lab, where you delved into a sample from the Youth Risk Behavior Surveillance System (YRBSS) survey, which uses data from high schoolers to help discover health patterns. The dataset is called yrbss.

about 1 year ago

Project 3 - Data Science Skills

This project aims to establish a quantitative assessment of the relative value of specific skills for data science professionals. We will achieve this by analyzing data extracted from job postings on relevant job boards. The analysis will focus on two key aspects of data scientist job postings: advertised salary and the frequency of specific skills mentioned in the job descriptions. By correlating these factors, we can develop a proxy measure to compare the relative value of various skills sought after in the data science job market.

about 1 year ago

Working with XML and JSON in R

Pick three of your favorite books on one of your favorite subjects. At least one of the books should have more than one author. For each book, include the title, authors, and two or three other attributes that you find interesting. Take the information that you’ve selected about these three books, and separately create three files which store the book’s information in HTML (using an html table), XML, and JSON formats (e.g. “books.html”, “books.xml”, and “books.json”). To help you better understand the different file structures, I’d prefer that you create each of these files “by hand” unless you’re already very comfortable with the file formats. Write R code, using your packages of choice, to load the information from each of the three sources into separate R data frames. Are the three data frames identical? Your deliverable is the three source files and the R code. If you can, package your assignment solution up into an .Rmd file and publish to rpubs.com. [This will also require finding a way to make your three text files accessible from the web].

about 1 year ago

Foundations for statistical inference - Confidence intervals

In this lab, we will explore and visualize the data using the **tidyverse** suite of packages, and perform statistical inference using **infer**.

about 1 year ago

Foundations for statistical inference - Sampling distributions

In this lab, you will investigate the ways in which the statistics from a random sample of data can serve as point estimates for population parameters. We’re interested in formulating a sampling distribution of our estimate in order to learn about the properties of the estimate, such as its distribution.

about 1 year ago

Fast Food Distribution

In this lab, you'll investigate the probability distribution that is most central to statistics: the normal distribution. If you are confident that your data are nearly normal, that opens the door to many powerful statistical methods. Here we'll use the graphical tools of R to assess the normality of our data and also learn how to generate random numbers from a normal distribution.

about 1 year ago

Language Diversity Dataset

This data will be analyzed to determine which countries have the highest amount of languages per capita. Some countries have a vast amount of different cultures and diversity and to determine which has the most we'll create a new column with the values of the amount of languages per capita. By calculating this average language count per individual, we aim to identify the countries within this dataset that boast the highest average number of languages spoken per resident.

about 1 year ago

FIFA Player Data

Source: 2021 Fifa player data. We will begin the tidying of this data set by loading the necessary libraries as well as loading the raw csv file into a data frame. We then store the data in a data frame which we will call fifaplayer_data

about 1 year ago

FDA-Approved A.I.-based algorthms

Content: This dataset contains information on medical devices and algorithms approved by the FDA from 1995 to 2021. Selection Reason: This dataset was chosen as an illustrative example of an untidy dataset due to the presence of the following data quality issues: Duplicate variables: The dataset contained redundant variables named "Medical specialty" and "Secondary medical specialty" with identical purposes. Ambiguous variable names: The dataset included variable names that were unclear or lacked proper definition. Missing or incomplete data: Some data points were either missing entirely or incomplete. Inconsistent missing value representation: Missing data was represented inconsistently.

about 1 year ago

Tidying and Transforming Data

(1) Create a .CSV file (or optionally, a MySQL database!) that includes all of the information above. You’re encouraged to use a “wide” structure similar to how the information appears above, so that you can practice tidying and transformations as described below. (2) Read the information from your .CSV file into R, and use tidyr and dplyr as needed to tidy and transform your data. (3) Perform analysis to compare the arrival delays for the two airlines.

over 1 year ago

Hot Hand Theory

Basketball players who make several baskets in succession are described as having a hot hand. Fans and players have long believed in the hot hand phenomenon, which refutes the assumption that each shot is independent of the next. However, a 1985 paper by Gilovich, Vallone, and Tversky collected evidence that contradicted this belief and showed that successive shots are independent events. This paper started a great controversy that continues to this day, as you can see by Googling hot hand basketball.

over 1 year ago

Manipulation and Data Processing

#1. Using the 173 majors listed in fivethirtyeight.com’s College Majors dataset [https://fivethirtyeight.com/features/the-economic-guide-to-picking-a-college-major/], provide code that identifies the majors that contain either "DATA" or "STATISTICS" #2 Write code that transforms the data below: [1] "bell pepper" "bilberry" "blackberry" "blood orange" [5] "blueberry" "cantaloupe" "chili pepper" "cloudberry" [9] "elderberry" "lime" "lychee" "mulberry" [13] "olive" "salal berry" Into a format like this: c("bell pepper", "bilberry", "blackberry", "blood orange", "blueberry", "cantaloupe", "chili pepper", "cloudberry", "elderberry", "lime", "lychee", "mulberry", "olive", "salal berry") The two exercises below are taken from R for Data Science, 14.3.5.1 in the on-line version: #3 Describe, in words, what these expressions will match: (.)\1\1 "(.)(.)\\2\\1" (..)\1 "(.).\\1.\\1" "(.)(.)(.).*\\3\\2\\1" #4 Construct regular expressions to match words that: Start and end with the same character. Contain a repeated pair of letters (e.g. "church" contains "ch" repeated twice.) Contain one letter repeated in at least three places (e.g. "eleven" contains three "e"s.)

over 1 year ago

R and SQL

Data 607 Week Two - R and SQL Part 1: Build Table •Choose six recent popular movies. •Ask at least five people that you know (friends, family, classmates, imaginary friends if necessary) to rate each of these movies that they have seen on a scale of 1 to 5. Part 2: Store data in SQL database • Take the results (observations) and store them in the class MySQL database:-Server name: cunydata607sql.mysql.database.azure.com-Username / password: will be given to you in an email Note: it is good practice to change your password. To do so, use this SQL command: SET PASSWORD = ‘<your new password here>'; Part 3: Transfer data from SQL database to R dataframe • Load the information from the SQL database into an R dataframe. Part 4: Missing data strategy • Implement an approach to missing data •Explain why you decided to take the chosen approach Note: consider that later in the course you will revisit this information you have collected and will use it to implement a Recommender. Bonus Challenge Questions: You’re encouraged to optionally find other ways to make your solution better. For example, consider incorporating one or more of the following suggestions into your solution: •Use survey software to gather the information. •Are you able to use a password without having to share the password with people who are viewing your code? • There are a lot of interesting approaches that you can uncover with a little bit of research. •While it’s acceptable to create a single SQL table, can you create a normalized set of tables that corresponds to the relationship between your movie viewing friends and the movies being rated? • Is there any benefit in standardizing ratings? How might you approach this?

over 1 year ago

Loading Data into a Data Frame

You should first study the data and any other information on the GitHub site, and read the associated fivethirtyeight.com article. To receive full credit, you should: 1. Take the data, and create one or more code blocks. You should finish with a data frame that contains a subset of the columns in your selected dataset. If there is an obvious target (aka predictor or independent) variable, you should include this in your set of columns. You should include (or add if necessary) meaningful column names and replace (if necessary) any non-intuitive abbreviations used in the data that you selected. For example, if you had instead been tasked with working with the UCI mushroom dataset, you would include the target column for edible or poisonous, and transform “e” values to “edible.” Your deliverable is the R code to perform these transformation tasks. 2. Make sure that the original data file is accessible through your code—for example, stored in a GitHub repository or AWS S3 bucket and referenced in your code. If the code references data on your local machine, then your work is not reproducible! 3. Start your R Markdown document with a two to three sentence “Overview” or “Introduction” description of what the article that you chose is about, and include a link to the article. 4. Finish with a “Conclusions” or “Findings and Recommendations” text block that includes what you might do to extend, verify, or update the work from the selected article. 5. Each of your text blocks should minimally include at least one header, and additional non-header text. 6. You’re of course welcome—but not required--to include additional information, such as exploratory data analysis graphics (which we will cover later in the course). 7. Place your solution into a single R Markdown (.Rmd) file and publish your solution out to rpubs.com. 8. Post the .Rmd file in your GitHub repository, and provide the appropriate URLs to your GitHub repositor y and your rpubs.com file in your assignment link.

over 1 year ago

Housing Prices

over 1 year ago

Lesson 1 : Intro to R and RStudio

The goal of this lab is to introduce you to R and RStudio, which you’ll be using throughout the course both to learn the statistical concepts discussed in the course and to analyze real data and come to informed conclusions. To clarify which is which: R is the name of the programming language itself and RStudio is a convenient interface. As the labs progress, you are encouraged to explore beyond what the labs dictate; a willingness to experiment will make you a much better programmer. Before we get to that stage, however, you need to build some basic fluency in R. Today we begin with the fundamental building blocks of R and RStudio: the interface, reading in data, and basic commands.

over 1 year ago

Sign In

RDLong718

Rashad Long

Recently Published