Recently Published
Project-Data Science Presentations
Complete 5-Slide R Presentation for SwiftKey Capstone
Project-Data Science Capstone
This report presents an exploratory analysis of the three text data sets provided for the SwiftKey Capstone Project: blogs, news, and Twitter. The goal is to understand the basic characteristics of these data sets before building a next-word prediction algorithm. Key findings include:
The Twitter data set has the most lines (over 2 million) but the smallest file size
The blogs data set contains the longest individual lines (over 40,000 characters)
Word "love" appears about 4 times more frequently than "hate" in Twitter data
All three data sets show similar patterns in word frequency distributions