Recently Published
Exploratory Data Analysis for Next-Word Prediction Using the SwiftKey Text Dataset
This report presents the exploratory analysis completed for the Coursera Data Science Capstone project, which involves building a next-word prediction model using the SwiftKey text corpus (Blogs, News, and Twitter data).
The analysis includes:
Loading and sampling the raw dataset
Text cleaning and preprocessing
Summary statistics such as line and word counts per source
Tokenization and creation of unigram, bigram, and trigram frequency tables
Visualizations of the most frequent words and n-grams
The report also outlines the planned predictive modeling approach using n-grams with a backoff strategy and the development of an interactive Shiny application that will provide next-word suggestions to users.