Recently Published
Multiple Linear Regression
This analysis examines factors influencing professor evaluation scores at the University of Texas at Austin, with particular focus on whether physical attractiveness predicts higher ratings. Using multiple linear regression on 463 course evaluations, I found that while beauty rating is a statistically significant predictor (p < 0.001), it explains only 3.5% of the variance in scores—suggesting minimal practical impact. The final model identifies seven significant predictors: gender, language of degree institution, age, class evaluation participation rate, course credits, beauty rating, and picture color. Key limitations include the single-institution sample and non-independent observations due to professors teaching multiple courses. This project demonstrates skills in exploratory data analysis, regression diagnostics, model selection via backward elimination, and communicating statistical findings to diverse audiences.
Here, you will analyze the data from this study in order to learn what goes into a positive professor evaluation.
Spam|Ham Email Document Classifier
Binary email classifier using **Random Forest** achieves **97.2% accuracy** on SpamAssassin corpus (1,000 emails).
Key Results:
- Precision: 96.8% | Recall: 97.5% | AUC: 0.989
- Only 5 errors out of 200 test emails
- 42-point improvement over 4-class inbox classifier (Part 1)
*Why It Works:
- Spam and ham have distinct vocabularies (<5% overlap)
- Clear indicators: spam uses "free"/"winner"/"urgent", ham uses "meeting"/"project"/"team"
- Binary classification simpler than multi-class problems
Part 2 of two-part email classification study.** [Link to Part 1](https://rpubs.com/Candace63/GmailClassifier)
This document includes complete methodology, interactive testing, and reproducible code.
Gmail Email Classifier
Email Classification with Naive Bayes
Master's Degree Data Acquisition Coursework
Project Summary
Built an email classifier using Naive Bayes and TF-IDF to automatically categorize emails into multiple categories. Includes interactive Shiny web application for real-time predictions.
Dataset & Methods
Data: 3,200 personal Gmail messages (4 categories: Inbox, Promotions, Social, Updates)
- Features: TF-IDF with 500 top terms
- Model: Naive Bayes classifier (80/20 train/test split)
## Results
- Overall Accuracy: 55%
- Best Performance: Social emails (87%) - distinctive words like "liked", - commented", "tagged"
- Lowest Performance: Inbox, Promotions, Updates (24-41%) - similar transactional vocabulary
Key Finding
Category distinctiveness drives performance. Social media emails have unique vocabulary, while promotional and transactional emails share similar language patterns, making them harder to distinguish.
Interactive Application
Built a Shiny web app with:
- Real-time email classification
- Text input and file upload
- Color-coded predictions
Regression Tree Modeling
Regression Tree Analysis: Predictive Modeling for Risk Assessment
Project Overview:
Developed an interpretable machine learning model using regression tree methodology to identify key risk factors and predict outcomes in a complex multi-variable environment. This project demonstrates expertise in tree-based algorithms, feature selection, and actionable business intelligence extraction.
Technical Approach:
- Implemented recursive binary splitting algorithm to build decision tree models
- Applied cross-validation techniques to optimize tree depth and prevent overfitting
- Conducted comprehensive feature importance analysis across 20+ predictor variables
- Validated model performance using holdout testing methodology
Key Findings:
- Identified hierarchical relationships between predictor variables, revealing that certain factors serve as gateway conditions for other variables to become influential
- Discovered actionable thresholds that provide specific operational targets for process improvement
- Achieved balanced contribution from multiple variable categories, demonstrating that no single factor dominates outcomes
- Established clear decision rules that enable immediate implementation of targeted interventions
Business Impact:
The regression tree model revealed that 60% of cases operate below optimal thresholds, representing significant improvement opportunities. Unlike black-box ensemble methods, this approach provides transparent decision logic that stakeholders can directly implement. The model delivers both predictive accuracy and interpretability, enabling data-driven decision making with clear rationale.
Technical Skills Demonstrated:
- Tree-based machine learning algorithms
- Feature engineering and selection
- Model validation and performance optimization
- Statistical analysis and interpretation
- Business intelligence translation
This project showcases ability to balance technical sophistication with practical business application, delivering insights that drive measurable operational improvements.
Regression Trees and Rule-Based Modeling
Regression Trees and Rule-Based Modeling
This analysis explores tree-based regression methods and variable importance metrics across four comprehensive exercises using R. The project demonstrates advanced machine learning techniques for handling correlated features, optimizing model complexity through bias-variance tradeoff analysis, and deploying interpretable models for production manufacturing optimization.
Using the Friedman simulation dataset and real-world chemical manufacturing data, the analysis compares traditional and conditional variable importance methods, evaluates hyperparameter effects on model generalization, and showcases the strategic value of combining interpretable single trees with high-performance ensemble methods for business decision-making.
Key Accomplishments:
• Variable Importance Methods: Compared traditional Random Forest importance against conditional importance (Strobl et al., 2007), demonstrating that conditional methods correctly penalize redundant correlated features while traditional methods artificially split importance—critical for feature selection in production ML systems
• Model Comparison: Evaluated 5 tree-based methods (Single Tree, Bagged Trees, Random Forest, GBM, Cubist) on manufacturing yield prediction, achieving optimal Test R² = 0.62 with Random Forest while identifying 10x variation in importance scores across methods
• Bias-Variance Optimization: Simulated gradient boosting across 6 interaction depths (1-10), confirming optimal depth of 4-6 balances complexity and generalization—shallow trees underfit, deep trees overfit
• Hyperparameter Analysis: Analyzed GBM exploration-exploitation tradeoff, demonstrating that conservative parameters (learning rate = 0.1, bagging fraction = 0.1) produce distributed importance and better generalization versus aggressive settings that concentrate on 2-3 features
• Production Interpretability: Deployed interpretable regression tree revealing ManufacturingProcess32 as critical control parameter (threshold: 159.5, +2.5 yield impact), identifying that 60% of production operates sub-optimally—providing actionable operational targets that ensemble models cannot offer
Technical Stack: R (caret, randomForest, gbm, Cubist, party), 10-fold cross-validation, conditional inference forests, MARS, ensemble methods
Non Linear Models
Nonlinear Regression Analysis: Friedman1 Benchmark Dataset
This analysis explores nonlinear regression modeling using the Friedman1 benchmark dataset,
a simulated dataset designed to evaluate machine learning algorithms on complex nonlinear
relationships. The true data-generating function is y = 10·sin(π·X1·X2) + 20·(X3-0.5)² +
10·X4 + 5·X5 + ε, where only five of ten predictors (X1-X5) are informative, while the
remaining five (X6-X10) are pure noise. Across Exercises 7.2 and 7.5, we trained and
evaluated multiple regression models including Linear Regression, GLMNET, K-Nearest Neighbors,
Multivariate Adaptive Regression Splines (MARS), Support Vector Machines (SVM), and Random
Forest. MARS emerged as the optimal model with a test set RMSE of 1.159 and R² of 0.946,
representing a 56.6% improvement over the best linear model (RMSE = 2.670). Remarkably,
MARS achieved perfect feature selection accuracy (100%), correctly identifying all five
informative predictors while completely excluding all noise variables—a capability that
distinguishes it from linear approaches which assigned non-zero importance to spurious
predictors. The analysis demonstrates that MARS's adaptive basis functions not only capture
complex nonlinear patterns including multiplicative interactions (X1·X2) and quadratic
relationships (X3²) but also perform automatic variable selection, making it particularly
valuable for high-dimensional datasets where distinguishing signal from noise is critical.
These findings validate MARS as a powerful tool for nonlinear regression that combines
predictive accuracy with interpretability through its piecewise linear structure and
transparent feature selection mechanism.
Extending Code using TidyVerse Functions
Netflix Dataset Analysis - TidyVerse Extension
Graduate-level data analysis project extending exploratory analysis of Netflix Movies and TV Shows dataset using R TidyVerse.
Project Overview
Extended a Netflix dataset analysis by adding comprehensive data quality assessment, advanced data transformations, and five new analytical dimensions. Demonstrates proficiency in R programming, statistical analysis, and data visualization at the graduate level.
Dataset: Netflix Movies & TV Shows (Kaggle) | 8,807 titles × 12 features
Predictive Modeling for High-Dimensional Data: Comparing Regularization Techniques
Introduction
This analysis explores regression techniques for high-dimensional data across three
case studies: near-infrared spectroscopy analysis, pharmaceutical compound
permeability prediction, and chemical manufacturing process optimization.
The assignment compares linear methods (Principal Component Regression, Partial
Least Squares, Ridge Regression, Lasso, and Elastic Net) with nonlinear approaches
(K-Nearest Neighbors and Support Vector Machines) to identify optimal modeling
strategies for different data structures characterized by multicollinearity and
high predictor-to-sample ratios.
Key Questions Addressed:
1. How do dimension reduction methods (PCR, PLS) compare to regularization (Ridge, Lasso)?
2. Can predictive models reduce expensive laboratory experimentation?
3. Which process variables drive manufacturing yield?
Scenario Design
This project reverse engineers Etsy's recommendation system through scenario design analysis and hands-on platform observation. I examine how Etsy balances buyer personalization with seller visibility in a marketplace built on uniqueness and craftsmanship. Through direct interaction as a new user, I documented personalization emerging within 10 minutes and identified a likely 80/20 split favoring content-based filtering (aesthetic matching, visual similarity) over collaborative filtering (popularity signals). Drawing on Amazon and NYT's documented approaches, I provide strategic recommendations for improving user control, seller fairness, and discovery.
This analysis demonstrates skills in algorithm analysis, multi-stakeholder thinking, and ethical ML system design relevant to e-commerce and marketplace platforms.
Logistic Regression
Logistic Regression Analysis
This analysis examined crime rates across 466 Boston neighborhoods using logistic regression to predict whether areas exceed the median crime rate. The dataset contained 12 predictor variables including residential zoning (zn), pollution levels (nox), housing characteristics (rm, age), accessibility metrics (dis, rad), and socioeconomic indicators (lstat, medv), with no missing values. Data preparation involved log-transforming right-skewed variables (nox, lstat) and addressing multicollinearity by removing highly correlated predictors—specifically dropping tax (correlated 0.91 with rad), indus (correlated 0.76 with nox), and medv (correlated -0.74 with lstat)—reducing all VIF values below 5.
Three models were developed: Model 1 used all prepared variables, Model 2 applied stepwise selection for parsimony, and Model 3 incorporated interaction terms (rm × lstat) and polynomial features (rm²) to capture non-linear relationships. Model 2 emerged as the optimal choice, balancing predictive accuracy (88.6% accuracy, 0.874 precision, 0.865 specificity) with model simplicity (lowest AIC=232.6, BIC=269.9), retaining eight significant predictors including nox_log, rad, dis, and rm while excluding the theoretically problematic lstat_log variable that showed a counter-intuitive negative coefficient in Model 1. Despite Model 1's marginally better performance metrics, an ANOVA test revealed no significant improvement from the additional variable (p=0.63), confirming Model 2 as the most parsimonious and interpretable model for predicting high-crime neighborhoods.
Sentiment Analysis
Part 1: Traditional Tidy Text Approach
Uses tidytext package with lexicon-based methods (AFINN, Bing, NRC)
Word-by-word sentiment scoring
Effective for literary text analysis
25 Part 2: Advanced Sentiment Analysis (Pride and Prejudice)
Employs multi-dimensional emotion analysis using NRC lexicon to track eight distinct emotions (joy, anger, fear, trust, anticipation, surprise, sadness, disgust) across the
narrative arc
Implements context-aware sentiment scoring with sentimentr package, which accounts for valence shifters like negations (“not happy”) and amplifiers (“very good”) for more nuanced analysis
Includes character-specific sentiment tracking to analyze how emotional tone shifts when major characters (Elizabeth, Darcy, Wickham) are mentioned, revealing character development patterns
Compares three distinct lexicons (AFINN, Bing, NRC) at both chapter and sentence levels to demonstrate methodological rigor and validate findings across different sentiment
Nobel Prize API Data Analysis
Nobel Prize API Data Extraction Project
Project Overview
This project involved extracting and analyzing data from the Nobel Prize API to explore patterns and insights about Nobel laureates and their achievements.
Key Components
Data Extraction
- Connected to the Nobel Prize API to retrieve comprehensive data about Nobel Prize winners
- Extracted information including laureate details, prize categories, award years, and affiliations
- Processed JSON data and transformed it into a tidy data format suitable for analysis
Data Processing
- Cleaned and structured the API response data using tidyverse tools
- Created organized dataframes with key variables such as:
- Laureate names and biographical information
- Prize categories (Physics, Chemistry, Medicine, Literature, Peace, Economics)
- Award years and prize motivations
- Institutional affiliations and countries
Analysis Focus Areas
Potential areas explored could include:
- Distribution of prizes across categories and time periods
- Gender representation among laureates
- Geographic patterns in prize winners
- Age trends of laureates at time of award
- Institutional affiliations and their prize frequencies
Technical Skills Demonstrated
- API integration and data retrieval
- JSON data parsing and transformation
- Data wrangling with dplyr and tidyr
- Exploratory data analysis
- Data visualization with ggplot2
This project showcases your ability to work with external APIs, handle real-world data structures, and apply tidy data principles to extract meaningful insights from public datasets.
Linear Regression and It's Cousins
This project analyzes high-dimensional regression techniques using two datasets: the Tecator meat spectroscopy data and a pharmaceutical permeability dataset. For the Tecator data, five regression methods (PCR, PLS, Ridge, Lasso, and Elastic Net) were compared to predict moisture and fat content from 100 spectroscopy measurements, with PLS emerging as the best performer using 18 components. Principal Component Analysis revealed that the spectroscopy data's effective dimension is much lower than the original 100 variables, with 95% of variance captured by just a few components. The permeability analysis used molecular fingerprints to predict drug permeability, comparing seven methods including PLS, PCR, regularization techniques, KNN, and SVM after filtering near-zero variance predictors. The optimized PLS model with cross-validation demonstrated strong predictive performance, though the results suggest it should be used for screening rather than completely replacing laboratory experiments.
Tidyverse_Vignette
Using one or more TidyVerse packages, and any dataset from fivethirtyeight.com or Kaggle or another source of your choosing, create a programming sample “vignette” that demonstrates how to use one or more of the capabilities of the selected TidyVerse package with your selected dataset
DATA 624 High Level Predictive Analysis
This portfolio demonstrates end-to-end predictive analytics capabilities across three business forecasting challenges. Part A implements ATM cash withdrawal forecasting for financial services, processing 1,500+ transactions across 4 machines to generate month-ahead predictions using ensemble methods (Prophet, SARIMA, ETS), achieving 7.2% MAPE through sophisticated feature engineering including payday effects, day-of-week patterns, and rolling statistics—delivering $70K annual savings per ATM through optimized cash management. Part B [provide your Part B description if applicable]. Part C tackles infrastructure monitoring by processing irregularly-sampled water flow sensor data from two pipes with misaligned timestamps, performing time-base sequencing to hourly intervals, conducting rigorous stationarity testing (ADF, KPSS, Phillips-Perron), and generating 7-day forecasts using Prophet with transparent uncertainty quantification. The complete portfolio showcases production-ready modeling, systematic data quality assessment, business-focused feature engineering, and professional communication of technical results to non-technical stakeholders, with all deliverables published via RPubs and provided in Excel format for operational deployment.
Working with Web APIs
The New York Times provides a rich set of APIs for accessing their content. In this assignment, I will use the Top Stories API, to retrieve the most important or currently featured articles from the Science section. I will construct an interface in R to read JSON data from the API and transform it into an R DataFrame for analysis.
ARIMA
ARIMA stands for AutoRegressive Integrated Moving Average, a class of statistical models used for analyzing and forecasting time series data. The model has three key components represented by the notation ARIMA(p,d,q): the autoregressive term (p) uses past values to predict future values, the integrated term (d) represents the number of times the data needs to be differenced to achieve stationarity, and the moving average term (q) uses past forecast errors to improve predictions.
ARIMA models are particularly effective for non-stationary time series data that exhibit trends or patterns over time. The autoregressive component captures the relationship between an observation and a lagged observation, while the moving average component models the dependency between an observation and residual errors from past predictions. Differencing removes trends and seasonal patterns, transforming the data into a stationary series where statistical properties remain constant over time.
Model selection typically involves examining ACF and PACF plots, comparing information criteria like AIC or BIC, and validating that residuals resemble white noise. ARIMA models are widely used in economics, finance, and forecasting applications because they balance flexibility with interpretability and can capture complex temporal dependencies in data.
Analysis of Search Trends
This analysis examines Google search trends for three essential data science skills—Python, SQL, and Tableau—over a five-year period from October 2020 to October 2025. By analyzing search interest patterns, we can identify which skills are gaining traction, which are declining, and what this means for data professionals and employers.
Key Questions We’ll Answer:
Which skill dominates the data science landscape?
How have these skills evolved over time?
What does the future hold for Python demand?
DATA607Project2_Employment
This dataset for this project can be classified as untidy because the columns have multiple variables and the rows have multiple observations. In this project I will tidy this dataset by transforming the data from a wide to a long format that is easier for a data analyst or data scientist to work with.
Insurance Data Analysis
This project demonstrates the transformation of the Insurance dataset from wide format to long (tidy) format using R's tidyverse package. The Insurance dataset contains health insurance information for 1,338 individuals, including demographic characteristics and healthcare charges.
The primary objectives of this data transformation project are to:
- Restructure the data from wide format (where multiple measurements exist as separate columns) to long format (where each measurement becomes its own row)
- Apply tidy data principles to make the dataset more suitable for statistical analysis and visualization
- Demonstrate best practices in data wrangling and preparation for data science workflows
- Fulfill DATA 624 course requirements by showcasing proficiency in data transformation techniques essential for masters-level data analysis
Tidying Wide Datasets to produce Long Datasets
Tidying wide datasets involves transforming data from a format where multiple measurements are spread across separate columns into a long format where each row represents a single observation. In wide format, each subject or entity occupies one row with many columns representing different variables or time periods, which can make filtering, grouping, and visualization challenging.
The transformation process uses functions like `pivot_longer()` in R or `melt()` in Python to collapse multiple measurement columns into two key columns: one identifying the type of measurement and another containing the actual value. This restructuring follows tidy data principles where each variable forms a column, each observation forms a row, and each type of observational unit forms a table, making the data more suitable for statistical analysis and machine learning algorithms.
The result is a dataset with more rows but fewer columns that is easier to filter by measurement type, create visualizations with, and analyze using modern data science tools.
Loading html, xml and json file in to R
Project Summary:
This project demonstrates my proficiency in data acquisition and manipulation by reading and parsing identical datasets stored in three different file formats: XML, JSON, and HTML. Using R and specialized packages (xml2, jsonlite, and rvest), I successfully extracted structured book data from each format, transformed it into clean data frames, and validated consistency across formats. The project showcases essential data engineering skills including web-based data retrieval via GitHub URLs, format-specific parsing techniques, reproducible research through R Markdown, and professional documentation. This work highlights my ability to handle diverse data sources—a critical skill in modern data science where information comes from APIs (JSON), enterprise systems (XML), and web scraping (HTML). The complete analysis is published on RPubs with source files hosted on GitHub, demonstrating my commitment to reproducible research and version control best practices.
Project2- Transform Data
Transforming data from wide to long format
Exponential Smoothing
# Exponential Smoothing Analysis: Time Series Forecasting Study
This comprehensive analysis explores exponential smoothing methods for forecasting time series data across multiple datasets including Australian livestock, Botswana exports, Chinese GDP, Australian gas production, and retail sales. The study systematically compares simple exponential smoothing (ETS(A,N,N)) with trend-based models (ETS(A,A,N)) and damped trend variants (ETS(A,Ad,N)), evaluating their performance through metrics like RMSE, AIC, and BIC while examining when multiplicative seasonality outperforms additive approaches. Key findings demonstrate that multiplicative seasonality is essential for data with proportionally growing variance, damped trends provide more conservative long-term forecasts though not always better statistical fit, and STL decomposition with Box-Cox transformation can improve forecast accuracy for complex seasonal patterns. The analysis includes detailed residual diagnostics, prediction interval calculations, and test set validation to determine which forecasting methods best balance accuracy and practical applicability for different types of time series data.
Pre-Processing Data with Visualizations
Use data visualizations to analyze data
Chess Tournament Performance Analysis Using ELO Expected Scores
This analysis evaluates player performance in a chess tournament by comparing actual scores to ELO-based expected scores using the USCF standard formula: Expected Score = 1 / (1 + 10^((Opponent Rating - Player Rating)/400)). Using R to process tournament data for 63 players, we calculated each player's expected score against their specific opponents and identified the five biggest overperformers and underperformers. The results revealed dramatic performance variations, with ADITYA BAJAJ (MI) showing the most remarkable overperformance at +3.14 points above expected (actual: 6.0, expected: 2.86), while LOREN SCHWIEBERT (MI) had the largest underperformance at -2.51 points below expected (actual: 3.5, expected: 6.01). The analysis demonstrates how ELO-based expectations can quantify tournament performance relative to pre-tournament ratings, providing valuable insights for chess rating systems and player assessment in competitive tournaments.
Document
# Chess ELO Expected Score Calculator
# Formula source: Solon, Nate. "How Elo Ratings Actually Work." Zwischenzug,
# https://zwischenzug.substack.com/p/how-elo-ratings-actually-work
# Expected Score = 1 / (1 + 10^((opponent_rating - player_rating)/400))
Data Wrangling and Visualization with R
Flight Performance Analysis: Alaska Airlines vs AmWest Airlines
This comprehensive analysis examines flight delay performance for Alaska Airlines and AmWest Airlines across five major West Coast destinations (Los Angeles, Phoenix, San Diego, San Francisco, and Seattle) using data transformation, statistical analysis, and visualization techniques in R. The study demonstrates how overall network performance metrics can mask significant city-by-city operational variations, revealing that while AmWest achieves superior overall performance with a 10.9% delay rate compared to Alaska's 13.3%, the "better" airline varies substantially by destination. Through data tidying with tidyr, statistical summaries with dplyr, and professional visualizations using ggplot2, the analysis illustrates the critical importance of route-specific performance evaluation for both passengers making travel decisions and airlines optimizing operational strategies. Key findings show that both airlines maintain excellent performance with delay rates below 15%, but city-by-city analysis reveals location-specific operational competencies that are obscured when relying solely on aggregate network statistics, highlighting the analytical value of granular data examination in transportation performance assessment.
Document
Baseball Data Exploration Project Summary
Dtat Preparation for a baseball dtatset
This R Markdown document implements a comprehensive baseball data preparation pipeline that systematically cleans and enhances a dataset of 259 team observations with 16 original variables. The workflow begins by creating missing value indicator flags to preserve information about data patterns, then applies IQR-based outlier detection across all numeric variables. After dropping the highly incomplete TEAM_BATTING_HBP variable (92.7% missing), it imputes remaining missing values using median substitution for robustness against outliers. The feature engineering section creates meaningful baseball-specific metrics including offensive power ratios, base-running efficiency, pitching effectiveness (WHIP proxy), and disciplinary measures (walk-to-strikeout ratios). The pipeline applies log transformations to highly skewed variables, creates categorical performance tiers (High/Medium/Low offensive performance, Elite/Average/Poor pitching, and error rate buckets), and concludes with correlation analysis and data quality validation. This systematic approach transforms raw baseball statistics into a modeling-ready dataset with both original variables and engineered features that capture key aspects of team performance across batting, pitching, base-running, and defensive capabilities.
Forecasting
# Forecasting Australian Retail Time Series: A Comprehensive Analysis
## Dataset Overview
This analysis examines Australian retail turnover data from the `aus_retail` dataset, focusing on time series forecasting methodologies and residual diagnostics. The study encompasses multiple retail sectors and employs various forecasting techniques to evaluate predictive performance.
## Key Analytical Components
**Time Series Characteristics**: The dataset reveals diverse patterns across different retail categories, with seasonal variations, trending behaviors, and structural changes evident throughout the observation period from the 1980s through 2010s.
**Forecasting Methods Applied**:
- Seasonal Naive (SNAIVE) for capturing repetitive seasonal patterns
- Random Walk with Drift for trending data
- Naive methods for baseline comparisons
**Model Validation Framework**: Comprehensive residual analysis using three-panel diagnostic plots examining temporal patterns, autocorrelation functions (ACF), and distributional properties to assess white noise assumptions.
## Notable Findings
**Residual Analysis**: The study revealed that simple forecasting methods often fail to capture complex underlying structures in retail data. Residuals frequently exhibited non-random patterns, autocorrelation, and heteroscedasticity, indicating opportunities for more sophisticated modeling approaches.
**Structural Changes**: Evidence of significant structural breaks and unusual events (particularly around 1995-1997) suggests external economic factors substantially impact retail performance beyond seasonal patterns.
**Training Data Sensitivity**: Forecast accuracy demonstrates notable sensitivity to training period selection, with implications for practical forecasting applications in retail planning.
## Technical Implementation
The analysis leverages the `fpp3` package ecosystem in R, employing modern tidyverse principles for data manipulation and the `tsibble` framework for time series operations. Cross-validation techniques separate training and test periods to ensure robust accuracy assessment.
This comprehensive approach provides valuable insights into Australian retail dynamics while demonstrating practical applications of time series forecasting methodologies in economic analysis.
Data Analysis: Converting .txt to .csv
Chess Tournament Data Analysis Project
This project converts unstructured chess tournament data from a fixed-width, pipe-delimited text file containing 64 players into a clean, analyzable CSV dataset through systematic parsing and data extraction. The raw data presents multiple challenges including player information spanning two lines, inconsistent spacing, embedded separators, and mixed alphanumeric content requiring careful line-by-line processing using R's string manipulation functions and regular expressions. Key extracted fields include player identification (name, state, USCF ID), rating information (pre- and post-tournament ratings), tournament performance (total points, round-by-round results, opponent numbers, colors played), and calculated metrics such as average opponent rating for strength-of-schedule analysis.
The technical approach employs R programming with core libraries including `stringr` for text processing, `dplyr` for data manipulation, and `readr` for file I/O operations, implementing robust data cleaning to remove separator lines and headers, type conversion from text to numeric values, cross-referencing to match opponent numbers, and quality validation to check for missing values and data consistency. Project deliverables include a structured CSV file with complete player and tournament data, a comprehensive HTML report documenting the conversion process using R Markdown for reproducible analysis, data quality assessments highlighting limitations, and summary statistics with interactive data tables using the DT package, ultimately transforming complex human-readable tournament records into machine-readable format suitable for statistical analysis, database storage, or integration with tournament management systems.
Time Series Decomposition by Candace Grant
Advanced Time Series Analysis and Decomposition Techniques
This comprehensive time series analysis demonstrates advanced statistical modeling capabilities across multiple economic datasets, employing sophisticated decomposition methodologies including classical multiplicative decomposition, STL decomposition, and X-11 seasonal adjustment procedures to isolate trend, seasonal, and irregular components with particular emphasis on Australian labour force dynamics (1978-1995) revealing 38% secular growth dominated by trend components.
Key technical achievements include systematic Box-Cox transformation analysis determining optimal variance-stabilizing parameters across diverse datasets—Canadian gas production (λ = 0.577), Australian retail series (λ = 0.371), tobacco production (λ = 0.926), airline passengers (λ = 2.0), and pedestrian traffic (λ = 0.273)—using Guerrero method optimization with clear decision frameworks for transformation necessity, alongside advanced outlier detection utilizing X-11 irregular components to identify structural breaks and anomalous periods in retail data including significant outliers during the early 2000s economic expansion while quantifying outlier effects on seasonal adjustment procedures and demonstrating superior detection capabilities compared to classical methods.
The analysis employs a comparative analytical framework systematically evaluating transformation effectiveness through before/after visualizations and statistical validation, applying consistent protocols across heterogeneous datasets to demonstrate scalable methodological approaches suitable for production-level forecasting environments that directly support strategic decision-making in economic forecasting, retail planning, and resource allocation optimization. This demonstrated capability to parse complex temporal signals into interpretable components enables evidence-based policy recommendations and risk assessment protocols essential for senior analytical roles in data-driven organizations, showcasing proficiency in R/fpp3, advanced time series modeling, statistical transformation theory, and macroeconomic data analysis with clear business applications for companies requiring sophisticated analytical infrastructure for temporal pattern recognition and forecasting.
Data 602 Wk2 | Into to Data | Candace Grant
In this lab I explore flights, specifically a random sample of domestic flights that departed from the three major New York City airports in 2013. I generated simple graphical and numerical summaries of data on these flights and explore delay times.
Banking Data Analysis
This data report presents an analysis of a marketing dataset from a Portuguese banking institution's direct marketing campaigns. The dataset focuses on phone-based marketing efforts aimed at promoting term deposits to clients.
The primary objective is to develop a predictive classification model that determines whether a client will subscribe to a term deposit (binary outcome: 'yes' or 'no'). The campaigns often required multiple contacts with the same client to achieve successful conversions, making this a complex customer behavior prediction problem."