gravatar

Candace63

Candace Grant

Recently Published

Logistic Regression
Logistic Regression Analysis This analysis examined crime rates across 466 Boston neighborhoods using logistic regression to predict whether areas exceed the median crime rate. The dataset contained 12 predictor variables including residential zoning (zn), pollution levels (nox), housing characteristics (rm, age), accessibility metrics (dis, rad), and socioeconomic indicators (lstat, medv), with no missing values. Data preparation involved log-transforming right-skewed variables (nox, lstat) and addressing multicollinearity by removing highly correlated predictors—specifically dropping tax (correlated 0.91 with rad), indus (correlated 0.76 with nox), and medv (correlated -0.74 with lstat)—reducing all VIF values below 5. Three models were developed: Model 1 used all prepared variables, Model 2 applied stepwise selection for parsimony, and Model 3 incorporated interaction terms (rm × lstat) and polynomial features (rm²) to capture non-linear relationships. Model 2 emerged as the optimal choice, balancing predictive accuracy (88.6% accuracy, 0.874 precision, 0.865 specificity) with model simplicity (lowest AIC=232.6, BIC=269.9), retaining eight significant predictors including nox_log, rad, dis, and rm while excluding the theoretically problematic lstat_log variable that showed a counter-intuitive negative coefficient in Model 1. Despite Model 1's marginally better performance metrics, an ANOVA test revealed no significant improvement from the additional variable (p=0.63), confirming Model 2 as the most parsimonious and interpretable model for predicting high-crime neighborhoods.
Sentiment Analysis
Part 1: Traditional Tidy Text Approach Uses tidytext package with lexicon-based methods (AFINN, Bing, NRC) Word-by-word sentiment scoring Effective for literary text analysis 25 Part 2: Advanced Sentiment Analysis (Pride and Prejudice) Employs multi-dimensional emotion analysis using NRC lexicon to track eight distinct emotions (joy, anger, fear, trust, anticipation, surprise, sadness, disgust) across the narrative arc Implements context-aware sentiment scoring with sentimentr package, which accounts for valence shifters like negations (“not happy”) and amplifiers (“very good”) for more nuanced analysis Includes character-specific sentiment tracking to analyze how emotional tone shifts when major characters (Elizabeth, Darcy, Wickham) are mentioned, revealing character development patterns Compares three distinct lexicons (AFINN, Bing, NRC) at both chapter and sentence levels to demonstrate methodological rigor and validate findings across different sentiment
Nobel Prize API Data Analysis
Nobel Prize API Data Extraction Project Project Overview This project involved extracting and analyzing data from the Nobel Prize API to explore patterns and insights about Nobel laureates and their achievements. Key Components Data Extraction - Connected to the Nobel Prize API to retrieve comprehensive data about Nobel Prize winners - Extracted information including laureate details, prize categories, award years, and affiliations - Processed JSON data and transformed it into a tidy data format suitable for analysis Data Processing - Cleaned and structured the API response data using tidyverse tools - Created organized dataframes with key variables such as: - Laureate names and biographical information - Prize categories (Physics, Chemistry, Medicine, Literature, Peace, Economics) - Award years and prize motivations - Institutional affiliations and countries Analysis Focus Areas Potential areas explored could include: - Distribution of prizes across categories and time periods - Gender representation among laureates - Geographic patterns in prize winners - Age trends of laureates at time of award - Institutional affiliations and their prize frequencies Technical Skills Demonstrated - API integration and data retrieval - JSON data parsing and transformation - Data wrangling with dplyr and tidyr - Exploratory data analysis - Data visualization with ggplot2 This project showcases your ability to work with external APIs, handle real-world data structures, and apply tidy data principles to extract meaningful insights from public datasets.
Linear Regression and It's Cousins
This project analyzes high-dimensional regression techniques using two datasets: the Tecator meat spectroscopy data and a pharmaceutical permeability dataset. For the Tecator data, five regression methods (PCR, PLS, Ridge, Lasso, and Elastic Net) were compared to predict moisture and fat content from 100 spectroscopy measurements, with PLS emerging as the best performer using 18 components. Principal Component Analysis revealed that the spectroscopy data's effective dimension is much lower than the original 100 variables, with 95% of variance captured by just a few components. The permeability analysis used molecular fingerprints to predict drug permeability, comparing seven methods including PLS, PCR, regularization techniques, KNN, and SVM after filtering near-zero variance predictors. The optimized PLS model with cross-validation demonstrated strong predictive performance, though the results suggest it should be used for screening rather than completely replacing laboratory experiments.
Tidyverse_Vignette
Using one or more TidyVerse packages, and any dataset from fivethirtyeight.com or Kaggle or another source of your choosing, create a programming sample “vignette” that demonstrates how to use one or more of the capabilities of the selected TidyVerse package with your selected dataset
DATA 624 High Level Predictive Analysis
This portfolio demonstrates end-to-end predictive analytics capabilities across three business forecasting challenges. Part A implements ATM cash withdrawal forecasting for financial services, processing 1,500+ transactions across 4 machines to generate month-ahead predictions using ensemble methods (Prophet, SARIMA, ETS), achieving 7.2% MAPE through sophisticated feature engineering including payday effects, day-of-week patterns, and rolling statistics—delivering $70K annual savings per ATM through optimized cash management. Part B [provide your Part B description if applicable]. Part C tackles infrastructure monitoring by processing irregularly-sampled water flow sensor data from two pipes with misaligned timestamps, performing time-base sequencing to hourly intervals, conducting rigorous stationarity testing (ADF, KPSS, Phillips-Perron), and generating 7-day forecasts using Prophet with transparent uncertainty quantification. The complete portfolio showcases production-ready modeling, systematic data quality assessment, business-focused feature engineering, and professional communication of technical results to non-technical stakeholders, with all deliverables published via RPubs and provided in Excel format for operational deployment.
Working with Web APIs
The New York Times provides a rich set of APIs for accessing their content. In this assignment, I will use the Top Stories API, to retrieve the most important or currently featured articles from the Science section. I will construct an interface in R to read JSON data from the API and transform it into an R DataFrame for analysis.
ARIMA
ARIMA stands for AutoRegressive Integrated Moving Average, a class of statistical models used for analyzing and forecasting time series data. The model has three key components represented by the notation ARIMA(p,d,q): the autoregressive term (p) uses past values to predict future values, the integrated term (d) represents the number of times the data needs to be differenced to achieve stationarity, and the moving average term (q) uses past forecast errors to improve predictions. ARIMA models are particularly effective for non-stationary time series data that exhibit trends or patterns over time. The autoregressive component captures the relationship between an observation and a lagged observation, while the moving average component models the dependency between an observation and residual errors from past predictions. Differencing removes trends and seasonal patterns, transforming the data into a stationary series where statistical properties remain constant over time. Model selection typically involves examining ACF and PACF plots, comparing information criteria like AIC or BIC, and validating that residuals resemble white noise. ARIMA models are widely used in economics, finance, and forecasting applications because they balance flexibility with interpretability and can capture complex temporal dependencies in data.
Analysis of Search Trends
This analysis examines Google search trends for three essential data science skills—Python, SQL, and Tableau—over a five-year period from October 2020 to October 2025. By analyzing search interest patterns, we can identify which skills are gaining traction, which are declining, and what this means for data professionals and employers. Key Questions We’ll Answer: Which skill dominates the data science landscape? How have these skills evolved over time? What does the future hold for Python demand?
DATA607Project2_Employment
This dataset for this project can be classified as untidy because the columns have multiple variables and the rows have multiple observations. In this project I will tidy this dataset by transforming the data from a wide to a long format that is easier for a data analyst or data scientist to work with.
Insurance Data Analysis
This project demonstrates the transformation of the Insurance dataset from wide format to long (tidy) format using R's tidyverse package. The Insurance dataset contains health insurance information for 1,338 individuals, including demographic characteristics and healthcare charges. The primary objectives of this data transformation project are to: - Restructure the data from wide format (where multiple measurements exist as separate columns) to long format (where each measurement becomes its own row) - Apply tidy data principles to make the dataset more suitable for statistical analysis and visualization - Demonstrate best practices in data wrangling and preparation for data science workflows - Fulfill DATA 624 course requirements by showcasing proficiency in data transformation techniques essential for masters-level data analysis
Tidying Wide Datasets to produce Long Datasets
Tidying wide datasets involves transforming data from a format where multiple measurements are spread across separate columns into a long format where each row represents a single observation. In wide format, each subject or entity occupies one row with many columns representing different variables or time periods, which can make filtering, grouping, and visualization challenging. The transformation process uses functions like `pivot_longer()` in R or `melt()` in Python to collapse multiple measurement columns into two key columns: one identifying the type of measurement and another containing the actual value. This restructuring follows tidy data principles where each variable forms a column, each observation forms a row, and each type of observational unit forms a table, making the data more suitable for statistical analysis and machine learning algorithms. The result is a dataset with more rows but fewer columns that is easier to filter by measurement type, create visualizations with, and analyze using modern data science tools.
Loading html, xml and json file in to R
Project Summary: This project demonstrates my proficiency in data acquisition and manipulation by reading and parsing identical datasets stored in three different file formats: XML, JSON, and HTML. Using R and specialized packages (xml2, jsonlite, and rvest), I successfully extracted structured book data from each format, transformed it into clean data frames, and validated consistency across formats. The project showcases essential data engineering skills including web-based data retrieval via GitHub URLs, format-specific parsing techniques, reproducible research through R Markdown, and professional documentation. This work highlights my ability to handle diverse data sources—a critical skill in modern data science where information comes from APIs (JSON), enterprise systems (XML), and web scraping (HTML). The complete analysis is published on RPubs with source files hosted on GitHub, demonstrating my commitment to reproducible research and version control best practices.
Project2- Transform Data
Transforming data from wide to long format
Exponential Smoothing
# Exponential Smoothing Analysis: Time Series Forecasting Study This comprehensive analysis explores exponential smoothing methods for forecasting time series data across multiple datasets including Australian livestock, Botswana exports, Chinese GDP, Australian gas production, and retail sales. The study systematically compares simple exponential smoothing (ETS(A,N,N)) with trend-based models (ETS(A,A,N)) and damped trend variants (ETS(A,Ad,N)), evaluating their performance through metrics like RMSE, AIC, and BIC while examining when multiplicative seasonality outperforms additive approaches. Key findings demonstrate that multiplicative seasonality is essential for data with proportionally growing variance, damped trends provide more conservative long-term forecasts though not always better statistical fit, and STL decomposition with Box-Cox transformation can improve forecast accuracy for complex seasonal patterns. The analysis includes detailed residual diagnostics, prediction interval calculations, and test set validation to determine which forecasting methods best balance accuracy and practical applicability for different types of time series data.
Data Prep for Modeling
Pre-Processing Data with Visualizations
Use data visualizations to analyze data
Chess Tournament Performance Analysis Using ELO Expected Scores
This analysis evaluates player performance in a chess tournament by comparing actual scores to ELO-based expected scores using the USCF standard formula: Expected Score = 1 / (1 + 10^((Opponent Rating - Player Rating)/400)). Using R to process tournament data for 63 players, we calculated each player's expected score against their specific opponents and identified the five biggest overperformers and underperformers. The results revealed dramatic performance variations, with ADITYA BAJAJ (MI) showing the most remarkable overperformance at +3.14 points above expected (actual: 6.0, expected: 2.86), while LOREN SCHWIEBERT (MI) had the largest underperformance at -2.51 points below expected (actual: 3.5, expected: 6.01). The analysis demonstrates how ELO-based expectations can quantify tournament performance relative to pre-tournament ratings, providing valuable insights for chess rating systems and player assessment in competitive tournaments.
Document
# Chess ELO Expected Score Calculator # Formula source: Solon, Nate. "How Elo Ratings Actually Work." Zwischenzug, # https://zwischenzug.substack.com/p/how-elo-ratings-actually-work # Expected Score = 1 / (1 + 10^((opponent_rating - player_rating)/400))
Data Wrangling and Visualization with R
Flight Performance Analysis: Alaska Airlines vs AmWest Airlines This comprehensive analysis examines flight delay performance for Alaska Airlines and AmWest Airlines across five major West Coast destinations (Los Angeles, Phoenix, San Diego, San Francisco, and Seattle) using data transformation, statistical analysis, and visualization techniques in R. The study demonstrates how overall network performance metrics can mask significant city-by-city operational variations, revealing that while AmWest achieves superior overall performance with a 10.9% delay rate compared to Alaska's 13.3%, the "better" airline varies substantially by destination. Through data tidying with tidyr, statistical summaries with dplyr, and professional visualizations using ggplot2, the analysis illustrates the critical importance of route-specific performance evaluation for both passengers making travel decisions and airlines optimizing operational strategies. Key findings show that both airlines maintain excellent performance with delay rates below 15%, but city-by-city analysis reveals location-specific operational competencies that are obscured when relying solely on aggregate network statistics, highlighting the analytical value of granular data examination in transportation performance assessment.
Document
Baseball Data Exploration Project Summary
Dtat Preparation for a baseball dtatset
This R Markdown document implements a comprehensive baseball data preparation pipeline that systematically cleans and enhances a dataset of 259 team observations with 16 original variables. The workflow begins by creating missing value indicator flags to preserve information about data patterns, then applies IQR-based outlier detection across all numeric variables. After dropping the highly incomplete TEAM_BATTING_HBP variable (92.7% missing), it imputes remaining missing values using median substitution for robustness against outliers. The feature engineering section creates meaningful baseball-specific metrics including offensive power ratios, base-running efficiency, pitching effectiveness (WHIP proxy), and disciplinary measures (walk-to-strikeout ratios). The pipeline applies log transformations to highly skewed variables, creates categorical performance tiers (High/Medium/Low offensive performance, Elite/Average/Poor pitching, and error rate buckets), and concludes with correlation analysis and data quality validation. This systematic approach transforms raw baseball statistics into a modeling-ready dataset with both original variables and engineered features that capture key aspects of team performance across batting, pitching, base-running, and defensive capabilities.
Forecasting
# Forecasting Australian Retail Time Series: A Comprehensive Analysis ## Dataset Overview This analysis examines Australian retail turnover data from the `aus_retail` dataset, focusing on time series forecasting methodologies and residual diagnostics. The study encompasses multiple retail sectors and employs various forecasting techniques to evaluate predictive performance. ## Key Analytical Components **Time Series Characteristics**: The dataset reveals diverse patterns across different retail categories, with seasonal variations, trending behaviors, and structural changes evident throughout the observation period from the 1980s through 2010s. **Forecasting Methods Applied**: - Seasonal Naive (SNAIVE) for capturing repetitive seasonal patterns - Random Walk with Drift for trending data - Naive methods for baseline comparisons **Model Validation Framework**: Comprehensive residual analysis using three-panel diagnostic plots examining temporal patterns, autocorrelation functions (ACF), and distributional properties to assess white noise assumptions. ## Notable Findings **Residual Analysis**: The study revealed that simple forecasting methods often fail to capture complex underlying structures in retail data. Residuals frequently exhibited non-random patterns, autocorrelation, and heteroscedasticity, indicating opportunities for more sophisticated modeling approaches. **Structural Changes**: Evidence of significant structural breaks and unusual events (particularly around 1995-1997) suggests external economic factors substantially impact retail performance beyond seasonal patterns. **Training Data Sensitivity**: Forecast accuracy demonstrates notable sensitivity to training period selection, with implications for practical forecasting applications in retail planning. ## Technical Implementation The analysis leverages the `fpp3` package ecosystem in R, employing modern tidyverse principles for data manipulation and the `tsibble` framework for time series operations. Cross-validation techniques separate training and test periods to ensure robust accuracy assessment. This comprehensive approach provides valuable insights into Australian retail dynamics while demonstrating practical applications of time series forecasting methodologies in economic analysis.
Data Analysis: Converting .txt to .csv
Chess Tournament Data Analysis Project This project converts unstructured chess tournament data from a fixed-width, pipe-delimited text file containing 64 players into a clean, analyzable CSV dataset through systematic parsing and data extraction. The raw data presents multiple challenges including player information spanning two lines, inconsistent spacing, embedded separators, and mixed alphanumeric content requiring careful line-by-line processing using R's string manipulation functions and regular expressions. Key extracted fields include player identification (name, state, USCF ID), rating information (pre- and post-tournament ratings), tournament performance (total points, round-by-round results, opponent numbers, colors played), and calculated metrics such as average opponent rating for strength-of-schedule analysis. The technical approach employs R programming with core libraries including `stringr` for text processing, `dplyr` for data manipulation, and `readr` for file I/O operations, implementing robust data cleaning to remove separator lines and headers, type conversion from text to numeric values, cross-referencing to match opponent numbers, and quality validation to check for missing values and data consistency. Project deliverables include a structured CSV file with complete player and tournament data, a comprehensive HTML report documenting the conversion process using R Markdown for reproducible analysis, data quality assessments highlighting limitations, and summary statistics with interactive data tables using the DT package, ultimately transforming complex human-readable tournament records into machine-readable format suitable for statistical analysis, database storage, or integration with tournament management systems.
Time Series Decomposition by Candace Grant
Advanced Time Series Analysis and Decomposition Techniques This comprehensive time series analysis demonstrates advanced statistical modeling capabilities across multiple economic datasets, employing sophisticated decomposition methodologies including classical multiplicative decomposition, STL decomposition, and X-11 seasonal adjustment procedures to isolate trend, seasonal, and irregular components with particular emphasis on Australian labour force dynamics (1978-1995) revealing 38% secular growth dominated by trend components. Key technical achievements include systematic Box-Cox transformation analysis determining optimal variance-stabilizing parameters across diverse datasets—Canadian gas production (λ = 0.577), Australian retail series (λ = 0.371), tobacco production (λ = 0.926), airline passengers (λ = 2.0), and pedestrian traffic (λ = 0.273)—using Guerrero method optimization with clear decision frameworks for transformation necessity, alongside advanced outlier detection utilizing X-11 irregular components to identify structural breaks and anomalous periods in retail data including significant outliers during the early 2000s economic expansion while quantifying outlier effects on seasonal adjustment procedures and demonstrating superior detection capabilities compared to classical methods. The analysis employs a comparative analytical framework systematically evaluating transformation effectiveness through before/after visualizations and statistical validation, applying consistent protocols across heterogeneous datasets to demonstrate scalable methodological approaches suitable for production-level forecasting environments that directly support strategic decision-making in economic forecasting, retail planning, and resource allocation optimization. This demonstrated capability to parse complex temporal signals into interpretable components enables evidence-based policy recommendations and risk assessment protocols essential for senior analytical roles in data-driven organizations, showcasing proficiency in R/fpp3, advanced time series modeling, statistical transformation theory, and macroeconomic data analysis with clear business applications for companies requiring sophisticated analytical infrastructure for temporal pattern recognition and forecasting.
Data 602 Wk2 | Into to Data | Candace Grant
In this lab I explore flights, specifically a random sample of domestic flights that departed from the three major New York City airports in 2013. I generated simple graphical and numerical summaries of data on these flights and explore delay times.
Banking Data Analysis
This data report presents an analysis of a marketing dataset from a Portuguese banking institution's direct marketing campaigns. The dataset focuses on phone-based marketing efforts aimed at promoting term deposits to clients. The primary objective is to develop a predictive classification model that determines whether a client will subscribe to a term deposit (binary outcome: 'yes' or 'no'). The campaigns often required multiple contacts with the same client to achieve successful conversions, making this a complex customer behavior prediction problem."