Recently Published
DATA607Project2_Employment
This dataset for this project can be classified as untidy because the columns have multiple variables and the rows have multiple observations. In this project I will tidy this dataset by transforming the data from a wide to a long format that is easier for a data analyst or data scientist to work with.
Insurance Data Analysis
This project demonstrates the transformation of the Insurance dataset from wide format to long (tidy) format using R's tidyverse package. The Insurance dataset contains health insurance information for 1,338 individuals, including demographic characteristics and healthcare charges.
The primary objectives of this data transformation project are to:
- Restructure the data from wide format (where multiple measurements exist as separate columns) to long format (where each measurement becomes its own row)
- Apply tidy data principles to make the dataset more suitable for statistical analysis and visualization
- Demonstrate best practices in data wrangling and preparation for data science workflows
- Fulfill DATA 624 course requirements by showcasing proficiency in data transformation techniques essential for masters-level data analysis
Tidying Wide Datasets to produce Long Datasets
Tidying wide datasets involves transforming data from a format where multiple measurements are spread across separate columns into a long format where each row represents a single observation. In wide format, each subject or entity occupies one row with many columns representing different variables or time periods, which can make filtering, grouping, and visualization challenging.
The transformation process uses functions like `pivot_longer()` in R or `melt()` in Python to collapse multiple measurement columns into two key columns: one identifying the type of measurement and another containing the actual value. This restructuring follows tidy data principles where each variable forms a column, each observation forms a row, and each type of observational unit forms a table, making the data more suitable for statistical analysis and machine learning algorithms.
The result is a dataset with more rows but fewer columns that is easier to filter by measurement type, create visualizations with, and analyze using modern data science tools.
Loading html, xml and json file in to R
Project Summary:
This project demonstrates my proficiency in data acquisition and manipulation by reading and parsing identical datasets stored in three different file formats: XML, JSON, and HTML. Using R and specialized packages (xml2, jsonlite, and rvest), I successfully extracted structured book data from each format, transformed it into clean data frames, and validated consistency across formats. The project showcases essential data engineering skills including web-based data retrieval via GitHub URLs, format-specific parsing techniques, reproducible research through R Markdown, and professional documentation. This work highlights my ability to handle diverse data sources—a critical skill in modern data science where information comes from APIs (JSON), enterprise systems (XML), and web scraping (HTML). The complete analysis is published on RPubs with source files hosted on GitHub, demonstrating my commitment to reproducible research and version control best practices.
Project2- Transform Data
Transforming data from wide to long format
Exponential Smoothing
# Exponential Smoothing Analysis: Time Series Forecasting Study
This comprehensive analysis explores exponential smoothing methods for forecasting time series data across multiple datasets including Australian livestock, Botswana exports, Chinese GDP, Australian gas production, and retail sales. The study systematically compares simple exponential smoothing (ETS(A,N,N)) with trend-based models (ETS(A,A,N)) and damped trend variants (ETS(A,Ad,N)), evaluating their performance through metrics like RMSE, AIC, and BIC while examining when multiplicative seasonality outperforms additive approaches. Key findings demonstrate that multiplicative seasonality is essential for data with proportionally growing variance, damped trends provide more conservative long-term forecasts though not always better statistical fit, and STL decomposition with Box-Cox transformation can improve forecast accuracy for complex seasonal patterns. The analysis includes detailed residual diagnostics, prediction interval calculations, and test set validation to determine which forecasting methods best balance accuracy and practical applicability for different types of time series data.
Pre-Processing Data with Visualizations
Use data visualizations to analyze data
Chess Tournament Performance Analysis Using ELO Expected Scores
This analysis evaluates player performance in a chess tournament by comparing actual scores to ELO-based expected scores using the USCF standard formula: Expected Score = 1 / (1 + 10^((Opponent Rating - Player Rating)/400)). Using R to process tournament data for 63 players, we calculated each player's expected score against their specific opponents and identified the five biggest overperformers and underperformers. The results revealed dramatic performance variations, with ADITYA BAJAJ (MI) showing the most remarkable overperformance at +3.14 points above expected (actual: 6.0, expected: 2.86), while LOREN SCHWIEBERT (MI) had the largest underperformance at -2.51 points below expected (actual: 3.5, expected: 6.01). The analysis demonstrates how ELO-based expectations can quantify tournament performance relative to pre-tournament ratings, providing valuable insights for chess rating systems and player assessment in competitive tournaments.
Document
# Chess ELO Expected Score Calculator
# Formula source: Solon, Nate. "How Elo Ratings Actually Work." Zwischenzug,
# https://zwischenzug.substack.com/p/how-elo-ratings-actually-work
# Expected Score = 1 / (1 + 10^((opponent_rating - player_rating)/400))
Data Wrangling and Visualization with R
Flight Performance Analysis: Alaska Airlines vs AmWest Airlines
This comprehensive analysis examines flight delay performance for Alaska Airlines and AmWest Airlines across five major West Coast destinations (Los Angeles, Phoenix, San Diego, San Francisco, and Seattle) using data transformation, statistical analysis, and visualization techniques in R. The study demonstrates how overall network performance metrics can mask significant city-by-city operational variations, revealing that while AmWest achieves superior overall performance with a 10.9% delay rate compared to Alaska's 13.3%, the "better" airline varies substantially by destination. Through data tidying with tidyr, statistical summaries with dplyr, and professional visualizations using ggplot2, the analysis illustrates the critical importance of route-specific performance evaluation for both passengers making travel decisions and airlines optimizing operational strategies. Key findings show that both airlines maintain excellent performance with delay rates below 15%, but city-by-city analysis reveals location-specific operational competencies that are obscured when relying solely on aggregate network statistics, highlighting the analytical value of granular data examination in transportation performance assessment.
Document
Baseball Data Exploration Project Summary
Dtat Preparation for a baseball dtatset
This R Markdown document implements a comprehensive baseball data preparation pipeline that systematically cleans and enhances a dataset of 259 team observations with 16 original variables. The workflow begins by creating missing value indicator flags to preserve information about data patterns, then applies IQR-based outlier detection across all numeric variables. After dropping the highly incomplete TEAM_BATTING_HBP variable (92.7% missing), it imputes remaining missing values using median substitution for robustness against outliers. The feature engineering section creates meaningful baseball-specific metrics including offensive power ratios, base-running efficiency, pitching effectiveness (WHIP proxy), and disciplinary measures (walk-to-strikeout ratios). The pipeline applies log transformations to highly skewed variables, creates categorical performance tiers (High/Medium/Low offensive performance, Elite/Average/Poor pitching, and error rate buckets), and concludes with correlation analysis and data quality validation. This systematic approach transforms raw baseball statistics into a modeling-ready dataset with both original variables and engineered features that capture key aspects of team performance across batting, pitching, base-running, and defensive capabilities.
Forecasting
# Forecasting Australian Retail Time Series: A Comprehensive Analysis
## Dataset Overview
This analysis examines Australian retail turnover data from the `aus_retail` dataset, focusing on time series forecasting methodologies and residual diagnostics. The study encompasses multiple retail sectors and employs various forecasting techniques to evaluate predictive performance.
## Key Analytical Components
**Time Series Characteristics**: The dataset reveals diverse patterns across different retail categories, with seasonal variations, trending behaviors, and structural changes evident throughout the observation period from the 1980s through 2010s.
**Forecasting Methods Applied**:
- Seasonal Naive (SNAIVE) for capturing repetitive seasonal patterns
- Random Walk with Drift for trending data
- Naive methods for baseline comparisons
**Model Validation Framework**: Comprehensive residual analysis using three-panel diagnostic plots examining temporal patterns, autocorrelation functions (ACF), and distributional properties to assess white noise assumptions.
## Notable Findings
**Residual Analysis**: The study revealed that simple forecasting methods often fail to capture complex underlying structures in retail data. Residuals frequently exhibited non-random patterns, autocorrelation, and heteroscedasticity, indicating opportunities for more sophisticated modeling approaches.
**Structural Changes**: Evidence of significant structural breaks and unusual events (particularly around 1995-1997) suggests external economic factors substantially impact retail performance beyond seasonal patterns.
**Training Data Sensitivity**: Forecast accuracy demonstrates notable sensitivity to training period selection, with implications for practical forecasting applications in retail planning.
## Technical Implementation
The analysis leverages the `fpp3` package ecosystem in R, employing modern tidyverse principles for data manipulation and the `tsibble` framework for time series operations. Cross-validation techniques separate training and test periods to ensure robust accuracy assessment.
This comprehensive approach provides valuable insights into Australian retail dynamics while demonstrating practical applications of time series forecasting methodologies in economic analysis.
Data Analysis: Converting .txt to .csv
# Chess Tournament Data Analysis Project
## Project Overview
This project involves converting unstructured chess tournament data from a text file format into a clean, analyzable CSV dataset. The raw data contains player information, ratings, tournament results, and round-by-round game outcomes in a complex tabular text format that requires careful parsing and data extraction.
## Data Source Characteristics
• **Format**: Fixed-width text file with pipe-delimited sections
• **Structure**: Player data spans two lines - basic info and rating/state details
• **Content**: 64 players with complete tournament records
• **Complexity**: Mixed alphanumeric data with varying field lengths
• **Challenges**: Inconsistent spacing, embedded separators, multi-line player records
## Technical Approach
### **Data Extraction Strategy**
• **Line-by-line parsing** using R's string manipulation functions
• **Pattern matching** with regular expressions to identify data fields
• **Two-line processing** to capture complete player information
• **Robust error handling** for malformed or missing data entries
### **Key Data Fields Extracted**
• **Player identification**: Name, state, USCF ID numbers
• **Rating information**: Pre-tournament and post-tournament ratings
• **Tournament performance**: Total points earned, round-by-round results
• **Game details**: Opponent numbers, colors played (White/Black)
• **Calculated metrics**: Average opponent rating for strength-of-schedule analysis
### **Data Processing Steps**
• **Data cleaning**: Remove separator lines, headers, and empty entries
• **String parsing**: Split pipe-delimited fields and extract relevant information
• **Type conversion**: Convert text ratings and points to numeric values
• **Cross-referencing**: Match opponent numbers to calculate average opponent ratings
• **Quality validation**: Check for missing values and data consistency
## Tools and Technologies Used
### **R Programming Environment**
• **Core libraries**: `stringr` for text processing, `dplyr` for data manipulation
• **File handling**: `readr` for robust file I/O operations
• **Documentation**: R Markdown for reproducible analysis and reporting
• **Output generation**: CSV export for universal data compatibility
### **Analysis Features**
• **Interactive data tables** using DT package for exploration
• **Summary statistics** for rating distributions and tournament metrics
• **Data quality reporting** with missing value analysis
• **Verification procedures** to ensure accurate data conversion
## Expected Deliverables
### **Primary Outputs**
• **Clean CSV file** with structured player and tournament data
• **Comprehensive HTML report** documenting the conversion process
• **Data quality assessment** highlighting any issues or limitations
• **Summary statistics** providing tournament overview insights
This approach transforms complex, human-readable tournament data into a machine-readable format suitable for statistical analysis, database storage, or integration with tournament management systems.
Time Series Decomposition by Candace Grant
Advanced Time Series Analysis and Decomposition Techniques
This comprehensive time series analysis demonstrates advanced statistical modeling capabilities across multiple economic datasets, employing sophisticated decomposition methodologies and transformation techniques. The assignment showcases proficiency in handling complex temporal data structures, applying appropriate statistical transformations, and extracting meaningful insights from macroeconomic indicators.
Key Technical Achievements
Box-Cox Transformation Analysis: Systematically determined optimal variance-stabilizing transformations across diverse datasets including Canadian gas production (λ = 0.577), Australian retail series (λ = 0.371), tobacco production (λ = 0.926), airline passengers (λ = 2.0), and pedestrian traffic (λ = 0.273). Applied Guerrero method optimization to identify appropriate transformation parameters and demonstrated clear decision frameworks for transformation necessity.
Advanced Decomposition Methodologies: Implemented multiple decomposition techniques including classical multiplicative decomposition, STL decomposition, and X-11 seasonal adjustment procedures. Successfully isolated trend, seasonal, and irregular components across various time series, with particular emphasis on Australian labour force dynamics (1978-1995) revealing 38% secular growth dominated by trend components.
Outlier Detection and Impact Analysis: Utilized X-11 irregular components to identify structural breaks and anomalous periods in retail data, including significant outliers during the early 2000s economic expansion. Quantified outlier effects on seasonal adjustment procedures and demonstrated superior outlier detection capabilities compared to classical methods.
Comparative Analytical Framework: Systematically evaluated transformation effectiveness through before/after visualizations and statistical validation. Applied consistent analytical protocols across heterogeneous datasets, demonstrating scalable methodological approaches suitable for production-level forecasting environments.
Strategic Business Applications
This analysis framework directly supports strategic decision-making in economic forecasting, retail planning, and resource allocation optimization. The demonstrated capability to parse complex signals into interpretable components enables evidence-based policy recommendations and risk assessment protocols essential for senior analytical roles in data-driven organizations.
Technical Stack: R/fpp3, advanced time series modeling, statistical transformation theory, macroeconomic data analysis
Data 602 Wk2 | Into to Data | Candace Grant
In this lab I explore flights, specifically a random sample of domestic flights that departed from the three major New York City airports in 2013. I generated simple graphical and numerical summaries of data on these flights and explore delay times.
Banking Data Analysis
This data report presents an analysis of a marketing dataset from a Portuguese banking institution's direct marketing campaigns. The dataset focuses on phone-based marketing efforts aimed at promoting term deposits to clients.
The primary objective is to develop a predictive classification model that determines whether a client will subscribe to a term deposit (binary outcome: 'yes' or 'no'). The campaigns often required multiple contacts with the same client to achieve successful conversions, making this a complex customer behavior prediction problem."