gravatar

mark_data

mark

Recently Published

Data Job Salaries
An R-based data visualization examining how role specialization and geography shape salaries across five data careers — Data Architect, Data Engineer, Data Scientist, Business Analyst, and Data Analyst — using BLS OEWS May 2024 state-level wage data.
Screenplay Structure Analysis An exploratory data analysis of fourteen screenplays — eleven critically acclaimed films and three lower-rated comparison films. The central question: do great screenplays share a common structure, or only a common structural discipline? Films Analyzed Acclaimed Title Writer(s) Year Ikiru Hashimoto, Kurosawa, Oguni 1952 Thelma & Louise Callie Khouri 1991 Boogie Nights Paul Thomas Anderson 1997 Eyes Wide Shut Stanley Kubrick 1999 Moonlight Barry Jenkins 2016 Get Out Jordan Peele 2017 Parasite Bong Joon-ho & Han Jin-won 2019 Portrait of a Lady on Fire Céline Sciamma 2019 Aftersun Charlotte Wells 2022 Sentimental Value Joachim Trier & Eskil Vogt 2025 Sinners Ryan Coogler 2025 Comparison Title Writer(s) Year RT Score The Mummy Koepp, McQuarrie, Kussman 2017 24% Amsterdam David O. Russell 2022 33% Don't Worry Darling Katie Silberman 2022 38% Project Structure ├── data/ │ ├── screenplays/ ← PDF screenplay files │ ├── screenplays.html ← Metadata table (hand-built) │ ├── screenplays.json ← Metadata (hand-built) │ └── script_structure_dataset.csv ← Hand-coded story beat sheet ├── Project_2.Rmd ← Full analysis ├── Project_2_Approach.Rmd ← Approach document └── README.md Datasets Metadata — title, writers, year, genre, runtime, page count, Rotten Tomatoes score, and original vs. adapted. Built by hand in both HTML and JSON, loaded and compared in R. Story beat sheet — six structural moments coded by hand for each screenplay: inciting incident, Act I break, midpoint, Act II break, climax, and resolution. All positions recorded as page numbers and converted to percentages of total script length for cross-film comparison. PDF extraction — dialogue exchange frequency and average lines per page extracted programmatically using pdftools and stringr. Four films were excluded from extraction-based metrics due to scanned PDFs with no extractable text. Visualizations Estimated act structure with act break positions Beat interval heatmap — where each screenplay spends its pages Climax to resolution length — post-climax runway by film Key Finding Structural timing alone does not explain quality. The comparison films hit broadly similar structural marks to the acclaimed films. What appears to differ is harder to quantify — the purposefulness of what happens between those marks. Great scripts in this sample share structural discipline more than structural uniformity. Tools R / RStudio — data loading, cleaning, extraction, analysis ggplot2 — all visualizations Quarto — final report, published to RPubs Packages pdftools # PDF text extraction stringr # string parsing rvest # loading HTML jsonlite # loading JSON tidyr # reshaping dplyr # data manipulation ggplot2 # visualization readr # CSV loading purrr # functional tools Notes Ikiru appears in the metadata and act structure analysis but is omitted from the beat interval heatmap and ending compression chart due to an unresolved Act II break that could not be coded with confidence. Four screenplays — Boogie Nights, Portrait of a Lady on Fire, Thelma & Louise, and one additional title — had PDF scan issues that made text extraction unreliable. Those values are recorded as NA and noted where relevant.
Screenplay
# Screenplay Structure Analysis An exploratory data analysis of eleven critically acclaimed screenplays spanning 1952 to 2025. The goal is to find out whether great screenplays share measurable structural and narrative patterns — and whether those patterns hold across genre, era, culture, and authorship. ## Screenplays | Title | Writer(s) | Year | Original / Adapted | |-------|-----------|------|--------------------| | Ikiru | Hashimoto, Kurosawa, Oguni | 1952 | Original | | Thelma & Louise | Callie Khouri | 1991 | Original | | Boogie Nights | Paul Thomas Anderson | 1997 | Original | | Eyes Wide Shut | Stanley Kubrick | 1999 | Adapted | | Moonlight | Barry Jenkins | 2016 | Adapted | | Get Out | Jordan Peele | 2017 | Original | | Parasite | Bong Joon-ho & Han Jin-won | 2019 | Original | | Portrait of a Lady on Fire | Céline Sciamma | 2019 | Original | | Aftersun | Charlotte Wells | 2022 | Original | | Sentimental Value | Joachim Trier & Eskil Vogt | 2025 | Original | | Sinners | Ryan Coogler | 2025 | Original | ## Project Structure ``` ├── data/ │ ├── screenplays/ # PDF screenplay files │ ├── screenplays.html # Metadata table (hand-built) │ ├── screenplays.json # Metadata (hand-built) │ └── script_structure_dataset.csv # Hand-coded story beat sheet ├── Project_2_Approach.Rmd # Approach document ├── Project_2.qmd # Final analysis (Quarto) └── README.md ``` ## Datasets **Metadata** — title, writers, year, genre, runtime, page count, Rotten Tomatoes score, awards, and original vs. adapted. Built by hand in both HTML and JSON, loaded and compared in R. **Structural features** — scene count, scene density, dialogue ratio, action line ratio, unique character count, and unique location count. Extracted programmatically from PDF screenplays using `pdftools` and `stringr`. **Story beat sheet** — six structural moments coded by hand for each screenplay: inciting incident, Act 1 climax, midpoint, Act 2 climax, climax, and resolution. All positions recorded as page numbers and converted to percentages of total script length for cross-film comparison. ## Tools - **R** — data loading, cleaning, extraction, and analysis - **ggplot2** — visualizations - **Tableau** — dashboard and exploratory visuals - **Quarto** — final report, published to RPubs ## Packages ```r pdftools # PDF text extraction stringr # regex and string parsing tidytext # sentiment and lexical analysis rvest # loading HTML data jsonlite # loading JSON data tidyr # reshaping data dplyr # data manipulation ggplot2 # visualization ggrepel # non-overlapping labels ``` ## Notes Four screenplays — Boogie Nights, Portrait of a Lady on Fire, Ikiru, and Thelma & Louise — had OCR or formatting issues that made automated scene counting unreliable. Those values are recorded as `NA` and excluded from scene-specific comparisons. Ikiru is also written in a Japanese screenplay tradition that doesn't map cleanly onto Hollywood three-act structure, which is noted wherever it affects the analysis.
GCC Energy Consumption Analysis
This project examines energy production and consumption trends in Gulf Cooperation Council (GCC) countries. The analysis explores changes in energy sources and the growth of energy demand over time, highlighting regional energy patterns. Tools used: R Quarto Data visualization
Food Affordability
Food Affordability Analysis This project analyzes the cost of maintaining a healthy diet across countries using publicly available global data. The analysis explores how the affordability of a healthy diet varies across regions and over time. Visualizations highlight differences between countries and help illustrate broader global patterns in food access. Tools used: R Quarto Data visualization
Workout Data
Workout Data Analysis This project analyzes workout and fitness metrics to explore relationships between exercise patterns and health indicators. The analysis investigates how workout duration and intensity relate to metrics such as heart rate and overall fitness trends. Tools used: R R Markdown Exploratory data analysis