Recently Published

Lab 3
Reproducible data collection
An exercise I have assigned to students in the past is to go to the online supplements and datasheets page for the Journal of Accounting Research, pick an issue of the journal and evaluate whether one could reproduce the analysis in the associated paper using the materials made available there. Generally, the answer has been negative. That said, it seems that the Journal of Accounting Research is still the (relative) leader among accounting-focused academic journals with regard to requiring authors to supply materials. In this note I show how data collection can be made more reproducible.
ggplot2 Annotations
Shared code
Open-source software dominates in certain areas. A lot of data science relies on thousands of open-source packages that are continually being improved in part because anyone can see how they work. Yet the open-source model has not taken off in academia. A lot of the publicly available code in accounting research relates to two seemingly obscure topics: Fama-French industries and winsorization. I discuss both in this note.
Some benchmarks with comp.g_secd
I use the WRDS data set `comp.g_secd` to do some benchmarking. A representative query that takes 6 minutes using SAS on the WRDS servers, takes about 1 minute using the WRDS PostgreSQL server, and about 0.2 seconds using a local parquet file. The parquet file occupies less than 4 GB on my hard drive, which compares with about 145 GB for the SAS file on the WRDS server. While creating the parquet file takes 45 minutes, this may be a reasonable trade-off for a researcher who is analysing `comp.g_secd` frequently and does not need the very latest iteration of `comp.g_secd` for research purposes.
The best of both worlds: Using modern data frame libraries to create pandas data
A number of modern data frame libraries have emerged that address weaknesses of pandas. In this note, I use polars and Ibis to show how one can use these libraries to get the data into a form in which pandas can shine.
Using SAS to create pandas data
SAS might be another approach to manipulating data for pandas. My Python package wrds2pg offers a sas_to_pandas() function that can run code on the WRDS server and return the results as a pandas dataframe. While not quite as fast as using Ibis with the PostgreSQL server, SAS performs pretty well with this task.
Workshop: Introduction to R Statistics for Insect Ecology
Welcome to the digital home of our workshop! Insect ecology is uniquely messy—zero-inflated counts, overdispersed populations, and more variables than a centipede has legs. This guide is designed to take you from "R-anxiety" to "R-competence," focusing on the specific statistical hurdles we face as entomologists.