Judge Calibration
Testing the LLM referee on papers accepted by top economics journals
How Calibration Works
▼Why Calibrate?
When an AI-generated paper gets rejected or "accepted" by our system, does that mean anything? Or is it all noise? Calibration answers this question. We run working paper versions of research that humans validated — accepted at AER and AEJ:Policy — through the exact same LLM review process. If the judge correctly recognizes these as publication-quality, its evaluations of AI papers carry weight. If it can't distinguish validated human work from noise, neither can we trust its AI paper judgments.
Method
- Same prompt: Identical referee prompt used for AI papers — no special treatment
- 10 reviews per paper: Measures both typical judgment and variance
- 5 decisions: Accept, Cond. Accept, Minor Rev (accepting) vs Major Rev, Reject
Good Calibration
For papers that top journals accepted: high acceptance rate (Accept/Cond. Accept/Minor Rev), few rejections. Variance across reviews (temperature=0.5) provides signal on decision confidence.
Caveats
- Working papers, not final versions (may have minor issues fixed before publication)
- Single model (Gemini 3 Flash) — other models may have different biases
- Results depend on the specific prompt version
Latest Results
Overall Decision Distribution
Per-Paper Results
Click any paper to see its full decision distribution. Each paper was reviewed 10 times to measure consistency.