Can AI automate
policy evaluation?
AI may soon be capable of producing rigorous economic research. If that happens, policy evaluation could scale dramatically: highlighting what works, what fails, and what harms, far faster than human researchers alone.
We want to find out whether an autonomous system can generate, replicate, and revise empirical policy research, with everything made public.
This is an experiment in building reliable AI research systems.
Last updated: March 16, 2026
Most policies — probably millions of them globally — are never rigorously evaluated. Data is plenty but there aren't enough researchers. Could AI help? We genuinely don't know. So we're running an experiment. An AI system attempts to produce economics research at scale, , using publicly available data. Will any be good? How would we even know? Ideally, we'd want PhDs or editors of top journals to evaluate all of them. But they are busy. We run an automated tournament evaluating the papers against human benchmarks from top journals. This could help triage. Get to a "you know it when you see it" moment, faster. Most importantly, everything is : papers, code, data, failures. The more people look, the faster mistakes get caught. And we want feedback! In fact, the core thesis is that recursive self-improvement is possible and can be enhanced by human feedback. The next milestone: generate a 1000 papers, evaluate, and share lessons in a report. Can policy evaluation be automated? Or is hallucinated slop unavoidable? Let's find out!
How the Tournament Works
▼What is this?
We built an AI system (APE) that writes economics research papers from scratch — finding data, running statistical analysis, and producing full academic papers. At its core, this is currently powered by Claude Code (currently Claude Opus 4.6), orchestrating an ensemble of currently (March 2025) 11 external models from 7 providers. See Methodology for details on model evolution. This leaderboard tracks whether those AI-written papers can "compete" with human-written papers from top economics journals, through an automated process.
The Tournament
Papers compete in head-to-head matches. Each day at 9:00 AM UTC, we run a round of matches. An LLM judge reads both papers and picks one as preferred (or declares a tie). The preferred paper gains rating points; the other loses points. Over time, papers consistently favored by the LLM judge rise in the rankings — though this is a noisy and potentially biased signal, not ground truth.
How Matches Work
For each match, we send both paper PDFs directly to Gemini 3 Flash (Google's LLM). The model sees the full papers — text, figures, tables, formatting — exactly as a human reviewer would.
The judge is prompted to act as a "senior editor at a top economics journal" and evaluate papers on: identification strategy (is the causal inference credible?), novelty, policy relevance, execution quality, and appropriate scope. .
To control for position bias (LLMs sometimes prefer whichever paper they see first), we run each comparison twice with the papers swapped. A paper must win both rounds to win the match; otherwise it's a tie.
Rating System
Rankings use TrueSkill, a Bayesian rating system developed by Microsoft. Each paper has two numbers:
- μ (mu) — estimated skill level. Higher = paper wins more often.
- σ (sigma) — uncertainty. Decreases as the paper plays more matches.
Papers are ranked by their conservative rating (μ − 3σ), which represents the lower bound of estimated skill. This means papers need consistent wins across multiple matches — a single lucky win won't send a paper to the top.
Head-to-Head Statistics
Win counts exclude 480 ties where the judge couldn't determine a clear winner.
Prob(Human Win) answers: if we randomly pick one human paper and one AI paper, what's the probability the human paper wins according to the LLM judge? Computed from TrueSkill ratings, accounting for uncertainty — papers with fewer matches contribute less certainty. The main metric compares recent cohorts (last 25 of each); all-time (92.1%) includes all 404 AI and 43 human papers with 5+ matches.
Matchup Selection
Each day we run 50 matches (100 LLM calls with position swapping) in 10 batches of 5. Within each batch, no paper plays twice. We combine random matching with structured matching.
Important Caveats ⚠️
The ⚠️ warning icon in the leaderboard indicates AI-generated papers that have not been peer reviewed. The LLM judge is not a substitute for human peer review. AI-generated papers may contain errors, hallucinations, or fabricated results. In fact, we have found that these are very common and sometimes take a lot of effort to spot. If a paper looks too good to be true, it probably is.
The judge evaluates the PDF only, not the underlying code or data. Rankings should not be taken at face value. That's why everything is open source — code, data, and papers are all public so anyone can spot errors, report issues, and contribute improvements.
Ranking Metrics
▼Tap chart to expand
Data as of 2026-03-16 20:19:01 CET
Highest Ranked AI Papers
Review Status
▼Swipe to see more columns
| Rank | 48hRank change over the last 48 hours. | Paper | EloElo rating. Standard chess-like rating where 400 points difference = 90% win probability. | Status✅ Peer reviewed · 🔎 Awaiting review · 🧐 Issues detected · 🚫 Critical errors |
|---|---|---|---|---|
| 1 | — | 2046 | ✅ | |
| 2 | ▲3 | 1952 | ✅ | |
| 3 | ▲9 | 1851 | ✅ | |
| 4 | ▲2 | 1863 | ✅ | |
| 5 | ▲3 | 1887 | ✅ | |
| 6 | ▼3 | 1859 | ✅ | |
| 7 | ▼3 | 1840 | ✅ | |
| 8 | ▼1 | 1825 | ✅ | |
| 9 | ▼7 | 1839 | ✅ | |
| 10 | — | 1801 | ✅ | |
| 11 | ▼2 | 1803 | ✅ | |
| 12 | ▲3 | 1792 | ✅ | |
| 13 | ▲1 | 1791 | ✅ | |
| 14 | ▼1 | 1786 | ✅ | |
| 15 | ▼4 | 1778 | ✅ | |
| 16 | ▲5 | 1769 | ✅ | |
| 17 | ▲3 | 1759 | ✅ | |
| 18 | ▲1 | 1761 | ✅ | |
| 19 | ▼2 | 1763 | ✅ | |
| 20 | ▼4 | AEJ: Policy | 1751 | ✅ |
| 21 | ▲2 | 1733 | ✅ | |
| 22 | ▲3 | 1729 | ✅ | |
| 23 | ▲3 | AEJ: Policy | 1728 | ✅ |
| 24 | ▼2 | 1710 | ✅ | |
| 25 | ▲4 | 1705 | ✅ | |
| 26 | ▲2 | 1686 | ✅ | |
| 27 | ▼9 | 1688 | ✅ | |
| 28 | ▲2 | 1685 | ✅ | |
| 29 | ▼2 | APE working paper #464 (v7) | 1758 | |
| 30 | ▲3 | 1668 | ✅ | |
| 31 | ▼7 | APE working paper #448 (v2) | 1689 | |
| 32 | — | 1659 | ✅ | |
| 33 | ▼2 | AEJ: Policy | 1657 | ✅ |
| 34 | — | 1645 | ✅ | |
| 35 | ▲4 | APE working paper #492 (v1) | 1693 | |
| 36 | — | AEJ: Policy | 1620 | ✅ |
| 37 | ▼2 | 1614 | ✅ | |
| 38 | NEW | 1659 | ||
| 39 | ▼2 | 1603 | ✅ | |
| 40 | NEW | Regulatory Whack-a-Mole: Cross-Media Pollution Substitution in Response to Clean Air Act Inspections APE working paper #642 (v1) | 1689 | |
| 41 | ▲2 | 1557 | ✅ | |
| 42 | ▼2 | APE working paper #501 (v1) | 1591 | |
| 43 | ▲2 | 1543 | ✅ | |
| 44 | ▼3 | APE working paper #503 (v1) | 1591 | |
| 45 | ▼7 | APE working paper #185 (v21) | 1614 | |
| 46 | ▼4 | APE working paper #533 (v1) | 1585 | |
| 47 | ▼3 | 1550 | ||
| 48 | ▼2 | AEJ: Policy | 1507 | ✅ |
| 49 | ▲2 | AEJ: Policy | 1498 | ✅ |
| 50 | ▲6 | 1490 | ✅ | |
| 51 | ▼3 | 1485 | ✅ | |
| 52 | ▼3 | APE working paper #428 (v1) | 1512 | |
| 53 | NEW | APE working paper #626 (v1) | 1591 | |
| 54 | ▼4 | 1497 | ||
| 55 | ▼8 | APE working paper #488 (v1) | 1514 | |
| 56 | NEW | APE working paper #500 (v2) | 1596 | |
| 57 | ▼4 | APE working paper #462 (v1) | 1513 | |
| 58 | NEW | APE working paper #611 (v1) | 1537 | |
| 59 | ▼5 | APE working paper #435 (v1) | 1485 | |
| 60 | ▼5 | 1476 |
Total tokens used for tournament (excludes paper generation tokens): 1,037,483,085