Most policies - probably millions of them globally - are never rigorously evaluated. Data is plenty but there aren't enough researchers. Could AI help? We genuinely don't know. So we're running an experiment. An AI system attempts to produce economics research at scale, , using publicly available data. We aim for at least 1000 papers before the end of 2026. How do we sort the good from the bad? An automated tournament measures them against human benchmarks from top journals, to triage for human oversight. Most importantly, everything is public: papers, code, data, failures.
54 AI papers (+10 this week)·43 human·2,300 matches
Models:Claude Opus 4.5 (Generation)·GPT-5.2 (Review)·Gemini 3 Flash (Review, Judge)
How the Tournament Works
▼What is this?
We built an AI system (APE) that writes economics research papers from scratch — finding data, running statistical analysis, and producing full academic papers. At its core, this is currently powered by Claude Code, with input from GPT 5.2 via API calls for additional diversity and entropy. This leaderboard tracks whether those AI-written papers can "compete" with human-written papers from top economics journals, through an automated process.
The Tournament
Papers compete in head-to-head matches. Each day at 9:00 AM UTC, we run a round of matches. An LLM judge reads both papers and picks one as preferred (or declares a tie). The preferred paper gains rating points; the other loses points. Over time, papers consistently favored by the LLM judge rise in the rankings — though this reflects LLM preferences, not necessarily true quality.
How Matches Work
For each match, we send both paper PDFs directly to Gemini 3 Flash (Google's LLM). The model sees the full papers — text, figures, tables, formatting — exactly as a human reviewer would.
The judge is prompted to act as a "senior editor at a top economics journal" and evaluate papers on: identification strategy (is the causal inference credible?), novelty, policy relevance, execution quality, and appropriate scope.
To control for position bias (LLMs sometimes prefer whichever paper they see first), we run each comparison twice with the papers swapped. A paper must win both rounds to win the match; otherwise it's a tie. All matches can be found here.
Rating System
Rankings use TrueSkill, a Bayesian rating system developed by Microsoft for Xbox matchmaking. Each paper has two numbers:
- μ (mu) — estimated skill level. Higher = paper wins more often.
- σ (sigma) — uncertainty. Decreases as the paper plays more matches.
Papers are ranked by their conservative rating (μ − 3σ), which represents the lower bound of estimated skill. This means papers need consistent wins across multiple matches — a single lucky win won't send a paper to the top.
Head-to-Head Statistics
Win counts exclude 12 ties where the judge couldn't determine a clear winner.
Prob(Human Win) answers: if we randomly pick one human paper and one AI paper, what's the probability the human paper wins according to the LLM judge? Computed from TrueSkill ratings, accounting for uncertainty — papers with fewer matches contribute less certainty. The main metric compares recent cohorts (last 25 of each); all-time (94.3%) includes all 54 AI and 43 human papers with 5+ matches.
Matchup Selection
Each day we run 50 matches (100 LLM calls with position swapping) in 10 batches of 5. Within each batch, no paper plays twice.
80% of matches use information-gain sampling — we prioritize pairs where both papers have high rating uncertainty and similar skill levels, maximizing what we learn from each expensive LLM call. Of these, half are constrained to cross-type (one AI paper vs one human paper), ensuring we continuously compare AI and human papers rather than just ranking within each group. The remaining 20% are random for exploration and robustness.
This design is cost-efficient: LLM judge calls are expensive, so we extract maximum signal per match. Papers with low uncertainty or extreme ratings may not match every day — that's by design.
Important Caveats ⚠️
The ⚠️ warning icon in the leaderboard indicates AI-generated papers that have not been peer reviewed. The LLM judge is not a substitute for human peer review. AI-generated papers may contain errors, hallucinations, or fabricated results. In fact, we have found that these are very common and sometimes take a lot of effort to spot. If a paper looks too good to be true, it probably is.
The judge evaluates the PDF only, not the underlying code or data. Rankings should not be taken at face value. That's why everything is open source — code, data, and papers are all public so anyone can spot errors, report issues, and contribute improvements.
Tap chart to expand
Data as of 2026-01-29 09:56:54 CET
Ranking Metrics
▼Highest Ranked AI Papers
Swipe to see more columns
| Rank | 48h | Paper | Elo | Rev. |
|---|---|---|---|---|
| 1 | ▲1 | 2100 | ✅Forthcoming paper in a leading economics journal — peer reviewed by expert referees. | |
| 2 | ▼1 | 2109 | ✅Forthcoming paper in a leading economics journal — peer reviewed by expert referees. | |
| 3 | ▲1 | 2068 | ✅Forthcoming paper in a leading economics journal — peer reviewed by expert referees. | |
| 4 | ▼1 | 1941 | ✅Forthcoming paper in a leading economics journal — peer reviewed by expert referees. | |
| 5 | ▲1 | 1901 | ✅Forthcoming paper in a leading economics journal — peer reviewed by expert referees. | |
| 6 | ▲1 | 1919 | ✅Forthcoming paper in a leading economics journal — peer reviewed by expert referees. | |
| 7 | ▲5 | 2004 | ✅Forthcoming paper in a leading economics journal — peer reviewed by expert referees. | |
| 8 | ▼3 | 1903 | ✅Forthcoming paper in a leading economics journal — peer reviewed by expert referees. | |
| 9 | ▲6 | 1810 | ✅Forthcoming paper in a leading economics journal — peer reviewed by expert referees. | |
| 10 | ▲1 | 1940 | ✅Forthcoming paper in a leading economics journal — peer reviewed by expert referees. | |
| 11 | ▼1 | 2027 | ✅Forthcoming paper in a leading economics journal — peer reviewed by expert referees. | |
| 12 | ▼4 | 1930 | ✅Forthcoming paper in a leading economics journal — peer reviewed by expert referees. | |
| 13 | ▲3 | 1845 | ✅Forthcoming paper in a leading economics journal — peer reviewed by expert referees. | |
| 14 | ▲3 | 1767 | ✅Forthcoming paper in a leading economics journal — peer reviewed by expert referees. | |
| 15 | ▼2 | 1822 | ✅Forthcoming paper in a leading economics journal — peer reviewed by expert referees. | |
| 16 | ▼7 | 1862 | ✅Forthcoming paper in a leading economics journal — peer reviewed by expert referees. | |
| 17 | ▼3 | 1845 | ✅Forthcoming paper in a leading economics journal — peer reviewed by expert referees. | |
| 18 | — | 1780 | ✅Forthcoming paper in a leading economics journal — peer reviewed by expert referees. | |
| 19 | ▲2 | 1744 | ✅Forthcoming paper in a leading economics journal — peer reviewed by expert referees. | |
| 20 | ▲8 | AEJ: Policy | 1682 | ✅Forthcoming paper in a leading economics journal — peer reviewed by expert referees. |
| 21 | ▲6 | AEJ: Policy | 1681 | ✅Forthcoming paper in a leading economics journal — peer reviewed by expert referees. |
| 22 | ▼3 | 1736 | ✅Forthcoming paper in a leading economics journal — peer reviewed by expert referees. | |
| 23 | ▼3 | AEJ: Policy | 1728 | ✅Forthcoming paper in a leading economics journal — peer reviewed by expert referees. |
| 24 | ▼2 | 1774 | ✅Forthcoming paper in a leading economics journal — peer reviewed by expert referees. | |
| 25 | ▲1 | 1697 | ✅Forthcoming paper in a leading economics journal — peer reviewed by expert referees. | |
| 26 | ▼1 | 1668 | ✅Forthcoming paper in a leading economics journal — peer reviewed by expert referees. | |
| 27 | ▲2 | 1680 | ✅Forthcoming paper in a leading economics journal — peer reviewed by expert referees. | |
| 28 | ▲5 | AEJ: Policy | 1588 | ✅Forthcoming paper in a leading economics journal — peer reviewed by expert referees. |
| 29 | ▲2 | 1624 | ✅Forthcoming paper in a leading economics journal — peer reviewed by expert referees. | |
| 30 | ▲9 | 1567 | ✅Forthcoming paper in a leading economics journal — peer reviewed by expert referees. | |
| 31 | ▲3 | 1614 | ✅Forthcoming paper in a leading economics journal — peer reviewed by expert referees. | |
| 32 | ▼9 | 1817 | ✅Forthcoming paper in a leading economics journal — peer reviewed by expert referees. | |
| 33 | ▼9 | 1601 | ✅Forthcoming paper in a leading economics journal — peer reviewed by expert referees. | |
| 34 | ▼4 | 1591 | ✅Forthcoming paper in a leading economics journal — peer reviewed by expert referees. | |
| 35 | — | 1577 | ✅Forthcoming paper in a leading economics journal — peer reviewed by expert referees. | |
| 36 | — | 1591 | ✅Forthcoming paper in a leading economics journal — peer reviewed by expert referees. | |
| 37 | ▼5 | AEJ: Policy | 1589 | ✅Forthcoming paper in a leading economics journal — peer reviewed by expert referees. |
| 38 | ▲3 | 1528 | ✅Forthcoming paper in a leading economics journal — peer reviewed by expert referees. | |
| 39 | ▲4 | 1516 | ✅Forthcoming paper in a leading economics journal — peer reviewed by expert referees. | |
| 40 | ▼3 | 1503 | ✅Forthcoming paper in a leading economics journal — peer reviewed by expert referees. | |
| 41 | ▼3 | 1544 | ✅Forthcoming paper in a leading economics journal — peer reviewed by expert referees. | |
| 42 | — | AEJ: Policy | 1524 | ✅Forthcoming paper in a leading economics journal — peer reviewed by expert referees. |
| 43 | ▼3 | 1480 | ⚠️AI papers have not been peer reviewed and may contain errors including hallucinations, manufactured data, or incorrect references. | |
| 44 | ▲2 | APE working paper #51 | 1447 | ⚠️AI papers have not been peer reviewed and may contain errors including hallucinations, manufactured data, or incorrect references. |
| 45 | — | 1363 | ⚠️AI papers have not been peer reviewed and may contain errors including hallucinations, manufactured data, or incorrect references. | |
| 46 | ▲1 | 1372 | ⚠️AI papers have not been peer reviewed and may contain errors including hallucinations, manufactured data, or incorrect references. | |
| 47 | ▲1 | 1383 | ⚠️AI papers have not been peer reviewed and may contain errors including hallucinations, manufactured data, or incorrect references. | |
| 48 | ▲6 | APE working paper #79 | 1398 | ⚠️AI papers have not been peer reviewed and may contain errors including hallucinations, manufactured data, or incorrect references. |
| 49 | ▲7 | 1349 | ✅Forthcoming paper in a leading economics journal — peer reviewed by expert referees. | |
| 50 | ▲3 | 1365 | ⚠️AI papers have not been peer reviewed and may contain errors including hallucinations, manufactured data, or incorrect references. | |
| 51 | ▼1 | 1325 | ⚠️AI papers have not been peer reviewed and may contain errors including hallucinations, manufactured data, or incorrect references. | |
| 52 | ▼1 | 1324 | ⚠️AI papers have not been peer reviewed and may contain errors including hallucinations, manufactured data, or incorrect references. | |
| 53 | ▼1 | 1319 | ⚠️AI papers have not been peer reviewed and may contain errors including hallucinations, manufactured data, or incorrect references. | |
| 54 | ▼10 | APE working paper #29 | 1422 | ⚠️AI papers have not been peer reviewed and may contain errors including hallucinations, manufactured data, or incorrect references. |
| 55 | ▲2 | 1327 | ⚠️AI papers have not been peer reviewed and may contain errors including hallucinations, manufactured data, or incorrect references. | |
| 56 | ▲4 | 1341 | ⚠️AI papers have not been peer reviewed and may contain errors including hallucinations, manufactured data, or incorrect references. | |
| 57 | ▲2 | 1232 | ⚠️AI papers have not been peer reviewed and may contain errors including hallucinations, manufactured data, or incorrect references. | |
| 58 | ▼3 | 1216 | ⚠️AI papers have not been peer reviewed and may contain errors including hallucinations, manufactured data, or incorrect references. | |
| 59 | ▼1 | 1411 | ⚠️AI papers have not been peer reviewed and may contain errors including hallucinations, manufactured data, or incorrect references. | |
| 60 | ▼11 | 1292 | ⚠️AI papers have not been peer reviewed and may contain errors including hallucinations, manufactured data, or incorrect references. |
Total tokens used for tournament (excluding generation): 96,125,097