Can AI automate
policy evaluation?

AI may soon be capable of producing rigorous economic research. If that happens, policy evaluation could scale dramatically: highlighting what works, what fails, and what harms, far faster than human researchers alone.

We want to find out whether an autonomous system can generate, replicate, and revise empirical policy research, with everything made public.

This is an experiment in building reliable AI research systems.

409AI Papers+150 this week

11,807Matches

3.8%AI Win Rate

Last updated: March 16, 2026

Top AI papers

Loading top AI papers...

Most policies — probably millions of them globally — are never rigorously evaluated. Data is plenty but there aren't enough researchers. Could AI help? We genuinely don't know. So we're running an experiment. An AI system attempts to produce economics research at scale, , using publicly available data. Will any be good? How would we even know? Ideally, we'd want PhDs or editors of top journals to evaluate all of them. But they are busy. We run an automated tournament evaluating the papers against human benchmarks from top journals. This could help triage. Get to a "you know it when you see it" moment, faster. Most importantly, everything is : papers, code, data, failures. The more people look, the faster mistakes get caught. And we want feedback! In fact, the core thesis is that recursive self-improvement is possible and can be enhanced by human feedback. The next milestone: generate a 1000 papers, evaluate, and share lessons in a report. Can policy evaluation be automated? Or is hallucinated slop unavoidable? Let's find out!

How the Tournament Works

▼

What is this?

We built an AI system (APE) that writes economics research papers from scratch — finding data, running statistical analysis, and producing full academic papers. At its core, this is currently powered by Claude Code (currently Claude Opus 4.6), orchestrating an ensemble of currently (March 2025) 11 external models from 7 providers. See Methodology for details on model evolution. This leaderboard tracks whether those AI-written papers can "compete" with human-written papers from top economics journals, through an automated process.

The Tournament

Papers compete in head-to-head matches. Each day at 9:00 AM UTC, we run a round of matches. An LLM judge reads both papers and picks one as preferred (or declares a tie). The preferred paper gains rating points; the other loses points. Over time, papers consistently favored by the LLM judge rise in the rankings — though this is a noisy and potentially biased signal, not ground truth.

How Matches Work

For each match, we send both paper PDFs directly to Gemini 3 Flash (Google's LLM). The model sees the full papers — text, figures, tables, formatting — exactly as a human reviewer would.

The judge is prompted to act as a "senior editor at a top economics journal" and evaluate papers on: identification strategy (is the causal inference credible?), novelty, policy relevance, execution quality, and appropriate scope. .

To control for position bias (LLMs sometimes prefer whichever paper they see first), we run each comparison twice with the papers swapped. A paper must win both rounds to win the match; otherwise it's a tie.

Rating System

Rankings use TrueSkill, a Bayesian rating system developed by Microsoft. Each paper has two numbers:

μ (mu) — estimated skill level. Higher = paper wins more often.
σ (sigma) — uncertainty. Decreases as the paper plays more matches.

Papers are ranked by their conservative rating (μ − 3σ), which represents the lower bound of estimated skill. This means papers need consistent wins across multiple matches — a single lucky win won't send a paper to the top.

Head-to-Head Statistics

Win counts exclude 480 ties where the judge couldn't determine a clear winner.

Prob(Human Win) answers: if we randomly pick one human paper and one AI paper, what's the probability the human paper wins according to the LLM judge? Computed from TrueSkill ratings, accounting for uncertainty — papers with fewer matches contribute less certainty. The main metric compares recent cohorts (last 25 of each); all-time (92.1%) includes all 404 AI and 43 human papers with 5+ matches.

Matchup Selection

Each day we run 50 matches (100 LLM calls with position swapping) in 10 batches of 5. Within each batch, no paper plays twice. We combine random matching with structured matching.

Important Caveats ⚠️

The ⚠️ warning icon in the leaderboard indicates AI-generated papers that have not been peer reviewed. The LLM judge is not a substitute for human peer review. AI-generated papers may contain errors, hallucinations, or fabricated results. In fact, we have found that these are very common and sometimes take a lot of effort to spot. If a paper looks too good to be true, it probably is.

The judge evaluates the PDF only, not the underlying code or data. Rankings should not be taken at face value. That's why everything is open source — code, data, and papers are all public so anyone can spot errors, report issues, and contribute improvements.

Ranking Metrics

▼

Rank

Position based on conservative rating (lower bound of estimated skill)

48h

Change in rank over the past 48 hours

Estimated skill level — higher means the paper wins more often

Uncertainty — decreases as the paper plays more matches

Cons.

Conservative rating (μ − 3σ) — papers need consistent wins to rank highly (TrueSkill)

Elo

Familiar chess-style rating for comparison (1500 = average)

Number of head-to-head matches played

Reviewed

Human expert review — Yes for journal papers (peer-reviewed), No for AI papers (pending)

Average Elo by Source

Elo Rating Distribution

Conservative Rating Distribution

TrueSkill Conservative: Top-3 Papers

Papers & Matches

Tap chart to expand

Data as of 2026-03-16 20:19:01 CET

Highest Ranked AI Papers

Connected Backlash: Social Networks and the Political Economy of Carbon Taxation in France

Rank #29 overall62 matches

Back to Work? Early Termination of Pandemic Unemployment Benefits and Medicaid Home Care Provider Supply

Rank #31 overall86 matches

The Price of Subsidy Limits: Multi-Cutoff Evidence from Help to Buy's Regional Caps

Rank #35 overall70 matches

Review Status

▼

Benchmark:

Automated Review of AI Papers

Code review:

Replication:

Referee Reviews:

Swipe to see more columns

Rank ↑	48hRank change over the last 48 hours.	Paper ↕	μEstimated skill rating (μ). Higher values indicate better research quality based on pairwise comparisons. ↕	σUncertainty (σ). Lower values mean higher confidence in the rating. ↕	Cons.Conservative Rating (μ - 3σ), adjusted for integrity penalties. Used for ranking. ↕	EloElo rating. Standard chess-like rating where 400 points difference = 90% win probability. ↕	MPMatches Played. Valid head-to-head comparisons, excluding annulled matches against papers flagged with severe issues during automated code review. ↕	Status✅ Peer reviewed · 🔎 Awaiting review · 🧐 Issues detected · 🚫 Critical errors
1	—	Dynamics of the Long Term Housing Yield: Evidence from Natural Experiments AER	38.7	1.8	33.1	2046	220	✅
2	▲3	The Value of Clean Water: Experimental Evidence from Rural India AER	36.3	1.8	31.0	1952	208	✅
3	▲9	Vertical Integration and Cream Skimming of Profitable Referrals: The Case of Hospital-Owned Skilled Nursing Facilities AEJ: Policy	33.8	1.1	30.4	1851	292	✅
4	▲2	Why is Workplace Sexual Harassment Underreported? The Value of Outside Options Amid the Threat of Retaliation AER	34.1	1.3	30.3	1863	259	✅
5	▲3	Are Complementary Policies Substitutes? Evidence from R&D Subsidies in the UK AEJ: Policy	34.7	1.5	30.3	1887	242	✅
6	▼3	Flood Risk Mapping and the Distributional Impacts of Climate Information AEJ: Policy	34.0	1.3	30.1	1859	263	✅
7	▼3	Plata y Plomo: How Higher Wages Expose Politicians to Criminal Violence AEJ: Policy	33.5	1.2	29.8	1840	238	✅
8	▼1	Invisible Wounds: How Mental Disability Benefits Shape Veteran Well-Being AEJ: Policy	33.1	1.2	29.6	1825	258	✅
9	▼7	Immigration, Innovation, and Growth AER	33.5	1.4	29.2	1839	241	✅
10	—	Punishing Financial Crimes: The Impact of Prison Sentences on Defendants and Their Colleagues AEJ: Policy	32.5	1.1	29.1	1801	268	✅
11	▼2	Estimating the Economic Value of Zoning Reform AEJ: Policy	32.6	1.1	29.1	1803	287	✅
12	▲3	Can You Erase the Mark of a Criminal Record? Labor Market Impacts of Criminal Record Remediation AEJ: Policy	32.3	1.1	28.9	1792	288	✅
13	▲1	The Welfare Effects of Eligibility Expansions: Theory and Evidence from SNAP AEJ: Policy	32.3	1.2	28.8	1791	234	✅
14	▼1	Market Power and Capital Constraints AER	32.1	1.1	28.7	1786	260	✅
15	▼4	Abundance from Abroad: Migrant Income and Long-Run Economic Development AER	31.9	1.2	28.4	1778	241	✅
16	▲5	Temporary Layoffs, Loss-of-Recall and Cyclical Unemployment Dynamics AER	31.7	1.1	28.3	1769	244	✅
17	▲3	The Effect of Deactivating Facebook and Instagram on Users' Emotional State AEJ: Policy	31.5	1.1	28.2	1759	267	✅
18	▲1	Collective Bargaining Rights, Policing, and Civilian Deaths AEJ: Policy	31.5	1.1	28.2	1761	295	✅
19	▼2	Hacked to Pieces? The Effects of Ransomware Attacks on Hospitals and Patients AEJ: Policy	31.6	1.1	28.1	1763	275	✅
20	▼4	Trade Protection Along Supply Chains AEJ: Policy	31.3	1.1	28.0	1751	257	✅
21	▲2	Cooking, Health, and Daily Exposure to Pollution Spikes AEJ: Policy	30.8	1.1	27.6	1733	253	✅
22	▲3	Zero-Sum Thinking and the Roots of US Political Differences AER	30.7	1.1	27.5	1729	247	✅
23	▲3	The Pass-Through of Retail Crime AEJ: Policy	30.7	1.1	27.4	1728	300	✅
24	▼2	Short- and Long-Term Effects of Universal Preschool: Evidence from the Arab Population in Israel AEJ: Policy	30.3	1.1	27.1	1710	257	✅
25	▲4	Adjustable Product Attributes, Indirect Network Effects, And Subsidy Design: The Case of Electric Vehicles AEJ: Policy	30.1	1.0	27.1	1705	315	✅
26	▲2	Childhood Health Shocks and the Intergenerational Transmission of Inequality AEJ: Policy	29.6	1.0	26.6	1686	282	✅
27	▼9	Work From Home and the Office Real Estate Apocalypse AER	29.7	1.0	26.6	1688	263	✅
28	▲2	Harnessing Deductions to Increase Tax Compliance and Formalization AEJ: Policy	29.6	1.0	26.6	1685	292	✅
29	▼2	Connected Backlash: Social Networks and the Political Economy of Carbon Taxation in France APE working paper #464 (v7)	31.4	1.7	26.4	1758	62
30	▲3	The Unintended Consequences of Academic Leniency AEJ: Policy	29.2	1.0	26.3	1668	314	✅
31	▼7	Back to Work? Early Termination of Pandemic Unemployment Benefits and Medicaid Home Care Provider Supply APE working paper #448 (v2)	29.7	1.2	26.3	1689	86
32	—	A Matter of Time? Measuring Effects of Public Schooling Expansions on Families AEJ: Policy	29.0	1.0	26.0	1659	289	✅
33	▼2	The Option Value of Municipal Liquidity AEJ: Policy	28.9	1.0	25.8	1657	277	✅
34	—	Tax and Occupancy of Business Properties: Evidence from UK Business Rate Reliefs AEJ: Policy	28.6	1.0	25.7	1645	325	✅
35	▲4	The Price of Subsidy Limits: Multi-Cutoff Evidence from Help to Buy's Regional Caps APE working paper #492 (v1)	29.8	1.4	25.6	1693	70
36	—	Cross-State Strategic Voting AEJ: Policy	28.0	1.0	25.1	1620	322	✅
37	▼2	Female Leaders and Intrahousehold Dynamics: Evidence from State Elections in India AEJ: Policy	27.9	1.0	24.9	1614	299	✅
38	NEW	The Hidden Pre-Trend: How a Third Census Decade Exposes Identification Failure in WWII Service-Return Estimates APE working paper #586 (v1)	29.0	1.4	24.9	1659	60
39	▼2	Perils of the Paperwork: The Impact of Information and Application Assistance on Social Benefit Take-Up in India AEJ: Policy	27.6	0.9	24.8	1603	336	✅
40	NEW	Regulatory Whack-a-Mole: Cross-Media Pollution Substitution in Response to Clean Air Act Inspections APE working paper #642 (v1)	29.7	1.9	24.2	1689	54
41	▲2	Hurricanes, Climate Change Policies and Electoral Accountability AEJ: Policy	26.4	0.9	23.7	1557	368	✅
42	▼2	The Democratic Cost of Consolidation: Municipal Mergers and Referendum Participation in Switzerland APE working paper #501 (v1)	27.3	1.3	23.5	1591	64
43	▲2	Polling Place Location and the Costs of Voting AEJ: Policy	26.1	0.9	23.4	1543	371	✅
44	▼3	Regulatory Teeth and Housing Prices: A Multi-Cutoff RDD at France's Energy Label Boundaries APE working paper #503 (v1)	27.3	1.4	23.1	1591	64
45	▼7	Friends in High Places: Minimum Wage Shocks and Social Network Propagation APE working paper #185 (v21)	27.9	1.6	23.0	1614	48
46	▼4	Can't Ask, Won't Tell: Salary History Bans and the Gender Earnings Gap at Hire APE working paper #533 (v1)	27.1	1.4	22.9	1585	56
47	▼3	Where Cultural Borders Cross: Gender Equality at the Intersection of Language and Religion in Swiss Direct Democracy APE working paper #439 (v3)	26.3	1.1	22.8	1550	90
48	▼2	Black Lives: The High Cost of Segregation AEJ: Policy	25.2	0.9	22.5	1507	345	✅
49	▲2	Polluted IPOs AEJ: Policy	25.0	0.9	22.3	1498	394	✅
50	▲6	The Value of School Social Climate Information: Evidence from Chicago Housing Transactions AEJ: Policy	24.8	0.9	22.1	1490	384	✅
51	▼3	China's Nationwide CO2 Emissions Trading System: A General Equilibrium Assessment AEJ: Policy	24.6	0.8	22.1	1485	388	✅
52	▼3	Connecting the Most Remote: Road Eligibility and Development in India's Tribal Periphery APE working paper #428 (v1)	25.3	1.2	21.8	1512	83
53	NEW	Closing the Golden Door: Individual Occupational Mobility After the 1924 Immigration Quota Act APE working paper #626 (v1)	27.3	1.8	21.8	1591	46
54	▼4	Do Energy Efficiency Resource Standards Reduce Electricity Consumption? Evidence from Staggered State Adoption APE working paper #119 (v7)	24.9	1.0	21.8	1497	120
55	▼8	The Welfare Cost of Prescription Drug Monitoring Programs: A Sufficient Statistics Approach APE working paper #488 (v1)	25.3	1.2	21.8	1514	76
56	NEW	Legislating Peace? Anti-Open Grazing Laws and Farmer-Herder Violence in Nigeria APE working paper #500 (v2)	27.4	1.9	21.7	1596	48
57	▼4	Faster and Deadlier? Disentangling Speed Limit Reversals from Pandemic Confounds in France APE working paper #462 (v1)	25.3	1.3	21.5	1513	68
58	NEW	Thinner at Midnight: How CRA Vulnerability Shrinks Federal Regulations during Presidential Transitions APE working paper #611 (v1)	25.9	1.5	21.4	1537	62
59	▼5	The Convergence of Gender Attitudes: Forty Years of Swiss Municipal Referenda APE working paper #435 (v1)	24.6	1.2	21.1	1485	73
60	▼5	The Balloon Effect: How Neighboring States' Prescription Drug Monitoring Programs Reshape the Geography of Opioid Mortality APE working paper #309 (v1)	24.4	1.0	21.1	1476	109

1–60 of 452

Total tokens used for tournament (excludes paper generation tokens): 1,037,483,085