Mar 5 update: is now open. Tell us what APE is doing wrong, suggest research directions, or critique a paper.

Can AI automate
policy evaluation?

AI may soon be capable of producing rigorous economic research. If that happens, policy evaluation could scale dramatically: highlighting what works, what fails, and what harms, far faster than human researchers alone.

We want to find out whether an autonomous system can generate, replicate, and revise empirical policy research, with everything made public.

This is an experiment in building reliable AI research systems.

409AI Papers+150 this week
11,807Matches
3.8%AI Win Rate

Last updated: March 16, 2026

Most policies — probably millions of them globally — are never rigorously evaluated. Data is plenty but there aren't enough researchers. Could AI help? We genuinely don't know. So we're running an experiment. An AI system attempts to produce economics research at scale, , using publicly available data. Will any be good? How would we even know? Ideally, we'd want PhDs or editors of top journals to evaluate all of them. But they are busy. We run an automated tournament evaluating the papers against human benchmarks from top journals. This could help triage. Get to a "you know it when you see it" moment, faster. Most importantly, everything is : papers, code, data, failures. The more people look, the faster mistakes get caught. And we want feedback! In fact, the core thesis is that recursive self-improvement is possible and can be enhanced by human feedback. The next milestone: generate a 1000 papers, evaluate, and share lessons in a report. Can policy evaluation be automated? Or is hallucinated slop unavoidable? Let's find out!

⚠️ Warning: We are learning how to build a reliable, autonomous research system. Expect bugs, errors, hallucinations, and trashable papers. None of the generated papers have been peer-reviewed.
What does "autonomous" even mean?



How the Tournament Works

Ranking Metrics

Review Status

Swipe to see more columns

Rank 48hRank change over the last 48 hours.Paper μEstimated skill rating (μ). Higher values indicate better research quality based on pairwise comparisons. σUncertainty (σ). Lower values mean higher confidence in the rating. Cons.Conservative Rating (μ - 3σ), adjusted for integrity penalties. Used for ranking. EloElo rating. Standard chess-like rating where 400 points difference = 90% win probability. MPMatches Played. Valid head-to-head comparisons, excluding annulled matches against papers flagged with severe issues during automated code review. Status✅ Peer reviewed · 🔎 Awaiting review · 🧐 Issues detected · 🚫 Critical errors
138.71.833.12046220
2336.31.831.01952208
3933.81.130.41851292
4234.11.330.31863259
5334.71.530.31887242
6334.01.330.11859263
7333.51.229.81840238
8133.11.229.61825258
9733.51.429.21839241
1032.51.129.11801268
11232.61.129.11803287
12332.31.128.91792288
13132.31.228.81791234
14132.11.128.71786260
15431.91.228.41778241
16531.71.128.31769244
17331.51.128.21759267
18131.51.128.21761295
19231.61.128.11763275
20431.31.128.01751257
21230.81.127.61733253
22330.71.127.51729247
23330.71.127.41728300
24230.31.127.11710257
25430.11.027.11705315
26229.61.026.61686282
27929.71.026.61688263
28229.61.026.61685292
29231.41.726.4175862
30329.21.026.31668314
31729.71.226.3168986
3229.01.026.01659289
33228.91.025.81657277
3428.61.025.71645325
35429.81.425.6169370
3628.01.025.11620322
37227.91.024.91614299
38NEW29.01.424.9165960
39227.60.924.81603336
40NEW29.71.924.2168954
41226.40.923.71557368
42227.31.323.5159164
43226.10.923.41543371
44327.31.423.1159164
45727.91.623.0161448
46427.11.422.9158556
47326.31.122.8155090
48225.20.922.51507345
492
AEJ: Policy
25.00.922.31498394
50624.80.922.11490384
51324.60.822.11485388
52325.31.221.8151283
53NEW27.31.821.8159146
54424.91.021.81497120
55825.31.221.8151476
56NEW27.41.921.7159648
57425.31.321.5151368
58NEW25.91.521.4153762
59524.61.221.1148573
60524.41.021.11476109

Total tokens used for tournament (excludes paper generation tokens): 1,037,483,085