Apr 13 update: We have completed generation of the first 1,000 APE papers and are now focused on evaluation.

Can we automate
policy evaluation?

AI may soon be capable of producing rigorous economic research. If that happens, policy evaluation could scale dramatically: highlighting what works, what fails, and what harms, far faster than human researchers alone.

We want to find out whether an autonomous system can generate, replicate, and revise empirical policy research, with everything made public.

This is an experiment in building reliable AI research systems. For a global overview, click here.

2,638Ideas+174 this week
1000Papers+58 this week
18k+Matches

Last updated: April 14, 2026

Most policies — probably millions of them globally — are never rigorously evaluated. Data is plenty but there aren't enough researchers. Could AI help? We genuinely don't know. So we're running an experiment. An AI system attempts to produce economics research at scale, , using publicly available data. Will any be good? How would we even know? Ideally, we'd want PhDs or editors of top journals to evaluate all of them. But they are busy. We run an automated tournament evaluating the papers against human benchmarks from top journals. This could help triage. Get to a "you know it when you see it" moment, faster. Most importantly, everything is : papers, code, data, failures. The more people look, the faster mistakes get caught. And we want feedback! In fact, the core thesis is that recursive self-improvement is possible and can be enhanced by human feedback. The next milestone: generate a 1000 papers, evaluate, and share lessons in a report. Can policy evaluation be automated? Or is hallucinated slop unavoidable? Let's find out!

⚠️ Warning: We are learning how to build a reliable, autonomous research system. Expect bugs, errors, hallucinations, and trashable papers. None of the generated papers have been peer-reviewed and should not be used for evidence-based policy making.
What does "autonomous" even mean?



Ranking Metrics

While we review the first 1,000 APE papers, tournament scores and rankings are hidden to avoid biasing judgment during human validation.

Review Status

Swipe to see more columns

Rank 48hRank change over the last 48 hours.Paper μEstimated skill rating (μ). Higher values indicate better research quality based on pairwise comparisons. σUncertainty (σ). Lower values mean higher confidence in the rating. Cons.Conservative Rating (μ - 3σ), adjusted for integrity penalties. Used for ranking. EloElo rating. Standard chess-like rating where 400 points difference = 90% win probability. MPMatches Played. Valid head-to-head comparisons, excluding annulled matches against papers flagged with severe issues during automated code review. Status✅ Peer reviewed · 🔎 Awaiting review · 🧐 Issues detected · 🚫 Critical errors
140.11.635.22104190
237.21.532.81989171
335.51.231.91920185
435.81.431.61932158
5135.31.331.61914155
6335.01.231.41900165
734.91.231.41896195
8334.91.231.31897188
9434.81.231.31892191
1034.81.231.21893185
11334.61.231.01883151
1233.91.230.31856168
1333.81.230.11851169
14233.31.129.91830166
16133.01.129.61820164
1732.61.129.41803180
1832.61.129.31802180
19132.61.129.21803171
20132.41.129.21795173
21232.31.129.11792179
22232.21.128.81788194
25231.51.128.41761199
26231.31.128.11750177
27231.11.127.91742186
28231.01.127.81741203
29430.71.027.61727175
32430.31.027.11710168
34529.51.026.61681188
351029.30.926.51672213
36829.41.026.41677180
371129.21.026.21669192
38329.11.026.11663198
4028.91.026.01657186
441128.11.025.21625184
461027.61.024.71605201
471327.30.924.51593223
611926.10.923.41545235
67225.80.923.11533217
7418
AEJ: Policy
25.40.922.71515238
1042624.30.821.81471236
1172824.00.921.41461233
1532822.60.820.01403256
4214316.70.914.11169288

Total tokens used for tournament (excludes paper generation tokens): 1,476,570,316