Apr 13 update: We have completed generation of the first 1,000 APE papers and are now focused on evaluation.

Can we automate
policy evaluation?

AI may soon be capable of producing rigorous economic research. If that happens, policy evaluation could scale dramatically: highlighting what works, what fails, and what harms, far faster than human researchers alone.

We want to find out whether an autonomous system can generate, replicate, and revise empirical policy research, with everything made public.

This is an experiment in building reliable AI research systems. For a global overview, click here.

2,878Ideas+414 this week
1000Papers+10 this week
18k+Matches

Last updated: April 16, 2026

Most policies — probably millions of them globally — are never rigorously evaluated. Data is plenty but there aren't enough researchers. Could AI help? We genuinely don't know. So we're running an experiment. An AI system attempts to produce economics research at scale, , using publicly available data. Will any be good? How would we even know? Ideally, we'd want PhDs or editors of top journals to evaluate all of them. But they are busy. We run an automated tournament evaluating the papers against human benchmarks from top journals. This could help triage. Get to a "you know it when you see it" moment, faster. Most importantly, everything is : papers, code, data, failures. The more people look, the faster mistakes get caught. And we want feedback! In fact, the core thesis is that recursive self-improvement is possible and can be enhanced by human feedback. The next milestone: generate a 1000 papers, evaluate, and share lessons in a report. Can policy evaluation be automated? Or is hallucinated slop unavoidable? Let's find out!

⚠️ Warning: We are learning how to build a reliable, autonomous research system. Expect bugs, errors, hallucinations, and trashable papers. None of the generated papers have been peer-reviewed and should not be used for evidence-based policy making.
What does "autonomous" even mean?



Ranking Metrics

While we review the first 1,000 APE papers, tournament scores and rankings are hidden to avoid biasing judgment during human validation.

Review Status

Swipe to see more columns

Rank 48hRank change over the last 48 hours.Paper μEstimated skill rating (μ). Higher values indicate better research quality based on pairwise comparisons. σUncertainty (σ). Lower values mean higher confidence in the rating. Cons.Conservative Rating (μ - 3σ), adjusted for integrity penalties. Used for ranking. EloElo rating. Standard chess-like rating where 400 points difference = 90% win probability. MPMatches Played. Valid head-to-head comparisons, excluding annulled matches against papers flagged with severe issues during automated code review. Status✅ Peer reviewed · 🔎 Awaiting review · 🧐 Issues detected · 🚫 Critical errors
140.11.635.22104190
237.21.532.81989171
335.51.231.91920185
435.81.431.61932158
535.31.331.61914155
6335.01.231.41900165
734.91.231.41896195
8334.91.231.31897188
9334.81.231.31892191
1034.81.231.21893185
11334.61.231.01883151
1233.91.230.31856168
1333.81.230.11851169
14133.31.129.91830166
1633.01.129.61820164
1732.61.129.41803180
1832.61.129.31802180
19132.61.129.21803171
20132.41.129.21795173
21232.31.129.11792179
22232.21.128.81788194
25231.51.128.41761199
26231.31.128.11750177
27231.11.127.91742186
28331.01.127.81741203
29430.71.027.61727175
32430.31.027.11710168
34629.51.026.61681188
35429.30.926.51672213
36629.41.026.41677180
371329.21.026.21669192
38829.11.026.11663198
40528.91.026.01657186
441228.11.025.21625184
461127.61.024.71605201
471427.30.924.51593223
611726.10.923.41545235
673325.80.923.11533217
7421
AEJ: Policy
25.40.922.71515238
1045324.30.821.81471236
1173324.00.921.41461233
1535522.60.820.01403256
4214716.70.914.11169288

Total tokens used for tournament (excludes paper generation tokens): 1,476,570,316