APEP Tournament Leaderboard

About·Social Catalyst Lab·University of Zurich

Project APEAutonomous Policy Evaluation

Most policies - probably millions of them globally - are never rigorously evaluated. Data is plenty but there aren't enough researchers. Could AI help? We genuinely don't know. So we're running an experiment. An AI system attempts to produce economics research at scale, , using publicly available data. We aim for at least 1000 papers before the end of 2026. How do we sort the good from the bad? An automated tournament measures them against human benchmarks from top journals, to triage for human oversight. Most importantly, everything is public: papers, code, data, failures.

Methodology

Snapshot: January 30, 2026

54 AI papers (+10 this week)·43 human·2,300 matches

Models:Claude Opus 4.5 (Generation)·GPT-5.2 (Review)·Gemini 3 Flash (Review, Judge)

346Human Wins94.3%

9AI Wins2.5%

95.0%Prob(Human Win)as of today

How the Tournament Works

▼

What is this?

We built an AI system (APE) that writes economics research papers from scratch — finding data, running statistical analysis, and producing full academic papers. At its core, this is currently powered by Claude Code, with input from GPT 5.2 via API calls for additional diversity and entropy. This leaderboard tracks whether those AI-written papers can "compete" with human-written papers from top economics journals, through an automated process.

The Tournament

Papers compete in head-to-head matches. Each day at 9:00 AM UTC, we run a round of matches. An LLM judge reads both papers and picks one as preferred (or declares a tie). The preferred paper gains rating points; the other loses points. Over time, papers consistently favored by the LLM judge rise in the rankings — though this reflects LLM preferences, not necessarily true quality.

How Matches Work

For each match, we send both paper PDFs directly to Gemini 3 Flash (Google's LLM). The model sees the full papers — text, figures, tables, formatting — exactly as a human reviewer would.

The judge is prompted to act as a "senior editor at a top economics journal" and evaluate papers on: identification strategy (is the causal inference credible?), novelty, policy relevance, execution quality, and appropriate scope.

To control for position bias (LLMs sometimes prefer whichever paper they see first), we run each comparison twice with the papers swapped. A paper must win both rounds to win the match; otherwise it's a tie. All matches can be found here.

Rating System

Rankings use TrueSkill, a Bayesian rating system developed by Microsoft for Xbox matchmaking. Each paper has two numbers:

μ (mu) — estimated skill level. Higher = paper wins more often.
σ (sigma) — uncertainty. Decreases as the paper plays more matches.

Papers are ranked by their conservative rating (μ − 3σ), which represents the lower bound of estimated skill. This means papers need consistent wins across multiple matches — a single lucky win won't send a paper to the top.

Head-to-Head Statistics

Win counts exclude 12 ties where the judge couldn't determine a clear winner.

Prob(Human Win) answers: if we randomly pick one human paper and one AI paper, what's the probability the human paper wins according to the LLM judge? Computed from TrueSkill ratings, accounting for uncertainty — papers with fewer matches contribute less certainty. The main metric compares recent cohorts (last 25 of each); all-time (94.3%) includes all 54 AI and 43 human papers with 5+ matches.

Matchup Selection

Each day we run 50 matches (100 LLM calls with position swapping) in 10 batches of 5. Within each batch, no paper plays twice.

80% of matches use information-gain sampling — we prioritize pairs where both papers have high rating uncertainty and similar skill levels, maximizing what we learn from each expensive LLM call. Of these, half are constrained to cross-type (one AI paper vs one human paper), ensuring we continuously compare AI and human papers rather than just ranking within each group. The remaining 20% are random for exploration and robustness.

This design is cost-efficient: LLM judge calls are expensive, so we extract maximum signal per match. Papers with low uncertainty or extreme ratings may not match every day — that's by design.

Important Caveats ⚠️

The ⚠️ warning icon in the leaderboard indicates AI-generated papers that have not been peer reviewed. The LLM judge is not a substitute for human peer review. AI-generated papers may contain errors, hallucinations, or fabricated results. In fact, we have found that these are very common and sometimes take a lot of effort to spot. If a paper looks too good to be true, it probably is.

The judge evaluates the PDF only, not the underlying code or data. Rankings should not be taken at face value. That's why everything is open source — code, data, and papers are all public so anyone can spot errors, report issues, and contribute improvements.

Average Elo by Source

Elo Rating Distribution

Conservative Rating Distribution

TrueSkill Conservative: Top-3 Papers

Papers & Matches

Tap chart to expand

Data as of 2026-01-29 09:56:54 CET

Ranking Metrics

▼

Rank

Position based on conservative rating (lower bound of estimated skill)

48h

Change in rank over the past 48 hours

Estimated skill level — higher means the paper wins more often

Uncertainty — decreases as the paper plays more matches

Cons.

Conservative rating (μ − 3σ) — papers need consistent wins to rank highly (TrueSkill)

Elo

Familiar chess-style rating for comparison (1500 = average)

Number of head-to-head matches played

Reviewed

Human expert review — Yes for journal papers (peer-reviewed), No for AI papers (pending)

Highest Ranked AI Papers

Does Local Climate Policy Build Demand for National Action? Evidence from Swiss Energy Referendums (v3)

Rank #43 overall35 matches⚠️Not Reviewed

Betting on Jobs? The Employment Effects of Legal Sports Betting in the United States

Rank #44 overall29 matches⚠️Not Reviewed

When Age Thresholds Fail: \\ A Cautionary Tale About RDD Validity for Life-Course Outcomes

Rank #45 overall35 matches⚠️Not Reviewed

See full match history

Swipe to see more columns

Rank ↑	48h	Paper ↕	μ ↕	σ ↕	Cons. ↕	Elo ↕	MP ↕	Rev.
1	▲1	Dynamics of the Long Term Housing Yield: Evidence from Natural Experiments AER	40.0	2.8	31.7	2100	23	✅Forthcoming paper in a leading economics journal — peer reviewed by expert referees.
2	▼1	Immigration, Innovation, and Growth AER	40.2	3.0	31.2	2109	25	✅Forthcoming paper in a leading economics journal — peer reviewed by expert referees.
3	▲1	Why is Workplace Sexual Harassment Underreported? The Value of Outside Options Amid the Threat of Retaliation AER	39.2	2.9	30.4	2068	19	✅Forthcoming paper in a leading economics journal — peer reviewed by expert referees.
4	▼1	Flood Risk Mapping and the Distributional Impacts of Climate Information AEJ: Policy	36.0	2.1	29.7	1941	19	✅Forthcoming paper in a leading economics journal — peer reviewed by expert referees.
5	▲1	Plata y Plomo: How Higher Wages Expose Politicians to Criminal Violence AEJ: Policy	35.0	2.1	28.8	1901	20	✅Forthcoming paper in a leading economics journal — peer reviewed by expert referees.
6	▲1	Estimating the Economic Value of Zoning Reform AEJ: Policy	35.5	2.2	28.8	1919	19	✅Forthcoming paper in a leading economics journal — peer reviewed by expert referees.
7	▲5	The Value of Clean Water: Experimental Evidence from Rural India AER	37.6	3.0	28.5	2004	15	✅Forthcoming paper in a leading economics journal — peer reviewed by expert referees.
8	▼3	Invisible Wounds: How Mental Disability Benefits Shape Veteran Well-Being AEJ: Policy	35.1	2.2	28.4	1903	20	✅Forthcoming paper in a leading economics journal — peer reviewed by expert referees.
9	▲6	Collective Bargaining Rights, Policing, and Civilian Deaths AEJ: Policy	32.8	1.5	28.2	1810	56	✅Forthcoming paper in a leading economics journal — peer reviewed by expert referees.
10	▲1	Market Power and Capital Constraints AER	36.0	2.8	27.8	1940	14	✅Forthcoming paper in a leading economics journal — peer reviewed by expert referees.
11	▼1	Are Complementary Policies Substitutes? Evidence from R&D Subsidies in the UK AEJ: Policy	38.2	3.5	27.6	2027	14	✅Forthcoming paper in a leading economics journal — peer reviewed by expert referees.
12	▼4	Abundance from Abroad: Migrant Income and Long-Run Economic Development AER	35.8	2.7	27.6	1930	21	✅Forthcoming paper in a leading economics journal — peer reviewed by expert referees.
13	▲3	The Effect of Deactivating Facebook and Instagram on Users' Emotional State AEJ: Policy	33.6	2.0	27.6	1845	22	✅Forthcoming paper in a leading economics journal — peer reviewed by expert referees.
14	▲3	Can You Erase the Mark of a Criminal Record? Labor Market Impacts of Criminal Record Remediation AEJ: Policy	31.7	1.4	27.4	1767	63	✅Forthcoming paper in a leading economics journal — peer reviewed by expert referees.
15	▼2	Work From Home and the Office Real Estate Apocalypse AER	33.0	1.9	27.4	1822	30	✅Forthcoming paper in a leading economics journal — peer reviewed by expert referees.
16	▼7	Hacked to Pieces? The Effects of Ransomware Attacks on Hospitals and Patients AEJ: Policy	34.0	2.3	27.1	1862	22	✅Forthcoming paper in a leading economics journal — peer reviewed by expert referees.
17	▼3	Cooking, Health, and Daily Exposure to Pollution Spikes AEJ: Policy	33.6	2.3	26.9	1845	17	✅Forthcoming paper in a leading economics journal — peer reviewed by expert referees.
18	—	The Welfare Effects of Eligibility Expansions: Theory and Evidence from SNAP AEJ: Policy	32.0	1.9	26.1	1780	21	✅Forthcoming paper in a leading economics journal — peer reviewed by expert referees.
19	▲2	Punishing Financial Crimes: The Impact of Prison Sentences on Defendants and Their Colleagues AEJ: Policy	31.1	1.8	25.6	1744	18	✅Forthcoming paper in a leading economics journal — peer reviewed by expert referees.
20	▲8	The Pass-Through of Retail Crime AEJ: Policy	29.6	1.4	25.3	1682	58	✅Forthcoming paper in a leading economics journal — peer reviewed by expert referees.
21	▲6	The Option Value of Municipal Liquidity AEJ: Policy	29.5	1.4	25.3	1681	56	✅Forthcoming paper in a leading economics journal — peer reviewed by expert referees.
22	▼3	Vertical Integration and Cream Skimming of Profitable Referrals: The Case of Hospital-Owned Skilled Nursing Facilities AEJ: Policy	30.9	1.9	25.2	1736	23	✅Forthcoming paper in a leading economics journal — peer reviewed by expert referees.
23	▼3	Trade Protection Along Supply Chains AEJ: Policy	30.7	2.1	24.5	1728	19	✅Forthcoming paper in a leading economics journal — peer reviewed by expert referees.
24	▼2	Zero-Sum Thinking and the Roots of US Political Differences AER	31.8	2.7	23.7	1774	11	✅Forthcoming paper in a leading economics journal — peer reviewed by expert referees.
25	▲1	Childhood Health Shocks and the Intergenerational Transmission of Inequality AEJ: Policy	29.9	2.2	23.4	1697	17	✅Forthcoming paper in a leading economics journal — peer reviewed by expert referees.
26	▼1	Short- and Long-Term Effects of Universal Preschool: Evidence from the Arab Population in Israel AEJ: Policy	29.2	2.0	23.3	1668	21	✅Forthcoming paper in a leading economics journal — peer reviewed by expert referees.
27	▲2	Female Leaders and Intrahousehold Dynamics: Evidence from State Elections in India AEJ: Policy	29.5	2.1	23.2	1680	20	✅Forthcoming paper in a leading economics journal — peer reviewed by expert referees.
28	▲5	Black Lives: The High Cost of Segregation AEJ: Policy	27.2	1.4	23.0	1588	45	✅Forthcoming paper in a leading economics journal — peer reviewed by expert referees.
29	▲2	The Unintended Consequences of Academic Leniency AEJ: Policy	28.1	1.8	22.8	1624	26	✅Forthcoming paper in a leading economics journal — peer reviewed by expert referees.
30	▲9	Hurricanes, Climate Change Policies and Electoral Accountability AEJ: Policy	26.7	1.3	22.8	1567	64	✅Forthcoming paper in a leading economics journal — peer reviewed by expert referees.
31	▲3	Perils of the Paperwork: The Impact of Information and Application Assistance on Social Benefit Take-Up in India AEJ: Policy	27.8	1.8	22.5	1614	22	✅Forthcoming paper in a leading economics journal — peer reviewed by expert referees.
32	▼9	Temporary Layoffs, Loss-of-Recall and Cyclical Unemployment Dynamics AER	32.9	3.5	22.5	1817	10	✅Forthcoming paper in a leading economics journal — peer reviewed by expert referees.
33	▼9	Harnessing Deductions to Increase Tax Compliance and Formalization AEJ: Policy	27.5	1.9	21.9	1601	33	✅Forthcoming paper in a leading economics journal — peer reviewed by expert referees.
34	▼4	A Matter of Time? Measuring Effects of Public Schooling Expansions on Families AEJ: Policy	27.3	1.9	21.8	1591	27	✅Forthcoming paper in a leading economics journal — peer reviewed by expert referees.
35	—	Tax and Occupancy of Business Properties: Evidence from UK Business Rate Reliefs AEJ: Policy	26.9	1.8	21.7	1577	28	✅Forthcoming paper in a leading economics journal — peer reviewed by expert referees.
36	—	Adjustable Product Attributes, Indirect Network Effects, And Subsidy Design: The Case of Electric Vehicles AEJ: Policy	27.3	1.9	21.6	1591	25	✅Forthcoming paper in a leading economics journal — peer reviewed by expert referees.
37	▼5	Cross-State Strategic Voting AEJ: Policy	27.2	1.9	21.6	1589	25	✅Forthcoming paper in a leading economics journal — peer reviewed by expert referees.
38	▲3	The Value of School Social Climate Information: Evidence from Chicago Housing Transactions AEJ: Policy	25.7	1.6	20.8	1528	41	✅Forthcoming paper in a leading economics journal — peer reviewed by expert referees.
39	▲4	State Recreational Cannabis Laws and Racial Disparities in the Criminal Legal System AEJ: Policy	25.4	1.5	20.8	1516	33	✅Forthcoming paper in a leading economics journal — peer reviewed by expert referees.
40	▼3	China's Nationwide CO2 Emissions Trading System: A General Equilibrium Assessment AEJ: Policy	25.1	1.5	20.6	1503	36	✅Forthcoming paper in a leading economics journal — peer reviewed by expert referees.
41	▼3	Polling Place Location and the Costs of Voting AEJ: Policy	26.1	1.8	20.6	1544	34	✅Forthcoming paper in a leading economics journal — peer reviewed by expert referees.
42	—	Polluted IPOs AEJ: Policy	25.6	1.8	20.1	1524	29	✅Forthcoming paper in a leading economics journal — peer reviewed by expert referees.
43	▼3	Does Local Climate Policy Build Demand for National Action? Evidence from Swiss Energy Referendums (v3) APE working paper #92	24.5	1.5	20.1	1480	35	⚠️AI papers have not been peer reviewed and may contain errors including hallucinations, manufactured data, or incorrect references.
44	▲2	Betting on Jobs? The Employment Effects of Legal Sports Betting in the United States APE working paper #51	23.7	1.7	18.7	1447	29	⚠️AI papers have not been peer reviewed and may contain errors including hallucinations, manufactured data, or incorrect references.
45	—	When Age Thresholds Fail: \\ A Cautionary Tale About RDD Validity for Life-Course Outcomes APE working paper #64	21.6	1.5	17.0	1363	35	⚠️AI papers have not been peer reviewed and may contain errors including hallucinations, manufactured data, or incorrect references.
46	▲1	Do State Paid Sick Leave Mandates Increase Work Hours? Evidence from Low-Wage Service Sector Workers APE working paper #2	21.8	1.7	16.8	1372	29	⚠️AI papers have not been peer reviewed and may contain errors including hallucinations, manufactured data, or incorrect references.
47	▲1	Do State Dyslexia Laws Improve Reading Achievement? \ from Staggered Adoption with Corrected Treatment Timing (v2) APE working paper #78	22.1	1.8	16.8	1383	27	⚠️AI papers have not been peer reviewed and may contain errors including hallucinations, manufactured data, or incorrect references.
48	▲6	Untitled APE working paper #79	22.4	1.9	16.6	1398	35	⚠️AI papers have not been peer reviewed and may contain errors including hallucinations, manufactured data, or incorrect references.
49	▲7	Fiscal Procyclicality in Commodity Exporting Countries: How Much Does It Pour and Why? AEJ: Policy	21.2	1.6	16.5	1349	36	✅Forthcoming paper in a leading economics journal — peer reviewed by expert referees.
50	▲3	Universal Occupational License Recognition and Interstate Migration: Evidence from State Policy Reforms APE working paper #14	21.6	1.9	15.9	1365	20	⚠️AI papers have not been peer reviewed and may contain errors including hallucinations, manufactured data, or incorrect references.
51	▼1	Universal Occupational Licensing Recognition and Interstate Migration: Evidence from Staggered State Adoptions APE working paper #18	20.6	1.6	15.9	1325	39	⚠️AI papers have not been peer reviewed and may contain errors including hallucinations, manufactured data, or incorrect references.
52	▼1	Do State Automatic IRA Mandates Affect Self-Reported Employer Retirement Plan Coverage? \\ Evidence from Staggered Policy Adoption APE working paper #55	20.6	1.7	15.6	1324	32	⚠️AI papers have not been peer reviewed and may contain errors including hallucinations, manufactured data, or incorrect references.
53	▼1	Does Federal Transit Funding Improve Local Labor Markets? \\ Evidence from a Population Threshold APE working paper #62	20.5	1.7	15.4	1319	35	⚠️AI papers have not been peer reviewed and may contain errors including hallucinations, manufactured data, or incorrect references.
54	▼10	Untitled APE working paper #29	23.0	2.6	15.2	1422	15	⚠️AI papers have not been peer reviewed and may contain errors including hallucinations, manufactured data, or incorrect references.
55	▲2	Compulsory Schooling Laws and Mother's Labor Supply: Testing the Permanent Income Hypothesis APE working paper #41	20.7	1.9	15.0	1327	19	⚠️AI papers have not been peer reviewed and may contain errors including hallucinations, manufactured data, or incorrect references.
56	▲4	Does Broadband Subsidy Eligibility Increase Self-Employment? Evidence from the FCC Lifeline Program APE working paper #24	21.0	2.3	14.1	1341	24	⚠️AI papers have not been peer reviewed and may contain errors including hallucinations, manufactured data, or incorrect references.
57	▲2	Unintended Labor Market Consequences of Teen Driving Restrictions: Evidence from Graduated Driver Licensing Laws APE working paper #17	18.3	1.7	13.3	1232	26	⚠️AI papers have not been peer reviewed and may contain errors including hallucinations, manufactured data, or incorrect references.
58	▼3	Breaking the Chains of Contract: The Labor Market Effects of State Noncompete Agreement Restrictions APE working paper #45	17.9	1.8	12.6	1216	31	⚠️AI papers have not been peer reviewed and may contain errors including hallucinations, manufactured data, or incorrect references.
59	▼1	Hot Standards, Cool Workers? The Effect of State Heat Illness Prevention Regulations on Workplace Injuries APE working paper #52	22.8	3.4	12.5	1411	10	⚠️AI papers have not been peer reviewed and may contain errors including hallucinations, manufactured data, or incorrect references.
60	▼11	Broadband Internet Expansion and Adolescent Time Use: Evidence from Virginia's Telecommunication Initiative[0.5em] Gender Differences in the Effects of Digital Access on Teen Daily Activities APE working paper #23	19.8	2.5	12.3	1292	16	⚠️AI papers have not been peer reviewed and may contain errors including hallucinations, manufactured data, or incorrect references.

1–60 of 97

Total tokens used for tournament (excluding generation): 96,125,097