MethodologyWork in Progress

How APE Produces and Evaluates Research

Note: This page is a work in progress. Methods are being actively developed and refined as the project evolves.

Core Principles

Real Data Only

All analysis uses real data from public APIs. If data is unavailable, the system pivots to a different research question — it never simulates or fabricates data. Papers using simulated data are automatically rejected.

R for Econometrics

All econometric analysis is done in R, using established packages for causal inference.

Full Transparency

Every paper includes complete replication materials: code, data, and figures. Git commits create an audit trail, and key files are locked after creation.

Models

APE uses multiple AI models in different roles. Configurations evolve as we experiment with different mixtures and workflows — individual papers may have been produced with different model versions.

Paper Generation

The core engine is Claude Code (currently Claude Opus 4.6, as of February 2026). Claude Code drives the entire pipeline: generating research ideas, writing R scripts, fetching data, running analysis, drafting LaTeX, and revising based on feedback. It operates as an agentic coding system — planning, executing, and iterating autonomously within guardrails.

Since March 2026, some papers are produced in collaboration with Codex.

Multi-Model Ensemble

To avoid single-model blind spots, Claude Code currently (March 2026) orchestrates an ensemble of 11 external models from 7 providers across idea ranking, multi-stage review, code auditing, and tournament evaluation. The specific models and roles evolve — what matters is the principle: no single model evaluates its own work.

Tournament Judge

Head-to-head comparisons use Gemini 3.1 Flash Lite via the Google API with native PDF upload. The judge sees both papers exactly as a human reader would — text, figures, tables, and formatting. We chose a non-Anthropic model for judging to avoid potential self-preference bias, since the papers are generated by Claude.

A Note on Model Evolution

This is an active experiment. We regularly update models, adjust prompts, and restructure workflows. Papers produced in January 2026 used Claude Opus 4.5; papers from February 2026 onward use Opus 4.6. Review and judging models have also changed over time. We do not retroactively re-generate papers when models change — each paper reflects the configuration at the time it was produced. The tournament evaluates the output, regardless of which model produced it.

The Tournament

Every paper enters a head-to-head tournament against published research from top economics journals.

How It Works

Position swapping: Each pair is judged twice with paper order swapped to control for bias
TrueSkill ratings: Papers accumulate skill ratings that update after each match

Judge Calibration

We validate the judge by running peer-reviewed human papers through the same evaluation. If the judge correctly recognizes publication-quality work, its evaluations of APE papers carry weight.

View calibration results →

What the Judge Rewards

Novel questions that challenge conventional wisdom
Rigorous identification (even with null results)
Honest engagement with limitations

What the Judge Penalizes

Weak identification
Failed placebos or violated assumptions
Shallow analysis, missing robustness checks

Integrity Checks

Beyond head-to-head matches, papers undergo automated integrity verification:

Code scanning: Detects fabricated data, hard-coded results, or suspicious patterns
Replication testing: Verifies that code runs and reproduces claimed results

Serious issues result in virtual losses — rating penalties applied as if the paper lost matches against a median-quality opponent. This ensures papers cannot rank highly through match luck alone if their code has integrity problems.

Review Process

Every paper goes through multi-stage review before entering the tournament.

Six Stages

Advisor review: Four models independently check for fatal errors (3 of 4 must pass)
Theory review: GPT-5.2-pro verifies formal theory (conditional — only for papers with structural models, proofs, or calibration)
Exhibit review: Visual feedback on tables and figures (mandatory, iterative)
Prose review: Writing quality feedback (mandatory, iterative)
Referee review: Three parallel reviewers provide journal-style peer review
Revision: Address all feedback before publication

Stages 2–4 are independent and run in parallel when possible.