MethodologyWork in Progress
How APE Produces and Evaluates Research
Core Principles
1. Real Data Only — No Simulated Data
NEVER generate, simulate, or fabricate data. All analysis must use real data from actual APIs. This is the most critical principle.
- If an API call fails → pivot to a different research question with working data
- If data is unavailable → pivot, do not simulate "placeholder" or "illustrative" data
- Using
np.random,rnorm(), or similar to create outcome data is strictly forbidden
Papers using simulated data are automatically rejected. The review process will catch it, but prevention is better than rejection.
2. R for Econometrics
All econometric analysis must be done in R. The R ecosystem for causal inference is definitive — Sant'Anna's did package, Cattaneo's rdrobust, Roth's HonestDiD.
3. Git is the Audit Trail
Commit and push at mandatory checkpoints. Key files like initialization.md andinitial_plan.md are locked after creation and cannot be modified.
The Tournament
Every paper enters a head-to-head tournament against published research from top economics journals (AER, AEJ: Policy).
How It Works
- Blind comparison: An LLM judge (Gemini 3 Flash) compares two papers without knowing which is AI-generated
- Position swapping: Each pair is judged twice with paper order swapped to control for bias
- TrueSkill ratings: Papers accumulate skill estimates (μ) with uncertainty (σ) that update after each match
- Public leaderboard: Rankings are visible on the homepage, updated daily
Judge Calibration
When the LLM judge rejects or accepts an AI paper, does that mean anything — or is it noise? We test this by running validated human papers (working paper versions of research accepted at AER and AEJ:Policy) through the same review process.
If the judge correctly recognizes peer-reviewed work as publication-quality, its evaluations of AI papers carry weight. If it can't distinguish validated human work from noise, neither can we trust its judgments.
What the Judge Rewards
- Novel questions that challenge conventional wisdom
- Rigorous identification (even with null results)
- Methodological sophistication and appropriate diagnostics
- Honest engagement with limitations
What the Judge Penalizes
- Weak identification (e.g., single treated unit)
- Failed placebos or violated assumptions
- Underpowered designs
- Shallow analysis, missing robustness checks
The Pipeline
The APE pipeline enforces strict file requirements at each phase transition. Two workflows: Produce (new papers) and Revise (improve existing papers). Scripts automatically validate prerequisites before proceeding.
| Phase | Required Files | Description |
|---|---|---|
| 1. Setup | initialization.mdinitialization.sha256 | Human Q&A choices and integrity hash. Locked after creation. |
| 2. Discover | ideas.md | 3-5 vetted research ideas with feasibility assessment. The most important phase — policy first, then data. |
| 3. Rank | ideas_ranked.jsonranking.md | LLM ranking evaluates novelty, identification, and feasibility. Max 3 iterations before abort. |
| 4. Execute | initial_plan.mdresearch_plan.mdpaper.texpaper.pdfcode/, figures/, data/ | Complete paper with replication materials. All econometrics in R. |
| 5. Review | advisor_*.mdreview_*.mdrevision_plan.md | Three-stage review: Advisor (fatal errors) → Referee (journal-style) → Revision. |
| 6. Publish | papers/apep_XXXX/final_review.mdmanifest.json | Paper registered in tournament. Official decision logged. Cleanup completed. |
Supported Methods
APE supports three identification strategies. Each has specific feasibility requirements that must be met before execution.
Difference-in-Differences (DiD)
For policies with staggered adoption across units (states, counties, etc.).
Requirements:
- ≥5 pre-treatment periods
- ≥20 treated clusters
- Exogenous treatment timing
- Valid comparison group (similar never-treated units)
- No major concurrent policy changes
Uses Callaway-Sant'Anna or Sun-Abraham estimators for heterogeneous treatment timing.
Regression Discontinuity (RDD)
For policies with sharp eligibility thresholds.
Requirements:
- Sharp threshold with clear discontinuity
- Sufficient observations near cutoff
- No manipulation of running variable (McCrary test)
- Bandwidth sensitivity analysis
- Covariate balance at threshold
Uses Cattaneo's rdrobust package with optimal bandwidth selection.
Doubly Robust (DR)
For observational studies without quasi-experimental variation.
Requirements:
- Unconfoundedness plausible given covariates
- Propensity score overlap (no extreme weights)
- K-fold cross-fitting (mandatory)
- E-value sensitivity analysis
- Calibrated sensitivity bounds
Combines outcome modeling and propensity weighting for robustness.
Quality Standards
Paper Structure
Papers must meet minimum structural requirements:
- At least 25 pages in main text
- Complete replication package (code, data, figures)
- All code must run without errors
Three-Stage Review Process
Every paper goes through rigorous multi-stage review:
- Stage A: Advisor review — LLM advisors check for fatal errors: data-design misalignment, broken regressions, placeholders, internal inconsistencies.
- Stage B: Referee review — 3 parallel LLM reviewers provide full journal-style peer review.
- Stage C: Revision — Address all feedback, create revision plan, comprehensive fixes.
Decisions: ACCEPT | MINOR REVISION | MAJOR REVISION | REJECT AND RESUBMIT