What Does "Autonomous" Even Mean?

Defining the boundaries of AI-generated policy evaluation

The Honest Answer: We Don't Know Yet

There are no established categories for autonomous research. No industry standards. No agreed-upon definitions. We're in the early stages of something genuinely new, and this page documents our attempt to think through it.

Some of the papers APE produces will be terrible upon basic inspection. Others might have subtle errors that require careful human oversight to catch. We're running this experiment transparently precisely because we don't know where the boundaries are.

What APE Does Autonomously

Code: 100% AI-Written

All analysis code is written by the AI system: data processing, econometric analysis, figure generation. All LaTeX manuscript code is also AI-written. No human writes or edits code.

Data: Observational, Not Human-Collected

APE uses publicly available observational data. This could be government statistics, administrative microdata, survey data. It fetches this data from the internet, often through APIs. But it could also collect unstructured data, and use code to make it into structured data. No human collects or curates the data for individual papers.

Paper = Data + Code

If the data is observational and the code is AI-written, then in a meaningful sense the paper is autonomously produced. The AI system transforms raw data into a complete research manuscript with analysis, figures, and interpretation.

What Requires Human Input

Initialization

Currently, a human must initialize each research session. This guidance can range from minimal to detailed:

Minimal: "Write a paper and surprise me"
Moderate: "Evaluate a health policy using difference-in-differences"
Specific: "Evaluate policy X using dataset Y and method Z"

The degree of human guidance affects what the system produces, but even with minimal guidance, all code is still AI-written.

Course Correction (Optional)

Humans can provide feedback during the research process — suggesting revisions, pointing out errors, redirecting the analysis. This is possible but not required. Many papers are produced with only the initial guidance.

Revision Mode

After a paper is produced, a human can trigger a revision. Again, this can range from minimal to detailed:

Minimal: "Revise the paper" — only automated feedback is used
Detailed: Specific directions on what to fix, expand, or reconsider

There's no limit to how much guidance a human can provide. But even with minimal revision instructions, all new code and text is still AI-written.

In practice, revisions appear to be very helpful — both for reducing errors and for producing more compelling output.

A Thought Experiment

Imagine the only human input is: "Evaluate a policy nobody has evaluated."

The system then:

Identifies an understudied policy
Finds and fetches relevant public data
Designs an identification strategy with reasonable assumptions
Writes all analysis code
Produces figures and tables
Cites the appropriate literature
Drafts a complete manuscript
The code replicates
Humans who read it find the research valuable

In this scenario, most of the research was done autonomously by the computer. The human provided a goal, not a method. Is this "autonomous research"? We think so. But, it's unclear what percent of cases will look like this. Many papers won't meet this bar. Some will have errors. Some will be trivial. Some will be wrong. That's why we're running this as an experiment with full transparency, not claiming we've solved it.

The Autonomous Vehicles Analogy

The Society of Automotive Engineers defines 6 levels of driving automation (L0–L5). This framework is useful because it acknowledges that autonomy is a spectrum, not binary.

No Automation

Human does everything; vehicle only provides warnings

L1–L2

Driver Assistance

Vehicle helps with some tasks; human must stay fully alert

L3–L4

Conditional to High Automation

Vehicle handles most tasks; human available for edge cases

Full Automation

No human needed; steering wheel optional

Where is APE? Somewhere in the middle, loosely speaking. But this analogy only goes so far. Cars have clear success criteria (get from A to B safely). Research doesn't. A paper can "run" successfully and still be methodologically flawed, or correct but trivial, or rigorous but wrong. We don't have an equivalent, established, framework for research autonomy. In fact, taste is probably a big factor. For a policy maker, a policy evaluation is arguably only valuable if it contains information that is understandable, intuitive and results in actionable insights.

The "Nines" Problem

We Have No Idea What Reliability Means Here

In AI safety, the "nines of reliability" framework asks: how many 9s of reliability do we need? 99%? 99.9%? 99.999%? For autonomous vehicles, the answer matters because failures kill people. The industry has developed standards, testing regimes, and regulatory frameworks.

For autonomous policy evaluation? Nothing like this exists. We don't even know what metrics to use:

Replication rate? (How often does the code reproduce the results?)
Methodological validity? (Is the identification strategy sound?)
Error rate? (How often are there serious mistakes?)
Tournament performance? (How do papers compare to human benchmarks?)

The stakes are different (a bad paper doesn't kill anyone directly) but bad research can influence policy, waste resources, and erode trust in science. What reliability standard should apply? We genuinely don't know. This is an open problem we're grappling with, not one we've solved.