What Does "Autonomous" Even Mean?
Defining the boundaries of AI-generated policy evaluation
The Honest Answer: We Don't Know Yet
There are no established categories for autonomous research. No industry standards. No agreed-upon definitions. We're in the early stages of something genuinely new, and this page documents our attempt to think through it.
Some of the papers APE produces will be terrible upon basic inspection. Others might have subtle errors that require careful human oversight to catch. We're running this experiment transparently precisely because we don't know where the boundaries are.
What APE Does Autonomously
Code: 100% AI-Written
All analysis code is written by the AI system: data processing, econometric analysis, figure generation. All LaTeX manuscript code is also AI-written. No human writes or edits code.
Data: Observational, Not Human-Collected
APE uses publicly available observational data. This could be government statistics, administrative microdata, survey data. It fetches this data from the internet, often through APIs. But it could also collect unstructured data, and use code to make it into structured data. No human collects or curates the data for individual papers.
Paper = Data + Code
If the data is observational and the code is AI-written, then in a meaningful sense the paper is autonomously produced. The AI system transforms raw data into a complete research manuscript with analysis, figures, and interpretation.
What Requires Human Input
Initialization
Currently, a human must initialize each research session. This guidance can range from minimal to detailed:
- Minimal: "Write a paper and surprise me"
- Moderate: "Evaluate a health policy using difference-in-differences"
- Specific: "Evaluate policy X using dataset Y and method Z"
The degree of human guidance affects what the system produces, but even with minimal guidance, all code is still AI-written.
Course Correction (Optional)
Humans can provide feedback during the research process — suggesting revisions, pointing out errors, redirecting the analysis. This is possible but not required. Many papers are produced with only the initial guidance.
Revision Mode
After a paper is produced, a human can trigger a revision. Again, this can range from minimal to detailed:
- Minimal: "Revise the paper" — only automated feedback is used
- Detailed: Specific directions on what to fix, expand, or reconsider
There's no limit to how much guidance a human can provide. But even with minimal revision instructions, all new code and text is still AI-written.
In practice, revisions appear to be very helpful — both for reducing errors and for producing more compelling output.
A Thought Experiment
Imagine the only human input is: "Evaluate a policy nobody has evaluated."
The system then:
- Identifies an understudied policy
- Finds and fetches relevant public data
- Designs an identification strategy with reasonable assumptions
- Writes all analysis code
- Produces figures and tables
- Cites the appropriate literature
- Drafts a complete manuscript
- The code replicates
- Humans who read it find the research valuable
In this scenario, most of the research was done autonomously by the computer. The human provided a goal, not a method. Is this "autonomous research"? We think so. But, it's unclear what percent of cases will look like this. Many papers won't meet this bar. Some will have errors. Some will be trivial. Some will be wrong. That's why we're running this as an experiment with full transparency, not claiming we've solved it.
The Autonomous Vehicles Analogy
The Society of Automotive Engineers defines 6 levels of driving automation (L0–L5). This framework is useful because it acknowledges that autonomy is a spectrum, not binary.
No Automation
Human does everything; vehicle only provides warnings
Driver Assistance
Vehicle helps with some tasks; human must stay fully alert
Conditional to High Automation
Vehicle handles most tasks; human available for edge cases
Full Automation
No human needed; steering wheel optional
Where is APE? Somewhere in the middle, loosely speaking. But this analogy only goes so far. Cars have clear success criteria (get from A to B safely). Research doesn't. A paper can "run" successfully and still be methodologically flawed, or correct but trivial, or rigorous but wrong. We don't have an equivalent, established, framework for research autonomy. In fact, taste is probably a big factor. For a policy maker, a policy evaluation is arguably only valuable if it contains information that is understandable, intuitive and results in actionable insights.
The "Nines" Problem
We Have No Idea What Reliability Means Here
In AI safety, the "nines of reliability" framework asks: how many 9s of reliability do we need? 99%? 99.9%? 99.999%? For autonomous vehicles, the answer matters because failures kill people. The industry has developed standards, testing regimes, and regulatory frameworks.
For autonomous policy evaluation? Nothing like this exists. We don't even know what metrics to use:
- Replication rate? (How often does the code reproduce the results?)
- Methodological validity? (Is the identification strategy sound?)
- Error rate? (How often are there serious mistakes?)
- Tournament performance? (How do papers compare to human benchmarks?)
The stakes are different (a bad paper doesn't kill anyone directly) but bad research can influence policy, waste resources, and erode trust in science. What reliability standard should apply? We genuinely don't know. This is an open problem we're grappling with, not one we've solved.