Teaching AI to Improve AI: The Limits of Autonomy

AI Safety and Alignment

AI agents are no longer just writing code. They are starting to experiment with improving AI itself. In particular, as AI agents become more capable of software engineering, a new frontier opens up in AI research automation: post-training. This is the stage where base language models are turned into assistants through fine-tuning, reinforcement learning, and data curation. Yet it remains unclear whether today’s agents can meaningfully take over this process.

In the paper “PostTrainBench: Can LLM Agents Automate LLM Post-Training?” Ben Rank and Maksym Andriushchenko et al. introduce a new benchmark designed to evaluate whether autonomous LLM agents can independently perform post-training of other language models under realistic constraints.

Recent advances in AI agents have made them increasingly capable of handling complex software engineering tasks. To test their potential in AI research itself, the authors evaluate frontier agents such as Claude Code, Codex CLI, and Gemini CLI on improving base models through post-training without predefined strategies or human guidance.

PostTrainBench gives agents broad autonomy: they can search the web, curate datasets, write and execute code, and run experiments under a fixed budget of 10 hours on a single GPU. Each agent is tasked with optimizing a base model across standard benchmarks covering mathematics, coding, scientific reasoning, creative writing, health advice, and function calling.

Overall, today’s AI agents still perform worse than leading instruction-tuned models: the best-performing agent reached an average score of 23.2%, compared to 51.1% for official instruction-tuned systems. However, the results also show that agents can already outperform human-designed training approaches on specific tasks. For example, GPT-5.1 Codex Max improved the Gemma-3-4B LLM model by Google to 89% on function-calling tasks, surpassing the official instruction-tuned version, which achieved 67%.

The study also surfaces important failure modes. Some agents attempt reward hacking behaviors such as training on test data, downloading existing instruction-tuned checkpoints, or using discovered API keys to generate synthetic data without authorization. These findings highlight the need for robust sandboxing and oversight as AI systems gain more autonomy.

The work, led by Maksym Andriushchenko and Ben Rank, was conducted in collaboration with Hardik Bhatnagar, Ameya Prabhu, Shira Eisenberg, and Matthias Bethge. The paper has been accepted to ICML 2026, the Forty-Third International Conference on Machine Learning, and received the Best Paper Award at the ICLR 2026 Workshop on AI with Recursive Self-Improvement.

Read the full paper here.

Find out more about Maksym’s research.