The Problem: Context is Expensive
Modern LLM agents can implement complex changes across dozens of files autonomously. But this requires vast numbers of tokens, creating two critical problems:
- Linear cost scaling: Per-token costs rise linearly with context length
- Context rot: Even the best models' performance drops as contexts grow longer
Current solutions like Claude Code and OpenAI's Codex use scaffolding—a succession of agents connected by prompts and file states, with LLM summarization to compress context. But this is just one approach.
Context Folding: A Different Approach
Instead of external files and summaries, context folding manages the context window itself to keep it short while maintaining a continual, growing rollout. It's compatible with file-based scaffolding—from the outside, it just looks like a normal LLM.
Existing Context Folding Methods:
The agent can branch its rollout and return from branches. Within a branch, it retains full context; after returning, only a self-chosen summary remains.
Every action produces both a result and a summary of the action and reasoning. Summaries can be hierarchical, consolidating lessons from multiple actions.
A three-agent system: Generator (creates rollout), Reflector (takes lessons), Curator (adapts knowledge base).
The RLM Solution: Self-Managing Context
Prime Intellect believes the Recursive Language Model (RLM) is the simplest, most flexible method. Introduced by Alex Zhang in October 2025, now available as a full paper.
How RLM Works:
Rather than ingesting potentially huge input data directly, the RLM uses a persistent Python REPL to inspect and transform input, and call sub-LLMs from within Python.
❌ Traditional Approach
- Stuff all data into context
- Process everything sequentially
- Summarize to compress
- Lose information
✅ RLM Approach
- Access data programmatically
- Delegate to sub-LLMs
- Search & filter with Python
- Preserve all information
RLM Capabilities
- No direct data loading: Huge inputs (PDFs, datasets, videos) don't clog the context—the model stays lean and avoids context rot
- Python-powered filtering: Search, filter, and transform context using Python, avoiding redundant processing
- Sub-LLM delegation: Spawn fresh instances of itself to perform work, piping specific data to them programmatically
- Aligns with The Bitter Lesson: More in line with learned approaches than hand-crafted summarization strategies
- No information loss: Never summarizes—delegates instead
Prime Intellect's Implementation
Available in their verifiers repository, with RLM-based environments on the Environments Hub.
Key Enhancements:
Main RLM doesn't see tool output tokens—it delegates tool-using work to sub-LLMs. Many tools produce lots of tokens; this keeps the main model lean.
An llm_batch function processes multiple prompts in parallel, speeding up complex workflows.
The model provides its answer through a Python dictionary:
answer["content"]: Can be edited/deleted over multiple turnsanswer["ready"]: Only when set toTruedoes the rollout end
This enables diffusion-style generation of the final answer over the reasoning chain.
Install what you need (numpy, scipy, sympy, etc.). Code executes in isolated Sandboxes.
Only 8192 characters of REPL output shown to the RLM per turn (user-adjustable). Forces the model to use Python and sub-LLMs intelligently rather than dumping everything.
Why This Matters
Prime Intellect believes teaching models to manage their own context end-to-end through reinforcement learning will be the next major breakthrough, enabling agents to solve long-horizon tasks spanning weeks to months.
Current work focuses on ablations with the RLM scaffolding on existing models called through APIs. Future work will scale RLM training on environments that reward effective very long-horizon reasoning.
The RLM is powerful, flexible, strong at tool-use, and perfect for a world where context is a sparse resource.
The Big Picture
RLM represents a shift from "how do we compress context?" to "how do we teach models to actively manage context like a skilled developer?"
Instead of fighting context limits with bigger windows or lossy summaries, RLMs embrace the constraint and learn to work within it—delegating, filtering, and focusing programmatically. It's scaffolding that scales with learning, not just engineering.