Technology

A New Agency-Focused Supervision Approach Scales Software AI Agents With Only 78 Examples

AuthorOctober 7, 2025

1 8 minutes read

A New Agency-Focused Supervision Approach Scales Software AI Agents With Only 78 Examples

Estimated Reading Time: 9-10 minutes

LIMI (Less Is More for Agency): A novel supervision method for AI agents that achieves superior performance with significantly less data (78 samples vs. 10,000+).
Quality Over Quantity: LIMI’s success stems from the “Agency Efficiency Principle,” prioritizing high-fidelity, long-horizon, tool-use trajectories over sheer volume of training data.
Meticulous Data Curation: Training data consists of carefully crafted, multi-turn workflows (average 42,400 tokens) capturing internal reasoning, tool calls, and environment responses, ensuring dense, actionable learning signals.
Exceptional Performance & Generalization: LIMI-trained agents scored 73.5% on AgencyBench, outperforming baselines by substantial margins, and demonstrated robust generalization across diverse benchmarks.
Real-World Impact: This approach enables the development of more autonomous and efficient AI agents for complex tasks like software development and research, reducing resource intensity and accelerating AI capabilities.

The Paradigm Shift: Quality Over Quantity in Agent Training
Unpacking LIMI’s Methodology: How Less Achieves More
Astounding Results: Outperforming Baselines with Unprecedented Efficiency
Real-World Application: An AI Agent for Streamlined Software Development
Actionable Steps for Adopting a “Less Is More” Approach
Conclusion
FAQ: Frequently Asked Questions

The quest for truly autonomous and capable AI agents often seems like a monumental task, demanding vast oceans of data and computational power. Traditional approaches have leaned heavily on scaling up the volume of training data, assuming that more examples inherently lead to better performance. However, a groundbreaking new research initiative challenges this paradigm, proposing that when it comes to training highly effective software AI agents, quality trumps quantity.

Imagine training a sophisticated AI agent for complex software development or research tasks with an astonishingly small dataset. This isn’t a futuristic fantasy; it’s the core of a new methodology poised to revolutionize how we build AI agents. This innovation promises to unlock a new era of efficiency and efficacy in AI development, making advanced agent capabilities more accessible and less resource-intensive.

“Do curated, tool-grounded demonstrations build stronger software agents than broad piles of generic instruction data? A team of researchers from Shanghai Jiao Tong University and SII Generative AI Research Lab (GAIR) proposes LIMI (“Less Is More for Agency”), a supervised fine-tuning method that turns a base model into a capable software/research agent using 78 samples. LIMI scores 73.5% average on AgencyBench (FTFC 71.7, RC@3 74.2, SR@3 74.6), beating strong baselines (GLM-4.5 45.1, Qwen3-235B-A22B 27.5, Kimi-K2 24.1, DeepSeek-V3.1 11.9) and even surpassing variants trained on 10,000 samples—with 128× less data.”

The Paradigm Shift: Quality Over Quantity in Agent Training

At the heart of LIMI’s success lies the “Agency Efficiency Principle,” a fundamental rethinking of how agentic competence scales. Instead of prioritizing raw sample count, LIMI champions the belief that an agent’s capability scales more effectively with the quality and structure of its training data. This principle suggests a profound shift from a data-hungry approach to one that is data-wise, focusing on the informational density and practical relevance of each training example.

What exactly does “quality” mean in this context? For LIMI, it translates into “minimal but dense supervision.” The research team fine-tuned base models like GLM-4.5 and GLM-4.5-Air on just 78 long-horizon, tool-use trajectories. Each trajectory is not a simple command or a short exchange; it’s a meticulously crafted, complete multi-turn workflow. These comprehensive sequences, ranging from approximately 13,000 to 152,000 tokens (with an average of about 42,400 tokens), encapsulate the full spectrum of an agent’s activity.

This includes the model’s internal reasoning, explicit tool calls, and observed environment responses—all meticulously captured within the SII-CLI execution environment. The tasks themselves are drawn from complex, real-world scenarios, spanning “vibe coding” (interactive software development) and intricate research workflows that involve search, detailed analysis, and experimental design. This holistic capture of an agent’s problem-solving journey provides an incredibly rich and efficient learning signal, allowing the model to internalize sophisticated planning and execution strategies with minimal exposure.

Unpacking LIMI’s Methodology: How Less Achieves More

The effectiveness of LIMI isn’t magic; it’s the result of a carefully designed methodology that leverages specific base models and a rigorous data construction process. The experiments primarily utilized powerful large language models: GLM-4.5 (355B parameters) and its lighter variant, GLM-4.5-Air (106B parameters). To ensure the integrity of their findings and isolate the impact of their data, the training employed the ‘slime SFT’ (Supervised Fine-Tuning) framework, maintaining identical configurations across all comparative analyses.

The data construction phase is where LIMI truly shines. The 78 training samples are not arbitrary; they are a strategic blend of 60 real-world queries submitted by expert practitioners and 18 queries synthesized from high-starred GitHub Pull Requests. This ensures the data reflects genuine challenges faced in professional software development and research. Crucially, each of these queries underwent tight quality assurance by PhD annotators, guaranteeing accuracy and relevance.

For every single query, LIMI diligently logged the full agent trajectory, from the initial prompt to successful completion, all within the controlled and capable SII-CLI execution environment. This meticulous logging captures not just the final answer, but every step of the agent’s thought process, tool interaction, and environmental feedback. This level of detail in data collection is paramount to the method’s success, providing dense, actionable intelligence for the fine-tuning process.

Evaluation of LIMI’s performance was equally comprehensive. The primary benchmark was AgencyBench, which assesses agent capabilities over three rounds (R=3) using metrics such as First Turn Follow-up Confidence (FTFC), Self-Refinement at 3 (SR@3), and Retrieval Comprehension at 3 (RC@3). Beyond AgencyBench, the researchers also evaluated generalization across a suite of established benchmarks, including TAU2-bench, EvalPlus-HE/MBPP, DS-1000, and SciCode, proving the agents’ robustness across diverse applications.

Astounding Results: Outperforming Baselines with Unprecedented Efficiency

The empirical results of the LIMI approach are compelling, demonstrating significant advancements in AI agent capabilities through an exceptionally data-efficient method. On AgencyBench, LIMI achieved an impressive average score of 73.5%, marking a substantial improvement of +28.4 percentage points over the baseline GLM-4.5 (45.1%). Delving into the sub-metrics, LIMI scored 71.7% for FTFC compared to GLM-4.5’s 37.8%, and 74.6% for SR@3 against GLM-4.5’s 47.4%. These figures underscore a profound enhancement in the agent’s ability to confidently follow instructions, self-correct, and comprehend retrieved information.

Perhaps the most striking finding is the sheer data efficiency. LIMI, trained on a mere 78 samples, not only beat strong baselines but also outperformed GLM-4.5 variants trained on vastly larger datasets. For instance, LIMI’s 73.5% average on AgencyBench significantly surpasses the 47.8% achieved by GLM-4.5 trained on AFM-CodeAgent SFT, which used a staggering 10,000 samples. This represents an absolute gain of +53.7% with an astonishing 128 times less data. Similar gaps were observed when compared to AFM-WebAgent (7,610 samples) and CC-Bench-Traj (260 samples), firmly establishing LIMI as a leader in data-efficient agent training.

The method also demonstrated excellent generalization capabilities. Across a diverse range of tasks encompassing tool-use, coding, and scientific computing, LIMI averaged approximately 57% performance, consistently exceeding GLM-4.5 and other baselines. Even in scenarios where tool access was explicitly removed, LIMI still maintained a slight lead (50.0% compared to GLM-4.5’s 48.7%), indicating that the fine-tuning process instills intrinsic gains in the agent’s reasoning and problem-solving abilities, independent of its external tools.

These “across-metric gains” on AgencyBench, alongside strong performance on generalization suites like TAU2, EvalPlus-HE/MBPP, DS-1000, and SciCode (averaging 57.2%), confirm the robustness and versatility of agents trained with LIMI. Furthermore, the approach proved scalable across different model sizes, with both GLM-4.5 (355B) and GLM-4.5-Air (106B) showing significant performance deltas over their base versions, affirming the method’s effectiveness regardless of model scale.

Real-World Application: An AI Agent for Streamlined Software Development

Consider a software development team grappling with a complex bug fix or implementing a new feature that requires navigating multiple codebases, API documentation, and debugging tools. Traditionally, an AI agent might struggle with the multi-step reasoning, tool orchestration, and error recovery needed for such a task. With LIMI’s approach, an agent could be trained on a handful of meticulously curated trajectories that demonstrate an expert developer’s workflow: identifying the bug, searching relevant files, proposing code changes, running tests, interpreting outputs, and iteratively refining the solution.

Instead of merely generating code snippets, this LIMI-trained agent would understand the entire “vibe” of the development process. It could autonomously reason, call a linter, query a knowledge base, run unit tests, and even suggest alternative strategies if an initial attempt fails. This ability to handle long-horizon, tool-grounded workflows makes the agent an invaluable co-pilot, significantly reducing development cycles and allowing human developers to focus on higher-level architectural challenges rather than repetitive debugging or boilerplate coding.

Actionable Steps for Adopting a “Less Is More” Approach

The success of LIMI offers valuable insights for anyone looking to develop more capable and efficient AI agents. Here are three actionable steps informed by the “Agency Efficiency Principle”:

Prioritize High-Fidelity, Long-Horizon Data Collection: Instead of collecting massive amounts of fragmented data, focus on capturing complete, end-to-end task trajectories. Ensure these trajectories represent complex, multi-step problem-solving processes, including all intermediate thoughts, tool interactions, and environmental feedback. Invest in capturing quality over sheer volume.
Emphasize Structured, Multi-Turn Tool-Use Workflows: Design your data collection to explicitly demonstrate how an agent should orchestrate multiple tools, perform iterative reasoning, and recover from failures. The data should teach the agent not just what to do, but how to plan, execute, and verify its actions within an interactive environment, much like the SII-CLI.
Invest in Expert Human Curation and Annotation: The quality of your training data will directly impact agent performance. Engage subject matter experts (e.g., PhD annotators in the LIMI study) to review, refine, and potentially synthesize trajectories. Their domain knowledge ensures that the training examples are not only accurate but also reflect optimal, intelligent problem-solving strategies.

Conclusion

The LIMI research from Shanghai Jiao Tong University and SII Generative AI Research Lab marks a significant milestone in the development of AI agents. By demonstrating that an agency-focused supervision approach, leveraging an incredibly small yet dense dataset of 78 curated trajectories, can outperform models trained on hundreds or thousands of times more data, it fundamentally reshapes our understanding of agent training.

This “Less Is More for Agency” principle underscores the critical role of data quality, structure, and expert curation in building highly competent and generalizable AI agents. The ability to achieve 73.5% average on AgencyBench, with strong performance across various metrics and generalization suites, using such minimal data, presents a compelling case for a more efficient and impactful future for AI development. As the research team highlights, these multi-turn, token-dense trajectories, emphasizing planning, tool orchestration, and verification, unlock intrinsic gains in agent intelligence, paving the way for more robust and autonomous software and research assistants.

Check out the Paper, GitHub Page and Model Card on HF. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.

The post A New Agency-Focused Supervision Approach Scales Software AI Agents With Only 78 Examples appeared first on MarkTechPost.

FAQ: Frequently Asked Questions

What is the core principle behind LIMI’s success?

LIMI’s success is rooted in the “Agency Efficiency Principle,” which posits that an agent’s capability scales more effectively with the quality and structure of its training data rather than just the quantity. It emphasizes “minimal but dense supervision” using meticulously crafted, long-horizon trajectories.

How much data did LIMI use compared to traditional methods?

LIMI achieved its results using only 78 curated training samples, which is an astonishing 128 times less data than some baseline variants that used 10,000 samples, yet it still significantly outperformed them.

What kind of data constitutes a “high-fidelity, long-horizon” trajectory?

These trajectories are complete multi-turn workflows, ranging from 13,000 to 152,000 tokens. They encapsulate the agent’s full activity, including its internal reasoning, explicit tool calls, and observed environment responses, drawn from complex real-world scenarios like interactive software development (“vibe coding”) and intricate research.

What are the practical implications of LIMI for AI development?

LIMI promises to make advanced AI agent capabilities more accessible and less resource-intensive. It allows for training sophisticated agents for complex tasks like software development, bug fixing, and research with significantly less data, accelerating development cycles and enabling human experts to focus on higher-level problems.

Does LIMI’s approach apply only to large models?

No, the research demonstrated that LIMI’s approach is scalable across different model sizes. Both the larger GLM-4.5 (355B parameters) and the lighter GLM-4.5-Air (106B parameters) showed significant performance improvements over their base versions when trained with LIMI, affirming its effectiveness regardless of model scale.

AuthorOctober 7, 2025

1 8 minutes read