Alibaba's AgentEvolver increases tooling model performance by approximately 30% using automatically generated synthetic tasks

Researchers at Alibaba’s Tongyi Lab have developed a new framework for self-evolving agents that create their own training data by exploring their application environments. The frame, AgentEvolveruses the knowledge and reasoning capabilities of large language models for autonomous learning, addressing the high costs and manual effort typically required to collect task-specific data sets.

Experiments show that compared to traditional reinforcement learning-based frameworks, AgentEvolver is more efficient in exploring its environment, makes better use of data, and adapts faster to application environments. For businesses, this is important because it lowers the barrier to training agents for custom applications, making powerful, personalized AI assistants more accessible to a broader range of organizations.

The high cost of training AI agents

reinforcement learning has become an important paradigm for training LLMs to act as agents who can interact with digital environments and learn from feedback. However, the development of agents with RL faces fundamental challenges. First, collecting the necessary training data sets is often prohibitively expensive and requires a large amount of manual work to create task examples, especially in novel or proprietary software environments where no data sets are commercially available.

Second, RL techniques commonly used for LLMs require the model to be run through a large number of trial and error attempts to learn effectively. This process is computationally expensive and inefficient. As a result, training capable LLM agents through RL remains laborious and expensive, limiting its implementation in customized enterprise environments.

How AgentEvolver works

The main idea behind AgentEvolver is to give models more autonomy in their own learning process. The researchers describe it as a “system of self-evolving agents” designed to “achieve autonomous and efficient evolution of capabilities through environmental interaction.” It uses the reasoning power of an LLM to create a self-training cycle, allowing the agent to continually improve by directly interacting with its target environment without the need for predefined tasks or reward functions.

“We envision an agent system in which the LLM actively guides exploration, task generation, and performance refinement,” the researchers wrote in your role.

The process of self-evolution is driven by three central mechanisms that work together.

The first is self-questioningwhere the agent explores its environment to discover the limits of its functions and identify useful states. It’s like a new user clicking on an app to see what’s possible. Based on this exploration, the agent generates its own diverse set of tasks that align with the user’s general preferences. This reduces the need for handcrafted data sets and allows the agent and its tasks to co-evolve, progressively allowing it to handle more complex challenges.

According to Yunpeng Zhai, a researcher at Alibaba and co-author of the paper, who spoke to VentureBeat, the self-questioning mechanism effectively converts the model from a “data consumer to a data producer,” dramatically reducing the time and cost required to deploy an agent in a proprietary environment.

The second mechanism is auto-navigationwhich improves exploration efficiency by reusing and generalizing past experiences. AgentEvolver extracts information from successful and failed attempts and uses it to guide future actions. For example, if an agent tries to use an API function that does not exist in an application, it logs it as an experience and learns to verify the existence of functions before attempting to use them in the future.

The third mechanism, self-attributingImproves learning efficiency by providing more detailed feedback. Instead of simply a final signal of success or failure (a common practice in RL that can result in scarce rewards), this mechanism uses an LLM to evaluate the contribution of each individual action in a multi-step task. Retrospectively determine whether each step contributed positively or negatively to the final result, providing the agent with detailed feedback that accelerates learning.

This is crucial for regulated industries where the way an agent solves a problem is as important as the outcome. “Instead of rewarding a student only for the final answer, we also evaluate the clarity and correctness of each step of their reasoning,” Zhai explained. This improves transparency and encourages the agent to adopt more robust and auditable problem-solving patterns.

“By shifting the training initiative from human-designed processes to LLM-guided self-improvement, AgentEvolver establishes a new paradigm that paves the way toward scalable, cost-effective, and continually improving intelligent systems,” the researchers say.

The team has also developed a practical end-to-end training framework that integrates these three mechanisms. A key part of this foundation is the Context Managera component that controls the agent’s memory and interaction history. While current benchmarks test a limited number of tools, real enterprise environments can involve thousands of APIs.

Zhai acknowledges that this is a fundamental challenge for the field, but notes that AgentEvolver was designed to scale. “Recovery in extremely large action spaces will always present computational challenges, but AgentEvolver’s architecture provides a clear path toward scalable tool reasoning in enterprise environments,” he said.

A more efficient path to agent training

To measure the effectiveness of their framework, the researchers tested it on Application world and BFCL v3two benchmarks that require agents to perform long, multi-step tasks using external tools. They used models from Alibaba Qwen2.5 Family (parameters 7B and 14B) and compared its performance with a reference model trained with GRPO, a popular RL technique used to develop reasoning models such as DeepSeek-R1.

The results showed that integrating all three mechanisms into AgentEvolver led to substantial performance improvements. For model 7B, the average score improved by 29.4% and for model 14B, it increased by 27.8% from baseline. The framework consistently improved the models’ reasoning and task execution capabilities across both benchmarks. The most significant improvement came from the self-questioning module, which autonomously generates various training tasks and directly addresses the problem of data sparsity.

Experiments also demonstrated that AgentEvolver can efficiently synthesize a large volume of high-quality training data. The tasks generated by the self-questioning module turned out to be diverse enough to achieve good training efficiency even with a small amount of data.

For enterprises, this provides a path to create agents for custom applications and internal workflows, while minimizing the need for manual data annotation. By providing high-level objectives and allowing the agent to generate their own training experiences, organizations can develop personalized AI assistants more simply and cost-effectively.

“This combination of algorithmic design and engineering pragmatics positions AgentEvolver as a research vehicle and reusable foundation for building tool-enhanced adaptive agents,” the researchers conclude.

Looking ahead, the ultimate goal is much greater. “A true ‘singular model’ that can be introduced into any software environment and mastered overnight is undoubtedly the holy grail of agent AI,” Zhai said. “We see AgentEvolver as a necessary step in that direction.” While that future still requires advances in model reasoning and infrastructure, self-evolving approaches are paving the way.

#Alibabas #AgentEvolver #increases #tooling #model #performance #approximately #automatically #generated #synthetic #tasks