Introduction
Large Language Model (LLM) multi-agent systems have become a popular paradigm for tackling complex tasks through collaborative intelligence. Multiple autonomous agents work together, each contributing specialized capabilities. However, these systems are notoriously brittle: a single misstep by one agent can cascade into a full task failure. Developers often face the daunting challenge of identifying which agent caused the failure and at what point in the process. This task, like trying to find a needle in a haystack, consumes hours of manual log inspection and deep domain expertise.

To address this, a collaborative team from Penn State University, Duke University, Google DeepMind, University of Washington, Meta, Nanyang Technological University, and Oregon State University has introduced the concept of Automated Failure Attribution. Their work, accepted as a Spotlight presentation at ICML 2025, provides the first benchmark dataset, Who&When, and evaluates multiple automated attribution methods. The research paves a new path toward making multi-agent systems more reliable and easier to debug.
The Challenge of Debugging Multi-Agent Systems
LLM-driven multi-agent systems excel in domains like software development, research analysis, and creative writing. Yet their very strength — autonomous, loosely coupled collaboration — also creates fragility. Failures can stem from any of three sources:
- Agent-level errors: An individual agent misinterprets its role or generates incorrect output.
- Inter-agent miscommunication: Agents fail to share information accurately or in a timely manner.
- Information transmission faults: Data is lost or corrupted as it moves between agents.
Currently, debugging these systems relies on manual methods often described as 'log archaeology.' Developers must comb through lengthy interaction logs, each containing thousands of lines of dialogue and intermediate results. The debugging process requires deep familiarity with the system's design, making it not only time-consuming but also inaccessible to newcomers. Without a systematic way to pinpoint failure origins, iterative improvement becomes nearly impossible.
Introducing Automated Failure Attribution
To overcome these limitations, the research team formalized the problem of Automated Failure Attribution: given a multi-agent system and a failed task, automatically determine which agent is responsible and at which step the failure occurred. This goes beyond simple error detection — it requires understanding the causal chain of actions in a multi-step, multi-agent interaction.
The team proposes that an ideal attribution method should be (a) precise, identifying the specific agent and the failure point, (b) efficient, requiring minimal computational overhead, and (c) generalizable across different system architectures. They explore multiple approaches, including log-based heuristics, LLM-based reasoning, and graph traversal techniques.
The Who&When Benchmark Dataset
A critical contribution of this research is the creation of the Who&When benchmark dataset. This dataset is designed to evaluate failure attribution methods in a controlled, reproducible manner. It includes:
- Multiple multi-agent system configurations (varying number of agents, roles, and communication patterns).
- A diverse set of tasks (e.g., code generation, report synthesis, question answering) that are prone to failures.
- Ground-truth labels indicating the responsible agent and the failure timestamp, carefully annotated by human experts.
The dataset is publicly available on Hugging Face, enabling other researchers to benchmark their own attribution techniques. By providing a standardized testbed, Who&When accelerates progress in this nascent field.
Evaluation of Attribution Methods
The researchers evaluated several automated attribution methods on the Who&When dataset. Their findings highlight the complexity of the problem:
- Heuristic-based methods (e.g., tracking last-error occurrences) performed poorly because failures often propagate, making it hard to distinguish cause from effect.
- LLM-based reasoning methods (prompting a language model to analyze logs) showed moderate success but struggled with long contexts and subtle dependencies between agents.
- Graph-based methods that model agent interactions as a directed graph and trace failure signals backward achieved the highest accuracy, especially when combined with explicit role and message semantics.
Interestingly, the best-performing method still left room for improvement, indicating that failure attribution remains a challenging research problem. The team also observed that attribution accuracy varies significantly depending on the type of failure — errors in communication are harder to attribute than isolated agent mistakes.
Implications for Multi-Agent System Reliability
This work has several practical implications. First, it provides developers with tools to quickly diagnose failures, reducing debugging time from hours to minutes. Second, the Who&When dataset serves as a common benchmark, fostering competition and collaboration in the community. Third, the research opens up new questions:
- How can attribution methods be made causal rather than correlational?
- Can failure attribution be integrated with runtime monitoring to prevent failures in the first place?
- What are the ethical implications of automatically blaming an agent for a failure, especially in systems that learn from past mistakes?
The code and dataset are fully open-sourced, available at the links below, encouraging further exploration.
Conclusion
As LLM-based multi-agent systems become more prevalent, the need for robust debugging tools grows. The introduction of Automated Failure Attribution by the Penn State-Duke-led team marks a significant step forward. By formalizing the problem, creating a benchmark, and evaluating initial methods, they have laid a foundation for more reliable and easier-to-maintain multi-agent systems. Future work will likely refine these techniques, making failure attribution as routine as testing a single-agent system today.