Who Failed and When? New Benchmark Helps Diagnose Multi-Agent System Breakdowns

When multiple AI agents collaborate on a complex task, things can go wrong in countless ways. A single miscommunication or erroneous step can derail the entire project, leaving developers to sift through mountains of logs to find the culprit. Researchers from Penn State, Duke, Google DeepMind, and other leading institutions have introduced a systematic solution: automated failure attribution. Their work, accepted as a spotlight presentation at ICML 2025, provides the first benchmark dataset called Who&When and evaluates multiple methods for pinpointing which agent, at which step, caused a failure. Below we explore this breakthrough in a Q&A format.

Why Is Diagnosing Failures in Multi-Agent Systems So Challenging?

Unlike single-agent systems, multi-agent setups involve autonomous, interdependent decision-making. A mistake by one agent—say, an incorrect calculation—can cascade through the chain, causing downstream agents to act on flawed information. Developers are forced to manually parse lengthy interaction logs, a process the researchers liken to “finding a needle in a haystack.” This manual debugging relies heavily on deep system expertise and becomes nearly impossible as the number of agents and interaction rounds grows. The lack of automated tools means that system iteration and optimization come to a grinding halt. This research directly addresses this pain point by framing the problem as automated failure attribution and building the tools to solve it.

Who Failed and When? New Benchmark Helps Diagnose Multi-Agent System Breakdowns — Source: syncedreview.com

What Exactly Is the “Who&When” Dataset?

Who&When is the first dedicated benchmark for the task of automated failure attribution in LLM-based multi-agent systems. The dataset consists of thousands of multi-agent conversation traces where task outcomes are known—some succeeded, some failed. For each failure, the ground truth labels indicate which agent (Who) was responsible and at which interaction step (When) the error occurred. The researchers constructed this dataset by simulating diverse multi-agent scenarios—such as collaborative writing, code generation, and planning—using common LLM backends. Each trace is carefully annotated, providing a gold standard for evaluating attribution methods. Who&When is fully open-source, available on Hugging Face, and designed to accelerate research in this new area.

What Automated Methods Did the Team Develop and Evaluate?

The authors proposed and tested several approaches, ranging from simple heuristic baselines to more sophisticated reasoning-based methods. One approach uses a post-hoc decomposition technique that analyzes the entire failure trajectory to assign blame by tracing back dependencies. Another method leverages the agents’ own introspective capabilities: each agent is prompted to summarize its own actions and flag potential errors. The most effective method, termed “critical path analysis,” constructs a directed graph of information flow and identifies the earliest node where incorrect information was introduced. The team evaluated these methods on the Who&When benchmark, measuring precision, recall, and accuracy in identifying the responsible agent and step. The results show that while simple baselines struggle, the critical path approach achieves substantial gains, though there remains significant room for improvement.

How Was the Research Validated and What Are the Next Steps?

The paper was accepted as a Spotlight presentation at ICML 2025, one of the top machine learning conferences, indicating strong peer review. All code and data are open-sourced to encourage community participation. The researchers plan to extend the benchmark to include more diverse tasks and agent architectures, as well as to explore failure attribution in real-time, during agent execution. They also aim to develop methods that not only attribute failures but also suggest corrective actions. The long-term goal is to create self-healing multi-agent systems that can automatically detect, diagnose, and recover from errors without human intervention.

What Impact Could This Work Have on Multi-Agent System Development?

By providing a systematic way to answer “who failed and when,” this research can dramatically reduce debugging time for developers building LLM-powered agent teams. Faster attribution means faster iteration, leading to more robust and reliable systems. The benchmark also sets a standard for comparing future attribution methods, much like ImageNet did for computer vision. Practical applications include automated customer support pipelines, collaborative code generation, and multi-modal reasoning systems. Ultimately, this work paves the way toward trustworthy multi-agent AI that can explain its own failures, an essential step for deploying such systems in high-stakes domains like healthcare or finance.

Which Institutions and Researchers Are Behind This Work?

This collaborative effort involves Penn State University and Duke University as leading institutions, with co-first authors Shaokun Zhang (Penn State) and Ming Yin (Duke). Additional contributors come from Google DeepMind, University of Washington, Meta, Nanyang Technological University, and Oregon State University. The interdisciplinary team combines expertise in machine learning, software engineering, and multi-agent systems. Their joint effort underscores the growing recognition of failure attribution as a critical problem in the field. The paper is available on arXiv and the code on GitHub.