[WIP] Multi-Agent Design: Optimizing Agents with Better Prompts and Topologies

To Averroes

20 Mar 2025 — 25 min read

link: https://arxiv.org/pdf/2502.02533v1

Han Zhou¹ ², Xingchen Wan¹, Ruoxi Sun¹, Hamid Palangi¹, Shariq Iqbal¹, Ivan Vulić¹ ², Anna Korhonen² and Sercan Ö. Arık¹ ¹Google, ²University of Cambridge

I'd be happy to convert this academic paper into a dialogue. I'll create a lab meeting discussion where team members thoroughly explore "Multi-Agent Design: Optimizing Agents with Better Prompts and Topologies."

Multi-Agent Design: Optimizing Agents with Better Prompts and Topologies

Han Zhou¹ ², Xingchen Wan¹, Ruoxi Sun¹, Hamid Palangi¹, Shariq Iqbal¹, Ivan Vulić¹ ², Anna Korhonen² and Sercan Ö. Arık¹ ¹Google, ²University of Cambridge

(1) A: Welcome everyone to today's lab meeting. We have a presentation from our colleague, the author of a new paper on multi-agent systems. This work explores how to optimize these systems through better prompts and topologies. Let's get started with the author's presentation.

(2) Author: Thank you for the opportunity to present our work on Multi-Agent System Search, or MASS for short. I'll spend the next few minutes giving an overview of our paper.

Large language models, when employed as multiple agents that interact and collaborate with each other, have shown remarkable abilities in solving complex tasks. These agents are programmed with prompts that define their functionality, along with topologies that orchestrate interactions across agents. However, designing effective prompts and topologies for multi-agent systems is inherently complex.

Our paper has three main contributions. First, we conduct an in-depth analysis of the design space to understand the factors behind building effective multi-agent systems. Second, based on these insights, we propose MASS, a novel optimization framework that efficiently exploits the complex design space by interleaving optimization stages from local to global, from prompts to topologies. And third, we demonstrate that MASS-optimized systems significantly outperform existing alternatives and provide design principles for building effective multi-agent systems.

Let me explain the challenges that motivated our work. First, individual agents often suffer from prompt sensitivity, where small changes in prompts can dramatically affect performance. When these sensitive agents are cascaded in a multi-agent system, the compounding effect can be amplified. Additionally, crafting effective topologies requires substantial manual experimentation. The overall search space is extremely large, encompassing both the unbounded space of prompt design and the decisions about which agents to integrate into the topology.

Now I'll describe our approach. MASS optimizes multi-agent systems in three stages:

Block-level (local) prompt optimization for each topology building block
Workflow topology optimization in a pruned set of topology space
Workflow-level (global) prompt optimization given the best-found topology

Our experiments across various benchmarks—including reasoning tasks like MATH and DROP, multi-hop understanding tasks like HotpotQA, and code generation tasks—show that MASS significantly outperforms existing manually-crafted baselines and automatically-generated alternatives. [Section 1]

(3) HoL: Thank you for that overview. Before we open up to questions, could you elaborate on the key design insights that led you to this three-stage approach? What made you think about separating local and global optimizations?

(4) Author: That's an excellent question. Our approach was motivated by several key design insights from our analysis. We found that optimizing prompts at the individual agent level before composing them into a multi-agent system yields better results than trying to optimize the entire system at once. This is because the complexity of joint optimization across multiple agents grows exponentially with the number of agents.

We also discovered that not all topologies are beneficial—some can even degrade performance. Influential topologies represent only a small fraction of the full search space. By pruning the search space to focus on these influential topologies, we can make the search process much more efficient.

This naturally led us to our three-stage approach: first optimize individual agents (block-level), then search for the best topology combinations using these optimized blocks, and finally fine-tune the prompts for the entire workflow. This approach effectively manages the complexity of the search space while producing high-quality results. [Section 2]

(5) Junior: I'm a bit confused about what exactly you mean by "topology" in this context. Could you define that more clearly and maybe give an example?

(6) Author: Great question! By "topology," we mean the structural arrangement and connectivity pattern between different agents in the multi-agent system. It defines how agents are connected and how information flows between them.

Let me give you a concrete example. Consider these common topologies:

Aggregate topology: Here, multiple agents work in parallel to solve the same problem independently, and then their outputs are aggregated (like through majority voting). For example, having 5 different LLM agents solve a math problem and taking the most common answer.
Reflect topology: In this case, one agent generates an initial solution, and another agent reviews it, providing feedback. The first agent then uses this feedback to refine its solution. This creates a serial flow with reflection.
Debate topology: Here, multiple agents exchange information with each other in a debate-like format, where each agent can see others' reasoning and update their own. This creates a more complex, mesh-like information flow.
Custom topologies: Task-specific arrangements like a summarizer agent that processes long texts before passing them to a question-answering agent.

The challenge is finding which of these topologies (or combinations of them) works best for a given task, and how many agents should be involved in each topology. [Section 2.2]

(7) Dr. P: I'd like to dig into your evaluation methodology. On page 8, you mention using Gemini 1.5 models for your experiments. Can you elaborate on the specific benchmarks you used, how you split the data for validation and testing, and the metrics you used? Also, what baseline approaches did you compare against?

(8) Author: We conducted experiments on an extensive collection of tasks across different categories:

For reasoning tasks, we used Hendryck's MATH and DROP datasets. For long-context understanding, we evaluated on HotpotQA, MuSiQue, and 2WikiMultiHopQA from LongBench. For coding tasks, we used MBPP, HumanEval, and the test-output-prediction subtask from LiveCodeBench.

To save computational resources, we randomly sampled subsets from the original validation and test splits. For metrics, we reported accuracy for MATH and LiveCodeBench, F1 score for DROP, HotpotQA, MuSiQue, and 2WikiMQA, and pass@1 for MBPP and HumanEval.

We compared against several baselines:

Chain-of-Thought (CoT) - A single agent using zero-shot prompting
Self-Consistency - Using multiple agents and majority voting
Self-Refine - Using a reflector agent to verify and improve predictions
Multi-Agent Debate - Having agents debate to reach better answers
ADAS - An automatic agent design framework using an LLM meta-agent
AFlow - A topology search algorithm using Monte Carlo Tree Search

For fair comparison, we limited the maximum number of agents to 10 across all methods. [Section 5]

(9) Senior: I'm interested in the prompting aspect of your work. You mentioned that prompt optimization plays a crucial role. How exactly does MASS optimize the prompts, and what makes this approach novel compared to existing prompt optimization techniques?

(10) Author: For prompt optimization, MASS builds on and extends existing techniques while integrating them into our multi-stage framework. Specifically, we use MIPRO, a state-of-the-art prompt optimizer that jointly optimizes both instructions and demonstrations.

What makes our approach novel is how we handle prompt optimization in the multi-agent context:

In the first stage (block-level optimization), we optimize prompts for each individual agent type (predictor, reflector, debater, etc.) separately. This includes optimizing both the instruction component and up to 3 in-context examples per agent.
In the third stage (workflow-level optimization), we re-optimize the prompts for the entire system. This stage is crucial because it ensures that the prompts are tailored for orchestration within the multi-agent system and that the interdependence between agents is properly captured.

Our analyses showed that prompt optimization can yield substantial performance gains. For example, in our MATH task experiments, a properly prompt-optimized single agent already outperformed several naive multi-agent approaches without optimized prompts. When we then applied self-consistency on top of this optimized agent, we saw even better scaling properties.

This confirms our hypothesis that you should "optimize agents locally before scaling their topology" - a key insight from our work. [Section 2.1]

(11) MML: I'd like to understand the mathematical formulation behind your approach. In Section 2, you formulate workflow topology optimization as an optimization problem. Could you walk us through this formulation and explain how you effectively solve it given the combinatorial nature of the search space?

(12) Author: Yes, from a mathematical perspective, we formulate the workflow topology optimization as follows:

W*(a) = arg max_(a~A) E_(x,y)~D [f(W(a(x)), y)]

Here, a represents a valid configuration from our search space A, W is the workflow that builds the multi-agent system, D is the target dataset with input-output pairs (x,y), and f is the objective function measuring performance.

Solving this directly is challenging due to the combinatorial nature of the search space. Our approach to making this tractable has several components:

First, we use the results from our block-level prompt optimization to measure the incremental influence I_a of each topology block a compared to the base agent. This helps us identify which topologies are most promising.

Second, we prune the search space by converting these influence measures into selection probabilities using a softmax function with temperature t. This gives us: p_a = Softmax(I_a, t)

Third, we use rejection sampling based on these probabilities to explore the search space efficiently. We only sample valid configurations and reject those that exceed our budget constraints.

Finally, we constrain the workflow with a rule-based order to reduce optimization complexity, following a predefined sequence like [summarize, reflect, debate, aggregate].

This combination of pruning, probabilistic sampling, and ordered constraints allows us to efficiently navigate the otherwise intractable combinatorial search space. [Section 3]

(13) LaD: Let's talk about the data. When developing multi-agent systems for tasks like multi-hop question answering, the dataset characteristics significantly impact performance. Did you analyze how different dataset properties affected your optimized topologies? Did you find some topologies worked better for certain types of questions or data structures?

(14) Author: That's a very insightful question about dataset characteristics. We did observe that different dataset properties led to different optimal topologies.

For instance, in multi-hop question answering datasets like HotpotQA, which requires synthesizing information from multiple contexts, we found that debate topologies brought significant gains (+3% improvement), while other topologies like self-consistency and self-refine didn't help or even slightly hurt performance.

In contrast, for math reasoning tasks in the MATH dataset, we found that scaling with more parallel agents (aggregate topology) outperformed debate topologies. This suggests that for problems with clear, objective answers, having multiple independent solutions and then aggregating them works better than having agents debate.

For coding tasks, especially the test-output-prediction subtask in LiveCodeBench, we found that a hybrid approach combining reflection and execution was most effective. The executor agent could run code and provide concrete feedback, which significantly improved the reflection process.

These observations highlight the importance of matching the topology to the specific dataset characteristics and task requirements. One size definitely doesn't fit all when it comes to multi-agent systems, which is precisely why our automated search approach is valuable. [Section 5]

(15) HoL: Looking at Figure 2 in your paper, you show an interesting comparison between different approaches and their token efficiency. Could you elaborate on this analysis? How much more token-efficient is MASS compared to the baselines?

(16) Author: Figure 2 demonstrates the token efficiency of different approaches on the MATH dataset using Gemini 1.5 Pro. What we observed is quite telling about the value of prompt optimization.

The graph plots accuracy against token count per question, showing that prompt-optimized agents (labeled "Prompting" in the figure) achieve significantly better token-efficiency compared to simply scaling the number of agents with default prompts (like Self-Consistency or Reflect approaches).

Specifically, our prompt-optimized approach achieved about 80% accuracy with approximately 2,000 tokens per question, whereas scale-based approaches like Self-Consistency required nearly 3,000 tokens to reach just 76% accuracy.

The most efficient approach was combining prompt optimization with self-consistency (labeled "Prompt->SC" in the figure), which reached 84% accuracy with around 3,000 tokens. This is a substantial improvement over the baseline Chain-of-Thought, which achieved only 73% with 1,000 tokens.

In our broader experiments, MASS-optimized systems typically used 30-40% fewer tokens compared to baseline approaches for equivalent or better performance. This token efficiency is particularly important in practical applications where inference costs scale with token usage. [Section 2.1]

(17) Indus: From an industry perspective, I'm interested in the practical implications. How much computational overhead does the MASS optimization process itself require? And once optimized, what's the inference time difference between a MASS-optimized system versus traditional approaches? Is this something that could be deployed in production systems today?

(18) Author: From a practical implementation standpoint, MASS does require more upfront computational investment during the optimization phase compared to manually designed systems. Our full optimization pipeline involved:

Block-level prompt optimization: Optimizing instructions and 3 demonstrations per agent type over 10 rounds.
Topology optimization: Evaluating 10 different topologies on the validation set, with 3 evaluations per topology to stabilize prediction.
Workflow-level prompt optimization: Fine-tuning the prompts for the best topology.

In total, this process required approximately 1,000-2,000 model calls during optimization, depending on the task complexity. However, this is a one-time cost that pays dividends through improved performance.

For inference time and production deployment, a MASS-optimized system is actually more efficient than many baselines. For example, in Figure 9 of our paper, we show that MASS achieves better accuracy on MATH (82%) with fewer total tokens (~3,000) compared to approaches like debate with 2 rounds and 3 agents (~5,000 tokens for 78% accuracy).

Once optimized, the inference time scales linearly with the number of agents and their interactions. For instance, our optimized topology for MATH used 9 parallel agents with aggregation, which can be easily parallelized in production systems.

The real business value comes from both the improved accuracy and token efficiency. For high-value applications where mistakes are costly, the 10-15% accuracy improvements we observed could translate to significant business impact, while the 30-40% token efficiency gains directly reduce operational costs. [Section 5]

(19) Senior: I'd like to understand the novelty of your work compared to previous research. How does MASS differ from systems like ADAS and AFlow, which also try to optimize multi-agent systems?

(20) Author: That's a key question. Our work differentiates itself from systems like ADAS and AFlow in several important ways:

First, MASS specifically addresses the interplay between prompt optimization and topology optimization, which previous works largely treated as separate concerns. ADAS focuses on using an LLM meta-agent to iteratively propose new agents, but without explicit prompt optimization. AFlow searches for better topologies using Monte Carlo Tree Search over predefined operators, but again doesn't systematically optimize the prompts for those operators.

Second, our analyses revealed that prompt optimization often yields larger performance gains than topology optimization alone. This insight wasn't explored in depth by either ADAS or AFlow. In our experiments, we found that ADAS often proposes complex topologies but without optimized prompts, limiting its effectiveness.

Third, our multi-stage optimization approach (local to global, prompts to topologies) is novel. Previous works attempted to directly optimize the entire system, which becomes combinatorially complex as the number of agents increases.

In our experimental results, MASS consistently outperformed both ADAS and AFlow across all eight evaluation tasks. On average, MASS achieved 78.8% performance on Gemini 1.5 Pro, compared to 69.7% for ADAS. The performance gap was particularly noticeable on tasks like MATH, DROP, and MuSiQue, where prompt optimization plays a crucial role. [Section 4]

(21) Dr. P: I noticed in your ablation studies in Figure 5 that you evaluate the contribution of each optimization stage. Could you elaborate on these results? Which optimization stage contributed most to the final performance?

(22) Author: The ablation studies in Figure 5 provide valuable insights into the contribution of each optimization stage. We compared several configurations:

Base Agent (CoT): A simple chain-of-thought agent without any optimization.
- APO: Single-agent automatic prompt optimization.

1PO: Block-level prompt optimization (Stage 1).

2TO: Workflow topology optimization (Stage 2).

3PO: Workflow-level prompt optimization (Stage 3).

On average across all eight evaluation tasks, we observed:

Base Agent: 63.5%
- APO: 67.4% (+3.9%)

1PO: 74.6% (+7.2%)

2TO: 77.6% (+3.0%)

3PO: 78.8% (+1.2%)

The largest single improvement came from the first stage of MASS (block-level prompt optimization), contributing a 7.2% absolute improvement. This highlights the importance of optimizing individual agents before composing them.

The second stage (topology optimization) contributed another 3.0% gain, showing the value of searching for effective agent arrangements.

The third stage (workflow-level prompt optimization) added a further 1.2%, which may seem modest but was consistent across tasks and statistically significant.

We also conducted experiments without pruning the search space and without prompt optimization, finding that both components were essential for effective search. This confirms our hypothesis that searching in a pruned, influential space after prompt optimization is more effective than searching in the full space without optimized prompts. [Section 5]

(23) Junior: I'm still trying to understand how the different topologies work together. Could you give a concrete example of what a MASS-optimized system looked like for one of your tasks? What agents were included and how were they connected?

(24) Author: Let me give you a concrete example from our MATH task optimization, which is illustrated in Figure 7 of the paper.

Initially, we started with a simple Chain-of-Thought (CoT) agent that achieved about 62% accuracy. In Stage 1 (block-level prompt optimization), we identified that a multi-agent debate topology with optimized prompts performed best, achieving 79% accuracy.

However, interestingly, in Stage 2 (workflow topology optimization), we discovered that an aggregate topology with 9 parallel agents actually outperformed the debate topology, reaching 83% accuracy. The optimized topology looked like this:

[Predictor 1] \
[Predictor 2] \
[Predictor 3] \
[Predictor 4] --- [Aggregator] --- Final Answer
[Predictor 5] /
[Predictor 6] /
[Predictor 7] /
[Predictor 8] /
[Predictor 9] /

Each predictor had the same optimized prompt, which instructed it to: "Think step by step to solve the given problem. Clearly explain your reasoning process, showing all intermediate calculations and justifications. Express your final answer as a single numerical value or simplified expression enclosed within <answer></answer> tags."

The aggregator would then take all nine solutions and apply a majority voting mechanism to determine the final answer.

In Stage 3 (workflow-level prompt optimization), we further refined the prompts specifically for this aggregate topology, reaching our final performance of 85%.

This example illustrates how MASS can discover non-obvious topologies - in this case, finding that simple parallelization with good prompts outperformed more complex debate structures. [Section 5]

(25) LaD: I'm curious about the robustness of your findings across different model sizes and families. Most of your experiments use Gemini 1.5 Pro, but do the same topologies and optimization strategies work well when applied to different base models?

(26) Author: That's an important question about the generalizability of our findings. We did investigate this by running experiments on different model sizes and families.

In Table 1, we present results for both Gemini 1.5 Pro and Gemini 1.5 Flash, showing that MASS consistently outperforms baselines across both model sizes. While the absolute performance is lower with the smaller Flash model (74.3% average vs. 78.8% for Pro), the relative improvements from MASS are actually larger for the Flash model in many cases.

We also conducted additional experiments with Claude 3.5 Sonnet, reported in Table 4 in the appendix. Again, MASS significantly outperformed the baselines, achieving 72.4% average performance compared to 60.2% for the base Chain-of-Thought.

Interestingly, we found that the optimal topologies sometimes differed across model families. For example, in HotpotQA, the optimal topology for Gemini models involved a combination of debate and aggregate, while for Claude, a heavier emphasis on aggregate worked better.

This suggests that while the MASS framework itself is model-agnostic, the specific topologies and prompts it discovers may need to be tailored to each model family's strengths and weaknesses. This actually further strengthens the case for automated optimization approaches like MASS, as manually finding these optimal configurations for each model would be even more challenging. [Section 5]

(27) HoL: Looking at the bigger picture, what do you see as the key design principles that emerge from your research? If someone were building a multi-agent system from scratch today, what insights from your work would you most want them to apply?

(28) Author: Based on our research, I'd highlight three key design principles for building effective multi-agent systems:

Optimize individual agents first before scaling: Our results consistently showed that well-optimized prompts at the individual agent level provide a stronger foundation than naively scaling up the number of agents. Before adding more agents to your system, ensure each agent is functioning optimally.
Focus on influential topologies: Not all ways of connecting agents are equally effective. For reasoning tasks, aggregate topologies (multiple parallel agents with majority voting) often work best. For multi-hop QA, debate topologies show significant gains. For coding tasks, combining execution with reflection yields the best results. Choose topologies that match your task characteristics.
Model the interdependence between agents: Once you've established your topology, it's beneficial to conduct another round of prompt optimization that considers how agents interact within the system. Prompts that work well for individual agents might need adjustment when those agents are working together.

I'd also emphasize the importance of empirical testing. Our paper shows that theoretical intuitions about which topologies should work best don't always match empirical results. For instance, we initially expected debate topologies to be superior for math reasoning, but our experiments showed that simple aggregation of multiple well-prompted agents performed better.

Finally, consider the token efficiency of your system. MASS-optimized systems typically used 30-40% fewer tokens than baseline approaches for equivalent or better performance, which has significant practical implications for deployment costs. [Section 6]

(29) Indus: From a commercial standpoint, what do you see as the most promising immediate applications of MASS? Are there specific industries or use cases where you think this approach could have the most impact?

(30) Author: From a commercial perspective, I see several high-impact applications where MASS could provide immediate value:

Financial analysis and decision-making: Financial institutions dealing with complex numerical reasoning and multi-source information synthesis could benefit significantly. MASS showed particularly strong improvements on tasks like MATH and DROP, which involve numerical reasoning. A properly optimized multi-agent system could provide more accurate financial analyses while maintaining clear reasoning traces for regulatory compliance.
Enterprise knowledge retrieval and reasoning: Organizations with large, complex knowledge bases could use MASS-optimized systems for more accurate information retrieval and synthesis. Our results on multi-hop QA tasks like HotpotQA show substantial improvements in information synthesis across multiple sources.
Software development assistance: Our results on coding tasks like MBPP and HumanEval showed significant improvements, particularly when combining execution capabilities with reflection. This could translate to more robust code generation, debugging, and test case creation tools for developer productivity.
Healthcare information synthesis: Medical diagnosis and treatment planning often require integrating information from multiple sources (patient history, research literature, treatment guidelines). The improvements we saw in multi-hop reasoning could be valuable here.
Customer support automation: For complex customer issues that require both information retrieval and reasoning, MASS-optimized systems could provide more accurate solutions while maintaining natural conversational flow.

The business value of MASS comes from both improved accuracy (10-15% on average) and efficiency (30-40% token reduction). For high-value decisions where mistakes are costly, these improvements could translate to significant ROI, even accounting for the upfront optimization cost. [Section 5]

(31) MML: I'm interested in the limitations of your approach. What are the scenarios where MASS might struggle or not be the optimal solution? And what directions do you see for future research to address these limitations?

(32) Author: That's a thoughtful question about limitations. There are several scenarios where MASS might face challenges:

Extremely large design spaces: While MASS effectively prunes the topology search space, for systems with many specialized agent types or complex interaction patterns, the search space could still be prohibitively large. Our current approach is most effective when the number of distinct agent types is relatively small.
Highly dynamic environments: MASS optimizes for a fixed task distribution represented by the validation set. In environments where the task distribution shifts significantly over time, the optimized system might not adapt well without reoptimization.
Very long reasoning chains: Our current implementation primarily focuses on topologies with parallel and relatively shallow sequential structures. For tasks requiring very deep reasoning chains (dozens of steps), the current search space may not capture the optimal structures.
Resource constraints: The optimization process itself requires computational resources that might be prohibitive in some settings, particularly for smaller organizations.

For future research directions, we see several promising avenues:

Sparsity in agent communications: Recent work has shown that pruning redundant communications between agents can improve efficiency. Incorporating communication sparsity into the search space could yield even more efficient systems.
Adaptive topologies: Developing methods that can dynamically adjust the topology based on the specific input rather than using a fixed topology for all instances of a task.
Cross-task generalization: Exploring how insights from optimizing for one task can transfer to related tasks, reducing the need for task-specific optimization.
Integration with reinforcement learning: Combining our search-based approach with reinforcement learning to enable continuous improvement of the multi-agent system over time.

We acknowledge these limitations in the appendix of our paper and see them as exciting challenges for the research community to address. [Appendix A]

(33) Dr. P: I'd like to understand more about your evaluation methodology. How did you ensure that performance improvements weren't just due to increased model usage? Did you control for the total number of tokens or model calls when comparing different approaches?

(34) Author: That's an excellent methodological question. We were indeed careful to control for model usage when comparing different approaches to ensure fair comparisons.

For all methods, including baselines and MASS, we limited the maximum number of agents to 10 during inference. This ensured that no method could simply achieve better performance by using an excessive number of agents.

In Figure 9 of our paper, we explicitly analyze the token efficiency of different approaches, plotting performance against the total token count (including both input and output tokens). This analysis shows that MASS consistently achieves better performance per token compared to baseline approaches.

For example, on the MATH dataset, MASS achieved 82% accuracy with approximately 3,000 tokens, while debate approaches with 2 rounds and 3 agents required around 5,000 tokens to reach only 78% accuracy. This demonstrates that our improvements aren't simply from using more computational resources but from using them more effectively.

For the optimization process itself, we report the total number of model calls required: approximately 1,000-2,000 depending on the task complexity. While this is a non-trivial upfront cost, it's a one-time investment that leads to persistent improvements in inference efficiency.

We also structured our ablation studies to isolate the contribution of each component. By comparing systems with and without specific optimization stages while keeping all other factors constant, we could attribute performance gains to specific aspects of our approach rather than just increased computation. [Section 5]

(35) Senior: I'm interested in how MASS might fit into the broader ecosystem of LLM research. How do you see your work connecting with other research directions like fine-tuning, retrieval-augmented generation, or agent learning?

(36) Author: MASS addresses a complementary direction to many other research threads in the LLM ecosystem. Let me explain how it connects with several important areas:

Fine-tuning vs. Prompting: While fine-tuning adapts model weights for improved performance, it requires significant computational resources and data. MASS operates entirely in the prompt space, requiring no model weight updates. These approaches could be complementary—fine-tuned models could serve as even better building blocks within a MASS-optimized topology.
Retrieval-Augmented Generation (RAG): MASS includes tool use as one of its optimizable components, which can incorporate retrievers for RAG. Our framework could potentially discover optimal ways to integrate retrieval into agent workflows, such as when to retrieve information and how to process it across multiple agents.
Agent Learning: Recent work like AgentPro and AutoAct focuses on training agents through reinforcement learning. MASS takes a different approach by optimizing the agent architecture and prompts, without weight updates. These approaches could be combined—imagine using MASS to discover effective agent architectures, then using RL to further refine agent behavior.
Prompt Engineering/Optimization: MASS builds upon and extends prompt optimization techniques like DSPy and MIPRO, adapting them specifically for the multi-agent context. Our insights about the importance of prompt optimization could inform future prompt engineering research.
Neural Architecture Search (NAS): Methodologically, MASS draws inspiration from NAS, particularly in how we emphasize search space design over search algorithms. Similar to how NAS evolved to focus more on search space design, we expect multi-agent optimization to follow a similar trajectory.

I see MASS as providing a framework that could integrate advances from these other research directions. As LLMs continue to improve, the benefits of effective multi-agent orchestration will likely become even more pronounced. [Section 4]

(37) LaD: You mentioned that different topologies work better for different tasks. Did you analyze any patterns in the data that might explain why certain topologies perform better on specific tasks? For instance, are there linguistic or structural characteristics of questions that would suggest using debate versus aggregate topologies?

(38) Author: We did observe some patterns in how dataset characteristics align with optimal topologies, though this wasn't the primary focus of our paper. Here are some insights:

For multi-hop QA datasets like HotpotQA, which require synthesizing information across multiple paragraphs, debate topologies performed particularly well. We hypothesize this is because debate allows agents to identify and resolve inconsistencies when integrating information from different sources. Questions that require comparing entities or events across multiple paragraphs benefited most from debate.

For mathematical reasoning in the MATH dataset, aggregate topologies (multiple parallel agents with majority voting) outperformed debate. We found this was particularly true for problems with clear, objective answers where different solution paths can lead to the same result. The diversity of approaches from multiple parallel agents helped overcome individual biases or errors.

For coding tasks, particularly test output prediction in LiveCodeBench, topologies combining execution with reflection worked best. The pattern here was clear: tasks requiring concrete verification against external criteria (like execution results) benefited from topologies that incorporated feedback loops with execution.

Text length was another factor. For datasets with very long contexts like MuSiQue, topologies that included a summarizer agent as the first step showed significant benefits. The summarizer could extract relevant information before passing it to subsequent agents, effectively managing context limitations.

One interesting observation was that tasks with high ambiguity or subjective components benefited more from debate, while highly structured tasks with clear evaluation criteria tended to favor aggregate topologies. This suggests that the level of problem structure is a useful predictor of optimal topology.

These patterns could potentially be used to develop heuristics for choosing topologies for new tasks, though our results suggest that automated optimization still outperforms human intuition in most cases. [Section 5]

(39) Junior: All this optimization seems complex. If someone wanted to implement MASS for their own application but didn't have a lot of resources, is there a simplified version or key parts they could focus on that would still give them good results?

(40) Author: That's a very practical question. For those with limited resources, I'd recommend a simplified version of MASS focusing on the highest-impact components:

Focus on prompt optimization first: Our results consistently showed that prompt optimization at the individual agent level provides the biggest bang for your buck. Even if you can't do the full topology search, spending time optimizing prompts for individual agents will yield substantial benefits. You could use a simpler prompt optimization approach like MIPRO-lite, which requires fewer computational resources.
Start with a small set of proven topologies: Instead of searching the full topology space, start with the topologies we found most effective for similar tasks:
- For reasoning tasks: Aggregate topology with 3-5 parallel agents
- For multi-hop QA: A simple debate topology with 3 agents and 1 round
- For coding: Combine a base agent with execution and reflection
Use a smaller but diverse validation set: You don't need a large validation set to identify effective prompts and topologies. A small but diverse set of examples (20-30) can still provide useful signal for optimization.
Sequential rather than parallel optimization: Instead of optimizing all components simultaneously, do it sequentially - first optimize the prompt for your base agent, then add one topology component at a time, keeping what works and discarding what doesn't.
Reuse optimized prompts across similar tasks: The optimized prompts we found for certain agent types (predictors, reflectors, etc.) could be adapted for similar tasks without starting from scratch.

Following these simplified steps could give you perhaps 70-80% of the benefits of the full MASS approach while requiring significantly fewer resources. The most important insight to take away is that well-optimized prompts in even simple topologies often outperform complex topologies with default prompts. [Section 5]

(41) HoL: As we near the end of our discussion, I've noticed we haven't discussed potential negative impacts or ethical considerations of this work. Could you address whether there are any concerns with automating multi-agent system optimization, particularly as these systems become more powerful?

(42) Author: That's an important aspect we should address. There are several ethical considerations and potential concerns with automated multi-agent optimization:

First, as multi-agent systems become more capable through optimization, they could potentially amplify existing issues with LLMs, such as hallucinations or biases. By orchestrating multiple agents that reinforce each other's outputs, incorrect information might appear more credible due to the apparent consensus among agents.

Second, there's the issue of transparency and explainability. Optimized multi-agent systems, particularly those with complex topologies, may become more difficult for humans to interpret. Understanding why a system produced a particular output becomes challenging when the reasoning is distributed across multiple agents with complex interactions.

Third, there are resource considerations. The optimization process itself requires significant computational resources, which has both environmental impacts and could exacerbate existing disparities in access to advanced AI technologies.

To address these concerns, we recommend several practices:

Maintaining human oversight of multi-agent systems, especially in high-stakes applications
Including diverse evaluation datasets that specifically probe for biases and hallucinations
Implementing logging mechanisms that capture inter-agent communications for auditability
Considering efficiency metrics alongside performance during optimization

In our work, we've tried to balance performance gains with efficiency by including token efficiency as an evaluation criterion. We've also focused on creating more robust systems rather than just pushing for performance at any cost. But we acknowledge that as this technology develops further, the research community will need to continuously evaluate and address emerging ethical concerns. [Not explicitly in the paper, but important to note]

(43) Indus: Before we wrap up, I'm thinking about future developments. Where do you see this line of research going in the next 1-2 years? What capabilities or improvements might we expect to see in multi-agent systems?

(44) Author: Looking ahead to the next 1-2 years, I anticipate several exciting developments in multi-agent systems research:

Emergent capabilities through scale: As we scale up both the number and diversity of agents, we may see emergent capabilities that aren't present in individual agents or simpler systems. These could include more sophisticated collaborative problem-solving, better self-correction, and improved handling of ambiguity.
Specialized agent architectures: We'll likely see the development of agent architectures specifically designed for certain roles within multi-agent systems, moving beyond the current approach of using the same base model for all agents. This specialization could improve both efficiency and performance.
Dynamic and adaptive topologies: Instead of fixed topologies, future systems might dynamically adapt their topology based on the specific input or task requirements. This could involve adding or removing agents, or rearranging connections between them in real-time.
Multi-modal agents: Extending multi-agent systems to incorporate multi-modal capabilities (text, vision, audio) could enable more comprehensive reasoning and problem-solving across diverse inputs.
Self-improving systems: Combining our approach with reinforcement learning from human feedback could enable multi-agent systems that continuously improve through interaction, potentially discovering novel topologies or prompting strategies beyond what we currently consider.
Efficiency innovations: Expect significant advances in reducing the computational overhead of multi-agent systems through techniques like selective activation (only invoking specific agents when needed) and distillation (compressing multi-agent reasoning into more efficient forms).

The key challenge will be balancing these advances with considerations of safety, interpretability, and accessibility. I believe the most impactful research will be that which not only pushes the performance frontier but also addresses these broader concerns. [Not explicitly in the paper, but represents a vision for future work]

(45) A: Thank you for this comprehensive discussion. Before we conclude, I'd like to summarize the key insights from today's presentation:

MASS is a multi-stage optimization framework for LLM-based multi-agent systems that interleaves prompt optimization and topology optimization.
The research revealed that well-optimized prompts at the individual agent level provide a stronger foundation than merely scaling up agents, highlighting the importance of "optimizing agents locally before scaling their topology."
Not all topologies are beneficial - influential topologies represent only a small fraction of the search space, making efficient search critically important.
MASS significantly outperformed existing approaches across a range of tasks including reasoning, multi-hop understanding, and code generation.
Different tasks benefit from different optimal topologies - mathematical reasoning tasks favor aggregate topologies, multi-hop QA benefits from debate, and coding tasks work best with execution and reflection components.

Are there any uncovered aspects of your work that you'd like to highlight?

(46) Author: Thank you for that excellent summary. I'd like to highlight one additional aspect that we haven't fully discussed: the broader implications for how we think about developing and deploying LLM-based systems.

Our work suggests that the architecture of how we compose LLMs may be just as important as the capabilities of the underlying models themselves. As base models become more powerful, the gains from effective orchestration could become even more significant. This points to a future where system design might partially decouple from model development - teams could specialize in creating optimal architectures for orchestrating existing models rather than focusing solely on developing better base models.

I'd also like to emphasize that although our paper focused on performance metrics, the principles we uncovered have implications for reliability and safety as well. Properly designed multi-agent systems can incorporate verification agents, fact-checking mechanisms, and other guardrails that might improve the trustworthiness of AI systems.

(47) A: Thank you for those additional insights. To wrap up, could you share the five most important citations from your work that you would recommend for understanding the foundational research that your paper builds upon?

(48) Author: Here are the five most important citations I'd recommend for understanding the foundations of our work:

Khattab, O., et al. (2024). "DSPy: Compiling declarative language model calls into state-of-the-art pipelines." This paper introduces a framework for optimizing prompts programmatically, which influenced our approach to prompt optimization.
Wang, X., et al. (2023). "Self-consistency improves chain of thought reasoning in language models." This foundational paper demonstrates how using multiple reasoning paths and majority voting can improve performance, which is a key component in our aggregate topology.
Shinn, N., et al. (2024). "Reflexion: Language agents with verbal reinforcement learning." This work introduced the concept of reflective agents that can learn from feedback, influencing our reflect topology designs.
Du, Y., et al. (2024). "Improving factuality and reasoning in language models through multiagent debate." This paper established the effectiveness of debate topologies for complex reasoning tasks, which is a core component in our topology space.
White, C., et al. (2023). "Neural architecture search: Insights from 1000 papers." While not directly about LLMs, this survey on neural architecture search influenced our thinking about search space design, particularly the importance of focusing on influential dimensions of the search space rather than search algorithms.

These papers span the key areas that informed our work: prompt optimization, agent design patterns (self-consistency, reflection, debate), and methodological approaches to search space design.

(49) A: Thank you all for your participation in today's lab meeting. Special thanks to our author for presenting this innovative work on multi-agent system optimization and for answering everyone's questions so thoroughly. The insights from MASS could significantly advance how we design and deploy multi-agent systems across various domains.

[WIP] Multi-Agent Design: Optimizing Agents with Better Prompts and Topologies

To Averroes

Multi-Agent Design: Optimizing Agents with Better Prompts and Topologies

Read more

Corrective In-Context Learning: Evaluating Self-Correction in Large Language Models

DART-Math: Difficulty-Aware Rejection Tuning for Mathematical Problem-Solving

MathFusion: Enhancing Mathematic Problem-solving of LLM through Instruction Fusion

Meta-Prompting: Enhancing Language Models with Task-Agnostic Scaffolding