Meta-Prompting: Enhancing Language Models with Task-Agnostic Scaffolding
Mirac Suzgun (Stanford University), Adam Tauman Kalai (OpenAI)
Lab Meeting Discussion
(1) A: Good morning everyone! Today we have a presentation from our colleague who will be discussing their paper on meta-prompting. Please take it away.
(2) Author: Thank you for having me today. I'm excited to share our work on meta-prompting, a technique we've developed to enhance the functionality of language models.
Let me start with the key idea: Meta-prompting transforms a single language model into what we call a "multi-faceted conductor" that manages multiple independent LM queries. Instead of treating an LM as a single entity that answers a question directly, we leverage the LM's ability to decompose complex tasks into smaller subtasks and delegate these to "expert" instances of the same LM.
The core of our approach involves creating a high-level "meta" prompt that instructs the language model to:
- Break down complex problems into manageable pieces
- Assign these pieces to specialized "expert" models with tailored instructions
- Oversee communication between these expert models
- Apply critical thinking and verification throughout the process
What makes this especially powerful is that the same LM plays both roles - it acts as the conductor and as the various experts. When the conductor calls an expert, that expert is given fresh instructions without the full conversation history, which provides "fresh eyes" on the problem. This helps avoid the common issue where LMs double down on mistakes.
Our meta-prompting approach is task-agnostic, meaning the same high-level instructions work across different tasks without needing task-specific examples. This contrasts with many other techniques that require examples tailored to each task.
Let me walk you through our methodology briefly. The process starts with a user query, which is processed by what we call the "Meta Model" - the LM in its conductor role. The Meta Model can either answer directly or, more typically, decide to consult various expert models. These experts only see what the Meta Model shares with them, creating an isolated process where each expert brings a fresh perspective. The Meta Model then integrates these insights and provides a final answer.
We compared meta-prompting with several other zero-shot prompting methods including standard prompting, zero-shot chain-of-thought, expert prompting (both static and dynamic), and multipersona prompting.
Our results across various tasks are quite compelling. When we enhanced meta-prompting with a Python interpreter, it outperformed standard prompting by 17.1%, expert dynamic prompting by 17.3%, and multipersona prompting by 15.2%. The tasks we evaluated included the Game of 24, Checkmate-in-One, and Python Programming Puzzles.
What's particularly interesting is how meta-prompting performs on tasks requiring complex reasoning or creativity. For example, in the Game of 24, where players must form an arithmetic expression that equals 24 using four given numbers exactly once, meta-prompting with a Python interpreter achieved 67% accuracy compared to just 3% with standard prompting.
We believe this approach has broad applications across various domains and opens up exciting possibilities for enhancing LM capabilities without requiring specialized fine-tuning. [Section 1 and 2]
(3) A: Thank you for that comprehensive summary. Now let's open the floor for questions from our lab members. Who would like to start?
(4) Junior: I'm a bit confused about what exactly a "meta" prompt is. Could you explain in simpler terms what makes this different from just having a good prompt?
(5) Author: That's a great clarification question. A standard prompt is a direct instruction to the language model to perform a specific task. For example, "Solve this math problem: 2+2=?"
In contrast, a meta-prompt is a high-level instruction that tells the language model how to approach problem-solving in general. It doesn't focus on solving the specific task but rather on orchestrating a process to solve any task.
Our meta-prompt essentially says: "You are a coordinator who can call upon different experts to help solve problems. When faced with a task, break it down, decide which experts to consult, give them specific instructions, and combine their insights."
What makes this different is that:
- The LM is explicitly instructed to decompose problems rather than solve them directly
- It creates a structured workflow where the LM plays multiple roles
- It introduces "fresh eyes" by having each expert view only their specific subtask without the full context
Think of it like the difference between telling someone "bake this cake" versus teaching them a method of "when cooking anything complex, break it into steps, consult recipes for each component, check your work as you go, and assemble carefully at the end."
The meta-prompt provides a scaffolding that helps the LM organize its own capabilities more effectively. [Section 2]
(6) HoL: I'd like to better understand the methodology. Your paper mentions a "shallow hierarchical configuration" with the Meta Model as the central authority. Could you elaborate on this structure and why you chose it rather than, say, a more peer-based collaborative approach?
(7) Author: The shallow hierarchical configuration was a deliberate design choice based on both practical considerations and empirical findings.
In our approach, the Meta Model sits at the top of a simple two-level hierarchy. The Meta Model (conductor) coordinates everything, while the expert models exist at the level below, each operating independently without directly communicating with each other.
We chose this structure for several reasons:
First, it simplifies the communication flow. With the Meta Model as the central authority, we avoid the complexity of managing peer-to-peer interactions between experts, which could quickly become unwieldy as the number of experts increases.
Second, it maintains a clear "thread" throughout the problem-solving process. The Meta Model retains the full context history, ensuring continuity and coherence in the approach.
Third, we found empirically that this approach works better than alternatives. In our early experiments, we explored more collaborative configurations where experts could directly interact, but that often led to confusion, redundancy, and sometimes conflicting approaches without a clear resolution mechanism.
As for why not a more peer-based approach - we actually explored this in our comparisons with multipersona prompting (which you might recognize from papers by Wang et al., 2023). In that approach, multiple personas collaborate more as peers. Our experiments showed that our hierarchical approach with the Meta Model as coordinator outperformed the peer-based approach by about 15.2% on average across tasks.
The hierarchy also allows for what we call "fresh eyes" - since each expert only sees what the Meta Model shares with them, they approach their subtask without being biased by previous attempts or reasoning, which helps avoid the common problem of LMs doubling down on mistakes. [Section 2 and 4.3]
(8) Dr. P: I want to dig into your evaluation methodology. In Table 1, you show impressive results across different tasks, particularly for the Game of 24 where you achieve 67% accuracy with the Python interpreter compared to just 3% with standard prompting. Could you walk us through your evaluation protocols? How exactly did you measure "accuracy" across these diverse tasks?
(9) Author: Our evaluation methodology needed to be flexible given the diversity of tasks we assessed. We used three different metrics depending on the nature of each task:
- Exact Match (EM): This is the strictest evaluation, where an answer is deemed correct only if it perfectly matches the ground truth. We used this for Geometric Shapes, Multi-Step Arithmetic Two, and Checkmate-in-One tasks.
- Soft Match (SM): This is more lenient, where an answer is correct if the ground truth label appears within the model's output. We used this for MGSM (Multilingual Grade School Math) and Word Sorting.
- Functionally Correct (FC): This evaluates whether the answer fulfills the task requirements, even if it doesn't match a specific ground truth. This was crucial for tasks like Game of 24, Python Programming Puzzles, and our Shakespearean Sonnet Writing task.
For the Game of 24 specifically, which you highlighted, we used Functional Correctness. A solution is correct if it forms an expression equal to 24 using each of the four given numbers exactly once. The striking improvement from 3% to 67% shows how meta-prompting with a Python interpreter allowed the model to systematically explore possible combinations rather than relying on direct intuition.
We also implemented consistent answer extraction protocols. Our system instruction to the Meta Model specifies a particular format for final answers - they must be wrapped in triple quotes and preceded by "»FINAL ANSWER:". This enabled us to consistently extract and evaluate responses across different approaches.
To ensure rigor, we ran all experiments with the same parameters across all methods - temperature of 0, top-p of 0.95, and maximum token count of 1024. For GPT-4 specifically, we used the 32k context version to accommodate potentially lengthy reasoning chains.
It's worth noting that while we present average performance in the paper, we're also releasing all model inputs, interactions, and outputs in our GitHub repository for transparency and reproducibility. [Section 3.3 and 3.4]
(10) Senior: I'm interested in what makes this approach truly novel. How does meta-prompting differ from existing methods like chain-of-thought or expert prompting? What's the key innovation here?
(11) Author: The key innovations of meta-prompting compared to existing methods are:
- Structured orchestration with "fresh eyes": Unlike chain-of-thought prompting, which guides an LM through sequential reasoning steps, meta-prompting creates a structured process where the same LM plays different roles with different perspectives. The critical innovation is that each expert gets only the information relevant to their subtask, providing "fresh eyes" on the problem. This helps prevent the model from getting stuck in erroneous reasoning patterns, which is a common issue in methods like chain-of-thought.
- Dynamic task decomposition: While expert prompting assigns a single expert persona to a task, meta-prompting enables dynamic decomposition into multiple subtasks with different experts. The Meta Model actively decides which experts to call, what instructions to give them, and how to integrate their outputs - all without requiring human intervention.
- Tool integration in a task-agnostic framework: We demonstrated that meta-prompting can seamlessly incorporate external tools (like a Python interpreter) while maintaining its task-agnostic nature. This isn't just appending code execution to prompting - it's incorporating it into the orchestration framework where the Meta Model decides when and how to use the tool.
- Verification loops without explicit instruction: Meta-prompting naturally encourages verification because the Meta Model can call different experts to check each other's work. This happens without explicitly prompting for it, emerging from the orchestration framework itself.
For example, in one Game of 24 execution, we observed:
- The Meta Model first consulting a mathematics expert
- Another expert identifying an error in that solution
- The Meta Model then calling a programming expert to write code
- Another expert verifying the program's output
This multi-layered verification process emerges organically from the meta-prompting framework, something not inherently present in existing methods.
What makes this particularly innovative is that it's a general framework that enhances performance across very different types of tasks without task-specific examples or instructions. [Sections 1, 2, and 4]
(12) LaD: Could you tell us more about the datasets you used for evaluation? Particularly for the creative task like Shakespearean Sonnet Writing - how did you construct that dataset and evaluate the quality of outputs?
(13) Author: We used a mix of existing benchmarks and one novel dataset that we created. Let me go through them:
For existing benchmarks:
- Game of 24: From Yao et al. (2023a), this consists of arithmetic puzzles where you must form an expression equal to 24 using four numbers exactly once.
- BIG-Bench Hard tasks: These include Geometric Shapes, Multi-Step Arithmetic Two, and Word Sorting from Suzgun et al. (2023b).
- Checkmate-in-One: Directly from the BIG-Bench suite (2023).
- Python Programming Puzzles (P3): From Schuster et al. (2021), these are challenging programming problems with varying difficulty levels.
- Multilingual Grade School Math (MGSM): From Shi et al. (2023), which includes math problems in ten diverse languages.
For the Shakespearean Sonnet Writing task, we created this dataset ourselves specifically for this paper. Here's how we approached it:
- Construction: We created a set of prompts asking the model to write a sonnet with a strict rhyme scheme ("ABAB CDCD EFEF GG") and containing three specific words that must be used verbatim. The three words were randomly selected from a curated list to ensure they presented a creative challenge while remaining potentially usable in poetic language.
- Evaluation: This is where it gets interesting. Since there's no "ground truth" for a creative task, we used Functional Correctness (FC) as our metric. A sonnet was deemed correct if it:
- Followed the specified rhyme scheme
- Included all three required words verbatim
- Maintained the general structure of a Shakespearean sonnet (14 lines with appropriate meter)
We had human evaluators check the outputs against these criteria. The evaluation wasn't subjective in terms of poetic quality - it focused on whether the functional requirements were met.
This task was particularly interesting because it tested the ability of meta-prompting to handle creative constraints while ensuring technical adherence to form. The meta-prompting approach typically had the Meta Model consult both a poetry expert for composition and another expert to verify that all constraints were met.
The sonnet task showed one of the significant improvements - from 62% accuracy with standard prompting to 79.6% with meta-prompting (with Python interpreter). This demonstrates that meta-prompting's benefits extend beyond purely computational tasks to creative domains as well. [Section 3.2]
(14) MML: I'd like to understand the algorithmic procedure more precisely. In Algorithm 1, you present the pseudocode for meta-prompting. Could you walk us through the mathematical formulation and explain some of the key variables like eexp and eret?
(15) Author: Happy to walk through the algorithmic procedure in detail. Algorithm 1 formalizes the meta-prompting process with precise mathematical notation.
First, let me define the notation we use:
- S represents the set of finite strings, with ∅ being the empty string
- x ∈ S refers to the test-time query (the task or problem)
- LM: S → S represents our language model function, which takes a string input and produces a string output
- H represents the prompt history (the accumulating conversation)
Then we have several template functions:
- tinit: S → S transforms the raw query into a suitable template with initial instructions
- tmid: S → S formats text to be added to the conversation history
- texp: S → S wraps the output of the Meta Model in a prompt suitable for an expert model
And the extractors:
- eexp: S → S extracts instructions for an expert from the Meta Model's output (returning ∅ if none)
- eret: S → S extracts the final answer from the Meta Model's output (returning ∅ if none)
The symbol ⊕ represents string concatenation.
Now, let me walk through the algorithm:
- We start by initializing H₁ ← tinit(x), which transforms the user query into our initial prompt history with the meta-prompting instructions.
- We then enter a loop for t = 1 to T (where T is a maximum number of turns):a. We first get the Meta Model's response: yt ← LM(Ht)b. We then check if the Meta Model has provided expert instructions by applying eexp(yt)c. If instead the Meta Model has returned a final answer (eret(yt) ≠ ∅), we extract and return itd. If neither of the above, we append an error message to the history
- If it has (eexp(yt) ≠ ∅), we:
- Format these instructions using texp
- Get the expert's response zt by calling the LM again with just these instructions
- Update the history by appending this response: Ht+1 ← Ht ⊕ tmid(zt)
- If it has (eexp(yt) ≠ ∅), we:
This algorithm captures the recursive nature of meta-prompting: the Meta Model can either request help from an expert or provide a final answer. When consulting experts, the key insight is that experts receive only their specific instructions (via texp(eexp(yt))) rather than the full history Ht, ensuring they maintain "fresh eyes" on the problem.
The extractors eexp and eret are crucial for identifying when the Meta Model is requesting expert help versus providing a final answer. In practice, we implement these as simple pattern matching functions that look for specific markers in the text (like "Expert [Name]:" for eexp and ">>FINAL ANSWER:" for eret).
This approach allows for dynamic decomposition of problems and flexible orchestration of multiple expert consultations before reaching a final answer. [Section 2]
(16) Indus: From an industry perspective, I'm curious about the practical implications. What are the cost and efficiency considerations of meta-prompting? And what real-world applications do you envision this being particularly valuable for?
(17) Author: Excellent questions about the practical aspects. Let me address both the cost/efficiency considerations and potential applications:
Cost and Efficiency: Meta-prompting does involve multiple LM calls, which increases computational cost and latency compared to single-shot prompting. For instance, a typical meta-prompting session might involve 3-6 model calls depending on task complexity. For the Game of 24, we saw an average of about 3.5 rounds before reaching a solution.
There are several important considerations:
- Cost tradeoffs: While more expensive per query, meta-prompting often leads to significantly higher accuracy. For high-value applications where errors are costly, this tradeoff is favorable.
- Scaling improvements: As LM costs decrease over time (as they historically have), the absolute cost difference becomes less significant.
- Context window requirements: Meta-prompting benefits from larger context windows to maintain the full conversation history. We used GPT-4-32k in our experiments.
- API improvements: Since our experiments, OpenAI has introduced features like code execution directly through their API, which would reduce the cost and complexity of the Python interpreter integration.
Real-world Applications: Meta-prompting is particularly valuable for applications requiring:
- Complex reasoning: Financial analysis, legal document processing, medical diagnosis assistance - areas where decomposing problems and verifying intermediate steps is crucial.
- Creative constraints: Marketing copy that must adhere to specific brand guidelines, technical documentation that must maintain consistency while being comprehensive.
- Code generation and debugging: The significant improvements we saw in programming tasks suggest applications in software development assistance, especially debugging complex issues.
- Multi-step planning: Business strategy development, project planning, or any task requiring breaking down goals into actionable steps.
- Educational applications: Tutoring systems that can decompose learning objectives and verify student understanding from multiple angles.
One specific example from our findings: For multilingual tasks (MGSM), meta-prompting showed particular improvements for underrepresented languages like Bengali and Telugu. This suggests applications in making AI more accessible and effective across diverse linguistic contexts.
The key industry advantage is that meta-prompting requires no model fine-tuning - it's a prompting technique that works with existing models, making deployment relatively straightforward. [Section 4.4 and 5.2]
(18) A: We've had some excellent initial questions. Let's dive deeper into specific aspects of the paper. Dr. P, I noticed you had a follow-up question?
(19) Dr. P: Yes, I'm interested in the "fresh eyes" concept you mentioned. Can you elaborate on how you measured or quantified the benefit of this approach compared to methods where the full context is maintained throughout?
(20) Author: The "fresh eyes" concept is indeed one of the key mechanisms behind meta-prompting's effectiveness. We didn't directly quantify it as an isolated variable in our experimental design, but we did observe its benefits in several ways:
First, let me clarify what we mean by "fresh eyes." In meta-prompting, when the Meta Model calls an expert, that expert only receives its specific instructions without the full conversation history. This means each expert approaches its subtask without being influenced by previous reasoning attempts or potential errors.
We observed the benefits of this mechanism in several ways:
- Error recovery patterns: In our qualitative analysis, we found numerous instances where an initial expert made an error, but a subsequent expert with "fresh eyes" identified and corrected it. For example, in Game of 24 problems, we often saw a pattern where:
- Expert A would propose a solution
- Expert B would review it and identify an error
- Expert C would then approach the problem from a different angle
- Comparison with multipersona prompting: The most direct evidence comes from our comparison with multipersona prompting, which uses a similar expert-based approach but without the "fresh eyes" mechanism (all personas share the same history). Meta-prompting outperformed multipersona prompting by 15.2% on average across tasks, suggesting the fresh perspective provides substantial benefits.
- No-solution acknowledgments: Interestingly, meta-prompting was more likely to acknowledge when no solution exists (or when it couldn't find one). In the Game of 24, meta-prompting reported no solution in 9-15% of cases compared to just 2% with standard prompting. While all examples actually had solutions, this suggests meta-prompting is less prone to forcing incorrect answers when uncertain.
- Reduced persistence of errors: We observed that when errors occurred in the meta-prompting framework, they were less likely to persist throughout the entire reasoning chain compared to methods like chain-of-thought, where an early mistake often propagates through all subsequent steps.
This approach addresses a well-documented problem with LMs - their tendency to double down on mistakes and exhibit overconfidence. The fresh eyes mechanism helps mitigate this by introducing independent verification at multiple stages.
While we didn't design experiments specifically to isolate and quantify this effect (which would be valuable future work), the performance improvements and qualitative patterns strongly suggest it's a key factor in meta-prompting's effectiveness. [Section 4.3]
(21) HoL: I've been reviewing your system instructions for the Meta Model in Figure 3. It's quite detailed and specific in how it guides the model to structure its behavior. How much engineering and iteration went into designing these instructions? And how sensitive is the performance to the exact wording?
(22) Author: You've touched on a critical aspect of our work. The system instructions for the Meta Model shown in Figure 3 were indeed the result of substantial engineering and iteration.
We went through approximately 15-20 major iterations of these instructions over a period of several months. Each iteration involved testing on a subset of our tasks, analyzing failure patterns, and refining the instructions accordingly.
Some key elements that evolved through this process:
- Expert calling format: We found that structuring how the Meta Model calls experts (with the name followed by a colon and triple-quoted instructions) was crucial for consistent extraction and processing.
- Memory reminders: The instruction "Keep in mind that all experts, except yourself, have no memory!" emerged after we observed the Meta Model sometimes assuming experts had context from previous interactions.
- Verification emphasis: We increased emphasis on verification after noticing that early versions sometimes accepted the first expert's solution without adequate checking.
- Final answer format: The specific formatting for final answers (with ">> FINAL ANSWER:" followed by triple quotes) was standardized to enable consistent extraction.
As for sensitivity to exact wording, we found that:
- High sensitivity to structural elements: The formatting instructions (how to call experts, how to present final answers) were extremely sensitive - changing these would break our extraction pipeline.
- Moderate sensitivity to conceptual guidance: Instructions about verification, breaking down problems, and expert selection showed moderate sensitivity. Changes here didn't break functionality but could reduce effectiveness.
- Lower sensitivity to explanatory text: The explanatory portions of the instructions (explaining what experts are, what the Meta Model's role is) were less sensitive to exact wording.
We also discovered that GPT-4 was significantly better at following these structured instructions than earlier models like GPT-3.5. This aligns with findings from other research showing that GPT-4 has enhanced instruction-following capabilities.
That said, we believe there's still room for optimization. The instructions we settled on represent what worked well across our diverse tasks, but task-specific instruction tuning could potentially yield further improvements for specialized applications. [Section 2 and 3.4]
(23) Junior: You mentioned using different types of experts for different tasks. How does the system decide which experts to call? And do you have any examples of the kinds of experts that get called for different types of problems?
(24) Author: That's a fascinating aspect of meta-prompting - the Meta Model itself decides which experts to call based on the task at hand, without any explicit instructions about which experts are appropriate for which tasks.
The decision process works like this:
- The Meta Model analyzes the query
- It decides which types of expertise would be helpful
- It dynamically creates and names these experts
- It formulates specific instructions for each expert
We visualized this in Figures 4 and 5 in the paper, showing the distribution of experts called across different tasks. Let me share some examples:
For Geometric Shapes (which involves identifying shapes from SVG paths):
- Expert Graphic Designer was called in 63.1% of cases
- Expert Geometer in 15.5%
For Checkmate-in-One chess problems:
- Expert Chess Player was called in 48.4% of cases
- Expert Chess Analyst in 32.1%
For Python Programming Puzzles:
- Expert Python was called in 64.3% of cases
- Expert Mathematician in 15.8%
For Sonnet Writing:
- Expert Poet was called in 50.2% of cases
- Expert Essayist in 37.6%
What's particularly interesting is how this distribution changes when we add the Python interpreter capability. For example, in the Game of 24:
- Without Python interpreter: Expert Mathematician (51.0%) and Expert Problem Solver (44.9%)
- With Python interpreter: Expert Mathematician (41.8%), Expert Python (36.0%), and Expert Problem Solver (22.3%)
This shows the Meta Model adaptively incorporating the new capability when available.
Sometimes the Meta Model creates very specific expert roles that surprised us. Examples include:
- Expert Arithmetic Verifier
- Expert Poet Reviewer (specifically to verify sonnets met all requirements)
- Expert Chess Analyst (distinct from Player, focused on verification)
- Expert SVG Specialist
These roles emerge organically from the meta-prompting framework - we never provide a list of possible experts or instructions about which experts to use for which tasks. This demonstrates the flexible, task-adaptive nature of the approach. [Section 5.1]
(25) Senior: I'm curious about the failure modes of this approach. Where does meta-prompting break down or perform worse than simpler methods? And did you observe any interesting patterns in these failures?
(26) Author: That's an excellent question about the limitations. Despite its overall strong performance, meta-prompting does have several failure modes and limitations we observed:
- Cost efficiency issues: The most obvious limitation is the increased computational cost and latency from multiple model calls. For very simple tasks where a single call would suffice, meta-prompting can be unnecessarily complex and expensive.
- Context window saturation: With extensive reasoning chains, especially for complex problems, the context window can become saturated. We observed this occasionally in the Python Programming Puzzles, where lengthy code explorations left little room for additional reasoning.
- Sequential bottlenecks: Since meta-prompting is inherently sequential (each step depends on previous ones), it can't leverage parallel processing. This creates a performance bottleneck that becomes more pronounced as the number of required expert consultations increases.
- Information transfer failures: We observed cases where the Meta Model failed to properly convey necessary information to experts. For instance, sometimes it would reference information from the history without explicitly including it in the expert instructions, leading to confusion.
- Poor expert selection: In some cases, particularly for Geometric Shapes, the Meta Model selected suboptimal experts. For this task, it often chose Expert Graphic Designer (52.1%) when an expert in SVG visualization might have been more appropriate - and interestingly, zero-shot chain-of-thought actually outperformed meta-prompting on this specific task.
- Apologetic response patterns: In tasks with lower performance, we noticed the Meta Model would often include apologetic language like "Apologies for the confusion in my previous response" - a pattern likely stemming from its training on instruction-following data.
- Tool integration limitations: While the Python interpreter integration was successful, it created new failure modes related to code generation errors that wouldn't occur in simpler approaches.
- Smaller model limitations: We found that meta-prompting's benefits were less pronounced with GPT-3.5 compared to GPT-4, suggesting that the approach may require a certain level of model capability to be effective.
I should note that these limitations highlight areas for future improvement rather than fundamental flaws in the approach. Many could be addressed through refinements like context summarization techniques, improved expert instruction templates, or incorporating parallel processing capabilities.
The most consistently challenging task for meta-prompting was Geometric Shapes, where other methods like zero-shot chain-of-thought performed better. This suggests that for tasks requiring spatial reasoning or visual interpretation, different approaches might be more suitable. [Section 5.2]
(27) LaD: Your paper mentions creating a novel "Shakespearean Sonnet Writing" task. Could you share an example of one of the generated sonnets and explain how you evaluated them? I'm particularly interested in how you assessed adherence to the rhyme scheme.
(28) Author: While we don't include full examples of the generated sonnets in the paper itself, I'd be happy to discuss how we evaluated them and describe a representative example.
For the Shakespearean Sonnet Writing task, each prompt specified that the sonnet must:
- Follow the strict rhyme scheme "ABAB CDCD EFEF GG"
- Contain three specific words verbatim (e.g., "algorithm," "digital," "quantum")
- Adhere to the general structure of a Shakespearean sonnet (14 lines with appropriate meter)
Here's a representative example of what meta-prompting generated when asked to include the words "digital," "quantum," and "network":
When digital realms expand their reach so wide,
Across the vastness of our mortal sphere,
In bits and bytes our thoughts now must abide,
As ancient wisdom fades from year to year.
The quantum leap of minds now intertwined,
Through unseen waves of data's endless flow,
Reveals new truths that once were undefined,
Yet leaves us wondering what we truly know.
This network vast of knowledge ever grows,
Connecting souls across the distant lands,
Like Avon's waters that forever flows,
United by invisible commands.
Though worlds apart, we share one common heart,
In verse and code, two realms no more apart.
For evaluation, we used a structured approach with human evaluators checking:
- Rhyme scheme: We manually verified that the ending sound of each line followed the ABAB CDCD EFEF GG pattern. The example above follows this with the rhyme pattern: wide/abide, sphere/year, intertwined/undefined, flow/know, grows/flows, lands/commands, heart/apart.
- Required words: We checked that all three specified words appeared verbatim. In the example, "digital" appears in line 1, "quantum" in line 5, and "network" in line 9.
- Sonnet structure: We verified the sonnet had exactly 14 lines and generally followed iambic pentameter (though we allowed some flexibility in meter as even Shakespeare wasn't always strictly consistent).
- Shakespearean elements: While not a strict requirement, we noted whether the sonnet included thematic and stylistic elements typical of Shakespearean sonnets, such as the turn or "volta" typically occurring at the beginning of the third quatrain or the final couplet.
What we found particularly interesting was how different prompting methods approached this task. Standard prompting often produced sonnets that either missed one of the required words or deviated from the rhyme scheme. Meta-prompting typically involved the Meta Model calling an Expert Poet to draft the sonnet and then an Expert Poet Reviewer or Expert Essayist to verify all constraints were met, often catching and correcting issues with the initial draft.
This verification step is why meta-prompting achieved 79.6% accuracy compared to 62% with standard prompting. The most common failure mode was subtle deviations from the rhyme scheme, particularly in the middle quatrains where maintaining consistent sounds becomes challenging. [Section 3.2]
(29) MML: Looking at your analysis of the number of rounds required to reach a solution, I'm interested in how this varies across different tasks. Could you provide more detail on the relationship between task complexity and the number of rounds, and whether you found any mathematical patterns in this relationship?
(30) Author: We did observe interesting patterns in how the number of rounds correlates with task complexity. While we didn't establish a formal mathematical model for this relationship, the empirical patterns are revealing.
For tasks with meta-prompting including a Python interpreter, here's what we found:
- Simpler, well-defined tasks required fewer rounds:
- Word Sorting: 3.31 rounds on average
- Checkmate-in-One: 3.48 rounds
- Game of 24: ~3.5 rounds
- Multistep Arithmetic Two: ~3.5 rounds
- Complex programming tasks required significantly more rounds:
- Python Programming Puzzles: 6.07 rounds on average
The higher round count for programming puzzles reflects several phases typically seen in these solutions:
- Initial problem understanding
- Solution planning
- Code generation
- Error identification
- Code refinement
- Solution verification
What's mathematically interesting is that the standard deviation of rounds also increased with task complexity. For simpler tasks, the distribution was tighter, suggesting a more predictable solution path. For complex tasks, the standard deviation was larger, indicating more variable solution paths.
We also observed a negative correlation between the number of rounds and accuracy at the task level. Tasks where meta-prompting performed exceptionally well (like Game of 24 with Python interpreter at 67% accuracy) typically had fewer rounds on average than tasks with lower performance.
This suggests an interesting hypothesis: When the Meta Model can quickly identify the right experts and solution approach, it not only reaches solutions more efficiently but also more accurately. Conversely, tasks requiring extensive back-and-forth may indicate the model is struggling to find a coherent solution path.
The relationship isn't strictly linear, however. Some tasks like Shakespearean Sonnet Writing showed moderate round counts (~4-5) but high accuracy.
While Shakespearean Sonnet Writing showed moderate round counts (~4-5) with high accuracy (79.6%). This suggests that creative tasks may have an optimal "middle ground" of iteration - enough rounds for refinement but not so many as to indicate fundamental confusion.
If we were to model this mathematically, task complexity seems to have a non-linear relationship with round count, with different functional forms for different task categories. Programming tasks follow more of an exponential pattern with complexity, while reasoning tasks show a more linear relationship.
From an efficiency perspective, it's worth noting that while more complex tasks required more rounds, the marginal benefit of additional rounds decreased after a certain point. We sometimes observed cases where excessive iteration actually led to worse outcomes as the model began to second-guess correct earlier solutions. [Section 5.1]
(31) Indus: How much computational overhead does meta-prompting add compared to standard prompting? And have you explored ways to optimize this process for production environments where latency might be critical?
(32) Author: The computational overhead of meta-prompting compared to standard prompting is significant:
- API calls: Meta-prompting requires 3-6 LM calls per query versus just 1 for standard prompting. For the Game of 24, the average was 3.5 rounds.
- Token usage: Total token consumption increases approximately proportionally to the number of calls, though there's some efficiency in expert calls since they don't receive the full history.
- Latency: This is the most notable impact. With each GPT-4 call potentially taking 2-5 seconds, meta-prompting can increase end-to-end latency from ~3 seconds to 10-30 seconds depending on the task.
For optimization in production environments, we've explored several approaches:
- Parallel expert consultation: Modifying the framework to allow the Meta Model to call multiple experts simultaneously rather than sequentially. This could significantly reduce latency for tasks requiring multiple expert opinions.
- History summarization: Having the Meta Model periodically condense previous interactions before continuing, reducing token usage and context window limitations.
- Expert pre-selection: For domain-specific applications, providing a curated list of relevant experts rather than letting the Meta Model generate them dynamically could save an initial round of interaction.
- Hybrid approaches: Using meta-prompting only for complex queries while routing simpler queries to standard prompting based on complexity classification.
- Caching mechanisms: For repetitive tasks, caching expert responses to similar subtasks can eliminate redundant computation.
Recent API developments also offer optimization opportunities. OpenAI's function calling capabilities and built-in code execution would streamline the Python interpreter integration, potentially reducing both latency and cost.
The cost-benefit analysis is highly application-dependent. For high-value tasks where accuracy is critical (financial analysis, medical applications, legal document processing), the improved performance likely justifies the increased latency and cost. For consumer-facing applications where response time is paramount, the optimization techniques above become essential. [Section 5.2]
(33) HoL: I'd like to get a better understanding of the limitations of meta-prompting when applied to different types of language models. You mentioned briefly that the results with GPT-3.5 were less impressive. Could you elaborate on how model scale affects the effectiveness of this technique?
(34) Author: Model scale and capability significantly impact meta-prompting's effectiveness. Our observations with GPT-3.5 versus GPT-4 revealed several key patterns:
- Task-dependent performance gaps: With GPT-3.5, meta-prompting showed modest improvements on specific tasks like Sonnet Writing and Checkmate-in-One but didn't consistently outperform simpler methods across all tasks as it did with GPT-4.
- Instruction following capabilities: GPT-3.5 struggled more with the structured meta-prompting instructions, occasionally deviating from the expected formats for expert calls and final answers. This created extraction challenges and reduced effectiveness.
- Context window limitations: GPT-3.5's smaller context window (4k tokens vs. 32k for GPT-4) became a significant bottleneck, limiting the depth of expert consultation chains and causing context truncation.
- Role-playing effectiveness: GPT-4 demonstrated superior ability to maintain consistent expert personas and provide specialized knowledge, while GPT-3.5's expert outputs were often less distinctive from each other.
The effectiveness of meta-prompting appears to scale non-linearly with model capability. Our qualitative analysis suggests three key capability thresholds:
- Instruction following threshold: The model must reliably follow structured instructions to maintain the meta-prompting framework.
- Context integration threshold: The model must effectively integrate insights across multiple expert consultations.
- Expertise simulation threshold: The model must convincingly simulate different expertise domains.
GPT-4 crosses all three thresholds, while GPT-3.5 is inconsistent, particularly with the latter two.
This aligns with broader findings about emergent abilities in larger language models. Techniques like meta-prompting that leverage more sophisticated cognitive processes (decomposition, verification, integration) appear to benefit disproportionately from increased model scale.
We hypothesize that several factors contribute to this:
- Larger models have been exposed to more instruction-following data
- They have greater working memory capacity for maintaining complex prompting structures
- They develop more distinctive internal representations of different expertise domains
This suggests that meta-prompting's advantages may become even more pronounced with future, more capable models, potentially widening the gap compared to simpler prompting methods. [Section 5.1]
(35) A: We've had a rich discussion of the paper. Let me summarize the key insights:
- Meta-prompting transforms a single language model into both a conductor and a panel of experts, breaking down complex problems and delegating subtasks dynamically.
- The approach is task-agnostic, working across diverse domains without task-specific examples.
- The "fresh eyes" mechanism, where experts receive only relevant information without the full history, helps prevent error propagation.
- Integration with external tools like a Python interpreter significantly enhances performance on computational tasks.
- Meta-prompting shows impressive gains across various tasks, particularly on complex reasoning problems like the Game of 24 and creative tasks like Shakespearean Sonnet Writing.
- The approach does have limitations, including increased computational cost, latency, and dependence on sufficiently capable language models.
Author, are there any aspects of your work that we haven't covered that you'd like to highlight? And could you share the five most important citations for understanding the foundational work that this paper builds upon?
(36) Author: Thank you for the excellent summary. Two aspects I'd like to highlight that we haven't fully explored:
First, the versatility of meta-prompting for real-world applications. Beyond our benchmark tasks, we've seen promising results applying this approach to complex real-world problems like business strategy development, technical troubleshooting, and educational content creation. The framework's ability to combine domain-specific expertise with critical verification makes it particularly valuable where both specialized knowledge and accuracy are critical.
Second, the emergent behavior of the verification process. We found that meta-prompting naturally encourages a form of adversarial verification where the Meta Model tends to select experts with complementary perspectives. This emergent behavior - not explicitly instructed - creates a robust system of checks and balances that improves overall reliability.
For the five most important citations to understand the foundational work:
- Wei, J., Wang, X., Schuurmans, D., et al. (2022). "Chain of Thought Prompting Elicits Reasoning in Large Language Models" - This established the power of guiding LMs through explicit reasoning steps.
- Kojima, T., Gu, S., Reid, M., Matsuo, Y., & Iwasawa, Y. (2022). "Large Language Models are Zero-Shot Reasoners" - Demonstrated that simply adding "Let's think step by step" improves reasoning significantly.
- Yao, S., Yu, D., Zhao, J., et al. (2023). "Tree of Thoughts: Deliberate Problem Solving with Large Language Models" - Introduced non-linear exploration of reasoning pathways.
- Wang, Z., Mao, S., Wu, W., Ge, T., Wei, F., & Ji, H. (2023). "Unleashing Cognitive Synergy in Large Language Models: A Task-Solving Agent through Multi-Persona Self-Collaboration" - Explored multi-persona collaboration in LMs.
- Xu, B., Yang, A., Lin, J., Wang, Q., Zhou, C., Zhang, Y., & Mao, Z. (2023). "ExpertPrompting: Instructing Large Language Models to be Distinguished Experts" - Demonstrated the value of expert persona adoption in prompting.
Together, these works established the foundation of guided reasoning, multi-step problem solving, and persona-based prompting that meta-prompting builds upon and extends. [Sections 1, 6, 7]
(37) A: Thank you, everyone, for your participation in today's lab meeting. The discussion has provided valuable insights into meta-prompting and its implications for enhancing language model capabilities. We look forward to seeing how this work evolves and influences future research in the field.