Corrective In-Context Learning: Evaluating Self-Correction in Large Language Models
link: https://arxiv.org/abs/2503.16022
Mario Sanz-Guerrero¹ and Katharina von der Wense¹,² (¹Johannes Gutenberg University Mainz, Germany; ²University of Colorado Boulder, USA)
This paper introduces and evaluates "Corrective In-Context Learning" (CICL), a novel approach intended to improve in-context learning in large language models by incorporating self-correction mechanisms. Here's a comprehensive overview:
Approach
The authors propose CICL as an extension of standard In-Context Learning (ICL). The key difference is that CICL includes both the model's incorrect predictions and their corrections in the prompt. For example:
CICL format:
Text: "Oh, great, another Monday."
Predicted label: positive
Correct label: negative
Standard ICL format:
Text: "Oh, great, another Monday."
Label: negative
The hypothesis was that exposing LLMs to their own errors alongside corrections would enable them to learn from these mistakes and improve performance on similar instances.
The CICL methodology follows a three-step process:
- Generate initial predictions using standard ICL
- Create a feedback-augmented prompt that includes both the model's predictions and correct labels
- Generate new predictions using this corrective prompt
Results
Contrary to their expectations, the authors found that CICL consistently underperformed standard ICL across all tested models and datasets. Key findings:
- Performance degraded as the proportion of corrections in the prompt increased
- When the correction proportion was low (0-25%), CICL performed similarly to ICL
- At higher correction proportions (50-100%), performance dropped significantly
- This pattern was consistent across all tested models (Llama-3.1, GPT-J, Mistral 7B, Qwen2.5)
Statistical analysis showed that from a correction proportion of 25% onward, ICL demonstrated statistically significant superiority over CICL.
Conclusions
The authors conclude that:
- CICL introduces confusion rather than improving performance
- Including incorrect-correct label pairs disrupts the model's understanding of the task
- Using harder examples (ones the model initially misclassified) in standard ICL does not improve performance
- Example difficulty alone may not be a reliable criterion for effective selection in ICL
They suggest that the corrective mechanism may disrupt the model's internal representations, making it harder to generalize effectively.
Methodology Rigor
The paper demonstrates several indicators of methodological rigor:
- Comprehensive experimentation: Testing across 17 diverse text classification datasets
- Multiple models: Evaluation using four different LLMs (Llama-3.1, GPT-J, Mistral 7B, Qwen2.5)
- Statistical validation: Using appropriate statistical tests (Wilcoxon signed-rank and Kruskal-Wallis)
- Robustness checks: Running each configuration with 5 different random seeds
- Transparency: Clear documentation of the experimental procedure and prompt formats
- Negative results reporting: Openly sharing findings that contradicted their initial hypothesis
The authors are also transparent about limitations, noting that their experiments were restricted to smaller open-source LLMs due to computational constraints, although preliminary tests with larger models showed similar results.
Key Takeaways
- Self-correction mechanisms in LLMs are complex and don't necessarily function as intuitively expected
- Simply showing a model its errors and corrections does not improve performance
- Harder examples don't inherently lead to better few-shot performance
- The study provides valuable insights for the development of more effective in-context learning methods
The findings challenge assumptions about how LLMs learn from feedback and demonstrate the importance of empirical evaluation even for seemingly intuitive approaches.