Corrective In-Context Learning: Evaluating Self-Correction in Large Language Models

To Averroes

24 Mar 2025 — 2 min read

Mario Sanz-Guerrero¹ and Katharina von der Wense¹,² (¹Johannes Gutenberg University Mainz, Germany; ²University of Colorado Boulder, USA)

This paper introduces and evaluates "Corrective In-Context Learning" (CICL), a novel approach intended to improve in-context learning in large language models by incorporating self-correction mechanisms. Here's a comprehensive overview:

Approach

The authors propose CICL as an extension of standard In-Context Learning (ICL). The key difference is that CICL includes both the model's incorrect predictions and their corrections in the prompt. For example:

CICL format:

Text: "Oh, great, another Monday."
Predicted label: positive
Correct label: negative

Standard ICL format:

Text: "Oh, great, another Monday."
Label: negative

The hypothesis was that exposing LLMs to their own errors alongside corrections would enable them to learn from these mistakes and improve performance on similar instances.

The CICL methodology follows a three-step process:

Generate initial predictions using standard ICL
Create a feedback-augmented prompt that includes both the model's predictions and correct labels
Generate new predictions using this corrective prompt

Results

Contrary to their expectations, the authors found that CICL consistently underperformed standard ICL across all tested models and datasets. Key findings:

Performance degraded as the proportion of corrections in the prompt increased
When the correction proportion was low (0-25%), CICL performed similarly to ICL
At higher correction proportions (50-100%), performance dropped significantly
This pattern was consistent across all tested models (Llama-3.1, GPT-J, Mistral 7B, Qwen2.5)

Statistical analysis showed that from a correction proportion of 25% onward, ICL demonstrated statistically significant superiority over CICL.

Conclusions

The authors conclude that:

CICL introduces confusion rather than improving performance
Including incorrect-correct label pairs disrupts the model's understanding of the task
Using harder examples (ones the model initially misclassified) in standard ICL does not improve performance
Example difficulty alone may not be a reliable criterion for effective selection in ICL

They suggest that the corrective mechanism may disrupt the model's internal representations, making it harder to generalize effectively.

Methodology Rigor

The paper demonstrates several indicators of methodological rigor:

Comprehensive experimentation: Testing across 17 diverse text classification datasets
Multiple models: Evaluation using four different LLMs (Llama-3.1, GPT-J, Mistral 7B, Qwen2.5)
Statistical validation: Using appropriate statistical tests (Wilcoxon signed-rank and Kruskal-Wallis)
Robustness checks: Running each configuration with 5 different random seeds
Transparency: Clear documentation of the experimental procedure and prompt formats
Negative results reporting: Openly sharing findings that contradicted their initial hypothesis

The authors are also transparent about limitations, noting that their experiments were restricted to smaller open-source LLMs due to computational constraints, although preliminary tests with larger models showed similar results.

Key Takeaways

Self-correction mechanisms in LLMs are complex and don't necessarily function as intuitively expected
Simply showing a model its errors and corrections does not improve performance
Harder examples don't inherently lead to better few-shot performance
The study provides valuable insights for the development of more effective in-context learning methods

The findings challenge assumptions about how LLMs learn from feedback and demonstrate the importance of empirical evaluation even for seemingly intuitive approaches.

Corrective In-Context Learning: Evaluating Self-Correction in Large Language Models

To Averroes

Approach

Results

Conclusions

Methodology Rigor

Key Takeaways

Read more

DART-Math: Difficulty-Aware Rejection Tuning for Mathematical Problem-Solving

MathFusion: Enhancing Mathematic Problem-solving of LLM through Instruction Fusion

[WIP] Multi-Agent Design: Optimizing Agents with Better Prompts and Topologies

Meta-Prompting: Enhancing Language Models with Task-Agnostic Scaffolding