MathFusion: Enhancing Mathematic Problem-solving of LLM through Instruction Fusion
link: https://arxiv.org/abs/2503.16212
Paper by: Qizhi Pei, Lijun Wu, Zhuoshi Pan, Yu Li, Honglin Lin, Chenlin Ming, Xin Gao, Conghui He, Rui Yan
Introduction
Mathematical reasoning remains a critical benchmark for assessing the cognitive capabilities of Large Language Models (LLMs). While significant progress has been made in this domain, current approaches primarily rely on instance-level modifications that fail to capture the intrinsic relational structures inherent in mathematical knowledge.
A recent paper introduces MathFusion, a novel framework that enhances mathematical reasoning capabilities in LLMs through strategically combining different mathematical problems using three fusion strategies. This approach represents a fundamental shift from independent problem enhancement to cross-problem instruction synthesis.
The MathFusion Approach
MathFusion leverages the relational structures inherent in mathematical knowledge through three distinct fusion strategies:
- Sequential Fusion: Creates dependencies between problems by chaining them together, where the answer from one problem becomes an input for another. This mirrors the sequential reasoning often required in complex mathematical problem-solving.
- Parallel Fusion: Combines analogous problems to create a new problem that encapsulates shared mathematical concepts, reinforcing conceptual understanding across different contexts.
- Conditional Fusion: Creates context-aware selective problems where the final answer depends on comparing the results of two independent problems, enhancing reasoning flexibility.
Using these strategies, the authors created MathFusionQA, a dataset built upon GSM8K and MATH datasets. This dataset was then used to fine-tune various LLMs including DeepSeekMath-7B, Mistral-7B, and Llama3-8B.
Examples of Fusion Strategies
To illustrate how these fusion strategies work in practice:
Sequential Fusion
Problem A: "During one day, there are 4 boat trips through the lake. The boat can take up to 12 people during one trip. How many people can the boat transport in 2 days?"
Problem B: "The school is organizing a trip to the museum. 4 buses were hired to take the children and teachers to their destination. The second bus has twice the number of people on it as the first bus. The third bus has 6 fewer people than the second bus. The fourth bus has 9 more people than the first bus. If the first bus has 12 people, how many people are going to the museum in total?"
Fused Problem: "The school has organized a trip to a museum and needs to transport children and teachers. First, calculate how many people can be transported by a boat over 2 days, with 4 boat trips each day, and each trip can carry up to 12 people. Let this total be the number of people in the first bus. The second bus has twice the number of people on the first bus, the third bus has 6 fewer people than the second bus, and the fourth bus has 9 more people than the first bus. How many people are going to the museum in total?"
Here, the answer to Problem A (96 people transported by boat) becomes the input for Problem B (number of people in the first bus).
Parallel Fusion
Problem A: "Add 53.463 to 12.9873 and round to the nearest thousandth."
Problem B: "Add 81.76 to 34.587 and round your answer to the nearest tenth."
Fused Problem: "Calculate the sum of 53.463 and 81.76, then add this result to 34.587. Round the total to the nearest hundredth first, and then take that result and round it to the nearest whole number. What is the final answer?"
This fusion combines similar operations (addition and rounding) but modifies the inputs and creates a multi-step problem that involves elements from both original problems.
Conditional Fusion
Problem A: "Noah is a painter who sells his artwork at the park. He charges $60 for a large painting and $30 for a small painting. Last month, he sold eight large paintings and four small paintings. This month, he has doubled his sales."
Problem B: "Michael, another painter, charges $100 for a large painting and $80 for a small painting. At his last art show, he sold 5 large paintings and 8 small paintings."
Fused Problem: "Noah is a painter who sells his artwork at the park. He charges $60 for a large painting and $30 for a small painting. Last month, he sold eight large paintings and four small paintings. This month, he has doubled his sales. Meanwhile, Michael, another painter, charges $100 for a large painting and $80 for a small painting. At his last art show, he sold 5 large paintings and 8 small paintings. Who earned more from their painting sales this month, Noah or Michael?"
The conditional fusion requires solving both problems independently and then comparing their results to determine the final answer.
Results and Performance
The empirical results demonstrate that MathFusion achieves substantial improvements in mathematical reasoning capabilities while maintaining high data efficiency:
- MathFusion models consistently outperform standard fine-tuning across all base models and benchmarks
- They achieved an 18.0 percentage point improvement in accuracy over traditional single-instruction approaches
- With Llama3-8B, sequential fusion achieved 21.3 and 12.5 percentage point improvements on MATH and GSM8K respectively
- When comparing with previous top-performing baselines (at equal data size of 60K samples), MathFusion outperformed models like MetaMath and DART-Math
A key strength of MathFusion is its data efficiency. The approach requires only 45K additional synthetic instructions compared to much larger datasets used by other methods:
| Approach | Dataset Size | Relative Size vs. MathFusion |
|---|---|---|
| MMIQC | 2,294,000 | 38× larger |
| MathScaleQA | 2,021,000 | 34× larger |
| KPMath-Plus | 1,576,000 | 26× larger |
| Xwin-Math-V1.1 | 1,440,000 | 24× larger |
| DART-Math | 590,000 | 9.8× larger |
| MetaMathQA | 395,000 | 6.6× larger |
| Orca-Math | 200,000 | 3.3× larger |
| WizardMath | 96,000 | 1.6× larger |
| MathFusion | 60,000 | 1× |
When other approaches are restricted to the same 60K data points as MathFusion, they typically suffer significant performance drops:
| Model (Llama3-8B base) | Original Size | Average Performance | 60K Version Performance |
|---|---|---|---|
| MetaMath | 400K | 30.8% | 29.9% (-0.9) |
| MMIQC | 2.3M | 35.6% | 25.7% (-9.9) |
| RefAug | 30K | 24.7% | 25.3% (+0.6) |
| DART-Math | 590K | 39.7% | 37.6% (-2.1) |
| MathFusion | 60K | 39.0% | 39.0% (same) |
Comparison to Alternative Approaches
MathFusion represents a fundamentally different paradigm compared to existing approaches:
MetaMath (Yu et al., 2024)
Focuses on enhancing individual problems through rephrasing and varied reasoning paths, working at the instance level rather than exploiting relationships between problems.
MMIQC (Liu et al., 2024)
Relies heavily on data volume (2.3M examples) through iterative question composing and refinement of individual problems, without explicitly modeling dependencies between problems.
RefAug (Zhang et al., 2024)
Enhances mathematical reasoning by incorporating reflection mechanisms on a single problem's reasoning process, but does not create new problem structures or relationships.
DART-Math (Tong et al., 2024)
Focuses on problem difficulty rather than structure, using rejection sampling to identify challenging problems and generating multiple solutions per problem.
MathFusion differs from these approaches by:
- Relational Focus: Explicitly modeling relationships between problems through fusion strategies
- Structural Manipulation: Manipulating the underlying structure of problems rather than just varying difficulty or surface features
- Cross-Problem Learning: Enabling the model to learn cross-problem dependencies and patterns
- Data Efficiency: Achieving strong performance with far less data
- Complementary Approach: Working well in combination with methods like DART-Math, suggesting they leverage different aspects of mathematical reasoning
Methodological Rigor
The paper demonstrates strong methodological rigor through:
- Comprehensive Evaluation: Testing across six benchmarks including both in-domain (GSM8K, MATH) and out-of-domain datasets (CollegeMath, DeepMind-Mathematics, OlympiadBench-Math, TheoremQA)
- Multiple Base Models: Testing on both specialized math models (DeepSeekMath-7B) and general models (Mistral-7B, Llama3-8B)
- Ablation Studies: Analyzing the contribution of each fusion strategy, the effect of the teacher model, the relationship between data size and performance, and the impact of combining with other datasets
- Error Analysis: Evaluating the correctness of fused problems and finding that unreasonable problems (5.6% of cases) had little impact on model performance
- Diversity Analysis: Using t-SNE to visualize problem embeddings, confirming that MathFusion augmented problems are more evenly distributed in the embedding space compared to original problems
- Statistical Significance: Conducting t-tests to verify that performance improvements are statistically significant (p < 0.05)
Conclusion
MathFusion represents a significant advancement in enhancing mathematical reasoning capabilities in LLMs through its innovative problem fusion approach. By capturing the relational structures inherent in mathematical knowledge, it achieves substantial improvements while maintaining high data efficiency.
The approach is inspired by how humans learn mathematics—through understanding connections between concepts rather than solving isolated problems—which may explain its improved performance with reduced data requirements.
As mathematical reasoning continues to be a critical benchmark for assessing LLM capabilities, MathFusion provides a valuable framework for creating more sophisticated and interconnected mathematical training data that better captures the relational nature of mathematical knowledge.