Reasoning Activation in LLMs via Small Model Transfer

TL;DR

This project introduce RAST, a simple yet effective inference-time strategy that activates reasoning capabilities in LLMs by transferring logit-level adjustments from smaller RL-tuned models, without requiring the direct RL tuning on large models.

Our paper begins with a key hypothesis as follows:

To test our hypothesis, we conduct a preliminary study using base LLM to do next token prediction when given prefix generated by the RL model, we found that there is a large path coverage rate, indicating that only a small set of tokens are different on the decoding path acorss model scales. In particular, the different tokens reflect certain reasoning behaviours, including (i) branching out, (ii) backtracking, and (iii) self-verification.

Methodology

Motivation:
- RL boosts reasoning but is costly at scale.
- Key insight: RL shifts token distributions to favor reasoning, not add new knowledge.
- Hypothesis: These shifts are model-size invariant and can be transferred.
- Goal: Use a small RL model's logits to activate reasoning in a large base model without retraining — enabling efficient reasoning via RAST.

Reasoning Activation through Small RL Models:
- Core idea: Amplify reasoning-relevant tokens (e.g., “instead”) while preserving base predictions for non-reasoning tokens (e.g., “of”).
- Mechanism: At each decoding step, apply the logit delta between S_RL and S_base to guide M_base.

Experimental Results

RAST enables scalable reasoning gains: It consistently improves pass@1 and recovery rates across model sizes without retraining, sometimes matching or surpassing RL models.
Stronger expert deltas yield greater improvement: Using ΔR from larger RL models (e.g., 14B) leads to better reasoning performance, showing that delta logits encode richer reasoning signals.
There is a trade-off between base model and delta compatibility: While stronger base models benefit more, excessively mismatched pairs (e.g., 32B base with ΔR from 7B) may hinder transfer effectiveness.

RAST increases solution diversity, as evidenced by consistent pass@k improvements with larger k, enabling broader exploration and higher chances of capturing correct answers.
It can surpass RL-trained models, achieving pass@k accuracy that matches or exceeds the ceiling performance, especially on complex benchmarks like AMC and MATH500.

Analysis

RAST guides the base model to follow more deliberate reasoning patterns, such as proposing, testing, and verifying solutions, unlike the base which often outputs linear, error-prone steps.
This shift is quantitatively supported by high KL divergence on reasoning-specific tokens (e.g., “check”), reflecting effective activation of reasoning behavior.

Cosine similarity between delta logits from different model scales serves as a strong indicator of transferability. Higher similarity correlates with better recovery rates, suggesting more effective reasoning activation.
RAST achieves high reasoning performance while significantly reducing GPU memory and hardware requirements—up to 50% savings compared to full-scale RL—demonstrating strong efficiency without sacrificing recovery rates.

RAST maintains strong performance across a wide range of decoding-time hyperparameters.
It is robust to variations in temperature (τ) and weight (λ), eliminating the need for heavy hyperparameter tuning.

BibTeX

@misc{ouyang2025rast,
      title={RAST: Reasoning Activation in LLMs via Small Model Transfer}, 
      author={Siru Ouyang and Xinyu Zhu and Zilin Xiao and Minhao Jiang and Yu Meng and Jiawei Han},
      year={2025},
      howpublished = {\url{https://github.com/ozyyshr/RAST}},
}