$\scriptsize \textcolor{gray}{\underline{\rm{Yufan\ Zhuang}} }$ $\scriptsize \textcolor{gray}{\underline{\rm{Liyuan\ Liu} }}$ $\scriptsize \textcolor{gray}{\underline{\rm{Dinghuai\ Zhang} }}$ $\scriptsize \textcolor{gray}{\underline{\rm{Chandan\ Singh} }}$ $\scriptsize \textcolor{gray}{\underline{\rm{Yelong\ Shen} }}$ $\scriptsize \textcolor{gray}{\underline{\rm{Jingbo\ Shang} }}$ $\scriptsize \textcolor{gray}{\underline{\rm{Jianfeng\ Gao} }}$

$\scriptsize \rm{First\ published\ on\ Oct\ 20,\ 2025}$ | :github:: $\scriptsize \textcolor{gray}{\underline{\rm{EvanZhuang/knowledge\_flow} }}$

<aside>

TL;DR

We explore how Knowledge Flow—iteratively updating a knowledge list between LLM rollouts—overcomes the context limit of LLMs in test-time scaling. This iterative refinement process mirrors human deliberation: the model progressively distills insights from attempts, refining a knowledge list, thereby empowering subsequent attempts.

Knowledge Flow enables both gpt-oss-120b and Qwen3-235B-A22B-Thinking to achieve 100% accuracy on AIME25 without any training/tools/external feedback.

</aside>

Figure 1: We propose Knowledge Flow, a training-free paradigm for test-time scaling that transforms reasoning from a set of isolated traces into a process of iterative knowledge evolution. By treating reasoning as a diffusion-like refinement over a dynamic knowledge list, the model continually revises its knowledge list, achieving perfect accuracy on AIME25 without relying on external tools or external feedback.

Figure 1: We propose Knowledge Flow, a training-free paradigm for test-time scaling that transforms reasoning from a set of isolated traces into a process of iterative knowledge evolution. By treating reasoning as a diffusion-like refinement over a dynamic knowledge list, the model continually revises its knowledge list, achieving perfect accuracy on AIME25 without relying on external tools or external feedback.

Test-Time Scaling Beyond the Context Limit

Allocating more compute at test-time is a powerful way to enhance model performance. In Auto-Regressive (AR) LLMs, this scaling is closely tied to the context length. The community has made remarkable progress in expanding context windows significantly, enabling models to tackle more complex tasks.

This approach reaches its limits with grand challenge problems. For instance, a long runtime like Gimini's 11.3-hour solution to the ICPC competition implies a degree of test-time scaling that a fixed context window cannot sustain. Here, we explore computation scaling beyond this boundary using Knowledge Flow.

Knowledge Flow

As in Figure 1, Knowledge Flow instructs the model (via prompting) to edit a knowledge list (initially empty) between rollouts, enabling reasoning beyond the context limit.

As a more concrete example in Figure 2, at each iteration, the old knowledge list is passed to the model as part of the prompt. Then, in addition to answering the question, the model is instructed by the prompt to make edits on the knowledge list: adding one new knowledge item, dropping one old knowledge item, or doing both.

As in Figure 1, this recursive denoising mechanism guides the knowledge list from an imprecise or partial state toward a deeper and more complete understanding, scaling both gpt-oss-120b and Qwen3-235B-A22B-Thinking to reach 100% accuracy on AIME25.

Figure 2: Knowledge Flow maintains a knowledge list, at each step the model can make edits to the knowledge list: adding one new knowledge, dropping one old knowledge, or do both. In this example, model added a new knowledge in the first round(“The second circle intersection…”), then made another drop in the second round.

Figure 2: Knowledge Flow maintains a knowledge list, at each step the model can make edits to the knowledge list: adding one new knowledge, dropping one old knowledge, or do both. In this example, model added a new knowledge in the first round(“The second circle intersection…”), then made another drop in the second round.

Connection to Diffusion/Flow Matching

Knowledge flow mirrors the iterative refinement of diffusion/flow-matching model inference (Dieleman, 2025). Different from AR models, diffusion/flow-matching models have the potential to perform inference with a fixed context length, thus having the test-time scaling potential not restricted by the long context capacity. Knowledge Flow simulates the diffusion/flow-matching process with AR models, enabling inserting and deleting knowledge, thus not requiring training and allows us to experiment with SOTA models instantly.

Connection to Parallel Thinking

As in Figure 4, parallel thinking approaches for large language models rely primarily on generating multiple independent reasoning paths and selecting among them through majority voting. Different from Knowledge Flow, each reasoning trace here operates in isolation. The static aggregation mechanisms employed cannot facilitate knowledge exchange across attempts.

Figure 4: Parallel Thinking generates multiple independent reasoning paths, with no information exchange or knowledge refinements.

Figure 4: Parallel Thinking generates multiple independent reasoning paths, with no information exchange or knowledge refinements.

Recent work in parallel thinking also considers aggregating information over the reasoning traces, they either train a aggregation model (Zhao et al., 2025), a cross-attention module (Dong et al., 2025) or employ a recurrent-depth language model for diffusion forcing sampling (Geiping et al., 2025), it is worth noting that Knowledge Flow does not require training and can work with existing models off-the-shelf.

Main Results

We conduct experiments on AIME25 and AIME24 with gpt-oss-120b and Qwen3-235B-A22B-Thinking-2507 with pure text reasoning, no tool use, and no external feedback.