Skip to content

Commit 550062f

Browse files
committed
resolve some issues
Signed-off-by: Fred Wei <[email protected]>
1 parent c96dba4 commit 550062f

File tree

1 file changed

+29
-29
lines changed

1 file changed

+29
-29
lines changed

docs/source/blogs/tech_blog/blog12_Inference_Time_Compute_Implementation_in_TensorRT-LLM.md

Lines changed: 29 additions & 29 deletions
Original file line numberDiff line numberDiff line change
@@ -1,9 +1,9 @@
1-
# Inference Time Compute Implementation in TensorRT-LLM
1+
# Inference Time Compute Implementation in TensorRT LLM
22

3-
By NVIDIA TensorRT-LLM Team
3+
By NVIDIA TensorRT LLM Team and UCSD Hao AI Lab
44

55
## Table of Contents
6-
- [Inference-Time Compute Implementation in TensorRT-LLM (Part 1: Design and Implementation](#inference-time-compute-implementation-in-tensorrt-llm)
6+
- [Inference-Time Compute Implementation in TensorRT LLM (Part 1: Design and Implementation](#inference-time-compute-implementation-in-tensorrt-llm)
77
- [Table of Content](#table-of-content)
88
- [Background and Motivation](#background-and-motivation)
99
- [Introduction for Scaffolding: A Framework for inference-time compute](#introduction-for-scaffolding)
@@ -16,6 +16,7 @@ By NVIDIA TensorRT-LLM Team
1616
- [Introduction for Dynasor](#dynasor-introduction)
1717
- [Implement Dynasor-CoT in Scaffolding](#dynasor-cot-implement-in-scaffolding)
1818
- [Implement Dynasor-CoT based Majority Voting in Scaffolding](#dynasor-cot-based-majority-vote-in-scaffolding)
19+
- [Reference](#dynasor-reference)
1920
- [Feature List on Scaffolding](#scaffolding-feature-list)
2021
- [Future Work](#scaffolding-future-work)
2122

@@ -24,7 +25,7 @@ By NVIDIA TensorRT-LLM Team
2425
Inference-time compute (aka test-time scaling) is becoming increasingly important. In addition to simply increasing the length of the output, using various workflows such as best-of-N and MCTS (Monte Carlo Tree Search) to obtain better answers is also an important means. Further, most of the workflows of agentic or multi-agent are logically similar to these methods of inference-time compute, except that they use more complex tools and context engineering. However, how to conveniently define these methods while achieving excellent inference performance has become a new problem. Because good performance requires careful asynchronous scheduling, but writing asynchronous scheduling programs is not easy for algorithm engineers. When considering the use of external tools and token budget management, the problem becomes even more complex.
2526

2627

27-
LLM inference frameworks such as TensorRT-LLM,vLLM and SGLang provide high performance for inference of generation models or reward models, but they are only for single request inference. Popular Agent frameworks such as LangChain and Dify focus on enabling users to develop agents as simply as possible. But precisely because of this, they may have difficulty completing many inference-time compute methods that require precise definition and developments.
28+
LLM inference frameworks such as TensorRT LLM,vLLM and SGLang provide high performance for inference of generation models or reward models, but they are only for single request inference. Popular Agent frameworks such as LangChain and Dify focus on enabling users to develop agents as simply as possible. But precisely because of this, they may have difficulty completing many inference-time compute methods that require precise definition and developments.
2829

2930

3031
So we want to build a good framework to support users in exploring and deploying more inference-time compute methods. It should provide a modular infrastructure and fill the gap in balancing usability and performance for inference-time compute.
@@ -55,7 +56,7 @@ Provides sufficient concurrency to achieve good performance while ease of use. C
5556

5657
This is the call sequence diagram of `Scaffolding`:
5758
<div align="center">
58-
<img src="../media/tech_blog12_scaffolding_sequence.png" alt="Scaffolding Sequence" width="900px">
59+
<img src="https://github.com/NVIDIA/TensorRT-LLM/raw/main/docs/source/blogs/media/tech_blog12_scaffolding_sequence.png" alt="Scaffolding Sequence" width="900px">
5960
</div>
6061
<p align="center"><sub><em>Figure 1. Scaffolding Sequence</em></sub></p>
6162

@@ -100,7 +101,7 @@ class Controller(ABC):
100101
def process(self, tasks: List[Task], **kwargs):
101102
raise NotImplementedError
102103
```
103-
Its two core interfaces are `generate()` and `process()`. `generate()` is the entry point for `ScaffoldingLlm` to invoke. In the default implementation of `generte()`, it produces a `Task` and then invokes `process()`. The `process()` is the most important part of every `Contronller` class, as it defines the implementation the workflow of this inference-time compute method.
104+
Its two core interfaces are `generate()` and `process()`. `generate()` is the entry point for `ScaffoldingLlm` to invoke. In the default implementation of `generate()`, it produces a `Task` and then invokes `process()`. The `process()` is the most important part of every `Contronller` class, as it defines the implementation the workflow of this inference-time compute method.
104105

105106

106107
Let's go into a specific subclass of `Controller` to see how `process()` is implemented.
@@ -171,44 +172,44 @@ results = llm.generate(prompts)
171172
Users need to first create instances of `Worker` and `Controller`, and map them by `WorkerTag` to create the `ScaffoldingLlm` class. Then call the generate interface of `ScaffoldingLlm` to get the final result.
172173

173174

174-
`ScaffoldingLlm` also provides async inferface.
175+
`ScaffoldingLlm` also provides async interface.
175176
```python
176177
async for result in llm.generate_async(prompt):
177178
print(">>>", result.outputs[0].text)
178179
```
179-
So an instance of `ScaffoldingLlm` can support the concurrent execution of multiple requests.
180+
Therefore, an instance of ScaffoldingLlm supports concurrent execution of multiple requests.
180181

181182

182183
Let's make a summary of the overall implementation of `Scaffolding`. If users want to implement a new inference-time compute method, users can develop a new `Controller`. They can also call some existing `Controllers` as its `sub-Controller`. If users want to implement a new backend, users can either create a new `Worker` or add a new `Task` handler to an existing `Worker`. As for `ScaffoldingLlM`, we have hidden many complex implementations, such as async scheduling within `ScaffoldingLlM`, and users do not need to modify the code of `ScaffoldingLlM`.
183184

184185

185186
## An Example: Implement Dynasor-CoT on Scaffolding
186-
Dynasor-CoT is a certainty-based, training-free approach to accelerate Chain-of-Thought (CoT) inference. This chapter discusses how inference-time compute methods can be smoothly integrated into the TRT-LLM Scaffolding framework, using Dynasor-CoT as an example.
187+
[Dynasor-CoT](https://arxiv.org/abs/2412.20993) is a certainty-based, training-free approach to accelerate Chain-of-Thought (CoT) inference. This chapter discusses how inference-time compute methods can be smoothly integrated into the TRT-LLM Scaffolding framework, using Dynasor-CoT as an example.
187188

188189
<div align="center">
189-
<img src="../media/tech_blog12_dynasor_demo.gif" alt="Dynasor Demo" width="900px">
190+
<img src="https://github.com/NVIDIA/TensorRT-LLM/raw/main/docs/source/blogs/media/tech_blog12_dynasor_demo.gif" alt="Dynasor Demo" width="900px">
190191
</div>
191192
<p align="center"><sub><em>Figure 2. Demo of DeepSeek-R1-Distill-Qwen-7B achieving a 5.74x speedup compared to the baseline when using Dynasor-CoT on MATH500</em></sub></p>
192193

193-
### Introducation for Dynasor-CoT
194+
### Introduction for Dynasor-CoT
194195
#### Motivation of Dynasor-CoT
195196
LLM reasoning is highly token-inefficient, often requiring far more tokens to achieve the same accuracy as non-reasoning models. A major source of this inefficiency is that reasoning models tend to **self-doubt**; they often reach the correct answer early but then engage in extended verification behaviors like double-checking and reassessment.
196197

197198
For instance, Figure 2 compares a traditional Qwen-7B model with a reasoning-focused, Deepseek-distilled Qwen-7B model on a simple question. While the traditional model reaches its answer in 180 tokens, the reasoning model expends 1,000 tokens on iterative verification, despite having already found the correct answer at token 340. This represents a significant waste of tokens for diminishing returns on accuracy.
198199

199200
<div align="center">
200-
<img src="../media/tech_blog12_dynasor_hesitation.png" alt="Motivation" width="900px">
201+
<img src="https://github.com/NVIDIA/TensorRT-LLM/raw/main/docs/source/blogs/media/tech_blog12_dynasor_hesitation.png" alt="Motivation" width="900px">
201202
</div>
202203
<p align="center"><sub><em>Figure 2. An example answer from reasoning model (Deepseek-distilled Qwen-2.5 7B) vs traditional model (Qwen-2.5 7B) on one of the problem in MATH500 dataset.</em></sub></p>
203204

204205
#### The "Probe" technique
205-
Dynasor-CoT uses a **"Probe-In-The-Middle"** (or "probe" for short) technique to force reasoning models to output their early-stage results based on their unfinished reasoning. Imagine you're in a math exam working on a hard problem. When time is up, you're forced to write down your final answer, regardless of how confident you are.
206+
Dynasor-CoT uses a **"Probe-In-The-Middle"** (or "probe" for short) technique, which prompts reasoning models to output early-stage results during intermediate steps of reasoning. Imagine you're in a math exam working on a hard problem. When time is up, you're forced to write down your final answer, regardless of how confident you are.
206207

207208
More specifically, a probe is an extra generation request with an eliciting prompt appended to the intermediate reasoning tokens. One effective eliciting prompt is: `Oh, I suddenly got the answer to the whole problem, Final Answer: boxed{`. Figure 3 shows an analysis comparing the accuracy of directly asking versus probing the model. Taking AMC23 as an example, reasoning models frequently arrive at correct answers early (median: 830 tokens) but continue generating unnecessary tokens due to self-doubt (median: 2.7K tokens).
208209

209210

210211
<div align="center">
211-
<img src="../media/tech_blog12_dynasor_pressure_testing.png" alt="Dynasor Demo" width="900px">
212+
<img src="https://github.com/NVIDIA/TensorRT-LLM/raw/main/docs/source/blogs/media/tech_blog12_dynasor_pressure_testing.png" alt="Dynasor Demo" width="900px">
212213
</div>
213214
<p align="center"><sub><em>Figure 3. DeepSeek-R1's performance on AMC23 and AIME24 at varying token budgets. (Left) Standard reasoning with late answer outputs. (Right) Early answer extraction using the Probe-In-The-Middle technique, demonstrating equivalent accuracy with a 50% token reduction. The greener regions in the right panels suggest the model knows the answers much earlier than it reveals in standard reasoning.</em></sub></p>
214215

@@ -217,16 +218,16 @@ Instead of generating a fixed number of tokens or waiting for a stop token, Dyna
217218

218219
Figure 4 provides an illustration:
219220

220-
* **Case 1**: All three probe requests lead to the same answer, "3159." We can assume this is the final answer with high certainty and exit early.
221+
* **Case 1**: All three probe requests yield the same answer, "3159.", indicating high certainty. The process can exit early.
221222

222-
* **Case 2**: The early-stage answers are inconsistent, which indicates low confidence, so we continue generation.
223+
* **Case 2**: Early-stage answers are inconsistent, indicating low confidence, so generation continues.
223224

224-
* **Case 3**: The model generates special tokens like "wait" or "hmm," which also indicate hesitation, so we continue the generation.
225+
* **Case 3**: The model generates special tokens such as "wait" or "hmm," signaling hesitation; generation continues.
225226

226227
<div align="center">
227-
<img src="../media/tech_blog12_dynasor_illustration.jpg" alt="Dynasor Illustration" width="900px">
228+
<img src="https://github.com/NVIDIA/TensorRT-LLM/raw/main/docs/source/blogs/media/tech_blog12_dynasor_illustration.jpg" alt="Dynasor Illustration" width="900px">
228229
</div>
229-
<p align="center"><sub><em>Figure 4. Illustration of Dynasor-CoT. Case 1: early exit due to consistent early-stage results. Case 2: continue generation due to inconsistent early-stage results. Case 3: responses containing hesitation words (e.g., wait) are disgarded.</em></sub></p>
230+
<p align="center"><sub><em>Figure 4. Illustration of Dynasor-CoT. Case 1: early exit due to consistent early-stage results. Case 2: continue generation due to inconsistent early-stage results. Case 3: responses containing hesitation words (e.g., wait) are discarded.</em></sub></p>
230231

231232
### Implement Dynasor-CoT in Scaffolding
232233
A key difference between inference-time compute methods like Dynasor-CoT and a normal LLM generation request is that the generation process can consist of multiple smaller, user-defined tasks. The results of these tasks can dynamically control the overall logic—for example, by determining whether to expand the scope of subsequent generation or to terminate the process entirely. In a single Dynasor-CoT request, generation proceeds chunk by chunk, with additional "probe" tasks running in parallel with the main generation. Once a consistent answer is formed across recent probes, the process terminates early.
@@ -320,8 +321,6 @@ In the following `for` loop, each iteration performs these steps:
320321

321322
```python
322323
# Iterate over generation rounds until the maximum tokens limit is reached.
323-
# Make sure length of prefilling is always smaller than the max_tokens in TRTLLMWorker.init_with_new_llm
324-
# Otherwise it will through an assertion fail, stated in issue #3576
325324
for _ in range(initial_prompt_token_num + probe_suffix_token_num,
326325
self.max_tokens, self.chunk_size):
327326
proposer_task.input_str = current_prompt
@@ -352,7 +351,7 @@ In the following `for` loop, each iteration performs these steps:
352351
probe_answers[-1] + "}\n\\]")
353352
return
354353

355-
# If not confident, do another round of generation
354+
# If the answer is not deemed confident, perform another round of generation.
356355
# Append the newly generated text from the proposer to the current prompt for the next iteration.
357356
current_prompt += proposer_task.output_str
358357

@@ -362,7 +361,7 @@ In the following `for` loop, each iteration performs these steps:
362361
tasks[0].output_str = current_prompt
363362
return
364363
```
365-
The `probe_task` can utilize prefix kvcache reuse to enhance inference performance. TensorRT-LLM enables the kvcache of an in-progress request to be reused by other requests, so `probe_task` can `proposer_task`'s kvcache even though the `proposer_task` is in a continuous running state.
364+
The `probe_task` can utilize prefix kvcache reuse to enhance inference performance. TensorRT LLM enables the kvcache of an in-progress request to be reused by other requests, so `probe_task` can `proposer_task`'s kvcache even though the `proposer_task` is in a continuous running state.
366365

367366
Now we have implemented a `Controller` for Dynasor-CoT. Here is an example of how to use it:
368367
```python
@@ -398,12 +397,15 @@ llm = ScaffoldingLlm(
398397
results = llm.generate(prompts)
399398
```
400399

400+
### Reference
401+
[1] Y. Fu*, J. Chen*, Y. Zhuang, Z. Fu, I. Stoica, and H. Zhang, "Dynasor: More Efficient Chain-of-Thought Through Certainty Probing," Hao-AI-Lab Blog, Feb. 16, 2025. [Online]. Available: https://hao-ai-lab.github.io/blogs/dynasor-cot/
402+
401403

402404
## Feature List on Scaffolding
403-
Although users can customize their own `Controller`, `Worker` and `Task`, we have still implemented a series of the most used ones as the foundation.
405+
You can customize your own `Controller`, `Worker` and `Task`, however, we have provided a foundational set with commonly used functionality that you can use.
404406

405407

406-
`Worker`: TensorRT-LLM, OpenaiAPI, MCP;
408+
`Worker`: TensorRT LLM, OpenaiAPI, MCP;
407409

408410

409411
`Task`: Generation, Reward, ToolCall;
@@ -419,9 +421,7 @@ The future work is divided into two parts.
419421
The first part is to enable `Scaffolding` to support more inference-time compute methods, especially the methods of agentic and multi-agent system.
420422

421423

422-
The second part is that we hope to find more opportunities to optimize TensorRT-LLM based on `Scaffolding` workloads. For examples, in terms of kvcache prefix reuse, `Scaffolding` can identify which parts are system prompts, which parts are likely to be reused in the subsequent requests of the agent task, and which parts cannot be reused and can be evicted immediately.
423-
424-
425-
Finally, what we want to emphasize is that we welcome and look forward to more people joining our open source community. You can find these issues in the [TensorRT-LLM GitHub issues with Scaffolding tag](https://github.com/NVIDIA/TensorRT-LLM/issues?q=state%3Aopen%20label%3AScaffolding).
424+
The second part is that we hope to find more opportunities to optimize TensorRT LLM based on `Scaffolding` workloads. For examples, in terms of kvcache prefix reuse, `Scaffolding` can identify which parts are system prompts, which parts are likely to be reused in the subsequent requests of the agent task, and which parts cannot be reused and can be evicted immediately.
426425

427426

427+
Finally, what we want to emphasize is that we welcome and look forward to more people joining our open source community. You can find these issues in the [TensorRT LLM GitHub issues with Scaffolding tag](https://github.com/NVIDIA/TensorRT-LLM/issues?q=state%3Aopen%20label%3AScaffolding).

0 commit comments

Comments
 (0)