You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/source/blogs/tech_blog/blog12_Inference_Time_Compute_Implementation_in_TensorRT-LLM.md
+29-29Lines changed: 29 additions & 29 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,9 +1,9 @@
1
-
# Inference Time Compute Implementation in TensorRT-LLM
1
+
# Inference Time Compute Implementation in TensorRTLLM
2
2
3
-
By NVIDIA TensorRT-LLM Team
3
+
By NVIDIA TensorRTLLM Team and UCSD Hao AI Lab
4
4
5
5
## Table of Contents
6
-
-[Inference-Time Compute Implementation in TensorRT-LLM (Part 1: Design and Implementation](#inference-time-compute-implementation-in-tensorrt-llm)
6
+
-[Inference-Time Compute Implementation in TensorRTLLM (Part 1: Design and Implementation](#inference-time-compute-implementation-in-tensorrt-llm)
7
7
-[Table of Content](#table-of-content)
8
8
-[Background and Motivation](#background-and-motivation)
9
9
-[Introduction for Scaffolding: A Framework for inference-time compute](#introduction-for-scaffolding)
@@ -16,6 +16,7 @@ By NVIDIA TensorRT-LLM Team
16
16
-[Introduction for Dynasor](#dynasor-introduction)
17
17
-[Implement Dynasor-CoT in Scaffolding](#dynasor-cot-implement-in-scaffolding)
18
18
-[Implement Dynasor-CoT based Majority Voting in Scaffolding](#dynasor-cot-based-majority-vote-in-scaffolding)
19
+
-[Reference](#dynasor-reference)
19
20
-[Feature List on Scaffolding](#scaffolding-feature-list)
20
21
-[Future Work](#scaffolding-future-work)
21
22
@@ -24,7 +25,7 @@ By NVIDIA TensorRT-LLM Team
24
25
Inference-time compute (aka test-time scaling) is becoming increasingly important. In addition to simply increasing the length of the output, using various workflows such as best-of-N and MCTS (Monte Carlo Tree Search) to obtain better answers is also an important means. Further, most of the workflows of agentic or multi-agent are logically similar to these methods of inference-time compute, except that they use more complex tools and context engineering. However, how to conveniently define these methods while achieving excellent inference performance has become a new problem. Because good performance requires careful asynchronous scheduling, but writing asynchronous scheduling programs is not easy for algorithm engineers. When considering the use of external tools and token budget management, the problem becomes even more complex.
25
26
26
27
27
-
LLM inference frameworks such as TensorRT-LLM,vLLM and SGLang provide high performance for inference of generation models or reward models, but they are only for single request inference. Popular Agent frameworks such as LangChain and Dify focus on enabling users to develop agents as simply as possible. But precisely because of this, they may have difficulty completing many inference-time compute methods that require precise definition and developments.
28
+
LLM inference frameworks such as TensorRTLLM,vLLM and SGLang provide high performance for inference of generation models or reward models, but they are only for single request inference. Popular Agent frameworks such as LangChain and Dify focus on enabling users to develop agents as simply as possible. But precisely because of this, they may have difficulty completing many inference-time compute methods that require precise definition and developments.
28
29
29
30
30
31
So we want to build a good framework to support users in exploring and deploying more inference-time compute methods. It should provide a modular infrastructure and fill the gap in balancing usability and performance for inference-time compute.
@@ -55,7 +56,7 @@ Provides sufficient concurrency to achieve good performance while ease of use. C
55
56
56
57
This is the call sequence diagram of `Scaffolding`:
Its two core interfaces are `generate()` and `process()`. `generate()` is the entry point for `ScaffoldingLlm` to invoke. In the default implementation of `generte()`, it produces a `Task` and then invokes `process()`. The `process()` is the most important part of every `Contronller` class, as it defines the implementation the workflow of this inference-time compute method.
104
+
Its two core interfaces are `generate()` and `process()`. `generate()` is the entry point for `ScaffoldingLlm` to invoke. In the default implementation of `generate()`, it produces a `Task` and then invokes `process()`. The `process()` is the most important part of every `Contronller` class, as it defines the implementation the workflow of this inference-time compute method.
104
105
105
106
106
107
Let's go into a specific subclass of `Controller` to see how `process()` is implemented.
Users need to first create instances of `Worker` and `Controller`, and map them by `WorkerTag` to create the `ScaffoldingLlm` class. Then call the generate interface of `ScaffoldingLlm` to get the final result.
172
173
173
174
174
-
`ScaffoldingLlm` also provides async inferface.
175
+
`ScaffoldingLlm` also provides async interface.
175
176
```python
176
177
asyncfor result in llm.generate_async(prompt):
177
178
print(">>>", result.outputs[0].text)
178
179
```
179
-
So an instance of `ScaffoldingLlm` can support the concurrent execution of multiple requests.
180
+
Therefore, an instance of ScaffoldingLlm supports concurrent execution of multiple requests.
180
181
181
182
182
183
Let's make a summary of the overall implementation of `Scaffolding`. If users want to implement a new inference-time compute method, users can develop a new `Controller`. They can also call some existing `Controllers` as its `sub-Controller`. If users want to implement a new backend, users can either create a new `Worker` or add a new `Task` handler to an existing `Worker`. As for `ScaffoldingLlM`, we have hidden many complex implementations, such as async scheduling within `ScaffoldingLlM`, and users do not need to modify the code of `ScaffoldingLlM`.
183
184
184
185
185
186
## An Example: Implement Dynasor-CoT on Scaffolding
186
-
Dynasor-CoT is a certainty-based, training-free approach to accelerate Chain-of-Thought (CoT) inference. This chapter discusses how inference-time compute methods can be smoothly integrated into the TRT-LLM Scaffolding framework, using Dynasor-CoT as an example.
187
+
[Dynasor-CoT](https://arxiv.org/abs/2412.20993) is a certainty-based, training-free approach to accelerate Chain-of-Thought (CoT) inference. This chapter discusses how inference-time compute methods can be smoothly integrated into the TRT-LLM Scaffolding framework, using Dynasor-CoT as an example.
<palign="center"><sub><em>Figure 2. Demo of DeepSeek-R1-Distill-Qwen-7B achieving a 5.74x speedup compared to the baseline when using Dynasor-CoT on MATH500</em></sub></p>
192
193
193
-
### Introducation for Dynasor-CoT
194
+
### Introduction for Dynasor-CoT
194
195
#### Motivation of Dynasor-CoT
195
196
LLM reasoning is highly token-inefficient, often requiring far more tokens to achieve the same accuracy as non-reasoning models. A major source of this inefficiency is that reasoning models tend to **self-doubt**; they often reach the correct answer early but then engage in extended verification behaviors like double-checking and reassessment.
196
197
197
198
For instance, Figure 2 compares a traditional Qwen-7B model with a reasoning-focused, Deepseek-distilled Qwen-7B model on a simple question. While the traditional model reaches its answer in 180 tokens, the reasoning model expends 1,000 tokens on iterative verification, despite having already found the correct answer at token 340. This represents a significant waste of tokens for diminishing returns on accuracy.
<palign="center"><sub><em>Figure 2. An example answer from reasoning model (Deepseek-distilled Qwen-2.5 7B) vs traditional model (Qwen-2.5 7B) on one of the problem in MATH500 dataset.</em></sub></p>
203
204
204
205
#### The "Probe" technique
205
-
Dynasor-CoT uses a **"Probe-In-The-Middle"** (or "probe" for short) technique to force reasoning models to output their early-stage results based on their unfinished reasoning. Imagine you're in a math exam working on a hard problem. When time is up, you're forced to write down your final answer, regardless of how confident you are.
206
+
Dynasor-CoT uses a **"Probe-In-The-Middle"** (or "probe" for short) technique, which prompts reasoning models to output early-stage results during intermediate steps of reasoning. Imagine you're in a math exam working on a hard problem. When time is up, you're forced to write down your final answer, regardless of how confident you are.
206
207
207
208
More specifically, a probe is an extra generation request with an eliciting prompt appended to the intermediate reasoning tokens. One effective eliciting prompt is: `Oh, I suddenly got the answer to the whole problem, Final Answer: boxed{`. Figure 3 shows an analysis comparing the accuracy of directly asking versus probing the model. Taking AMC23 as an example, reasoning models frequently arrive at correct answers early (median: 830 tokens) but continue generating unnecessary tokens due to self-doubt (median: 2.7K tokens).
<palign="center"><sub><em>Figure 3. DeepSeek-R1's performance on AMC23 and AIME24 at varying token budgets. (Left) Standard reasoning with late answer outputs. (Right) Early answer extraction using the Probe-In-The-Middle technique, demonstrating equivalent accuracy with a 50% token reduction. The greener regions in the right panels suggest the model knows the answers much earlier than it reveals in standard reasoning.</em></sub></p>
214
215
@@ -217,16 +218,16 @@ Instead of generating a fixed number of tokens or waiting for a stop token, Dyna
217
218
218
219
Figure 4 provides an illustration:
219
220
220
-
***Case 1**: All three probe requests lead to the same answer, "3159." We can assume this is the final answer with high certainty and exit early.
221
+
***Case 1**: All three probe requests yield the same answer, "3159.", indicating high certainty. The process can exit early.
221
222
222
-
***Case 2**: The early-stage answers are inconsistent, which indicates low confidence, so we continue generation.
223
+
***Case 2**: Early-stage answers are inconsistent, indicating low confidence, so generation continues.
223
224
224
-
***Case 3**: The model generates special tokens like "wait" or "hmm," which also indicate hesitation, so we continue the generation.
225
+
***Case 3**: The model generates special tokens such as "wait" or "hmm," signaling hesitation; generation continues.
<palign="center"><sub><em>Figure 4. Illustration of Dynasor-CoT. Case 1: early exit due to consistent early-stage results. Case 2: continue generation due to inconsistent early-stage results. Case 3: responses containing hesitation words (e.g., wait) are disgarded.</em></sub></p>
230
+
<palign="center"><sub><em>Figure 4. Illustration of Dynasor-CoT. Case 1: early exit due to consistent early-stage results. Case 2: continue generation due to inconsistent early-stage results. Case 3: responses containing hesitation words (e.g., wait) are discarded.</em></sub></p>
230
231
231
232
### Implement Dynasor-CoT in Scaffolding
232
233
A key difference between inference-time compute methods like Dynasor-CoT and a normal LLM generation request is that the generation process can consist of multiple smaller, user-defined tasks. The results of these tasks can dynamically control the overall logic—for example, by determining whether to expand the scope of subsequent generation or to terminate the process entirely. In a single Dynasor-CoT request, generation proceeds chunk by chunk, with additional "probe" tasks running in parallel with the main generation. Once a consistent answer is formed across recent probes, the process terminates early.
@@ -320,8 +321,6 @@ In the following `for` loop, each iteration performs these steps:
320
321
321
322
```python
322
323
# Iterate over generation rounds until the maximum tokens limit is reached.
323
-
# Make sure length of prefilling is always smaller than the max_tokens in TRTLLMWorker.init_with_new_llm
324
-
# Otherwise it will through an assertion fail, stated in issue #3576
325
324
for _ inrange(initial_prompt_token_num + probe_suffix_token_num,
326
325
self.max_tokens, self.chunk_size):
327
326
proposer_task.input_str = current_prompt
@@ -352,7 +351,7 @@ In the following `for` loop, each iteration performs these steps:
352
351
probe_answers[-1] +"}\n\\]")
353
352
return
354
353
355
-
# If not confident, do another round of generation
354
+
# If the answer is not deemed confident, perform another round of generation.
356
355
# Append the newly generated text from the proposer to the current prompt for the next iteration.
357
356
current_prompt += proposer_task.output_str
358
357
@@ -362,7 +361,7 @@ In the following `for` loop, each iteration performs these steps:
362
361
tasks[0].output_str = current_prompt
363
362
return
364
363
```
365
-
The `probe_task` can utilize prefix kvcache reuse to enhance inference performance. TensorRT-LLM enables the kvcache of an in-progress request to be reused by other requests, so `probe_task` can `proposer_task`'s kvcache even though the `proposer_task` is in a continuous running state.
364
+
The `probe_task` can utilize prefix kvcache reuse to enhance inference performance. TensorRTLLM enables the kvcache of an in-progress request to be reused by other requests, so `probe_task` can `proposer_task`'s kvcache even though the `proposer_task` is in a continuous running state.
366
365
367
366
Now we have implemented a `Controller` for Dynasor-CoT. Here is an example of how to use it:
368
367
```python
@@ -398,12 +397,15 @@ llm = ScaffoldingLlm(
398
397
results = llm.generate(prompts)
399
398
```
400
399
400
+
### Reference
401
+
[1] Y. Fu*, J. Chen*, Y. Zhuang, Z. Fu, I. Stoica, and H. Zhang, "Dynasor: More Efficient Chain-of-Thought Through Certainty Probing," Hao-AI-Lab Blog, Feb. 16, 2025. [Online]. Available: https://hao-ai-lab.github.io/blogs/dynasor-cot/
402
+
401
403
402
404
## Feature List on Scaffolding
403
-
Although users can customize their own `Controller`, `Worker` and `Task`, we have still implemented a series of the most used ones as the foundation.
405
+
You can customize your own `Controller`, `Worker` and `Task`, however, we have provided a foundational set with commonly used functionality that you can use.
404
406
405
407
406
-
`Worker`: TensorRT-LLM, OpenaiAPI, MCP;
408
+
`Worker`: TensorRTLLM, OpenaiAPI, MCP;
407
409
408
410
409
411
`Task`: Generation, Reward, ToolCall;
@@ -419,9 +421,7 @@ The future work is divided into two parts.
419
421
The first part is to enable `Scaffolding` to support more inference-time compute methods, especially the methods of agentic and multi-agent system.
420
422
421
423
422
-
The second part is that we hope to find more opportunities to optimize TensorRT-LLM based on `Scaffolding` workloads. For examples, in terms of kvcache prefix reuse, `Scaffolding` can identify which parts are system prompts, which parts are likely to be reused in the subsequent requests of the agent task, and which parts cannot be reused and can be evicted immediately.
423
-
424
-
425
-
Finally, what we want to emphasize is that we welcome and look forward to more people joining our open source community. You can find these issues in the [TensorRT-LLM GitHub issues with Scaffolding tag](https://github.com/NVIDIA/TensorRT-LLM/issues?q=state%3Aopen%20label%3AScaffolding).
424
+
The second part is that we hope to find more opportunities to optimize TensorRT LLM based on `Scaffolding` workloads. For examples, in terms of kvcache prefix reuse, `Scaffolding` can identify which parts are system prompts, which parts are likely to be reused in the subsequent requests of the agent task, and which parts cannot be reused and can be evicted immediately.
426
425
427
426
427
+
Finally, what we want to emphasize is that we welcome and look forward to more people joining our open source community. You can find these issues in the [TensorRT LLM GitHub issues with Scaffolding tag](https://github.com/NVIDIA/TensorRT-LLM/issues?q=state%3Aopen%20label%3AScaffolding).
0 commit comments