You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/source/features/paged-attention-ifb-scheduler.md
+1-1Lines changed: 1 addition & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -135,7 +135,7 @@ Overall, the max batch size and max num tokens limits play a key role in determi
135
135
136
136
## Revisiting Paged Context Attention and Context Chunking
137
137
138
-
[Previously](./useful-build-time-flags.md#paged-context-attention) we recommended enabling paged context attention even though in our case study it didn't affect performance significantly. Now that we understand the TensorRT LLM scheduler, we can explain why this is beneficial. In short, we recommend enabling it because it enables context chunking, which allows the context phase of a request to be broken up into pieces and processed over several execution iterations, allowing the engine to provide a more stable balance of context and generation phase execution.
138
+
Previously we recommended enabling paged context attention even though in our case study it didn't affect performance significantly. Now that we understand the TensorRT LLM scheduler, we can explain why this is beneficial. In short, we recommend enabling it because it enables context chunking, which allows the context phase of a request to be broken up into pieces and processed over several execution iterations, allowing the engine to provide a more stable balance of context and generation phase execution.
139
139
140
140
The [visualization](#the-schedulers) of the TensorRT LLM scheduler showed that initially Request 3 couldn't be scheduled because it would put the scheduler over the max-num tokens limit. However, with context chunking, this is no longer the case, and the first chunk of Request 3 can be scheduled.
Copy file name to clipboardExpand all lines: docs/source/features/sampling.md
+5-5Lines changed: 5 additions & 5 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -6,7 +6,7 @@ The PyTorch backend supports most of the sampling features that are supported on
6
6
To use the feature:
7
7
8
8
1. Enable the `enable_trtllm_sampler` option in the `LLM` class
9
-
2. Pass a [`SamplingParams`](../../../../tensorrt_llm/sampling_params.py#L125) object with the desired options to the `generate()` function
9
+
2. Pass a [`SamplingParams`](source:tensorrt_llm/sampling_params.py#L125) object with the desired options to the `generate()` function
10
10
11
11
The following example prepares two identical prompts which will give different results due to the sampling parameters chosen:
12
12
@@ -74,7 +74,7 @@ The PyTorch backend supports guided decoding with the XGrammar and Low-level Gui
74
74
To enable guided decoding, you must:
75
75
76
76
1. Set the `guided_decoding_backend` parameter to `'xgrammar'` or `'llguidance'` in the `LLM` class
77
-
2. Create a [`GuidedDecodingParams`](../../../../tensorrt_llm/sampling_params.py#L14) object with the desired format specification
77
+
2. Create a [`GuidedDecodingParams`](source:tensorrt_llm/sampling_params.py#L14) object with the desired format specification
78
78
* Note: Depending on the type of format, a different parameter needs to be chosen to construct the object (`json`, `regex`, `grammar`, `structural_tag`).
79
79
3. Pass the `GuidedDecodingParams` object to the `guided_decoding` parameter of the `SamplingParams` object
To use the C++ benchmark scripts under [benchmark/cpp](/benchmarks/cpp/), for example `gptManagerBenchmark.cpp`, add the `--benchmarks` option:
188
+
To use the C++ benchmark scripts under [benchmark/cpp](source:benchmarks/cpp/), for example `gptManagerBenchmark.cpp`, add the `--benchmarks` option:
189
189
190
190
```bash
191
191
python3 ./scripts/build_wheel.py --benchmarks
@@ -207,6 +207,7 @@ relevant classes. The associated unit tests should also be consulted for underst
207
207
208
208
This feature will not be enabled when [`building only the C++ runtime`](#link-with-the-tensorrt-llm-c++-runtime).
209
209
210
+
(link-with-the-tensorrt-llm-c++-runtime)=
210
211
#### Linking with the TensorRT LLM C++ Runtime
211
212
212
213
The `build_wheel.py` script will also compile the library containing the C++ runtime of TensorRT LLM. If Python support and `torch` modules are not required, the script provides the option `--cpp_only` which restricts the build to the C++ runtime only.
0 commit comments