Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 2 additions & 3 deletions docs/source/conf.py
Original file line number Diff line number Diff line change
Expand Up @@ -16,10 +16,9 @@

sys.path.insert(0, os.path.abspath('.'))

project = 'TensorRT-LLM'
project = 'TensorRT LLM'
copyright = '2025, NVidia'
author = 'NVidia'
branch_name = pygit2.Repository('.').head.shorthand
html_show_sphinx = False

# Get the git commit hash
Expand Down Expand Up @@ -78,7 +77,7 @@
"https":
None,
"source":
"https://github.com/NVIDIA/TensorRT-LLM/tree/" + branch_name + "/{{path}}",
"https://github.com/NVIDIA/TensorRT-LLM/tree/" + commit_hash + "/{{path}}",
}

myst_heading_anchors = 4
Expand Down
2 changes: 1 addition & 1 deletion docs/source/features/paged-attention-ifb-scheduler.md
Original file line number Diff line number Diff line change
Expand Up @@ -135,7 +135,7 @@ Overall, the max batch size and max num tokens limits play a key role in determi

## Revisiting Paged Context Attention and Context Chunking

[Previously](./useful-build-time-flags.md#paged-context-attention) we recommended enabling paged context attention even though in our case study it didn't affect performance significantly. Now that we understand the TensorRT LLM scheduler, we can explain why this is beneficial. In short, we recommend enabling it because it enables context chunking, which allows the context phase of a request to be broken up into pieces and processed over several execution iterations, allowing the engine to provide a more stable balance of context and generation phase execution.
Previously we recommended enabling paged context attention even though in our case study it didn't affect performance significantly. Now that we understand the TensorRT LLM scheduler, we can explain why this is beneficial. In short, we recommend enabling it because it enables context chunking, which allows the context phase of a request to be broken up into pieces and processed over several execution iterations, allowing the engine to provide a more stable balance of context and generation phase execution.

The [visualization](#the-schedulers) of the TensorRT LLM scheduler showed that initially Request 3 couldn't be scheduled because it would put the scheduler over the max-num tokens limit. However, with context chunking, this is no longer the case, and the first chunk of Request 3 can be scheduled.

Expand Down
10 changes: 5 additions & 5 deletions docs/source/features/sampling.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@ The PyTorch backend supports most of the sampling features that are supported on
To use the feature:

1. Enable the `enable_trtllm_sampler` option in the `LLM` class
2. Pass a [`SamplingParams`](../../../../tensorrt_llm/sampling_params.py#L125) object with the desired options to the `generate()` function
2. Pass a [`SamplingParams`](source:tensorrt_llm/sampling_params.py#L125) object with the desired options to the `generate()` function

The following example prepares two identical prompts which will give different results due to the sampling parameters chosen:

Expand Down Expand Up @@ -74,7 +74,7 @@ The PyTorch backend supports guided decoding with the XGrammar and Low-level Gui
To enable guided decoding, you must:

1. Set the `guided_decoding_backend` parameter to `'xgrammar'` or `'llguidance'` in the `LLM` class
2. Create a [`GuidedDecodingParams`](../../../../tensorrt_llm/sampling_params.py#L14) object with the desired format specification
2. Create a [`GuidedDecodingParams`](source:tensorrt_llm/sampling_params.py#L14) object with the desired format specification
* Note: Depending on the type of format, a different parameter needs to be chosen to construct the object (`json`, `regex`, `grammar`, `structural_tag`).
3. Pass the `GuidedDecodingParams` object to the `guided_decoding` parameter of the `SamplingParams` object

Expand All @@ -94,15 +94,15 @@ sampling_params = SamplingParams(
llm.generate("Generate a JSON response", sampling_params)
```

You can find a more detailed example on guided decoding [here](../../../../examples/llm-api/llm_guided_decoding.py).
You can find a more detailed example on guided decoding [here](source:examples/llm-api/llm_guided_decoding.py).

## Logits processor

Logits processors allow you to modify the logits produced by the network before sampling, enabling custom generation behavior and constraints.

To use a custom logits processor:

1. Create a custom class that inherits from [`LogitsProcessor`](../../../../tensorrt_llm/sampling_params.py#L48) and implements the `__call__` method
1. Create a custom class that inherits from [`LogitsProcessor`](source:tensorrt_llm/sampling_params.py#L48) and implements the `__call__` method
2. Pass an instance of this class to the `logits_processor` parameter of `SamplingParams`

The following example demonstrates logits processing:
Expand Down Expand Up @@ -132,4 +132,4 @@ sampling_params = SamplingParams(
llm.generate(["Hello, my name is"], sampling_params)
```

You can find a more detailed example on logits processors [here](../../../../examples/llm-api/llm_logits_processor.py).
You can find a more detailed example on logits processors [here](source:examples/llm-api/llm_logits_processor.py).
3 changes: 2 additions & 1 deletion docs/source/installation/build-from-source-linux.md
Original file line number Diff line number Diff line change
Expand Up @@ -158,7 +158,7 @@ example:
python3 ./scripts/build_wheel.py --cuda_architectures "80-real;86-real"
```

To use the C++ benchmark scripts under [benchmark/cpp](/benchmarks/cpp/), for example `gptManagerBenchmark.cpp`, add the `--benchmarks` option:
To use the C++ benchmark scripts under [benchmark/cpp](source:benchmarks/cpp/), for example `gptManagerBenchmark.cpp`, add the `--benchmarks` option:

```bash
python3 ./scripts/build_wheel.py --benchmarks
Expand All @@ -180,6 +180,7 @@ relevant classes. The associated unit tests should also be consulted for underst

This feature will not be enabled when [`building only the C++ runtime`](#link-with-the-tensorrt-llm-c++-runtime).

(link-with-the-tensorrt-llm-c++-runtime)=
#### Linking with the TensorRT LLM C++ Runtime

The `build_wheel.py` script will also compile the library containing the C++ runtime of TensorRT LLM. If Python support and `torch` modules are not required, the script provides the option `--cpp_only` which restricts the build to the C++ runtime only.
Expand Down