NVIDIA · nv-guomingz · Sep 9, 2025 · Sep 9, 2025
@@ -16,10 +16,9 @@
 
 sys.path.insert(0, os.path.abspath('.'))
 
-project = 'TensorRT-LLM'
+project = 'TensorRT LLM'
 copyright = '2025, NVidia'
 author = 'NVidia'
-branch_name = pygit2.Repository('.').head.shorthand
 html_show_sphinx = False
 
 # Get the git commit hash
@@ -78,7 +77,7 @@
     "https":
     None,
     "source":
-    "https://github.com/NVIDIA/TensorRT-LLM/tree/" + branch_name + "/{{path}}",
+    "https://github.com/NVIDIA/TensorRT-LLM/tree/" + commit_hash + "/{{path}}",
 }
 
 myst_heading_anchors = 4

@@ -135,7 +135,7 @@ Overall, the max batch size and max num tokens limits play a key role in determi
 
 ## Revisiting Paged Context Attention and Context Chunking
 
-[Previously](./useful-build-time-flags.md#paged-context-attention) we recommended enabling paged context attention even though in our case study it didn't affect performance significantly. Now that we understand the TensorRT LLM scheduler, we can explain why this is beneficial. In short, we recommend enabling it because it enables context chunking, which allows the context phase of a request to be broken up into pieces and processed over several execution iterations, allowing the engine to provide a more stable balance of context and generation phase execution.
+Previously we recommended enabling paged context attention even though in our case study it didn't affect performance significantly. Now that we understand the TensorRT LLM scheduler, we can explain why this is beneficial. In short, we recommend enabling it because it enables context chunking, which allows the context phase of a request to be broken up into pieces and processed over several execution iterations, allowing the engine to provide a more stable balance of context and generation phase execution.
 
 The [visualization](#the-schedulers) of the TensorRT LLM scheduler showed that initially Request 3 couldn't be scheduled because it would put the scheduler over the max-num tokens limit. However, with context chunking, this is no longer the case, and the first chunk of Request 3 can be scheduled.
 

@@ -6,7 +6,7 @@ The PyTorch backend supports most of the sampling features that are supported on
 To use the feature:
 
 1. Enable the `enable_trtllm_sampler` option in the `LLM` class
-2. Pass a [`SamplingParams`](../../../../tensorrt_llm/sampling_params.py#L125) object with the desired options to the `generate()` function
+2. Pass a [`SamplingParams`](source:tensorrt_llm/sampling_params.py#L125) object with the desired options to the `generate()` function
 
 The following example prepares two identical prompts which will give different results due to the sampling parameters chosen:
 
@@ -74,7 +74,7 @@ The PyTorch backend supports guided decoding with the XGrammar and Low-level Gui
 To enable guided decoding, you must:
 
 1. Set the `guided_decoding_backend` parameter to `'xgrammar'` or `'llguidance'` in the `LLM` class
-2. Create a [`GuidedDecodingParams`](../../../../tensorrt_llm/sampling_params.py#L14) object with the desired format specification
+2. Create a [`GuidedDecodingParams`](source:tensorrt_llm/sampling_params.py#L14) object with the desired format specification
     * Note: Depending on the type of format, a different parameter needs to be chosen to construct the object (`json`, `regex`, `grammar`, `structural_tag`).
 3. Pass the `GuidedDecodingParams` object to the `guided_decoding` parameter of the `SamplingParams` object
 
@@ -94,15 +94,15 @@ sampling_params = SamplingParams(
 llm.generate("Generate a JSON response", sampling_params)
 ```
 
-You can find a more detailed example on guided decoding [here](../../../../examples/llm-api/llm_guided_decoding.py).
+You can find a more detailed example on guided decoding [here](source:examples/llm-api/llm_guided_decoding.py).
 
 ## Logits processor
 
 Logits processors allow you to modify the logits produced by the network before sampling, enabling custom generation behavior and constraints.
 
 To use a custom logits processor:
 
-1. Create a custom class that inherits from [`LogitsProcessor`](../../../../tensorrt_llm/sampling_params.py#L48) and implements the `__call__` method
+1. Create a custom class that inherits from [`LogitsProcessor`](source:tensorrt_llm/sampling_params.py#L48) and implements the `__call__` method
 2. Pass an instance of this class to the `logits_processor` parameter of `SamplingParams`
 
 The following example demonstrates logits processing:
@@ -132,4 +132,4 @@ sampling_params = SamplingParams(
 llm.generate(["Hello, my name is"], sampling_params)
 ```
 
-You can find a more detailed example on logits processors [here](../../../../examples/llm-api/llm_logits_processor.py).
+You can find a more detailed example on logits processors [here](source:examples/llm-api/llm_logits_processor.py).
@@ -158,7 +158,7 @@ example:
 python3 ./scripts/build_wheel.py --cuda_architectures "80-real;86-real"
 ```
 
-To use the C++ benchmark scripts under [benchmark/cpp](/benchmarks/cpp/), for example `gptManagerBenchmark.cpp`, add the `--benchmarks` option:
+To use the C++ benchmark scripts under [benchmark/cpp](source:benchmarks/cpp/), for example `gptManagerBenchmark.cpp`, add the `--benchmarks` option:
 
 ```bash
 python3 ./scripts/build_wheel.py --benchmarks
@@ -180,6 +180,7 @@ relevant classes. The associated unit tests should also be consulted for underst
 
 This feature will not be enabled when [`building only the C++ runtime`](#link-with-the-tensorrt-llm-c++-runtime).
 
+(link-with-the-tensorrt-llm-c++-runtime)=
 #### Linking with the TensorRT LLM C++ Runtime
 
 The `build_wheel.py` script will also compile the library containing the C++ runtime of TensorRT LLM. If Python support and `torch` modules are not required, the script provides the option `--cpp_only` which restricts the build to the C++ runtime only.