diff --git a/docs/source/conf.py b/docs/source/conf.py index def277aba43..650e6be87c1 100644 --- a/docs/source/conf.py +++ b/docs/source/conf.py @@ -16,10 +16,9 @@ sys.path.insert(0, os.path.abspath('.')) -project = 'TensorRT-LLM' +project = 'TensorRT LLM' copyright = '2025, NVidia' author = 'NVidia' -branch_name = pygit2.Repository('.').head.shorthand html_show_sphinx = False # Get the git commit hash @@ -78,7 +77,7 @@ "https": None, "source": - "https://github.com/NVIDIA/TensorRT-LLM/tree/" + branch_name + "/{{path}}", + "https://github.com/NVIDIA/TensorRT-LLM/tree/" + commit_hash + "/{{path}}", } myst_heading_anchors = 4 diff --git a/docs/source/features/paged-attention-ifb-scheduler.md b/docs/source/features/paged-attention-ifb-scheduler.md index 2057be56048..61900572a84 100644 --- a/docs/source/features/paged-attention-ifb-scheduler.md +++ b/docs/source/features/paged-attention-ifb-scheduler.md @@ -135,7 +135,7 @@ Overall, the max batch size and max num tokens limits play a key role in determi ## Revisiting Paged Context Attention and Context Chunking -[Previously](./useful-build-time-flags.md#paged-context-attention) we recommended enabling paged context attention even though in our case study it didn't affect performance significantly. Now that we understand the TensorRT LLM scheduler, we can explain why this is beneficial. In short, we recommend enabling it because it enables context chunking, which allows the context phase of a request to be broken up into pieces and processed over several execution iterations, allowing the engine to provide a more stable balance of context and generation phase execution. +Previously we recommended enabling paged context attention even though in our case study it didn't affect performance significantly. Now that we understand the TensorRT LLM scheduler, we can explain why this is beneficial. In short, we recommend enabling it because it enables context chunking, which allows the context phase of a request to be broken up into pieces and processed over several execution iterations, allowing the engine to provide a more stable balance of context and generation phase execution. The [visualization](#the-schedulers) of the TensorRT LLM scheduler showed that initially Request 3 couldn't be scheduled because it would put the scheduler over the max-num tokens limit. However, with context chunking, this is no longer the case, and the first chunk of Request 3 can be scheduled. diff --git a/docs/source/features/sampling.md b/docs/source/features/sampling.md index 686a4e7bd64..e0a44c67d34 100644 --- a/docs/source/features/sampling.md +++ b/docs/source/features/sampling.md @@ -6,7 +6,7 @@ The PyTorch backend supports most of the sampling features that are supported on To use the feature: 1. Enable the `enable_trtllm_sampler` option in the `LLM` class -2. Pass a [`SamplingParams`](../../../../tensorrt_llm/sampling_params.py#L125) object with the desired options to the `generate()` function +2. Pass a [`SamplingParams`](source:tensorrt_llm/sampling_params.py#L125) object with the desired options to the `generate()` function The following example prepares two identical prompts which will give different results due to the sampling parameters chosen: @@ -74,7 +74,7 @@ The PyTorch backend supports guided decoding with the XGrammar and Low-level Gui To enable guided decoding, you must: 1. Set the `guided_decoding_backend` parameter to `'xgrammar'` or `'llguidance'` in the `LLM` class -2. Create a [`GuidedDecodingParams`](../../../../tensorrt_llm/sampling_params.py#L14) object with the desired format specification +2. Create a [`GuidedDecodingParams`](source:tensorrt_llm/sampling_params.py#L14) object with the desired format specification * Note: Depending on the type of format, a different parameter needs to be chosen to construct the object (`json`, `regex`, `grammar`, `structural_tag`). 3. Pass the `GuidedDecodingParams` object to the `guided_decoding` parameter of the `SamplingParams` object @@ -94,7 +94,7 @@ sampling_params = SamplingParams( llm.generate("Generate a JSON response", sampling_params) ``` -You can find a more detailed example on guided decoding [here](../../../../examples/llm-api/llm_guided_decoding.py). +You can find a more detailed example on guided decoding [here](source:examples/llm-api/llm_guided_decoding.py). ## Logits processor @@ -102,7 +102,7 @@ Logits processors allow you to modify the logits produced by the network before To use a custom logits processor: -1. Create a custom class that inherits from [`LogitsProcessor`](../../../../tensorrt_llm/sampling_params.py#L48) and implements the `__call__` method +1. Create a custom class that inherits from [`LogitsProcessor`](source:tensorrt_llm/sampling_params.py#L48) and implements the `__call__` method 2. Pass an instance of this class to the `logits_processor` parameter of `SamplingParams` The following example demonstrates logits processing: @@ -132,4 +132,4 @@ sampling_params = SamplingParams( llm.generate(["Hello, my name is"], sampling_params) ``` -You can find a more detailed example on logits processors [here](../../../../examples/llm-api/llm_logits_processor.py). +You can find a more detailed example on logits processors [here](source:examples/llm-api/llm_logits_processor.py). diff --git a/docs/source/installation/build-from-source-linux.md b/docs/source/installation/build-from-source-linux.md index f4b6f3836ff..53e1e5c3348 100644 --- a/docs/source/installation/build-from-source-linux.md +++ b/docs/source/installation/build-from-source-linux.md @@ -158,7 +158,7 @@ example: python3 ./scripts/build_wheel.py --cuda_architectures "80-real;86-real" ``` -To use the C++ benchmark scripts under [benchmark/cpp](/benchmarks/cpp/), for example `gptManagerBenchmark.cpp`, add the `--benchmarks` option: +To use the C++ benchmark scripts under [benchmark/cpp](source:benchmarks/cpp/), for example `gptManagerBenchmark.cpp`, add the `--benchmarks` option: ```bash python3 ./scripts/build_wheel.py --benchmarks @@ -180,6 +180,7 @@ relevant classes. The associated unit tests should also be consulted for underst This feature will not be enabled when [`building only the C++ runtime`](#link-with-the-tensorrt-llm-c++-runtime). +(link-with-the-tensorrt-llm-c++-runtime)= #### Linking with the TensorRT LLM C++ Runtime The `build_wheel.py` script will also compile the library containing the C++ runtime of TensorRT LLM. If Python support and `torch` modules are not required, the script provides the option `--cpp_only` which restricts the build to the C++ runtime only.