docs: add TRTLLM variable sliding window attention example for gemma3 model (#2134)

richardhuo-nv · web-flow · commit c8f6d4d99e4c · 2025-08-05T11:09:55.000-07:00
diff --git a/components/backends/trtllm/engine_configs/gemma3/vswa_agg.yaml b/components/backends/trtllm/engine_configs/gemma3/vswa_agg.yaml
@@ -0,0 +1,27 @@
+# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+tensor_parallel_size: 1
+backend: pytorch
+
+kv_cache_config:
+  max_attention_window:
+    - 512
+    - 512
+    - 512
+    - 512
+    - 512
+    - 32768
+  enable_block_reuse: false
diff --git a/components/backends/trtllm/engine_configs/gemma3/vswa_decode.yaml b/components/backends/trtllm/engine_configs/gemma3/vswa_decode.yaml
@@ -0,0 +1,30 @@
+# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+tensor_parallel_size: 1
+backend: pytorch
+
+kv_cache_config:
+  max_attention_window:
+    - 512
+    - 512
+    - 512
+    - 512
+    - 512
+    - 32768
+  enable_block_reuse: false
+
+cache_transceiver_config:
+  backend: default
diff --git a/components/backends/trtllm/engine_configs/gemma3/vswa_prefill.yaml b/components/backends/trtllm/engine_configs/gemma3/vswa_prefill.yaml
@@ -0,0 +1,31 @@
+# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+tensor_parallel_size: 1
+backend: pytorch
+disable_overlap_scheduler: True
+
+kv_cache_config:
+  max_attention_window:
+    - 512
+    - 512
+    - 512
+    - 512
+    - 512
+    - 32768
+  enable_block_reuse: false
+
+cache_transceiver_config:
+  backend: default
diff --git a/components/backends/trtllm/gemma3_sliding_window_attention.md b/components/backends/trtllm/gemma3_sliding_window_attention.md
@@ -0,0 +1,46 @@
+<!--
+SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+SPDX-License-Identifier: Apache-2.0
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+-->
+
+# Gemma 3 with Variable Sliding Window Attention
+
+This guide demonstrates how to deploy google/gemma-3-1b-it with Variable Sliding Window Attention (VSWA) using Dynamo. Since google/gemma-3-1b-it is a small model, each aggregated, decode, or prefill worker only requires one H100 GPU or one GB200 GPU.
+VSWA is a mechanism in which a model’s layers alternate between multiple sliding window sizes. An example of this is Gemma 3, which incorporates both global attention layers and sliding window layers.
+
+## Notes
+* To run Gemma 3 with VSWA, ensure that the container has TensorRT-LLM v1.0.0rc4 installed.
+
+## Limitation
+* The current KV event-based KV routing does not work well with VSWA. The Dynamo team is actively working on adding support to distinguish between events from different layer groups.
+
+### Aggregated Serving
+```bash
+cd $DYNAMO_HOME/components/backends/trtllm
+export MODEL_PATH=google/gemma-3-1b-it
+export SERVED_MODEL_NAME=$MODEL_PATH
+export AGG_ENGINE_ARGS=engine_configs/gemma3/vswa_agg.yaml
+./launch/agg.sh
+```
+
+#### Disaggregated Serving
+```bash
+cd $DYNAMO_HOME/components/backends/trtllm
+export MODEL_PATH=google/gemma-3-1b-it
+export SERVED_MODEL_NAME=$MODEL_PATH
+export PREFILL_ENGINE_ARGS=engine_configs/gemma3/vswa_prefill.yaml
+export DECODE_ENGINE_ARGS=engine_configs/gemma3/vswa_decode.yaml
+./launch/disagg.sh
+```
diff --git a/components/backends/trtllm/launch/disagg_router.sh b/components/backends/trtllm/launch/disagg_router.sh
@@ -53,4 +53,4 @@ CUDA_VISIBLE_DEVICES=$DECODE_CUDA_VISIBLE_DEVICES python3 -m dynamo.trtllm \
   --extra-engine-args "$DECODE_ENGINE_ARGS" \
   --disaggregation-mode decode \
   --disaggregation-strategy "$DISAGGREGATION_STRATEGY" \
-  "${EXTRA_DECODE_ARGS[@]}"
+  "${EXTRA_DECODE_ARGS[@]}"