From f85cfcfacdbb593d2ec6c5409e99194d87599352 Mon Sep 17 00:00:00 2001
From: Zhenhua Wang <4936589+zhenhuaw-me@users.noreply.github.com>
Date: Wed, 13 Aug 2025 14:46:21 +0800
Subject: [PATCH] Add the workaround doc for H200 OOM

Signed-off-by: Zhenhua Wang <4936589+zhenhuaw-me@users.noreply.github.com>
---
 .../quick-start-recipe-for-deepseek-r1-on-trtllm.md   | 11 ++++++-----
 1 file changed, 6 insertions(+), 5 deletions(-)

diff --git a/docs/source/deployment-guide/quick-start-recipe-for-deepseek-r1-on-trtllm.md b/docs/source/deployment-guide/quick-start-recipe-for-deepseek-r1-on-trtllm.md
index 8e06b8c55f9..070b2c18038 100644
--- a/docs/source/deployment-guide/quick-start-recipe-for-deepseek-r1-on-trtllm.md
+++ b/docs/source/deployment-guide/quick-start-recipe-for-deepseek-r1-on-trtllm.md
@@ -235,11 +235,12 @@ Here is an example response, showing that the TRT-LLM server returns “New York
 
 ### Troubleshooting Tips
 
-* If you encounter CUDA out-of-memory errors, try reducing max\_batch\_size or max\_seq\_len  
-* Ensure your model checkpoints are compatible with the expected format  
-* For performance issues, check GPU utilization with nvidia-smi while the server is running  
-* If the container fails to start, verify that the NVIDIA Container Toolkit is properly installed  
-* For connection issues, make sure port 8000 is not being used by another application
+* If you encounter CUDA out-of-memory errors, try reducing `max_batch_size` or `max_seq_len`.
+  * For running input/output sequence lengths of 8K/1K on H200, there is a known CUDA Out-Of-Memory issue caused by the PyTorch CUDA Caching Allocator fragmenting memory. As a workaround, you can set the environment variable `PYTORCH_CUDA_ALLOC_CONF=max_split_size_mb:8192`. For more details, please refer to the [PyTorch documentation on optimizing memory usage](https://docs.pytorch.org/docs/stable/notes/cuda.html#optimizing-memory-usage-with-pytorch-cuda-alloc-conf).
+* Ensure your model checkpoints are compatible with the expected format.
+* For performance issues, check GPU utilization with nvidia-smi while the server is running.
+* If the container fails to start, verify that the NVIDIA Container Toolkit is properly installed.
+* For connection issues, make sure port 8000 is not being used by another application.
 
 ### Running Evaluations to Verify Accuracy (Optional)