Describe the bug
We found an issue during the stability test that router Pod keeps restarting after running for a certain amount of time. It seems the router is restarted by k8s because of OOM. We configured the router Pod limit with 2G memory initially. After raising to 16G, we observed that the memory usage of router is unbounded and keeps ramping up.
To Reproduce
- Create a vLLM production stack router, connected with multiple vLLM model server Pods
- Configure the memory limit of router to be 2GB
- Run a continuous long document QA benchmark for over 2 hours
- The router pod should be periodically get restarted because of OOM
Expected behavior
The HashTrie size should be capped by a limit and perhaps a GC can run periodically to cleanup the stale nodes to free up the memory, e.g TTL or LRU
Additional context
Does vLLM production stack team have any plan to fix this?