[Feat][Router] Add HashTrie LRU to prevent OOM #711

can-sun · 2025-09-22T21:02:19Z

Description

Currently the prefix aware routing HashTrie misses the capability of node eviction. This will result in router pod crashes after the memory is filled up with TrieNodes. During pod restart period, users will experience downtime and in-progress requests could be terminated in the middle stream.

This PR adds LRU for HashTrie to evict the nodes when hitting memory limit. It will resolve the issue that router keeps restarting by Kubernetes due to insufficient memory. Besides, HashTrieConfig is added to configure some important parameters that will affect the performance, e.g eviction threshold, max memory limit, target utilization etc. These are configured via environment variables.

Nodes are added to the LRU when they are inserted to the Trie. When this node is revisited by other request, the node will be popped and added to tail of LRU. This proposed approach will evict the child nodes first since child nodes are always less frequently accessed compared to parent.

Benchmark Tool

We did the benchmarking by referencing LMCache tool, vLLM prod stack multi-round QA script.

We also applied some local changes to ensure that benchmarking can run continuously so that we can test the router's stability. And stats like TTFT, Success/Failure, Throughput etc will be emitted to AWS cloud metrics for better visibility.

Benchmark Configs

The benchmark is performed with 100 concurrent users continuously performing multi-turn chat. The input system prompt context length is randomly selected, ranges from 8k~64k. Every single user will ask 5 round of followup questions then quit, then new user will be added back to the queue to send new requests. All benchmarks will run continuously over 3hrs to ensure there is no stability problem.

Test Infra Setup

Platform: AWS EKS with SageMaker Hyperpod
Model Server Instance: 40 g5.24xlarge instances (4 Nvidia A10G GPU w/ 96 GB GPU memory per instance)
Router Instance: c5.12xlarge (96 GB CPU memory per instance)
Invocation Path: Public ALB -> vLLM Production Stack Router -> vLLM Model Server
Model: Llama-3.1-8b-instruct
GPU Parallelization: 4 (Thus all models are deployed on dedicated g5 instance)
Router Replicas: 1 (No autoscaling is enabled)
Router Pod Memory Limit: 4GB

Eviction Configs

Max Router Memory: 1GB
HashTrie Eviction Threshold: 0.9 (start eviction when 900MB+ memory is consumed)
HashTrie Utilization Target: 0.5 (will evict 0.5 of total nodes)
HashTrie Chunk Size: 128

Memory Usage before adding Eviction

Since HashTrie nodes are never evicted and free up the memory. The pod memory utilization will go up till the memory allocated to the pod is filled up. Then pod will crash and get restarted.

Memory Usage

TTFT

ITL

Availability

Throughput Output Tokens/s

Summary

Metric Name	Stats	Value
Availability	Average	96.2%
Throughput Tokens/s	Average	153
TTFT	Average	5.07s
TTFT	P50	0.29s
TTFT	P90	21.58s
TTFT	P99	48.3s
ITL	Average	0.05s

Memory Usage after adding Eviction

After reaching the eviction threshold, nodes will start being evicted to target utilization. Thus we identify that memory usage keeps stable without ramping up anymore.

Memory Usage

TTFT

ITL

Availability

Throughput Output Tokens/s

Summary

Metric Name	Stats	Value
Availability	Average	98.2%
Throughput Tokens/s	Average	146
TTFT	Average	6.06s
TTFT	P50	0.32s
TTFT	P90	23.37s
TTFT	P99	45.79s
ITL	Average	0.05s

FIX #687 (link existing issues this PR will resolve)

BEFORE SUBMITTING, PLEASE READ THE CHECKLIST BELOW AND FILL IN THE DESCRIPTION ABOVE

Make sure the code changes pass the pre-commit checks.
Sign-off your commit by using -s when doing git commit
Try to classify PRs for easy understanding of the type of changes, such as [Bugfix], [Feat], and [CI].

Detailed Checklist (Click to Expand)

Thank you for your contribution to production-stack! Before submitting the pull request, please ensure the PR meets the following criteria. This helps us maintain the code quality and improve the efficiency of the review process.

PR Title and Classification

Please try to classify PRs for easy understanding of the type of changes. The PR title is prefixed appropriately to indicate the type of change. Please use one of the following:

[Bugfix] for bug fixes.
[CI/Build] for build or continuous integration improvements.
[Doc] for documentation fixes and improvements.
[Feat] for new features in the cluster (e.g., autoscaling, disaggregated prefill, etc.).
[Router] for changes to the vllm_router (e.g., routing algorithm, router observability, etc.).
[Misc] for PRs that do not fit the above categories. Please use this sparingly.

Note: If the PR spans more than one category, please include all relevant prefixes.

Code Quality

The PR need to meet the following code quality standards:

Pass all linter checks. Please use pre-commit to format your code. See README.md for installation.
The code need to be well-documented to ensure future contributors can easily understand the code.
Please include sufficient tests to ensure the change is stay correct and robust. This includes both unit tests and integration tests.

DCO and Signed-off-by

When contributing changes to this project, you must agree to the DCO. Commits must include a Signed-off-by: header which certifies agreement with the terms of the DCO.

Using -s with git commit will automatically add this header.

What to Expect for the Reviews

We aim to address all PRs in a timely manner. If no one reviews your PR within 5 days, please @-mention one of YuhanLiu11
, Shaoting-Feng or ApostaC.

Signed-off-by: Can Sun <[email protected]>

gemini-code-assist · 2025-09-22T21:02:39Z

Summary of Changes

Hello @can-sun, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a crucial memory management enhancement to the PrefixAwareRouter by integrating an LRU eviction mechanism into its HashTrie data structure. This change directly addresses the problem of unbounded memory growth, which previously led to router pod crashes and service disruptions. By allowing the HashTrie to intelligently evict less frequently used nodes based on configurable memory thresholds, the router can now maintain stable memory consumption, significantly improving its resilience and preventing OOM-induced restarts.

Highlights

LRU Eviction for HashTrie: Implemented a Least Recently Used (LRU) eviction mechanism within the HashTrie to prevent Out-Of-Memory (OOM) errors in the router, ensuring stable memory usage.
Configurable Parameters: Introduced HashTrieConfig to allow configuration of HashTrie behavior, including chunk_size, max_memory_size, eviction_threshold, target_utilization, and memory_check_request_batch_size via environment variables.
Child Node Prioritization: The LRU eviction strategy is designed to evict child nodes before parent nodes, minimizing impact on frequently accessed prefixes and maintaining routing efficiency.
New Test Coverage: Added test_hashtrie_eviction to src/tests/test_prefixaware_router.py to validate the new eviction functionality and ensure its correctness.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces a crucial feature: an LRU eviction mechanism for the HashTrie to prevent out-of-memory issues. The implementation is well-structured, using an OrderedDict for the LRU cache, a HashTrieConfig for configuration, and psutil for memory monitoring. The logic to periodically check memory and evict nodes when a threshold is reached is a solid approach to managing memory growth. My review includes a few suggestions to improve robustness, configuration clarity, and memory efficiency during the eviction process. Overall, this is a great addition that will significantly improve the router's stability.

gemini-code-assist · 2025-09-22T21:04:11Z

src/vllm_router/prefix/hashtrie.py

+                except:
+                    pass


Using a bare except: pass is dangerous as it silently swallows all exceptions, including potentially critical ones like psutil errors or even KeyboardInterrupt. This could cause the eviction mechanism to fail silently, defeating the purpose of this PR and re-introducing the risk of OOM crashes.

It's crucial to catch specific exceptions or, at a minimum, Exception as e and log the error to ensure that any failures in the memory checking or eviction process are visible for debugging.

Suggested change

except:

pass

except Exception as e:

logger.warning(f"Failed to check memory for eviction: {e}")

gemini-code-assist · 2025-09-22T21:04:11Z

src/vllm_router/prefix/config.py

+    chunk_size: int = 0
+    max_memory_size: int = 0  # GB
+    eviction_threshold: float = 0  # start evicting at
+    target_utilization: float = 0  # evict down to
+    memory_check_request_batch_size: int = 0
+
+    @staticmethod
+    def from_defaults(
+        chunk_size: int = 128,
+        max_memory_size: int = 2,
+        eviction_threshold: float = 0.9,
+        target_utilization: float = 0.5,
+        memory_check_request_batch_size: int = 10,
+    ) -> "HashTrieConfig":
+        """Create configuration with default values"""
+        return HashTrieConfig(
+            chunk_size,
+            max_memory_size,
+            eviction_threshold,
+            target_utilization,
+            memory_check_request_batch_size,
+        )


The current implementation of HashTrieConfig has default values of 0 in the dataclass definition, which can be misleading and lead to a non-functional configuration if instantiated with HashTrieConfig(). The actual default values are in a separate from_defaults static method.

To improve clarity and usability, I suggest moving the default values directly into the dataclass attributes and removing the from_defaults method. This makes the default configuration explicit and simplifies instantiation.

You will need to update the call sites:

HashTrieConfig.from_defaults() becomes HashTrieConfig()

HashTrieConfig.from_defaults(...) becomes HashTrieConfig(...)

@dataclass class HashTrieConfig: """Configuration for HashTrie with LRU eviction""" chunk_size: int = 128 max_memory_size: int = 2 # GB eviction_threshold: float = 0.9 # start evicting at target_utilization: float = 0.5 # evict down to memory_check_request_batch_size: int = 10 @staticmethod def from_env() -> "HashTrieConfig": """Load configuration from environment variables""" return HashTrieConfig( chunk_size=int(os.getenv("HASHTRIE_CHUNK_SIZE", "128")), max_memory_size=int(os.getenv("PREFIXAWARE_MAX_MEMORY_SIZE_GB", "2")), eviction_threshold=float(os.getenv("HASHTRIE_EVICTION_THRESHOLD", "0.9")), target_utilization=float(os.getenv("HASHTRIE_TARGET_UTILIZATION", "0.5")), memory_check_request_batch_size=int( os.getenv("MEMORY_CHECK_REQUEST_BATCH_SIZE", "10") ), )

gemini-code-assist · 2025-09-22T21:04:11Z

src/vllm_router/prefix/hashtrie.py

+        evict_nodes = []
+        for node in list(self.node_cache.keys())[:target_evictions]:
+            evict_nodes.append(node)


The current implementation for collecting nodes to evict can be inefficient. list(self.node_cache.keys()) creates a copy of all keys in memory. When memory is already under pressure (the very reason this function is called), creating a potentially large list could exacerbate the problem and is also redundant with the subsequent loop.

A more memory-efficient and concise approach is to use itertools.islice to create a list containing only the keys of the nodes that need to be evicted.

import itertools evict_nodes = list(itertools.islice(self.node_cache, target_evictions))

YuhanLiu11 · 2025-09-23T05:44:15Z

@can-sun Thanks for the fix! Could you fix the pre-commit error and show an easy way for us to see the effect of your fix?

can-sun · 2025-09-23T05:50:05Z

@can-sun Thanks for the fix! Could you fix the pre-commit error and show an easy way for us to see the effect of your fix?

Thanks Yuhan, let me double check the pre-commit failures tmr. I am pretty sure I have run the pre-commit check before sending out the PR. I will pull up some metrics that I can share with you tomorrow to justify the effect.

Signed-off-by: Can Sun <[email protected]>

YuhanLiu11 · 2025-09-23T23:24:17Z

@can-sun Thanks for the fix. BTW I realize that that if we evict nodes from the HashTrie, the vLLM-side delay of the requests may be affected as the information of where did the requests go is missing. We will have to run some systematic benchmarks to test the e2e performance of your PR.
cc @zerofishnoodles

can-sun · 2025-09-24T01:13:01Z

@can-sun Thanks for the fix. BTW I realize that that if we evict nodes from the HashTrie, the vLLM-side delay of the requests may be affected as the information of where did the requests go is missing. We will have to run some systematic benchmarks to test the e2e performance of your PR. cc @zerofishnoodles

You are definitely right. We also did some benchmarking locally and evaluated the performance impact. Let me share that with you today.

jiapinw · 2025-09-24T01:24:47Z

the vLLM-side delay of the requests may be affected as the information of where did the requests go is missing

I think this is intentional for this PR. It's a balance between availability and latency imo. There won't be any performance regression until we hit the configured mem limit. If I am reading the PR correctly, the node eviction will only happen when we reach the limit, ie. if we trim down a root node, then it will free up quite a few space and it would give an extended period of time when eviction is no longer needed. Though we do need to be careful regarding how we set the default, one improvement area might be setting the default based on instance type or other env var, for instance cpu limits configured during a k8s deployment.

can-sun · 2025-09-24T09:18:26Z

Done, added some data collected during benchmarking. Hope it helps. Feel free to take in the code and do the local performance evaluations. Thanks

can-sun · 2025-09-25T14:20:06Z

Hey @YuhanLiu11 Do you know and next steps for this change? Currently we are waiting for this change to be in place before putting the router into use.

can-sun · 2025-09-25T19:27:15Z

Double checked the E2E test failed because ERROR: Could not install packages due to an OSError: [Errno 28] No space left on device

YuhanLiu11 · 2025-09-25T20:02:31Z

hey @can-sun thanks for adding the benchmarking results!

@zerofishnoodles can you check why this CI fails due to space errors? Is our CI server full?

zerofishnoodles · 2025-09-25T20:44:28Z

This is weird. We just cleaned the space not long ago. I will take a look today.

zerofishnoodles · 2025-09-26T03:31:28Z

That one is the GitHub hosted machine, their machine is out of space. :(
I think from our side, we can try to modify the CI to either reduce the dependency size or move it to our self-hosted machine

piyushdaftary · 2025-09-29T21:32:42Z

src/vllm_router/prefix/config.py

+        """Load configuration from environment variables"""
+        return HashTrieConfig(
+            chunk_size=int(os.getenv("HASHTRIE_CHUNK_SIZE", "128")),
+            max_memory_size=int(os.getenv("PREFIXAWARE_MAX_MEMORY_SIZE_GB", "2")),


Read default values from constant instead of hardcoding here

piyushdaftary · 2025-09-29T21:33:35Z

src/vllm_router/prefix/hashtrie.py



+class NodeMetadata:
+    """Metadata for nodes tracked in LRU cache"""


You need to make NodeMetadata to be immutable for nodes tracked in LRU cache. Please add @dataclass(frozen=True)

piyushdaftary · 2025-09-29T21:34:27Z

src/vllm_router/prefix/hashtrie.py

-        self.children = {}
-        self.endpoints = set()
+        self.children: Dict[int, "TrieNode"] = {}
+        self.endpoints: Set[str] = set()


Suggest to add is_Leaf and other operation methods, something as below

@property def is_leaf(self) -> bool: """Check if the node is a leaf node.""" return not bool(self.children) async def add_child(self, hash_value: int, node: "TrieNode") -> None: """Add a child node.""" async with self.lock: self.children[hash_value] = node async def remove_child(self, hash_value: int) -> None: """Remove a child node.""" async with self.lock: self.children.pop(hash_value, None) async def add_endpoint(self, endpoint: str) -> None: """Add an endpoint to the node.""" async with self.lock: self.endpoints.add(endpoint) def __repr__(self) -> str: return f"TrieNode(children={len(self.children)}, endpoints={len(self.endpoints)})"

piyushdaftary · 2025-09-29T21:34:58Z

src/vllm_router/prefix/hashtrie.py

+            self.config.max_memory_size * 1024 * self.config.eviction_threshold
+        )
+        # eviction percentage
+        self.eviction_target_percentage = 1.0 - self.config.target_utilization


Please move all these constant values to be read from env variables or default constant . Also see it it make sense to introduce HashTrieStats something like

@dataclass class HashTrieStats: current_memory_mb: float = 0.0 operation_count: int = 0 @dataclass class HashTrie: config: HashTrieConfig = field(default_factory=HashTrieConfig.from_defaults) root: TrieNode = field(default_factory=TrieNode) node_cache: OrderedDict[TrieNode, NodeMetadata] = field(default_factory=OrderedDict) stats: HashTrieStats = field(default_factory=HashTrieStats) eviction_lock: asyncio.Lock = field(default_factory=asyncio.Lock) def __post_init__(self): self.memory_threshold_mb = ( self.config.max_memory_size * BYTES_TO_MB * self.config.eviction_threshold ) self.eviction_target_percentage = 1.0 - self.config.target_utilization @property def current_memory_mb(self) -> float: return self.stats.current_memory_mb @current_memory_mb.setter def current_memory_mb(self, value: float) -> None: self.stats.current_memory_mb = value @property def operation_count(self) -> int: return self.stats.operation_count @operation_count.setter def operation_count(self, value: int) -> None: self.stats.operation_count = value def __repr__(self) -> str: return (f"HashTrie(nodes={len(self.node_cache)}, " f"memory={self.current_memory_mb:.2f}MB)")

piyushdaftary · 2025-09-29T21:35:28Z

src/vllm_router/prefix/hashtrie.py

+
+            # Check if we should evict
+            if self.operation_count == 0:
+                try:


can you please move the eviction logic to a separate method in HashTrie, something like

class HashTrie: async def _check_and_perform_eviction(self) -> None: """Check memory usage and perform eviction if necessary.""" try: stats = await MemoryMonitor.get_memory_stats(self.memory_threshold_mb) if abs(stats.current_mb - self.current_memory_mb) < 0.1: logger.debug(f"Eviction skipped - minimal memory change: {stats.current_mb:.1f}MB") return self.current_memory_mb = stats.current_mb if stats.current_mb > stats.threshold_mb: logger.warning(f"Memory threshold exceeded: {stats.current_mb:.1f}MB > {stats.threshold_mb:.1f}MB") await self._perform_eviction() except Exception as e: // Add Exception Handling async def _perform_eviction(self) -> None: """Handle the actual eviction process with retry logic.""" max_retries = self.config.eviction_max_retries for retry in range(max_retries): try: await self.batch_evict(self.eviction_target_percentage) return except Exception as e: // Add Exception Handling logger.error("Max eviction retries reached") async def _handle_monitoring_failure(self) -> None: """Handle cases where memory monitoring fails."""

[Feat] Add HashTrie LRU to prevent OOM

9b93e06

Signed-off-by: Can Sun <[email protected]>

gemini-code-assist bot reviewed Sep 22, 2025

View reviewed changes

Fix pre-commit failures

a54321d

Signed-off-by: Can Sun <[email protected]>

Merge branch 'main' into hashtrie-lru

2ccbfc7

Merge branch 'main' into hashtrie-lru

5ce27d6

piyushdaftary reviewed Sep 29, 2025

View reviewed changes

-                except:
-                    pass
+                except Exception as e:
+                    logger.warning(f"Failed to check memory for eviction: {e}")



		class NodeMetadata:
		"""Metadata for nodes tracked in LRU cache"""

[Feat][Router] Add HashTrie LRU to prevent OOM #711

Are you sure you want to change the base?

[Feat][Router] Add HashTrie LRU to prevent OOM #711

Uh oh!

Conversation

can-sun commented Sep 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Benchmark Tool

Benchmark Configs

Test Infra Setup

Eviction Configs

Memory Usage before adding Eviction

Memory Usage after adding Eviction

PR Title and Classification

Code Quality

DCO and Signed-off-by

What to Expect for the Reviews

Uh oh!

gemini-code-assist bot commented Sep 22, 2025

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Sep 22, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Sep 22, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Sep 22, 2025

Choose a reason for hiding this comment

Uh oh!

YuhanLiu11 commented Sep 23, 2025

Uh oh!

can-sun commented Sep 23, 2025

Uh oh!

YuhanLiu11 commented Sep 23, 2025

Uh oh!

can-sun commented Sep 24, 2025

Uh oh!

jiapinw commented Sep 24, 2025

Uh oh!

can-sun commented Sep 24, 2025

Uh oh!

can-sun commented Sep 25, 2025

Uh oh!

can-sun commented Sep 25, 2025

Uh oh!

YuhanLiu11 commented Sep 25, 2025

Uh oh!

zerofishnoodles commented Sep 25, 2025

Uh oh!

zerofishnoodles commented Sep 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

piyushdaftary Sep 29, 2025

Choose a reason for hiding this comment

Uh oh!

piyushdaftary Sep 29, 2025

Choose a reason for hiding this comment

Uh oh!

piyushdaftary Sep 29, 2025

Choose a reason for hiding this comment

Uh oh!

piyushdaftary Sep 29, 2025

Choose a reason for hiding this comment

Uh oh!

piyushdaftary Sep 29, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

can-sun commented Sep 22, 2025 •

edited

Loading

zerofishnoodles commented Sep 26, 2025 •

edited

Loading