-
Notifications
You must be signed in to change notification settings - Fork 314
[Feat][Router] Add HashTrie LRU to prevent OOM #711
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: Can Sun <[email protected]>
Summary of ChangesHello @can-sun, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request introduces a crucial memory management enhancement to the Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces a crucial feature: an LRU eviction mechanism for the HashTrie to prevent out-of-memory issues. The implementation is well-structured, using an OrderedDict for the LRU cache, a HashTrieConfig for configuration, and psutil for memory monitoring. The logic to periodically check memory and evict nodes when a threshold is reached is a solid approach to managing memory growth. My review includes a few suggestions to improve robustness, configuration clarity, and memory efficiency during the eviction process. Overall, this is a great addition that will significantly improve the router's stability.
src/vllm_router/prefix/hashtrie.py
Outdated
| except: | ||
| pass |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using a bare except: pass is dangerous as it silently swallows all exceptions, including potentially critical ones like psutil errors or even KeyboardInterrupt. This could cause the eviction mechanism to fail silently, defeating the purpose of this PR and re-introducing the risk of OOM crashes.
It's crucial to catch specific exceptions or, at a minimum, Exception as e and log the error to ensure that any failures in the memory checking or eviction process are visible for debugging.
| except: | |
| pass | |
| except Exception as e: | |
| logger.warning(f"Failed to check memory for eviction: {e}") |
| chunk_size: int = 0 | ||
| max_memory_size: int = 0 # GB | ||
| eviction_threshold: float = 0 # start evicting at | ||
| target_utilization: float = 0 # evict down to | ||
| memory_check_request_batch_size: int = 0 | ||
|
|
||
| @staticmethod | ||
| def from_defaults( | ||
| chunk_size: int = 128, | ||
| max_memory_size: int = 2, | ||
| eviction_threshold: float = 0.9, | ||
| target_utilization: float = 0.5, | ||
| memory_check_request_batch_size: int = 10, | ||
| ) -> "HashTrieConfig": | ||
| """Create configuration with default values""" | ||
| return HashTrieConfig( | ||
| chunk_size, | ||
| max_memory_size, | ||
| eviction_threshold, | ||
| target_utilization, | ||
| memory_check_request_batch_size, | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The current implementation of HashTrieConfig has default values of 0 in the dataclass definition, which can be misleading and lead to a non-functional configuration if instantiated with HashTrieConfig(). The actual default values are in a separate from_defaults static method.
To improve clarity and usability, I suggest moving the default values directly into the dataclass attributes and removing the from_defaults method. This makes the default configuration explicit and simplifies instantiation.
You will need to update the call sites:
HashTrieConfig.from_defaults()becomesHashTrieConfig()HashTrieConfig.from_defaults(...)becomesHashTrieConfig(...)
@dataclass
class HashTrieConfig:
"""Configuration for HashTrie with LRU eviction"""
chunk_size: int = 128
max_memory_size: int = 2 # GB
eviction_threshold: float = 0.9 # start evicting at
target_utilization: float = 0.5 # evict down to
memory_check_request_batch_size: int = 10
@staticmethod
def from_env() -> "HashTrieConfig":
"""Load configuration from environment variables"""
return HashTrieConfig(
chunk_size=int(os.getenv("HASHTRIE_CHUNK_SIZE", "128")),
max_memory_size=int(os.getenv("PREFIXAWARE_MAX_MEMORY_SIZE_GB", "2")),
eviction_threshold=float(os.getenv("HASHTRIE_EVICTION_THRESHOLD", "0.9")),
target_utilization=float(os.getenv("HASHTRIE_TARGET_UTILIZATION", "0.5")),
memory_check_request_batch_size=int(
os.getenv("MEMORY_CHECK_REQUEST_BATCH_SIZE", "10")
),
)| evict_nodes = [] | ||
| for node in list(self.node_cache.keys())[:target_evictions]: | ||
| evict_nodes.append(node) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The current implementation for collecting nodes to evict can be inefficient. list(self.node_cache.keys()) creates a copy of all keys in memory. When memory is already under pressure (the very reason this function is called), creating a potentially large list could exacerbate the problem and is also redundant with the subsequent loop.
A more memory-efficient and concise approach is to use itertools.islice to create a list containing only the keys of the nodes that need to be evicted.
import itertools
evict_nodes = list(itertools.islice(self.node_cache, target_evictions))|
@can-sun Thanks for the fix! Could you fix the pre-commit error and show an easy way for us to see the effect of your fix? |
Thanks Yuhan, let me double check the pre-commit failures tmr. I am pretty sure I have run the pre-commit check before sending out the PR. I will pull up some metrics that I can share with you tomorrow to justify the effect. |
Signed-off-by: Can Sun <[email protected]>
|
@can-sun Thanks for the fix. BTW I realize that that if we evict nodes from the HashTrie, the vLLM-side delay of the requests may be affected as the information of where did the requests go is missing. We will have to run some systematic benchmarks to test the e2e performance of your PR. |
You are definitely right. We also did some benchmarking locally and evaluated the performance impact. Let me share that with you today. |
I think this is intentional for this PR. It's a balance between availability and latency imo. There won't be any performance regression until we hit the configured mem limit. If I am reading the PR correctly, the node eviction will only happen when we reach the limit, ie. if we trim down a root node, then it will free up quite a few space and it would give an extended period of time when eviction is no longer needed. Though we do need to be careful regarding how we set the default, one improvement area might be setting the default based on instance type or other env var, for instance cpu limits configured during a k8s deployment. |
|
Done, added some data collected during benchmarking. Hope it helps. Feel free to take in the code and do the local performance evaluations. Thanks |
|
Hey @YuhanLiu11 Do you know and next steps for this change? Currently we are waiting for this change to be in place before putting the router into use. |
|
Double checked the E2E test failed because |
|
hey @can-sun thanks for adding the benchmarking results! @zerofishnoodles can you check why this CI fails due to space errors? Is our CI server full? |
|
This is weird. We just cleaned the space not long ago. I will take a look today. |
|
That one is the GitHub hosted machine, their machine is out of space. :( |
| """Load configuration from environment variables""" | ||
| return HashTrieConfig( | ||
| chunk_size=int(os.getenv("HASHTRIE_CHUNK_SIZE", "128")), | ||
| max_memory_size=int(os.getenv("PREFIXAWARE_MAX_MEMORY_SIZE_GB", "2")), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Read default values from constant instead of hardcoding here
|
|
||
|
|
||
| class NodeMetadata: | ||
| """Metadata for nodes tracked in LRU cache""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You need to make NodeMetadata to be immutable for nodes tracked in LRU cache. Please add @dataclass(frozen=True)
| self.children = {} | ||
| self.endpoints = set() | ||
| self.children: Dict[int, "TrieNode"] = {} | ||
| self.endpoints: Set[str] = set() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suggest to add is_Leaf and other operation methods, something as below
@property
def is_leaf(self) -> bool:
"""Check if the node is a leaf node."""
return not bool(self.children)
async def add_child(self, hash_value: int, node: "TrieNode") -> None:
"""Add a child node."""
async with self.lock:
self.children[hash_value] = node
async def remove_child(self, hash_value: int) -> None:
"""Remove a child node."""
async with self.lock:
self.children.pop(hash_value, None)
async def add_endpoint(self, endpoint: str) -> None:
"""Add an endpoint to the node."""
async with self.lock:
self.endpoints.add(endpoint)
def __repr__(self) -> str:
return f"TrieNode(children={len(self.children)}, endpoints={len(self.endpoints)})"
| self.config.max_memory_size * 1024 * self.config.eviction_threshold | ||
| ) | ||
| # eviction percentage | ||
| self.eviction_target_percentage = 1.0 - self.config.target_utilization |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please move all these constant values to be read from env variables or default constant . Also see it it make sense to introduce HashTrieStats something like
@dataclass
class HashTrieStats:
current_memory_mb: float = 0.0
operation_count: int = 0
@dataclass
class HashTrie:
config: HashTrieConfig = field(default_factory=HashTrieConfig.from_defaults)
root: TrieNode = field(default_factory=TrieNode)
node_cache: OrderedDict[TrieNode, NodeMetadata] = field(default_factory=OrderedDict)
stats: HashTrieStats = field(default_factory=HashTrieStats)
eviction_lock: asyncio.Lock = field(default_factory=asyncio.Lock)
def __post_init__(self):
self.memory_threshold_mb = (
self.config.max_memory_size * BYTES_TO_MB * self.config.eviction_threshold
)
self.eviction_target_percentage = 1.0 - self.config.target_utilization
@property
def current_memory_mb(self) -> float:
return self.stats.current_memory_mb
@current_memory_mb.setter
def current_memory_mb(self, value: float) -> None:
self.stats.current_memory_mb = value
@property
def operation_count(self) -> int:
return self.stats.operation_count
@operation_count.setter
def operation_count(self, value: int) -> None:
self.stats.operation_count = value
def __repr__(self) -> str:
return (f"HashTrie(nodes={len(self.node_cache)}, "
f"memory={self.current_memory_mb:.2f}MB)")
|
|
||
| # Check if we should evict | ||
| if self.operation_count == 0: | ||
| try: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you please move the eviction logic to a separate method in HashTrie, something like
class HashTrie:
async def _check_and_perform_eviction(self) -> None:
"""Check memory usage and perform eviction if necessary."""
try:
stats = await MemoryMonitor.get_memory_stats(self.memory_threshold_mb)
if abs(stats.current_mb - self.current_memory_mb) < 0.1:
logger.debug(f"Eviction skipped - minimal memory change: {stats.current_mb:.1f}MB")
return
self.current_memory_mb = stats.current_mb
if stats.current_mb > stats.threshold_mb:
logger.warning(f"Memory threshold exceeded: {stats.current_mb:.1f}MB > {stats.threshold_mb:.1f}MB")
await self._perform_eviction()
except Exception as e:
// Add Exception Handling
async def _perform_eviction(self) -> None:
"""Handle the actual eviction process with retry logic."""
max_retries = self.config.eviction_max_retries
for retry in range(max_retries):
try:
await self.batch_evict(self.eviction_target_percentage)
return
except Exception as e:
// Add Exception Handling
logger.error("Max eviction retries reached")
async def _handle_monitoring_failure(self) -> None:
"""Handle cases where memory monitoring fails."""
Description
Currently the prefix aware routing HashTrie misses the capability of node eviction. This will result in router pod crashes after the memory is filled up with TrieNodes. During pod restart period, users will experience downtime and in-progress requests could be terminated in the middle stream.
This PR adds LRU for HashTrie to evict the nodes when hitting memory limit. It will resolve the issue that router keeps restarting by Kubernetes due to insufficient memory. Besides, HashTrieConfig is added to configure some important parameters that will affect the performance, e.g eviction threshold, max memory limit, target utilization etc. These are configured via environment variables.
Nodes are added to the LRU when they are inserted to the Trie. When this node is revisited by other request, the node will be popped and added to tail of LRU. This proposed approach will evict the child nodes first since child nodes are always less frequently accessed compared to parent.
Benchmark Tool
We did the benchmarking by referencing LMCache tool, vLLM prod stack multi-round QA script.
We also applied some local changes to ensure that benchmarking can run continuously so that we can test the router's stability. And stats like TTFT, Success/Failure, Throughput etc will be emitted to AWS cloud metrics for better visibility.
Benchmark Configs
The benchmark is performed with 100 concurrent users continuously performing multi-turn chat. The input system prompt context length is randomly selected, ranges from 8k~64k. Every single user will ask 5 round of followup questions then quit, then new user will be added back to the queue to send new requests. All benchmarks will run continuously over 3hrs to ensure there is no stability problem.
Test Infra Setup
Platform: AWS EKS with SageMaker Hyperpod
Model Server Instance: 40 g5.24xlarge instances (4 Nvidia A10G GPU w/ 96 GB GPU memory per instance)
Router Instance: c5.12xlarge (96 GB CPU memory per instance)
Invocation Path: Public ALB -> vLLM Production Stack Router -> vLLM Model Server
Model: Llama-3.1-8b-instruct
GPU Parallelization: 4 (Thus all models are deployed on dedicated g5 instance)
Router Replicas: 1 (No autoscaling is enabled)
Router Pod Memory Limit: 4GB
Eviction Configs
Max Router Memory: 1GB
HashTrie Eviction Threshold: 0.9 (start eviction when 900MB+ memory is consumed)
HashTrie Utilization Target: 0.5 (will evict 0.5 of total nodes)
HashTrie Chunk Size: 128
Memory Usage before adding Eviction
Since HashTrie nodes are never evicted and free up the memory. The pod memory utilization will go up till the memory allocated to the pod is filled up. Then pod will crash and get restarted.
Memory Usage

TTFT

ITL

Availability

Throughput Output Tokens/s

Summary
Memory Usage after adding Eviction
After reaching the eviction threshold, nodes will start being evicted to target utilization. Thus we identify that memory usage keeps stable without ramping up anymore.
Memory Usage

TTFT

ITL

Availability

Throughput Output Tokens/s

Summary
FIX #687 (link existing issues this PR will resolve)
BEFORE SUBMITTING, PLEASE READ THE CHECKLIST BELOW AND FILL IN THE DESCRIPTION ABOVE
-swhen doinggit commit[Bugfix],[Feat], and[CI].Detailed Checklist (Click to Expand)
Thank you for your contribution to production-stack! Before submitting the pull request, please ensure the PR meets the following criteria. This helps us maintain the code quality and improve the efficiency of the review process.
PR Title and Classification
Please try to classify PRs for easy understanding of the type of changes. The PR title is prefixed appropriately to indicate the type of change. Please use one of the following:
[Bugfix]for bug fixes.[CI/Build]for build or continuous integration improvements.[Doc]for documentation fixes and improvements.[Feat]for new features in the cluster (e.g., autoscaling, disaggregated prefill, etc.).[Router]for changes to thevllm_router(e.g., routing algorithm, router observability, etc.).[Misc]for PRs that do not fit the above categories. Please use this sparingly.Note: If the PR spans more than one category, please include all relevant prefixes.
Code Quality
The PR need to meet the following code quality standards:
pre-committo format your code. SeeREADME.mdfor installation.DCO and Signed-off-by
When contributing changes to this project, you must agree to the DCO. Commits must include a
Signed-off-by:header which certifies agreement with the terms of the DCO.Using
-swithgit commitwill automatically add this header.What to Expect for the Reviews
We aim to address all PRs in a timely manner. If no one reviews your PR within 5 days, please @-mention one of YuhanLiu11
, Shaoting-Feng or ApostaC.