Skip to content

Conversation

@can-sun
Copy link

@can-sun can-sun commented Sep 22, 2025

Description

Currently the prefix aware routing HashTrie misses the capability of node eviction. This will result in router pod crashes after the memory is filled up with TrieNodes. During pod restart period, users will experience downtime and in-progress requests could be terminated in the middle stream.

This PR adds LRU for HashTrie to evict the nodes when hitting memory limit. It will resolve the issue that router keeps restarting by Kubernetes due to insufficient memory. Besides, HashTrieConfig is added to configure some important parameters that will affect the performance, e.g eviction threshold, max memory limit, target utilization etc. These are configured via environment variables.

Nodes are added to the LRU when they are inserted to the Trie. When this node is revisited by other request, the node will be popped and added to tail of LRU. This proposed approach will evict the child nodes first since child nodes are always less frequently accessed compared to parent.

Benchmark Tool

We did the benchmarking by referencing LMCache tool, vLLM prod stack multi-round QA script.

We also applied some local changes to ensure that benchmarking can run continuously so that we can test the router's stability. And stats like TTFT, Success/Failure, Throughput etc will be emitted to AWS cloud metrics for better visibility.

Benchmark Configs

The benchmark is performed with 100 concurrent users continuously performing multi-turn chat. The input system prompt context length is randomly selected, ranges from 8k~64k. Every single user will ask 5 round of followup questions then quit, then new user will be added back to the queue to send new requests. All benchmarks will run continuously over 3hrs to ensure there is no stability problem.

Test Infra Setup

Platform: AWS EKS with SageMaker Hyperpod
Model Server Instance: 40 g5.24xlarge instances (4 Nvidia A10G GPU w/ 96 GB GPU memory per instance)
Router Instance: c5.12xlarge (96 GB CPU memory per instance)
Invocation Path: Public ALB -> vLLM Production Stack Router -> vLLM Model Server
Model: Llama-3.1-8b-instruct
GPU Parallelization: 4 (Thus all models are deployed on dedicated g5 instance)
Router Replicas: 1 (No autoscaling is enabled)
Router Pod Memory Limit: 4GB

Eviction Configs

Max Router Memory: 1GB
HashTrie Eviction Threshold: 0.9 (start eviction when 900MB+ memory is consumed)
HashTrie Utilization Target: 0.5 (will evict 0.5 of total nodes)
HashTrie Chunk Size: 128

Memory Usage before adding Eviction

Since HashTrie nodes are never evicted and free up the memory. The pod memory utilization will go up till the memory allocated to the pod is filled up. Then pod will crash and get restarted.

Memory Usage
image

TTFT
image

ITL
image

Availability
image

Throughput Output Tokens/s
image

Summary

Metric Name Stats Value
Availability Average 96.2%
Throughput Tokens/s Average 153
TTFT Average 5.07s
TTFT P50 0.29s
TTFT P90 21.58s
TTFT P99 48.3s
ITL Average 0.05s

Memory Usage after adding Eviction

After reaching the eviction threshold, nodes will start being evicted to target utilization. Thus we identify that memory usage keeps stable without ramping up anymore.

Memory Usage
2025-09-23_22-22-34

TTFT
image

ITL
image

Availability
image

Throughput Output Tokens/s
image

Summary

Metric Name Stats Value
Availability Average 98.2%
Throughput Tokens/s Average 146
TTFT Average 6.06s
TTFT P50 0.32s
TTFT P90 23.37s
TTFT P99 45.79s
ITL Average 0.05s

FIX #687 (link existing issues this PR will resolve)

BEFORE SUBMITTING, PLEASE READ THE CHECKLIST BELOW AND FILL IN THE DESCRIPTION ABOVE


  • Make sure the code changes pass the pre-commit checks.
  • Sign-off your commit by using -s when doing git commit
  • Try to classify PRs for easy understanding of the type of changes, such as [Bugfix], [Feat], and [CI].
Detailed Checklist (Click to Expand)

Thank you for your contribution to production-stack! Before submitting the pull request, please ensure the PR meets the following criteria. This helps us maintain the code quality and improve the efficiency of the review process.

PR Title and Classification

Please try to classify PRs for easy understanding of the type of changes. The PR title is prefixed appropriately to indicate the type of change. Please use one of the following:

  • [Bugfix] for bug fixes.
  • [CI/Build] for build or continuous integration improvements.
  • [Doc] for documentation fixes and improvements.
  • [Feat] for new features in the cluster (e.g., autoscaling, disaggregated prefill, etc.).
  • [Router] for changes to the vllm_router (e.g., routing algorithm, router observability, etc.).
  • [Misc] for PRs that do not fit the above categories. Please use this sparingly.

Note: If the PR spans more than one category, please include all relevant prefixes.

Code Quality

The PR need to meet the following code quality standards:

  • Pass all linter checks. Please use pre-commit to format your code. See README.md for installation.
  • The code need to be well-documented to ensure future contributors can easily understand the code.
  • Please include sufficient tests to ensure the change is stay correct and robust. This includes both unit tests and integration tests.

DCO and Signed-off-by

When contributing changes to this project, you must agree to the DCO. Commits must include a Signed-off-by: header which certifies agreement with the terms of the DCO.

Using -s with git commit will automatically add this header.

What to Expect for the Reviews

We aim to address all PRs in a timely manner. If no one reviews your PR within 5 days, please @-mention one of YuhanLiu11
, Shaoting-Feng or ApostaC.

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @can-sun, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a crucial memory management enhancement to the PrefixAwareRouter by integrating an LRU eviction mechanism into its HashTrie data structure. This change directly addresses the problem of unbounded memory growth, which previously led to router pod crashes and service disruptions. By allowing the HashTrie to intelligently evict less frequently used nodes based on configurable memory thresholds, the router can now maintain stable memory consumption, significantly improving its resilience and preventing OOM-induced restarts.

Highlights

  • LRU Eviction for HashTrie: Implemented a Least Recently Used (LRU) eviction mechanism within the HashTrie to prevent Out-Of-Memory (OOM) errors in the router, ensuring stable memory usage.
  • Configurable Parameters: Introduced HashTrieConfig to allow configuration of HashTrie behavior, including chunk_size, max_memory_size, eviction_threshold, target_utilization, and memory_check_request_batch_size via environment variables.
  • Child Node Prioritization: The LRU eviction strategy is designed to evict child nodes before parent nodes, minimizing impact on frequently accessed prefixes and maintaining routing efficiency.
  • New Test Coverage: Added test_hashtrie_eviction to src/tests/test_prefixaware_router.py to validate the new eviction functionality and ensure its correctness.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a crucial feature: an LRU eviction mechanism for the HashTrie to prevent out-of-memory issues. The implementation is well-structured, using an OrderedDict for the LRU cache, a HashTrieConfig for configuration, and psutil for memory monitoring. The logic to periodically check memory and evict nodes when a threshold is reached is a solid approach to managing memory growth. My review includes a few suggestions to improve robustness, configuration clarity, and memory efficiency during the eviction process. Overall, this is a great addition that will significantly improve the router's stability.

Comment on lines 145 to 146
except:
pass
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

Using a bare except: pass is dangerous as it silently swallows all exceptions, including potentially critical ones like psutil errors or even KeyboardInterrupt. This could cause the eviction mechanism to fail silently, defeating the purpose of this PR and re-introducing the risk of OOM crashes.

It's crucial to catch specific exceptions or, at a minimum, Exception as e and log the error to ensure that any failures in the memory checking or eviction process are visible for debugging.

Suggested change
except:
pass
except Exception as e:
logger.warning(f"Failed to check memory for eviction: {e}")

Comment on lines +24 to +45
chunk_size: int = 0
max_memory_size: int = 0 # GB
eviction_threshold: float = 0 # start evicting at
target_utilization: float = 0 # evict down to
memory_check_request_batch_size: int = 0

@staticmethod
def from_defaults(
chunk_size: int = 128,
max_memory_size: int = 2,
eviction_threshold: float = 0.9,
target_utilization: float = 0.5,
memory_check_request_batch_size: int = 10,
) -> "HashTrieConfig":
"""Create configuration with default values"""
return HashTrieConfig(
chunk_size,
max_memory_size,
eviction_threshold,
target_utilization,
memory_check_request_batch_size,
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The current implementation of HashTrieConfig has default values of 0 in the dataclass definition, which can be misleading and lead to a non-functional configuration if instantiated with HashTrieConfig(). The actual default values are in a separate from_defaults static method.

To improve clarity and usability, I suggest moving the default values directly into the dataclass attributes and removing the from_defaults method. This makes the default configuration explicit and simplifies instantiation.

You will need to update the call sites:

  • HashTrieConfig.from_defaults() becomes HashTrieConfig()
  • HashTrieConfig.from_defaults(...) becomes HashTrieConfig(...)
@dataclass
class HashTrieConfig:
    """Configuration for HashTrie with LRU eviction"""

    chunk_size: int = 128
    max_memory_size: int = 2  # GB
    eviction_threshold: float = 0.9  # start evicting at
    target_utilization: float = 0.5  # evict down to
    memory_check_request_batch_size: int = 10

    @staticmethod
    def from_env() -> "HashTrieConfig":
        """Load configuration from environment variables"""
        return HashTrieConfig(
            chunk_size=int(os.getenv("HASHTRIE_CHUNK_SIZE", "128")),
            max_memory_size=int(os.getenv("PREFIXAWARE_MAX_MEMORY_SIZE_GB", "2")),
            eviction_threshold=float(os.getenv("HASHTRIE_EVICTION_THRESHOLD", "0.9")),
            target_utilization=float(os.getenv("HASHTRIE_TARGET_UTILIZATION", "0.5")),
            memory_check_request_batch_size=int(
                os.getenv("MEMORY_CHECK_REQUEST_BATCH_SIZE", "10")
            ),
        )

Comment on lines +189 to +191
evict_nodes = []
for node in list(self.node_cache.keys())[:target_evictions]:
evict_nodes.append(node)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The current implementation for collecting nodes to evict can be inefficient. list(self.node_cache.keys()) creates a copy of all keys in memory. When memory is already under pressure (the very reason this function is called), creating a potentially large list could exacerbate the problem and is also redundant with the subsequent loop.

A more memory-efficient and concise approach is to use itertools.islice to create a list containing only the keys of the nodes that need to be evicted.

        import itertools
        evict_nodes = list(itertools.islice(self.node_cache, target_evictions))

@YuhanLiu11
Copy link
Collaborator

@can-sun Thanks for the fix! Could you fix the pre-commit error and show an easy way for us to see the effect of your fix?

@can-sun
Copy link
Author

can-sun commented Sep 23, 2025

@can-sun Thanks for the fix! Could you fix the pre-commit error and show an easy way for us to see the effect of your fix?

Thanks Yuhan, let me double check the pre-commit failures tmr. I am pretty sure I have run the pre-commit check before sending out the PR. I will pull up some metrics that I can share with you tomorrow to justify the effect.

@YuhanLiu11
Copy link
Collaborator

@can-sun Thanks for the fix. BTW I realize that that if we evict nodes from the HashTrie, the vLLM-side delay of the requests may be affected as the information of where did the requests go is missing. We will have to run some systematic benchmarks to test the e2e performance of your PR.
cc @zerofishnoodles

@can-sun
Copy link
Author

can-sun commented Sep 24, 2025

@can-sun Thanks for the fix. BTW I realize that that if we evict nodes from the HashTrie, the vLLM-side delay of the requests may be affected as the information of where did the requests go is missing. We will have to run some systematic benchmarks to test the e2e performance of your PR. cc @zerofishnoodles

You are definitely right. We also did some benchmarking locally and evaluated the performance impact. Let me share that with you today.

@jiapinw
Copy link

jiapinw commented Sep 24, 2025

the vLLM-side delay of the requests may be affected as the information of where did the requests go is missing

I think this is intentional for this PR. It's a balance between availability and latency imo. There won't be any performance regression until we hit the configured mem limit. If I am reading the PR correctly, the node eviction will only happen when we reach the limit, ie. if we trim down a root node, then it will free up quite a few space and it would give an extended period of time when eviction is no longer needed. Though we do need to be careful regarding how we set the default, one improvement area might be setting the default based on instance type or other env var, for instance cpu limits configured during a k8s deployment.

@can-sun
Copy link
Author

can-sun commented Sep 24, 2025

Done, added some data collected during benchmarking. Hope it helps. Feel free to take in the code and do the local performance evaluations. Thanks

@can-sun
Copy link
Author

can-sun commented Sep 25, 2025

Hey @YuhanLiu11 Do you know and next steps for this change? Currently we are waiting for this change to be in place before putting the router into use.

@can-sun
Copy link
Author

can-sun commented Sep 25, 2025

Double checked the E2E test failed because ERROR: Could not install packages due to an OSError: [Errno 28] No space left on device

@YuhanLiu11
Copy link
Collaborator

hey @can-sun thanks for adding the benchmarking results!

@zerofishnoodles can you check why this CI fails due to space errors? Is our CI server full?

@zerofishnoodles
Copy link
Collaborator

This is weird. We just cleaned the space not long ago. I will take a look today.

@zerofishnoodles
Copy link
Collaborator

zerofishnoodles commented Sep 26, 2025

That one is the GitHub hosted machine, their machine is out of space. :(
I think from our side, we can try to modify the CI to either reduce the dependency size or move it to our self-hosted machine

"""Load configuration from environment variables"""
return HashTrieConfig(
chunk_size=int(os.getenv("HASHTRIE_CHUNK_SIZE", "128")),
max_memory_size=int(os.getenv("PREFIXAWARE_MAX_MEMORY_SIZE_GB", "2")),

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Read default values from constant instead of hardcoding here



class NodeMetadata:
"""Metadata for nodes tracked in LRU cache"""

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You need to make NodeMetadata to be immutable for nodes tracked in LRU cache. Please add @dataclass(frozen=True)

self.children = {}
self.endpoints = set()
self.children: Dict[int, "TrieNode"] = {}
self.endpoints: Set[str] = set()

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggest to add is_Leaf and other operation methods, something as below

    @property
    def is_leaf(self) -> bool:
        """Check if the node is a leaf node."""
        return not bool(self.children)

    async def add_child(self, hash_value: int, node: "TrieNode") -> None:
        """Add a child node."""
        async with self.lock:
            self.children[hash_value] = node

    async def remove_child(self, hash_value: int) -> None:
        """Remove a child node."""
        async with self.lock:
            self.children.pop(hash_value, None)

    async def add_endpoint(self, endpoint: str) -> None:
        """Add an endpoint to the node."""
        async with self.lock:
            self.endpoints.add(endpoint)

    def __repr__(self) -> str:
        return f"TrieNode(children={len(self.children)}, endpoints={len(self.endpoints)})"

self.config.max_memory_size * 1024 * self.config.eviction_threshold
)
# eviction percentage
self.eviction_target_percentage = 1.0 - self.config.target_utilization

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please move all these constant values to be read from env variables or default constant . Also see it it make sense to introduce HashTrieStats something like

@dataclass
class HashTrieStats:
    current_memory_mb: float = 0.0
    operation_count: int = 0

@dataclass
class HashTrie:
    config: HashTrieConfig = field(default_factory=HashTrieConfig.from_defaults)
    root: TrieNode = field(default_factory=TrieNode)
    node_cache: OrderedDict[TrieNode, NodeMetadata] = field(default_factory=OrderedDict)
    stats: HashTrieStats = field(default_factory=HashTrieStats)
    eviction_lock: asyncio.Lock = field(default_factory=asyncio.Lock)

    def __post_init__(self):
        self.memory_threshold_mb = (
            self.config.max_memory_size * BYTES_TO_MB * self.config.eviction_threshold
        )
        self.eviction_target_percentage = 1.0 - self.config.target_utilization

    @property
    def current_memory_mb(self) -> float:
        return self.stats.current_memory_mb

    @current_memory_mb.setter
    def current_memory_mb(self, value: float) -> None:
        self.stats.current_memory_mb = value

    @property
    def operation_count(self) -> int:
        return self.stats.operation_count

    @operation_count.setter
    def operation_count(self, value: int) -> None:
        self.stats.operation_count = value

    def __repr__(self) -> str:
        return (f"HashTrie(nodes={len(self.node_cache)}, "
                f"memory={self.current_memory_mb:.2f}MB)")


# Check if we should evict
if self.operation_count == 0:
try:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you please move the eviction logic to a separate method in HashTrie, something like

class HashTrie:
    async def _check_and_perform_eviction(self) -> None:
        """Check memory usage and perform eviction if necessary."""
        try:
            stats = await MemoryMonitor.get_memory_stats(self.memory_threshold_mb)

            if abs(stats.current_mb - self.current_memory_mb) < 0.1:
                logger.debug(f"Eviction skipped - minimal memory change: {stats.current_mb:.1f}MB")
                return

            self.current_memory_mb = stats.current_mb

            if stats.current_mb > stats.threshold_mb:
                logger.warning(f"Memory threshold exceeded: {stats.current_mb:.1f}MB > {stats.threshold_mb:.1f}MB")
                await self._perform_eviction()
         except Exception as e: 
           // Add Exception Handling 


    async def _perform_eviction(self) -> None:
        """Handle the actual eviction process with retry logic."""
        max_retries = self.config.eviction_max_retries
        for retry in range(max_retries):
            try:
                await self.batch_evict(self.eviction_target_percentage)
                return
            except Exception as e:
           // Add Exception Handling 


        logger.error("Max eviction retries reached")

    async def _handle_monitoring_failure(self) -> None:
        """Handle cases where memory monitoring fails."""

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

bug: router crashes because of OOM and prefix cache HashTrie memory usage is unbounded

5 participants