feat: Priority-based scheduling optimization (including default priority, preemption toggle, priority-based metrics, etc.)#17026
Conversation
Summary of ChangesHello @zhuxinjie-nz, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly enhances the existing priority-based scheduling system by addressing key functional and observability gaps. It provides mechanisms for assigning default priorities to requests, offers a configurable switch for preemption, and integrates comprehensive priority-aware metrics. These changes aim to make the scheduler more robust, adaptable to diverse operational needs, and transparent in its performance characteristics. Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request introduces several valuable optimizations for priority-based scheduling, including support for a default priority, a toggle for preemption, and priority-based metrics. The changes are well-structured and address the motivations outlined in the description.
My review focuses on improving code maintainability and performance by addressing areas of code duplication. I've identified several instances where similar logic is repeated across different parts of the codebase. By refactoring these into helper methods, the code can be made cleaner, more efficient, and easier to maintain in the future. Specifically, I've pointed out opportunities for this in scheduler.py, scheduler_metrics_mixin.py, and collector.py.
| if self.enable_priority_scheduling: | ||
| num_running_reqs_by_priority.clear() | ||
| for req in self.running_batch.reqs: | ||
| num_running_reqs_by_priority[req.priority] += 1 |
There was a problem hiding this comment.
This block for calculating num_running_reqs_by_priority is inefficient as it runs inside a loop over self.waiting_queue. It is also a duplicate of the logic at lines 1859-1861.
This repeated calculation can impact performance, especially with a large waiting queue. Furthermore, its current placement might not accurately reflect the final state of self.running_batch for logging, as it's updated on each iteration.
A better approach is to calculate num_running_reqs_by_priority only once, after the loop completes and just before log_prefill_stats is called. This ensures correctness and improves efficiency.
I recommend removing this block and the one at lines 1859-1861, and then adding the calculation logic right before the log_prefill_stats call around line 2003.
| if self.enable_priority_scheduling: | ||
| num_queue_reqs_by_priority: dict[int, int] = defaultdict(int) | ||
| for req in self.waiting_queue: | ||
| num_queue_reqs_by_priority[req.priority] += 1 | ||
| self.stats.num_queue_reqs_by_priority = num_queue_reqs_by_priority |
There was a problem hiding this comment.
The logic for counting requests by priority is repeated multiple times in this file for different queues (e.g., lines 257-265, 278-286, and also in log_decode_stats). This code duplication makes the code harder to maintain.
To improve this, you could extract the counting logic into a helper method. For example:
from collections import defaultdict
from typing import Iterable, Dict
def _compute_reqs_by_priority(self, req_queue: Iterable[Req]) -> Dict[int, int]:
"""Computes the number of requests for each priority in a queue."""
counts = defaultdict(int)
for req in req_queue:
counts[req.priority] += 1
return countsThen you can use it like this:
if self.enable_priority_scheduling:
self.stats.num_queue_reqs_by_priority = self._compute_reqs_by_priority(self.waiting_queue)This would make the code cleaner and more maintainable.
| if stats.num_running_reqs_by_priority: | ||
| for key, value in stats.num_running_reqs_by_priority.items(): | ||
| self._log_gauge(self.num_running_reqs, value, priority=key) | ||
| else: | ||
| self._log_gauge(self.num_running_reqs, stats.num_running_reqs) |
There was a problem hiding this comment.
This if/else block for logging gauges based on whether per-priority data is available is repeated for multiple metrics (e.g., num_queue_reqs, num_prefill_prealloc_queue_reqs, etc.). This leads to code duplication and makes the log_stats method lengthy.
Consider refactoring this logic into a helper method to reduce duplication and improve readability. For example:
def _log_gauge_by_priority(self, gauge, total_value: Union[int, float], by_priority_dict: Dict[int, int]):
if by_priority_dict:
for priority, value in by_priority_dict.items():
self._log_gauge(gauge, value, priority=priority)
else:
self._log_gauge(gauge, total_value)You could then simplify the calls in log_stats like this:
self._log_gauge_by_priority(
self.num_running_reqs,
stats.num_running_reqs,
stats.num_running_reqs_by_priority
)|
/tag-and-rerun-ci |
|
/tag-run-ci-label |
|
@harrisonlimh would appreciate if you can take a look at this PR. |
@harrisonlimh Thanks for reviewing! 🙏 |
|
Hi! Thank you so much for the contribution! I will make sure to review the PR in the next two days. |
| prefill_max_requests: Optional[int] = None | ||
| schedule_policy: str = "fcfs" | ||
| enable_priority_scheduling: bool = False | ||
| enable_try_preemption_by_priority: bool = False |
There was a problem hiding this comment.
Preemption is enabled by default in priority scheduling, so this flag is redundant. Would you be able to update it to disable_try_preemption_by_priority and use it as a toggle to disable the behavior instead?
self.try_preemption = (
self.enable_priority_scheduling
and not self.server_args.disable_try_preemption_by_priority
)
FYI - alternative way of disabling preemption is to set priority_scheduling_preemption_threshold to a value that is > absl(highest_priority_int - lowest_priority_int). Sharing it in case that the mentioned use case of disabling preemption is urgent for the business need.
There was a problem hiding this comment.
The parameters have already been updated as required. we still hope to be able to explicitly disable preemption via a parameter, which would make usage clearer and more explicit.
There was a problem hiding this comment.
Thank you for updating it!
There was a problem hiding this comment.
Would it be possible to make one or two helper functions that calculate per-priority request counts in scheduler.py and scheduler_metrics_mixin.py and consolidate the usage?
Perhaps that a) takes in the list of requests and return xxx_reqs_by_priority dictionary or b) additionally taking the dictionary and updating in place.
There was a problem hiding this comment.
Thank you very much — I’ll work on fixing these issues.
# Conflicts: # python/sglang/srt/managers/scheduler.py # python/sglang/srt/managers/scheduler_metrics_mixin.py # python/sglang/srt/managers/tokenizer_manager.py # python/sglang/srt/metrics/collector.py
|
/run-failed-ci |
|
LGTM! Thank you! |
|
@hnyls2002 Thanks! |
1 similar comment
|
@hnyls2002 Thanks! |
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
/rerun-stage stage-c-test-8-gpu-h20 |
|
✅ Triggered |
|
/rerun-ut test_priority_metrics.py |
|
❌ No test file found matching |
…ity, preemption toggle, priority-based metrics, etc.) (sgl-project#17026) Co-authored-by: hnyls2002 <lsyincs@gmail.com>
…ity, preemption toggle, priority-based metrics, etc.) (sgl-project#17026) Co-authored-by: hnyls2002 <lsyincs@gmail.com>
…ity, preemption toggle, priority-based metrics, etc.) (sgl-project#17026) Co-authored-by: hnyls2002 <lsyincs@gmail.com>
…ity, preemption toggle, priority-based metrics, etc.) (sgl-project#17026) Co-authored-by: hnyls2002 <lsyincs@gmail.com>
…ity, preemption toggle, priority-based metrics, etc.) (sgl-project#17026) Co-authored-by: hnyls2002 <lsyincs@gmail.com>
Motivation
During the use of priority-based scheduling, we identified several issues:
The original priority scheduling logic does not support setting a default priority.
The preemption logic introduces certain risks in our specific business scenario, so we need a configurable toggle to enable or disable preemption.
We require metrics to support performance analysis based on priority-based scheduling.
Modifications
Built upon the original priority-based scheduling, several optimizations have been implemented, including support for setting a default priority, adding a toggle for preemption logic, and introducing priority-based metrics.

Benchmarking and Profiling
Queueing behavior under high and low priority
Starting from concurrency level 40, requests begin to queue, with the number of queued low-priority requests significantly higher than that of high-priority ones, which aligns with expectations.

Checklist