Skip to content

feat: Priority-based scheduling optimization (including default priority, preemption toggle, priority-based metrics, etc.)#17026

Merged
hnyls2002 merged 51 commits into
sgl-project:mainfrom
zhuxinjie-nz:priority_metrics_dev
Mar 4, 2026
Merged

feat: Priority-based scheduling optimization (including default priority, preemption toggle, priority-based metrics, etc.)#17026
hnyls2002 merged 51 commits into
sgl-project:mainfrom
zhuxinjie-nz:priority_metrics_dev

Conversation

@zhuxinjie-nz
Copy link
Copy Markdown
Contributor

@zhuxinjie-nz zhuxinjie-nz commented Jan 13, 2026

Motivation

During the use of priority-based scheduling, we identified several issues:

  1. The original priority scheduling logic does not support setting a default priority.

  2. The preemption logic introduces certain risks in our specific business scenario, so we need a configurable toggle to enable or disable preemption.

  3. We require metrics to support performance analysis based on priority-based scheduling.

Modifications

Built upon the original priority-based scheduling, several optimizations have been implemented, including support for setting a default priority, adding a toggle for preemption logic, and introducing priority-based metrics.
image

Benchmarking and Profiling

image

Queueing behavior under high and low priority

Starting from concurrency level 40, requests begin to queue, with the number of queued low-priority requests significantly higher than that of high-priority ones, which aligns with expectations.
image

Checklist

@gemini-code-assist
Copy link
Copy Markdown
Contributor

Summary of Changes

Hello @zhuxinjie-nz, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the existing priority-based scheduling system by addressing key functional and observability gaps. It provides mechanisms for assigning default priorities to requests, offers a configurable switch for preemption, and integrates comprehensive priority-aware metrics. These changes aim to make the scheduler more robust, adaptable to diverse operational needs, and transparent in its performance characteristics.

Highlights

  • Default Priority Support: Introduced the ability to set a default priority for requests that do not explicitly specify one, improving flexibility in scheduling.
  • Configurable Preemption Toggle: Added a new configuration option to enable or disable preemption logic based on priority, allowing for better control over scheduling behavior in different business scenarios.
  • Priority-Based Metrics: Implemented new metrics to track request counts by priority across various queues (running, waiting, prefill, decode), enabling detailed performance analysis and monitoring of priority scheduling effectiveness.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces several valuable optimizations for priority-based scheduling, including support for a default priority, a toggle for preemption, and priority-based metrics. The changes are well-structured and address the motivations outlined in the description.

My review focuses on improving code maintainability and performance by addressing areas of code duplication. I've identified several instances where similar logic is repeated across different parts of the codebase. By refactoring these into helper methods, the code can be made cleaner, more efficient, and easier to maintain in the future. Specifically, I've pointed out opportunities for this in scheduler.py, scheduler_metrics_mixin.py, and collector.py.

Comment thread python/sglang/srt/managers/scheduler.py Outdated
Comment on lines +1934 to +1937
if self.enable_priority_scheduling:
num_running_reqs_by_priority.clear()
for req in self.running_batch.reqs:
num_running_reqs_by_priority[req.priority] += 1
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

This block for calculating num_running_reqs_by_priority is inefficient as it runs inside a loop over self.waiting_queue. It is also a duplicate of the logic at lines 1859-1861.

This repeated calculation can impact performance, especially with a large waiting queue. Furthermore, its current placement might not accurately reflect the final state of self.running_batch for logging, as it's updated on each iteration.

A better approach is to calculate num_running_reqs_by_priority only once, after the loop completes and just before log_prefill_stats is called. This ensures correctness and improves efficiency.

I recommend removing this block and the one at lines 1859-1861, and then adding the calculation logic right before the log_prefill_stats call around line 2003.

Comment on lines +234 to +238
if self.enable_priority_scheduling:
num_queue_reqs_by_priority: dict[int, int] = defaultdict(int)
for req in self.waiting_queue:
num_queue_reqs_by_priority[req.priority] += 1
self.stats.num_queue_reqs_by_priority = num_queue_reqs_by_priority
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The logic for counting requests by priority is repeated multiple times in this file for different queues (e.g., lines 257-265, 278-286, and also in log_decode_stats). This code duplication makes the code harder to maintain.

To improve this, you could extract the counting logic into a helper method. For example:

from collections import defaultdict
from typing import Iterable, Dict

def _compute_reqs_by_priority(self, req_queue: Iterable[Req]) -> Dict[int, int]:
    """Computes the number of requests for each priority in a queue."""
    counts = defaultdict(int)
    for req in req_queue:
        counts[req.priority] += 1
    return counts

Then you can use it like this:

if self.enable_priority_scheduling:
    self.stats.num_queue_reqs_by_priority = self._compute_reqs_by_priority(self.waiting_queue)

This would make the code cleaner and more maintainable.

Comment thread python/sglang/srt/metrics/collector.py Outdated
Comment on lines +958 to +962
if stats.num_running_reqs_by_priority:
for key, value in stats.num_running_reqs_by_priority.items():
self._log_gauge(self.num_running_reqs, value, priority=key)
else:
self._log_gauge(self.num_running_reqs, stats.num_running_reqs)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This if/else block for logging gauges based on whether per-priority data is available is repeated for multiple metrics (e.g., num_queue_reqs, num_prefill_prealloc_queue_reqs, etc.). This leads to code duplication and makes the log_stats method lengthy.

Consider refactoring this logic into a helper method to reduce duplication and improve readability. For example:

def _log_gauge_by_priority(self, gauge, total_value: Union[int, float], by_priority_dict: Dict[int, int]):
    if by_priority_dict:
        for priority, value in by_priority_dict.items():
            self._log_gauge(gauge, value, priority=priority)
    else:
        self._log_gauge(gauge, total_value)

You could then simplify the calls in log_stats like this:

self._log_gauge_by_priority(
    self.num_running_reqs,
    stats.num_running_reqs,
    stats.num_running_reqs_by_priority
)

@stmatengss
Copy link
Copy Markdown
Collaborator

/tag-and-rerun-ci

@huangtingwei9988
Copy link
Copy Markdown
Collaborator

@hnyls2002 @xiezhq-hermann PTAL

@huangtingwei9988
Copy link
Copy Markdown
Collaborator

/tag-run-ci-label

@xiezhq-hermann
Copy link
Copy Markdown
Collaborator

@harrisonlimh would appreciate if you can take a look at this PR.

@zhuxinjie-nz
Copy link
Copy Markdown
Contributor Author

@harrisonlimh would appreciate if you can take a look at this PR.

@harrisonlimh Thanks for reviewing! 🙏

@harrisonlimh
Copy link
Copy Markdown
Collaborator

Hi! Thank you so much for the contribution! I will make sure to review the PR in the next two days.

Comment thread python/sglang/srt/server_args.py Outdated
prefill_max_requests: Optional[int] = None
schedule_policy: str = "fcfs"
enable_priority_scheduling: bool = False
enable_try_preemption_by_priority: bool = False
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Preemption is enabled by default in priority scheduling, so this flag is redundant. Would you be able to update it to disable_try_preemption_by_priority and use it as a toggle to disable the behavior instead?

        self.try_preemption = (
            self.enable_priority_scheduling
            and not self.server_args.disable_try_preemption_by_priority
        )

FYI - alternative way of disabling preemption is to set priority_scheduling_preemption_threshold to a value that is > absl(highest_priority_int - lowest_priority_int). Sharing it in case that the mentioned use case of disabling preemption is urgent for the business need.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The parameters have already been updated as required. we still hope to be able to explicitly disable preemption via a parameter, which would make usage clearer and more explicit.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for updating it!

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be possible to make one or two helper functions that calculate per-priority request counts in scheduler.py and scheduler_metrics_mixin.py and consolidate the usage?

Perhaps that a) takes in the list of requests and return xxx_reqs_by_priority dictionary or b) additionally taking the dictionary and updating in place.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you very much — I’ll work on fixing these issues.

nanzhi.zxj added 3 commits February 4, 2026 15:06
# Conflicts:
#	python/sglang/srt/managers/scheduler.py
#	python/sglang/srt/managers/scheduler_metrics_mixin.py
#	python/sglang/srt/managers/tokenizer_manager.py
#	python/sglang/srt/metrics/collector.py
@stmatengss
Copy link
Copy Markdown
Collaborator

/run-failed-ci

@harrisonlimh
Copy link
Copy Markdown
Collaborator

LGTM! Thank you!

@hnyls2002 hnyls2002 requested a review from ClawSeven as a code owner March 4, 2026 08:34
@stmatengss
Copy link
Copy Markdown
Collaborator

@hnyls2002 Thanks!

1 similar comment
@huangtingwei9988
Copy link
Copy Markdown
Collaborator

@hnyls2002 Thanks!

@hnyls2002
Copy link
Copy Markdown
Collaborator

/rerun-stage stage-c-test-8-gpu-h20

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Mar 4, 2026

✅ Triggered stage-c-test-8-gpu-h20 to run independently (skipping dependencies).

@hnyls2002
Copy link
Copy Markdown
Collaborator

/rerun-ut test_priority_metrics.py

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Mar 4, 2026

🔗 View workflow run

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Mar 4, 2026

❌ No test file found matching test_priority_metrics.py under test/registered/.

@hnyls2002 hnyls2002 merged commit 28c931e into sgl-project:main Mar 4, 2026
53 of 97 checks passed
qeternity pushed a commit to qeternity/sglang that referenced this pull request Mar 6, 2026
…ity, preemption toggle, priority-based metrics, etc.) (sgl-project#17026)

Co-authored-by: hnyls2002 <lsyincs@gmail.com>
JustinTong0323 pushed a commit to JustinTong0323/sglang that referenced this pull request Mar 6, 2026
…ity, preemption toggle, priority-based metrics, etc.) (sgl-project#17026)

Co-authored-by: hnyls2002 <lsyincs@gmail.com>
Wangzheee pushed a commit to Wangzheee/sglang that referenced this pull request Mar 21, 2026
…ity, preemption toggle, priority-based metrics, etc.) (sgl-project#17026)

Co-authored-by: hnyls2002 <lsyincs@gmail.com>
JustinTong0323 pushed a commit to JustinTong0323/sglang that referenced this pull request Apr 7, 2026
…ity, preemption toggle, priority-based metrics, etc.) (sgl-project#17026)

Co-authored-by: hnyls2002 <lsyincs@gmail.com>
yhyang201 pushed a commit to yhyang201/sglang that referenced this pull request Apr 22, 2026
…ity, preemption toggle, priority-based metrics, etc.) (sgl-project#17026)

Co-authored-by: hnyls2002 <lsyincs@gmail.com>
@hnyls2002 hnyls2002 mentioned this pull request Apr 28, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

amd documentation Improvements or additions to documentation run-ci

Projects

None yet

Development

Successfully merging this pull request may close these issues.