Skip to content

[Core][Scheduler] Fix FCFS queue ordering for skipped waiting requests#33522

Draft
harsh543 wants to merge 3 commits intovllm-project:mainfrom
harsh543:fix-scheduler-queue-order
Draft

[Core][Scheduler] Fix FCFS queue ordering for skipped waiting requests#33522
harsh543 wants to merge 3 commits intovllm-project:mainfrom
harsh543:fix-scheduler-queue-order

Conversation

@harsh543
Copy link
Copy Markdown

@harsh543 harsh543 commented Feb 1, 2026

Purpose

Fix FCFS (First-Come-First-Served) queue ordering violation in the v1 scheduler where skipped waiting requests jump ahead of other requests, causing increased tail latency under resource pressure.

Motivation: The vLLM v1 scheduler maintains a waiting queue following FCFS policy. When requests cannot be immediately scheduled (due to resource constraints), they should maintain their relative arrival order. However, the current implementation violates this by prepending skipped requests to the front of the queue.

Without this fix:

  • Requests that arrive earlier get pushed backward when later requests are skipped
  • Under sustained load with resource pressure, some requests experience unbounded delays
  • Tail latency (p99/p100) increases significantly as queue ordering becomes chaotic
  • Fairness guarantees are violated — a request's wait time depends on unrelated scheduling decisions

Example of the bug:

Initial waiting: [Req1, Req2, Req3, Req4]  (Req1 arrived first)
                      ↓ (Req2 can't schedule due to KV blocks)
After scheduling: [Req2, Req1, Req3, Req4]  ❌ Req2 jumped ahead of Req1!

With sustained pressure, Req1 keeps getting pushed back indefinitely.

With this fix:

Initial waiting: [Req1, Req2, Req3, Req4]
                      ↓ (Req2 can't schedule)
After scheduling: [Req1, Req3, Req4, Req2]  ✅ Order preserved, Req2 goes to tail
Scenario Before (Bug) After (Fix)
Skipped request placement Front of queue Tail of queue
FCFS ordering Violated under pressure Preserved
Tail latency (p99/p100) Increased unpredictably Stable/reduced
Request starvation risk Possible Eliminated

Root Cause

The schedule() function uses prepend_request() (front insertion) instead of add_request() (tail insertion) when returning skipped requests to the waiting queue:

# WRONG: Inserts at FRONT, violating FCFS
skipped_waiting_requests.prepend_request(request)
...
self.waiting.prepend_requests(skipped_waiting_requests)

This affects all skip scenarios:

  • Waiting for remote KV blocks (WAITING_FOR_REMOTE_KVS)
  • Waiting for FSM compilation (WAITING_FOR_FSM)
  • Waiting for streaming input (WAITING_FOR_STREAMING_REQ)
  • Max LoRA limit exceeded
  • KV cache allocation failure
  • Async KV loading

Changes

Replace all prepend_request() calls with add_request() for skipped waiting requests:

# CORRECT: Appends to TAIL, preserving FCFS
skipped_waiting_requests.add_request(request)
...
for request in skipped_waiting_requests:
    self.waiting.add_request(request)

Files changed:

  • vllm/v1/core/sched/scheduler.py: 7 call sites updated in schedule()
  • tests/v1/core/test_scheduler.py: Test updated to verify correct FCFS behavior

Note: _preempt_request() intentionally keeps prepend_request() — preempted requests were already running and interrupted, so they should get priority to prevent starvation. This is semantically different from "skipped waiting" requests.

Backward Compatibility

Fully backward compatible:

  • No API changes
  • No configuration changes
  • Only affects internal queue ordering
  • All requests still get scheduled eventually
  • Existing behavior preserved for preempted requests

Test Plan

# Run the updated FCFS ordering test
pytest tests/v1/core/test_scheduler.py::test_skipped_requests_fcfs_order -v

# Run full scheduler test suite
pytest tests/v1/core/test_scheduler.py -v

Test Result

Test test_skipped_requests_fcfs_order verifies:

  1. Create 4 requests: [req_0, req_1, req_2, req_3]
  2. Set first 2 to WAITING_FOR_REMOTE_KVS (will be skipped)
  3. Schedule with max_num_seqs=1 (only req_2 can run)
  4. Verify waiting queue order: [req_3, req_0, req_1]
    • req_3: stayed in waiting (never processed, capacity full)
    • req_0: skipped → added to tail
    • req_1: skipped → added to tail

Before fix: Test expected [req_0, req_1, req_3] (front insertion)
After fix: Test expects [req_3, req_0, req_1] (tail insertion) ✅

Impact

Metric Expected Change
p50 latency No change
p99 latency Improved (less queue reordering)
p100 latency Improved (no request starvation)
Throughput No change
Memory usage No change

Fixes #27441

🤖 Generated with Claude Code

harsh543 and others added 3 commits January 31, 2026 19:43
Replace prepend_request() with add_request() when returning skipped
requests to the waiting queue. Previously, skipped requests were
inserted at the front of the queue, causing them to jump ahead of
other waiting requests and violating FCFS ordering.

With this fix, skipped requests go to the tail of the queue,
preserving arrival order and reducing tail latency under resource
pressure.

Fixes vllm-project#27441

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown

github-actions bot commented Feb 1, 2026

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors.

You ask your reviewers to trigger select CI tests on top of fastcheck CI.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

If you have any questions, please reach out to us on Slack at https://slack.vllm.ai.

🚀

@dosubot
Copy link
Copy Markdown

dosubot bot commented Feb 1, 2026

Related Documentation

No published documentation to review for changes on this repository.

Write your first living document

How did I do? Any feedback?  Join Discord

@mergify mergify bot added the v1 label Feb 1, 2026
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request effectively addresses a First-Come, First-Served (FCFS) ordering violation for skipped requests in the scheduler. By replacing prepend_request with add_request, skipped requests are now correctly appended to the tail of the waiting queue, preserving their order. Additionally, the change to handle KV cache allocation failures by continuing to the next request instead of breaking the scheduling loop is a significant improvement that mitigates head-of-line blocking. The accompanying test updates and the new fairness test correctly validate these important fixes. The changes are well-implemented and improve the scheduler's correctness and performance under resource pressure.

@orozery
Copy link
Copy Markdown
Collaborator

orozery commented Feb 2, 2026

I actually think that the current implementation is the correct way to ensure FCFS.
The scheduler processes waiting requests serially.
Every waiting request that is skipped should be returned to its original position in the self.waiting list.
This is achieved by prepending the list of skipped requests back to the head (position 0) of the waiting list.

@harsh543
Copy link
Copy Markdown
Author

harsh543 commented Feb 2, 2026

Original analysis below was incorrect — see my correction in the follow-up comments.


@orozery Thanks for the review! After further analysis, I realize you're correct about the algorithm.

Corrected analysis of prepend+prepend:

Initial: [Req1, Req2, Req3, Req4]

1. Skip Req1 → skipped.prepend_request(Req1) → skipped = [Req1]
2. Skip Req2 → skipped.prepend_request(Req2) → skipped = [Req2, Req1]  ← First reversal
3. Req3 scheduled
4. waiting = [Req4]
5. waiting.prepend_requests(skipped) → extendleft([Req2, Req1])
   - appendleft(Req2) → [Req2, Req4]
   - appendleft(Req1) → [Req1, Req2, Req4]  ← Second reversal cancels first!

Result: [Req1, Req2, Req4] ✅ Correct FCFS order!

The two reversals (prepend during collection + extendleft during reinsertion) cancel out and preserve FCFS order. My original worked example missed the extendleft reversal step.

The real question: If the current prepend+prepend implementation is algorithmically correct, what's causing the shuffling shown in the issue author's visualization? That's what needs investigation.

Apologies for the initial confusion!

@harsh543
Copy link
Copy Markdown
Author

harsh543 commented Feb 2, 2026

Follow-up clarification:

@orozery You're conceptually correct that prepending skipped requests back to the head is the right approach for FCFS — I initially missed this. The two prepend operations (collection + reinsertion) should create two reversals that cancel out.

However, @CennyMo identified the root cause in the issue thread:

# PR #14002 changed the behavior:
# BEFORE:
self._deque.extendleft(requests)  # extendleft naturally reverses

# AFTER (or at some point):
self._deque.extendleft(reversed(requests))  # Double reversal!

The current prepend_requests implementation uses extendleft(requests) without reversed(), but if prepend_request is building the skipped list in reversed order AND extendleft also reverses, the two should cancel out.

The issue author's visualization shows shuffling occurring in practice, which suggests either:

  1. A mismatch in the reversal logic at some point in the code path
  2. An edge case where the cancellation doesn't work as expected

The add_request() approach in this PR sidesteps the reversal complexity entirely — it's more explicit and less error-prone, even if it changes the semantic to tail placement (which also helps with HOL blocking under KV pressure).

Would you prefer we:

  1. Keep the PR as-is (tail placement, avoids reversal complexity, prevents HOL blocking)
  2. Fix the reversal bug directly while keeping head placement (if that's the preferred FCFS semantic)

Happy to adjust based on your guidance!

@orozery
Copy link
Copy Markdown
Collaborator

orozery commented Feb 2, 2026

I actually fixed the reversing bug in #32173 and added a unit test which verifies it.

@njhill
Copy link
Copy Markdown
Member

njhill commented Feb 2, 2026

It sounds like the problem is just that we didn't close #27441 after it was fixed by #32173?

If so let's close both #27441 and this PR.

@harsh543 harsh543 marked this pull request as draft February 4, 2026 20:03
@harsh543
Copy link
Copy Markdown
Author

harsh543 commented Mar 2, 2026

Reviewed the relevant shutdown/structured-output paths in the repo (EngineCore + scheduler/StructuredOutputManager) and didn't find any remaining issues or conflicting comments. I've tested as much as possible—can we move this forward for now, @njhill?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug]: vllm/v1/core/sched/scheduler.py: Unintended reordering of requests during scheduling

3 participants