Skip to content

Add per-request status info for MLLM scheduler#257

Closed
janhilgard wants to merge 1 commit intowaybarrios:mainfrom
janhilgard:fix/mllm-running-requests-info
Closed

Add per-request status info for MLLM scheduler#257
janhilgard wants to merge 1 commit intowaybarrios:mainfrom
janhilgard:fix/mllm-running-requests-info

Conversation

@janhilgard
Copy link
Copy Markdown
Collaborator

Summary

  • MLLM scheduler was missing get_running_requests_info(), so /v1/status always returned "requests": [] for MLLM models (all --mllm servers)
  • This meant monitoring dashboards could not see real-time per-request progress during streaming — throughput appeared as 0 tok/s until the request completed, then spiked
  • Added get_running_requests_info() to MLLMScheduler (mirrors the existing implementation in scheduler.py) and promoted scheduler stats to top-level in BatchedEngine.get_stats()

Changes

  • mllm_scheduler.py: Add first_token_time to MLLMRequest, record it on first generated token, add get_running_requests_info() returning per-request completion_tokens, tokens_per_second, progress, ttft_s, phase
  • engine/batched.py: Promote num_running, num_waiting, total_prompt_tokens, total_completion_tokens, requests (and other keys) from mllm_stats to top-level stats

Test plan

  • Start an MLLM server (--mllm flag)
  • Send a streaming request with high max_tokens
  • Poll /v1/status during generation — verify requests array contains per-request details with increasing completion_tokens and non-null tokens_per_second
  • Verify num_running and total_completion_tokens are visible at top-level in /v1/status

🤖 Generated with Claude Code

The MLLM scheduler was missing get_running_requests_info(), so
/v1/status always returned an empty requests array for MLLM models.
This made it impossible to see real-time per-request progress
(completion_tokens, tokens_per_second, TTFT) during streaming.

Changes:
- Add first_token_time field to MLLMRequest for TTFT tracking
- Add get_running_requests_info() to MLLMScheduler (mirrors scheduler.py)
- Include requests in get_stats() output
- Promote scheduler stats (num_running, num_waiting, total tokens,
  requests) to top-level in BatchedEngine.get_stats()

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@janhilgard
Copy link
Copy Markdown
Collaborator Author

@Thump604 Superseded — per-request RequestStatus tracking is already in main via #278. Closing.

@janhilgard janhilgard closed this Apr 11, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant