Skip to content

Support async scheduling with TPU-inference's RayExecutor#1912

Merged
gxd3 merged 4 commits intomainfrom
gxd/ray-async-scheduler
Mar 17, 2026
Merged

Support async scheduling with TPU-inference's RayExecutor#1912
gxd3 merged 4 commits intomainfrom
gxd/ray-async-scheduler

Conversation

@gxd3
Copy link
Copy Markdown
Collaborator

@gxd3 gxd3 commented Mar 12, 2026

Support async scheduling with TPU-inference's RayExecutor

Implement the functionality in TPU-inference's RayExecutor subclass instead of vLLM repo's RayExecutor parent class for more flexibility.

TPUPlatform override vLLM's repo's Platform's executors_supports_async_scheduling() so that our custom executor is in the whitelist that support async scheduling.

Sent a separate PR to vLLM repo: vllm-project/vllm#36924

Share similar idea as vllm-project/vllm#29012.

Tests

Unit test.
E2E benchmark: xprof: http://xprof/trace_viewer.html?session_id=gxd-2909041915368964943 (there is no TPU bubble now)
Quality test:

python ./tpu-inference/scripts/vllm/benchmarking/benchmark_serving.py --backend vllm --model deepseek-ai/DeepSeek-R1 --dataset-name mmlu --dataset-path /home/gxd_google_com/mmlu/data/test --num-prompts 5000 --run_eval --temperature 0

============ Serving Benchmark Result ============
Successful requests:                     5000      
Benchmark duration (s):                  119.35    
Total input tokens:                      1015147   
Total generated tokens:                  10000     
Request throughput (req/s):              41.89     
Output token throughput (tok/s):         83.79     
Total token throughput (tok/s):          8589.47   
---------------Time to First Token----------------
Mean TTFT (ms):                          66489.88  
Median TTFT (ms):                        62561.75  
P99 TTFT (ms):                           117335.94 
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          788.79    
Median TPOT (ms):                        787.97    
P99 TPOT (ms):                           806.68    
---------------Inter-token Latency----------------
Mean ITL (ms):                           788.79    
Median ITL (ms):                         787.97    
P99 ITL (ms):                            806.68    
----------------End-to-end Latency----------------
Mean E2EL (ms):                          67278.67  
Median E2EL (ms):                        63351.98  
P99 E2EL (ms):                           118134.76 
==================================================
Evaluating MMLU...
[nltk_data] Downloading package punkt to
[nltk_data]     /home/gxd_google_com/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to
[nltk_data]     /home/gxd_google_com/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!

Results

{'accuracy': 0.8742, 'gen_num': 5000}

Checklist

Before submitting this PR, please make sure:

  • I have performed a self-review of my code.
  • I have necessary comments in my code, particularly in hard-to-understand areas.
  • I have made or will make corresponding changes to any relevant documentation.

gxd3 added 3 commits March 12, 2026 04:00
Signed-off-by: Guangxiang Du <gxd@google.com>
Signed-off-by: Guangxiang Du <gxd@google.com>
Signed-off-by: Guangxiang Du <gxd@google.com>
@github-actions
Copy link
Copy Markdown

Description

Start with a short description of what the PR does and how this is a change from
the past.

The rest of the description includes relevant details and context, examples:

  • why is this change being made,
  • the problem being solved and any relevant context,
  • why this is a good solution,
  • some information about the specific implementation,
  • shortcomings of the solution and possible future improvements.

If the change fixes a Github issue, please include a link, e.g.,:
FIXES: #123456

Tests

Please describe how you tested this change, and include any instructions and/or
commands to reproduce.

Checklist

Before submitting this PR, please make sure:

  • I have performed a self-review of my code.
  • I have necessary comments in my code, particularly in hard-to-understand areas.
  • I have made or will make corresponding changes to any relevant documentation.

@Lumosis Lumosis added the ready ONLY add when PR is ready to merge/full CI is needed label Mar 12, 2026
@Lumosis
Copy link
Copy Markdown
Collaborator

Lumosis commented Mar 12, 2026

Great job Guangxiang! We should also enable async scheduling in multi-host/disagg e2e testing.

@gxd3
Copy link
Copy Markdown
Collaborator Author

gxd3 commented Mar 12, 2026

Great job Guangxiang! We should also enable async scheduling in multi-host/disagg e2e testing.

@gxd3 gxd3 closed this Mar 12, 2026
@gxd3
Copy link
Copy Markdown
Collaborator Author

gxd3 commented Mar 12, 2026

Great job Guangxiang! We should also enable async scheduling in multi-host/disagg e2e testing.

Will do, once the vLLM repo commit is submitted :)

@gxd3 gxd3 reopened this Mar 12, 2026
@github-actions
Copy link
Copy Markdown

Description

Start with a short description of what the PR does and how this is a change from
the past.

The rest of the description includes relevant details and context, examples:

  • why is this change being made,
  • the problem being solved and any relevant context,
  • why this is a good solution,
  • some information about the specific implementation,
  • shortcomings of the solution and possible future improvements.

If the change fixes a Github issue, please include a link, e.g.,:
FIXES: #123456

Tests

Please describe how you tested this change, and include any instructions and/or
commands to reproduce.

Checklist

Before submitting this PR, please make sure:

  • I have performed a self-review of my code.
  • I have necessary comments in my code, particularly in hard-to-understand areas.
  • I have made or will make corresponding changes to any relevant documentation.

Signed-off-by: Guangxiang Du <gxd@google.com>
@gxd3 gxd3 merged commit 891ae0d into main Mar 17, 2026
42 checks passed
@wdhongtw wdhongtw deleted the gxd/ray-async-scheduler branch April 7, 2026 09:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready ONLY add when PR is ready to merge/full CI is needed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants