Support async scheduling with TPU-inference's RayExecutor by gxd3 · Pull Request #1912 · vllm-project/tpu-inference

gxd3 · 2026-03-12T19:57:13Z

Support async scheduling with TPU-inference's RayExecutor

Implement the functionality in TPU-inference's RayExecutor subclass instead of vLLM repo's RayExecutor parent class for more flexibility.

TPUPlatform override vLLM's repo's Platform's executors_supports_async_scheduling() so that our custom executor is in the whitelist that support async scheduling.

Sent a separate PR to vLLM repo: vllm-project/vllm#36924

Share similar idea as vllm-project/vllm#29012.

Tests

Unit test.
E2E benchmark: xprof: http://xprof/trace_viewer.html?session_id=gxd-2909041915368964943 (there is no TPU bubble now)
Quality test:

python ./tpu-inference/scripts/vllm/benchmarking/benchmark_serving.py --backend vllm --model deepseek-ai/DeepSeek-R1 --dataset-name mmlu --dataset-path /home/gxd_google_com/mmlu/data/test --num-prompts 5000 --run_eval --temperature 0

============ Serving Benchmark Result ============
Successful requests:                     5000      
Benchmark duration (s):                  119.35    
Total input tokens:                      1015147   
Total generated tokens:                  10000     
Request throughput (req/s):              41.89     
Output token throughput (tok/s):         83.79     
Total token throughput (tok/s):          8589.47   
---------------Time to First Token----------------
Mean TTFT (ms):                          66489.88  
Median TTFT (ms):                        62561.75  
P99 TTFT (ms):                           117335.94 
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          788.79    
Median TPOT (ms):                        787.97    
P99 TPOT (ms):                           806.68    
---------------Inter-token Latency----------------
Mean ITL (ms):                           788.79    
Median ITL (ms):                         787.97    
P99 ITL (ms):                            806.68    
----------------End-to-end Latency----------------
Mean E2EL (ms):                          67278.67  
Median E2EL (ms):                        63351.98  
P99 E2EL (ms):                           118134.76 
==================================================
Evaluating MMLU...
[nltk_data] Downloading package punkt to
[nltk_data]     /home/gxd_google_com/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to
[nltk_data]     /home/gxd_google_com/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!

Results

{'accuracy': 0.8742, 'gen_num': 5000}

Checklist

Before submitting this PR, please make sure:

I have performed a self-review of my code.
I have necessary comments in my code, particularly in hard-to-understand areas.
I have made or will make corresponding changes to any relevant documentation.

Signed-off-by: Guangxiang Du <gxd@google.com>

github-actions · 2026-03-12T20:16:52Z

Description

Start with a short description of what the PR does and how this is a change from
the past.

The rest of the description includes relevant details and context, examples:

why is this change being made,
the problem being solved and any relevant context,
why this is a good solution,
some information about the specific implementation,
shortcomings of the solution and possible future improvements.

If the change fixes a Github issue, please include a link, e.g.,:
FIXES: #123456

Tests

Please describe how you tested this change, and include any instructions and/or
commands to reproduce.

Checklist

Before submitting this PR, please make sure:

I have performed a self-review of my code.
I have necessary comments in my code, particularly in hard-to-understand areas.
I have made or will make corresponding changes to any relevant documentation.

Lumosis · 2026-03-12T20:26:20Z

Great job Guangxiang! We should also enable async scheduling in multi-host/disagg e2e testing.

gxd3 · 2026-03-12T20:31:20Z

Great job Guangxiang! We should also enable async scheduling in multi-host/disagg e2e testing.

gxd3 · 2026-03-12T20:31:46Z

Great job Guangxiang! We should also enable async scheduling in multi-host/disagg e2e testing.

Will do, once the vLLM repo commit is submitted :)

github-actions · 2026-03-12T20:58:10Z

Description

Start with a short description of what the PR does and how this is a change from
the past.

The rest of the description includes relevant details and context, examples:

why is this change being made,
the problem being solved and any relevant context,
why this is a good solution,
some information about the specific implementation,
shortcomings of the solution and possible future improvements.

If the change fixes a Github issue, please include a link, e.g.,:
FIXES: #123456

Tests

Please describe how you tested this change, and include any instructions and/or
commands to reproduce.

Checklist

Before submitting this PR, please make sure:

I have performed a self-review of my code.
I have necessary comments in my code, particularly in hard-to-understand areas.
I have made or will make corresponding changes to any relevant documentation.

tpu_inference/platforms/tpu_platform.py

Signed-off-by: Guangxiang Du <gxd@google.com>

gxd3 added 3 commits March 12, 2026 04:00

test ray async scheduler

19bba4c

Signed-off-by: Guangxiang Du <gxd@google.com>

support async scheduling for TPU Inferenece's async scheduler

0d63990

Signed-off-by: Guangxiang Du <gxd@google.com>

support async scheduling for TPU Inferenece's async scheduler

6b4ec6b

Signed-off-by: Guangxiang Du <gxd@google.com>

gxd3 requested review from mrjunwan-lang, sixiang-google and vipannalla as code owners March 12, 2026 19:57

gxd3 requested review from Lumosis and gpolovets1 March 12, 2026 19:57

Lumosis added the ready ONLY add when PR is ready to merge/full CI is needed label Mar 12, 2026

gxd3 closed this Mar 12, 2026

gxd3 reopened this Mar 12, 2026

Lumosis approved these changes Mar 17, 2026

View reviewed changes

tpu_inference/platforms/tpu_platform.py Outdated Show resolved Hide resolved

update w.r.t. review from vLLM repo PR

ac14a9d

Signed-off-by: Guangxiang Du <gxd@google.com>

gxd3 merged commit 891ae0d into main Mar 17, 2026
42 checks passed

wdhongtw deleted the gxd/ray-async-scheduler branch April 7, 2026 09:55

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support async scheduling with TPU-inference's RayExecutor#1912

Support async scheduling with TPU-inference's RayExecutor#1912
gxd3 merged 4 commits intomainfrom
gxd/ray-async-scheduler

gxd3 commented Mar 12, 2026 •

edited

Loading

Uh oh!

github-actions bot commented Mar 12, 2026

Uh oh!

Lumosis commented Mar 12, 2026

Uh oh!

gxd3 commented Mar 12, 2026

Uh oh!

gxd3 commented Mar 12, 2026

Uh oh!

github-actions bot commented Mar 12, 2026

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

gxd3 commented Mar 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Tests

Checklist

Uh oh!

github-actions bot commented Mar 12, 2026

Description

Tests

Checklist

Uh oh!

Lumosis commented Mar 12, 2026

Uh oh!

gxd3 commented Mar 12, 2026

Uh oh!

gxd3 commented Mar 12, 2026

Uh oh!

github-actions bot commented Mar 12, 2026

Description

Tests

Checklist

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

gxd3 commented Mar 12, 2026 •

edited

Loading