Fix #26037: Skip CUDA platform detection when displaying help#33550
Fix #26037: Skip CUDA platform detection when displaying help#33550AbhiOnGithub wants to merge 2 commits into
Conversation
|
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run You ask your reviewers to trigger select CI tests on top of Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add If you have any questions, please reach out to us on Slack at https://slack.vllm.ai. 🚀 |
There was a problem hiding this comment.
Code Review
This pull request effectively addresses the performance issue of running CLI commands with --help flags by deferring platform initialization. The changes in vllm/entrypoints/cli/main.py and vllm/entrypoints/utils.py are logical and well-supported by the new tests. My main feedback is to refactor a small piece of duplicated logic in main.py to improve code clarity and maintainability.
|
Hi @AbhiOnGithub, the pre-commit checks have failed. Please run: uv pip install pre-commit
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
b0d760e to
d18d65b
Compare
|
But does vllm in the help codepath import torch / attempts dlopen CUDA libraries (I think this happens as part of import torch) / attempt doing CUDA init (if any of CUDA methods are used, torch does this)? If so, it will still take time to load them from disk, and these can be hundreds of megabytes or larger, it takes quite some time to import torch if it's not in OS disk cache |
|
Hi @NickLucche , @chaunceyjiang , @aarnphm , @DarkLight1337 , @robertgshaw2-redhat When users run vllm serve --help or vllm --help, the command takes ~10 seconds because it triggers CUDA/platform initialization before displaying help text. |
Good point! You're absolutely right to ask about this. Let me clarify what this PR does and doesn't prevent: What this PR prevents: ✅ Platform/CUDA initialization - The detection (which calls CUDA APIs) is now skipped for help The impact: This PR addresses that specific bottleneck - the CUDA/platform initialization that happens during [vllm.platforms.current_platform] access. Further optimization: Lazy-load CLI subcommands (only import when that subcommand is actually invoked) |
|
Current Status (with this PR) Looking at the code, the CLI module imports in main.py happen regardless of the --help flag: These imports occur even for --help, which means: YES - torch likely gets imported during help through the dependency chain (e.g., benchmark.main → EngineArgs → eventually torch) What This PR Actually Fixes Platform detection is skipped - No current_platform access means no CUDA API calls Setting CUDA_VISIBLE_DEVICES="" made help instant The 10-second delay matched CUDA init time, not torch import time (which is typically 1-3 seconds) Further Optimization Possible |
|
I think it would be good to at least figure out if So if you figured it out - maybe would be great to create a new issue to track this and decide what to do... Btw maybe a new command can be added for benchmark or "self-test" (to check import torch/flashinfer/backends etc) - or maybe it exists already (except that proper very-accurate benchmarking is hard as it needs to control for various throttling and resource sharing). So that the new installation can run this command - and maybe its structured output could be uploaded somewhere to get some real-world basic benchmark numbers from many users... Seems benchmark command already exists as |
|
@vadimkantorov FindingsThe import chain during help display: So yes, users will experience torch import overhead (~1-3 seconds) even with this PR. What This PR Actually FixesThis PR specifically prevents CUDA initialization (the ~10 second delay from #26037), not torch import overhead. The original issue showed:
This PR eliminates the CUDA initialization overhead by:
Next StepsI've created issue #33741 to track optimizing the remaining torch import overhead. The solution would be to make CLI submodule imports lazy (only import when the subcommand is actually invoked, not during argument parsing).
|
|
Hi @vadimkantorov @NickLucche @chaunceyjiang , @aarnphm , @DarkLight1337 , @robertgshaw2-redhat |
|
This pull request has merge conflicts that must be resolved before it can be |
|
Hi @AbhiOnGithub, the pre-commit checks have failed. Please run: uv pip install pre-commit
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
ad5f10f to
ed2d51e
Compare
|
Hi @vadimkantorov |
|
@AbhiOnGithub I'm not a maintainer, just the reporter of the original issue. For me all is good. Maybe after this is merged, worth creating a separate issue for discussing/tracking removing |
|
✅ Done! I've made the following changes:
The code is now cleaner and follows DRY principle. Thanks for the review! |
| cli_env_setup() | ||
| # Only do environment setup if not showing help | ||
| if not needs_help: | ||
| cli_env_setup() |
There was a problem hiding this comment.
Are there any CLI args that get their default from env?
There was a problem hiding this comment.
No — [cli_env_setup()] only sets VLLM_WORKER_MULTIPROC_METHOD, which isn't used as a default for any CLI argument. All CLI arg defaults come from static dataclass field values, not [os.environ.get()]
The only tangential case is [CompilationConfig.compile_cache_save_format] which reads VLLM_COMPILE_CACHE_SAVE_FORMAT via a [default_factory],
but that's a sub-field of a JSON argument ([--compilation-config], not a standalone CLI arg. Skipping [cli_env_setup()] during [--help] is safe and won't affect displayed defaults."
There was a problem hiding this comment.
This isn't obvious to me. What is the conflict avoided by skipping this with showing_help is True?
There was a problem hiding this comment.
Good question. The bench command block accesses platforms.current_platform.is_unspecified(), which triggers CUDA/platform detection. That detection can take ~10 seconds (or fail entirely without a GPU). When showing help, we just want to print usage info — we don't need to know the platform. Updated the comment in the latest commit to make this clearer.
There was a problem hiding this comment.
This comment, and the code comment update, are not related to the code this comment thread is under.
Please provide a manual response based on your own understanding (don't use claude or whatever to respond).
There was a problem hiding this comment.
Hi @russellb ,my earlier reply was actually about the bench block below, not this code. Looking at it again, theres honestly no good reason to skip cli_env_setup()
All it does is set VLLM_WORKER_MULTIPROC_METHOD=spawn in the env — its instant and dosent trigger any CUDA or platform initialization at all.
I was just being cautious and wrapped everthing in help guard without thinking about it. Removed it now so cli_env_setup() just runs unconditionally again and thanks for pointing this out.
Code was assisted by Claude not comments :P
e044f2d to
a4ad06f
Compare
hmellor
left a comment
There was a problem hiding this comment.
LGTM! Thanks for this change and for being responsive to feedback.
For future reference, if you can avoid force pushing, it would make the reviewing process easier because GitHub can show me only the changes I have not yet reviewed.
|
You will need to merge from main to fix the docs build |
| @@ -14,6 +14,14 @@ | |||
|
|
|||
|
|
|||
| def main(): | |||
| # Check if help is requested before doing any heavy initialization | |||
There was a problem hiding this comment.
Import does not need to be lazy?
There was a problem hiding this comment.
yes, here lazy import will not help
The actual cost is three heavy libraries imported eagerly:
torch ~1.3s (via vllm/init.py -> vllm.env_override)
transformers ~1.4s (via vllm.config.model -> transformers_utils.config)
fastapi+aio ~0.5s (via cli.serve -> api_server)
| # For 'vllm bench *': use CPU instead of UnspecifiedPlatform by default. | ||
| # When showing help, skip this to avoid triggering CUDA/platform init | ||
| # (which can take ~10s or fail without a GPU). | ||
| if len(sys.argv) > 1 and sys.argv[1] == "bench" and not showing_help: |
There was a problem hiding this comment.
| if len(sys.argv) > 1 and sys.argv[1] == "bench" and not showing_help: | |
| if len(sys.argv) > 1 and sys.argv[1] == "bench" and not needs_help(): |
There was a problem hiding this comment.
Also, did you test vllm bench --help after this change? Does that still work OK?
There was a problem hiding this comment.
yes , please refer this comment
#33550 (comment)
| def test_help_flag_skips_platform_detection(argv): | ||
| """Test that help flags don't trigger platform detection.""" | ||
| import vllm.platforms | ||
|
|
||
| vllm.platforms._current_platform = None | ||
|
|
||
| with patch.object(sys, "argv", argv), patch.object(sys, "exit"): | ||
| from vllm.entrypoints.cli.main import main | ||
|
|
||
| with contextlib.suppress(SystemExit): | ||
| main() | ||
|
|
||
| assert vllm.platforms._current_platform is None, ( | ||
| f"Platform should not be detected when showing help with {argv}" | ||
| ) |
There was a problem hiding this comment.
I don't think this is safe. This could conflict with other tests in the same process by editing global state.
There was a problem hiding this comment.
this is now replaced with 3 tests
test_needs_help_detects_help_flags
test_needs_help_returns_false_without_help_flags
test_bench_help_skips_platform_detection
having with statement
| import vllm.platforms | ||
|
|
||
| # Reset platform detection state | ||
| vllm.platforms._current_platform = None |
There was a problem hiding this comment.
Same comment for this test: I don't think this is safe. This could conflict with other tests in the same process by editing global state.
There was a problem hiding this comment.
yes mutating vllm.platforms._current_platform = None is unsafe global state modification.
Now uses patch.object to temporarily set _current_platform = None , the original value is automatically restored when the with block exits, so no global state leaks between tests.
There was a problem hiding this comment.
using unittest.mock.patch is just as unsafe for the same reasons.
There was a problem hiding this comment.
hi @russellb , what is your suggestion/recomendation , what should I use here to make it safe.
There was a problem hiding this comment.
@russellb , IMHO the better approach is to run the import check in a subprocess.
This gives perfect isolation — no mutation of the current process's global state at all.
def test_utils_import_no_platform_detection():
"""Importing vllm.entrypoints.utils must not trigger platform detection."""
# Run in a subprocess for full isolation — no mock / patch needed.
result = subprocess.run(
[
sys.executable,
"-c",
";".join([
"import vllm.platforms",
"import vllm.entrypoints.utils",
"assert vllm.platforms._current_platform is None"
", 'Importing vllm.entrypoints.utils triggered detection'",
]),
],
capture_output=True,
text=True,
)
assert result.returncode == 0, (
f"Subprocess failed (rc={result.returncode}):\n"
f"stdout: {result.stdout}\nstderr: {result.stderr}"
)
Replaced unittest.mock.patch with subprocess isolation in [test_utils_lazy_import.py]
because patch.object combined with importlib.reload was masking a real bug where importing vllm.entrypoints.utils triggered platform detection through a top-level EngineArgs import.
Fixed the root cause by making EngineArgs a lazy import inside function bodies in vllm/entrypoints/utils.py.
Kept patch.object for sys.argv in [test_cli_main.py] since patching a plain list attribute is safe.
Simplified the bench help test to a focused unit test of the guard condition instead of running the full main() entrypoint in a subprocess.
refer my latest commit
| # For 'vllm bench *': use CPU instead of UnspecifiedPlatform by default. | ||
| # When showing help, skip this to avoid triggering CUDA/platform init | ||
| # (which can take ~10s or fail without a GPU). | ||
| if len(sys.argv) > 1 and sys.argv[1] == "bench" and not showing_help: |
There was a problem hiding this comment.
Also, did you test vllm bench --help after this change? Does that still work OK?
removing as I don't want to block this if I don't come back to it.
| from vllm.utils.argparse_utils import FlexibleArgumentParser | ||
|
|
||
| if TYPE_CHECKING: | ||
| from vllm.engine.arg_utils import EngineArgs |
There was a problem hiding this comment.
You can't do this because EngineArgs is used for more than just type checking
There was a problem hiding this comment.
Hi @hmellor ,
thanks for the feedback! I've addressed all the review comments and rebased into a single clean commit. Here's what changed:
utils.py — EngineArgs is kept as a normal import (not TYPE_CHECKING), since it's used at runtime for isinstance() and constructor calls.
The only change here is moving current_platform from a module-level import to a lazy import inside get_max_tokens().
main.py — Using needs_help() inline as you suggested: if len(sys.argv) > 1 and sys.argv[1] == "bench" and not needs_help():
arg_utils.py — Converted NEEDS_HELP constant into a needs_help() function (reusable from main.py), added -h and --help=X support.
NEEDS_HELP isstill set at module level for backward compat.
otel.py — Dropped from this PR as you suggested.
Tests — Parametrized with @pytest.mark.parametrize, lazy import test uses subprocess isolation. All 10 tests pass locally, all pre-commit hooks
| Scenario | Upstream | With this Fix |
|---|---|---|
vllm serve -h → NEEDS_HELP |
False (broken) | True (fixed) |
vllm serve --help → NEEDS_HELP |
True | True |
vllm serve --help=ModelConfig → NEEDS_HELP |
True | True |
When NEEDS_HELP is False on upstream with -h:
• pre_register_and_update(parser) runs — triggers full CUDA platform detection
• Bench command platform override runs — triggers platform detection again
With this fix, -h is properly detected, those heavy calls are skipped.
|
This pull request has merge conflicts that must be resolved before it can be |
f975ab9 to
705e0cc
Compare
|
Hi @AbhiOnGithub, the pre-commit checks have failed. Please run: uv pip install pre-commit
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
Signed-off-by: AbhiOnGithub <mail2abhishekgupta@gmail.com>
705e0cc to
e83d3a9
Compare
| def test_bench_help_skips_platform_detection(): | ||
| """Test that the bench guard in main() is skipped when --help is present. | ||
|
|
||
| The guard in main.py is: | ||
| if sys.argv[1] == "bench" and not showing_help | ||
| When showing_help is True, current_platform is never accessed for | ||
| the bench CPU-override, avoiding unnecessary platform detection. | ||
| """ | ||
| from vllm.engine.arg_utils import needs_help | ||
|
|
||
| # Verify the guard: needs_help() == True means "not showing_help" is False, | ||
| # so the bench platform-override block is skipped. | ||
| with patch.object(sys, "argv", ["vllm", "bench", "--help"]): | ||
| assert needs_help(), "needs_help() should be True for bench --help" | ||
|
|
||
| # Without --help the guard would be entered | ||
| with patch.object(sys, "argv", ["vllm", "bench", "latency"]): | ||
| assert not needs_help(), "needs_help() should be False without --help" |
There was a problem hiding this comment.
This duplicates what is already done in the previous 2 tests?
| @pytest.mark.parametrize( | ||
| "argv", | ||
| [ | ||
| ["vllm", "--help"], | ||
| ["vllm", "serve", "--help"], | ||
| ["vllm", "-h"], | ||
| ["vllm", "bench", "--help"], | ||
| ["vllm", "serve", "--help=ModelConfig"], | ||
| ], | ||
| ) | ||
| def test_needs_help_detects_help_flags(argv): | ||
| """Test that needs_help() correctly detects help flags in sys.argv.""" | ||
| from vllm.engine.arg_utils import needs_help | ||
|
|
||
| # patch.object on sys.argv is safe — it's a simple list attribute | ||
| # with no lazy-init or side-effect machinery. | ||
| with patch.object(sys, "argv", argv): | ||
| assert needs_help(), f"needs_help() should return True for {argv}" | ||
|
|
||
|
|
||
| @pytest.mark.parametrize( | ||
| "argv", | ||
| [ | ||
| ["vllm", "serve", "--model", "test"], | ||
| ["vllm", "bench", "latency", "--model", "test"], | ||
| ["vllm", "collect-env"], | ||
| ], | ||
| ) | ||
| def test_needs_help_returns_false_without_help_flags(argv): | ||
| """Test that needs_help() returns False when no help flag is present.""" | ||
| from vllm.engine.arg_utils import needs_help | ||
|
|
||
| with patch.object(sys, "argv", argv): | ||
| assert not needs_help(), f"needs_help() should return False for {argv}" |
There was a problem hiding this comment.
These can be merged into
def test_needs_help(argv, expected):
...There was a problem hiding this comment.
This test is strangely specific. Surely we just want to check that there's no vllm.platforms in the global scope, not that there is vllm.platforms in get_max_tokens?
Description
This PR fixes issue #26037 by preventing CUDA/platform initialization when users run
--helpor-hflags, making help display instant instead of taking ~10 seconds.When NEEDS_HELP is False on upstream with -h:
• pre_register_and_update(parser) runs — triggers full CUDA platform detection
• Bench command platform override runs — triggers platform detection again
With our fix, -h is properly detected, those heavy calls are skipped.
On this machine with fast Blackwell GPUs, CUDA init takes milliseconds so the wall-clock difference is tiny. But on machines where CUDA init is slow
(original issue reported ~10s), this fix eliminates that delay entirely for -h
Problem
When users run
vllm serve --helporvllm --help, the command takes ~10 seconds because it triggers CUDA/platform initialization before displaying help text. This makes it difficult to quickly view command options and is especially problematic when CUDA initialization fails.Solution
This PR implements a two-part fix:
1. Early Help Detection in
vllm/entrypoints/cli/main.py--helpor-hflags at the start ofmain()before any heavy initializationcli_env_setup()when displaying help2. Lazy Import in
vllm/entrypoints/utils.pyfrom vllm.platforms import current_platformfrom module-level to function-levelget_max_tokens()where it's actually usedChanges
vllm/entrypoints/cli/main.py: Added help flag detection and conditional initializationvllm/entrypoints/utils.py: Made platform import lazytests/entrypoints/test_cli_main.py: Tests for help flag behavior (5 tests)tests/entrypoints/test_utils_lazy_import.py: Tests for lazy import (2 tests)Benefits
Testing
Fixes #26037