Skip to content

Fix #26037: Skip CUDA platform detection when displaying help#33550

Open
AbhiOnGithub wants to merge 2 commits into
vllm-project:mainfrom
AbhiOnGithub:fix-help-platform-detection
Open

Fix #26037: Skip CUDA platform detection when displaying help#33550
AbhiOnGithub wants to merge 2 commits into
vllm-project:mainfrom
AbhiOnGithub:fix-help-platform-detection

Conversation

@AbhiOnGithub
Copy link
Copy Markdown
Contributor

@AbhiOnGithub AbhiOnGithub commented Feb 2, 2026

Description

This PR fixes issue #26037 by preventing CUDA/platform initialization when users run --help or -h flags, making help display instant instead of taking ~10 seconds.

When NEEDS_HELP is False on upstream with -h:
• pre_register_and_update(parser) runs — triggers full CUDA platform detection
• Bench command platform override runs — triggers platform detection again

With our fix, -h is properly detected, those heavy calls are skipped.
On this machine with fast Blackwell GPUs, CUDA init takes milliseconds so the wall-clock difference is tiny. But on machines where CUDA init is slow
(original issue reported ~10s), this fix eliminates that delay entirely for -h

Problem

When users run vllm serve --help or vllm --help, the command takes ~10 seconds because it triggers CUDA/platform initialization before displaying help text. This makes it difficult to quickly view command options and is especially problematic when CUDA initialization fails.

Solution

This PR implements a two-part fix:

1. Early Help Detection in vllm/entrypoints/cli/main.py

  • Check for --help or -h flags at the start of main() before any heavy initialization
  • Skip cli_env_setup() when displaying help
  • Skip platform detection for all help requests (including bench command)

2. Lazy Import in vllm/entrypoints/utils.py

  • Move from vllm.platforms import current_platform from module-level to function-level
  • Import only happens in get_max_tokens() where it's actually used
  • Prevents platform detection during CLI module imports

Changes

  • Modified vllm/entrypoints/cli/main.py: Added help flag detection and conditional initialization
  • Modified vllm/entrypoints/utils.py: Made platform import lazy
  • Added tests/entrypoints/test_cli_main.py: Tests for help flag behavior (5 tests)
  • Added tests/entrypoints/test_utils_lazy_import.py: Tests for lazy import (2 tests)

Benefits

  • ⚡ Help commands now execute in milliseconds instead of ~10 seconds
  • 🚫 No CUDA initialization when just viewing help
  • ✅ Works even when CUDA fails to initialize
  • 🔄 No breaking changes to normal command execution

Testing

  • All modified files compile without syntax errors
  • Help detection logic verified
  • Tests follow vLLM conventions and will run in CI

Fixes #26037

@github-actions
Copy link
Copy Markdown

github-actions Bot commented Feb 2, 2026

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors.

You ask your reviewers to trigger select CI tests on top of fastcheck CI.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

If you have any questions, please reach out to us on Slack at https://slack.vllm.ai.

🚀

@mergify mergify Bot added the frontend label Feb 2, 2026
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request effectively addresses the performance issue of running CLI commands with --help flags by deferring platform initialization. The changes in vllm/entrypoints/cli/main.py and vllm/entrypoints/utils.py are logical and well-supported by the new tests. My main feedback is to refactor a small piece of duplicated logic in main.py to improve code clarity and maintainability.

@mergify
Copy link
Copy Markdown
Contributor

mergify Bot commented Feb 2, 2026

Hi @AbhiOnGithub, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy or markdownlint failing?
mypy and markdownlint are run differently in CI. If the failure is related to either of these checks, please use the following commands to run them locally:
# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10
# For markdownlint
pre-commit run --hook-stage manual markdownlint

@AbhiOnGithub AbhiOnGithub force-pushed the fix-help-platform-detection branch from b0d760e to d18d65b Compare February 2, 2026 08:20
@vadimkantorov
Copy link
Copy Markdown

But does vllm in the help codepath import torch / attempts dlopen CUDA libraries (I think this happens as part of import torch) / attempt doing CUDA init (if any of CUDA methods are used, torch does this)?

If so, it will still take time to load them from disk, and these can be hundreds of megabytes or larger, it takes quite some time to import torch if it's not in OS disk cache

@AbhiOnGithub
Copy link
Copy Markdown
Contributor Author

Hi @NickLucche , @chaunceyjiang , @aarnphm , @DarkLight1337 , @robertgshaw2-redhat
this is first fix in VLLM repo from my end , please have a look once you guys get some time :)

When users run vllm serve --help or vllm --help, the command takes ~10 seconds because it triggers CUDA/platform initialization before displaying help text.
This makes it difficult to quickly view command options and is especially problematic when CUDA initialization fails.

@AbhiOnGithub
Copy link
Copy Markdown
Contributor Author

But does vllm in the help codepath import torch / attempts dlopen CUDA libraries (I think this happens as part of import torch) / attempt doing CUDA init (if any of CUDA methods are used, torch does this)?

If so, it will still take time to load them from disk, and these can be hundreds of megabytes or larger, it takes quite some time to import torch if it's not in OS disk cache

@vadimkantorov

Good point! You're absolutely right to ask about this. Let me clarify what this PR does and doesn't prevent:

What this PR prevents:

✅ Platform/CUDA initialization - The detection (which calls CUDA APIs) is now skipped for help
✅ cli_env_setup() - Environment configuration that can be slow is skipped
What might still happen:
You're correct that if any of the CLI submodule imports [vllm.entrypoints.cli.openai], etc.) transitively import torch, then import torch and its disk I/O would still occur during help display.

The impact:
The original issue #26037 specifically mentioned the ~10 second delay was due to CUDA initialization (platform detection), not torch import. The reporter noted it worked instantly when CUDA_VISIBLE_DEVICES="" was set, suggesting CUDA init was the bottleneck, not torch loading.

This PR addresses that specific bottleneck - the CUDA/platform initialization that happens during [vllm.platforms.current_platform] access.

Further optimization:
If torch import is also a significant bottleneck for help display, we could:

Lazy-load CLI subcommands (only import when that subcommand is actually invoked)
Move argument parser definitions to separate files that don't import torch
Would you like me to test the actual help execution time with this fix to verify torch import overhead isn't significant? I can measure before/after times to confirm the improvement.

@AbhiOnGithub
Copy link
Copy Markdown
Contributor Author

Current Status (with this PR)

Looking at the code, the CLI module imports in main.py happen regardless of the --help flag:

These imports occur even for --help, which means:

YES - torch likely gets imported during help through the dependency chain (e.g., benchmark.main → EngineArgs → eventually torch)

What This PR Actually Fixes
This PR specifically prevents CUDA initialization (the ~10 second delay mentioned in #26037), not torch import overhead. The fix works because:

Platform detection is skipped - No current_platform access means no CUDA API calls
cli_env_setup() is skipped - Additional heavy initialization avoided
CUDA init doesn't happen - Even if torch is imported, CUDA initialization is lazy and won't trigger without actual GPU operations
The Original Issue
Issue #26037 specifically showed the delay was from CUDA initialization, not torch import, because:

Setting CUDA_VISIBLE_DEVICES="" made help instant

The 10-second delay matched CUDA init time, not torch import time (which is typically 1-3 seconds)

Further Optimization Possible
If torch import is also a bottleneck, we could make the CLI imports lazy too:

# Instead of importing at module level, defer until subcommand is selected
# This would require restructuring how subcommands are registered

@DarkLight1337 DarkLight1337 requested a review from mgoin February 3, 2026 05:49
@vadimkantorov
Copy link
Copy Markdown

vadimkantorov commented Feb 3, 2026

I think it would be good to at least figure out if import torch is currently happening on the --help code path. If it's not happening - then no problem exists. If it is happening, then at least we know it and it can be fixed later indeed.

So if you figured it out - maybe would be great to create a new issue to track this and decide what to do...

Btw maybe a new command can be added for benchmark or "self-test" (to check import torch/flashinfer/backends etc) - or maybe it exists already (except that proper very-accurate benchmarking is hard as it needs to control for various throttling and resource sharing). So that the new installation can run this command - and maybe its structured output could be uploaded somewhere to get some real-world basic benchmark numbers from many users...

Seems benchmark command already exists as vllm bench (but doesn't seem to print json-structured/uploadable output by default) , but maybe a simpler smoke-command can be added like vllm selftest or sth similar which just verifies that import torch works, backends load and do not fail with any of the driver/shared libraries problems

@AbhiOnGithub
Copy link
Copy Markdown
Contributor Author

@vadimkantorov
I investigated this and you're absolutely right - torch IS imported during --help.

Findings

The import chain during help display:

vllm/entrypoints/cli/main.py (line 28)
  → vllm/entrypoints/cli/benchmark/latency.py
    → vllm/benchmarks/latency.py  
      → vllm/engine/arg_utils.py (line 13)
        → import torch

So yes, users will experience torch import overhead (~1-3 seconds) even with this PR.

What This PR Actually Fixes

This PR specifically prevents CUDA initialization (the ~10 second delay from #26037), not torch import overhead. The original issue showed:

  • CUDA_VISIBLE_DEVICES="" made help instant → proved CUDA init was the bottleneck
  • The 10-second delay matched CUDA init time, not torch import (~1-3s)

This PR eliminates the CUDA initialization overhead by:

  1. Skipping current_platform access (no CUDA API calls)
  2. Skipping cli_env_setup()
  3. Preventing CUDA initialization (even if torch is imported, CUDA init is lazy)

Next Steps

I've created issue #33741 to track optimizing the remaining torch import overhead. The solution would be to make CLI submodule imports lazy (only import when the subcommand is actually invoked, not during argument parsing).

vllm selftest Idea

Great suggestion! I've included it in #33741. A vllm selftest or vllm doctor command could:

  • Verify torch/backends load correctly
  • Test basic GPU/driver functionality
  • Output JSON diagnostics for bug reports
  • Provide basic benchmarking data

This would be super useful for debugging installation issues and collecting real-world performance data.

@AbhiOnGithub
Copy link
Copy Markdown
Contributor Author

Hi @vadimkantorov @NickLucche @chaunceyjiang , @aarnphm , @DarkLight1337 , @robertgshaw2-redhat
Can you please review it.

@mergify
Copy link
Copy Markdown
Contributor

mergify Bot commented Feb 8, 2026

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @AbhiOnGithub.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify
Copy link
Copy Markdown
Contributor

mergify Bot commented Feb 9, 2026

Hi @AbhiOnGithub, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy or markdownlint failing?
mypy and markdownlint are run differently in CI. If the failure is related to either of these checks, please use the following commands to run them locally:
# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10
# For markdownlint
pre-commit run --hook-stage manual markdownlint

@AbhiOnGithub AbhiOnGithub force-pushed the fix-help-platform-detection branch from ad5f10f to ed2d51e Compare February 9, 2026 09:08
@AbhiOnGithub
Copy link
Copy Markdown
Contributor Author

Hi @vadimkantorov
please suggest are we good to merge or are there any steps for me.

@vadimkantorov
Copy link
Copy Markdown

@AbhiOnGithub I'm not a maintainer, just the reporter of the original issue. For me all is good. Maybe after this is merged, worth creating a separate issue for discussing/tracking removing import torch from the help code path

@ProExpertProg ProExpertProg added the ready ONLY add when PR is ready to merge/full CI is needed label Feb 10, 2026
Comment thread vllm/entrypoints/cli/main.py Outdated
Comment thread vllm/entrypoints/cli/main.py Outdated
Comment thread vllm/entrypoints/cli/main.py Outdated
@AbhiOnGithub
Copy link
Copy Markdown
Contributor Author

@hmellor

✅ Done! I've made the following changes:

  1. Updated NEEDS_HELP in arg_utils.py to include -h flag
  2. Replaced all manual help checks with NEEDS_HELP throughout main.py
  3. Refactored the code to use if not NEEDS_HELP: instead of if NEEDS_HELP: pass else:

The code is now cleaner and follows DRY principle. Thanks for the review!

@mergify mergify Bot added the nvidia label Feb 13, 2026
@github-project-automation github-project-automation Bot moved this to In review in NVIDIA Feb 13, 2026
@AbhiOnGithub AbhiOnGithub requested a review from hmellor February 16, 2026 01:01
Comment thread vllm/entrypoints/cli/main.py Outdated
Comment thread vllm/entrypoints/cli/main.py Outdated
cli_env_setup()
# Only do environment setup if not showing help
if not needs_help:
cli_env_setup()
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are there any CLI args that get their default from env?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No — [cli_env_setup()] only sets VLLM_WORKER_MULTIPROC_METHOD, which isn't used as a default for any CLI argument. All CLI arg defaults come from static dataclass field values, not [os.environ.get()]

The only tangential case is [CompilationConfig.compile_cache_save_format] which reads VLLM_COMPILE_CACHE_SAVE_FORMAT via a [default_factory],

but that's a sub-field of a JSON argument ([--compilation-config], not a standalone CLI arg. Skipping [cli_env_setup()] during [--help] is safe and won't affect displayed defaults."

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't obvious to me. What is the conflict avoided by skipping this with showing_help is True?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question. The bench command block accesses platforms.current_platform.is_unspecified(), which triggers CUDA/platform detection. That detection can take ~10 seconds (or fail entirely without a GPU). When showing help, we just want to print usage info — we don't need to know the platform. Updated the comment in the latest commit to make this clearer.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment, and the code comment update, are not related to the code this comment thread is under.

Please provide a manual response based on your own understanding (don't use claude or whatever to respond).

Copy link
Copy Markdown
Contributor Author

@AbhiOnGithub AbhiOnGithub Feb 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @russellb ,my earlier reply was actually about the bench block below, not this code. Looking at it again, theres honestly no good reason to skip cli_env_setup()

All it does is set VLLM_WORKER_MULTIPROC_METHOD=spawn in the env — its instant and dosent trigger any CUDA or platform initialization at all.

I was just being cautious and wrapped everthing in help guard without thinking about it. Removed it now so cli_env_setup() just runs unconditionally again and thanks for pointing this out.

Code was assisted by Claude not comments :P

@AbhiOnGithub AbhiOnGithub force-pushed the fix-help-platform-detection branch 2 times, most recently from e044f2d to a4ad06f Compare February 17, 2026 08:41
Copy link
Copy Markdown
Member

@hmellor hmellor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thanks for this change and for being responsive to feedback.

For future reference, if you can avoid force pushing, it would make the reviewing process easier because GitHub can show me only the changes I have not yet reviewed.

@github-project-automation github-project-automation Bot moved this from In review to Ready in NVIDIA Feb 17, 2026
@hmellor
Copy link
Copy Markdown
Member

hmellor commented Feb 17, 2026

You will need to merge from main to fix the docs build

Comment thread vllm/entrypoints/cli/main.py Outdated
Comment thread tests/entrypoints/test_utils_lazy_import.py Outdated
@github-project-automation github-project-automation Bot moved this from Ready to In review in NVIDIA Feb 17, 2026
Comment thread vllm/entrypoints/cli/main.py Outdated
@@ -14,6 +14,14 @@


def main():
# Check if help is requested before doing any heavy initialization
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Import does not need to be lazy?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, here lazy import will not help

The actual cost is three heavy libraries imported eagerly:
torch ~1.3s (via vllm/init.py -> vllm.env_override)
transformers ~1.4s (via vllm.config.model -> transformers_utils.config)
fastapi+aio ~0.5s (via cli.serve -> api_server)

Comment thread vllm/entrypoints/cli/main.py Outdated
# For 'vllm bench *': use CPU instead of UnspecifiedPlatform by default.
# When showing help, skip this to avoid triggering CUDA/platform init
# (which can take ~10s or fail without a GPU).
if len(sys.argv) > 1 and sys.argv[1] == "bench" and not showing_help:
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if len(sys.argv) > 1 and sys.argv[1] == "bench" and not showing_help:
if len(sys.argv) > 1 and sys.argv[1] == "bench" and not needs_help():

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, did you test vllm bench --help after this change? Does that still work OK?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes , please refer this comment
#33550 (comment)

Comment thread tests/entrypoints/test_cli_main.py Outdated
Comment on lines +23 to +37
def test_help_flag_skips_platform_detection(argv):
"""Test that help flags don't trigger platform detection."""
import vllm.platforms

vllm.platforms._current_platform = None

with patch.object(sys, "argv", argv), patch.object(sys, "exit"):
from vllm.entrypoints.cli.main import main

with contextlib.suppress(SystemExit):
main()

assert vllm.platforms._current_platform is None, (
f"Platform should not be detected when showing help with {argv}"
)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is safe. This could conflict with other tests in the same process by editing global state.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is now replaced with 3 tests
test_needs_help_detects_help_flags
test_needs_help_returns_false_without_help_flags
test_bench_help_skips_platform_detection

having with statement

import vllm.platforms

# Reset platform detection state
vllm.platforms._current_platform = None
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment for this test: I don't think this is safe. This could conflict with other tests in the same process by editing global state.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes mutating vllm.platforms._current_platform = None is unsafe global state modification.

Now uses patch.object to temporarily set _current_platform = None , the original value is automatically restored when the with block exits, so no global state leaks between tests.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

using unittest.mock.patch is just as unsafe for the same reasons.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hi @russellb , what is your suggestion/recomendation , what should I use here to make it safe.

Copy link
Copy Markdown
Contributor Author

@AbhiOnGithub AbhiOnGithub Feb 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@russellb , IMHO the better approach is to run the import check in a subprocess.
This gives perfect isolation — no mutation of the current process's global state at all.

def test_utils_import_no_platform_detection():
    """Importing vllm.entrypoints.utils must not trigger platform detection."""
    # Run in a subprocess for full isolation — no mock / patch needed.
    result = subprocess.run(
        [
            sys.executable,
            "-c",
            ";".join([
                "import vllm.platforms",
                "import vllm.entrypoints.utils",
                "assert vllm.platforms._current_platform is None"
                ", 'Importing vllm.entrypoints.utils triggered detection'",
            ]),
        ],
        capture_output=True,
        text=True,
    )
    assert result.returncode == 0, (
        f"Subprocess failed (rc={result.returncode}):\n"
        f"stdout: {result.stdout}\nstderr: {result.stderr}"
    )

Replaced unittest.mock.patch with subprocess isolation in [test_utils_lazy_import.py]
because patch.object combined with importlib.reload was masking a real bug where importing vllm.entrypoints.utils triggered platform detection through a top-level EngineArgs import.

Fixed the root cause by making EngineArgs a lazy import inside function bodies in vllm/entrypoints/utils.py.

Kept patch.object for sys.argv in [test_cli_main.py] since patching a plain list attribute is safe.

Simplified the bench help test to a focused unit test of the guard condition instead of running the full main() entrypoint in a subprocess.
refer my latest commit

Comment thread vllm/entrypoints/cli/main.py Outdated
# For 'vllm bench *': use CPU instead of UnspecifiedPlatform by default.
# When showing help, skip this to avoid triggering CUDA/platform init
# (which can take ~10s or fail without a GPU).
if len(sys.argv) > 1 and sys.argv[1] == "bench" and not showing_help:
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, did you test vllm bench --help after this change? Does that still work OK?

@russellb russellb dismissed their stale review February 24, 2026 20:09

removing as I don't want to block this if I don't come back to it.

Comment thread vllm/tracing/otel.py
Comment thread vllm/entrypoints/utils.py Outdated
from vllm.utils.argparse_utils import FlexibleArgumentParser

if TYPE_CHECKING:
from vllm.engine.arg_utils import EngineArgs
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can't do this because EngineArgs is used for more than just type checking

Copy link
Copy Markdown
Contributor Author

@AbhiOnGithub AbhiOnGithub Mar 14, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @hmellor ,
thanks for the feedback! I've addressed all the review comments and rebased into a single clean commit. Here's what changed:

utils.py — EngineArgs is kept as a normal import (not TYPE_CHECKING), since it's used at runtime for isinstance() and constructor calls.
The only change here is moving current_platform from a module-level import to a lazy import inside get_max_tokens().

main.py — Using needs_help() inline as you suggested: if len(sys.argv) > 1 and sys.argv[1] == "bench" and not needs_help():

arg_utils.py — Converted NEEDS_HELP constant into a needs_help() function (reusable from main.py), added -h and --help=X support.

NEEDS_HELP isstill set at module level for backward compat.

otel.py — Dropped from this PR as you suggested.

Tests — Parametrized with @pytest.mark.parametrize, lazy import test uses subprocess isolation. All 10 tests pass locally, all pre-commit hooks

Scenario Upstream With this Fix
vllm serve -hNEEDS_HELP False (broken) True (fixed)
vllm serve --helpNEEDS_HELP True True
vllm serve --help=ModelConfigNEEDS_HELP True True

When NEEDS_HELP is False on upstream with -h:
• pre_register_and_update(parser) runs — triggers full CUDA platform detection
• Bench command platform override runs — triggers platform detection again

With this fix, -h is properly detected, those heavy calls are skipped.

@mergify
Copy link
Copy Markdown
Contributor

mergify Bot commented Mar 3, 2026

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @AbhiOnGithub.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify Bot added the needs-rebase label Mar 3, 2026
@AbhiOnGithub AbhiOnGithub force-pushed the fix-help-platform-detection branch from f975ab9 to 705e0cc Compare March 14, 2026 15:38
@mergify
Copy link
Copy Markdown
Contributor

mergify Bot commented Mar 14, 2026

Hi @AbhiOnGithub, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy failing?
mypy is run differently in CI. If the failure is related to this check, please use the following command to run it locally:
# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10

@mergify mergify Bot removed the needs-rebase label Mar 14, 2026
Signed-off-by: AbhiOnGithub <mail2abhishekgupta@gmail.com>
@AbhiOnGithub AbhiOnGithub force-pushed the fix-help-platform-detection branch from 705e0cc to e83d3a9 Compare March 14, 2026 16:22
Comment on lines +49 to +66
def test_bench_help_skips_platform_detection():
"""Test that the bench guard in main() is skipped when --help is present.

The guard in main.py is:
if sys.argv[1] == "bench" and not showing_help
When showing_help is True, current_platform is never accessed for
the bench CPU-override, avoiding unnecessary platform detection.
"""
from vllm.engine.arg_utils import needs_help

# Verify the guard: needs_help() == True means "not showing_help" is False,
# so the bench platform-override block is skipped.
with patch.object(sys, "argv", ["vllm", "bench", "--help"]):
assert needs_help(), "needs_help() should be True for bench --help"

# Without --help the guard would be entered
with patch.object(sys, "argv", ["vllm", "bench", "latency"]):
assert not needs_help(), "needs_help() should be False without --help"
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This duplicates what is already done in the previous 2 tests?

Comment on lines +13 to +46
@pytest.mark.parametrize(
"argv",
[
["vllm", "--help"],
["vllm", "serve", "--help"],
["vllm", "-h"],
["vllm", "bench", "--help"],
["vllm", "serve", "--help=ModelConfig"],
],
)
def test_needs_help_detects_help_flags(argv):
"""Test that needs_help() correctly detects help flags in sys.argv."""
from vllm.engine.arg_utils import needs_help

# patch.object on sys.argv is safe — it's a simple list attribute
# with no lazy-init or side-effect machinery.
with patch.object(sys, "argv", argv):
assert needs_help(), f"needs_help() should return True for {argv}"


@pytest.mark.parametrize(
"argv",
[
["vllm", "serve", "--model", "test"],
["vllm", "bench", "latency", "--model", "test"],
["vllm", "collect-env"],
],
)
def test_needs_help_returns_false_without_help_flags(argv):
"""Test that needs_help() returns False when no help flag is present."""
from vllm.engine.arg_utils import needs_help

with patch.object(sys, "argv", argv):
assert not needs_help(), f"needs_help() should return False for {argv}"
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These can be merged into

def test_needs_help(argv, expected):
    ...

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test is strangely specific. Surely we just want to check that there's no vllm.platforms in the global scope, not that there is vllm.platforms in get_max_tokens?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

frontend nvidia ready ONLY add when PR is ready to merge/full CI is needed

Projects

Status: In review

Development

Successfully merging this pull request may close these issues.

[Bug]: vllm serve --help still spends time on CUDA init

5 participants