[GGUF] Support non-standard quant types with prefix (e.g. UD-IQ1_S)#39471
Merged
Isotr0py merged 3 commits intoApr 10, 2026
Conversation
Signed-off-by: Injae Ryou <injaeryou@gmail.com>
Signed-off-by: Injae Ryou <injaeryou@gmail.com>
436bbc6 to
eed3842
Compare
Contributor
There was a problem hiding this comment.
Code Review
This pull request introduces support for non-standard GGUF quantization types that use dash-separated prefixes, such as 'UD-Q4_K_XL'. The is_remote_gguf function was updated to recognize these types by validating the suffix after the last dash, and a new helper function is_nonstandard_gguf_quant_type was added. Additionally, the error message in split_remote_gguf was updated to reflect this support, and comprehensive unit tests were included to cover various prefix scenarios. I have no feedback to provide.
This was referenced Apr 12, 2026
wojciech-wais
pushed a commit
to wojciech-wais/vllm
that referenced
this pull request
Apr 13, 2026
…llm-project#39471) Signed-off-by: Injae Ryou <injaeryou@gmail.com> Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn>
whk-lab
pushed a commit
to whk-lab/vllm
that referenced
this pull request
Apr 23, 2026
…llm-project#39471) Signed-off-by: Injae Ryou <injaeryou@gmail.com> Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn>
mystous
pushed a commit
to mystous/vllm_hybrid
that referenced
this pull request
May 10, 2026
…llm-project#39471) Signed-off-by: Injae Ryou <injaeryou@gmail.com> Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn>
1 task
my-other-github-account
pushed a commit
to my-other-github-account/vllm
that referenced
this pull request
May 15, 2026
…llm-project#39471) Signed-off-by: Injae Ryou <injaeryou@gmail.com> Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn>
my-other-github-account
pushed a commit
to my-other-github-account/vllm
that referenced
this pull request
May 15, 2026
…llm-project#39471) Signed-off-by: Injae Ryou <injaeryou@gmail.com> Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn>
jhu960213
pushed a commit
to jhu960213/vllm
that referenced
this pull request
May 20, 2026
…llm-project#39471) Signed-off-by: Injae Ryou <injaeryou@gmail.com> Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn>
4 tasks
mvanhorn
pushed a commit
to mvanhorn/vllm
that referenced
this pull request
Jun 4, 2026
…llm-project#39471) Signed-off-by: Injae Ryou <injaeryou@gmail.com> Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn> Signed-off-by: Matt Van Horn <455140+mvanhorn@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Purpose
Support non-standard quant types with prefix (e.g. UD-IQ1_S )
Fixes: #39469
Test Plan
Test Result
before
vllm serve unsloth/Qwen3-0.6B-GGUF:UD-IQ1_S --tokenizer Qwen/Qwen3-0.6B (APIServer pid=2603210) INFO 04-10 09:33:46 [utils.py:299] (APIServer pid=2603210) INFO 04-10 09:33:46 [utils.py:299] █ █ █▄ ▄█ (APIServer pid=2603210) INFO 04-10 09:33:46 [utils.py:299] ▄▄ ▄█ █ █ █ ▀▄▀ █ version 0.19.1rc1.dev122+g83aea2147 (APIServer pid=2603210) INFO 04-10 09:33:46 [utils.py:299] █▄█▀ █ █ █ █ model unsloth/Qwen3-0.6B-GGUF:UD-IQ1_S (APIServer pid=2603210) INFO 04-10 09:33:46 [utils.py:299] ▀▀ ▀▀▀▀▀ ▀▀▀▀▀ ▀ ▀ (APIServer pid=2603210) INFO 04-10 09:33:46 [utils.py:299] (APIServer pid=2603210) INFO 04-10 09:33:46 [utils.py:233] non-default args: {'model_tag': 'unsloth/Qwen3-0.6B-GGUF:UD-IQ1_S', 'model': 'unsloth/Qwen3-0.6B-GGUF:UD-IQ1_S', 'tokenizer': 'Qwen/Qwen3-0.6B'} (APIServer pid=2603210) Traceback (most recent call last): (APIServer pid=2603210) File "/home/name/.test/.gpu/vllm/.venv/lib/python3.12/site-packages/transformers/utils/hub.py", line 479, in cached_files (APIServer pid=2603210) hf_hub_download( (APIServer pid=2603210) File "/home/name/.test/.gpu/vllm/.venv/lib/python3.12/site-packages/huggingface_hub/utils/_validators.py", line 106, in _inner_fn (APIServer pid=2603210) validate_repo_id(arg_value) (APIServer pid=2603210) File "/home/name/.test/.gpu/vllm/.venv/lib/python3.12/site-packages/huggingface_hub/utils/_validators.py", line 160, in validate_repo_id (APIServer pid=2603210) raise HFValidationError( (APIServer pid=2603210) huggingface_hub.errors.HFValidationError: Repo id must use alphanumeric chars, '-', '_' or '.'. The name cannot start or end with '-' or '.' and the maximum length is 96: 'unsloth/Qwen3-0.6B-GGUF:UD-IQ1_S'. (APIServer pid=2603210) (APIServer pid=2603210) During handling of the above exception, another exception occurred: (APIServer pid=2603210) (APIServer pid=2603210) Traceback (most recent call last): (APIServer pid=2603210) File "/home/name/.test/.gpu/vllm/.venv/lib/python3.12/site-packages/transformers/configuration_utils.py", line 721, in _get_config_dict (APIServer pid=2603210) resolved_config_file = cached_file( (APIServer pid=2603210) ^^^^^^^^^^^^ (APIServer pid=2603210) File "/home/name/.test/.gpu/vllm/.venv/lib/python3.12/site-packages/transformers/utils/hub.py", line 322, in cached_file (APIServer pid=2603210) file = cached_files(path_or_repo_id=path_or_repo_id, filenames=[filename], **kwargs) (APIServer pid=2603210) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (APIServer pid=2603210) File "/home/name/.test/.gpu/vllm/.venv/lib/python3.12/site-packages/transformers/utils/hub.py", line 532, in cached_files (APIServer pid=2603210) _get_cache_file_to_return(path_or_repo_id, filename, cache_dir, revision, repo_type) (APIServer pid=2603210) File "/home/name/.test/.gpu/vllm/.venv/lib/python3.12/site-packages/transformers/utils/hub.py", line 143, in _get_cache_file_to_return (APIServer pid=2603210) resolved_file = try_to_load_from_cache( (APIServer pid=2603210) ^^^^^^^^^^^^^^^^^^^^^^^ (APIServer pid=2603210) File "/home/name/.test/.gpu/vllm/.venv/lib/python3.12/site-packages/huggingface_hub/utils/_validators.py", line 106, in _inner_fn (APIServer pid=2603210) validate_repo_id(arg_value) (APIServer pid=2603210) File "/home/name/.test/.gpu/vllm/.venv/lib/python3.12/site-packages/huggingface_hub/utils/_validators.py", line 160, in validate_repo_id (APIServer pid=2603210) raise HFValidationError( (APIServer pid=2603210) huggingface_hub.errors.HFValidationError: Repo id must use alphanumeric chars, '-', '_' or '.'. The name cannot start or end with '-' or '.' and the maximum length is 96: 'unsloth/Qwen3-0.6B-GGUF:UD-IQ1_S'. (APIServer pid=2603210) (APIServer pid=2603210) During handling of the above exception, another exception occurred: (APIServer pid=2603210) (APIServer pid=2603210) Traceback (most recent call last): (APIServer pid=2603210) File "/home/name/.test/.gpu/vllm/.venv/bin/vllm", line 10, in <module> (APIServer pid=2603210) sys.exit(main()) (APIServer pid=2603210) ^^^^^^ (APIServer pid=2603210) File "/home/name/.test/.gpu/vllm/vllm/entrypoints/cli/main.py", line 75, in main (APIServer pid=2603210) args.dispatch_function(args) (APIServer pid=2603210) File "/home/name/.test/.gpu/vllm/vllm/entrypoints/cli/serve.py", line 122, in cmd (APIServer pid=2603210) uvloop.run(run_server(args)) (APIServer pid=2603210) File "/home/name/.test/.gpu/vllm/.venv/lib/python3.12/site-packages/uvloop/__init__.py", line 96, in run (APIServer pid=2603210) return __asyncio.run( (APIServer pid=2603210) ^^^^^^^^^^^^^^ (APIServer pid=2603210) File "/home/name/.local/share/uv/python/cpython-3.12.13-linux-x86_64-gnu/lib/python3.12/asyncio/runners.py", line 195, in run (APIServer pid=2603210) return runner.run(main) (APIServer pid=2603210) ^^^^^^^^^^^^^^^^ (APIServer pid=2603210) File "/home/name/.local/share/uv/python/cpython-3.12.13-linux-x86_64-gnu/lib/python3.12/asyncio/runners.py", line 118, in run (APIServer pid=2603210) return self._loop.run_until_complete(task) (APIServer pid=2603210) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (APIServer pid=2603210) File "uvloop/loop.pyx", line 1518, in uvloop.loop.Loop.run_until_complete (APIServer pid=2603210) File "/home/name/.test/.gpu/vllm/.venv/lib/python3.12/site-packages/uvloop/__init__.py", line 48, in wrapper (APIServer pid=2603210) return await main (APIServer pid=2603210) ^^^^^^^^^^ (APIServer pid=2603210) File "/home/name/.test/.gpu/vllm/vllm/entrypoints/openai/api_server.py", line 686, in run_server (APIServer pid=2603210) await run_server_worker(listen_address, sock, args, **uvicorn_kwargs) (APIServer pid=2603210) File "/home/name/.test/.gpu/vllm/vllm/entrypoints/openai/api_server.py", line 700, in run_server_worker (APIServer pid=2603210) async with build_async_engine_client( (APIServer pid=2603210) ^^^^^^^^^^^^^^^^^^^^^^^^^^ (APIServer pid=2603210) File "/home/name/.local/share/uv/python/cpython-3.12.13-linux-x86_64-gnu/lib/python3.12/contextlib.py", line 210, in __aenter__ (APIServer pid=2603210) return await anext(self.gen) (APIServer pid=2603210) ^^^^^^^^^^^^^^^^^^^^^ (APIServer pid=2603210) File "/home/name/.test/.gpu/vllm/vllm/entrypoints/openai/api_server.py", line 100, in build_async_engine_client (APIServer pid=2603210) async with build_async_engine_client_from_engine_args( (APIServer pid=2603210) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (APIServer pid=2603210) File "/home/name/.local/share/uv/python/cpython-3.12.13-linux-x86_64-gnu/lib/python3.12/contextlib.py", line 210, in __aenter__ (APIServer pid=2603210) return await anext(self.gen) (APIServer pid=2603210) ^^^^^^^^^^^^^^^^^^^^^ (APIServer pid=2603210) File "/home/name/.test/.gpu/vllm/vllm/entrypoints/openai/api_server.py", line 124, in build_async_engine_client_from_engine_args (APIServer pid=2603210) vllm_config = engine_args.create_engine_config(usage_context=usage_context) (APIServer pid=2603210) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (APIServer pid=2603210) File "/home/name/.test/.gpu/vllm/vllm/engine/arg_utils.py", line 1574, in create_engine_config (APIServer pid=2603210) maybe_override_with_speculators( (APIServer pid=2603210) File "/home/name/.test/.gpu/vllm/vllm/transformers_utils/config.py", line 584, in maybe_override_with_speculators (APIServer pid=2603210) config_dict, _ = PretrainedConfig.get_config_dict( (APIServer pid=2603210) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (APIServer pid=2603210) File "/home/name/.test/.gpu/vllm/.venv/lib/python3.12/site-packages/transformers/configuration_utils.py", line 662, in get_config_dict (APIServer pid=2603210) config_dict, kwargs = cls._get_config_dict(pretrained_model_name_or_path, **kwargs) (APIServer pid=2603210) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (APIServer pid=2603210) File "/home/name/.test/.gpu/vllm/.venv/lib/python3.12/site-packages/transformers/configuration_utils.py", line 744, in _get_config_dict (APIServer pid=2603210) raise OSError( (APIServer pid=2603210) OSError: Can't load the configuration of 'unsloth/Qwen3-0.6B-GGUF:UD-IQ1_S'. If you were trying to load it from 'https://huggingface.co/models', make sure you don't have a local directory with the same name. Otherwise, make sure 'unsloth/Qwen3-0.6B-GGUF:UD-IQ1_S' is the correct path to a directory containing a config.json fileafter
vllm serve unsloth/Qwen3-0.6B-GGUF:UD-IQ1_S --tokenizer Qwen/Qwen3-0.6B (APIServer pid=2598019) INFO 04-10 09:31:38 [utils.py:299] (APIServer pid=2598019) INFO 04-10 09:31:38 [utils.py:299] █ █ █▄ ▄█ (APIServer pid=2598019) INFO 04-10 09:31:38 [utils.py:299] ▄▄ ▄█ █ █ █ ▀▄▀ █ version 0.19.1rc1.dev122+g83aea2147 (APIServer pid=2598019) INFO 04-10 09:31:38 [utils.py:299] █▄█▀ █ █ █ █ model unsloth/Qwen3-0.6B-GGUF:UD-IQ1_S (APIServer pid=2598019) INFO 04-10 09:31:38 [utils.py:299] ▀▀ ▀▀▀▀▀ ▀▀▀▀▀ ▀ ▀ (APIServer pid=2598019) INFO 04-10 09:31:38 [utils.py:299] (APIServer pid=2598019) INFO 04-10 09:31:38 [utils.py:233] non-default args: {'model_tag': 'unsloth/Qwen3-0.6B-GGUF:UD-IQ1_S', 'model': 'unsloth/Qwen3-0.6B-GGUF:UD-IQ1_S', 'tokenizer': 'Qwen/Qwen3-0.6B'} (APIServer pid=2598019) WARNING 04-10 09:31:38 [gguf_utils.py:60] Non-standard GGUF quant type 'UD-IQ1_S' detected. (APIServer pid=2598019) INFO 04-10 09:31:39 [model.py:554] Resolved architecture: Qwen3ForCausalLM (APIServer pid=2598019) INFO 04-10 09:31:39 [model.py:1684] Using max model len 40960 (APIServer pid=2598019) INFO 04-10 09:31:39 [vllm.py:799] Asynchronous scheduling is enabled. (APIServer pid=2598019) INFO 04-10 09:31:39 [kernel.py:199] Final IR op priority after setting platform defaults: IrOpPriorityConfig(rms_norm=['native']) (EngineCore pid=2598461) INFO 04-10 09:31:44 [core.py:107] Initializing a V1 LLM engine (v0.19.1rc1.dev122+g83aea2147) with config: model='unsloth/Qwen3-0.6B-GGUF:UD-IQ1_S', speculative_config=None, tokenizer='Qwen/Qwen3-0.6B', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=40960, download_dir=None, load_format=gguf, tensor_parallel_size=1, pipeline_parallel_size=1, data_parallel_size=1, decode_context_parallel_size=1, dcp_comm_backend=ag_rs, disable_custom_all_reduce=False, quantization=gguf, quantization_config=None, enforce_eager=False, enable_return_routed_experts=False, kv_cache_dtype=auto, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser='', reasoning_parser_plugin='', enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, kv_cache_metrics=False, kv_cache_metrics_sample=0.01, cudagraph_metrics=False, enable_layerwise_nvtx_tracing=False, enable_mfu_metrics=False, enable_mm_processor_stats=False, enable_logging_iteration_details=False), seed=0, served_model_name=unsloth/Qwen3-0.6B-GGUF:UD-IQ1_S, enable_prefix_caching=True, enable_chunked_prefill=True, pooler_config=None, compilation_config={'mode': <CompilationMode.VLLM_COMPILE: 3>, 'debug_dump_path': None, 'cache_dir': '', 'compile_cache_save_format': 'binary', 'backend': 'inductor', 'custom_ops': ['none'], 'ir_enable_torch_wrap': True, 'splitting_ops': ['vllm::unified_attention_with_output', 'vllm::unified_mla_attention_with_output', 'vllm::mamba_mixer2', 'vllm::mamba_mixer', 'vllm::short_conv', 'vllm::linear_attention', 'vllm::plamo2_mamba_mixer', 'vllm::gdn_attention_core', 'vllm::olmo_hybrid_gdn_full_forward', 'vllm::kda_attention', 'vllm::sparse_attn_indexer', 'vllm::rocm_aiter_sparse_attn_indexer', 'vllm::unified_kv_cache_update', 'vllm::unified_mla_kv_cache_update'], 'compile_mm_encoder': False, 'cudagraph_mm_encoder': False, 'encoder_cudagraph_token_budgets': [], 'encoder_cudagraph_max_images_per_batch': 0, 'compile_sizes': [], 'compile_ranges_endpoints': [2048], 'inductor_compile_config': {'enable_auto_functionalized_v2': False, 'size_asserts': False, 'alignment_asserts': False, 'scalar_asserts': False, 'combo_kernels': True, 'benchmark_combo_kernel': True}, 'inductor_passes': {}, 'cudagraph_mode': <CUDAGraphMode.FULL_AND_PIECEWISE: (2, 1)>, 'cudagraph_num_of_warmups': 1, 'cudagraph_capture_sizes': [1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200, 208, 216, 224, 232, 240, 248, 256, 272, 288, 304, 320, 336, 352, 368, 384, 400, 416, 432, 448, 464, 480, 496, 512], 'cudagraph_copy_inputs': False, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': False, 'pass_config': {'fuse_norm_quant': False, 'fuse_act_quant': False, 'fuse_attn_quant': False, 'enable_sp': False, 'fuse_gemm_comms': False, 'fuse_allreduce_rms': False}, 'max_cudagraph_capture_size': 512, 'dynamic_shapes_config': {'type': <DynamicShapesType.BACKED: 'backed'>, 'evaluate_guards': False, 'assume_32_bit_indexing': False}, 'local_cache_dir': None, 'fast_moe_cold_start': False, 'static_all_moe_layers': []}, kernel_config=KernelConfig(ir_op_priority=IrOpPriorityConfig(rms_norm=['native']), enable_flashinfer_autotune=True, moe_backend='auto') (EngineCore pid=2598461) INFO 04-10 09:31:44 [parallel_state.py:1400] world_size=1 rank=0 local_rank=0 distributed_init_method=tcp://172.16.1.10:52857 backend=nccl (EngineCore pid=2598461) INFO 04-10 09:31:44 [parallel_state.py:1713] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, PCP rank 0, TP rank 0, EP rank N/A, EPLB rank N/A (EngineCore pid=2598461) INFO 04-10 09:31:45 [gpu_model_runner.py:4735] Starting to load model unsloth/Qwen3-0.6B-GGUF:UD-IQ1_S... Qwen3-0.6B-UD-IQ1_S.gguf: 100%|████████████████████████████████████████████████████████████████████████████████| 215M/215M [00:21<00:00, 9.93MB/s] (EngineCore pid=2598461) INFO 04-10 09:32:07 [weight_utils.py:615] Time spent downloading weights for unsloth/Qwen3-0.6B-GGUF: 22.251029 seconds (EngineCore pid=2598461) INFO 04-10 09:32:16 [cuda.py:362] Using FLASH_ATTN attention backend out of potential backends: ['FLASH_ATTN', 'FLASHINFER', 'TRITON_ATTN', 'FLEX_ATTENTION']. (EngineCore pid=2598461) INFO 04-10 09:32:16 [flash_attn.py:636] Using FlashAttention version 2 (EngineCore pid=2598461) <frozen importlib._bootstrap_external>:1301: FutureWarning: The cuda.cudart module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.runtime module instead. (EngineCore pid=2598461) <frozen importlib._bootstrap_external>:1301: FutureWarning: The cuda.nvrtc module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.nvrtc module instead. (EngineCore pid=2598461) INFO 04-10 09:32:21 [gpu_model_runner.py:4820] Model loading took 0.22 GiB memory and 35.669191 seconds (EngineCore pid=2598461) INFO 04-10 09:32:24 [backends.py:1055] Using cache directory: /home/name/.cache/vllm/torch_compile_cache/564aa12500/rank_0_0/backbone for vLLM's torch.compile (EngineCore pid=2598461) INFO 04-10 09:32:24 [backends.py:1115] Dynamo bytecode transform time: 3.07 s (EngineCore pid=2598461) INFO 04-10 09:32:26 [backends.py:373] Cache the graph of compile range (1, 2048) for later use (EngineCore pid=2598461) INFO 04-10 09:32:33 [backends.py:391] Compiling a graph for compile range (1, 2048) takes 8.53 s (EngineCore pid=2598461) INFO 04-10 09:32:35 [decorators.py:655] saved AOT compiled function to /home/name/.cache/vllm/torch_compile_cache/torch_aot_compile/d5db8a5d1bc2f897526bb947908032d2f1ae13b65f8af58e817018da7e2e59ce/rank_0_0/model (EngineCore pid=2598461) INFO 04-10 09:32:35 [monitor.py:48] torch.compile took 13.63 s in total (EngineCore pid=2598461) INFO 04-10 09:32:35 [monitor.py:76] Initial profiling/warmup run took 0.24 s (EngineCore pid=2598461) INFO 04-10 09:32:35 [kv_cache_utils.py:829] Overriding num_gpu_blocks=0 with num_gpu_blocks_override=512 (EngineCore pid=2598461) INFO 04-10 09:32:35 [gpu_model_runner.py:5893] Profiling CUDA graph memory: PIECEWISE=51 (largest=512), FULL=35 (largest=256) (EngineCore pid=2598461) INFO 04-10 09:32:36 [gpu_model_runner.py:5972] Estimated CUDA graph memory: 0.64 GiB total (EngineCore pid=2598461) INFO 04-10 09:32:36 [gpu_worker.py:436] Available KV cache memory: 20.23 GiB (EngineCore pid=2598461) INFO 04-10 09:32:36 [gpu_worker.py:470] In v0.19, CUDA graph memory profiling will be enabled by default (VLLM_MEMORY_PROFILER_ESTIMATE_CUDAGRAPHS=1), which more accurately accounts for CUDA graph memory during KV cache allocation. To try it now, set VLLM_MEMORY_PROFILER_ESTIMATE_CUDAGRAPHS=1 and increase --gpu-memory-utilization from 0.9000 to 0.9274 to maintain the same effective KV cache size. (EngineCore pid=2598461) INFO 04-10 09:32:36 [kv_cache_utils.py:1319] GPU KV cache size: 189,408 tokens (EngineCore pid=2598461) INFO 04-10 09:32:36 [kv_cache_utils.py:1324] Maximum concurrency for 40,960 tokens per request: 4.62x Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 100%|████████████████████████████████████████████████████| 51/51 [00:01<00:00, 44.49it/s] Capturing CUDA graphs (decode, FULL): 100%|███████████████████████████████████████████████████████████████████████| 35/35 [00:00<00:00, 52.28it/s] (EngineCore pid=2598461) INFO 04-10 09:32:39 [gpu_model_runner.py:6063] Graph capturing finished in 2 secs, took 0.72 GiB (EngineCore pid=2598461) INFO 04-10 09:32:39 [gpu_worker.py:597] CUDA graph pool memory: 0.72 GiB (actual), 0.64 GiB (estimated), difference: 0.07 GiB (10.1%). (EngineCore pid=2598461) INFO 04-10 09:32:39 [core.py:285] init engine (profile, create kv cache, warmup model) took 18.07 seconds (EngineCore pid=2598461) INFO 04-10 09:32:41 [vllm.py:799] Asynchronous scheduling is enabled. (EngineCore pid=2598461) INFO 04-10 09:32:41 [kernel.py:199] Final IR op priority after setting platform defaults: IrOpPriorityConfig(rms_norm=['native']) (APIServer pid=2598019) INFO 04-10 09:32:41 [api_server.py:606] Supported tasks: ['generate'] (APIServer pid=2598019) INFO 04-10 09:32:43 [hf.py:314] Detected the chat template content format to be 'string'. You can set `--chat-template-content-format` to override this. (APIServer pid=2598019) INFO 04-10 09:32:43 [api_server.py:610] Starting vLLM server on http://0.0.0.0:8000 (APIServer pid=2598019) INFO 04-10 09:32:43 [launcher.py:37] Available routes are: (APIServer pid=2598019) INFO 04-10 09:32:43 [launcher.py:46] Route: /openapi.json, Methods: GET, HEAD (APIServer pid=2598019) INFO 04-10 09:32:43 [launcher.py:46] Route: /docs, Methods: GET, HEAD (APIServer pid=2598019) INFO 04-10 09:32:43 [launcher.py:46] Route: /docs/oauth2-redirect, Methods: GET, HEAD (APIServer pid=2598019) INFO 04-10 09:32:43 [launcher.py:46] Route: /redoc, Methods: GET, HEAD (APIServer pid=2598019) INFO 04-10 09:32:43 [launcher.py:46] Route: /tokenize, Methods: POST (APIServer pid=2598019) INFO 04-10 09:32:43 [launcher.py:46] Route: /detokenize, Methods: POST (APIServer pid=2598019) INFO 04-10 09:32:43 [launcher.py:46] Route: /load, Methods: GET (APIServer pid=2598019) INFO 04-10 09:32:43 [launcher.py:46] Route: /version, Methods: GET (APIServer pid=2598019) INFO 04-10 09:32:43 [launcher.py:46] Route: /health, Methods: GET (APIServer pid=2598019) INFO 04-10 09:32:43 [launcher.py:46] Route: /metrics, Methods: GET (APIServer pid=2598019) INFO 04-10 09:32:43 [launcher.py:46] Route: /v1/models, Methods: GET (APIServer pid=2598019) INFO 04-10 09:32:43 [launcher.py:46] Route: /ping, Methods: GET (APIServer pid=2598019) INFO 04-10 09:32:43 [launcher.py:46] Route: /ping, Methods: POST (APIServer pid=2598019) INFO 04-10 09:32:43 [launcher.py:46] Route: /invocations, Methods: POST (APIServer pid=2598019) INFO 04-10 09:32:43 [launcher.py:46] Route: /v1/chat/completions, Methods: POST (APIServer pid=2598019) INFO 04-10 09:32:43 [launcher.py:46] Route: /v1/chat/completions/batch, Methods: POST (APIServer pid=2598019) INFO 04-10 09:32:43 [launcher.py:46] Route: /v1/responses, Methods: POST (APIServer pid=2598019) INFO 04-10 09:32:43 [launcher.py:46] Route: /v1/responses/{response_id}, Methods: GET (APIServer pid=2598019) INFO 04-10 09:32:43 [launcher.py:46] Route: /v1/responses/{response_id}/cancel, Methods: POST (APIServer pid=2598019) INFO 04-10 09:32:43 [launcher.py:46] Route: /v1/completions, Methods: POST (APIServer pid=2598019) INFO 04-10 09:32:43 [launcher.py:46] Route: /v1/messages, Methods: POST (APIServer pid=2598019) INFO 04-10 09:32:43 [launcher.py:46] Route: /v1/messages/count_tokens, Methods: POST (APIServer pid=2598019) INFO 04-10 09:32:43 [launcher.py:46] Route: /inference/v1/generate, Methods: POST (APIServer pid=2598019) INFO 04-10 09:32:43 [launcher.py:46] Route: /scale_elastic_ep, Methods: POST (APIServer pid=2598019) INFO 04-10 09:32:43 [launcher.py:46] Route: /is_scaling_elastic_ep, Methods: POST (APIServer pid=2598019) INFO 04-10 09:32:43 [launcher.py:46] Route: /v1/chat/completions/render, Methods: POST (APIServer pid=2598019) INFO 04-10 09:32:43 [launcher.py:46] Route: /v1/completions/render, Methods: POST (APIServer pid=2598019) INFO 04-10 09:32:43 [launcher.py:46] Route: /generative_scoring, Methods: POST (APIServer pid=2598019) INFO: Started server process [2598019] (APIServer pid=2598019) INFO: Waiting for application startup. (APIServer pid=2598019) INFO: Application startup complete.Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.