Skip to content

[GGUF] Support non-standard quant types with prefix (e.g. UD-IQ1_S)#39471

Merged
Isotr0py merged 3 commits into
vllm-project:mainfrom
sts07142:fix/gguf-nonstandard-quant-type
Apr 10, 2026
Merged

[GGUF] Support non-standard quant types with prefix (e.g. UD-IQ1_S)#39471
Isotr0py merged 3 commits into
vllm-project:mainfrom
sts07142:fix/gguf-nonstandard-quant-type

Conversation

@sts07142

Copy link
Copy Markdown
Contributor

Purpose

Support non-standard quant types with prefix (e.g. UD-IQ1_S )

Fixes: #39469

Test Plan

vllm serve unsloth/Qwen3-0.6B-GGUF:UD-IQ1_S --tokenizer Qwen/Qwen3-0.6B

Test Result

before
vllm serve unsloth/Qwen3-0.6B-GGUF:UD-IQ1_S --tokenizer Qwen/Qwen3-0.6B
(APIServer pid=2603210) INFO 04-10 09:33:46 [utils.py:299]
(APIServer pid=2603210) INFO 04-10 09:33:46 [utils.py:299]        █     █     █▄   ▄█
(APIServer pid=2603210) INFO 04-10 09:33:46 [utils.py:299]  ▄▄ ▄█ █     █     █ ▀▄▀ █  version 0.19.1rc1.dev122+g83aea2147
(APIServer pid=2603210) INFO 04-10 09:33:46 [utils.py:299]   █▄█▀ █     █     █     █  model   unsloth/Qwen3-0.6B-GGUF:UD-IQ1_S
(APIServer pid=2603210) INFO 04-10 09:33:46 [utils.py:299]    ▀▀  ▀▀▀▀▀ ▀▀▀▀▀ ▀     ▀
(APIServer pid=2603210) INFO 04-10 09:33:46 [utils.py:299]
(APIServer pid=2603210) INFO 04-10 09:33:46 [utils.py:233] non-default args: {'model_tag': 'unsloth/Qwen3-0.6B-GGUF:UD-IQ1_S', 'model': 'unsloth/Qwen3-0.6B-GGUF:UD-IQ1_S', 'tokenizer': 'Qwen/Qwen3-0.6B'}
(APIServer pid=2603210) Traceback (most recent call last):
(APIServer pid=2603210)   File "/home/name/.test/.gpu/vllm/.venv/lib/python3.12/site-packages/transformers/utils/hub.py", line 479, in cached_files
(APIServer pid=2603210)     hf_hub_download(
(APIServer pid=2603210)   File "/home/name/.test/.gpu/vllm/.venv/lib/python3.12/site-packages/huggingface_hub/utils/_validators.py", line 106, in _inner_fn
(APIServer pid=2603210)     validate_repo_id(arg_value)
(APIServer pid=2603210)   File "/home/name/.test/.gpu/vllm/.venv/lib/python3.12/site-packages/huggingface_hub/utils/_validators.py", line 160, in validate_repo_id
(APIServer pid=2603210)     raise HFValidationError(
(APIServer pid=2603210) huggingface_hub.errors.HFValidationError: Repo id must use alphanumeric chars, '-', '_' or '.'. The name cannot start or end with '-' or '.' and the maximum length is 96: 'unsloth/Qwen3-0.6B-GGUF:UD-IQ1_S'.
(APIServer pid=2603210)
(APIServer pid=2603210) During handling of the above exception, another exception occurred:
(APIServer pid=2603210)
(APIServer pid=2603210) Traceback (most recent call last):
(APIServer pid=2603210)   File "/home/name/.test/.gpu/vllm/.venv/lib/python3.12/site-packages/transformers/configuration_utils.py", line 721, in _get_config_dict
(APIServer pid=2603210)     resolved_config_file = cached_file(
(APIServer pid=2603210)                            ^^^^^^^^^^^^
(APIServer pid=2603210)   File "/home/name/.test/.gpu/vllm/.venv/lib/python3.12/site-packages/transformers/utils/hub.py", line 322, in cached_file
(APIServer pid=2603210)     file = cached_files(path_or_repo_id=path_or_repo_id, filenames=[filename], **kwargs)
(APIServer pid=2603210)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=2603210)   File "/home/name/.test/.gpu/vllm/.venv/lib/python3.12/site-packages/transformers/utils/hub.py", line 532, in cached_files
(APIServer pid=2603210)     _get_cache_file_to_return(path_or_repo_id, filename, cache_dir, revision, repo_type)
(APIServer pid=2603210)   File "/home/name/.test/.gpu/vllm/.venv/lib/python3.12/site-packages/transformers/utils/hub.py", line 143, in _get_cache_file_to_return
(APIServer pid=2603210)     resolved_file = try_to_load_from_cache(
(APIServer pid=2603210)                     ^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=2603210)   File "/home/name/.test/.gpu/vllm/.venv/lib/python3.12/site-packages/huggingface_hub/utils/_validators.py", line 106, in _inner_fn
(APIServer pid=2603210)     validate_repo_id(arg_value)
(APIServer pid=2603210)   File "/home/name/.test/.gpu/vllm/.venv/lib/python3.12/site-packages/huggingface_hub/utils/_validators.py", line 160, in validate_repo_id
(APIServer pid=2603210)     raise HFValidationError(
(APIServer pid=2603210) huggingface_hub.errors.HFValidationError: Repo id must use alphanumeric chars, '-', '_' or '.'. The name cannot start or end with '-' or '.' and the maximum length is 96: 'unsloth/Qwen3-0.6B-GGUF:UD-IQ1_S'.
(APIServer pid=2603210)
(APIServer pid=2603210) During handling of the above exception, another exception occurred:
(APIServer pid=2603210)
(APIServer pid=2603210) Traceback (most recent call last):
(APIServer pid=2603210)   File "/home/name/.test/.gpu/vllm/.venv/bin/vllm", line 10, in <module>
(APIServer pid=2603210)     sys.exit(main())
(APIServer pid=2603210)              ^^^^^^
(APIServer pid=2603210)   File "/home/name/.test/.gpu/vllm/vllm/entrypoints/cli/main.py", line 75, in main
(APIServer pid=2603210)     args.dispatch_function(args)
(APIServer pid=2603210)   File "/home/name/.test/.gpu/vllm/vllm/entrypoints/cli/serve.py", line 122, in cmd
(APIServer pid=2603210)     uvloop.run(run_server(args))
(APIServer pid=2603210)   File "/home/name/.test/.gpu/vllm/.venv/lib/python3.12/site-packages/uvloop/__init__.py", line 96, in run
(APIServer pid=2603210)     return __asyncio.run(
(APIServer pid=2603210)            ^^^^^^^^^^^^^^
(APIServer pid=2603210)   File "/home/name/.local/share/uv/python/cpython-3.12.13-linux-x86_64-gnu/lib/python3.12/asyncio/runners.py", line 195, in run
(APIServer pid=2603210)     return runner.run(main)
(APIServer pid=2603210)            ^^^^^^^^^^^^^^^^
(APIServer pid=2603210)   File "/home/name/.local/share/uv/python/cpython-3.12.13-linux-x86_64-gnu/lib/python3.12/asyncio/runners.py", line 118, in run
(APIServer pid=2603210)     return self._loop.run_until_complete(task)
(APIServer pid=2603210)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=2603210)   File "uvloop/loop.pyx", line 1518, in uvloop.loop.Loop.run_until_complete
(APIServer pid=2603210)   File "/home/name/.test/.gpu/vllm/.venv/lib/python3.12/site-packages/uvloop/__init__.py", line 48, in wrapper
(APIServer pid=2603210)     return await main
(APIServer pid=2603210)            ^^^^^^^^^^
(APIServer pid=2603210)   File "/home/name/.test/.gpu/vllm/vllm/entrypoints/openai/api_server.py", line 686, in run_server
(APIServer pid=2603210)     await run_server_worker(listen_address, sock, args, **uvicorn_kwargs)
(APIServer pid=2603210)   File "/home/name/.test/.gpu/vllm/vllm/entrypoints/openai/api_server.py", line 700, in run_server_worker
(APIServer pid=2603210)     async with build_async_engine_client(
(APIServer pid=2603210)                ^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=2603210)   File "/home/name/.local/share/uv/python/cpython-3.12.13-linux-x86_64-gnu/lib/python3.12/contextlib.py", line 210, in __aenter__
(APIServer pid=2603210)     return await anext(self.gen)
(APIServer pid=2603210)            ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=2603210)   File "/home/name/.test/.gpu/vllm/vllm/entrypoints/openai/api_server.py", line 100, in build_async_engine_client
(APIServer pid=2603210)     async with build_async_engine_client_from_engine_args(
(APIServer pid=2603210)                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=2603210)   File "/home/name/.local/share/uv/python/cpython-3.12.13-linux-x86_64-gnu/lib/python3.12/contextlib.py", line 210, in __aenter__
(APIServer pid=2603210)     return await anext(self.gen)
(APIServer pid=2603210)            ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=2603210)   File "/home/name/.test/.gpu/vllm/vllm/entrypoints/openai/api_server.py", line 124, in build_async_engine_client_from_engine_args
(APIServer pid=2603210)     vllm_config = engine_args.create_engine_config(usage_context=usage_context)
(APIServer pid=2603210)                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=2603210)   File "/home/name/.test/.gpu/vllm/vllm/engine/arg_utils.py", line 1574, in create_engine_config
(APIServer pid=2603210)     maybe_override_with_speculators(
(APIServer pid=2603210)   File "/home/name/.test/.gpu/vllm/vllm/transformers_utils/config.py", line 584, in maybe_override_with_speculators
(APIServer pid=2603210)     config_dict, _ = PretrainedConfig.get_config_dict(
(APIServer pid=2603210)                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=2603210)   File "/home/name/.test/.gpu/vllm/.venv/lib/python3.12/site-packages/transformers/configuration_utils.py", line 662, in get_config_dict
(APIServer pid=2603210)     config_dict, kwargs = cls._get_config_dict(pretrained_model_name_or_path, **kwargs)
(APIServer pid=2603210)                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=2603210)   File "/home/name/.test/.gpu/vllm/.venv/lib/python3.12/site-packages/transformers/configuration_utils.py", line 744, in _get_config_dict
(APIServer pid=2603210)     raise OSError(
(APIServer pid=2603210) OSError: Can't load the configuration of 'unsloth/Qwen3-0.6B-GGUF:UD-IQ1_S'. If you were trying to load it from 'https://huggingface.co/models', make sure you don't have a local directory with the same name. Otherwise, make sure 'unsloth/Qwen3-0.6B-GGUF:UD-IQ1_S' is the correct path to a directory containing a config.json file
after
vllm serve unsloth/Qwen3-0.6B-GGUF:UD-IQ1_S --tokenizer Qwen/Qwen3-0.6B
(APIServer pid=2598019) INFO 04-10 09:31:38 [utils.py:299]
(APIServer pid=2598019) INFO 04-10 09:31:38 [utils.py:299]        █     █     █▄   ▄█
(APIServer pid=2598019) INFO 04-10 09:31:38 [utils.py:299]  ▄▄ ▄█ █     █     █ ▀▄▀ █  version 0.19.1rc1.dev122+g83aea2147
(APIServer pid=2598019) INFO 04-10 09:31:38 [utils.py:299]   █▄█▀ █     █     █     █  model   unsloth/Qwen3-0.6B-GGUF:UD-IQ1_S
(APIServer pid=2598019) INFO 04-10 09:31:38 [utils.py:299]    ▀▀  ▀▀▀▀▀ ▀▀▀▀▀ ▀     ▀
(APIServer pid=2598019) INFO 04-10 09:31:38 [utils.py:299]
(APIServer pid=2598019) INFO 04-10 09:31:38 [utils.py:233] non-default args: {'model_tag': 'unsloth/Qwen3-0.6B-GGUF:UD-IQ1_S', 'model': 'unsloth/Qwen3-0.6B-GGUF:UD-IQ1_S', 'tokenizer': 'Qwen/Qwen3-0.6B'}
(APIServer pid=2598019) WARNING 04-10 09:31:38 [gguf_utils.py:60] Non-standard GGUF quant type 'UD-IQ1_S' detected.
(APIServer pid=2598019) INFO 04-10 09:31:39 [model.py:554] Resolved architecture: Qwen3ForCausalLM
(APIServer pid=2598019) INFO 04-10 09:31:39 [model.py:1684] Using max model len 40960
(APIServer pid=2598019) INFO 04-10 09:31:39 [vllm.py:799] Asynchronous scheduling is enabled.
(APIServer pid=2598019) INFO 04-10 09:31:39 [kernel.py:199] Final IR op priority after setting platform defaults: IrOpPriorityConfig(rms_norm=['native'])
(EngineCore pid=2598461) INFO 04-10 09:31:44 [core.py:107] Initializing a V1 LLM engine (v0.19.1rc1.dev122+g83aea2147) with config: model='unsloth/Qwen3-0.6B-GGUF:UD-IQ1_S', speculative_config=None, tokenizer='Qwen/Qwen3-0.6B', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=40960, download_dir=None, load_format=gguf, tensor_parallel_size=1, pipeline_parallel_size=1, data_parallel_size=1, decode_context_parallel_size=1, dcp_comm_backend=ag_rs, disable_custom_all_reduce=False, quantization=gguf, quantization_config=None, enforce_eager=False, enable_return_routed_experts=False, kv_cache_dtype=auto, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser='', reasoning_parser_plugin='', enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, kv_cache_metrics=False, kv_cache_metrics_sample=0.01, cudagraph_metrics=False, enable_layerwise_nvtx_tracing=False, enable_mfu_metrics=False, enable_mm_processor_stats=False, enable_logging_iteration_details=False), seed=0, served_model_name=unsloth/Qwen3-0.6B-GGUF:UD-IQ1_S, enable_prefix_caching=True, enable_chunked_prefill=True, pooler_config=None, compilation_config={'mode': <CompilationMode.VLLM_COMPILE: 3>, 'debug_dump_path': None, 'cache_dir': '', 'compile_cache_save_format': 'binary', 'backend': 'inductor', 'custom_ops': ['none'], 'ir_enable_torch_wrap': True, 'splitting_ops': ['vllm::unified_attention_with_output', 'vllm::unified_mla_attention_with_output', 'vllm::mamba_mixer2', 'vllm::mamba_mixer', 'vllm::short_conv', 'vllm::linear_attention', 'vllm::plamo2_mamba_mixer', 'vllm::gdn_attention_core', 'vllm::olmo_hybrid_gdn_full_forward', 'vllm::kda_attention', 'vllm::sparse_attn_indexer', 'vllm::rocm_aiter_sparse_attn_indexer', 'vllm::unified_kv_cache_update', 'vllm::unified_mla_kv_cache_update'], 'compile_mm_encoder': False, 'cudagraph_mm_encoder': False, 'encoder_cudagraph_token_budgets': [], 'encoder_cudagraph_max_images_per_batch': 0, 'compile_sizes': [], 'compile_ranges_endpoints': [2048], 'inductor_compile_config': {'enable_auto_functionalized_v2': False, 'size_asserts': False, 'alignment_asserts': False, 'scalar_asserts': False, 'combo_kernels': True, 'benchmark_combo_kernel': True}, 'inductor_passes': {}, 'cudagraph_mode': <CUDAGraphMode.FULL_AND_PIECEWISE: (2, 1)>, 'cudagraph_num_of_warmups': 1, 'cudagraph_capture_sizes': [1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200, 208, 216, 224, 232, 240, 248, 256, 272, 288, 304, 320, 336, 352, 368, 384, 400, 416, 432, 448, 464, 480, 496, 512], 'cudagraph_copy_inputs': False, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': False, 'pass_config': {'fuse_norm_quant': False, 'fuse_act_quant': False, 'fuse_attn_quant': False, 'enable_sp': False, 'fuse_gemm_comms': False, 'fuse_allreduce_rms': False}, 'max_cudagraph_capture_size': 512, 'dynamic_shapes_config': {'type': <DynamicShapesType.BACKED: 'backed'>, 'evaluate_guards': False, 'assume_32_bit_indexing': False}, 'local_cache_dir': None, 'fast_moe_cold_start': False, 'static_all_moe_layers': []}, kernel_config=KernelConfig(ir_op_priority=IrOpPriorityConfig(rms_norm=['native']), enable_flashinfer_autotune=True, moe_backend='auto')
(EngineCore pid=2598461) INFO 04-10 09:31:44 [parallel_state.py:1400] world_size=1 rank=0 local_rank=0 distributed_init_method=tcp://172.16.1.10:52857 backend=nccl
(EngineCore pid=2598461) INFO 04-10 09:31:44 [parallel_state.py:1713] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, PCP rank 0, TP rank 0, EP rank N/A, EPLB rank N/A
(EngineCore pid=2598461) INFO 04-10 09:31:45 [gpu_model_runner.py:4735] Starting to load model unsloth/Qwen3-0.6B-GGUF:UD-IQ1_S...
Qwen3-0.6B-UD-IQ1_S.gguf: 100%|████████████████████████████████████████████████████████████████████████████████| 215M/215M [00:21<00:00, 9.93MB/s]
(EngineCore pid=2598461) INFO 04-10 09:32:07 [weight_utils.py:615] Time spent downloading weights for unsloth/Qwen3-0.6B-GGUF: 22.251029 seconds
(EngineCore pid=2598461) INFO 04-10 09:32:16 [cuda.py:362] Using FLASH_ATTN attention backend out of potential backends: ['FLASH_ATTN', 'FLASHINFER', 'TRITON_ATTN', 'FLEX_ATTENTION'].
(EngineCore pid=2598461) INFO 04-10 09:32:16 [flash_attn.py:636] Using FlashAttention version 2
(EngineCore pid=2598461) <frozen importlib._bootstrap_external>:1301: FutureWarning: The cuda.cudart module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.runtime module instead.
(EngineCore pid=2598461) <frozen importlib._bootstrap_external>:1301: FutureWarning: The cuda.nvrtc module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.nvrtc module instead.
(EngineCore pid=2598461) INFO 04-10 09:32:21 [gpu_model_runner.py:4820] Model loading took 0.22 GiB memory and 35.669191 seconds
(EngineCore pid=2598461) INFO 04-10 09:32:24 [backends.py:1055] Using cache directory: /home/name/.cache/vllm/torch_compile_cache/564aa12500/rank_0_0/backbone for vLLM's torch.compile
(EngineCore pid=2598461) INFO 04-10 09:32:24 [backends.py:1115] Dynamo bytecode transform time: 3.07 s
(EngineCore pid=2598461) INFO 04-10 09:32:26 [backends.py:373] Cache the graph of compile range (1, 2048) for later use
(EngineCore pid=2598461) INFO 04-10 09:32:33 [backends.py:391] Compiling a graph for compile range (1, 2048) takes 8.53 s
(EngineCore pid=2598461) INFO 04-10 09:32:35 [decorators.py:655] saved AOT compiled function to /home/name/.cache/vllm/torch_compile_cache/torch_aot_compile/d5db8a5d1bc2f897526bb947908032d2f1ae13b65f8af58e817018da7e2e59ce/rank_0_0/model
(EngineCore pid=2598461) INFO 04-10 09:32:35 [monitor.py:48] torch.compile took 13.63 s in total
(EngineCore pid=2598461) INFO 04-10 09:32:35 [monitor.py:76] Initial profiling/warmup run took 0.24 s
(EngineCore pid=2598461) INFO 04-10 09:32:35 [kv_cache_utils.py:829] Overriding num_gpu_blocks=0 with num_gpu_blocks_override=512
(EngineCore pid=2598461) INFO 04-10 09:32:35 [gpu_model_runner.py:5893] Profiling CUDA graph memory: PIECEWISE=51 (largest=512), FULL=35 (largest=256)
(EngineCore pid=2598461) INFO 04-10 09:32:36 [gpu_model_runner.py:5972] Estimated CUDA graph memory: 0.64 GiB total
(EngineCore pid=2598461) INFO 04-10 09:32:36 [gpu_worker.py:436] Available KV cache memory: 20.23 GiB
(EngineCore pid=2598461) INFO 04-10 09:32:36 [gpu_worker.py:470] In v0.19, CUDA graph memory profiling will be enabled by default (VLLM_MEMORY_PROFILER_ESTIMATE_CUDAGRAPHS=1), which more accurately accounts for CUDA graph memory during KV cache allocation. To try it now, set VLLM_MEMORY_PROFILER_ESTIMATE_CUDAGRAPHS=1 and increase --gpu-memory-utilization from 0.9000 to 0.9274 to maintain the same effective KV cache size.
(EngineCore pid=2598461) INFO 04-10 09:32:36 [kv_cache_utils.py:1319] GPU KV cache size: 189,408 tokens
(EngineCore pid=2598461) INFO 04-10 09:32:36 [kv_cache_utils.py:1324] Maximum concurrency for 40,960 tokens per request: 4.62x
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 100%|████████████████████████████████████████████████████| 51/51 [00:01<00:00, 44.49it/s]
Capturing CUDA graphs (decode, FULL): 100%|███████████████████████████████████████████████████████████████████████| 35/35 [00:00<00:00, 52.28it/s]
(EngineCore pid=2598461) INFO 04-10 09:32:39 [gpu_model_runner.py:6063] Graph capturing finished in 2 secs, took 0.72 GiB
(EngineCore pid=2598461) INFO 04-10 09:32:39 [gpu_worker.py:597] CUDA graph pool memory: 0.72 GiB (actual), 0.64 GiB (estimated), difference: 0.07 GiB (10.1%).
(EngineCore pid=2598461) INFO 04-10 09:32:39 [core.py:285] init engine (profile, create kv cache, warmup model) took 18.07 seconds
(EngineCore pid=2598461) INFO 04-10 09:32:41 [vllm.py:799] Asynchronous scheduling is enabled.
(EngineCore pid=2598461) INFO 04-10 09:32:41 [kernel.py:199] Final IR op priority after setting platform defaults: IrOpPriorityConfig(rms_norm=['native'])
(APIServer pid=2598019) INFO 04-10 09:32:41 [api_server.py:606] Supported tasks: ['generate']
(APIServer pid=2598019) INFO 04-10 09:32:43 [hf.py:314] Detected the chat template content format to be 'string'. You can set `--chat-template-content-format` to override this.
(APIServer pid=2598019) INFO 04-10 09:32:43 [api_server.py:610] Starting vLLM server on http://0.0.0.0:8000
(APIServer pid=2598019) INFO 04-10 09:32:43 [launcher.py:37] Available routes are:
(APIServer pid=2598019) INFO 04-10 09:32:43 [launcher.py:46] Route: /openapi.json, Methods: GET, HEAD
(APIServer pid=2598019) INFO 04-10 09:32:43 [launcher.py:46] Route: /docs, Methods: GET, HEAD
(APIServer pid=2598019) INFO 04-10 09:32:43 [launcher.py:46] Route: /docs/oauth2-redirect, Methods: GET, HEAD
(APIServer pid=2598019) INFO 04-10 09:32:43 [launcher.py:46] Route: /redoc, Methods: GET, HEAD
(APIServer pid=2598019) INFO 04-10 09:32:43 [launcher.py:46] Route: /tokenize, Methods: POST
(APIServer pid=2598019) INFO 04-10 09:32:43 [launcher.py:46] Route: /detokenize, Methods: POST
(APIServer pid=2598019) INFO 04-10 09:32:43 [launcher.py:46] Route: /load, Methods: GET
(APIServer pid=2598019) INFO 04-10 09:32:43 [launcher.py:46] Route: /version, Methods: GET
(APIServer pid=2598019) INFO 04-10 09:32:43 [launcher.py:46] Route: /health, Methods: GET
(APIServer pid=2598019) INFO 04-10 09:32:43 [launcher.py:46] Route: /metrics, Methods: GET
(APIServer pid=2598019) INFO 04-10 09:32:43 [launcher.py:46] Route: /v1/models, Methods: GET
(APIServer pid=2598019) INFO 04-10 09:32:43 [launcher.py:46] Route: /ping, Methods: GET
(APIServer pid=2598019) INFO 04-10 09:32:43 [launcher.py:46] Route: /ping, Methods: POST
(APIServer pid=2598019) INFO 04-10 09:32:43 [launcher.py:46] Route: /invocations, Methods: POST
(APIServer pid=2598019) INFO 04-10 09:32:43 [launcher.py:46] Route: /v1/chat/completions, Methods: POST
(APIServer pid=2598019) INFO 04-10 09:32:43 [launcher.py:46] Route: /v1/chat/completions/batch, Methods: POST
(APIServer pid=2598019) INFO 04-10 09:32:43 [launcher.py:46] Route: /v1/responses, Methods: POST
(APIServer pid=2598019) INFO 04-10 09:32:43 [launcher.py:46] Route: /v1/responses/{response_id}, Methods: GET
(APIServer pid=2598019) INFO 04-10 09:32:43 [launcher.py:46] Route: /v1/responses/{response_id}/cancel, Methods: POST
(APIServer pid=2598019) INFO 04-10 09:32:43 [launcher.py:46] Route: /v1/completions, Methods: POST
(APIServer pid=2598019) INFO 04-10 09:32:43 [launcher.py:46] Route: /v1/messages, Methods: POST
(APIServer pid=2598019) INFO 04-10 09:32:43 [launcher.py:46] Route: /v1/messages/count_tokens, Methods: POST
(APIServer pid=2598019) INFO 04-10 09:32:43 [launcher.py:46] Route: /inference/v1/generate, Methods: POST
(APIServer pid=2598019) INFO 04-10 09:32:43 [launcher.py:46] Route: /scale_elastic_ep, Methods: POST
(APIServer pid=2598019) INFO 04-10 09:32:43 [launcher.py:46] Route: /is_scaling_elastic_ep, Methods: POST
(APIServer pid=2598019) INFO 04-10 09:32:43 [launcher.py:46] Route: /v1/chat/completions/render, Methods: POST
(APIServer pid=2598019) INFO 04-10 09:32:43 [launcher.py:46] Route: /v1/completions/render, Methods: POST
(APIServer pid=2598019) INFO 04-10 09:32:43 [launcher.py:46] Route: /generative_scoring, Methods: POST
(APIServer pid=2598019) INFO:     Started server process [2598019]
(APIServer pid=2598019) INFO:     Waiting for application startup.
(APIServer pid=2598019) INFO:     Application startup complete.

Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

Signed-off-by: Injae Ryou <injaeryou@gmail.com>
Signed-off-by: Injae Ryou <injaeryou@gmail.com>
@sts07142 sts07142 force-pushed the fix/gguf-nonstandard-quant-type branch from 436bbc6 to eed3842 Compare April 10, 2026 03:20

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces support for non-standard GGUF quantization types that use dash-separated prefixes, such as 'UD-Q4_K_XL'. The is_remote_gguf function was updated to recognize these types by validating the suffix after the last dash, and a new helper function is_nonstandard_gguf_quant_type was added. Additionally, the error message in split_remote_gguf was updated to reflect this support, and comprehensive unit tests were included to cover various prefix scenarios. I have no feedback to provide.

@Isotr0py Isotr0py left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks!

@Isotr0py Isotr0py added the ready ONLY add when PR is ready to merge/full CI is needed label Apr 10, 2026
@Isotr0py Isotr0py enabled auto-merge (squash) April 10, 2026 05:47
@Isotr0py Isotr0py merged commit 447ce22 into vllm-project:main Apr 10, 2026
48 checks passed
wojciech-wais pushed a commit to wojciech-wais/vllm that referenced this pull request Apr 13, 2026
…llm-project#39471)

Signed-off-by: Injae Ryou <injaeryou@gmail.com>
Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn>
whk-lab pushed a commit to whk-lab/vllm that referenced this pull request Apr 23, 2026
…llm-project#39471)

Signed-off-by: Injae Ryou <injaeryou@gmail.com>
Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn>
mystous pushed a commit to mystous/vllm_hybrid that referenced this pull request May 10, 2026
…llm-project#39471)

Signed-off-by: Injae Ryou <injaeryou@gmail.com>
Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn>
@lowlyocean lowlyocean mentioned this pull request May 15, 2026
1 task
my-other-github-account pushed a commit to my-other-github-account/vllm that referenced this pull request May 15, 2026
…llm-project#39471)

Signed-off-by: Injae Ryou <injaeryou@gmail.com>
Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn>
my-other-github-account pushed a commit to my-other-github-account/vllm that referenced this pull request May 15, 2026
…llm-project#39471)

Signed-off-by: Injae Ryou <injaeryou@gmail.com>
Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn>
jhu960213 pushed a commit to jhu960213/vllm that referenced this pull request May 20, 2026
…llm-project#39471)

Signed-off-by: Injae Ryou <injaeryou@gmail.com>
Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn>
mvanhorn pushed a commit to mvanhorn/vllm that referenced this pull request Jun 4, 2026
…llm-project#39471)

Signed-off-by: Injae Ryou <injaeryou@gmail.com>
Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn>
Signed-off-by: Matt Van Horn <455140+mvanhorn@users.noreply.github.com>
@sts07142 sts07142 deleted the fix/gguf-nonstandard-quant-type branch June 12, 2026 00:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready ONLY add when PR is ready to merge/full CI is needed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Feature]: Support non-standard GGUF quant type prefixes (e.g. Unsloth Dynamic UD-IQ1_S )

2 participants