[GGUF] Support non-standard quant types with prefix (e.g. UD-IQ1_S) by sts07142 · Pull Request #39471 · vllm-project/vllm

sts07142 · 2026-04-10T03:15:34Z

Purpose

Support non-standard quant types with prefix (e.g. UD-IQ1_S )

Test Plan

vllm serve unsloth/Qwen3-0.6B-GGUF:UD-IQ1_S --tokenizer Qwen/Qwen3-0.6B

Test Result

before

vllm serve unsloth/Qwen3-0.6B-GGUF:UD-IQ1_S --tokenizer Qwen/Qwen3-0.6B
(APIServer pid=2603210) INFO 04-10 09:33:46 [utils.py:299]
(APIServer pid=2603210) INFO 04-10 09:33:46 [utils.py:299]        █     █     █▄   ▄█
(APIServer pid=2603210) INFO 04-10 09:33:46 [utils.py:299]  ▄▄ ▄█ █     █     █ ▀▄▀ █  version 0.19.1rc1.dev122+g83aea2147
(APIServer pid=2603210) INFO 04-10 09:33:46 [utils.py:299]   █▄█▀ █     █     █     █  model   unsloth/Qwen3-0.6B-GGUF:UD-IQ1_S
(APIServer pid=2603210) INFO 04-10 09:33:46 [utils.py:299]    ▀▀  ▀▀▀▀▀ ▀▀▀▀▀ ▀     ▀
(APIServer pid=2603210) INFO 04-10 09:33:46 [utils.py:299]
(APIServer pid=2603210) INFO 04-10 09:33:46 [utils.py:233] non-default args: {'model_tag': 'unsloth/Qwen3-0.6B-GGUF:UD-IQ1_S', 'model': 'unsloth/Qwen3-0.6B-GGUF:UD-IQ1_S', 'tokenizer': 'Qwen/Qwen3-0.6B'}
(APIServer pid=2603210) Traceback (most recent call last):
(APIServer pid=2603210)   File "/home/name/.test/.gpu/vllm/.venv/lib/python3.12/site-packages/transformers/utils/hub.py", line 479, in cached_files
(APIServer pid=2603210)     hf_hub_download(
(APIServer pid=2603210)   File "/home/name/.test/.gpu/vllm/.venv/lib/python3.12/site-packages/huggingface_hub/utils/_validators.py", line 106, in _inner_fn
(APIServer pid=2603210)     validate_repo_id(arg_value)
(APIServer pid=2603210)   File "/home/name/.test/.gpu/vllm/.venv/lib/python3.12/site-packages/huggingface_hub/utils/_validators.py", line 160, in validate_repo_id
(APIServer pid=2603210)     raise HFValidationError(
(APIServer pid=2603210) huggingface_hub.errors.HFValidationError: Repo id must use alphanumeric chars, '-', '_' or '.'. The name cannot start or end with '-' or '.' and the maximum length is 96: 'unsloth/Qwen3-0.6B-GGUF:UD-IQ1_S'.
(APIServer pid=2603210)
(APIServer pid=2603210) During handling of the above exception, another exception occurred:
(APIServer pid=2603210)
(APIServer pid=2603210) Traceback (most recent call last):
(APIServer pid=2603210)   File "/home/name/.test/.gpu/vllm/.venv/lib/python3.12/site-packages/transformers/configuration_utils.py", line 721, in _get_config_dict
(APIServer pid=2603210)     resolved_config_file = cached_file(
(APIServer pid=2603210)                            ^^^^^^^^^^^^
(APIServer pid=2603210)   File "/home/name/.test/.gpu/vllm/.venv/lib/python3.12/site-packages/transformers/utils/hub.py", line 322, in cached_file
(APIServer pid=2603210)     file = cached_files(path_or_repo_id=path_or_repo_id, filenames=[filename], **kwargs)
(APIServer pid=2603210)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=2603210)   File "/home/name/.test/.gpu/vllm/.venv/lib/python3.12/site-packages/transformers/utils/hub.py", line 532, in cached_files
(APIServer pid=2603210)     _get_cache_file_to_return(path_or_repo_id, filename, cache_dir, revision, repo_type)
(APIServer pid=2603210)   File "/home/name/.test/.gpu/vllm/.venv/lib/python3.12/site-packages/transformers/utils/hub.py", line 143, in _get_cache_file_to_return
(APIServer pid=2603210)     resolved_file = try_to_load_from_cache(
(APIServer pid=2603210)                     ^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=2603210)   File "/home/name/.test/.gpu/vllm/.venv/lib/python3.12/site-packages/huggingface_hub/utils/_validators.py", line 106, in _inner_fn
(APIServer pid=2603210)     validate_repo_id(arg_value)
(APIServer pid=2603210)   File "/home/name/.test/.gpu/vllm/.venv/lib/python3.12/site-packages/huggingface_hub/utils/_validators.py", line 160, in validate_repo_id
(APIServer pid=2603210)     raise HFValidationError(
(APIServer pid=2603210) huggingface_hub.errors.HFValidationError: Repo id must use alphanumeric chars, '-', '_' or '.'. The name cannot start or end with '-' or '.' and the maximum length is 96: 'unsloth/Qwen3-0.6B-GGUF:UD-IQ1_S'.
(APIServer pid=2603210)
(APIServer pid=2603210) During handling of the above exception, another exception occurred:
(APIServer pid=2603210)
(APIServer pid=2603210) Traceback (most recent call last):
(APIServer pid=2603210)   File "/home/name/.test/.gpu/vllm/.venv/bin/vllm", line 10, in <module>
(APIServer pid=2603210)     sys.exit(main())
(APIServer pid=2603210)              ^^^^^^
(APIServer pid=2603210)   File "/home/name/.test/.gpu/vllm/vllm/entrypoints/cli/main.py", line 75, in main
(APIServer pid=2603210)     args.dispatch_function(args)
(APIServer pid=2603210)   File "/home/name/.test/.gpu/vllm/vllm/entrypoints/cli/serve.py", line 122, in cmd
(APIServer pid=2603210)     uvloop.run(run_server(args))
(APIServer pid=2603210)   File "/home/name/.test/.gpu/vllm/.venv/lib/python3.12/site-packages/uvloop/__init__.py", line 96, in run
(APIServer pid=2603210)     return __asyncio.run(
(APIServer pid=2603210)            ^^^^^^^^^^^^^^
(APIServer pid=2603210)   File "/home/name/.local/share/uv/python/cpython-3.12.13-linux-x86_64-gnu/lib/python3.12/asyncio/runners.py", line 195, in run
(APIServer pid=2603210)     return runner.run(main)
(APIServer pid=2603210)            ^^^^^^^^^^^^^^^^
(APIServer pid=2603210)   File "/home/name/.local/share/uv/python/cpython-3.12.13-linux-x86_64-gnu/lib/python3.12/asyncio/runners.py", line 118, in run
(APIServer pid=2603210)     return self._loop.run_until_complete(task)
(APIServer pid=2603210)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=2603210)   File "uvloop/loop.pyx", line 1518, in uvloop.loop.Loop.run_until_complete
(APIServer pid=2603210)   File "/home/name/.test/.gpu/vllm/.venv/lib/python3.12/site-packages/uvloop/__init__.py", line 48, in wrapper
(APIServer pid=2603210)     return await main
(APIServer pid=2603210)            ^^^^^^^^^^
(APIServer pid=2603210)   File "/home/name/.test/.gpu/vllm/vllm/entrypoints/openai/api_server.py", line 686, in run_server
(APIServer pid=2603210)     await run_server_worker(listen_address, sock, args, **uvicorn_kwargs)
(APIServer pid=2603210)   File "/home/name/.test/.gpu/vllm/vllm/entrypoints/openai/api_server.py", line 700, in run_server_worker
(APIServer pid=2603210)     async with build_async_engine_client(
(APIServer pid=2603210)                ^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=2603210)   File "/home/name/.local/share/uv/python/cpython-3.12.13-linux-x86_64-gnu/lib/python3.12/contextlib.py", line 210, in __aenter__
(APIServer pid=2603210)     return await anext(self.gen)
(APIServer pid=2603210)            ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=2603210)   File "/home/name/.test/.gpu/vllm/vllm/entrypoints/openai/api_server.py", line 100, in build_async_engine_client
(APIServer pid=2603210)     async with build_async_engine_client_from_engine_args(
(APIServer pid=2603210)                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=2603210)   File "/home/name/.local/share/uv/python/cpython-3.12.13-linux-x86_64-gnu/lib/python3.12/contextlib.py", line 210, in __aenter__
(APIServer pid=2603210)     return await anext(self.gen)
(APIServer pid=2603210)            ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=2603210)   File "/home/name/.test/.gpu/vllm/vllm/entrypoints/openai/api_server.py", line 124, in build_async_engine_client_from_engine_args
(APIServer pid=2603210)     vllm_config = engine_args.create_engine_config(usage_context=usage_context)
(APIServer pid=2603210)                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=2603210)   File "/home/name/.test/.gpu/vllm/vllm/engine/arg_utils.py", line 1574, in create_engine_config
(APIServer pid=2603210)     maybe_override_with_speculators(
(APIServer pid=2603210)   File "/home/name/.test/.gpu/vllm/vllm/transformers_utils/config.py", line 584, in maybe_override_with_speculators
(APIServer pid=2603210)     config_dict, _ = PretrainedConfig.get_config_dict(
(APIServer pid=2603210)                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=2603210)   File "/home/name/.test/.gpu/vllm/.venv/lib/python3.12/site-packages/transformers/configuration_utils.py", line 662, in get_config_dict
(APIServer pid=2603210)     config_dict, kwargs = cls._get_config_dict(pretrained_model_name_or_path, **kwargs)
(APIServer pid=2603210)                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=2603210)   File "/home/name/.test/.gpu/vllm/.venv/lib/python3.12/site-packages/transformers/configuration_utils.py", line 744, in _get_config_dict
(APIServer pid=2603210)     raise OSError(
(APIServer pid=2603210) OSError: Can't load the configuration of 'unsloth/Qwen3-0.6B-GGUF:UD-IQ1_S'. If you were trying to load it from 'https://huggingface.co/models', make sure you don't have a local directory with the same name. Otherwise, make sure 'unsloth/Qwen3-0.6B-GGUF:UD-IQ1_S' is the correct path to a directory containing a config.json file

after

vllm serve unsloth/Qwen3-0.6B-GGUF:UD-IQ1_S --tokenizer Qwen/Qwen3-0.6B
(APIServer pid=2598019) INFO 04-10 09:31:38 [utils.py:299]
(APIServer pid=2598019) INFO 04-10 09:31:38 [utils.py:299]        █     █     █▄   ▄█
(APIServer pid=2598019) INFO 04-10 09:31:38 [utils.py:299]  ▄▄ ▄█ █     █     █ ▀▄▀ █  version 0.19.1rc1.dev122+g83aea2147
(APIServer pid=2598019) INFO 04-10 09:31:38 [utils.py:299]   █▄█▀ █     █     █     █  model   unsloth/Qwen3-0.6B-GGUF:UD-IQ1_S
(APIServer pid=2598019) INFO 04-10 09:31:38 [utils.py:299]    ▀▀  ▀▀▀▀▀ ▀▀▀▀▀ ▀     ▀
(APIServer pid=2598019) INFO 04-10 09:31:38 [utils.py:299]
(APIServer pid=2598019) INFO 04-10 09:31:38 [utils.py:233] non-default args: {'model_tag': 'unsloth/Qwen3-0.6B-GGUF:UD-IQ1_S', 'model': 'unsloth/Qwen3-0.6B-GGUF:UD-IQ1_S', 'tokenizer': 'Qwen/Qwen3-0.6B'}
(APIServer pid=2598019) WARNING 04-10 09:31:38 [gguf_utils.py:60] Non-standard GGUF quant type 'UD-IQ1_S' detected.
(APIServer pid=2598019) INFO 04-10 09:31:39 [model.py:554] Resolved architecture: Qwen3ForCausalLM
(APIServer pid=2598019) INFO 04-10 09:31:39 [model.py:1684] Using max model len 40960
(APIServer pid=2598019) INFO 04-10 09:31:39 [vllm.py:799] Asynchronous scheduling is enabled.
(APIServer pid=2598019) INFO 04-10 09:31:39 [kernel.py:199] Final IR op priority after setting platform defaults: IrOpPriorityConfig(rms_norm=['native'])
(EngineCore pid=2598461) INFO 04-10 09:31:44 [core.py:107] Initializing a V1 LLM engine (v0.19.1rc1.dev122+g83aea2147) with config: model='unsloth/Qwen3-0.6B-GGUF:UD-IQ1_S', speculative_config=None, tokenizer='Qwen/Qwen3-0.6B', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=40960, download_dir=None, load_format=gguf, tensor_parallel_size=1, pipeline_parallel_size=1, data_parallel_size=1, decode_context_parallel_size=1, dcp_comm_backend=ag_rs, disable_custom_all_reduce=False, quantization=gguf, quantization_config=None, enforce_eager=False, enable_return_routed_experts=False, kv_cache_dtype=auto, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser='', reasoning_parser_plugin='', enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, kv_cache_metrics=False, kv_cache_metrics_sample=0.01, cudagraph_metrics=False, enable_layerwise_nvtx_tracing=False, enable_mfu_metrics=False, enable_mm_processor_stats=False, enable_logging_iteration_details=False), seed=0, served_model_name=unsloth/Qwen3-0.6B-GGUF:UD-IQ1_S, enable_prefix_caching=True, enable_chunked_prefill=True, pooler_config=None, compilation_config={'mode': <CompilationMode.VLLM_COMPILE: 3>, 'debug_dump_path': None, 'cache_dir': '', 'compile_cache_save_format': 'binary', 'backend': 'inductor', 'custom_ops': ['none'], 'ir_enable_torch_wrap': True, 'splitting_ops': ['vllm::unified_attention_with_output', 'vllm::unified_mla_attention_with_output', 'vllm::mamba_mixer2', 'vllm::mamba_mixer', 'vllm::short_conv', 'vllm::linear_attention', 'vllm::plamo2_mamba_mixer', 'vllm::gdn_attention_core', 'vllm::olmo_hybrid_gdn_full_forward', 'vllm::kda_attention', 'vllm::sparse_attn_indexer', 'vllm::rocm_aiter_sparse_attn_indexer', 'vllm::unified_kv_cache_update', 'vllm::unified_mla_kv_cache_update'], 'compile_mm_encoder': False, 'cudagraph_mm_encoder': False, 'encoder_cudagraph_token_budgets': [], 'encoder_cudagraph_max_images_per_batch': 0, 'compile_sizes': [], 'compile_ranges_endpoints': [2048], 'inductor_compile_config': {'enable_auto_functionalized_v2': False, 'size_asserts': False, 'alignment_asserts': False, 'scalar_asserts': False, 'combo_kernels': True, 'benchmark_combo_kernel': True}, 'inductor_passes': {}, 'cudagraph_mode': <CUDAGraphMode.FULL_AND_PIECEWISE: (2, 1)>, 'cudagraph_num_of_warmups': 1, 'cudagraph_capture_sizes': [1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200, 208, 216, 224, 232, 240, 248, 256, 272, 288, 304, 320, 336, 352, 368, 384, 400, 416, 432, 448, 464, 480, 496, 512], 'cudagraph_copy_inputs': False, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': False, 'pass_config': {'fuse_norm_quant': False, 'fuse_act_quant': False, 'fuse_attn_quant': False, 'enable_sp': False, 'fuse_gemm_comms': False, 'fuse_allreduce_rms': False}, 'max_cudagraph_capture_size': 512, 'dynamic_shapes_config': {'type': <DynamicShapesType.BACKED: 'backed'>, 'evaluate_guards': False, 'assume_32_bit_indexing': False}, 'local_cache_dir': None, 'fast_moe_cold_start': False, 'static_all_moe_layers': []}, kernel_config=KernelConfig(ir_op_priority=IrOpPriorityConfig(rms_norm=['native']), enable_flashinfer_autotune=True, moe_backend='auto')
(EngineCore pid=2598461) INFO 04-10 09:31:44 [parallel_state.py:1400] world_size=1 rank=0 local_rank=0 distributed_init_method=tcp://172.16.1.10:52857 backend=nccl
(EngineCore pid=2598461) INFO 04-10 09:31:44 [parallel_state.py:1713] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, PCP rank 0, TP rank 0, EP rank N/A, EPLB rank N/A
(EngineCore pid=2598461) INFO 04-10 09:31:45 [gpu_model_runner.py:4735] Starting to load model unsloth/Qwen3-0.6B-GGUF:UD-IQ1_S...
Qwen3-0.6B-UD-IQ1_S.gguf: 100%|████████████████████████████████████████████████████████████████████████████████| 215M/215M [00:21<00:00, 9.93MB/s]
(EngineCore pid=2598461) INFO 04-10 09:32:07 [weight_utils.py:615] Time spent downloading weights for unsloth/Qwen3-0.6B-GGUF: 22.251029 seconds
(EngineCore pid=2598461) INFO 04-10 09:32:16 [cuda.py:362] Using FLASH_ATTN attention backend out of potential backends: ['FLASH_ATTN', 'FLASHINFER', 'TRITON_ATTN', 'FLEX_ATTENTION'].
(EngineCore pid=2598461) INFO 04-10 09:32:16 [flash_attn.py:636] Using FlashAttention version 2
(EngineCore pid=2598461) <frozen importlib._bootstrap_external>:1301: FutureWarning: The cuda.cudart module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.runtime module instead.
(EngineCore pid=2598461) <frozen importlib._bootstrap_external>:1301: FutureWarning: The cuda.nvrtc module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.nvrtc module instead.
(EngineCore pid=2598461) INFO 04-10 09:32:21 [gpu_model_runner.py:4820] Model loading took 0.22 GiB memory and 35.669191 seconds
(EngineCore pid=2598461) INFO 04-10 09:32:24 [backends.py:1055] Using cache directory: /home/name/.cache/vllm/torch_compile_cache/564aa12500/rank_0_0/backbone for vLLM's torch.compile
(EngineCore pid=2598461) INFO 04-10 09:32:24 [backends.py:1115] Dynamo bytecode transform time: 3.07 s
(EngineCore pid=2598461) INFO 04-10 09:32:26 [backends.py:373] Cache the graph of compile range (1, 2048) for later use
(EngineCore pid=2598461) INFO 04-10 09:32:33 [backends.py:391] Compiling a graph for compile range (1, 2048) takes 8.53 s
(EngineCore pid=2598461) INFO 04-10 09:32:35 [decorators.py:655] saved AOT compiled function to /home/name/.cache/vllm/torch_compile_cache/torch_aot_compile/d5db8a5d1bc2f897526bb947908032d2f1ae13b65f8af58e817018da7e2e59ce/rank_0_0/model
(EngineCore pid=2598461) INFO 04-10 09:32:35 [monitor.py:48] torch.compile took 13.63 s in total
(EngineCore pid=2598461) INFO 04-10 09:32:35 [monitor.py:76] Initial profiling/warmup run took 0.24 s
(EngineCore pid=2598461) INFO 04-10 09:32:35 [kv_cache_utils.py:829] Overriding num_gpu_blocks=0 with num_gpu_blocks_override=512
(EngineCore pid=2598461) INFO 04-10 09:32:35 [gpu_model_runner.py:5893] Profiling CUDA graph memory: PIECEWISE=51 (largest=512), FULL=35 (largest=256)
(EngineCore pid=2598461) INFO 04-10 09:32:36 [gpu_model_runner.py:5972] Estimated CUDA graph memory: 0.64 GiB total
(EngineCore pid=2598461) INFO 04-10 09:32:36 [gpu_worker.py:436] Available KV cache memory: 20.23 GiB
(EngineCore pid=2598461) INFO 04-10 09:32:36 [gpu_worker.py:470] In v0.19, CUDA graph memory profiling will be enabled by default (VLLM_MEMORY_PROFILER_ESTIMATE_CUDAGRAPHS=1), which more accurately accounts for CUDA graph memory during KV cache allocation. To try it now, set VLLM_MEMORY_PROFILER_ESTIMATE_CUDAGRAPHS=1 and increase --gpu-memory-utilization from 0.9000 to 0.9274 to maintain the same effective KV cache size.
(EngineCore pid=2598461) INFO 04-10 09:32:36 [kv_cache_utils.py:1319] GPU KV cache size: 189,408 tokens
(EngineCore pid=2598461) INFO 04-10 09:32:36 [kv_cache_utils.py:1324] Maximum concurrency for 40,960 tokens per request: 4.62x
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 100%|████████████████████████████████████████████████████| 51/51 [00:01<00:00, 44.49it/s]
Capturing CUDA graphs (decode, FULL): 100%|███████████████████████████████████████████████████████████████████████| 35/35 [00:00<00:00, 52.28it/s]
(EngineCore pid=2598461) INFO 04-10 09:32:39 [gpu_model_runner.py:6063] Graph capturing finished in 2 secs, took 0.72 GiB
(EngineCore pid=2598461) INFO 04-10 09:32:39 [gpu_worker.py:597] CUDA graph pool memory: 0.72 GiB (actual), 0.64 GiB (estimated), difference: 0.07 GiB (10.1%).
(EngineCore pid=2598461) INFO 04-10 09:32:39 [core.py:285] init engine (profile, create kv cache, warmup model) took 18.07 seconds
(EngineCore pid=2598461) INFO 04-10 09:32:41 [vllm.py:799] Asynchronous scheduling is enabled.
(EngineCore pid=2598461) INFO 04-10 09:32:41 [kernel.py:199] Final IR op priority after setting platform defaults: IrOpPriorityConfig(rms_norm=['native'])
(APIServer pid=2598019) INFO 04-10 09:32:41 [api_server.py:606] Supported tasks: ['generate']
(APIServer pid=2598019) INFO 04-10 09:32:43 [hf.py:314] Detected the chat template content format to be 'string'. You can set `--chat-template-content-format` to override this.
(APIServer pid=2598019) INFO 04-10 09:32:43 [api_server.py:610] Starting vLLM server on http://0.0.0.0:8000
(APIServer pid=2598019) INFO 04-10 09:32:43 [launcher.py:37] Available routes are:
(APIServer pid=2598019) INFO 04-10 09:32:43 [launcher.py:46] Route: /openapi.json, Methods: GET, HEAD
(APIServer pid=2598019) INFO 04-10 09:32:43 [launcher.py:46] Route: /docs, Methods: GET, HEAD
(APIServer pid=2598019) INFO 04-10 09:32:43 [launcher.py:46] Route: /docs/oauth2-redirect, Methods: GET, HEAD
(APIServer pid=2598019) INFO 04-10 09:32:43 [launcher.py:46] Route: /redoc, Methods: GET, HEAD
(APIServer pid=2598019) INFO 04-10 09:32:43 [launcher.py:46] Route: /tokenize, Methods: POST
(APIServer pid=2598019) INFO 04-10 09:32:43 [launcher.py:46] Route: /detokenize, Methods: POST
(APIServer pid=2598019) INFO 04-10 09:32:43 [launcher.py:46] Route: /load, Methods: GET
(APIServer pid=2598019) INFO 04-10 09:32:43 [launcher.py:46] Route: /version, Methods: GET
(APIServer pid=2598019) INFO 04-10 09:32:43 [launcher.py:46] Route: /health, Methods: GET
(APIServer pid=2598019) INFO 04-10 09:32:43 [launcher.py:46] Route: /metrics, Methods: GET
(APIServer pid=2598019) INFO 04-10 09:32:43 [launcher.py:46] Route: /v1/models, Methods: GET
(APIServer pid=2598019) INFO 04-10 09:32:43 [launcher.py:46] Route: /ping, Methods: GET
(APIServer pid=2598019) INFO 04-10 09:32:43 [launcher.py:46] Route: /ping, Methods: POST
(APIServer pid=2598019) INFO 04-10 09:32:43 [launcher.py:46] Route: /invocations, Methods: POST
(APIServer pid=2598019) INFO 04-10 09:32:43 [launcher.py:46] Route: /v1/chat/completions, Methods: POST
(APIServer pid=2598019) INFO 04-10 09:32:43 [launcher.py:46] Route: /v1/chat/completions/batch, Methods: POST
(APIServer pid=2598019) INFO 04-10 09:32:43 [launcher.py:46] Route: /v1/responses, Methods: POST
(APIServer pid=2598019) INFO 04-10 09:32:43 [launcher.py:46] Route: /v1/responses/{response_id}, Methods: GET
(APIServer pid=2598019) INFO 04-10 09:32:43 [launcher.py:46] Route: /v1/responses/{response_id}/cancel, Methods: POST
(APIServer pid=2598019) INFO 04-10 09:32:43 [launcher.py:46] Route: /v1/completions, Methods: POST
(APIServer pid=2598019) INFO 04-10 09:32:43 [launcher.py:46] Route: /v1/messages, Methods: POST
(APIServer pid=2598019) INFO 04-10 09:32:43 [launcher.py:46] Route: /v1/messages/count_tokens, Methods: POST
(APIServer pid=2598019) INFO 04-10 09:32:43 [launcher.py:46] Route: /inference/v1/generate, Methods: POST
(APIServer pid=2598019) INFO 04-10 09:32:43 [launcher.py:46] Route: /scale_elastic_ep, Methods: POST
(APIServer pid=2598019) INFO 04-10 09:32:43 [launcher.py:46] Route: /is_scaling_elastic_ep, Methods: POST
(APIServer pid=2598019) INFO 04-10 09:32:43 [launcher.py:46] Route: /v1/chat/completions/render, Methods: POST
(APIServer pid=2598019) INFO 04-10 09:32:43 [launcher.py:46] Route: /v1/completions/render, Methods: POST
(APIServer pid=2598019) INFO 04-10 09:32:43 [launcher.py:46] Route: /generative_scoring, Methods: POST
(APIServer pid=2598019) INFO:     Started server process [2598019]
(APIServer pid=2598019) INFO:     Waiting for application startup.
(APIServer pid=2598019) INFO:     Application startup complete.

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

Signed-off-by: Injae Ryou <injaeryou@gmail.com>

gemini-code-assist

Code Review

This pull request introduces support for non-standard GGUF quantization types that use dash-separated prefixes, such as 'UD-Q4_K_XL'. The is_remote_gguf function was updated to recognize these types by validating the suffix after the last dash, and a new helper function is_nonstandard_gguf_quant_type was added. Additionally, the error message in split_remote_gguf was updated to reflect this support, and comprehensive unit tests were included to cover various prefix scenarios. I have no feedback to provide.

Isotr0py

LGTM, thanks!

…llm-project#39471) Signed-off-by: Injae Ryou <injaeryou@gmail.com> Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn>

…llm-project#39471) Signed-off-by: Injae Ryou <injaeryou@gmail.com> Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn> Signed-off-by: Matt Van Horn <455140+mvanhorn@users.noreply.github.com>

sts07142 added 2 commits April 10, 2026 12:18

[GGUF] Support non-standard quant types with prefix (e.g. UD-Q4_K_XL)

98c65a3

Signed-off-by: Injae Ryou <injaeryou@gmail.com>

[GGUF] Add tests for non-standard quant type validation

eed3842

Signed-off-by: Injae Ryou <injaeryou@gmail.com>

sts07142 force-pushed the fix/gguf-nonstandard-quant-type branch from 436bbc6 to eed3842 Compare April 10, 2026 03:20

gemini-code-assist Bot reviewed Apr 10, 2026

View reviewed changes

Isotr0py approved these changes Apr 10, 2026

View reviewed changes

Isotr0py added the ready ONLY add when PR is ready to merge/full CI is needed label Apr 10, 2026

Merge branch 'main' into fix/gguf-nonstandard-quant-type

9de5730

Isotr0py enabled auto-merge (squash) April 10, 2026 05:47

Isotr0py merged commit 447ce22 into vllm-project:main Apr 10, 2026
48 checks passed

This was referenced Apr 12, 2026

fix: handle vendor-prefixed GGUF quant types (e.g., UD-Q4_K_XL) #39470

Closed

[Bug]: HFValidationError when trying to run a GGUF model with quants #39198

Closed

whk-lab pushed a commit to whk-lab/vllm that referenced this pull request Apr 23, 2026

[GGUF] Support non-standard quant types with prefix (e.g. UD-IQ1_S) (v…

174510b

…llm-project#39471) Signed-off-by: Injae Ryou <injaeryou@gmail.com> Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn>

lowlyocean mentioned this pull request May 15, 2026

[Bug]: UD-IQ2_M not supported #42734

Closed

1 task

jhu960213 pushed a commit to jhu960213/vllm that referenced this pull request May 20, 2026

[GGUF] Support non-standard quant types with prefix (e.g. UD-IQ1_S) (v…

f8411b1

…llm-project#39471) Signed-off-by: Injae Ryou <injaeryou@gmail.com> Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn>

Sunt-ing mentioned this pull request Jun 1, 2026

[Bugfix][GGUF] Accept file-type-only quant types (IQ2_M, IQ3_XS, ...) in remote GGUF model IDs #44218

Open

4 tasks

sts07142 deleted the fix/gguf-nonstandard-quant-type branch June 12, 2026 00:15

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[GGUF] Support non-standard quant types with prefix (e.g. UD-IQ1_S)#39471

[GGUF] Support non-standard quant types with prefix (e.g. UD-IQ1_S)#39471
Isotr0py merged 3 commits into
vllm-project:mainfrom
sts07142:fix/gguf-nonstandard-quant-type

sts07142 commented Apr 10, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

Isotr0py left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

sts07142 commented Apr 10, 2026

Purpose

Test Plan

Test Result

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Isotr0py left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants