-
-
Notifications
You must be signed in to change notification settings - Fork 10.2k
Description
Your current environment
Environment:
vLLM Version: vllm/vllm-openai:v0.10.1
Model: Qwen3-4B-Thinking-2507-Q4_1.gguf (GGUF format)
docker run -itd --gpus all -e CUDA_VISIBLE_DEVICES=0
-e CUDA_DEVICE_ORDER=PCI_BUS_ID
-e VLLM_ATTENTION_BACKEND=FLASHINFER
-e VLLM_LOG_LEVEL=TRACE
--name vllm_gguf-tools-Qwen3-4B-Thinking
--network lt01net
-v /mnt/g/models/:/models/
-v /mnt/e/:/mnt/e/
-p 30022:8000
--env TZ=Asia/Shanghai
--ipc=host
--shm-size=8g
--entrypoint vllm
vllm/vllm-openai:v0.10.1
serve /models/unsloth/Qwen3-4B-Thinking-2507-GGUF/Qwen3-4B-Thinking-2507-Q4_1.gguf
--served-model-name Qwen3-4B-Thinking-2507-GGUF
--swap-space 8
--reasoning-parser deepseek_r1
--gpu-memory-utilization 0.75
--enable-auto-tool-choice
--max-model-len 1000
--tool-call-parser hermes
--kv-cache-dtype fp8
--no-enable-chunked-prefill
--max-num-seqs 4
--enable-log-requests
--enable-log-outputs
--max-log-len 2048
--uvicorn-log-level trace
Model: Qwen3-4B-Thinking-2507-Q4_1.gguf (GGUF format)
docker run -itd --gpus all -e CUDA_VISIBLE_DEVICES=0
-e CUDA_DEVICE_ORDER=PCI_BUS_ID
-e VLLM_ATTENTION_BACKEND=FLASHINFER
-e VLLM_LOG_LEVEL=TRACE
--name vllm_gguf-tools-Qwen3-4B-Thinking
--network lt01net
-v /mnt/g/models/:/models/
-v /mnt/e/:/mnt/e/
-p 30022:8000
--env TZ=Asia/Shanghai
--ipc=host
--shm-size=8g
--entrypoint vllm
vllm/vllm-openai:v0.10.1
serve /models/unsloth/Qwen3-4B-Thinking-2507-GGUF/Qwen3-4B-Thinking-2507-Q4_1.gguf
--served-model-name Qwen3-4B-Thinking-2507-GGUF
--swap-space 8
--reasoning-parser deepseek_r1
--gpu-memory-utilization 0.75
--enable-auto-tool-choice
--max-model-len 1000
--tool-call-parser hermes
--kv-cache-dtype fp8
--no-enable-chunked-prefill
--max-num-seqs 4
--enable-log-requests
--enable-log-outputs
--max-log-len 2048
--uvicorn-log-level trace
🐛 Describe the bug
The --enable-log-outputs flag in vLLM (version vllm/vllm-openai:v0.10.1) does not log model outputs in the server logs, despite being enabled alongside --enable-log-requests, VLLM_LOG_LEVEL=TRACE, and --uvicorn-log-level trace. The logs show request details (e.g., prompt and sampling parameters) and server status (HTTP 200 OK), but the generated output text is missing.
the generation is missing in logs
out logs is
INFO 09-10 19:10:43 [init.py:241] Automatically detected platform cuda.
INFO 09-10 19:10:44 [llm_engine.py:222] Initializing a V0 LLM engine (v0.10.1) with config: model='/models/unsloth/Qwen3-4B-Thinking-2507-GGUF/Qwen3-4B-Thinking-2507-Q4_1.gguf', speculative_config=None, tokenizer='/models/unsloth/Qwen3-4B-Thinking-2507-GGUF/Qwen3-4B-Thinking-2507-Q4_1.gguf', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config={}, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=1000, download_dir=None, load_format=gguf, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=gguf, enforce_eager=False, kv_cache_dtype=fp8, device_config=cuda, decoding_config=DecodingConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_backend='deepseek_r1'), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None), seed=0, served_model_name=Qwen3-4B-Thinking-2507-GGUF, enable_prefix_caching=None, chunked_prefill_enabled=False, use_async_output_proc=True, pooler_config=None, compilation_config={"level":0,"debug_dump_path":"","cache_dir":"","backend":"","custom_ops":[],"splitting_ops":null,"use_inductor":true,"compile_sizes":[],"inductor_compile_config":{"enable_auto_functionalized_v2":false},"inductor_passes":{},"cudagraph_mode":0,"use_cudagraph":true,"cudagraph_num_of_warmups":0,"cudagraph_capture_sizes":[4,2,1],"cudagraph_copy_inputs":false,"full_cuda_graph":false,"pass_config":{"enable_fusion":false,"enable_noop":false},"max_capture_size":4,"local_cache_dir":null}, use_cached_outputs=True,
WARNING 09-10 19:11:04 [interface.py:389] Using 'pin_memory=False' as WSL is detected. This may slow down the performance.
INFO 09-10 19:11:04 [cuda.py:354] Using FlashInfer backend.
INFO 09-10 19:11:04 [cuda.py:358] Using HND KV cache layout on V1 engine by default for Blackwell (SM 10.0) GPUs.
INFO 09-10 19:11:04 [parallel_state.py:1134] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, TP rank 0, EP rank 0
INFO 09-10 19:11:04 [model_runner.py:1080] Starting to load model /models/unsloth/Qwen3-4B-Thinking-2507-GGUF/Qwen3-4B-Thinking-2507-Q4_1.gguf...
INFO 09-10 19:11:24 [model_runner.py:1112] Model loading took 2.4875 GiB and 19.238120 seconds
INFO 09-10 19:11:25 [flashinfer.py:216] Using KV cache layout HND
INFO 09-10 19:11:41 [worker.py:295] Memory profiling takes 16.23 seconds
INFO 09-10 19:11:41 [worker.py:295] the current vLLM instance can use total_gpu_memory (7.96GiB) x gpu_memory_utilization (0.75) = 5.97GiB
INFO 09-10 19:11:41 [worker.py:295] model weights take 2.49GiB; non_torch_memory takes 0.02GiB; PyTorch activation peak memory takes 0.51GiB; the rest of the memory reserved for KV Cache is 2.96GiB.
INFO 09-10 19:11:41 [executor_base.py:114] # cuda blocks: 2692, # CPU blocks: 7281
INFO 09-10 19:11:41 [executor_base.py:119] Maximum concurrency for 1000 tokens per request: 43.07x
INFO 09-10 19:11:42 [model_runner.py:1383] Capturing cudagraphs for decoding. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI. If out-of-memory error occurs during cudagraph capture, consider decreasing gpu_memory_utilization
or switching to eager mode. You can also reduce the max_num_seqs
as needed to decrease memory usage.
Capturing CUDA graph shapes: 0%| | 0/3 [00:00<?, ?it/s]/usr/local/lib/python3.12/dist-packages/torch/utils/cpp_extension.py:2356: UserWarning: TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation.
If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'].
warnings.warn(
Capturing CUDA graph shapes: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:15<00:00, 5.08s/it]
INFO 09-10 19:11:57 [model_runner.py:1535] Graph capturing finished in 15 secs, took -0.01 GiB
INFO 09-10 19:11:57 [llm_engine.py:417] init engine (profile, create kv cache, warmup model) took 32.93 seconds
(APIServer pid=1) INFO 09-10 19:11:57 [api_server.py:1611] Supported_tasks: ['generate']
(APIServer pid=1) INFO 09-10 19:12:08 [serving_responses.py:149] "auto" tool choice has been enabled please note that while the parallel_tool_calls client option is preset for compatibility reasons, it will be ignored.
(APIServer pid=1) INFO 09-10 19:12:08 [serving_chat.py:94] "auto" tool choice has been enabled please note that while the parallel_tool_calls client option is preset for compatibility reasons, it will be ignored.
(APIServer pid=1) INFO 09-10 19:12:30 [api_server.py:1880] Starting vLLM API server 0 on http://0.0.0.0:8000
(APIServer pid=1) INFO 09-10 19:12:30 [launcher.py:29] Available routes are:
(APIServer pid=1) INFO 09-10 19:12:30 [launcher.py:37] Route: /openapi.json, Methods: HEAD, GET
(APIServer pid=1) INFO 09-10 19:12:30 [launcher.py:37] Route: /docs, Methods: HEAD, GET
(APIServer pid=1) INFO 09-10 19:12:30 [launcher.py:37] Route: /docs/oauth2-redirect, Methods: HEAD, GET
(APIServer pid=1) INFO 09-10 19:12:30 [launcher.py:37] Route: /redoc, Methods: HEAD, GET
(APIServer pid=1) INFO 09-10 19:12:30 [launcher.py:37] Route: /health, Methods: GET
(APIServer pid=1) INFO 09-10 19:12:30 [launcher.py:37] Route: /load, Methods: GET
(APIServer pid=1) INFO 09-10 19:12:30 [launcher.py:37] Route: /ping, Methods: POST
(APIServer pid=1) INFO 09-10 19:12:30 [launcher.py:37] Route: /ping, Methods: GET
(APIServer pid=1) INFO 09-10 19:12:30 [launcher.py:37] Route: /tokenize, Methods: POST
(APIServer pid=1) INFO 09-10 19:12:30 [launcher.py:37] Route: /detokenize, Methods: POST
(APIServer pid=1) INFO 09-10 19:12:30 [launcher.py:37] Route: /v1/models, Methods: GET
(APIServer pid=1) INFO 09-10 19:12:30 [launcher.py:37] Route: /version, Methods: GET
(APIServer pid=1) INFO 09-10 19:12:30 [launcher.py:37] Route: /v1/responses, Methods: POST
(APIServer pid=1) INFO 09-10 19:12:30 [launcher.py:37] Route: /v1/responses/{response_id}, Methods: GET
(APIServer pid=1) INFO 09-10 19:12:30 [launcher.py:37] Route: /v1/responses/{response_id}/cancel, Methods: POST
(APIServer pid=1) INFO 09-10 19:12:30 [launcher.py:37] Route: /v1/chat/completions, Methods: POST
(APIServer pid=1) INFO 09-10 19:12:30 [launcher.py:37] Route: /v1/completions, Methods: POST
(APIServer pid=1) INFO 09-10 19:12:30 [launcher.py:37] Route: /v1/embeddings, Methods: POST
(APIServer pid=1) INFO 09-10 19:12:30 [launcher.py:37] Route: /pooling, Methods: POST
(APIServer pid=1) INFO 09-10 19:12:30 [launcher.py:37] Route: /classify, Methods: POST
(APIServer pid=1) INFO 09-10 19:12:30 [launcher.py:37] Route: /score, Methods: POST
(APIServer pid=1) INFO 09-10 19:12:30 [launcher.py:37] Route: /v1/score, Methods: POST
(APIServer pid=1) INFO 09-10 19:12:30 [launcher.py:37] Route: /v1/audio/transcriptions, Methods: POST
(APIServer pid=1) INFO 09-10 19:12:30 [launcher.py:37] Route: /v1/audio/translations, Methods: POST
(APIServer pid=1) INFO 09-10 19:12:30 [launcher.py:37] Route: /rerank, Methods: POST
(APIServer pid=1) INFO 09-10 19:12:30 [launcher.py:37] Route: /v1/rerank, Methods: POST
(APIServer pid=1) INFO 09-10 19:12:30 [launcher.py:37] Route: /v2/rerank, Methods: POST
(APIServer pid=1) INFO 09-10 19:12:30 [launcher.py:37] Route: /scale_elastic_ep, Methods: POST
(APIServer pid=1) INFO 09-10 19:12:30 [launcher.py:37] Route: /is_scaling_elastic_ep, Methods: POST
(APIServer pid=1) INFO 09-10 19:12:30 [launcher.py:37] Route: /invocations, Methods: POST
(APIServer pid=1) INFO 09-10 19:12:30 [launcher.py:37] Route: /metrics, Methods: GET
(APIServer pid=1) INFO: Started server process [1]
(APIServer pid=1) INFO: Waiting for application startup.
(APIServer pid=1) TRACE: ASGI [1] Started scope={'type': 'lifespan', 'asgi': {'version': '3.0', 'spec_version': '2.0'}, 'state': {}}
(APIServer pid=1) TRACE: ASGI [1] Receive {'type': 'lifespan.startup'}
(APIServer pid=1) TRACE: ASGI [1] Send {'type': 'lifespan.startup.complete'}
(APIServer pid=1) INFO: Application startup complete.
(APIServer pid=1) TRACE: 172.18.0.1:54130 - HTTP connection made
(APIServer pid=1) TRACE: 172.18.0.1:54130 - ASGI [2] Started scope={'type': 'http', 'asgi': {'version': '3.0', 'spec_version': '2.3'}, 'http_version': '1.1', 'server': ('172.18.0.3', 8000), 'client': ('172.18.0.1', 54130), 'scheme': 'http', 'root_path': '', 'headers': '<...>', 'state': {}, 'method': 'POST', 'path': '/v1/chat/completions', 'raw_path': b'/v1/chat/completions', 'query_string': b''}
(APIServer pid=1) TRACE: 172.18.0.1:54130 - ASGI [2] Receive {'type': 'http.request', 'body': '<145 bytes>', 'more_body': False}
(APIServer pid=1) INFO 09-10 19:12:40 [chat_utils.py:470] Detected the chat template content format to be 'string'. You can set --chat-template-content-format
to override this.
(APIServer pid=1) INFO 09-10 19:12:40 [logger.py:40] Received request chatcmpl-83e8e8272c0843959a6f4ed9442f0b9a: prompt: '<|im_start|>user\n你好,请简单介绍一下自己。<|im_end|>\n<|im_start|>assistant\n\n', params: SamplingParams(n=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=1.0, top_p=1.0, top_k=0, min_p=0.0, seed=None, stop=[], stop_token_ids=[], bad_words=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=100, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None, guided_decoding=None, extra_args=None), prompt_token_ids: None, prompt_embeds shape: None, lora_request: None.
INFO 09-10 19:12:40 [engine.py:326] Added request chatcmpl-83e8e8272c0843959a6f4ed9442f0b9a.
/usr/local/lib/python3.12/dist-packages/torch/utils/cpp_extension.py:2356: UserWarning: TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation.
If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'].
warnings.warn(
INFO 09-10 19:12:57 [metrics.py:386] Avg prompt throughput: 0.9 tokens/s, Avg generation throughput: 0.1 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.1%, CPU KV cache usage: 0.0%.
(APIServer pid=1) TRACE: 172.18.0.1:54130 - ASGI [2] Send {'type': 'http.response.start', 'status': 200, 'headers': '<...>'}
(APIServer pid=1) INFO: 172.18.0.1:54130 - "POST /v1/chat/completions HTTP/1.1" 200 OK
(APIServer pid=1) TRACE: 172.18.0.1:54130 - ASGI [2] Send {'type': 'http.response.body', 'body': '<1074 bytes>'}
(APIServer pid=1) TRACE: 172.18.0.1:54130 - ASGI [2] Completed
(APIServer pid=1) TRACE: 172.18.0.1:54130 - HTTP connection lost
(APIServer pid=1) TRACE: 172.18.0.1:47262 - HTTP connection made
(APIServer pid=1) TRACE: 172.18.0.1:47262 - ASGI [3] Started scope={'type': 'http', 'asgi': {'version': '3.0', 'spec_version': '2.3'}, 'http_version': '1.1', 'server': ('172.18.0.3', 8000), 'client': ('172.18.0.1', 47262), 'scheme': 'http', 'root_path': '', 'headers': '<...>', 'state': {}, 'method': 'POST', 'path': '/v1/chat/completions', 'raw_path': b'/v1/chat/completions', 'query_string': b''}
(APIServer pid=1) TRACE: 172.18.0.1:47262 - ASGI [3] Receive {'type': 'http.request', 'body': '<212 bytes>', 'more_body': False}
(APIServer pid=1) INFO 09-10 19:13:01 [logger.py:40] Received request chatcmpl-47e9114ab2e640f19aee75ba784fd102: prompt: '<|im_start|>user\n你好,请简单介绍一下AI。<|im_end|>\n<|im_start|>user\n你好,请简单介绍一下自己。<|im_end|>\n<|im_start|>assistant\n\n', params: SamplingParams(n=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=1.0, top_p=1.0, top_k=0, min_p=0.0, seed=None, stop=[], stop_token_ids=[], bad_words=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=100, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None, guided_decoding=None, extra_args=None), prompt_token_ids: None, prompt_embeds shape: None, lora_request: None.
INFO 09-10 19:13:01 [engine.py:326] Added request chatcmpl-47e9114ab2e640f19aee75ba784fd102.
Before submitting a new issue...
- Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.