Skip to content

Added regression test for openai/harmony/issues/78#29830

Open
jacobthebanana wants to merge 4 commits intovllm-project:mainfrom
VectorInstitute:harmony-issue-78-regression
Open

Added regression test for openai/harmony/issues/78#29830
jacobthebanana wants to merge 4 commits intovllm-project:mainfrom
VectorInstitute:harmony-issue-78-regression

Conversation

@jacobthebanana
Copy link
Copy Markdown
Contributor

@jacobthebanana jacobthebanana commented Dec 2, 2025

Purpose

Add regression tests for openai/harmony#78, particularly the FastAPI vllm serve path which wasn't fully addressed in #26185.

Test Plan

Modify the existing test_output_messages_enabled test in test_response_api_with_harmony.py to add the following validations:

    for _message in [*response.input_messages, *response.output_messages]:
        for _item in _message.get("content"):
            assert isinstance(_item, dict), _message
            assert len(_item) > 0, _message
full code details
@pytest.mark.asyncio
@pytest.mark.parametrize("model_name", [MODEL_NAME])
async def test_output_messages_enabled(client: OpenAI, model_name: str, server):
    response = await client.responses.create(
        model=model_name,
        input="What is the capital of South Korea?",
        extra_body={"enable_response_messages": True},
    )

    assert response is not None
    assert response.status == "completed"
    assert len(response.input_messages) > 0
    assert len(response.output_messages) > 0

    # Regression test for github.com/openai/harmony/issues/78 (empty content)
    is_query_returned: bool = False
    for _message in [*response.input_messages, *response.output_messages]:
        for _item in _message.get("content"):
            assert isinstance(_item, dict), _message
            assert len(_item) > 0, _message

            # Ensure original input is returned
            _item_text = _item.get("text")
            if _item_text and "South Korea" in _item_text:
                is_query_returned = True

    assert is_query_returned
uv run pytest -v tests/entrypoints/openai/test_response_api_with_harmony.py::test_output_messages_enabled

Test Result

FAILED tests/entrypoints/openai/test_response_api_with_harmony.py::test_output_messages_enabled[openai/gpt-oss-20b] - AssertionError: {'author': {'name': None, 'role': 'system'}, 'channel': None, 'content': [{}], 'content_type': None, ...}

client = <openai.AsyncOpenAI object at 0x1553fa3b1550>, model_name = 'openai/gpt-oss-20b', server = <tests.utils.RemoteOpenAIServer object at 0x1553fb984ad0>

    @pytest.mark.asyncio
    @pytest.mark.parametrize("model_name", [MODEL_NAME])
    async def test_output_messages_enabled(client: OpenAI, model_name: str, server):
        response = await client.responses.create(
            model=model_name,
            input="What is the capital of South Korea?",
            extra_body={"enable_response_messages": True},
        )
    
        assert response is not None
        assert response.status == "completed"
        assert len(response.input_messages) > 0
        assert len(response.output_messages) > 0
    
        # Regression test for github.com/openai/harmony/issues/78 (empty content)
        is_query_returned: bool = False
        for _message in [*response.input_messages, *response.output_messages]:
            for _item in _message.get("content"):
                assert isinstance(_item, dict), _message
>               assert len(_item) > 0, _message
E               AssertionError: {'author': {'name': None, 'role': 'system'}, 'channel': None, 'content': [{}], 'content_type': None, ...}
E               assert 0 > 0
E                +  where 0 = len({})

tests/entrypoints/openai/test_response_api_with_harmony.py:872: AssertionError
Full test output
$ uv run pytest -v tests/entrypoints/openai/test_response_api_with_harmony.py::test_output_messages_enabled 
============================================================================== test session starts ===============================================================================
platform linux -- Python 3.13.7, pytest-9.0.1, pluggy-1.6.0 -- <workdir>/.venv/bin/python3
cachedir: .pytest_cache
rootdir: <workdir>
configfile: pyproject.toml
plugins: anyio-4.12.0, asyncio-1.3.0
asyncio: mode=Mode.STRICT, debug=False, asyncio_default_fixture_loop_scope=None, asyncio_default_test_loop_scope=function
collected 1 item                                                                                                                                                                 

tests/entrypoints/openai/test_response_api_with_harmony.py::test_output_messages_enabled[openai/gpt-oss-20b] FAILED                                                        [100%]

==================================================================================== FAILURES ====================================================================================
________________________________________________________________ test_output_messages_enabled[openai/gpt-oss-20b] ________________________________________________________________

client = <openai.AsyncOpenAI object at 0x1553fa3b1550>, model_name = 'openai/gpt-oss-20b', server = <tests.utils.RemoteOpenAIServer object at 0x1553fb984ad0>

    @pytest.mark.asyncio
    @pytest.mark.parametrize("model_name", [MODEL_NAME])
    async def test_output_messages_enabled(client: OpenAI, model_name: str, server):
        response = await client.responses.create(
            model=model_name,
            input="What is the capital of South Korea?",
            extra_body={"enable_response_messages": True},
        )
    
        assert response is not None
        assert response.status == "completed"
        assert len(response.input_messages) > 0
        assert len(response.output_messages) > 0
    
        # Regression test for github.com/openai/harmony/issues/78 (empty content)
        is_query_returned: bool = False
        for _message in [*response.input_messages, *response.output_messages]:
            for _item in _message.get("content"):
                assert isinstance(_item, dict), _message
>               assert len(_item) > 0, _message
E               AssertionError: {'author': {'name': None, 'role': 'system'}, 'channel': None, 'content': [{}], 'content_type': None, ...}
E               assert 0 > 0
E                +  where 0 = len({})

tests/entrypoints/openai/test_response_api_with_harmony.py:872: AssertionError
----------------------------------------------------------------------------- Captured stdout setup ------------------------------------------------------------------------------
DEBUG 12-01 19:48:37 [plugins/__init__.py:43] Available plugins for group vllm.general_plugins:
DEBUG 12-01 19:48:37 [plugins/__init__.py:45] - lora_filesystem_resolver -> vllm.plugins.lora_resolvers.filesystem_resolver:register_filesystem_resolver
DEBUG 12-01 19:48:37 [plugins/__init__.py:48] All plugins in this group will be loaded. Set `VLLM_PLUGINS` to control which plugins to load.
DEBUG 12-01 19:48:37 [model_executor/models/registry.py:675] Loaded model info for class vllm.model_executor.models.gpt_oss.GptOssForCausalLM from cache
DEBUG 12-01 19:48:37 [logging_utils/log_time.py:29] Registry inspect model class: Elapsed time 0.0018544 secs
INFO 12-01 19:48:37 [config/model.py:637] Resolved architecture: GptOssForCausalLM
INFO 12-01 19:48:37 [config/model.py:1750] Using max model len 5000
DEBUG 12-01 19:48:38 [_ipex_ops.py:15] Import error msg: No module named 'intel_extension_for_pytorch'
DEBUG 12-01 19:48:38 [model_executor/model_loader/weight_utils.py:466] Using model weights format ['*.safetensors']
Launching RemoteOpenAIServer with: vllm serve openai/gpt-oss-20b --enforce-eager --tool-server demo --max_model_len 5000 --port 54241 --seed 0
DEBUG 12-01 19:48:42 [plugins/__init__.py:35] No plugins for group vllm.platform_plugins found.
DEBUG 12-01 19:48:42 [platforms/__init__.py:36] Checking if TPU platform is available.
DEBUG 12-01 19:48:42 [platforms/__init__.py:55] TPU platform is not available because: No module named 'libtpu'
DEBUG 12-01 19:48:42 [platforms/__init__.py:61] Checking if CUDA platform is available.
DEBUG 12-01 19:48:42 [platforms/__init__.py:84] Confirmed CUDA platform is available.
DEBUG 12-01 19:48:42 [platforms/__init__.py:112] Checking if ROCm platform is available.
DEBUG 12-01 19:48:42 [platforms/__init__.py:126] ROCm platform is not available because: No module named 'amdsmi'
DEBUG 12-01 19:48:42 [platforms/__init__.py:133] Checking if XPU platform is available.
DEBUG 12-01 19:48:42 [platforms/__init__.py:153] XPU platform is not available because: No module named 'intel_extension_for_pytorch'
DEBUG 12-01 19:48:42 [platforms/__init__.py:160] Checking if CPU platform is available.
DEBUG 12-01 19:48:42 [platforms/__init__.py:61] Checking if CUDA platform is available.
DEBUG 12-01 19:48:42 [platforms/__init__.py:84] Confirmed CUDA platform is available.
DEBUG 12-01 19:48:42 [platforms/__init__.py:225] Automatically detected platform cuda.
DEBUG 12-01 19:48:44 [utils/flashinfer.py:45] flashinfer-cubin package was not found
DEBUG 12-01 19:48:44 [utils/flashinfer.py:60] FlashInfer unavailable since nvcc was not found and not using pre-downloaded cubins
DEBUG 12-01 19:48:46 [plugins/__init__.py:43] Available plugins for group vllm.general_plugins:
DEBUG 12-01 19:48:46 [plugins/__init__.py:45] - lora_filesystem_resolver -> vllm.plugins.lora_resolvers.filesystem_resolver:register_filesystem_resolver
DEBUG 12-01 19:48:46 [plugins/__init__.py:48] All plugins in this group will be loaded. Set `VLLM_PLUGINS` to control which plugins to load.
(APIServer pid=3413524) INFO 12-01 19:48:46 [entrypoints/openai/api_server.py:1770] vLLM API server version 0.11.2.dev440+g342c4f147
(APIServer pid=3413524) INFO 12-01 19:48:46 [entrypoints/utils.py:253] non-default args: {'model_tag': 'openai/gpt-oss-20b', 'port': 54241, 'tool_server': 'demo', 'model': 'openai/gpt-oss-20b', 'max_model_len': 5000, 'enforce_eager': True}
(APIServer pid=3413524) DEBUG 12-01 19:48:46 [model_executor/models/registry.py:675] Loaded model info for class vllm.model_executor.models.gpt_oss.GptOssForCausalLM from cache
(APIServer pid=3413524) DEBUG 12-01 19:48:46 [logging_utils/log_time.py:29] Registry inspect model class: Elapsed time 0.0013798 secs
(APIServer pid=3413524) INFO 12-01 19:48:46 [config/model.py:637] Resolved architecture: GptOssForCausalLM
(APIServer pid=3413524) INFO 12-01 19:48:46 [config/model.py:1750] Using max model len 5000
(APIServer pid=3413524) DEBUG 12-01 19:48:47 [_ipex_ops.py:15] Import error msg: No module named 'intel_extension_for_pytorch'
(APIServer pid=3413524) DEBUG 12-01 19:48:47 [config/model.py:1805] Generative models support chunked prefill.
(APIServer pid=3413524) DEBUG 12-01 19:48:47 [config/model.py:1855] Generative models support prefix caching.
(APIServer pid=3413524) DEBUG 12-01 19:48:47 [engine/arg_utils.py:1914] Enabling chunked prefill by default
(APIServer pid=3413524) DEBUG 12-01 19:48:47 [engine/arg_utils.py:1944] Enabling prefix caching by default
(APIServer pid=3413524) DEBUG 12-01 19:48:47 [engine/arg_utils.py:2022] Defaulting max_num_batched_tokens to 2048 for OPENAI_API_SERVER usage context.
(APIServer pid=3413524) DEBUG 12-01 19:48:47 [engine/arg_utils.py:2032] Defaulting max_num_seqs to 256 for OPENAI_API_SERVER usage context.
(APIServer pid=3413524) INFO 12-01 19:48:47 [config/scheduler.py:207] Chunked prefill is enabled with max_num_batched_tokens=2048.
(APIServer pid=3413524) INFO 12-01 19:48:47 [model_executor/models/config.py:274] Overriding max cuda graph capture size to 1024 for performance.
(APIServer pid=3413524) WARNING 12-01 19:48:47 [config/vllm.py:589] Enforce eager set, overriding optimization level to -O0
(APIServer pid=3413524) INFO 12-01 19:48:47 [config/vllm.py:695] Cudagraph is disabled under eager mode
(APIServer pid=3413524) DEBUG 12-01 19:48:47 [plugins/__init__.py:35] No plugins for group vllm.stat_logger_plugins found.
(APIServer pid=3413524) DEBUG 12-01 19:48:49 [plugins/io_processors/__init__.py:33] No IOProcessor plugins requested by the model
DEBUG 12-01 19:48:52 [plugins/__init__.py:35] No plugins for group vllm.platform_plugins found.
DEBUG 12-01 19:48:52 [platforms/__init__.py:36] Checking if TPU platform is available.
DEBUG 12-01 19:48:52 [platforms/__init__.py:55] TPU platform is not available because: No module named 'libtpu'
DEBUG 12-01 19:48:52 [platforms/__init__.py:61] Checking if CUDA platform is available.
DEBUG 12-01 19:48:52 [platforms/__init__.py:84] Confirmed CUDA platform is available.
DEBUG 12-01 19:48:52 [platforms/__init__.py:112] Checking if ROCm platform is available.
DEBUG 12-01 19:48:52 [platforms/__init__.py:126] ROCm platform is not available because: No module named 'amdsmi'
DEBUG 12-01 19:48:52 [platforms/__init__.py:133] Checking if XPU platform is available.
DEBUG 12-01 19:48:52 [platforms/__init__.py:153] XPU platform is not available because: No module named 'intel_extension_for_pytorch'
DEBUG 12-01 19:48:52 [platforms/__init__.py:160] Checking if CPU platform is available.
DEBUG 12-01 19:48:52 [platforms/__init__.py:61] Checking if CUDA platform is available.
DEBUG 12-01 19:48:52 [platforms/__init__.py:84] Confirmed CUDA platform is available.
DEBUG 12-01 19:48:52 [platforms/__init__.py:225] Automatically detected platform cuda.
DEBUG 12-01 19:48:55 [utils/flashinfer.py:45] flashinfer-cubin package was not found
DEBUG 12-01 19:48:55 [utils/flashinfer.py:60] FlashInfer unavailable since nvcc was not found and not using pre-downloaded cubins
(EngineCore_DP0 pid=3413554) DEBUG 12-01 19:48:56 [v1/engine/core.py:780] Waiting for init message from front-end.
(APIServer pid=3413524) DEBUG 12-01 19:48:56 [v1/engine/utils.py:1063] HELLO from local core engine process 0.
(EngineCore_DP0 pid=3413554) DEBUG 12-01 19:48:56 [v1/engine/core.py:791] Received init message: EngineHandshakeMetadata(addresses=EngineZmqAddresses(inputs=['ipc:///tmp/b787941e-f122-4abe-8460-c915b8a43283'], outputs=['ipc:///tmp/d48a33d1-023a-4aa5-859e-a7ce54d28380'], coordinator_input=None, coordinator_output=None, frontend_stats_publish_address=None), parallel_config={'data_parallel_master_ip': '127.0.0.1', 'data_parallel_master_port': 0, '_data_parallel_master_port_list': [], 'data_parallel_size': 1}, parallel_config_hash=None)
(EngineCore_DP0 pid=3413554) DEBUG 12-01 19:48:56 [v1/engine/core.py:596] Has DP Coordinator: False, stats publish address: None
(EngineCore_DP0 pid=3413554) DEBUG 12-01 19:48:56 [plugins/__init__.py:43] Available plugins for group vllm.general_plugins:
(EngineCore_DP0 pid=3413554) DEBUG 12-01 19:48:56 [plugins/__init__.py:45] - lora_filesystem_resolver -> vllm.plugins.lora_resolvers.filesystem_resolver:register_filesystem_resolver
(EngineCore_DP0 pid=3413554) DEBUG 12-01 19:48:56 [plugins/__init__.py:48] All plugins in this group will be loaded. Set `VLLM_PLUGINS` to control which plugins to load.
(EngineCore_DP0 pid=3413554) INFO 12-01 19:48:56 [v1/engine/core.py:93] Initializing a V1 LLM engine (v0.11.2.dev440+g342c4f147) with config: model='openai/gpt-oss-20b', speculative_config=None, tokenizer='openai/gpt-oss-20b', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=5000, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, data_parallel_size=1, disable_custom_all_reduce=False, quantization=mxfp4, enforce_eager=True, kv_cache_dtype=auto, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser='openai_gptoss', reasoning_parser_plugin='', enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, kv_cache_metrics=False, kv_cache_metrics_sample=0.01), seed=0, served_model_name=openai/gpt-oss-20b, enable_prefix_caching=True, enable_chunked_prefill=True, pooler_config=None, compilation_config={'level': None, 'mode': <CompilationMode.NONE: 0>, 'debug_dump_path': None, 'cache_dir': '', 'compile_cache_save_format': 'binary', 'backend': 'inductor', 'custom_ops': ['all'], 'splitting_ops': [], 'compile_mm_encoder': False, 'compile_sizes': [], 'inductor_compile_config': {'enable_auto_functionalized_v2': False, 'combo_kernels': True, 'benchmark_combo_kernel': True}, 'inductor_passes': {}, 'cudagraph_mode': <CUDAGraphMode.NONE: 0>, 'cudagraph_num_of_warmups': 0, 'cudagraph_capture_sizes': [], 'cudagraph_copy_inputs': False, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': False, 'pass_config': {'enable_fusion': False, 'enable_attn_fusion': False, 'enable_noop': False, 'enable_sequence_parallelism': False, 'enable_async_tp': False, 'enable_fi_allreduce_fusion': False}, 'max_cudagraph_capture_size': 0, 'dynamic_shapes_config': {'type': <DynamicShapesType.BACKED: 'backed'>}, 'local_cache_dir': None}
(EngineCore_DP0 pid=3413554) DEBUG 12-01 19:48:56 [compilation/decorators.py:194] Inferred dynamic dimensions for forward method of <class 'vllm.model_executor.models.deepseek_v2.DeepseekV2Model'>: ['input_ids', 'positions', 'intermediate_tensors', 'inputs_embeds']
(EngineCore_DP0 pid=3413554) DEBUG 12-01 19:48:56 [compilation/decorators.py:194] Inferred dynamic dimensions for forward method of <class 'vllm.model_executor.models.llama.LlamaModel'>: ['input_ids', 'positions', 'intermediate_tensors', 'inputs_embeds']
(EngineCore_DP0 pid=3413554) DEBUG 12-01 19:48:57 [distributed/parallel_state.py:1161] world_size=1 rank=0 local_rank=0 distributed_init_method=tcp://10.1.1.152:60465 backend=nccl
(EngineCore_DP0 pid=3413554) INFO 12-01 19:48:57 [distributed/parallel_state.py:1200] world_size=1 rank=0 local_rank=0 distributed_init_method=tcp://10.1.1.152:60465 backend=nccl
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
(EngineCore_DP0 pid=3413554) DEBUG 12-01 19:48:57 [distributed/parallel_state.py:1244] Detected 1 nodes in the distributed environment
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
(EngineCore_DP0 pid=3413554) INFO 12-01 19:48:57 [distributed/parallel_state.py:1408] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, PCP rank 0, TP rank 0, EP rank 0
(EngineCore_DP0 pid=3413554) DEBUG 12-01 19:48:57 [v1/sample/ops/topk_topp_sampler.py:53] FlashInfer top-p/top-k sampling is available but disabled by default. Set VLLM_USE_FLASHINFER_SAMPLER=1 to opt in after verifying accuracy for your workloads.
(EngineCore_DP0 pid=3413554) DEBUG 12-01 19:48:57 [v1/sample/logits_processor/__init__.py:63] No logitsprocs plugins installed (group vllm.logits_processors).
(EngineCore_DP0 pid=3413554) DEBUG 12-01 19:48:57 [compilation/decorators.py:194] Inferred dynamic dimensions for forward method of <class 'vllm.model_executor.models.gpt_oss.GptOssModel'>: ['input_ids', 'positions', 'intermediate_tensors', 'inputs_embeds']
(EngineCore_DP0 pid=3413554) INFO 12-01 19:48:57 [v1/worker/gpu_model_runner.py:3469] Starting to load model openai/gpt-oss-20b...
(EngineCore_DP0 pid=3413554) DEBUG 12-01 19:48:58 [model_executor/.../quantization/mxfp4.py:215] MXFP4 linear layer is not implemented - falling back to UnquantizedLinearMethod.
(EngineCore_DP0 pid=3413554) DEBUG 12-01 19:48:58 [model_executor/.../quantization/mxfp4.py:230] MXFP4 attention layer is not implemented. Skipping quantization for this layer.
(EngineCore_DP0 pid=3413554) DEBUG 12-01 19:48:58 [platforms/cuda.py:393] Some attention backends are not valid for cuda with head_size: 64, dtype: torch.bfloat16, kv_cache_dtype: auto, block_size: 16, use_mla: False, has_sink: True, use_sparse: False. Reasons: {FLASH_ATTN: [sink setting not supported, sink not supported on compute capability < 9.0], FLASHINFER: [sink setting not supported], FLEX_ATTENTION: [sink setting not supported]}.
(EngineCore_DP0 pid=3413554) INFO 12-01 19:48:58 [platforms/cuda.py:411] Using TRITON_ATTN attention backend out of potential backends: ['TRITON_ATTN']
(EngineCore_DP0 pid=3413554) INFO 12-01 19:48:58 [model_executor/.../fused_moe/layer.py:379] Enabled separate cuda stream for MoE shared_experts
(EngineCore_DP0 pid=3413554) INFO 12-01 19:48:58 [model_executor/.../quantization/mxfp4.py:162] Using Marlin backend
(EngineCore_DP0 pid=3413554) DEBUG 12-01 19:48:58 [config/compilation.py:966] enabled custom ops: Counter({'rms_norm': 49, 'column_parallel_linear': 24, 'row_parallel_linear': 24, 'fused_moe': 24, 'vocab_parallel_embedding': 1, 'rotary_embedding': 1, 'parallel_lm_head': 1, 'logits_processor': 1})
(EngineCore_DP0 pid=3413554) DEBUG 12-01 19:48:58 [config/compilation.py:967] disabled custom ops: Counter()
(EngineCore_DP0 pid=3413554) DEBUG 12-01 19:48:58 [model_executor/model_loader/base_loader.py:53] Loading weights on cuda ...
(EngineCore_DP0 pid=3413554) DEBUG 12-01 19:48:58 [model_executor/model_loader/weight_utils.py:466] Using model weights format ['*.safetensors']
(EngineCore_DP0 pid=3413554) DEBUG 12-01 19:49:00 [model_executor/models/utils.py:220] Loaded weight lm_head.weight with shape torch.Size([201088, 2880])
(EngineCore_DP0 pid=3413554) INFO 12-01 19:49:00 [model_executor/model_loader/default_loader.py:308] Loading weights took 2.22 seconds
(EngineCore_DP0 pid=3413554) WARNING 12-01 19:49:00 [model_executor/.../utils/marlin_utils_fp4.py:226] Your GPU does not have native support for FP4 computation but FP4 quantization is being used. Weight-only FP4 compression will be used leveraging the Marlin kernel. This may degrade performance for compute-heavy workloads.
(EngineCore_DP0 pid=3413554) INFO 12-01 19:49:01 [v1/worker/gpu_model_runner.py:3551] Model loading took 13.7194 GiB memory and 3.255989 seconds
(EngineCore_DP0 pid=3413554) DEBUG 12-01 19:49:02 [v1/worker/gpu_worker.py:346] Initial free memory: 43.91 GiB; Requested memory: 0.90 (util), 39.96 GiB
(EngineCore_DP0 pid=3413554) DEBUG 12-01 19:49:02 [v1/worker/gpu_worker.py:352] Free memory after profiling: 29.94 GiB (total), 25.99 GiB (within requested)
(EngineCore_DP0 pid=3413554) DEBUG 12-01 19:49:02 [v1/worker/gpu_worker.py:358] Memory profiling takes 0.83 seconds. Total non KV cache memory: 15.59GiB; torch peak memory increase: 1.83GiB; non-torch forward increase memory: 0.04GiB; weights memory: 13.72GiB.
(EngineCore_DP0 pid=3413554) INFO 12-01 19:49:02 [v1/worker/gpu_worker.py:359] Available KV cache memory: 24.38 GiB
(EngineCore_DP0 pid=3413554) INFO 12-01 19:49:03 [v1/core/kv_cache_utils.py:1286] GPU KV cache size: 532,528 tokens
(EngineCore_DP0 pid=3413554) INFO 12-01 19:49:03 [v1/core/kv_cache_utils.py:1291] Maximum concurrency for 5,000 tokens per request: 147.92x
(EngineCore_DP0 pid=3413554) DEBUG 12-01 19:49:03 [v1/worker/gpu_worker.py:484] Free memory on device (43.91/44.4 GiB) on startup. Desired GPU memory utilization is (0.9, 39.96 GiB). Actual usage is 13.72 GiB for weight, 1.83 GiB for peak activation, 0.04 GiB for non-torch memory, and 0.0 GiB for CUDAGraph memory. Replace gpu_memory_utilization config with `--kv-cache-memory=26017729126` (24.23 GiB) to fit into requested memory, or `--kv-cache-memory=30260962304` (28.18 GiB) to fully utilize gpu memory. Current kv cache memory in use is 24.38 GiB.
(EngineCore_DP0 pid=3413554) INFO 12-01 19:49:03 [v1/engine/core.py:254] init engine (profile, create kv cache, warmup model) took 1.34 seconds
(EngineCore_DP0 pid=3413554) DEBUG 12-01 19:49:04 [utils/gc_utils.py:40] GC Debug Config. enabled:False,top_objects:-1
(APIServer pid=3413524) DEBUG 12-01 19:49:04 [v1/engine/utils.py:1063] READY from local core engine process 0.
(EngineCore_DP0 pid=3413554) WARNING 12-01 19:49:04 [config/vllm.py:596] Inductor compilation was disabled by user settings,Optimizations settings that are only active duringInductor compilation will be ignored.
(EngineCore_DP0 pid=3413554) INFO 12-01 19:49:04 [config/vllm.py:695] Cudagraph is disabled under eager mode
(EngineCore_DP0 pid=3413554) DEBUG 12-01 19:49:04 [v1/engine/core.py:875] EngineCore waiting for work.
(APIServer pid=3413524) DEBUG 12-01 19:49:04 [v1/metrics/loggers.py:246] Engine 000: vllm cache_config_info with initialization after num_gpu_blocks is: 66566
(EngineCore_DP0 pid=3413554) DEBUG 12-01 19:49:04 [v1/engine/core.py:875] EngineCore waiting for work.
(EngineCore_DP0 pid=3413554) DEBUG 12-01 19:49:05 [v1/engine/core.py:875] EngineCore waiting for work.
(APIServer pid=3413524) INFO 12-01 19:49:05 [entrypoints/openai/api_server.py:1518] Supported tasks: ['generate']
(APIServer pid=3413524) WARNING 12-01 19:49:05 [entrypoints/tool.py:56] EXA_API_KEY is not set, browsing is disabled
(APIServer pid=3413524) INFO 12-01 19:49:05 [entrypoints/tool.py:129] Code interpreter tool initialized
(APIServer pid=3413524) INFO 12-01 19:49:05 [entrypoints/tool_server.py:209] DemoToolServer initialized with tools: ['python']
(APIServer pid=3413524) WARNING 12-01 19:49:05 [entrypoints/openai/serving_responses.py:207] `VLLM_ENABLE_RESPONSES_API_STORE` is enabled. This may cause a memory leak since we never remove responses from the store.
(APIServer pid=3413524) WARNING 12-01 19:49:05 [entrypoints/openai/serving_responses.py:215] For gpt-oss, we ignore --enable-auto-tool-choice and always enable tool use.
(APIServer pid=3413524) INFO 12-01 19:49:05 [entrypoints/openai/api_server.py:1845] Starting vLLM API server 0 on http://0.0.0.0:54241
(APIServer pid=3413524) INFO 12-01 19:49:05 [entrypoints/launcher.py:38] Available routes are:
(APIServer pid=3413524) INFO 12-01 19:49:05 [entrypoints/launcher.py:46] Route: /openapi.json, Methods: HEAD, GET
(APIServer pid=3413524) INFO 12-01 19:49:05 [entrypoints/launcher.py:46] Route: /docs, Methods: HEAD, GET
(APIServer pid=3413524) INFO 12-01 19:49:05 [entrypoints/launcher.py:46] Route: /docs/oauth2-redirect, Methods: HEAD, GET
(APIServer pid=3413524) INFO 12-01 19:49:05 [entrypoints/launcher.py:46] Route: /redoc, Methods: HEAD, GET
(APIServer pid=3413524) INFO 12-01 19:49:05 [entrypoints/launcher.py:46] Route: /health, Methods: GET
(APIServer pid=3413524) INFO 12-01 19:49:05 [entrypoints/launcher.py:46] Route: /load, Methods: GET
(APIServer pid=3413524) INFO 12-01 19:49:05 [entrypoints/launcher.py:46] Route: /pause, Methods: POST
(APIServer pid=3413524) INFO 12-01 19:49:05 [entrypoints/launcher.py:46] Route: /resume, Methods: POST
(APIServer pid=3413524) INFO 12-01 19:49:05 [entrypoints/launcher.py:46] Route: /is_paused, Methods: GET
(APIServer pid=3413524) INFO 12-01 19:49:05 [entrypoints/launcher.py:46] Route: /tokenize, Methods: POST
(APIServer pid=3413524) INFO 12-01 19:49:05 [entrypoints/launcher.py:46] Route: /detokenize, Methods: POST
(APIServer pid=3413524) INFO 12-01 19:49:05 [entrypoints/launcher.py:46] Route: /v1/models, Methods: GET
(APIServer pid=3413524) INFO 12-01 19:49:05 [entrypoints/launcher.py:46] Route: /version, Methods: GET
(APIServer pid=3413524) INFO 12-01 19:49:05 [entrypoints/launcher.py:46] Route: /v1/responses, Methods: POST
(APIServer pid=3413524) INFO 12-01 19:49:05 [entrypoints/launcher.py:46] Route: /v1/responses/{response_id}, Methods: GET
(APIServer pid=3413524) INFO 12-01 19:49:05 [entrypoints/launcher.py:46] Route: /v1/responses/{response_id}/cancel, Methods: POST
(APIServer pid=3413524) INFO 12-01 19:49:05 [entrypoints/launcher.py:46] Route: /v1/messages, Methods: POST
(APIServer pid=3413524) INFO 12-01 19:49:05 [entrypoints/launcher.py:46] Route: /v1/chat/completions, Methods: POST
(APIServer pid=3413524) INFO 12-01 19:49:05 [entrypoints/launcher.py:46] Route: /v1/completions, Methods: POST
(APIServer pid=3413524) INFO 12-01 19:49:05 [entrypoints/launcher.py:46] Route: /v1/audio/transcriptions, Methods: POST
(APIServer pid=3413524) INFO 12-01 19:49:05 [entrypoints/launcher.py:46] Route: /v1/audio/translations, Methods: POST
(APIServer pid=3413524) INFO 12-01 19:49:05 [entrypoints/launcher.py:46] Route: /scale_elastic_ep, Methods: POST
(APIServer pid=3413524) INFO 12-01 19:49:05 [entrypoints/launcher.py:46] Route: /is_scaling_elastic_ep, Methods: POST
(APIServer pid=3413524) INFO 12-01 19:49:05 [entrypoints/launcher.py:46] Route: /inference/v1/generate, Methods: POST
(APIServer pid=3413524) INFO 12-01 19:49:05 [entrypoints/launcher.py:46] Route: /ping, Methods: GET
(APIServer pid=3413524) INFO 12-01 19:49:05 [entrypoints/launcher.py:46] Route: /ping, Methods: POST
(APIServer pid=3413524) INFO 12-01 19:49:05 [entrypoints/launcher.py:46] Route: /invocations, Methods: POST
(APIServer pid=3413524) INFO 12-01 19:49:05 [entrypoints/launcher.py:46] Route: /metrics, Methods: GET
(APIServer pid=3413524) INFO 12-01 19:49:05 [entrypoints/launcher.py:46] Route: /classify, Methods: POST
(APIServer pid=3413524) INFO 12-01 19:49:05 [entrypoints/launcher.py:46] Route: /v1/embeddings, Methods: POST
(APIServer pid=3413524) INFO 12-01 19:49:05 [entrypoints/launcher.py:46] Route: /score, Methods: POST
(APIServer pid=3413524) INFO 12-01 19:49:05 [entrypoints/launcher.py:46] Route: /v1/score, Methods: POST
(APIServer pid=3413524) INFO 12-01 19:49:05 [entrypoints/launcher.py:46] Route: /rerank, Methods: POST
(APIServer pid=3413524) INFO 12-01 19:49:05 [entrypoints/launcher.py:46] Route: /v1/rerank, Methods: POST
(APIServer pid=3413524) INFO 12-01 19:49:05 [entrypoints/launcher.py:46] Route: /v2/rerank, Methods: POST
(APIServer pid=3413524) INFO 12-01 19:49:05 [entrypoints/launcher.py:46] Route: /pooling, Methods: POST
(APIServer pid=3413524) DEBUG 12-01 19:49:06 [v1/engine/async_llm.py:733] Called check_health.
(APIServer pid=3413524) INFO:     127.0.0.1:39606 - "GET /health HTTP/1.1" 200 OK
----------------------------------------------------------------------------- Captured stderr setup ------------------------------------------------------------------------------
Parse safetensors files: 100%|██████████| 3/3 [00:00<00:00, 31.45it/s]
Parse safetensors files: 100%|██████████| 3/3 [00:00<00:00, 24.38it/s]
(EngineCore_DP0 pid=3413554) <workdir>/.venv/lib/python3.13/site-packages/tvm_ffi/_optional_torch_c_dlpack.py:161: UserWarning: Failed to JIT torch c dlpack extension, EnvTensorAllocator will not be enabled.
(EngineCore_DP0 pid=3413554) We recommend installing via `pip install torch-c-dlpack-ext`
(EngineCore_DP0 pid=3413554)   warnings.warn(
Loading safetensors checkpoint shards:   0% Completed | 0/3 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:  33% Completed | 1/3 [00:00<00:01,  1.42it/s]
Loading safetensors checkpoint shards:  67% Completed | 2/3 [00:01<00:00,  1.35it/s]
Loading safetensors checkpoint shards: 100% Completed | 3/3 [00:02<00:00,  1.41it/s]
Loading safetensors checkpoint shards: 100% Completed | 3/3 [00:02<00:00,  1.40it/s]
(EngineCore_DP0 pid=3413554) 
(APIServer pid=3413524) INFO:     Started server process [3413524]
(APIServer pid=3413524) INFO:     Waiting for application startup.
(APIServer pid=3413524) INFO:     Application startup complete.
------------------------------------------------------------------------------ Captured stdout call ------------------------------------------------------------------------------
(APIServer pid=3413524) INFO 12-01 19:49:07 [reasoning/gptoss_reasoning_parser.py:162] Builtin_tool_list: ['python']
(EngineCore_DP0 pid=3413554) DEBUG 12-01 19:49:08 [v1/engine/core.py:881] EngineCore loop active.
(EngineCore_DP0 pid=3413554) DEBUG 12-01 19:49:08 [v1/worker/gpu_model_runner.py:2931] Running batch with cudagraph_mode: NONE, batch_descriptor: BatchDescriptor(num_tokens=73, num_reqs=None, uniform=False, has_lora=False), ubatch_slices: None, num_tokens_across_dp: None
(EngineCore_DP0 pid=3413554) DEBUG 12-01 19:49:08 [v1/worker/gpu_model_runner.py:2931] Running batch with cudagraph_mode: NONE, batch_descriptor: BatchDescriptor(num_tokens=1, num_reqs=None, uniform=False, has_lora=False), ubatch_slices: None, num_tokens_across_dp: None
(EngineCore_DP0 pid=3413554) DEBUG 12-01 19:49:08 [v1/worker/gpu_model_runner.py:2931] Running batch with cudagraph_mode: NONE, batch_descriptor: BatchDescriptor(num_tokens=1, num_reqs=None, uniform=False, has_lora=False), ubatch_slices: None, num_tokens_across_dp: None
(EngineCore_DP0 pid=3413554) DEBUG 12-01 19:49:08 [v1/worker/gpu_model_runner.py:2931] Running batch with cudagraph_mode: NONE, batch_descriptor: BatchDescriptor(num_tokens=1, num_reqs=None, uniform=False, has_lora=False), ubatch_slices: None, num_tokens_across_dp: None
(EngineCore_DP0 pid=3413554) DEBUG 12-01 19:49:09 [v1/worker/gpu_model_runner.py:2931] Running batch with cudagraph_mode: NONE, batch_descriptor: BatchDescriptor(num_tokens=1, num_reqs=None, uniform=False, has_lora=False), ubatch_slices: None, num_tokens_across_dp: None
(EngineCore_DP0 pid=3413554) DEBUG 12-01 19:49:09 [v1/worker/gpu_model_runner.py:2931] Running batch with cudagraph_mode: NONE, batch_descriptor: BatchDescriptor(num_tokens=1, num_reqs=None, uniform=False, has_lora=False), ubatch_slices: None, num_tokens_across_dp: None
(EngineCore_DP0 pid=3413554) DEBUG 12-01 19:49:09 [v1/worker/gpu_model_runner.py:2931] Running batch with cudagraph_mode: NONE, batch_descriptor: BatchDescriptor(num_tokens=1, num_reqs=None, uniform=False, has_lora=False), ubatch_slices: None, num_tokens_across_dp: None
(EngineCore_DP0 pid=3413554) DEBUG 12-01 19:49:09 [v1/worker/gpu_model_runner.py:2931] Running batch with cudagraph_mode: NONE, batch_descriptor: BatchDescriptor(num_tokens=1, num_reqs=None, uniform=False, has_lora=False), ubatch_slices: None, num_tokens_across_dp: None
(EngineCore_DP0 pid=3413554) DEBUG 12-01 19:49:09 [v1/worker/gpu_model_runner.py:2931] Running batch with cudagraph_mode: NONE, batch_descriptor: BatchDescriptor(num_tokens=1, num_reqs=None, uniform=False, has_lora=False), ubatch_slices: None, num_tokens_across_dp: None
(EngineCore_DP0 pid=3413554) DEBUG 12-01 19:49:09 [v1/worker/gpu_model_runner.py:2931] Running batch with cudagraph_mode: NONE, batch_descriptor: BatchDescriptor(num_tokens=1, num_reqs=None, uniform=False, has_lora=False), ubatch_slices: None, num_tokens_across_dp: None
(EngineCore_DP0 pid=3413554) DEBUG 12-01 19:49:09 [v1/worker/gpu_model_runner.py:2931] Running batch with cudagraph_mode: NONE, batch_descriptor: BatchDescriptor(num_tokens=1, num_reqs=None, uniform=False, has_lora=False), ubatch_slices: None, num_tokens_across_dp: None
(EngineCore_DP0 pid=3413554) DEBUG 12-01 19:49:09 [v1/worker/gpu_model_runner.py:2931] Running batch with cudagraph_mode: NONE, batch_descriptor: BatchDescriptor(num_tokens=1, num_reqs=None, uniform=False, has_lora=False), ubatch_slices: None, num_tokens_across_dp: None
(EngineCore_DP0 pid=3413554) DEBUG 12-01 19:49:09 [v1/worker/gpu_model_runner.py:2931] Running batch with cudagraph_mode: NONE, batch_descriptor: BatchDescriptor(num_tokens=1, num_reqs=None, uniform=False, has_lora=False), ubatch_slices: None, num_tokens_across_dp: None
(EngineCore_DP0 pid=3413554) DEBUG 12-01 19:49:09 [v1/worker/gpu_model_runner.py:2931] Running batch with cudagraph_mode: NONE, batch_descriptor: BatchDescriptor(num_tokens=1, num_reqs=None, uniform=False, has_lora=False), ubatch_slices: None, num_tokens_across_dp: None
(EngineCore_DP0 pid=3413554) DEBUG 12-01 19:49:09 [v1/worker/gpu_model_runner.py:2931] Running batch with cudagraph_mode: NONE, batch_descriptor: BatchDescriptor(num_tokens=1, num_reqs=None, uniform=False, has_lora=False), ubatch_slices: None, num_tokens_across_dp: None
(EngineCore_DP0 pid=3413554) DEBUG 12-01 19:49:09 [v1/worker/gpu_model_runner.py:2931] Running batch with cudagraph_mode: NONE, batch_descriptor: BatchDescriptor(num_tokens=1, num_reqs=None, uniform=False, has_lora=False), ubatch_slices: None, num_tokens_across_dp: None
(EngineCore_DP0 pid=3413554) DEBUG 12-01 19:49:09 [v1/worker/gpu_model_runner.py:2931] Running batch with cudagraph_mode: NONE, batch_descriptor: BatchDescriptor(num_tokens=1, num_reqs=None, uniform=False, has_lora=False), ubatch_slices: None, num_tokens_across_dp: None
(EngineCore_DP0 pid=3413554) DEBUG 12-01 19:49:09 [v1/worker/gpu_model_runner.py:2931] Running batch with cudagraph_mode: NONE, batch_descriptor: BatchDescriptor(num_tokens=1, num_reqs=None, uniform=False, has_lora=False), ubatch_slices: None, num_tokens_across_dp: None
(EngineCore_DP0 pid=3413554) DEBUG 12-01 19:49:09 [v1/worker/gpu_model_runner.py:2931] Running batch with cudagraph_mode: NONE, batch_descriptor: BatchDescriptor(num_tokens=1, num_reqs=None, uniform=False, has_lora=False), ubatch_slices: None, num_tokens_across_dp: None
(EngineCore_DP0 pid=3413554) DEBUG 12-01 19:49:09 [v1/worker/gpu_model_runner.py:2931] Running batch with cudagraph_mode: NONE, batch_descriptor: BatchDescriptor(num_tokens=1, num_reqs=None, uniform=False, has_lora=False), ubatch_slices: None, num_tokens_across_dp: None
(EngineCore_DP0 pid=3413554) DEBUG 12-01 19:49:09 [v1/worker/gpu_model_runner.py:2931] Running batch with cudagraph_mode: NONE, batch_descriptor: BatchDescriptor(num_tokens=1, num_reqs=None, uniform=False, has_lora=False), ubatch_slices: None, num_tokens_across_dp: None
(EngineCore_DP0 pid=3413554) DEBUG 12-01 19:49:09 [v1/worker/gpu_model_runner.py:2931] Running batch with cudagraph_mode: NONE, batch_descriptor: BatchDescriptor(num_tokens=1, num_reqs=None, uniform=False, has_lora=False), ubatch_slices: None, num_tokens_across_dp: None
(EngineCore_DP0 pid=3413554) DEBUG 12-01 19:49:09 [v1/worker/gpu_model_runner.py:2931] Running batch with cudagraph_mode: NONE, batch_descriptor: BatchDescriptor(num_tokens=1, num_reqs=None, uniform=False, has_lora=False), ubatch_slices: None, num_tokens_across_dp: None
(EngineCore_DP0 pid=3413554) DEBUG 12-01 19:49:09 [v1/worker/gpu_model_runner.py:2931] Running batch with cudagraph_mode: NONE, batch_descriptor: BatchDescriptor(num_tokens=1, num_reqs=None, uniform=False, has_lora=False), ubatch_slices: None, num_tokens_across_dp: None
(EngineCore_DP0 pid=3413554) DEBUG 12-01 19:49:09 [v1/worker/gpu_model_runner.py:2931] Running batch with cudagraph_mode: NONE, batch_descriptor: BatchDescriptor(num_tokens=1, num_reqs=None, uniform=False, has_lora=False), ubatch_slices: None, num_tokens_across_dp: None
(EngineCore_DP0 pid=3413554) DEBUG 12-01 19:49:09 [v1/worker/gpu_model_runner.py:2931] Running batch with cudagraph_mode: NONE, batch_descriptor: BatchDescriptor(num_tokens=1, num_reqs=None, uniform=False, has_lora=False), ubatch_slices: None, num_tokens_across_dp: None
(EngineCore_DP0 pid=3413554) DEBUG 12-01 19:49:09 [v1/worker/gpu_model_runner.py:2931] Running batch with cudagraph_mode: NONE, batch_descriptor: BatchDescriptor(num_tokens=1, num_reqs=None, uniform=False, has_lora=False), ubatch_slices: None, num_tokens_across_dp: None
(EngineCore_DP0 pid=3413554) DEBUG 12-01 19:49:09 [v1/worker/gpu_model_runner.py:2931] Running batch with cudagraph_mode: NONE, batch_descriptor: BatchDescriptor(num_tokens=1, num_reqs=None, uniform=False, has_lora=False), ubatch_slices: None, num_tokens_across_dp: None
(EngineCore_DP0 pid=3413554) DEBUG 12-01 19:49:09 [v1/worker/gpu_model_runner.py:2931] Running batch with cudagraph_mode: NONE, batch_descriptor: BatchDescriptor(num_tokens=1, num_reqs=None, uniform=False, has_lora=False), ubatch_slices: None, num_tokens_across_dp: None
(EngineCore_DP0 pid=3413554) DEBUG 12-01 19:49:09 [v1/worker/gpu_model_runner.py:2931] Running batch with cudagraph_mode: NONE, batch_descriptor: BatchDescriptor(num_tokens=1, num_reqs=None, uniform=False, has_lora=False), ubatch_slices: None, num_tokens_across_dp: None
(EngineCore_DP0 pid=3413554) DEBUG 12-01 19:49:09 [v1/worker/gpu_model_runner.py:2931] Running batch with cudagraph_mode: NONE, batch_descriptor: BatchDescriptor(num_tokens=1, num_reqs=None, uniform=False, has_lora=False), ubatch_slices: None, num_tokens_across_dp: None
(EngineCore_DP0 pid=3413554) DEBUG 12-01 19:49:09 [v1/worker/gpu_model_runner.py:2931] Running batch with cudagraph_mode: NONE, batch_descriptor: BatchDescriptor(num_tokens=1, num_reqs=None, uniform=False, has_lora=False), ubatch_slices: None, num_tokens_across_dp: None
(EngineCore_DP0 pid=3413554) DEBUG 12-01 19:49:09 [v1/worker/gpu_model_runner.py:2931] Running batch with cudagraph_mode: NONE, batch_descriptor: BatchDescriptor(num_tokens=1, num_reqs=None, uniform=False, has_lora=False), ubatch_slices: None, num_tokens_across_dp: None
(EngineCore_DP0 pid=3413554) DEBUG 12-01 19:49:09 [v1/worker/gpu_model_runner.py:2931] Running batch with cudagraph_mode: NONE, batch_descriptor: BatchDescriptor(num_tokens=1, num_reqs=None, uniform=False, has_lora=False), ubatch_slices: None, num_tokens_across_dp: None
(EngineCore_DP0 pid=3413554) DEBUG 12-01 19:49:09 [v1/worker/gpu_model_runner.py:2931] Running batch with cudagraph_mode: NONE, batch_descriptor: BatchDescriptor(num_tokens=1, num_reqs=None, uniform=False, has_lora=False), ubatch_slices: None, num_tokens_across_dp: None
(EngineCore_DP0 pid=3413554) DEBUG 12-01 19:49:09 [v1/worker/gpu_model_runner.py:2931] Running batch with cudagraph_mode: NONE, batch_descriptor: BatchDescriptor(num_tokens=1, num_reqs=None, uniform=False, has_lora=False), ubatch_slices: None, num_tokens_across_dp: None
(EngineCore_DP0 pid=3413554) DEBUG 12-01 19:49:09 [v1/worker/gpu_model_runner.py:2931] Running batch with cudagraph_mode: NONE, batch_descriptor: BatchDescriptor(num_tokens=1, num_reqs=None, uniform=False, has_lora=False), ubatch_slices: None, num_tokens_across_dp: None
(EngineCore_DP0 pid=3413554) DEBUG 12-01 19:49:09 [v1/worker/gpu_model_runner.py:2931] Running batch with cudagraph_mode: NONE, batch_descriptor: BatchDescriptor(num_tokens=1, num_reqs=None, uniform=False, has_lora=False), ubatch_slices: None, num_tokens_across_dp: None
(EngineCore_DP0 pid=3413554) DEBUG 12-01 19:49:09 [v1/worker/gpu_model_runner.py:2931] Running batch with cudagraph_mode: NONE, batch_descriptor: BatchDescriptor(num_tokens=1, num_reqs=None, uniform=False, has_lora=False), ubatch_slices: None, num_tokens_across_dp: None
(EngineCore_DP0 pid=3413554) DEBUG 12-01 19:49:09 [v1/worker/gpu_model_runner.py:2931] Running batch with cudagraph_mode: NONE, batch_descriptor: BatchDescriptor(num_tokens=1, num_reqs=None, uniform=False, has_lora=False), ubatch_slices: None, num_tokens_across_dp: None
(EngineCore_DP0 pid=3413554) DEBUG 12-01 19:49:09 [v1/worker/gpu_model_runner.py:2931] Running batch with cudagraph_mode: NONE, batch_descriptor: BatchDescriptor(num_tokens=1, num_reqs=None, uniform=False, has_lora=False), ubatch_slices: None, num_tokens_across_dp: None
(EngineCore_DP0 pid=3413554) DEBUG 12-01 19:49:09 [v1/worker/gpu_model_runner.py:2931] Running batch with cudagraph_mode: NONE, batch_descriptor: BatchDescriptor(num_tokens=1, num_reqs=None, uniform=False, has_lora=False), ubatch_slices: None, num_tokens_across_dp: None
(EngineCore_DP0 pid=3413554) DEBUG 12-01 19:49:09 [v1/worker/gpu_model_runner.py:2931] Running batch with cudagraph_mode: NONE, batch_descriptor: BatchDescriptor(num_tokens=1, num_reqs=None, uniform=False, has_lora=False), ubatch_slices: None, num_tokens_across_dp: None
(EngineCore_DP0 pid=3413554) DEBUG 12-01 19:49:09 [v1/worker/gpu_model_runner.py:2931] Running batch with cudagraph_mode: NONE, batch_descriptor: BatchDescriptor(num_tokens=1, num_reqs=None, uniform=False, has_lora=False), ubatch_slices: None, num_tokens_across_dp: None
(EngineCore_DP0 pid=3413554) DEBUG 12-01 19:49:09 [v1/engine/core.py:875] EngineCore waiting for work.
(APIServer pid=3413524) INFO:     127.0.0.1:39620 - "POST /v1/responses HTTP/1.1" 200 OK
------------------------------------------------------------------------------ Captured stderr call ------------------------------------------------------------------------------
[2025-12-01 19:49:09] INFO _client.py:1740: HTTP Request: POST http://127.0.0.1:54241/v1/responses "HTTP/1.1 200 OK"
------------------------------------------------------------------------------- Captured log call --------------------------------------------------------------------------------
INFO     httpx:_client.py:1740 HTTP Request: POST http://127.0.0.1:54241/v1/responses "HTTP/1.1 200 OK"
---------------------------------------------------------------------------- Captured stdout teardown ----------------------------------------------------------------------------
(APIServer pid=3413524) INFO 12-01 19:49:10 [entrypoints/launcher.py:110] Shutting down FastAPI HTTP server.
(EngineCore_DP0 pid=3413554) DEBUG 12-01 19:49:10 [v1/engine/core.py:839] EngineCore exiting.
---------------------------------------------------------------------------- Captured stderr teardown ----------------------------------------------------------------------------
[rank0]:[W1201 19:49:10.263624448 ProcessGroupNCCL.cpp:1524] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())
nanobind: leaked 2 instances!
 - leaked instance 0x1553f81fdbd8 of type "xgrammar.xgrammar_bindings.CompiledGrammar"
 - leaked instance 0x1553d2ec9098 of type "xgrammar.xgrammar_bindings.GrammarMatcher"
nanobind: leaked 6 types!
 - leaked type "xgrammar.xgrammar_bindings.CompiledGrammar"
 - leaked type "xgrammar.xgrammar_bindings.TokenizerInfo"
 - leaked type "xgrammar.xgrammar_bindings.Grammar"
 - leaked type "xgrammar.xgrammar_bindings.GrammarMatcher"
 - leaked type "xgrammar.xgrammar_bindings.BatchGrammarMatcher"
 - leaked type "xgrammar.xgrammar_bindings.GrammarCompiler"
nanobind: leaked 51 functions!
 - leaked function "serialize_json"
 - leaked function "builtin_json_grammar"
 - leaked function ""
 - leaked function "compile_json_schema"
 - leaked function "fill_next_token_bitmask"
 - leaked function ""
 - leaked function "deserialize_json"
 - leaked function "compile_regex"
 - leaked function "__init__"
 - leaked function ""
 - leaked function "batch_fill_next_token_bitmask"
 - leaked function "batch_accept_string"
 - leaked function "reset"
 - leaked function "dump_metadata"
 - leaked function "from_json_schema"
 - leaked function "_debug_print_internal_state"
 - leaked function "serialize_json"
 - leaked function "__init__"
 - leaked function "from_regex"
 - leaked function ""
 - leaked function "from_vocab_and_metadata"
 - leaked function "__init__"
 - leaked function ""
 - leaked function "compile_builtin_json_grammar"
 - leaked function "find_jump_forward_string"
 - leaked function "union"
 - leaked function "serialize_json"
 - leaked function "concat"
 - leaked function ""
 - leaked function ""
 - leaked function "clear_cache"
 - leaked function "accept_token"
 - leaked function "batch_accept_token"
 - leaked function ""
 - leaked function ""
 - leaked function "to_string"
 - leaked function "deserialize_json"
 - leaked function "compile_grammar"
 - leaked function "from_structural_tag"
 - leaked function "accept_string"
 - leaked function "_detect_metadata_from_hf"
 - leaked function "deserialize_json"
 - leaked function "get_cache_size_bytes"
 - leaked function ""
 - leaked function ""
 - leaked function "from_ebnf"
 - leaked function "__init__"
 - leaked function "compile_structural_tag"
 - leaked function "rollback"
 - leaked function "is_terminated"
 - leaked function ""
nanobind: this is likely caused by a reference counting issue in the binding code.
(APIServer pid=3413524) INFO:     Shutting down
(APIServer pid=3413524) INFO:     Waiting for application shutdown.
(APIServer pid=3413524) INFO:     Application shutdown complete.
================================================================================ warnings summary ================================================================================
<frozen importlib._bootstrap>:488
  <frozen importlib._bootstrap>:488: DeprecationWarning: builtin type SwigPyPacked has no __module__ attribute

<frozen importlib._bootstrap>:488
  <frozen importlib._bootstrap>:488: DeprecationWarning: builtin type SwigPyObject has no __module__ attribute

tests/entrypoints/openai/test_response_api_with_harmony.py:636
  <workdir>/tests/entrypoints/openai/test_response_api_with_harmony.py:636: PytestUnknownMarkWarning: Unknown pytest.mark.flaky - is this a typo?  You can register custom marks to avoid this warning - for details, see https://docs.pytest.org/en/stable/how-to/mark.html
    @pytest.mark.flaky(reruns=5)

-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
============================================================================ short test summary info =============================================================================
FAILED tests/entrypoints/openai/test_response_api_with_harmony.py::test_output_messages_enabled[openai/gpt-oss-20b] - AssertionError: {'author': {'name': None, 'role': 'system'}, 'channel': None, 'content': [{}], 'content_type': None, ...}
========================================================================= 1 failed, 3 warnings in 35.23s =========================================================================

Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adds a regression test for an issue where response messages can contain empty content. The test correctly reproduces the bug and is well-documented. I have one suggestion to make the test more robust against potential TypeError exceptions.

# Regression test for github.com/openai/harmony/issues/78 (empty content)
is_query_returned: bool = False
for _message in [*response.input_messages, *response.output_messages]:
for _item in _message.get("content"):
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The content of a message can be None, for example in an assistant message that only contains tool calls. If _message.get("content") returns None, iterating over it will raise a TypeError. To make the test more robust, you should handle the None case, for example by iterating over an empty list if content is None.

Suggested change
for _item in _message.get("content"):
for _item in _message.get("content") or []:

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated to use _message["content"] instead of _message.get("content") since "content" is required per OAI Harmony specs- better to raise KeyError if that is missing in vLLM output

Signed-off-by: Jacob-Junqi Tian <jacob@banana.abay.cf>
@jacobthebanana jacobthebanana force-pushed the harmony-issue-78-regression branch from 23de85f to f44dda6 Compare December 2, 2025 01:08
@chatgpt-codex-connector
Copy link
Copy Markdown

💡 Codex Review

https://github.com/vllm-project/vllm/blob/23de85f3697b84c698c465fd6d3d911adb091bbf/tests/entrypoints/openai/test_response_api_with_harmony.py#L869-L872
P1 Badge New regression assertion keeps suite red

The new loop asserts every content element in both input_messages and output_messages is a non-empty dict. With enable_response_messages on the FastAPI/gpt-oss path, the server currently returns system input content as [{}] (see the failure output in the PR description), so this test fails immediately with AssertionError and there is no accompanying fix to make the assertion pass. Landing this change will keep tests/entrypoints/openai/test_response_api_with_harmony.py::test_output_messages_enabled failing on the only supported model path, blocking merges until the serialization bug is addressed or the expectation is relaxed.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Harmony specs
Signed-off-by: Jacob-Junqi Tian <jacob@banana.abay.cf>
Signed-off-by: Jacob-Junqi Tian <jacob@banana.abay.cf>
@robertgshaw2-redhat
Copy link
Copy Markdown
Collaborator

Can you make the pre-commit pass?

Signed-off-by: Jacob-Junqi Tian <jacob@banana.abay.cf>
@jacobthebanana jacobthebanana force-pushed the harmony-issue-78-regression branch from 553706f to a3ea500 Compare December 4, 2025 15:25
@jacobthebanana
Copy link
Copy Markdown
Contributor Author

Can you make the pre-commit pass?

Done- though the full end-to-end tests likely will not pass because of #29831. I created a quick fix in #27377 if you'd like to take a look

@github-actions
Copy link
Copy Markdown

github-actions bot commented Mar 6, 2026

This pull request has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this pull request should remain open. Thank you!

@github-actions github-actions bot added the stale Over 90 days of inactivity label Mar 6, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

gpt-oss Related to GPT-OSS models stale Over 90 days of inactivity

Projects

Status: To Triage

Development

Successfully merging this pull request may close these issues.

2 participants