Skip to content

Conversation

@qandrew
Copy link
Contributor

@qandrew qandrew commented Sep 10, 2025

Purpose

  • add IncompleteDetails to vllm's implementation of ResponsesResponses.
  • first use case: if the max token length has been hit, we should add the reason for incompleteness. this is added in the non streaming version, the streaming version will be a follow up PR (bc it needs to be based off of [gpt-oss][1][bugfix] fix streaming final output #24466
  • fix some formatting issues

I do realize there are other things we need to fix for IncompleteDetails; for now we'll guarantee that IF incomplete details is outputted, it is true. We'll still need to test flows like if the generator was interrupted abruptly, and output the reason why. And I don't think GPT-OSS does content filters for now but that is another IncompleteDetails implementation.

Test Plan

Server

(gpt_oss_edit) [[email protected] /data/users/axia/gitrepos/vllm (andrew/incomplete-details)]$ CUDA_VISIBLE_DEVICES=2,3 with-proxy vllm serve "/data/users/axia/checkpoints/gpt-oss-120b" -tp 2 --port 20001

Client

(gpt_oss_edit) [[email protected] /data/users/axia/gitrepos/vllm (andrew/incomplete-details)]$ curl http://localhost:20001/v1/responses   -H "Content-Type: application/json"   -N   -d '{
    "model": "/data/users/axia/checkpoints/gpt-oss-120b",
    "input": [
        {
            "role": "user",
            "content": "Write two paragraphs on the weather."
        }
    ],
    "temperature": 0.7,
    "max_output_tokens": 256
}' | jq
# outputs 
{
  "id": "resp_5b6f93e8efa4445497f6b6bc052b6dac",
  "created_at": 1757701130,
  "incomplete_details": {
    "reason": "max_output_tokens" # we get incomplete reasons here
  },
  "instructions": null,
  "metadata": null,
  "model": "/data/users/axia/checkpoints/gpt-oss-120b",
  "object": "response",
  "output": [
    {
      "id": "rs_541bc7e4806f4630b04703e72e14e025",
      "summary": [],
      "type": "reasoning",
      "content": [
        {
          "text": "User asks: \"Write two paragraphs on the weather.\" We need to produce two paragraphs about weather. Should be descriptive, could talk about different aspects. Probably just two paragraphs. Ensure it's coherent. No extra instructions.",
          "type": "reasoning_text"
        }
      ],
      "encrypted_content": null,
      "status": null
    },
    {
      "id": "msg_122778466dba478aa761b0542f2cf81f",
      "content": [
        {
          "annotations": [],
          "text": "The sky stretched a bruised violet this morning, the first hints of sunrise shyly peeking through a thin veil of low‑lying clouds. A gentle breeze whispered through the oak leaves, carrying the faint scent of damp earth and pine sap, while droplets of drizzle clung to the windowpanes like tiny crystal beads. The temperature hovered just above the dew point, making the air feel cool enough to pull a light sweater from the closet, yet not so cold as to bite the skin. As the sun climbed higher, its golden rays began to melt the lingering mist, turning the wet streets into shimmering ribbons that reflected the city’s hurried rhythm.\n\nBy afternoon, the weather shifted dramatically. Dark, billowing cumulonimbus towers gathered on the horizon, their edges tinged with electric blue, signalling an approaching thunderstorm. A sudden gust surged through the streets, rattling shutters and sending loose papers swirling in a chaotic dance. The first crack of thunder rolled like distant drums, followed quickly by a cascade of rain that",
          "type": "output_text",
          "logprobs": null
        }
      ],
      "role": "assistant",
      "status": "completed",
      "type": "message"
    }
  ],
  "parallel_tool_calls": true,
  "temperature": 0.7,
  "tool_choice": "auto",
  "tools": [],
  "top_p": 1.0,
  "background": false,
  "max_output_tokens": 256,
  "max_tool_calls": null,
  "previous_response_id": null,
  "prompt": null,
  "reasoning": null,
  "service_tier": "auto",
  "status": "incomplete",
  "text": null,
  "top_logprobs": null,
  "truncation": "disabled",
  "usage": {
    "input_tokens": 72,
    "input_tokens_details": {
      "cached_tokens": 64
    },
    "output_tokens": 256,
    "output_tokens_details": {
      "reasoning_tokens": 44,
      "tool_output_tokens": 0
    },
    "total_tokens": 328
  },
  "user": null
}

Client, where we don't hit max token limit

(gpt_oss_edit) [[email protected] /data/users/axia/gitrepos/vllm (andrew/incomplete-details)]$ curl http://localhost:20001/v1/responses   -H "Content-Type: application/json"   -N   -d '{
    "model": "/data/users/axia/checkpoints/gpt-oss-120b",
    "input": [
        {
            "role": "user",
            "content": "Write two words on the weather."
        }
    ],
    "temperature": 0.7,
    "max_output_tokens": 256
}' 
# output
{
  "id": "resp_60215e7212c044669bf5b85015fe19e4",
  "created_at": 1757701190,
  "incomplete_details": null,
  "instructions": null,
  "metadata": null,
  "model": "/data/users/axia/checkpoints/gpt-oss-120b",
  "object": "response",
  "output": [
    {
      "id": "rs_02262a1cf9cf4ff3891d07d282f2e176",
      "summary": [],
      "type": "reasoning",
      "content": [
        {
          "text": "The user asks: \"Write two words on the weather.\" Probably they want a short phrase of two words describing the weather. Could be \"sunny skies\", \"stormy night\", etc. Provide two words. Could also be a short description: \"Cloudy day\". Provide two words. Probably just output two words. I'll respond with two words.",
          "type": "reasoning_text"
        }
      ],
      "encrypted_content": null,
      "status": null
    },
    {
      "id": "msg_52743c005ca94adf9687a4ffebb0107e",
      "content": [
        {
          "annotations": [],
          "text": "Sunny skies.",
          "type": "output_text",
          "logprobs": null
        }
      ],
      "role": "assistant",
      "status": "completed",
      "type": "message"
    }
  ],
  "parallel_tool_calls": true,
  "temperature": 0.7,
  "tool_choice": "auto",
  "tools": [],
  "top_p": 1.0,
  "background": false,
  "max_output_tokens": 256,
  "max_tool_calls": null,
  "previous_response_id": null,
  "prompt": null,
  "reasoning": null,
  "service_tier": "auto",
  "status": "completed",
  "text": null,
  "top_logprobs": null,
  "truncation": "disabled",
  "usage": {
    "input_tokens": 72,
    "input_tokens_details": {
      "cached_tokens": 64
    },
    "output_tokens": 84,
    "output_tokens_details": {
      "reasoning_tokens": 72,
      "tool_output_tokens": 0
    },
    "total_tokens": 156
  },
  "user": null
}

Test Result


Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

@mergify mergify bot added frontend gpt-oss Related to GPT-OSS models v1 labels Sep 10, 2025
Signed-off-by: Andrew Xia <[email protected]>
@mergify
Copy link

mergify bot commented Sep 12, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @qandrew.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Sep 12, 2025
Signed-off-by: Andrew Xia <[email protected]>
@mergify mergify bot removed the needs-rebase label Sep 12, 2025
Signed-off-by: Andrew Xia <[email protected]>
@qandrew qandrew force-pushed the andrew/incomplete-details branch from 1fe6099 to 1718d05 Compare September 12, 2025 22:35
Signed-off-by: Andrew Xia <[email protected]>
Signed-off-by: Andrew Xia <[email protected]>
Copy link
Collaborator

@yeqcharlotte yeqcharlotte left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for the test plan! cc: @heheda12345 @houseroad

Comment on lines +1900 to +1902
# TODO: implement the other reason for incomplete_details,
# which is content_filter
# incomplete_details = IncompleteDetails(reason='content_filter')
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what's missing from current logic btw.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think VLLM baseline implementation supports content filter as an abort reason currently: https://github.com/vllm-project/vllm/blob/main/vllm/v1/request.py#L206

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if the parser still has messages (ie if the generator got cut abruptly, this should be incomplete and not completed.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make sense please move your comment to the code :)

Comment on lines +1900 to +1902
# TODO: implement the other reason for incomplete_details,
# which is content_filter
# incomplete_details = IncompleteDetails(reason='content_filter')
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think VLLM baseline implementation supports content filter as an abort reason currently: https://github.com/vllm-project/vllm/blob/main/vllm/v1/request.py#L206

@houseroad houseroad added the ready ONLY add when PR is ready to merge/full CI is needed label Sep 14, 2025
Signed-off-by: Andrew Xia <[email protected]>
@yeqcharlotte
Copy link
Collaborator

@qandrew @houseroad this needs a rebase over the structured output disable commit

@houseroad houseroad merged commit 25aba2b into vllm-project:main Sep 15, 2025
45 checks passed
tlrmchlsmth pushed a commit to tlrmchlsmth/vllm that referenced this pull request Sep 15, 2025
FeiDaLI pushed a commit to FeiDaLI/vllm that referenced this pull request Sep 25, 2025
QierLi pushed a commit to QierLi/vllm that referenced this pull request Oct 5, 2025
Signed-off-by: bbartels <[email protected]>

[gpt-oss] Add IncompleteDetails to ResponsesRepsonse (vllm-project#24561)

Signed-off-by: Andrew Xia <[email protected]>

[gpt-oss][1a] create_responses stream outputs BaseModel type, api server is SSE still (vllm-project#24759)

Signed-off-by: Andrew Xia <[email protected]>

[Performance] Remove redundant clone() calls in cutlass_mla (vllm-project#24891)

[Bug] Fix Cutlass Scaled MM Compilation Error (vllm-project#24887)

Signed-off-by: yewentao256 <[email protected]>

[ci] fix wheel names for arm wheels (vllm-project#24898)

Signed-off-by: simon-mo <[email protected]>

[Tests] fix initialization of kv hash in tests (vllm-project#24273)

Signed-off-by: Mickael Seznec <[email protected]>

[Compile] Fix noop_elimination pass and add tests for noop_elimination (vllm-project#24880)

Signed-off-by: zjy0516 <[email protected]>

Propagate entire tokens to connector for resumed preemptions

Signed-off-by: Qier Li <[email protected]>

Fix pre-commit

Signed-off-by: Qier Li <[email protected]>

Rename field and nullify empty lists

Signed-off-by: Qier Li <[email protected]>

Update vllm/v1/core/sched/scheduler.py

Co-authored-by: Nick Hill <[email protected]>
Signed-off-by: Qier Li <[email protected]>

Add unit test for preemption resumption

Signed-off-by: Qier Li <[email protected]>
xuebwang-amd pushed a commit to xuebwang-amd/vllm that referenced this pull request Oct 10, 2025
choprahetarth pushed a commit to Tandemn-Labs/vllm that referenced this pull request Oct 11, 2025
xuebwang-amd pushed a commit to xuebwang-amd/vllm that referenced this pull request Oct 24, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

frontend gpt-oss Related to GPT-OSS models ready ONLY add when PR is ready to merge/full CI is needed v1

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

3 participants