Skip to content

[BugFix][Frontend] pass kv_transfer_params through to sampling_params#38094

Open
hhk7734 wants to merge 1 commit intovllm-project:mainfrom
moreh-dev:fix/tito_kv_transfer_params
Open

[BugFix][Frontend] pass kv_transfer_params through to sampling_params#38094
hhk7734 wants to merge 1 commit intovllm-project:mainfrom
moreh-dev:fix/tito_kv_transfer_params

Conversation

@hhk7734
Copy link
Copy Markdown
Contributor

@hhk7734 hhk7734 commented Mar 25, 2026

Purpose

Fix two issues in the disaggregated serving GenerateRequest:

  1. kv_transfer_params silently dropped: When a client sends kv_transfer_params in a generate request, the field is parsed but never forwarded to the engine's SamplingParams. This means disaggregated prefill/decode scheduling flags (do_remote_decode, do_remote_prefill, etc.) have no effect via the token-level serving endpoint.

  2. sampling_params unnecessarily required: The field has no default, forcing every caller to provide it even when default sampling is acceptable.

Changes

  • Add GenerateRequest.to_sampling_params() that merges kv_transfer_params into SamplingParams.extra_args before handing off to the engine.
  • Default sampling_params to SamplingParams() via Field(default_factory=...).
  • Call request.to_sampling_params() in ServingTokens instead of accessing the raw field.

Test Plan

curl -XPOST http://<prefillerIP>:8000/inference/v1/generate \
    -H "Content-Type: application/json" \
    -d '{
      "token_ids": [15339, 11, 1268, 527, 499, 30],
      "sampling_params": {
        "max_tokens": 20
      },
      "kv_transfer_params": {
        "do_remote_decode": true,
        "do_remote_prefill": false
      }
    }' | jq
curl -XPOST http://<decoderIP>:8000/inference/v1/generate \
    -H "Content-Type: application/json" \
    -d '{
      "token_ids": [15339, 11, 1268, 527, 499, 30],
      "sampling_params": {
        "max_tokens": 20
      },
      "kv_transfer_params": {
        "do_remote_prefill": true,
        "do_remote_decode": false,
        "remote_block_ids": [
          [
            1,
            2
          ]
        ],
        "remote_engine_id": "0ce78dd0-7144-4375-86ed-71dee6d9a81c_dp0",
        "remote_request_id": "generate-tokens-bcc9f408bb55b99e-9446e206",
        "remote_host": "<prefillerIP>",
        "remote_port": 5600,
        "tp_size": 1
     }
    }' | jq

Test Result

{
  "request_id": "905c6d76ee0a7603",
  "choices": [
    {
      "index": 0,
      "logprobs": null,
      "finish_reason": "length",
      "token_ids": [
        29882,
        29906,
        29900,
        29906,
        29906,
        31054,
        31093,
        29871,
        237,
        181,
        131,
        239,
        134,
        140,
        240,
        152,
        163,
        29871,
        30970,
        29871
      ]
    }
  ],
  "prompt_logprobs": null,
  "kv_transfer_params": {
    "do_remote_prefill": true,
    "do_remote_decode": false,
    "remote_block_ids": [
      [
        1,
        2
      ]
    ],
    "remote_engine_id": "0ce78dd0-7144-4375-86ed-71dee6d9a81c_dp0",
    "remote_request_id": "generate-tokens-bcc9f408bb55b99e-9446e206",
    "remote_host": "<prefillerIP>",
    "remote_port": 5600,
    "tp_size": 1
  }
}
{
  "request_id": "b7fe3643234c27e3",
  "choices": [
    {
      "index": 0,
      "logprobs": null,
      "finish_reason": "length",
      "token_ids": [
        29871,
        239,
        161,
        139,
        31137,
        31054,
        31136,
        29871,
        237,
        181,
        194,
        239,
        188,
        30393,
        240,
        152,
        163,
        239,
        159,
        191
      ]
    }
  ],
  "prompt_logprobs": null,
  "kv_transfer_params": null
}
main (APIServer pid=8) INFO 03-26 14:20:00 [loggers.py:259] Engine 000: Avg prompt throughput: 0.1 tokens/s, Avg generation throughput: 2.0 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%, External prefix cache hit rate: 100.0%
main (APIServer pid=8) INFO 03-26 14:20:00 [metrics.py:103] KV Transfer metrics: Num successful transfers=1, Avg xfer time (ms)=11.411, P90 xfer time (ms)=11.411, Avg post time (ms)=0.33, P90 post time (ms)=0.33, Avg MB per transfer=2.0, Throughput (MB/s)=175.269, Avg number of descriptors=32.0
main (APIServer pid=8) INFO 03-26 14:20:10 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%, External prefix cache hit rate: 100.0%
Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces support for kv_transfer_params in disaggregated serving requests and makes sampling_params optional by providing a default. New tests have been added to cover these scenarios. A high-severity issue was identified in the to_sampling_params method, which currently mutates the original sampling_params object instead of working on a copy, potentially leading to unexpected side effects.

)

def to_sampling_params(self) -> SamplingParams:
params = self.sampling_params
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

This method mutates self.sampling_params in place because params is a reference, not a copy. This can lead to unexpected side effects, as the state of the GenerateRequest object is modified. A method with a to_... naming convention should not have side effects on the instance it's called on.

The caller in serving.py also modifies the returned object. This reinforces the need to work on a copy to avoid altering the original request object's state.

Please create a deep copy of self.sampling_params. Assuming it's a Pydantic model, model_copy(deep=True) is the idiomatic way to do this.

Suggested change
params = self.sampling_params
params = self.sampling_params.model_copy(deep=True)

…ake sampling_params optional

kv_transfer_params in GenerateRequest were not being forwarded to the
engine. Add to_sampling_params() that merges kv_transfer_params into
extra_args, and default sampling_params so callers can omit it.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Hyeonki Hong <hyeonki.hong@moreh.io>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working frontend

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant