Skip to content

[HiCache] return cached_tokens_details in sglext for streaming responses#22055

Open
vladnosiv wants to merge 1 commit intosgl-project:mainfrom
vladnosiv:fix-cache-hit-breakdown-chat-streaming
Open

[HiCache] return cached_tokens_details in sglext for streaming responses#22055
vladnosiv wants to merge 1 commit intosgl-project:mainfrom
vladnosiv:fix-cache-hit-breakdown-chat-streaming

Conversation

@vladnosiv
Copy link
Copy Markdown
Contributor

Motivation

Previous PRs: #17648 & #21764

sglext.cached_tokens_details is returned correctly in non-streaming chat/completions responses, but silently dropped in streaming mode. The backend populates cached_tokens_details in meta_info for every request, and the streaming loop already collects it but it was never extracted and emitted.

Signed-off-by: Vladislav Nosivskoy <vladnosiv@gmail.com>
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces support for cached token details in OpenAI-compatible chat and completion streams, consolidating extension data into a single response chunk. It also adds a utility function for processing cached token metadata. Review feedback suggests simplifying the conditional logic for assigning routed experts to improve readability.

Copy link
Copy Markdown
Collaborator

@huangtingwei9988 huangtingwei9988 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants