-
Notifications
You must be signed in to change notification settings - Fork 3.8k
feat: add configurable thinking output format support for vLLM #8901
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
feat: add configurable thinking output format support for vLLM #8901
Conversation
|
I have read the CLA Document and I hereby sign the CLA 1 out of 2 committers have signed the CLA. |
|
I have read the CLA Document and I hereby sign the CLA |
|
recheck |
- Add new section in vLLM provider docs explaining thinking output format options - Document thinkingOpenTag and thinkingCloseTag properties in YAML reference - Document thinkingOpenTag and thinkingCloseTag properties in JSON reference - Include configuration examples for both YAML and JSON formats Co-authored-by: nate <[email protected]> Generated with [Continue](https://continue.dev) Co-Authored-By: Continue <[email protected]>
|
Added documentation for the new thinking output format configuration feature: Changes made:
The documentation maintains the existing level of detail and focuses on practical usage of the new feature. |
…eaming Add comprehensive integration tests to verify the ThinkingTagExtractor works correctly when integrated with BaseLLM's streamChat method. Tests cover: - Single and multiple chunk scenarios - Partial tag handling at chunk boundaries - Flush behavior at stream end - Multiple thinking blocks - Custom tag formats - Interaction with native thinking role chunks Co-authored-by: nate <[email protected]>
|
Added integration tests in The tests cover:
These integration tests complement the existing unit tests for ThinkingTagExtractor by verifying the end-to-end behavior when the extractor is integrated with the actual streaming pipeline. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
1 issue found across 4 files
Prompt for AI agents (all 1 issues)
Check if these issues are valid — if so, understand the root cause of each and fix them.
<file name="core/llm/index.ts">
<violation number="1" location="core/llm/index.ts:1412">
Custom provider streaming now suppresses assistant tool-call messages with empty `content`, so downstream tool executions are lost.</violation>
</file>
Reply to cubic to teach it or ask questions. Re-run a review with @cubic-dev-ai review this PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
1 issue found across 8 files
Prompt for AI agents (all 1 issues)
Check if these issues are valid — if so, understand the root cause of each and fix them.
<file name="core/llm/index.ts">
<violation number="1" location="core/llm/index.ts:1413">
Tool-call/usage-only assistant chunks are dropped because they are now yielded only when `content` is non-empty, preventing tools from ever executing.</violation>
</file>
Reply to cubic to teach it or ask questions. Re-run a review with @cubic-dev-ai review this PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
1 issue found across 7 files
Prompt for AI agents (all 1 issues)
Check if these issues are valid — if so, understand the root cause of each and fix them.
<file name="core/llm/index.ts">
<violation number="1" location="core/llm/index.ts:1412">
Tool/function-call chunks are dropped because new guard filters out assistant messages whose content is an empty string, so streaming tool calls no longer reach the caller.</violation>
</file>
Reply to cubic to teach it or ask questions. Re-run a review with @cubic-dev-ai review this PR
RomneyDa
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@AyRickk see comments, I think we could merge if
- functionality is siloed to the VLLM class
- use a single thinkingTagName not duplicate tag configs (unless it would be normal for them to be different in VLLM, I'm not sure)
- Update config yaml schema as in comment, and in config-yaml package (or it will not work for yaml configs)
- On first glance I wonder if the tag parsing logic could be significantly more concise, maybe worth another look/approach?
- Follow up on/clean up test/docs commits. This is a new feature and we are tweaking when it runs (i.e. we might not run for community PRs) but I think in this case it made sense
…utput-format-support-for-vLLM
Per reviewer feedback on PR continuedev#8901: - Remove ThinkingTagExtractor class from core/llm/index.ts (keep in separate file) - Remove thinkingOpenTag/thinkingCloseTag from BaseLLM class - Remove thinking extractor logic from processChatChunk and streamChat in BaseLLM - Remove thinkingOpenTag/thinkingCloseTag from LLMOptions in core/index.d.ts - Remove thinkingTagIntegration.vitest.ts (BaseLLM integration test) The feature is now vLLM-specific only, handled by the Vllm class. Co-authored-by: AyRickk <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
1 issue found across 2 files (reviewed changes from recent commits).
Prompt for AI agents (all 1 issues)
Check if these issues are valid — if so, understand the root cause of each and fix them.
<file name="core/llm/thinkingTagIntegration.vitest.ts">
<violation number="1" location="core/llm/thinkingTagIntegration.vitest.ts:30">
Tests duplicate the production `_streamChat` logic inside `MockVllm`, so none of these cases exercise the real vLLM streaming implementation—regressions in `Vllm._streamChat` would still pass.</violation>
</file>
Reply to cubic to teach it or ask questions. Re-run a review with @cubic-dev-ai review this PR
|
@RomneyDa is it ok ? |
Related to #5992
Summary by cubic
Adds configurable thinking output format support for vLLM by parsing custom tags in streamed responses, while preserving vLLM’s native reasoning_content during streaming. Models using tags like … or … now emit proper “thinking” and “assistant” chunks.
Written for commit d99b93c. Summary will update automatically on new commits.