Allow decalring FI source and commit hash in env#29
Allow decalring FI source and commit hash in env#29sempervictus wants to merge 1 commit intoguoqingbao:mainfrom
Conversation
Hardcoded repository and commit sources make testing require an additional step out-of-tree wherein the library must be updated and consumers pointed to the updated branch in order to test FI changes. Allow use of env vars to override the defaults with safe fallback to the prior settings - CARGO_FEATURE_FLASHINFER_REPO for repo URL and CARGO_FEATURE_FLASHINFER_COMMIT for commit hash.
|
@guoqingbao: reasoning here being that FI is a fast-moving target re testing and this will allow such testing of pull requests/merges/etc in external repos without bothering you too much to update yours |
We used their interface from specific version and not sure it they can keep backward compatible in their new versions. |
|
This is a good way to test that exact thing 😁 |
|
@guoqingbao - this works, i've been using it to target my repo and branch and running the results all weekend (i'm running FI master+the sm120 kenel PR though need |
When testing main and the llguidance PR, I found a key issue that must be fixed: the model unable to produce tool call some times during a sequence of tasks (not one by one by human interactions), the model said it need to read a file for example, but simply no tool call produced, which is weird since it can cause certain agents like opencode and Claude code stop in the middle of the task. Are you able to help analyze this issue? This will be much appreciated. And I think llguidance is not our current first priority given that it didn't work well with tool calls for popular agents, it only used when agents explicitly request certain format. |
Yes. This is one of the things i've been seeing since the beginning and far less now as both this and the LLG PR are intended to fix things like this. You and i use different clients but this is what i've been seeing:
1a. This does not happen with If your clients contain any non-ascii chars (local characters for example) then LLG may suppress them. I believe that commit is also filtering tools for ones provided as valid and rejecting incorrect matches all of which you should be able to see clearly in the Easiest test i can propose:
tool_schema: Some(tool_schema),
tool_constraint: tool_constraint,
reasoning_start_id,to tool_schema: Some(tool_schema),
tool_constraint: None,
reasoning_start_id,
Strongly urge using #30 because it allows work into 2M tokens but i was seeing mistakes ~20-50K without it, especially w/ tool calls. |
|
Far as FI API stability - this is what the Q3Coder80 has to say while digging around:
which seems rather relevant to this lib for prefix indexing and possibly facilitate in-sequence page swapping effectively removing the ctx size limitation on models as imposed by VRAM short of loading them and at least a few blocks/pages of kvcache |
This happens when using or w/o the llguidance and everything seems normal before the last no tool call response, the model simply doesn't produce any tool call including the start mark and ends with normal stop token (EOS) after a sentence. |
This has been a problem since i started using
We've got a few gaps where we silently drop tokens - adding the logging to said gaps now and handling the un-decoded ID track. I think we can also simplify and strengthen to tool-handling a bit by proper chat template extraction, processing, tool-call extractor selection (right now we're auto-selecting what should be the right one but it seems the 80B alternates between That said, the specific issue which you describe that produces "just a stop with no content" is something i've seen as well and am hoping to illuminate with the added logging as stop tokens, role change, etc appear to be handled silently |
|
@guoqingbao - you mean this error? 😁
Still trying to work out how that works but "yes" - empty calls happen (even without this commit), kilo just re-prompts though |
Have you encouterred this in vLLM with the nvfp4 80B model? |
I think we followed what openapi spec said. |
No. that's why i have the Spark running the openai compat fixes in kilo kode in the background but it will take well into tomorrow - it has to re-prefill each turn (no prefix cache), context is about 300k right now, so its... not happy. I'd love to use the SM120s but they more tool errors there are, the more it makes, and the more you have to "talk to it" to get it out of that state. Guidance should help there but it needs to trend the declination so i'm restructuring it a bit to keep state over sequences and expose itself correctly in the API. Until i started using Python VLLM, i had not seen chained/multi-tool calls work nor run a long session without at least a few errors on tools. So trying to get us 1:1 with their impl. My apologies i think i'm not being clear somehow but tool calling on vllm-rs has always been really shaky especially in long-context. I'm digging through how we're pulling and parsing chat templates, the tokenizing, everything i can think of to figure out where and how we're causing the effects i showed above. KiloCode and OpenWebUI just handle the failures better but TUI clients which actually consume bad UTF8 characters into the terminal state suffer the most when it partially dumps out some bytes from the stream of its tool response. |
You mean nvfp4 80B on spark?
Is this vLLM.rs with the guidance PR? |
|
I'm fixed the flash-attn backend for qwen3 next and qwen3.5, should push a PR for that. |
|
Ironically, the model stop at the exact point each time during a sequence of calls. The last time it said "Now let me check the docs directory for important documentation files and look at the Python package structure. " but tool call at all, it just stopped (notmal EOS) after this sentence. |
|
Or this gem ... sometimes it just "ignores you" for a couple of turns (also an old behavior at reasonable context sizes):
|
I found the root cause is that we simply bypassed the previous tool call context to the new request, this has been fixed in #235, resulting robust tool call handling in both claude code and opencode. |


Hardcoded repository and commit sources make testing require an additional step out-of-tree wherein the library must be updated and consumers pointed to the updated branch in order to test FI changes.
Allow use of env vars to override the defaults with safe fallback to the prior settings - CARGO_FEATURE_FLASHINFER_REPO for repo URL and CARGO_FEATURE_FLASHINFER_COMMIT for commit hash.