Skip to content

Allow decalring FI source and commit hash in env#29

Open
sempervictus wants to merge 1 commit intoguoqingbao:mainfrom
sempervictus:feature/select_flashinfer
Open

Allow decalring FI source and commit hash in env#29
sempervictus wants to merge 1 commit intoguoqingbao:mainfrom
sempervictus:feature/select_flashinfer

Conversation

@sempervictus
Copy link
Copy Markdown
Contributor

Hardcoded repository and commit sources make testing require an additional step out-of-tree wherein the library must be updated and consumers pointed to the updated branch in order to test FI changes.

Allow use of env vars to override the defaults with safe fallback to the prior settings - CARGO_FEATURE_FLASHINFER_REPO for repo URL and CARGO_FEATURE_FLASHINFER_COMMIT for commit hash.

Hardcoded repository and commit sources make testing require an
additional step out-of-tree wherein the library must be updated
and consumers pointed to the updated branch in order to test FI
changes.

Allow use of env vars to override the defaults with safe fallback
to the prior settings - CARGO_FEATURE_FLASHINFER_REPO for repo URL
and CARGO_FEATURE_FLASHINFER_COMMIT for commit hash.
@sempervictus
Copy link
Copy Markdown
Contributor Author

sempervictus commented Feb 15, 2026

@guoqingbao: reasoning here being that FI is a fast-moving target re testing and this will allow such testing of pull requests/merges/etc in external repos without bothering you too much to update yours

@guoqingbao
Copy link
Copy Markdown
Owner

@guoqingbao: reasoning here being that FI is a fast-moving target retesting and this will allow such testing of pull requests/merges/etc in external repos without bothering you too much to update yours

We used their interface from specific version and not sure it they can keep backward compatible in their new versions.

@sempervictus
Copy link
Copy Markdown
Contributor Author

This is a good way to test that exact thing 😁

@sempervictus
Copy link
Copy Markdown
Contributor Author

@guoqingbao - this works, i've been using it to target my repo and branch and running the results all weekend (i'm running FI master+the sm120 kenel PR though need trtllm back for that to work)

@guoqingbao
Copy link
Copy Markdown
Owner

@guoqingbao - this works, i've been using it to target my repo and branch and running the results all weekend (i'm running FI master+the sm120 kenel PR though need trtllm back for that to work)

When testing main and the llguidance PR, I found a key issue that must be fixed: the model unable to produce tool call some times during a sequence of tasks (not one by one by human interactions), the model said it need to read a file for example, but simply no tool call produced, which is weird since it can cause certain agents like opencode and Claude code stop in the middle of the task. Are you able to help analyze this issue? This will be much appreciated. And I think llguidance is not our current first priority given that it didn't work well with tool calls for popular agents, it only used when agents explicitly request certain format.

@sempervictus
Copy link
Copy Markdown
Contributor Author

sempervictus commented Feb 17, 2026

Are you able to help analyze this issue?

Yes. This is one of the things i've been seeing since the beginning and far less now as both this and the LLG PR are intended to fix things like this. You and i use different clients but this is what i've been seeing:

  1. aichat - malformation of the chat stream immediately after a tool-call which sometimes reflects tool-call output back into the chat stream presented by aichat from vllm.rs:

Call web_search_searxng {"query":"llguidance inference performance token mask computation chat templates HuggingFace"}
Processing 4096 tokens per chunk

Call fetch_url_via_curl {"url":"https://huggingface.co/docs/transformers/main/chat_templating"}
Call fetch_url_via_curl {"url":"https://huggingface.co/docs/transformers/main/chat_templating"}
Call fetch_url_via_curl {"url":"https://huggingface.co/docs/transformers/main/chat_extras"}
Using apply_chat_template add_generation_prompt continue_final_message Model training\n"}
<|im_start|><|im_start|>assistant

1a. This does not happen with vllm or mistral.rs, it is specific to the API interactions with vllm.rs
2. Kilo Code - highly visible errors of empty responses/requests or just cessation of task execution
3. OpenWebUI - these fail silently in the back-end on research tasks causing the model to try and "fill in the search result gaps" without much explanation


If your clients contain any non-ascii chars (local characters for example) then LLG may suppress them. I believe that commit is also filtering tools for ones provided as valid and rejecting incorrect matches all of which you should be able to see clearly in the vllm.rs log stream.

Easiest test i can propose:

  1. Comment-out build_tool_call_constraint_ascii and select build_tool_call_constraint_utf8 to determine if the char-constraint is the isse
  2. Failing that, set None in:
            tool_schema: Some(tool_schema),
            tool_constraint: tool_constraint,
            reasoning_start_id,

to

            tool_schema: Some(tool_schema),
            tool_constraint: None,
            reasoning_start_id,
  1. If all else fails, only use this target branch without Implement LLGuidance vllm.rs#232 to fully eliminate the code changes made on the vllm.rs side (they're intended to be complementary but not interdependent).

Strongly urge using #30 because it allows work into 2M tokens but i was seeing mistakes ~20-50K without it, especially w/ tool calls.

@sempervictus
Copy link
Copy Markdown
Contributor Author

Far as FI API stability - this is what the Q3Coder80 has to say while digging around:

FlashInfer 0.6.3 documentation reflects an update over 0.6.2 in its API surface, with one clear addition: the flashinfer.page module now includes a new function flashinfer.page.get_batch_indices_positions, which supplements append_paged_kv_cache and append_paged_mla_kv_cache. All other modules retain their previously listed APIs from version 0.6.2.

The page table-related functionality is now:

  • flashinfer.page.append_paged_kv_cache
  • flashinfer.page.append_paged_mla_kv_cache
  • flashinfer.page.get_batch_indices_positions

This addition likely supports improved batch index management for page-aware KV-cache append operations, relevant for inference with paged attention and MLA (Multi-Head Latent Attention) workloads.

All other modules (flashinfer.decode, flashinfer.prefill, flashinfer.gemm, flashinfer.fused_moe, flashinfer.cascade, flashinfer.comm, flashinfer.sparse, flashinfer.sampling, flashinfer.topk, flashinfer.logits_processor, flashinfer.norm, flashinfer.rope, flashinfer.activation, flashinfer.quantization, flashinfer.green_ctx, flashinfer.fp4_quantization, flashinfer.testing) retain the same function signatures as in 0.6.2, based on the provided context.

which seems rather relevant to this lib for prefix indexing and possibly facilitate in-sequence page swapping effectively removing the ctx size limitation on models as imposed by VRAM short of loading them and at least a few blocks/pages of kvcache

@guoqingbao
Copy link
Copy Markdown
Owner

guoqingbao commented Feb 18, 2026

Yes. This is one of the things i've been seeing since the beginning and far less now as both this and the LLG PR are intended to fix things like this. You and i use different clients but this is what i've been seeing:

This happens when using or w/o the llguidance and everything seems normal before the last no tool call response, the model simply doesn't produce any tool call including the start mark and ends with normal stop token (EOS) after a sentence.

@sempervictus
Copy link
Copy Markdown
Contributor Author

This happens when using or w/o the llguidance and everything seems normal before the last no tool call response, the model simply doesn't produce any tool call including the start mark and ends with normal stop token (EOS) after a sentence.

This has been a problem since i started using vllm.rs - with, without, doesn't matter. The stream-structural defects i listed have to be addressed to ensure we only send valid text back to the user and log every time we have a problem doing that.

...
Provider: openai (proxy)
Model: Qwen3-235B

Kilo Code tried to use read_file without value for required parameter 'args (containing valid file paths)'. Retrying...

We've got a few gaps where we silently drop tokens - adding the logging to said gaps now and handling the un-decoded ID track.

I think we can also simplify and strengthen to tool-handling a bit by proper chat template extraction, processing, tool-call extractor selection (right now we're auto-selecting what should be the right one but it seems the 80B alternates between<tool_call>{json}</tool_call> and the QwenCoder magic markup during execution.

That said, the specific issue which you describe that produces "just a stop with no content" is something i've seen as well and am hoping to illuminate with the added logging as stop tokens, role change, etc appear to be handled silently

@sempervictus
Copy link
Copy Markdown
Contributor Author

@guoqingbao - you mean this error? 😁

Provider: openai (proxy)
Model: Q3C80B

MODEL_NO_TOOLS_USED

Still trying to work out how that works but "yes" - empty calls happen (even without this commit), kilo just re-prompts though

@sempervictus
Copy link
Copy Markdown
Contributor Author

This is what i mean about coder-formatted calls... it seems to start out q3-compatible but gets confused eventually and emanates
image

@guoqingbao
Copy link
Copy Markdown
Owner

@guoqingbao - you mean this error? 😁

Provider: openai (proxy)
Model: Q3C80B
MODEL_NO_TOOLS_USED

Still trying to work out how that works but "yes" - empty calls happen (even without this commit), kilo just re-prompts though

Have you encouterred this in vLLM with the nvfp4 80B model?

@guoqingbao
Copy link
Copy Markdown
Owner

This is what i mean about coder-formatted calls... it seems to start out q3-compatible but gets confused eventually and emanates

I think we followed what openapi spec said.

@sempervictus
Copy link
Copy Markdown
Contributor Author

Have you encouterred this in vLLM with the nvfp4 80B model?

No. that's why i have the Spark running the openai compat fixes in kilo kode in the background but it will take well into tomorrow - it has to re-prefill each turn (no prefix cache), context is about 300k right now, so its... not happy. I'd love to use the SM120s but they more tool errors there are, the more it makes, and the more you have to "talk to it" to get it out of that state. Guidance should help there but it needs to trend the declination so i'm restructuring it a bit to keep state over sequences and expose itself correctly in the API. Until i started using Python VLLM, i had not seen chained/multi-tool calls work nor run a long session without at least a few errors on tools. So trying to get us 1:1 with their impl.

My apologies i think i'm not being clear somehow but tool calling on vllm-rs has always been really shaky especially in long-context. I'm digging through how we're pulling and parsing chat templates, the tokenizing, everything i can think of to figure out where and how we're causing the effects i showed above. KiloCode and OpenWebUI just handle the failures better but TUI clients which actually consume bad UTF8 characters into the terminal state suffer the most when it partially dumps out some bytes from the stream of its tool response.

@guoqingbao
Copy link
Copy Markdown
Owner

it has to re-prefill each turn (no prefix cache), context is about 300k right now, so its... not happy.

You mean nvfp4 80B on spark?

they more tool errors there are, the more it makes, and the more you have to "talk to it" to get it out of that state. Guidance should help there but it needs to trend the declination so i'm restructuring it a bit to keep state over sequences and expose itself correctly in the API.

Is this vLLM.rs with the guidance PR?

@guoqingbao
Copy link
Copy Markdown
Owner

I'm fixed the flash-attn backend for qwen3 next and qwen3.5, should push a PR for that.

@guoqingbao
Copy link
Copy Markdown
Owner

Ironically, the model stop at the exact point each time during a sequence of calls.

❯ /init                                                                                                                                                                 
                                        
● Read 2 files (ctrl+o to expand)                                                                                                                                       
                                                                                                                                                                        
● Let me check for existing CLAUDE.md or similar files, and also explore the build and test setup.                                                                      
                                                                                                                                                                        
● Searched for 2 patterns, read 2 files (ctrl+o to expand)                                                                                                              
                                                                                                                                                                        
● Let me check the run.sh and other key files to understand the build system better.                                                                                    
                  
● Read 1 file (ctrl+o to expand)                                                                                                                                        
   
● Now let me check the docs directory for important documentation files and look at the Python package structure.                                                       
                  
────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
❯    

The last time it said "Now let me check the docs directory for important documentation files and look at the Python package structure. " but tool call at all, it just stopped (notmal EOS) after this sentence.

@sempervictus
Copy link
Copy Markdown
Contributor Author

"normal EOS" might not be normal though if its a token that did not decode or a chunk out of order that was dropped (fix incoming)

Ironically i just got bit by the content-leak thing in kilocode (Spark wasnt hacking it, too long) - it read a file via tool call and vomitted back a good chunk of it malforming the response:

image

@sempervictus
Copy link
Copy Markdown
Contributor Author

Or this gem ... sometimes it just "ignores you" for a couple of turns (also an old behavior at reasonable context sizes):

Unexpected API Response: The language model did not provide any assistant messages. This may indicate an issue with the API or the model's output.

@guoqingbao
Copy link
Copy Markdown
Owner

Or this gem ... sometimes it just "ignores you" for a couple of turns (also an old behavior at reasonable context sizes):

I found the root cause is that we simply bypassed the previous tool call context to the new request, this has been fixed in #235, resulting robust tool call handling in both claude code and opencode.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants