Allow decalring FI source and commit hash in env by sempervictus · Pull Request #29 · guoqingbao/attention.rs

sempervictus · 2026-02-15T04:19:54Z

Hardcoded repository and commit sources make testing require an additional step out-of-tree wherein the library must be updated and consumers pointed to the updated branch in order to test FI changes.

Allow use of env vars to override the defaults with safe fallback to the prior settings - CARGO_FEATURE_FLASHINFER_REPO for repo URL and CARGO_FEATURE_FLASHINFER_COMMIT for commit hash.

Hardcoded repository and commit sources make testing require an additional step out-of-tree wherein the library must be updated and consumers pointed to the updated branch in order to test FI changes. Allow use of env vars to override the defaults with safe fallback to the prior settings - CARGO_FEATURE_FLASHINFER_REPO for repo URL and CARGO_FEATURE_FLASHINFER_COMMIT for commit hash.

sempervictus · 2026-02-15T04:20:47Z

@guoqingbao: reasoning here being that FI is a fast-moving target re testing and this will allow such testing of pull requests/merges/etc in external repos without bothering you too much to update yours

guoqingbao · 2026-02-15T04:45:44Z

@guoqingbao: reasoning here being that FI is a fast-moving target retesting and this will allow such testing of pull requests/merges/etc in external repos without bothering you too much to update yours

We used their interface from specific version and not sure it they can keep backward compatible in their new versions.

sempervictus · 2026-02-15T06:07:14Z

This is a good way to test that exact thing 😁

sempervictus · 2026-02-17T15:54:24Z

@guoqingbao - this works, i've been using it to target my repo and branch and running the results all weekend (i'm running FI master+the sm120 kenel PR though need trtllm back for that to work)

guoqingbao · 2026-02-17T16:33:17Z

@guoqingbao - this works, i've been using it to target my repo and branch and running the results all weekend (i'm running FI master+the sm120 kenel PR though need trtllm back for that to work)

When testing main and the llguidance PR, I found a key issue that must be fixed: the model unable to produce tool call some times during a sequence of tasks (not one by one by human interactions), the model said it need to read a file for example, but simply no tool call produced, which is weird since it can cause certain agents like opencode and Claude code stop in the middle of the task. Are you able to help analyze this issue? This will be much appreciated. And I think llguidance is not our current first priority given that it didn't work well with tool calls for popular agents, it only used when agents explicitly request certain format.

sempervictus · 2026-02-17T16:51:53Z

Are you able to help analyze this issue?

Yes. This is one of the things i've been seeing since the beginning and far less now as both this and the LLG PR are intended to fix things like this. You and i use different clients but this is what i've been seeing:

aichat - malformation of the chat stream immediately after a tool-call which sometimes reflects tool-call output back into the chat stream presented by aichat from vllm.rs:

Call web_search_searxng {"query":"llguidance inference performance token mask computation chat templates HuggingFace"}
Processing 4096 tokens per chunk

Call fetch_url_via_curl {"url":"https://huggingface.co/docs/transformers/main/chat_templating"}
Call fetch_url_via_curl {"url":"https://huggingface.co/docs/transformers/main/chat_templating"}
Call fetch_url_via_curl {"url":"https://huggingface.co/docs/transformers/main/chat_extras"}
Using apply_chat_template add_generation_prompt continue_final_message Model training\n"}
<|im_start|><|im_start|>assistant

1a. This does not happen with vllm or mistral.rs, it is specific to the API interactions with vllm.rs
2. Kilo Code - highly visible errors of empty responses/requests or just cessation of task execution
3. OpenWebUI - these fail silently in the back-end on research tasks causing the model to try and "fill in the search result gaps" without much explanation

If your clients contain any non-ascii chars (local characters for example) then LLG may suppress them. I believe that commit is also filtering tools for ones provided as valid and rejecting incorrect matches all of which you should be able to see clearly in the vllm.rs log stream.

Easiest test i can propose:

Comment-out build_tool_call_constraint_ascii and select build_tool_call_constraint_utf8 to determine if the char-constraint is the isse
Failing that, set None in:

            tool_schema: Some(tool_schema),
            tool_constraint: tool_constraint,
            reasoning_start_id,

to

            tool_schema: Some(tool_schema),
            tool_constraint: None,
            reasoning_start_id,

If all else fails, only use this target branch without Implement LLGuidance vllm.rs#232 to fully eliminate the code changes made on the vllm.rs side (they're intended to be complementary but not interdependent).

Strongly urge using #30 because it allows work into 2M tokens but i was seeing mistakes ~20-50K without it, especially w/ tool calls.

sempervictus · 2026-02-17T18:15:52Z

Far as FI API stability - this is what the Q3Coder80 has to say while digging around:

FlashInfer 0.6.3 documentation reflects an update over 0.6.2 in its API surface, with one clear addition: the flashinfer.page module now includes a new function flashinfer.page.get_batch_indices_positions, which supplements append_paged_kv_cache and append_paged_mla_kv_cache. All other modules retain their previously listed APIs from version 0.6.2.

The page table-related functionality is now:

flashinfer.page.append_paged_kv_cache

flashinfer.page.append_paged_mla_kv_cache

flashinfer.page.get_batch_indices_positions

This addition likely supports improved batch index management for page-aware KV-cache append operations, relevant for inference with paged attention and MLA (Multi-Head Latent Attention) workloads.

All other modules (flashinfer.decode, flashinfer.prefill, flashinfer.gemm, flashinfer.fused_moe, flashinfer.cascade, flashinfer.comm, flashinfer.sparse, flashinfer.sampling, flashinfer.topk, flashinfer.logits_processor, flashinfer.norm, flashinfer.rope, flashinfer.activation, flashinfer.quantization, flashinfer.green_ctx, flashinfer.fp4_quantization, flashinfer.testing) retain the same function signatures as in 0.6.2, based on the provided context.

which seems rather relevant to this lib for prefix indexing and possibly facilitate in-sequence page swapping effectively removing the ctx size limitation on models as imposed by VRAM short of loading them and at least a few blocks/pages of kvcache

guoqingbao · 2026-02-18T01:01:08Z

Yes. This is one of the things i've been seeing since the beginning and far less now as both this and the LLG PR are intended to fix things like this. You and i use different clients but this is what i've been seeing:

This happens when using or w/o the llguidance and everything seems normal before the last no tool call response, the model simply doesn't produce any tool call including the start mark and ends with normal stop token (EOS) after a sentence.

sempervictus · 2026-02-18T03:13:33Z

This happens when using or w/o the llguidance and everything seems normal before the last no tool call response, the model simply doesn't produce any tool call including the start mark and ends with normal stop token (EOS) after a sentence.

This has been a problem since i started using vllm.rs - with, without, doesn't matter. The stream-structural defects i listed have to be addressed to ensure we only send valid text back to the user and log every time we have a problem doing that.

...
Provider: openai (proxy)
Model: Qwen3-235B

Kilo Code tried to use read_file without value for required parameter 'args (containing valid file paths)'. Retrying...

We've got a few gaps where we silently drop tokens - adding the logging to said gaps now and handling the un-decoded ID track.

I think we can also simplify and strengthen to tool-handling a bit by proper chat template extraction, processing, tool-call extractor selection (right now we're auto-selecting what should be the right one but it seems the 80B alternates between<tool_call>{json}</tool_call> and the QwenCoder magic markup during execution.

That said, the specific issue which you describe that produces "just a stop with no content" is something i've seen as well and am hoping to illuminate with the added logging as stop tokens, role change, etc appear to be handled silently

sempervictus · 2026-02-18T03:41:58Z

@guoqingbao - you mean this error? 😁

Provider: openai (proxy)
Model: Q3C80B

MODEL_NO_TOOLS_USED

Still trying to work out how that works but "yes" - empty calls happen (even without this commit), kilo just re-prompts though

sempervictus · 2026-02-18T03:46:52Z

This is what i mean about coder-formatted calls... it seems to start out q3-compatible but gets confused eventually and emanates

guoqingbao · 2026-02-18T07:15:26Z

@guoqingbao - you mean this error? 😁

Provider: openai (proxy)
Model: Q3C80B
MODEL_NO_TOOLS_USED

Still trying to work out how that works but "yes" - empty calls happen (even without this commit), kilo just re-prompts though

Have you encouterred this in vLLM with the nvfp4 80B model?

guoqingbao · 2026-02-18T07:16:40Z

This is what i mean about coder-formatted calls... it seems to start out q3-compatible but gets confused eventually and emanates

I think we followed what openapi spec said.

sempervictus · 2026-02-18T07:44:30Z

Have you encouterred this in vLLM with the nvfp4 80B model?

No. that's why i have the Spark running the openai compat fixes in kilo kode in the background but it will take well into tomorrow - it has to re-prefill each turn (no prefix cache), context is about 300k right now, so its... not happy. I'd love to use the SM120s but they more tool errors there are, the more it makes, and the more you have to "talk to it" to get it out of that state. Guidance should help there but it needs to trend the declination so i'm restructuring it a bit to keep state over sequences and expose itself correctly in the API. Until i started using Python VLLM, i had not seen chained/multi-tool calls work nor run a long session without at least a few errors on tools. So trying to get us 1:1 with their impl.

My apologies i think i'm not being clear somehow but tool calling on vllm-rs has always been really shaky especially in long-context. I'm digging through how we're pulling and parsing chat templates, the tokenizing, everything i can think of to figure out where and how we're causing the effects i showed above. KiloCode and OpenWebUI just handle the failures better but TUI clients which actually consume bad UTF8 characters into the terminal state suffer the most when it partially dumps out some bytes from the stream of its tool response.

guoqingbao · 2026-02-18T08:18:23Z

it has to re-prefill each turn (no prefix cache), context is about 300k right now, so its... not happy.

You mean nvfp4 80B on spark?

they more tool errors there are, the more it makes, and the more you have to "talk to it" to get it out of that state. Guidance should help there but it needs to trend the declination so i'm restructuring it a bit to keep state over sequences and expose itself correctly in the API.

Is this vLLM.rs with the guidance PR?

guoqingbao · 2026-02-18T08:21:38Z

I'm fixed the flash-attn backend for qwen3 next and qwen3.5, should push a PR for that.

guoqingbao · 2026-02-18T08:24:54Z

Ironically, the model stop at the exact point each time during a sequence of calls.

❯ /init                                                                                                                                                                 
                                        
● Read 2 files (ctrl+o to expand)                                                                                                                                       
                                                                                                                                                                        
● Let me check for existing CLAUDE.md or similar files, and also explore the build and test setup.                                                                      
                                                                                                                                                                        
● Searched for 2 patterns, read 2 files (ctrl+o to expand)                                                                                                              
                                                                                                                                                                        
● Let me check the run.sh and other key files to understand the build system better.                                                                                    
                  
● Read 1 file (ctrl+o to expand)                                                                                                                                        
   
● Now let me check the docs directory for important documentation files and look at the Python package structure.                                                       
                  
────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
❯

The last time it said "Now let me check the docs directory for important documentation files and look at the Python package structure. " but tool call at all, it just stopped (notmal EOS) after this sentence.

sempervictus · 2026-02-18T08:39:34Z

"normal EOS" might not be normal though if its a token that did not decode or a chunk out of order that was dropped (fix incoming)

Ironically i just got bit by the content-leak thing in kilocode (Spark wasnt hacking it, too long) - it read a file via tool call and vomitted back a good chunk of it malforming the response:

sempervictus · 2026-02-18T08:43:42Z

Or this gem ... sometimes it just "ignores you" for a couple of turns (also an old behavior at reasonable context sizes):

Unexpected API Response: The language model did not provide any assistant messages. This may indicate an issue with the API or the model's output.

guoqingbao · 2026-02-18T09:43:44Z

Or this gem ... sometimes it just "ignores you" for a couple of turns (also an old behavior at reasonable context sizes):

I found the root cause is that we simply bypassed the previous tool call context to the new request, this has been fixed in #235, resulting robust tool call handling in both claude code and opencode.

This was referenced Feb 17, 2026

Implement LLGuidance guoqingbao/vllm.rs#232

Closed

Enable precision flags for flashinfer compilation #30

Open

Conversation

sempervictus commented Feb 15, 2026

Uh oh!

sempervictus commented Feb 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

guoqingbao commented Feb 15, 2026

Uh oh!

sempervictus commented Feb 15, 2026

Uh oh!

sempervictus commented Feb 17, 2026

Uh oh!

guoqingbao commented Feb 17, 2026

Uh oh!

sempervictus commented Feb 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sempervictus commented Feb 17, 2026

Uh oh!

guoqingbao commented Feb 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sempervictus commented Feb 18, 2026

Uh oh!

sempervictus commented Feb 18, 2026

Uh oh!

sempervictus commented Feb 18, 2026

Uh oh!

guoqingbao commented Feb 18, 2026

Uh oh!

guoqingbao commented Feb 18, 2026

Uh oh!

sempervictus commented Feb 18, 2026

Uh oh!

guoqingbao commented Feb 18, 2026

Uh oh!

guoqingbao commented Feb 18, 2026

Uh oh!

guoqingbao commented Feb 18, 2026

Uh oh!

sempervictus commented Feb 18, 2026

Uh oh!

sempervictus commented Feb 18, 2026

Uh oh!

guoqingbao commented Feb 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

sempervictus commented Feb 15, 2026 •

edited

Loading

sempervictus commented Feb 17, 2026 •

edited

Loading

guoqingbao commented Feb 18, 2026 •

edited

Loading