Skip to content

Enable precision flags for flashinfer compilation#30

Open
sempervictus wants to merge 2 commits intoguoqingbao:mainfrom
sempervictus:flashinfer/precision_flags
Open

Enable precision flags for flashinfer compilation#30
sempervictus wants to merge 2 commits intoguoqingbao:mainfrom
sempervictus:flashinfer/precision_flags

Conversation

@sempervictus
Copy link
Copy Markdown
Contributor

Enable precision flags for flashinfer compilation

Guard against deterioration of precision in attention generation in flashinfer/cutlass by way of fast-math-style optimizations resulting in iterative "dumbing down" of the values relevant to accurate sampling and final decode:

  1. Disable FMA rounding non-determinism
  2. Preserve subnormals (critical for attention)
  3. Enable precise division
  4. Enable precise square-root

None of these things should be happening in the compilation process given the number of cumulatively deteriorating iterations piling up in a million-token-long KVCache, so lets guard against them explicitly.

@sempervictus sempervictus force-pushed the flashinfer/precision_flags branch 2 times, most recently from 4386393 to a2c2130 Compare February 15, 2026 22:07
@sempervictus
Copy link
Copy Markdown
Contributor Author

@guoqingbao - some not so great news on this front:

  1. This makes a significant difference in long-context inference and fine-grained analysis of repetitive code patterns such as protocol libraries... that's the good part.
  2. The majority of the benefit comes from the fmad/ftz enforcement. Which makes sense. So does all the cost (haven't dug into which one of them is the worst yet, but its one or both of them).
  3. On 4xSM120 this PR drops prefill rate by almost 1K T/S and decode rate by ~10 T/S - this is the case for your current rev as well as what i've been testing upstream/my branches, both compile fine, same exact effect with and without this.

I dont know if upstream is overly aggressive with their kernel optimizations, something "leaked in" during the import, or if the padding thing you mentioned in FP8 is amplifying the skew/performance hit (i'm guessing "yes" for the latter). This PR appears to prevent the looping/insanity we see with analysis of dense JSON/patterned code/etc but it does seem to take a bite out of performance for that efficacy (personally i'd rather go slower and correctly than fast and into the wall).

@sempervictus
Copy link
Copy Markdown
Contributor Author

Also having these flags enabled appears to trip

vllm-rs-svc0  | 2026-02-15T22:30:37.151593Z  WARN vllm_rs::core::runner: Using sampling from generation_config: temp=Some(0.5), top_k=Some(20), top_p=Some(0.95), freq_penalty=Some(1.2), pres_penalty=Some(1.2)
vllm-rs-svc0  | 2026-02-15T22:30:37.170915Z  INFO vllm_rs::core::scheduler: Seq 36 - chunk prefilled 8192 (remain 259023 tokens)
vllm-rs-svc0  | 2026-02-15T22:30:38.949701Z  INFO vllm_rs::core::scheduler: Seq 36 - chunk prefilled 16384 (remain 250831 tokens)
vllm-rs-svc0  | 2026-02-15T22:30:40.741954Z  INFO vllm_rs::core::scheduler: Seq 36 - chunk prefilled 24576 (remain 242639 tokens)
vllm-rs-svc0  | 2026-02-15T22:30:42.548652Z  INFO vllm_rs::core::scheduler: Seq 36 - chunk prefilled 32768 (remain 234447 tokens)
vllm-rs-svc0  | 2026-02-15T22:30:44.362205Z  INFO vllm_rs::core::scheduler: Seq 36 - chunk prefilled 40960 (remain 226255 tokens)
vllm-rs-svc0  | terminate called after throwing an instance of 'thrust::THRUST_300001_SM_1200_NS::system::system_error'
vllm-rs-svc0  |   what():  inclusive_scan failed to synchronize: cudaErrorIllegalAddress: an illegal memory access was encountered

on occasion using your git repo and v0.6.2 commit hash (the defaults).

I've run into this a few times over the last couple of days anyway so i know this isn't causal but it might exacerbate the issue to the point of tracking it down. Testing right now on a newer flashinfer target to see if that helps.

@sempervictus
Copy link
Copy Markdown
Contributor Author

Yeah, this is seriously impactful for production work. One of the problems i often run into with long-context conversations is "momentum" of the KVCache causing the conversation to collapse back to pathological patterns esp with code concerns (to include malformed indents and the like). Sampling/decode collapse sort of effect where the softmax ends up occupying some 20% of the range - apparently caused by some of the things this PR avoids. Currently (with this and upstream current FI instead of 0.6.2), am able to steer a 300k+ ctx conversation like one would with a person - no emphasis, prompt abuse, etc. It's behaving a lot more like the vllm version running in nvfp4 on the Spark except its technically slower on prefill and about 10T/S faster on decode for now in FP8 but the working prefix cache makes that imperceptible in comparison.

@guoqingbao
Copy link
Copy Markdown
Owner

We don't have mamba cache eviction in the current main, has you used this PR guoqingbao/vllm.rs#231?

@guoqingbao
Copy link
Copy Markdown
Owner

I dont know if upstream is overly aggressive with their kernel optimizations, something "leaked in" during the import,

I think they used aggressive optimizations for optimal performance (in terms of speed), and tested under limited context lengths, e.g.,.128k or up to 256k, but not to 1 million or more. I suggest we made the compiler flags you mentioned into env guided branch, if for example HIGH_PRECISION_KERNEL enabled we need compile against highest precision.

@sempervictus
Copy link
Copy Markdown
Contributor Author

sempervictus commented Feb 16, 2026

Feature precise? ;)
Architectural consideration: if we ALWAYS aim for higher precision then we optimize from the strongest initial position overall. Let someone else hold the 1/4 drag racing title, well take the LeMans one by efficacy over brute force.
Maybe we go the other way - 'feature sloppy?'

@sempervictus
Copy link
Copy Markdown
Contributor Author

We don't have mamba cache eviction in the current main, has you used this PR guoqingbao/vllm.rs#231?

Hasn't caught up to me yet? I've only run a few dozen per build though

@sempervictus
Copy link
Copy Markdown
Contributor Author

Running with that and this now - seems to work well, 340k context and going. 4500 vs 5500 T/S prefill doesnt hurt much but accuracy improvement is huge

sempervictus pushed a commit to sempervictus/vllm.rs that referenced this pull request Feb 16, 2026
This commit implements llguidance v1.5 integration to enable
constrained decoding (JSON schema, regex, Lark grammar) for LLM
output generation. This provides strict validation of tool call
arguments and structured outputs.

Tool Call Enhancements (Independent of llguidance):
- Updated StreamToolParser to validate tool calls before
accepting them
- Added per-tool validation against available tools in
resolved_tools
- Implemented JSON schema validation for tool call arguments
- Added strict filtering: only accept tool calls matching
available tool names
- Enhanced error handling with detailed logging for rejected or
invalid tool calls
- Fixed partial start tag detection for text-mode tool calls
- Improved reasoning block detection to avoid false positives in
code blocks

llguidance Integration (Reinforces Tool Call Validation):
- Added ParserFactory building once per model load from tokenizer
- Implemented per-sequence Matcher for FSM state management
- Added logit masking via compute_mask() to prevent invalid tokens
at sampling time
- Added token commitment via commit_token() to advance FSM after
valid tokens
- Added rollback() support for future speculative decoding work
- Supports JSON Schema, Regex, and Lark grammar constraints
- Graceful degradation: no constraints when factory building fails

Architecture:
- Two-tier: Global ParserFactory (shared) + Per-sequence Matcher
- llguidance masks logits BEFORE sampling (prevents invalid tokens)
- Tool call validation happens AFTER sampling (filters  outputs)
- Ppre-sampling + post-sampling validation = strict tool call

Key Files:
- src/utils/guidance.rs:
 - GuidanceState, build_llg_factory, llg_grammar_from_constraint
- src/utils/config.rs:
 - Constraint enum with JsonSchema, Regex, Lark variants
- src/core/runner.rs:
 - ModelRunner integration with llg_factory and guidance_states
- src/server/parser.rs:
 - StreamToolParser improvements for partial tag detection
- src/tools/parser.rs:
 - has_tool_calls() now detects both XML and raw JSON formats
- src/tools/schema.rs:
 - Added validate_arguments() for strict schema validation

Performance:
- Cold start:
[Seq 0] ⏱️ Prompt: 320085 tokens in 80.15s (3993.62 t/s)
[Seq 0] ⏱️ Decoded: 2571 tokens in 80.12s (32.09 t/s)
- Warm operation:
[Seq 3] ⏱️ Prompt: 10549 tokens in 2.31s (4564.69 t/s))
[Seq 3] ⏱️ Decoded: 182 tokens in 5.09s (35.76 t/s)
- No difference whatsoever on SM120 host from the metrics seen in
the high-precision PR for attention-rs which is also running with
this commit (guoqingbao/attention.rs#30)

Accuracy:
- Tool calls still seem to work correctly

Future Work:
- Speculative decoding integration
 - rollback() method ready, needs library support
- Python API exposure for structured outputs (currently Rust-only)
- Performance optimization for large batch sizes

This enables strict validation of tool call arguments and
structured outputs, preventing invalid JSON, malformed tool calls,
and out-of-schema outputs.
@sempervictus
Copy link
Copy Markdown
Contributor Author

... Seq 202 - chunk prefill finished (500704 tokens) and that was before it rewrote src/server and src/util in vllm.rs from memory having read those files sometime around the 50k token size of the cache. That's a lot more time saved than 1k t/s in prefill would get me 😁

@sempervictus
Copy link
Copy Markdown
Contributor Author

... "and going, and going" ...

2026-02-17T06:43:12.203519Z  INFO vllm_rs::core::block_manager: Prefix cache insert seq 24 (1436779 tokens, 22449 blocks)
2026-02-17T06:43:12.206566Z  WARN vllm_rs::server::server: --- Performance Metrics ---
2026-02-17T06:43:12.206584Z  INFO vllm_rs::server::server: [Seq 24] ⏱️ Prompt: 1430072 tokens in 0.14s (10070929.00 t/s)
2026-02-17T06:43:12.206591Z  INFO vllm_rs::server::server: [Seq 24] ⏱️ Decoded: 6707 tokens in 266.86s (25.13 t/s)
2026-02-17T06:43:12.211178Z  INFO vllm_rs::core::scheduler: GPU Kvcache: 43076 blocks (2756864 tokens) free, used 34.3% (16.45GB/48.00GB); CPU swap used 0.0% (0.00GB/96.00GB)

^^ its read a large chunk of cutlass about sm120 for this and is spitting out kernels which seem to compile at least... correct formatting, logic, dynamic tile sizing, etc.

@sempervictus
Copy link
Copy Markdown
Contributor Author

sempervictus commented Feb 17, 2026

@guoqingbao - probably test on this vs #29 if you can, it's a world of difference - FA does not compare

sempervictus pushed a commit to sempervictus/vllm.rs that referenced this pull request Feb 18, 2026
This commit implements llguidance v1.5 integration to enable
constrained decoding (JSON schema, regex, Lark grammar) for LLM
output generation. This provides strict validation of tool call
arguments and structured outputs.

Tool Call Enhancements (Independent of llguidance):
- Updated StreamToolParser to validate tool calls before
accepting them
- Added per-tool validation against available tools in
resolved_tools
- Implemented JSON schema validation for tool call arguments
- Added strict filtering: only accept tool calls matching
available tool names
- Enhanced error handling with detailed logging for rejected or
invalid tool calls
- Fixed partial start tag detection for text-mode tool calls
- Improved reasoning block detection to avoid false positives in
code blocks

llguidance Integration (Reinforces Tool Call Validation):
- Added ParserFactory building once per model load from tokenizer
- Implemented per-sequence Matcher for FSM state management
- Added logit masking via compute_mask() to prevent invalid tokens
at sampling time
- Added token commitment via commit_token() to advance FSM after
valid tokens
- Added rollback() support for future speculative decoding work
- Supports JSON Schema, Regex, and Lark grammar constraints
- Graceful degradation: no constraints when factory building fails

Architecture:
- Two-tier: Global ParserFactory (shared) + Per-sequence Matcher
- llguidance masks logits BEFORE sampling (prevents invalid tokens)
- Tool call validation happens AFTER sampling (filters  outputs)
- Ppre-sampling + post-sampling validation = strict tool call

Key Files:
- src/utils/guidance.rs:
 - GuidanceState, build_llg_factory, llg_grammar_from_constraint
- src/utils/config.rs:
 - Constraint enum with JsonSchema, Regex, Lark variants
- src/core/runner.rs:
 - ModelRunner integration with llg_factory and guidance_states
- src/server/parser.rs:
 - StreamToolParser improvements for partial tag detection
- src/tools/parser.rs:
 - has_tool_calls() now detects both XML and raw JSON formats
- src/tools/schema.rs:
 - Added validate_arguments() for strict schema validation

Performance:
- Cold start:
[Seq 0] ⏱️ Prompt: 320085 tokens in 80.15s (3993.62 t/s)
[Seq 0] ⏱️ Decoded: 2571 tokens in 80.12s (32.09 t/s)
- Warm operation:
[Seq 3] ⏱️ Prompt: 10549 tokens in 2.31s (4564.69 t/s))
[Seq 3] ⏱️ Decoded: 182 tokens in 5.09s (35.76 t/s)
- No difference whatsoever on SM120 host from the metrics seen in
the high-precision PR for attention-rs which is also running with
this commit (guoqingbao/attention.rs#30)

Accuracy:
- Tool calls still seem to work correctly

Future Work:
- Speculative decoding integration
 - rollback() method ready, needs library support
- Python API exposure for structured outputs (currently Rust-only)
- Performance optimization for large batch sizes

This enables strict validation of tool call arguments and
structured outputs, preventing invalid JSON, malformed tool calls,
and out-of-schema outputs.
sempervictus pushed a commit to sempervictus/vllm.rs that referenced this pull request Feb 18, 2026
This commit implements llguidance v1.5 integration to enable
constrained decoding (JSON schema, regex, Lark grammar) for LLM
output generation. This provides strict validation of tool call
arguments and structured outputs.

Tool Call Enhancements (Independent of llguidance):
- Updated StreamToolParser to validate tool calls before
accepting them
- Added per-tool validation against available tools in
resolved_tools
- Implemented JSON schema validation for tool call arguments
- Added strict filtering: only accept tool calls matching
available tool names
- Enhanced error handling with detailed logging for rejected or
invalid tool calls
- Fixed partial start tag detection for text-mode tool calls
- Improved reasoning block detection to avoid false positives in
code blocks

llguidance Integration (Reinforces Tool Call Validation):
- Added ParserFactory building once per model load from tokenizer
- Implemented per-sequence Matcher for FSM state management
- Added logit masking via compute_mask() to prevent invalid tokens
at sampling time
- Added token commitment via commit_token() to advance FSM after
valid tokens
- Added rollback() support for future speculative decoding work
- Supports JSON Schema, Regex, and Lark grammar constraints
- Graceful degradation: no constraints when factory building fails

Architecture:
- Two-tier: Global ParserFactory (shared) + Per-sequence Matcher
- llguidance masks logits BEFORE sampling (prevents invalid tokens)
- Tool call validation happens AFTER sampling (filters  outputs)
- Ppre-sampling + post-sampling validation = strict tool call

Key Files:
- src/utils/guidance.rs:
 - GuidanceState, build_llg_factory, llg_grammar_from_constraint
- src/utils/config.rs:
 - Constraint enum with JsonSchema, Regex, Lark variants
- src/core/runner.rs:
 - ModelRunner integration with llg_factory and guidance_states
- src/server/parser.rs:
 - StreamToolParser improvements for partial tag detection
- src/tools/parser.rs:
 - has_tool_calls() now detects both XML and raw JSON formats
- src/tools/schema.rs:
 - Added validate_arguments() for strict schema validation

Performance:
- Cold start:
[Seq 0] ⏱️ Prompt: 320085 tokens in 80.15s (3993.62 t/s)
[Seq 0] ⏱️ Decoded: 2571 tokens in 80.12s (32.09 t/s)
- Warm operation:
[Seq 3] ⏱️ Prompt: 10549 tokens in 2.31s (4564.69 t/s))
[Seq 3] ⏱️ Decoded: 182 tokens in 5.09s (35.76 t/s)
- No difference whatsoever on SM120 host from the metrics seen in
the high-precision PR for attention-rs which is also running with
this commit (guoqingbao/attention.rs#30)

Accuracy:
- Tool calls still seem to work correctly

Future Work:
- Speculative decoding integration
 - rollback() method ready, needs library support
- Python API exposure for structured outputs (currently Rust-only)
- Performance optimization for large batch sizes

This enables strict validation of tool call arguments and
structured outputs, preventing invalid JSON, malformed tool calls,
and out-of-schema outputs.
@sempervictus sempervictus force-pushed the flashinfer/precision_flags branch from a2c2130 to 2e16358 Compare February 18, 2026 17:24
@sempervictus
Copy link
Copy Markdown
Contributor Author

@guoqingbao - using latest flashinfer commit with this PR i'm seeing:

vllm-rs-svc0  | 2026-02-19T21:55:41.375178Z  INFO vllm_rs::server::server: [Seq 1] ⏱️ Prompt: 116747 tokens in 22.23s (5251.07 t/s)
vllm-rs-svc0  | 2026-02-19T21:55:41.375189Z  INFO vllm_rs::server::server: [Seq 1] ⏱️ Decoded: 567 tokens in 13.87s (40.87 t/s)

on 4xSM120 with the 80B coder

sempervictus pushed a commit to sempervictus/vllm.rs that referenced this pull request Feb 20, 2026
This commit implements llguidance v1.5 integration to enable
constrained decoding (JSON schema, regex, Lark grammar) for LLM
output generation. This provides strict validation of tool call
arguments and structured outputs.

Tool Call Enhancements (Independent of llguidance):
- Updated StreamToolParser to validate tool calls before
accepting them
- Added per-tool validation against available tools in
resolved_tools
- Implemented JSON schema validation for tool call arguments
- Added strict filtering: only accept tool calls matching
available tool names
- Enhanced error handling with detailed logging for rejected or
invalid tool calls
- Fixed partial start tag detection for text-mode tool calls
- Improved reasoning block detection to avoid false positives in
code blocks

llguidance Integration (Reinforces Tool Call Validation):
- Added ParserFactory building once per model load from tokenizer
- Implemented per-sequence Matcher for FSM state management
- Added logit masking via compute_mask() to prevent invalid tokens
at sampling time
- Added token commitment via commit_token() to advance FSM after
valid tokens
- Added rollback() support for future speculative decoding work
- Supports JSON Schema, Regex, and Lark grammar constraints
- Graceful degradation: no constraints when factory building fails

Architecture:
- Two-tier: Global ParserFactory (shared) + Per-sequence Matcher
- llguidance masks logits BEFORE sampling (prevents invalid tokens)
- Tool call validation happens AFTER sampling (filters  outputs)
- Ppre-sampling + post-sampling validation = strict tool call

Key Files:
- src/utils/guidance.rs:
 - GuidanceState, build_llg_factory, llg_grammar_from_constraint
- src/utils/config.rs:
 - Constraint enum with JsonSchema, Regex, Lark variants
- src/core/runner.rs:
 - ModelRunner integration with llg_factory and guidance_states
- src/server/parser.rs:
 - StreamToolParser improvements for partial tag detection
- src/tools/parser.rs:
 - has_tool_calls() now detects both XML and raw JSON formats
- src/tools/schema.rs:
 - Added validate_arguments() for strict schema validation

Performance:
- Cold start:
[Seq 0] ⏱️ Prompt: 320085 tokens in 80.15s (3993.62 t/s)
[Seq 0] ⏱️ Decoded: 2571 tokens in 80.12s (32.09 t/s)
- Warm operation:
[Seq 3] ⏱️ Prompt: 10549 tokens in 2.31s (4564.69 t/s))
[Seq 3] ⏱️ Decoded: 182 tokens in 5.09s (35.76 t/s)
- No difference whatsoever on SM120 host from the metrics seen in
the high-precision PR for attention-rs which is also running with
this commit (guoqingbao/attention.rs#30)

Accuracy:
- Tool calls still seem to work correctly

Future Work:
- Speculative decoding integration
 - rollback() method ready, needs library support
- Python API exposure for structured outputs (currently Rust-only)
- Performance optimization for large batch sizes

This enables strict validation of tool call arguments and
structured outputs, preventing invalid JSON, malformed tool calls,
and out-of-schema outputs.
@sempervictus
Copy link
Copy Markdown
Contributor Author

@guoqingbao - so this is neat, the 0.6.4 flashinfer handles concurrent prefill pretty well:

vllm-rs-svc0  | 2026-02-20T10:05:48.090050Z  INFO vllm_rs::core::scheduler: Seq 3 - chunk prefilled 8192 (remain 915137 tokens)
vllm-rs-svc0  | 2026-02-20T10:05:50.705752Z  INFO vllm_rs::core::scheduler: Seq 1 - chunk prefilled 606208 (remain 314283 tokens)
vllm-rs-svc0  | 2026-02-20T10:05:52.230881Z  INFO vllm_rs::core::scheduler: Seq 2 - chunk prefilled 49152 (remain 76387 tokens)
vllm-rs-svc0  | 2026-02-20T10:05:53.701617Z  INFO vllm_rs::core::scheduler: Seq 3 - chunk prefilled 16384 (remain 906945 tokens)
vllm-rs-svc0  | 2026-02-20T10:05:56.342359Z  INFO vllm_rs::core::scheduler: Seq 1 - chunk prefilled 614400 (remain 306091 tokens)
vllm-rs-svc0  | 2026-02-20T10:05:57.883521Z  INFO vllm_rs::core::scheduler: Seq 2 - chunk prefilled 57344 (remain 68195 tokens)
vllm-rs-svc0  | 2026-02-20T10:05:59.366263Z  INFO vllm_rs::core::scheduler: Seq 3 - chunk prefilled 24576 (remain 898753 tokens)
vllm-rs-svc0  | 2026-02-20T10:06:02.018052Z  INFO vllm_rs::core::scheduler: Seq 1 - chunk prefilled 622592 (remain 297899 tokens)
vllm-rs-svc0  | 2026-02-20T10:06:03.569144Z  INFO vllm_rs::core::scheduler: Seq 2 - chunk prefilled 65536 (remain 60003 tokens)
vllm-rs-svc0  | 2026-02-20T10:06:05.067563Z  INFO vllm_rs::core::scheduler: Seq 3 - chunk prefilled 32768 (remain 890561 tokens)
vllm-rs-svc0  | 2026-02-20T10:06:07.739425Z  INFO vllm_rs::core::scheduler: Seq 1 - chunk prefilled 630784 (remain 289707 tokens)
vllm-rs-svc0  | 2026-02-20T10:06:09.309375Z  INFO vllm_rs::core::scheduler: Seq 2 - chunk prefilled 73728 (remain 51811 tokens)

could you please merge this? I've put >10M tokens through it and it works up to the ~2M i can fit in VRAM with rather amazing precision. Compared to the fury of having to restart sessions when the model beats itself into a hole i think that whatever small cost we pay initially in throughput is a tiny fraction of the time we gain from accurate inference (esp in long ctx)

@sempervictus
Copy link
Copy Markdown
Contributor Author

sempervictus commented Feb 20, 2026

Decoding (FP8) is definitely still the achilles heel but having learned how the bloody scheduler/sequences work (very reminiscent of noria) now i think i have some ideas for balancing prefill/decode when this happens:

vllm-rs-svc0  | 2026-02-20T10:09:48.604676Z  INFO vllm_rs::core::scheduler: Seq 3 - chunk prefilled 499712 (remain 423617 tokens)
vllm-rs-svc0  | 2026-02-20T10:09:48.638055Z  INFO vllm_rs::core::engine: Decoding: 1 active request(s) [Seq: [1]], avg. 0 tokens/s per request (total: 0 tokens/s)
vllm-rs-svc0  | 2026-02-20T10:09:51.079400Z  INFO vllm_rs::core::scheduler: Seq 3 - chunk prefilled 507904 (remain 415425 tokens)
vllm-rs-svc0  | 2026-02-20T10:09:53.563517Z  INFO vllm_rs::core::scheduler: Seq 3 - chunk prefilled 516096 (remain 407233 tokens)
vllm-rs-svc0  | 2026-02-20T10:09:53.596895Z  INFO vllm_rs::server::parser: Stream parsing: [ToolCallItem { tool_index: 0, name: Some("switch_mode"), parameters: "" }]
vllm-rs-svc0  | 2026-02-20T10:09:56.048797Z  INFO vllm_rs::core::scheduler: Seq 3 - chunk prefilled 524288 (remain 399041 tokens)
vllm-rs-svc0  | 2026-02-20T10:09:56.082209Z  INFO vllm_rs::core::engine: Decoding: 1 active request(s) [Seq: [1]], avg. 0 tokens/s per request (total: 0 tokens/s)
vllm-rs-svc0  | 2026-02-20T10:09:58.564154Z  INFO vllm_rs::core::scheduler: Seq 3 - chunk prefilled 532480 (remain 390849 tokens)
vllm-rs-svc0  | 2026-02-20T10:10:01.088371Z  INFO vllm_rs::core::scheduler: Seq 3 - chunk prefilled 540672 (remain 382657 tokens)

@guoqingbao
Copy link
Copy Markdown
Owner

so this is neat, the 0.6.4 flashinfer handles concurrent prefill pretty well:

I think your revision for compiler flags also used for other kernels, not only the flashinfer kernel itself.

@guoqingbao
Copy link
Copy Markdown
Owner

could you please merge this? I've put >10M tokens through it and it works up to the ~2M i can fit in VRAM with rather amazing precision.

I will test this PR shortly.

@guoqingbao
Copy link
Copy Markdown
Owner

Decoding (FP8) is definitely still the achilles heel but having learned how the bloody scheduler/sequences work (very reminiscent of noria) now i think i have some ideas for balancing prefill/decode when this happens:

I will improve the fp8 decoding speed next week.

@sempervictus
Copy link
Copy Markdown
Contributor Author

could you please merge this? I've put >10M tokens through it and it works up to the ~2M i can fit in VRAM with rather amazing precision.

I will test this PR shortly.

its been a hell of an eye-opener for me about model quality. Using this you can YaRN a tiny model to absurd context lengths and it still works. Whoever invented -fast-math is responsible for more energy wasted on bad inference loops than was spent during WW2.

@sempervictus
Copy link
Copy Markdown
Contributor Author

@guoqingbao should we be seeing higher mamba occupancy than

vllm-rs-svc0  | 2026-02-21T00:55:54.787146Z  INFO vllm_rs::server::server: [Seq 296] ⏱️ Prompt: 370690 tokens in 6.65s (55709.35 t/s)
vllm-rs-svc0  | 2026-02-21T00:55:54.787156Z  INFO vllm_rs::server::server: [Seq 296] ⏱️ Decoded: 1908 tokens in 50.29s (37.94 t/s)
vllm-rs-svc0  | 2026-02-21T00:55:54.795438Z  INFO vllm_rs::core::scheduler: GPU Kvcache: 32768 blocks (2097152 tokens) free, used 50.0% (24.00GB/48.00GB); CPU swap used 0.0% (0.00GB/96.00GB)
vllm-rs-svc0  | 2026-02-21T00:55:54.795454Z  INFO vllm_rs::core::scheduler: GPU MambaState: 1 / 379 slots used (0.3%), approx 0.02GB/6.82GB (slot 18.42MB)

@guoqingbao
Copy link
Copy Markdown
Owner

@guoqingbao should we be seeing higher mamba occupancy than

vllm-rs-svc0  | 2026-02-21T00:55:54.787146Z  INFO vllm_rs::server::server: [Seq 296] ⏱️ Prompt: 370690 tokens in 6.65s (55709.35 t/s)
vllm-rs-svc0  | 2026-02-21T00:55:54.787156Z  INFO vllm_rs::server::server: [Seq 296] ⏱️ Decoded: 1908 tokens in 50.29s (37.94 t/s)
vllm-rs-svc0  | 2026-02-21T00:55:54.795438Z  INFO vllm_rs::core::scheduler: GPU Kvcache: 32768 blocks (2097152 tokens) free, used 50.0% (24.00GB/48.00GB); CPU swap used 0.0% (0.00GB/96.00GB)
vllm-rs-svc0  | 2026-02-21T00:55:54.795454Z  INFO vllm_rs::core::scheduler: GPU MambaState: 1 / 379 slots used (0.3%), approx 0.02GB/6.82GB (slot 18.42MB)

No, the mamba slots used is only depend on the number of concurrent requests.

@guoqingbao
Copy link
Copy Markdown
Owner

@guoqingbao should we be seeing higher mamba occupancy than

vllm-rs-svc0  | 2026-02-21T00:55:54.787146Z  INFO vllm_rs::server::server: [Seq 296] ⏱️ Prompt: 370690 tokens in 6.65s (55709.35 t/s)
vllm-rs-svc0  | 2026-02-21T00:55:54.787156Z  INFO vllm_rs::server::server: [Seq 296] ⏱️ Decoded: 1908 tokens in 50.29s (37.94 t/s)
vllm-rs-svc0  | 2026-02-21T00:55:54.795438Z  INFO vllm_rs::core::scheduler: GPU Kvcache: 32768 blocks (2097152 tokens) free, used 50.0% (24.00GB/48.00GB); CPU swap used 0.0% (0.00GB/96.00GB)
vllm-rs-svc0  | 2026-02-21T00:55:54.795454Z  INFO vllm_rs::core::scheduler: GPU MambaState: 1 / 379 slots used (0.3%), approx 0.02GB/6.82GB (slot 18.42MB)

No, the mamba slots used is only depend on the number of concurrent requests.

That said, it supports 379 concurrent requests at the current config.

@sempervictus
Copy link
Copy Markdown
Contributor Author

good point. once we figure out concurrent paged attention swap-out we'll be able to run a 370 tiny demons! 😛

Currently getting this on the 4xSM120 with this PR and flashinfer 0.6.4:

vllm-rs-svc0  | 2026-02-21T03:35:08.354563Z  INFO vllm_rs::core::block_manager: Prefix cache miss seq 476 (94162 tokens, 32768 cached blocks, raw_match=142 blocks)
vllm-rs-svc0  | 2026-02-21T03:35:09.854756Z  INFO vllm_rs::core::runner: User's thinking preference for reasoning models: None
vllm-rs-svc0  | 2026-02-21T03:35:09.854770Z  WARN vllm_rs::core::runner: Using sampling from generation_config: temp=Some(0.5), top_k=Some(20), top_p=Some(0.95), freq_penalty=Some(1.2), pres_penalty=Some(1.2)
vllm-rs-svc0  | 2026-02-21T03:35:09.869283Z  INFO vllm_rs::core::scheduler: Seq 476 - chunk prefilled 8192 (remain 85970 tokens)
vllm-rs-svc0  | 2026-02-21T03:35:11.343483Z  INFO vllm_rs::core::scheduler: Seq 476 - chunk prefilled 16384 (remain 77778 tokens)
vllm-rs-svc0  | 2026-02-21T03:35:12.832627Z  INFO vllm_rs::core::scheduler: Seq 476 - chunk prefilled 24576 (remain 69586 tokens)
vllm-rs-svc0  | 2026-02-21T03:35:14.339742Z  INFO vllm_rs::core::scheduler: Seq 476 - chunk prefilled 32768 (remain 61394 tokens)
vllm-rs-svc0  | 2026-02-21T03:35:15.850428Z  INFO vllm_rs::core::scheduler: Seq 476 - chunk prefilled 40960 (remain 53202 tokens)
vllm-rs-svc0  | 2026-02-21T03:35:17.376420Z  INFO vllm_rs::core::scheduler: Seq 476 - chunk prefilled 49152 (remain 45010 tokens)
vllm-rs-svc0  | 2026-02-21T03:35:18.914484Z  INFO vllm_rs::core::scheduler: Seq 476 - chunk prefilled 57344 (remain 36818 tokens)
vllm-rs-svc0  | 2026-02-21T03:35:20.470355Z  INFO vllm_rs::core::scheduler: Seq 476 - chunk prefilled 65536 (remain 28626 tokens)
vllm-rs-svc0  | 2026-02-21T03:35:22.039551Z  INFO vllm_rs::core::scheduler: Seq 476 - chunk prefilled 73728 (remain 20434 tokens)
vllm-rs-svc0  | 2026-02-21T03:35:23.623046Z  INFO vllm_rs::core::scheduler: Seq 476 - chunk prefilled 81920 (remain 12242 tokens)
vllm-rs-svc0  | 2026-02-21T03:35:25.223287Z  INFO vllm_rs::core::scheduler: Seq 476 - chunk prefilled 90112 (remain 4050 tokens)
vllm-rs-svc0  | 2026-02-21T03:35:26.052102Z  WARN vllm_rs::core::scheduler: Seq 476 - chunk prefill finished (94162 tokens)
vllm-rs-svc0  | 2026-02-21T03:35:26.052112Z  INFO vllm_rs::core::engine: Prefilling [seq_id 476]: 94163 tokens in 17.71s (5316.04 tokens/s, cache included)
vllm-rs-svc0  | 2026-02-21T03:35:26.052222Z  INFO vllm_rs::server::parser: Tool call <tool_call> (151657) found, start buffering!
...
vllm-rs-svc0  | 2026-02-21T03:35:40.854482Z  WARN vllm_rs::server::server: --- Performance Metrics ---
vllm-rs-svc0  | 2026-02-21T03:35:40.854488Z  INFO vllm_rs::server::server: [Seq 476] ⏱️ Prompt: 94162 tokens in 17.71s (5315.98 t/s)
vllm-rs-svc0  | 2026-02-21T03:35:40.854496Z  INFO vllm_rs::server::server: [Seq 476] ⏱️ Decoded: 622 tokens in 14.80s (42.02 t/s)

@sempervictus sempervictus force-pushed the flashinfer/precision_flags branch 2 times, most recently from 7a05ca3 to aab4de7 Compare February 28, 2026 16:02
@sempervictus
Copy link
Copy Markdown
Contributor Author

So... i've had various iterations of

ENV CARGO_FEATURE_FLASHINFER_COMMIT="ad94692d6b911af9498415d2faa946fbb3bba882"
ENV CARGO_FEATURE_FLASHINFER_REPO="https://github.com/flashinfer-ai/flashinfer"

in my Dockerfiles and this branch in my Cargo.tomls for a few weeks now and i can't detect any adverse deviation from main at this point. Newer FI is now prefilling at ~5K/TS for the FP8 80B on 4xSM120 and decoding at ~50 (per sequence, looks like running 3 sequences gets us ~130 T/S so there's likely some optimization room there as well). What this has allowed me to actually do however is run coder agents out to ~2m context window and even that is only limited by my VRAM (FP8 kvcache acceleration seems to be something FI can do at least on vllm).

Any chance you might be inclined to merge this or would you prefer to just merge #29 and have me try to build a --feature cuda_precise propertly inheritance chain from vllm.rs down to cudaforge?

@guoqingbao
Copy link
Copy Markdown
Owner

Newer FI is now prefilling at ~5K/TS for the FP8 80B on 4xSM120 and decoding at ~50 (per sequence, looks like running 3 sequences gets us ~130 T/S

These optimizations are NOT coming from flashinfer update, instead, the current main recently improved MoE dispatch logic in attention.rs that why you can observe the performance improvements. If you use the official flashinfer repo, you likely unable to run other types of models, especially when head/kv_head not within [1,2,4,8].

@sempervictus
Copy link
Copy Markdown
Contributor Author

No argument there - if you peek at the forks on your FI repo, you'll notice i have one for just that reason. Found it faster to just use upstream lately in working w/ the the q3n80b-fp8 since i'm not quantizing anything and only running 4-way.

That said, their velocity does require a lot of change tracking and now they have a PR up for sm120f too. Would be grand if they would actually take your delta into their code under a feature flag or something (i get not wanting to compile a million kernels) and we COULD really track off them and just pass a flag or env var to compile the additional layouts for this project

@sempervictus
Copy link
Copy Markdown
Contributor Author

They actually have a great link re sm120a/sm120f in that thread....

@sempervictus
Copy link
Copy Markdown
Contributor Author

sempervictus commented Feb 28, 2026

Speaking of your change to FI - merge-conflicts now:

 103 <<<<<<< HEAD
 104     if (
 105         is_sm100a_supported(device)
 106         or is_sm110a_supported(device)
 107         or is_sm12x_supported(device)
 108     ):
 109 =======
 110     # SM12x (RTX 5090, DGX Spark) cannot use CUTLASS FMHA SM100 (requires tcgen05).
 111     # SM12x should use FA2 JIT kernels or fmha_v2 HMMA kernels instead.
 112     if is_sm100a_supported(device) or is_sm110a_supported(device):
 113 >>>>>>> 03eaa20bc (feat: SM120 standard attention via fmha_v2 + CuTe DSL backend + SM12x fixes)
 114         return gen_fmha_cutlass_sm100a_module(

We don't use these high level stuffs, we used their headers and we write custom dispatcher in cu file. The full attention is not the bottleneck of qwen3 next and qwen3.5 and we also didn't use fp8 path for attention because it need sacrifice accuracy. If you use flash attention, it can also generate similar performance. One part of flashinfer advantage over flash attention is fp8 kvcache, unfortunately, this doesn't work at the current stage, although I have finished dispatcher in attention.rs but it simply does not work at sm90 and sm120, they have poor support of blockwise fp8 scale.

In short, the current bottleneck is MoE gemv fp8 kernel, I think I'm able to write a new gemv fp8 kernel without dequant the fp8 weights, which can push the decoding speed to 100 tokens/s request, instead of 50 in the current form.

@sempervictus
Copy link
Copy Markdown
Contributor Author

@guoqingbao Lets see how this works ... #29 for the win here to test and might not be a terrible idea to do same for cutlass

@guoqingbao
Copy link
Copy Markdown
Owner

@guoqingbao Lets see how this works ... #29 for the win here to test and might not be a terrible idea to do same for cutlass

Another reason we pin to specific version is that we need to make sure flashinfer and other cutlass kernels in attention.rs can be compiled successful on different platforms, not just sm120. The current cutlass kernels uses specific features that old or new cutlass not compatible and you need to make sure the flashinfer uses that cutlass version correctly on different platform, otherwise, we need to manage different cutlass versions in single cubin which can cause conflicts.

@sempervictus
Copy link
Copy Markdown
Contributor Author

sempervictus commented Feb 28, 2026

@guoqingbao right but doing that in the source trees directly is a process to put it mildly and no amount of AI bots can test on all the hardware we need to support. I think that if we expose these env vars for both cutlass and FI, it would enable adoption testing by users via PRs/issues and quickly changing their target commits or repos as needed to help localize/bisect relevant changes.

That said, i'm testing the same build from SM120 on SM89 every once in a while with a parallel one for SM70 (no FI obviously), BTW, i am pretty sure that --feature cutlass is what's breaking FP8 on the V100s right now.

@guoqingbao
Copy link
Copy Markdown
Owner

These three flags you added change nothing because that's the compiler default values (we already removed fast math, so you don't need pass these default values explicitly):

.arg("-ftz=false")
.arg("-prec-div=true")
.arg("-prec-sqrt=true")

as for

arg("-fmad=false")

Here is the ChatGPT reply:

If you set -fmad=false:

This can be a real slowdown if your kernel is heavy on a*b+c patterns (which is common in GEMM-like or attention math).

Worst case, you effectively turn many FMAs into separate mul+add, increasing instruction count and potentially register pressure.

@sempervictus
Copy link
Copy Markdown
Contributor Author

Chatgpt isn't considering the fact that we dont control the whole source tree or at least audit for these - upstream projects may set the countermanded flags themselves today or at some later point. When we got a taste of fast math here for a few days it was bad and that sort of regression has to be guarded if nothing else. Moreover I'm seeing >5500 t/s in prefill with this and #31 while running a 3m CTX window (draft model work is being handled in two of those) for which we need precision. I've got a way to measure deviation in the logits returned but it needs baselines. Should help w pfx cached computation getting sour at later seqs but I haven't gotten it to a point of being able to say "confidence in a specific response drops from 90% to 30% with 2m tokens between the measured seqpos' " in the codebase yet.

@guoqingbao
Copy link
Copy Markdown
Owner

Chatgpt isn't considering the fact that we dont control the whole source tree or at least audit for these - upstream projects may set the countermanded flags themselves today or at some later point. When we got a taste of fast math here for a few days it was bad and that sort of regression has to be guarded if nothing else.

We didn't use fast math across the entire project including vllm.rs, attention.rs and candle. The only difference of this PR compared to the main is the -fmad=false which is not recommended by NV and also not used in popular inference frameworks like vllm, llama.cpp and sglang.

@sempervictus sempervictus force-pushed the flashinfer/precision_flags branch 2 times, most recently from 6d44609 to 683e984 Compare March 19, 2026 08:54
@sempervictus sempervictus force-pushed the flashinfer/precision_flags branch 2 times, most recently from a1f74cb to e9104a1 Compare March 21, 2026 14:51
@sempervictus
Copy link
Copy Markdown
Contributor Author

So far... have had excellent results with this PR -fmad=false and all - i see no counterindication on 120 or 121 but i am enjoying multi-million token context windows on occasion.

That said, it's not your code i'm worried about. For example:

cutlass(main)$ grep -ri fastmath|wc -l
110

most of that in one file but a fair deal of it not. Thats just one dep and flashinfer isn't exactly simple either. Idea is to gate that at compilation time with explicit negations.

@guoqingbao
Copy link
Copy Markdown
Owner

guoqingbao commented Mar 25, 2026

So far... have had excellent results with this PR -fmad=false and all - i see no counterindication on 120 or 121 but i am enjoying multi-million token context windows on occasion.

I think we can make an env for using "-fmad=false" (default not using), are you able to revise the PR to support this?

@sempervictus
Copy link
Copy Markdown
Contributor Author

Of course - would it make sense to move this into a --feature for Cargo in cudaforge itself so we ensure all compilation passing through the pipe hits precision flags?

Interesting datapoint: i have the SM121 currently running without this on a 6x scaling factor and the scaling adjustments we merged upstream seem to help quite a bit. With this PR the SM120 box can hit 3M tokens on-task (its turning open-coreui into an engineering workbench for us, lots of reading involved and several languages of coding/testing). I have no idea how far this can actually go but until we have "infinite context" via coherent state swap i probably won't find out 😉.

I am fairly surprised at how much better long context is with the scaling PR even without this one but with this one we're competitive with commercial platform accuracy at long context on a few (pricey but still) PCI cards.

@guoqingbao
Copy link
Copy Markdown
Owner

Of course - would it make sense to move this into a --feature for Cargo in cudaforge itself so we ensure all compilation passing through the pipe hits precision flags?

We can support additional compiler flags using env.

By the way, I'm fixing a decoding cache mismatch issue, once it's solved, the prefix cache and tool calling can be speedy.

@sempervictus sempervictus force-pushed the flashinfer/precision_flags branch from e9104a1 to 0f34892 Compare April 1, 2026 03:31
RageLtMan added 2 commits April 21, 2026 22:48
Hardcoded repository and commit sources make testing require an
additional step out-of-tree wherein the library must be updated
and consumers pointed to the updated branch in order to test FI
changes.

Allow use of env vars to override the defaults with safe fallback
to the prior settings - CARGO_FEATURE_FLASHINFER_REPO for repo URL
and CARGO_FEATURE_FLASHINFER_COMMIT for commit hash.
Guard against deterioration of precision in attention generation
in flashinfer/cutlass by way of fast-math-style optimizations
resulting in iterative "dumbing down" of the values relevant to
accurate sampling and final decode:

1. Disable FMA rounding non-determinism
2. Preserve subnormals (critical for attention)
3. Enable precise division
4. Enable precise square-root
@sempervictus sempervictus force-pushed the flashinfer/precision_flags branch from 0f34892 to 65acf39 Compare April 22, 2026 02:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants