Enable precision flags for flashinfer compilation#30
Enable precision flags for flashinfer compilation#30sempervictus wants to merge 2 commits intoguoqingbao:mainfrom
Conversation
4386393 to
a2c2130
Compare
|
@guoqingbao - some not so great news on this front:
I dont know if upstream is overly aggressive with their kernel optimizations, something "leaked in" during the import, or if the padding thing you mentioned in FP8 is amplifying the skew/performance hit (i'm guessing "yes" for the latter). This PR appears to prevent the looping/insanity we see with analysis of dense JSON/patterned code/etc but it does seem to take a bite out of performance for that efficacy (personally i'd rather go slower and correctly than fast and into the wall). |
|
Also having these flags enabled appears to trip vllm-rs-svc0 | 2026-02-15T22:30:37.151593Z WARN vllm_rs::core::runner: Using sampling from generation_config: temp=Some(0.5), top_k=Some(20), top_p=Some(0.95), freq_penalty=Some(1.2), pres_penalty=Some(1.2)
vllm-rs-svc0 | 2026-02-15T22:30:37.170915Z INFO vllm_rs::core::scheduler: Seq 36 - chunk prefilled 8192 (remain 259023 tokens)
vllm-rs-svc0 | 2026-02-15T22:30:38.949701Z INFO vllm_rs::core::scheduler: Seq 36 - chunk prefilled 16384 (remain 250831 tokens)
vllm-rs-svc0 | 2026-02-15T22:30:40.741954Z INFO vllm_rs::core::scheduler: Seq 36 - chunk prefilled 24576 (remain 242639 tokens)
vllm-rs-svc0 | 2026-02-15T22:30:42.548652Z INFO vllm_rs::core::scheduler: Seq 36 - chunk prefilled 32768 (remain 234447 tokens)
vllm-rs-svc0 | 2026-02-15T22:30:44.362205Z INFO vllm_rs::core::scheduler: Seq 36 - chunk prefilled 40960 (remain 226255 tokens)
vllm-rs-svc0 | terminate called after throwing an instance of 'thrust::THRUST_300001_SM_1200_NS::system::system_error'
vllm-rs-svc0 | what(): inclusive_scan failed to synchronize: cudaErrorIllegalAddress: an illegal memory access was encounteredon occasion using your git repo and v0.6.2 commit hash (the defaults). I've run into this a few times over the last couple of days anyway so i know this isn't causal but it might exacerbate the issue to the point of tracking it down. Testing right now on a newer |
|
Yeah, this is seriously impactful for production work. One of the problems i often run into with long-context conversations is "momentum" of the KVCache causing the conversation to collapse back to pathological patterns esp with code concerns (to include malformed indents and the like). Sampling/decode collapse sort of effect where the softmax ends up occupying some 20% of the range - apparently caused by some of the things this PR avoids. Currently (with this and upstream current FI instead of 0.6.2), am able to steer a 300k+ ctx conversation like one would with a person - no emphasis, prompt abuse, etc. It's behaving a lot more like the |
|
We don't have mamba cache eviction in the current main, has you used this PR guoqingbao/vllm.rs#231? |
I think they used aggressive optimizations for optimal performance (in terms of speed), and tested under limited context lengths, e.g.,.128k or up to 256k, but not to 1 million or more. I suggest we made the compiler flags you mentioned into env guided branch, if for example HIGH_PRECISION_KERNEL enabled we need compile against highest precision. |
|
Feature precise? ;) |
Hasn't caught up to me yet? I've only run a few dozen per build though |
|
Running with that and this now - seems to work well, 340k context and going. 4500 vs 5500 T/S prefill doesnt hurt much but accuracy improvement is huge |
This commit implements llguidance v1.5 integration to enable constrained decoding (JSON schema, regex, Lark grammar) for LLM output generation. This provides strict validation of tool call arguments and structured outputs. Tool Call Enhancements (Independent of llguidance): - Updated StreamToolParser to validate tool calls before accepting them - Added per-tool validation against available tools in resolved_tools - Implemented JSON schema validation for tool call arguments - Added strict filtering: only accept tool calls matching available tool names - Enhanced error handling with detailed logging for rejected or invalid tool calls - Fixed partial start tag detection for text-mode tool calls - Improved reasoning block detection to avoid false positives in code blocks llguidance Integration (Reinforces Tool Call Validation): - Added ParserFactory building once per model load from tokenizer - Implemented per-sequence Matcher for FSM state management - Added logit masking via compute_mask() to prevent invalid tokens at sampling time - Added token commitment via commit_token() to advance FSM after valid tokens - Added rollback() support for future speculative decoding work - Supports JSON Schema, Regex, and Lark grammar constraints - Graceful degradation: no constraints when factory building fails Architecture: - Two-tier: Global ParserFactory (shared) + Per-sequence Matcher - llguidance masks logits BEFORE sampling (prevents invalid tokens) - Tool call validation happens AFTER sampling (filters outputs) - Ppre-sampling + post-sampling validation = strict tool call Key Files: - src/utils/guidance.rs: - GuidanceState, build_llg_factory, llg_grammar_from_constraint - src/utils/config.rs: - Constraint enum with JsonSchema, Regex, Lark variants - src/core/runner.rs: - ModelRunner integration with llg_factory and guidance_states - src/server/parser.rs: - StreamToolParser improvements for partial tag detection - src/tools/parser.rs: - has_tool_calls() now detects both XML and raw JSON formats - src/tools/schema.rs: - Added validate_arguments() for strict schema validation Performance: - Cold start: [Seq 0] ⏱️ Prompt: 320085 tokens in 80.15s (3993.62 t/s) [Seq 0] ⏱️ Decoded: 2571 tokens in 80.12s (32.09 t/s) - Warm operation: [Seq 3] ⏱️ Prompt: 10549 tokens in 2.31s (4564.69 t/s)) [Seq 3] ⏱️ Decoded: 182 tokens in 5.09s (35.76 t/s) - No difference whatsoever on SM120 host from the metrics seen in the high-precision PR for attention-rs which is also running with this commit (guoqingbao/attention.rs#30) Accuracy: - Tool calls still seem to work correctly Future Work: - Speculative decoding integration - rollback() method ready, needs library support - Python API exposure for structured outputs (currently Rust-only) - Performance optimization for large batch sizes This enables strict validation of tool call arguments and structured outputs, preventing invalid JSON, malformed tool calls, and out-of-schema outputs.
|
... |
|
... "and going, and going" ... 2026-02-17T06:43:12.203519Z INFO vllm_rs::core::block_manager: Prefix cache insert seq 24 (1436779 tokens, 22449 blocks)
2026-02-17T06:43:12.206566Z WARN vllm_rs::server::server: --- Performance Metrics ---
2026-02-17T06:43:12.206584Z INFO vllm_rs::server::server: [Seq 24] ⏱️ Prompt: 1430072 tokens in 0.14s (10070929.00 t/s)
2026-02-17T06:43:12.206591Z INFO vllm_rs::server::server: [Seq 24] ⏱️ Decoded: 6707 tokens in 266.86s (25.13 t/s)
2026-02-17T06:43:12.211178Z INFO vllm_rs::core::scheduler: GPU Kvcache: 43076 blocks (2756864 tokens) free, used 34.3% (16.45GB/48.00GB); CPU swap used 0.0% (0.00GB/96.00GB)^^ its read a large chunk of cutlass about sm120 for this and is spitting out kernels which seem to compile at least... correct formatting, logic, dynamic tile sizing, etc. |
|
@guoqingbao - probably test on this vs #29 if you can, it's a world of difference - FA does not compare |
This commit implements llguidance v1.5 integration to enable constrained decoding (JSON schema, regex, Lark grammar) for LLM output generation. This provides strict validation of tool call arguments and structured outputs. Tool Call Enhancements (Independent of llguidance): - Updated StreamToolParser to validate tool calls before accepting them - Added per-tool validation against available tools in resolved_tools - Implemented JSON schema validation for tool call arguments - Added strict filtering: only accept tool calls matching available tool names - Enhanced error handling with detailed logging for rejected or invalid tool calls - Fixed partial start tag detection for text-mode tool calls - Improved reasoning block detection to avoid false positives in code blocks llguidance Integration (Reinforces Tool Call Validation): - Added ParserFactory building once per model load from tokenizer - Implemented per-sequence Matcher for FSM state management - Added logit masking via compute_mask() to prevent invalid tokens at sampling time - Added token commitment via commit_token() to advance FSM after valid tokens - Added rollback() support for future speculative decoding work - Supports JSON Schema, Regex, and Lark grammar constraints - Graceful degradation: no constraints when factory building fails Architecture: - Two-tier: Global ParserFactory (shared) + Per-sequence Matcher - llguidance masks logits BEFORE sampling (prevents invalid tokens) - Tool call validation happens AFTER sampling (filters outputs) - Ppre-sampling + post-sampling validation = strict tool call Key Files: - src/utils/guidance.rs: - GuidanceState, build_llg_factory, llg_grammar_from_constraint - src/utils/config.rs: - Constraint enum with JsonSchema, Regex, Lark variants - src/core/runner.rs: - ModelRunner integration with llg_factory and guidance_states - src/server/parser.rs: - StreamToolParser improvements for partial tag detection - src/tools/parser.rs: - has_tool_calls() now detects both XML and raw JSON formats - src/tools/schema.rs: - Added validate_arguments() for strict schema validation Performance: - Cold start: [Seq 0] ⏱️ Prompt: 320085 tokens in 80.15s (3993.62 t/s) [Seq 0] ⏱️ Decoded: 2571 tokens in 80.12s (32.09 t/s) - Warm operation: [Seq 3] ⏱️ Prompt: 10549 tokens in 2.31s (4564.69 t/s)) [Seq 3] ⏱️ Decoded: 182 tokens in 5.09s (35.76 t/s) - No difference whatsoever on SM120 host from the metrics seen in the high-precision PR for attention-rs which is also running with this commit (guoqingbao/attention.rs#30) Accuracy: - Tool calls still seem to work correctly Future Work: - Speculative decoding integration - rollback() method ready, needs library support - Python API exposure for structured outputs (currently Rust-only) - Performance optimization for large batch sizes This enables strict validation of tool call arguments and structured outputs, preventing invalid JSON, malformed tool calls, and out-of-schema outputs.
This commit implements llguidance v1.5 integration to enable constrained decoding (JSON schema, regex, Lark grammar) for LLM output generation. This provides strict validation of tool call arguments and structured outputs. Tool Call Enhancements (Independent of llguidance): - Updated StreamToolParser to validate tool calls before accepting them - Added per-tool validation against available tools in resolved_tools - Implemented JSON schema validation for tool call arguments - Added strict filtering: only accept tool calls matching available tool names - Enhanced error handling with detailed logging for rejected or invalid tool calls - Fixed partial start tag detection for text-mode tool calls - Improved reasoning block detection to avoid false positives in code blocks llguidance Integration (Reinforces Tool Call Validation): - Added ParserFactory building once per model load from tokenizer - Implemented per-sequence Matcher for FSM state management - Added logit masking via compute_mask() to prevent invalid tokens at sampling time - Added token commitment via commit_token() to advance FSM after valid tokens - Added rollback() support for future speculative decoding work - Supports JSON Schema, Regex, and Lark grammar constraints - Graceful degradation: no constraints when factory building fails Architecture: - Two-tier: Global ParserFactory (shared) + Per-sequence Matcher - llguidance masks logits BEFORE sampling (prevents invalid tokens) - Tool call validation happens AFTER sampling (filters outputs) - Ppre-sampling + post-sampling validation = strict tool call Key Files: - src/utils/guidance.rs: - GuidanceState, build_llg_factory, llg_grammar_from_constraint - src/utils/config.rs: - Constraint enum with JsonSchema, Regex, Lark variants - src/core/runner.rs: - ModelRunner integration with llg_factory and guidance_states - src/server/parser.rs: - StreamToolParser improvements for partial tag detection - src/tools/parser.rs: - has_tool_calls() now detects both XML and raw JSON formats - src/tools/schema.rs: - Added validate_arguments() for strict schema validation Performance: - Cold start: [Seq 0] ⏱️ Prompt: 320085 tokens in 80.15s (3993.62 t/s) [Seq 0] ⏱️ Decoded: 2571 tokens in 80.12s (32.09 t/s) - Warm operation: [Seq 3] ⏱️ Prompt: 10549 tokens in 2.31s (4564.69 t/s)) [Seq 3] ⏱️ Decoded: 182 tokens in 5.09s (35.76 t/s) - No difference whatsoever on SM120 host from the metrics seen in the high-precision PR for attention-rs which is also running with this commit (guoqingbao/attention.rs#30) Accuracy: - Tool calls still seem to work correctly Future Work: - Speculative decoding integration - rollback() method ready, needs library support - Python API exposure for structured outputs (currently Rust-only) - Performance optimization for large batch sizes This enables strict validation of tool call arguments and structured outputs, preventing invalid JSON, malformed tool calls, and out-of-schema outputs.
a2c2130 to
2e16358
Compare
|
@guoqingbao - using latest flashinfer commit with this PR i'm seeing: vllm-rs-svc0 | 2026-02-19T21:55:41.375178Z INFO vllm_rs::server::server: [Seq 1] ⏱️ Prompt: 116747 tokens in 22.23s (5251.07 t/s)
vllm-rs-svc0 | 2026-02-19T21:55:41.375189Z INFO vllm_rs::server::server: [Seq 1] ⏱️ Decoded: 567 tokens in 13.87s (40.87 t/s)on 4xSM120 with the 80B coder |
This commit implements llguidance v1.5 integration to enable constrained decoding (JSON schema, regex, Lark grammar) for LLM output generation. This provides strict validation of tool call arguments and structured outputs. Tool Call Enhancements (Independent of llguidance): - Updated StreamToolParser to validate tool calls before accepting them - Added per-tool validation against available tools in resolved_tools - Implemented JSON schema validation for tool call arguments - Added strict filtering: only accept tool calls matching available tool names - Enhanced error handling with detailed logging for rejected or invalid tool calls - Fixed partial start tag detection for text-mode tool calls - Improved reasoning block detection to avoid false positives in code blocks llguidance Integration (Reinforces Tool Call Validation): - Added ParserFactory building once per model load from tokenizer - Implemented per-sequence Matcher for FSM state management - Added logit masking via compute_mask() to prevent invalid tokens at sampling time - Added token commitment via commit_token() to advance FSM after valid tokens - Added rollback() support for future speculative decoding work - Supports JSON Schema, Regex, and Lark grammar constraints - Graceful degradation: no constraints when factory building fails Architecture: - Two-tier: Global ParserFactory (shared) + Per-sequence Matcher - llguidance masks logits BEFORE sampling (prevents invalid tokens) - Tool call validation happens AFTER sampling (filters outputs) - Ppre-sampling + post-sampling validation = strict tool call Key Files: - src/utils/guidance.rs: - GuidanceState, build_llg_factory, llg_grammar_from_constraint - src/utils/config.rs: - Constraint enum with JsonSchema, Regex, Lark variants - src/core/runner.rs: - ModelRunner integration with llg_factory and guidance_states - src/server/parser.rs: - StreamToolParser improvements for partial tag detection - src/tools/parser.rs: - has_tool_calls() now detects both XML and raw JSON formats - src/tools/schema.rs: - Added validate_arguments() for strict schema validation Performance: - Cold start: [Seq 0] ⏱️ Prompt: 320085 tokens in 80.15s (3993.62 t/s) [Seq 0] ⏱️ Decoded: 2571 tokens in 80.12s (32.09 t/s) - Warm operation: [Seq 3] ⏱️ Prompt: 10549 tokens in 2.31s (4564.69 t/s)) [Seq 3] ⏱️ Decoded: 182 tokens in 5.09s (35.76 t/s) - No difference whatsoever on SM120 host from the metrics seen in the high-precision PR for attention-rs which is also running with this commit (guoqingbao/attention.rs#30) Accuracy: - Tool calls still seem to work correctly Future Work: - Speculative decoding integration - rollback() method ready, needs library support - Python API exposure for structured outputs (currently Rust-only) - Performance optimization for large batch sizes This enables strict validation of tool call arguments and structured outputs, preventing invalid JSON, malformed tool calls, and out-of-schema outputs.
|
@guoqingbao - so this is neat, the 0.6.4 vllm-rs-svc0 | 2026-02-20T10:05:48.090050Z INFO vllm_rs::core::scheduler: Seq 3 - chunk prefilled 8192 (remain 915137 tokens)
vllm-rs-svc0 | 2026-02-20T10:05:50.705752Z INFO vllm_rs::core::scheduler: Seq 1 - chunk prefilled 606208 (remain 314283 tokens)
vllm-rs-svc0 | 2026-02-20T10:05:52.230881Z INFO vllm_rs::core::scheduler: Seq 2 - chunk prefilled 49152 (remain 76387 tokens)
vllm-rs-svc0 | 2026-02-20T10:05:53.701617Z INFO vllm_rs::core::scheduler: Seq 3 - chunk prefilled 16384 (remain 906945 tokens)
vllm-rs-svc0 | 2026-02-20T10:05:56.342359Z INFO vllm_rs::core::scheduler: Seq 1 - chunk prefilled 614400 (remain 306091 tokens)
vllm-rs-svc0 | 2026-02-20T10:05:57.883521Z INFO vllm_rs::core::scheduler: Seq 2 - chunk prefilled 57344 (remain 68195 tokens)
vllm-rs-svc0 | 2026-02-20T10:05:59.366263Z INFO vllm_rs::core::scheduler: Seq 3 - chunk prefilled 24576 (remain 898753 tokens)
vllm-rs-svc0 | 2026-02-20T10:06:02.018052Z INFO vllm_rs::core::scheduler: Seq 1 - chunk prefilled 622592 (remain 297899 tokens)
vllm-rs-svc0 | 2026-02-20T10:06:03.569144Z INFO vllm_rs::core::scheduler: Seq 2 - chunk prefilled 65536 (remain 60003 tokens)
vllm-rs-svc0 | 2026-02-20T10:06:05.067563Z INFO vllm_rs::core::scheduler: Seq 3 - chunk prefilled 32768 (remain 890561 tokens)
vllm-rs-svc0 | 2026-02-20T10:06:07.739425Z INFO vllm_rs::core::scheduler: Seq 1 - chunk prefilled 630784 (remain 289707 tokens)
vllm-rs-svc0 | 2026-02-20T10:06:09.309375Z INFO vllm_rs::core::scheduler: Seq 2 - chunk prefilled 73728 (remain 51811 tokens)could you please merge this? I've put >10M tokens through it and it works up to the ~2M i can fit in VRAM with rather amazing precision. Compared to the fury of having to restart sessions when the model beats itself into a hole i think that whatever small cost we pay initially in throughput is a tiny fraction of the time we gain from accurate inference (esp in long ctx) |
|
Decoding (FP8) is definitely still the achilles heel but having learned how the bloody scheduler/sequences work (very reminiscent of noria) now i think i have some ideas for balancing prefill/decode when this happens: vllm-rs-svc0 | 2026-02-20T10:09:48.604676Z INFO vllm_rs::core::scheduler: Seq 3 - chunk prefilled 499712 (remain 423617 tokens)
vllm-rs-svc0 | 2026-02-20T10:09:48.638055Z INFO vllm_rs::core::engine: Decoding: 1 active request(s) [Seq: [1]], avg. 0 tokens/s per request (total: 0 tokens/s)
vllm-rs-svc0 | 2026-02-20T10:09:51.079400Z INFO vllm_rs::core::scheduler: Seq 3 - chunk prefilled 507904 (remain 415425 tokens)
vllm-rs-svc0 | 2026-02-20T10:09:53.563517Z INFO vllm_rs::core::scheduler: Seq 3 - chunk prefilled 516096 (remain 407233 tokens)
vllm-rs-svc0 | 2026-02-20T10:09:53.596895Z INFO vllm_rs::server::parser: Stream parsing: [ToolCallItem { tool_index: 0, name: Some("switch_mode"), parameters: "" }]
vllm-rs-svc0 | 2026-02-20T10:09:56.048797Z INFO vllm_rs::core::scheduler: Seq 3 - chunk prefilled 524288 (remain 399041 tokens)
vllm-rs-svc0 | 2026-02-20T10:09:56.082209Z INFO vllm_rs::core::engine: Decoding: 1 active request(s) [Seq: [1]], avg. 0 tokens/s per request (total: 0 tokens/s)
vllm-rs-svc0 | 2026-02-20T10:09:58.564154Z INFO vllm_rs::core::scheduler: Seq 3 - chunk prefilled 532480 (remain 390849 tokens)
vllm-rs-svc0 | 2026-02-20T10:10:01.088371Z INFO vllm_rs::core::scheduler: Seq 3 - chunk prefilled 540672 (remain 382657 tokens) |
I think your revision for compiler flags also used for other kernels, not only the flashinfer kernel itself. |
I will test this PR shortly. |
I will improve the fp8 decoding speed next week. |
its been a hell of an eye-opener for me about model quality. Using this you can YaRN a tiny model to absurd context lengths and it still works. Whoever invented -fast-math is responsible for more energy wasted on bad inference loops than was spent during WW2. |
|
@guoqingbao should we be seeing higher mamba occupancy than vllm-rs-svc0 | 2026-02-21T00:55:54.787146Z INFO vllm_rs::server::server: [Seq 296] ⏱️ Prompt: 370690 tokens in 6.65s (55709.35 t/s)
vllm-rs-svc0 | 2026-02-21T00:55:54.787156Z INFO vllm_rs::server::server: [Seq 296] ⏱️ Decoded: 1908 tokens in 50.29s (37.94 t/s)
vllm-rs-svc0 | 2026-02-21T00:55:54.795438Z INFO vllm_rs::core::scheduler: GPU Kvcache: 32768 blocks (2097152 tokens) free, used 50.0% (24.00GB/48.00GB); CPU swap used 0.0% (0.00GB/96.00GB)
vllm-rs-svc0 | 2026-02-21T00:55:54.795454Z INFO vllm_rs::core::scheduler: GPU MambaState: 1 / 379 slots used (0.3%), approx 0.02GB/6.82GB (slot 18.42MB) |
No, the mamba slots used is only depend on the number of concurrent requests. |
That said, it supports 379 concurrent requests at the current config. |
|
good point. once we figure out concurrent paged attention swap-out we'll be able to run a 370 tiny demons! 😛 Currently getting this on the 4xSM120 with this PR and vllm-rs-svc0 | 2026-02-21T03:35:08.354563Z INFO vllm_rs::core::block_manager: Prefix cache miss seq 476 (94162 tokens, 32768 cached blocks, raw_match=142 blocks)
vllm-rs-svc0 | 2026-02-21T03:35:09.854756Z INFO vllm_rs::core::runner: User's thinking preference for reasoning models: None
vllm-rs-svc0 | 2026-02-21T03:35:09.854770Z WARN vllm_rs::core::runner: Using sampling from generation_config: temp=Some(0.5), top_k=Some(20), top_p=Some(0.95), freq_penalty=Some(1.2), pres_penalty=Some(1.2)
vllm-rs-svc0 | 2026-02-21T03:35:09.869283Z INFO vllm_rs::core::scheduler: Seq 476 - chunk prefilled 8192 (remain 85970 tokens)
vllm-rs-svc0 | 2026-02-21T03:35:11.343483Z INFO vllm_rs::core::scheduler: Seq 476 - chunk prefilled 16384 (remain 77778 tokens)
vllm-rs-svc0 | 2026-02-21T03:35:12.832627Z INFO vllm_rs::core::scheduler: Seq 476 - chunk prefilled 24576 (remain 69586 tokens)
vllm-rs-svc0 | 2026-02-21T03:35:14.339742Z INFO vllm_rs::core::scheduler: Seq 476 - chunk prefilled 32768 (remain 61394 tokens)
vllm-rs-svc0 | 2026-02-21T03:35:15.850428Z INFO vllm_rs::core::scheduler: Seq 476 - chunk prefilled 40960 (remain 53202 tokens)
vllm-rs-svc0 | 2026-02-21T03:35:17.376420Z INFO vllm_rs::core::scheduler: Seq 476 - chunk prefilled 49152 (remain 45010 tokens)
vllm-rs-svc0 | 2026-02-21T03:35:18.914484Z INFO vllm_rs::core::scheduler: Seq 476 - chunk prefilled 57344 (remain 36818 tokens)
vllm-rs-svc0 | 2026-02-21T03:35:20.470355Z INFO vllm_rs::core::scheduler: Seq 476 - chunk prefilled 65536 (remain 28626 tokens)
vllm-rs-svc0 | 2026-02-21T03:35:22.039551Z INFO vllm_rs::core::scheduler: Seq 476 - chunk prefilled 73728 (remain 20434 tokens)
vllm-rs-svc0 | 2026-02-21T03:35:23.623046Z INFO vllm_rs::core::scheduler: Seq 476 - chunk prefilled 81920 (remain 12242 tokens)
vllm-rs-svc0 | 2026-02-21T03:35:25.223287Z INFO vllm_rs::core::scheduler: Seq 476 - chunk prefilled 90112 (remain 4050 tokens)
vllm-rs-svc0 | 2026-02-21T03:35:26.052102Z WARN vllm_rs::core::scheduler: Seq 476 - chunk prefill finished (94162 tokens)
vllm-rs-svc0 | 2026-02-21T03:35:26.052112Z INFO vllm_rs::core::engine: Prefilling [seq_id 476]: 94163 tokens in 17.71s (5316.04 tokens/s, cache included)
vllm-rs-svc0 | 2026-02-21T03:35:26.052222Z INFO vllm_rs::server::parser: Tool call <tool_call> (151657) found, start buffering!
...
vllm-rs-svc0 | 2026-02-21T03:35:40.854482Z WARN vllm_rs::server::server: --- Performance Metrics ---
vllm-rs-svc0 | 2026-02-21T03:35:40.854488Z INFO vllm_rs::server::server: [Seq 476] ⏱️ Prompt: 94162 tokens in 17.71s (5315.98 t/s)
vllm-rs-svc0 | 2026-02-21T03:35:40.854496Z INFO vllm_rs::server::server: [Seq 476] ⏱️ Decoded: 622 tokens in 14.80s (42.02 t/s) |
7a05ca3 to
aab4de7
Compare
|
So... i've had various iterations of ENV CARGO_FEATURE_FLASHINFER_COMMIT="ad94692d6b911af9498415d2faa946fbb3bba882"
ENV CARGO_FEATURE_FLASHINFER_REPO="https://github.com/flashinfer-ai/flashinfer"in my Dockerfiles and this branch in my Cargo.tomls for a few weeks now and i can't detect any adverse deviation from Any chance you might be inclined to merge this or would you prefer to just merge #29 and have me try to build a |
These optimizations are NOT coming from flashinfer update, instead, the current main recently improved MoE dispatch logic in attention.rs that why you can observe the performance improvements. If you use the official flashinfer repo, you likely unable to run other types of models, especially when head/kv_head not within [1,2,4,8]. |
|
No argument there - if you peek at the forks on your FI repo, you'll notice i have one for just that reason. Found it faster to just use upstream lately in working w/ the the q3n80b-fp8 since i'm not quantizing anything and only running 4-way. That said, their velocity does require a lot of change tracking and now they have a PR up for sm120f too. Would be grand if they would actually take your delta into their code under a feature flag or something (i get not wanting to compile a million kernels) and we COULD really track off them and just pass a flag or env var to compile the additional layouts for this project |
|
They actually have a great link re sm120a/sm120f in that thread.... |
|
Speaking of your change to FI - merge-conflicts now: 103 <<<<<<< HEAD
104 if (
105 is_sm100a_supported(device)
106 or is_sm110a_supported(device)
107 or is_sm12x_supported(device)
108 ):
109 =======
110 # SM12x (RTX 5090, DGX Spark) cannot use CUTLASS FMHA SM100 (requires tcgen05).
111 # SM12x should use FA2 JIT kernels or fmha_v2 HMMA kernels instead.
112 if is_sm100a_supported(device) or is_sm110a_supported(device):
113 >>>>>>> 03eaa20bc (feat: SM120 standard attention via fmha_v2 + CuTe DSL backend + SM12x fixes)
114 return gen_fmha_cutlass_sm100a_module(We don't use these high level stuffs, we used their headers and we write custom dispatcher in cu file. The full attention is not the bottleneck of qwen3 next and qwen3.5 and we also didn't use fp8 path for attention because it need sacrifice accuracy. If you use flash attention, it can also generate similar performance. One part of flashinfer advantage over flash attention is fp8 kvcache, unfortunately, this doesn't work at the current stage, although I have finished dispatcher in attention.rs but it simply does not work at sm90 and sm120, they have poor support of blockwise fp8 scale. In short, the current bottleneck is MoE gemv fp8 kernel, I think I'm able to write a new gemv fp8 kernel without dequant the fp8 weights, which can push the decoding speed to 100 tokens/s request, instead of 50 in the current form. |
|
@guoqingbao Lets see how this works ... #29 for the win here to test and might not be a terrible idea to do same for cutlass |
|
@guoqingbao Lets see how this works ... #29 for the win here to test and might not be a terrible idea to do same for cutlass Another reason we pin to specific version is that we need to make sure flashinfer and other cutlass kernels in attention.rs can be compiled successful on different platforms, not just sm120. The current cutlass kernels uses specific features that old or new cutlass not compatible and you need to make sure the flashinfer uses that cutlass version correctly on different platform, otherwise, we need to manage different cutlass versions in single cubin which can cause conflicts. |
|
@guoqingbao right but doing that in the source trees directly is a process to put it mildly and no amount of AI bots can test on all the hardware we need to support. I think that if we expose these env vars for both cutlass and FI, it would enable adoption testing by users via PRs/issues and quickly changing their target commits or repos as needed to help localize/bisect relevant changes. That said, i'm testing the same build from SM120 on SM89 every once in a while with a parallel one for SM70 (no FI obviously), BTW, i am pretty sure that |
|
These three flags you added change nothing because that's the compiler default values (we already removed fast math, so you don't need pass these default values explicitly): as for Here is the ChatGPT reply: |
|
Chatgpt isn't considering the fact that we dont control the whole source tree or at least audit for these - upstream projects may set the countermanded flags themselves today or at some later point. When we got a taste of fast math here for a few days it was bad and that sort of regression has to be guarded if nothing else. Moreover I'm seeing >5500 t/s in prefill with this and #31 while running a 3m CTX window (draft model work is being handled in two of those) for which we need precision. I've got a way to measure deviation in the logits returned but it needs baselines. Should help w pfx cached computation getting sour at later seqs but I haven't gotten it to a point of being able to say "confidence in a specific response drops from 90% to 30% with 2m tokens between the measured seqpos' " in the codebase yet. |
We didn't use fast math across the entire project including vllm.rs, attention.rs and candle. The only difference of this PR compared to the main is the |
6d44609 to
683e984
Compare
a1f74cb to
e9104a1
Compare
|
So far... have had excellent results with this PR That said, it's not your code i'm worried about. For example: cutlass(main)$ grep -ri fastmath|wc -l
110most of that in one file but a fair deal of it not. Thats just one dep and flashinfer isn't exactly simple either. Idea is to gate that at compilation time with explicit negations. |
I think we can make an env for using "-fmad=false" (default not using), are you able to revise the PR to support this? |
|
Of course - would it make sense to move this into a Interesting datapoint: i have the SM121 currently running without this on a 6x scaling factor and the scaling adjustments we merged upstream seem to help quite a bit. With this PR the SM120 box can hit 3M tokens on-task (its turning I am fairly surprised at how much better long context is with the scaling PR even without this one but with this one we're competitive with commercial platform accuracy at long context on a few (pricey but still) PCI cards. |
We can support additional compiler flags using env. By the way, I'm fixing a decoding cache mismatch issue, once it's solved, the prefix cache and tool calling can be speedy. |
e9104a1 to
0f34892
Compare
Hardcoded repository and commit sources make testing require an additional step out-of-tree wherein the library must be updated and consumers pointed to the updated branch in order to test FI changes. Allow use of env vars to override the defaults with safe fallback to the prior settings - CARGO_FEATURE_FLASHINFER_REPO for repo URL and CARGO_FEATURE_FLASHINFER_COMMIT for commit hash.
Guard against deterioration of precision in attention generation in flashinfer/cutlass by way of fast-math-style optimizations resulting in iterative "dumbing down" of the values relevant to accurate sampling and final decode: 1. Disable FMA rounding non-determinism 2. Preserve subnormals (critical for attention) 3. Enable precise division 4. Enable precise square-root
0f34892 to
65acf39
Compare
Enable precision flags for flashinfer compilation
Guard against deterioration of precision in attention generation in flashinfer/cutlass by way of fast-math-style optimizations resulting in iterative "dumbing down" of the values relevant to accurate sampling and final decode:
None of these things should be happening in the compilation process given the number of cumulatively deteriorating iterations piling up in a million-token-long KVCache, so lets guard against them explicitly.