Full graph parallel for Qwen3.5 (dense and MoE)#1388
Conversation
|
I am running on 2×3090 without P2P. Using -sm graph improves the TG speed from around 20 to 25. However, when mmproj is enabled, sending simple "Hello" causes a crash while -sm layer still works. The model is Qwen3.5-27B, quantized with the following configuration: Details``` blk\.0\.attn_gate\.weight=Q8_0 blk\.0\.attn_norm\.weight=BF16 blk\.0\.attn_qkv\.weight=Q6_K blk\.0\.ffn_down\.weight=Q6_K blk\.0\.ffn_gate\.weight=Q6_K blk\.0\.ffn_up\.weight=Q6_K blk\.0\.post_attention_norm\.weight=BF16 blk\.0\.ssm_a=BF16 blk\.0\.ssm_alpha\.weight=BF16 blk\.0\.ssm_beta\.weight=BF16 blk\.0\.ssm_conv1d\.weight=BF16 blk\.0\.ssm_dt\.bias=BF16 blk\.0\.ssm_norm\.weight=BF16 blk\.0\.ssm_out\.weight=BF16 blk\.1\.attn_gate\.weight=Q8_0 blk\.1\.attn_norm\.weight=BF16 blk\.1\.attn_qkv\.weight=Q6_K blk\.1\.ffn_down\.weight=Q6_K blk\.1\.ffn_gate\.weight=Q6_K blk\.1\.ffn_up\.weight=Q6_K blk\.1\.post_attention_norm\.weight=BF16 blk\.1\.ssm_a=BF16 blk\.1\.ssm_alpha\.weight=BF16 blk\.1\.ssm_beta\.weight=BF16 blk\.1\.ssm_conv1d\.weight=BF16 blk\.1\.ssm_dt\.bias=BF16 blk\.1\.ssm_norm\.weight=BF16 blk\.1\.ssm_out\.weight=BF16 blk\.2\.attn_gate\.weight=Q8_0 blk\.2\.attn_norm\.weight=BF16 blk\.2\.attn_qkv\.weight=Q6_K blk\.2\.ffn_down\.weight=Q6_K blk\.2\.ffn_gate\.weight=Q6_K blk\.2\.ffn_up\.weight=Q6_K blk\.2\.post_attention_norm\.weight=BF16 blk\.2\.ssm_a=BF16 blk\.2\.ssm_alpha\.weight=BF16 blk\.2\.ssm_beta\.weight=BF16 blk\.2\.ssm_conv1d\.weight=BF16 blk\.2\.ssm_dt\.bias=BF16 blk\.2\.ssm_norm\.weight=BF16 blk\.2\.ssm_out\.weight=BF16 blk\.3\.attn_k\.weight=BF16 blk\.3\.attn_k_norm\.weight=BF16 blk\.3\.attn_norm\.weight=BF16 blk\.3\.attn_q\.weight=Q8_0 blk\.3\.attn_q_norm\.weight=BF16 blk\.3\.attn_v\.weight=BF16 blk\.3\.ffn_down\.weight=Q6_K blk\.3\.ffn_gate\.weight=Q6_K blk\.3\.ffn_up\.weight=Q6_K blk\.3\.post_attention_norm\.weight=BF16 blk\.3\.attn_output\.weight=Q6_K blk\.4\.attn_gate\.weight=Q8_0 blk\.4\.attn_norm\.weight=BF16 blk\.4\.attn_qkv\.weight=Q6_K blk\.4\.ffn_down\.weight=Q6_K blk\.4\.ffn_gate\.weight=Q6_K blk\.4\.ffn_up\.weight=Q6_K blk\.4\.post_attention_norm\.weight=BF16 blk\.4\.ssm_a=BF16 blk\.4\.ssm_alpha\.weight=BF16 blk\.4\.ssm_beta\.weight=BF16 blk\.4\.ssm_conv1d\.weight=BF16 blk\.4\.ssm_dt\.bias=BF16 blk\.4\.ssm_norm\.weight=BF16 blk\.4\.ssm_out\.weight=BF16 blk\.5\.attn_gate\.weight=Q8_0 blk\.5\.attn_norm\.weight=BF16 blk\.5\.attn_qkv\.weight=Q6_K blk\.5\.ffn_down\.weight=Q6_K blk\.5\.ffn_gate\.weight=Q6_K blk\.5\.ffn_up\.weight=Q6_K blk\.5\.post_attention_norm\.weight=BF16 blk\.5\.ssm_a=BF16 blk\.5\.ssm_alpha\.weight=BF16 blk\.5\.ssm_beta\.weight=BF16 blk\.5\.ssm_conv1d\.weight=BF16 blk\.5\.ssm_dt\.bias=BF16 blk\.5\.ssm_norm\.weight=BF16 blk\.5\.ssm_out\.weight=BF16 blk\.6\.attn_gate\.weight=Q8_0 blk\.6\.attn_norm\.weight=BF16 blk\.6\.attn_qkv\.weight=Q6_K blk\.6\.ffn_down\.weight=Q6_K blk\.6\.ffn_gate\.weight=Q6_K blk\.6\.ffn_up\.weight=Q6_K blk\.6\.post_attention_norm\.weight=BF16 blk\.6\.ssm_a=BF16 blk\.6\.ssm_alpha\.weight=BF16 blk\.6\.ssm_beta\.weight=BF16 blk\.6\.ssm_conv1d\.weight=BF16 blk\.6\.ssm_dt\.bias=BF16 blk\.6\.ssm_norm\.weight=BF16 blk\.6\.ssm_out\.weight=BF16 blk\.7\.attn_k\.weight=BF16 blk\.7\.attn_k_norm\.weight=BF16 blk\.7\.attn_norm\.weight=BF16 blk\.7\.attn_q\.weight=Q8_0 blk\.7\.attn_q_norm\.weight=BF16 blk\.7\.attn_v\.weight=BF16 blk\.7\.ffn_down\.weight=Q6_K blk\.7\.ffn_gate\.weight=Q6_K blk\.7\.ffn_up\.weight=Q6_K blk\.7\.post_attention_norm\.weight=BF16 blk\.7\.attn_output\.weight=Q6_K blk\.8\.attn_gate\.weight=Q8_0 blk\.8\.attn_norm\.weight=BF16 blk\.8\.attn_qkv\.weight=Q6_K blk\.8\.ffn_down\.weight=Q6_K blk\.8\.ffn_gate\.weight=Q6_K blk\.8\.ffn_up\.weight=Q6_K blk\.8\.post_attention_norm\.weight=BF16 blk\.8\.ssm_a=BF16 blk\.8\.ssm_alpha\.weight=BF16 blk\.8\.ssm_beta\.weight=BF16 blk\.8\.ssm_conv1d\.weight=BF16 blk\.8\.ssm_dt\.bias=BF16 blk\.8\.ssm_norm\.weight=BF16 blk\.8\.ssm_out\.weight=BF16 blk\.9\.attn_gate\.weight=Q8_0 blk\.9\.attn_norm\.weight=BF16 blk\.9\.attn_qkv\.weight=Q6_K blk\.9\.ffn_down\.weight=Q6_K blk\.9\.ffn_gate\.weight=Q6_K blk\.9\.ffn_up\.weight=Q6_K blk\.9\.post_attention_norm\.weight=BF16 blk\.9\.ssm_a=BF16 blk\.9\.ssm_alpha\.weight=BF16 blk\.9\.ssm_beta\.weight=BF16 blk\.9\.ssm_conv1d\.weight=BF16 blk\.9\.ssm_dt\.bias=BF16 blk\.9\.ssm_norm\.weight=BF16 blk\.9\.ssm_out\.weight=BF16 blk\.10\.attn_gate\.weight=Q8_0 blk\.10\.attn_norm\.weight=BF16 blk\.10\.attn_qkv\.weight=Q6_K blk\.10\.ffn_down\.weight=Q6_K blk\.10\.ffn_gate\.weight=Q6_K blk\.10\.ffn_up\.weight=Q6_K blk\.10\.post_attention_norm\.weight=BF16 blk\.10\.ssm_a=BF16 blk\.10\.ssm_alpha\.weight=BF16 blk\.10\.ssm_beta\.weight=BF16 blk\.10\.ssm_conv1d\.weight=BF16 blk\.10\.ssm_dt\.bias=BF16 blk\.10\.ssm_norm\.weight=BF16 blk\.10\.ssm_out\.weight=BF16 blk\.11\.attn_k\.weight=BF16 blk\.11\.attn_k_norm\.weight=BF16 blk\.11\.attn_norm\.weight=BF16 blk\.11\.attn_q\.weight=Q8_0 blk\.11\.attn_q_norm\.weight=BF16 blk\.11\.attn_v\.weight=BF16 blk\.11\.ffn_down\.weight=Q8_0 blk\.11\.ffn_gate\.weight=Q6_K blk\.11\.ffn_up\.weight=Q6_K blk\.11\.post_attention_norm\.weight=BF16 blk\.11\.attn_output\.weight=Q6_K blk\.12\.attn_gate\.weight=Q8_0 blk\.12\.attn_norm\.weight=BF16 blk\.12\.attn_qkv\.weight=Q6_K blk\.12\.ffn_down\.weight=Q8_0 blk\.12\.ffn_gate\.weight=Q6_K blk\.12\.ffn_up\.weight=Q6_K blk\.12\.post_attention_norm\.weight=BF16 blk\.12\.ssm_a=BF16 blk\.12\.ssm_alpha\.weight=BF16 blk\.12\.ssm_beta\.weight=BF16 blk\.12\.ssm_conv1d\.weight=BF16 blk\.12\.ssm_dt\.bias=BF16 blk\.12\.ssm_norm\.weight=BF16 blk\.12\.ssm_out\.weight=BF16 blk\.13\.attn_gate\.weight=Q8_0 blk\.13\.attn_norm\.weight=BF16 blk\.13\.attn_qkv\.weight=Q6_K blk\.13\.ffn_down\.weight=Q8_0 blk\.13\.ffn_gate\.weight=Q6_K blk\.13\.ffn_up\.weight=Q6_K blk\.13\.post_attention_norm\.weight=BF16 blk\.13\.ssm_a=BF16 blk\.13\.ssm_alpha\.weight=BF16 blk\.13\.ssm_beta\.weight=BF16 blk\.13\.ssm_conv1d\.weight=BF16 blk\.13\.ssm_dt\.bias=BF16 blk\.13\.ssm_norm\.weight=BF16 blk\.13\.ssm_out\.weight=BF16 blk\.14\.attn_gate\.weight=Q8_0 blk\.14\.attn_norm\.weight=BF16 blk\.14\.attn_qkv\.weight=Q6_K blk\.14\.ffn_down\.weight=Q8_0 blk\.14\.ffn_gate\.weight=Q6_K blk\.14\.ffn_up\.weight=Q6_K blk\.14\.post_attention_norm\.weight=BF16 blk\.14\.ssm_a=BF16 blk\.14\.ssm_alpha\.weight=BF16 blk\.14\.ssm_beta\.weight=BF16 blk\.14\.ssm_conv1d\.weight=BF16 blk\.14\.ssm_dt\.bias=BF16 blk\.14\.ssm_norm\.weight=BF16 blk\.14\.ssm_out\.weight=BF16 blk\.15\.attn_k\.weight=BF16 blk\.15\.attn_k_norm\.weight=BF16 blk\.15\.attn_norm\.weight=BF16 blk\.15\.attn_q\.weight=Q8_0 blk\.15\.attn_q_norm\.weight=BF16 blk\.15\.attn_v\.weight=BF16 blk\.15\.ffn_down\.weight=Q8_0 blk\.15\.ffn_gate\.weight=Q6_K blk\.15\.ffn_up\.weight=Q6_K blk\.15\.post_attention_norm\.weight=BF16 blk\.15\.attn_output\.weight=Q6_K blk\.16\.attn_gate\.weight=Q8_0 blk\.16\.attn_norm\.weight=BF16 blk\.16\.attn_qkv\.weight=Q6_K blk\.16\.ffn_down\.weight=Q8_0 blk\.16\.ffn_gate\.weight=Q6_K blk\.16\.ffn_up\.weight=Q6_K blk\.16\.post_attention_norm\.weight=BF16 blk\.16\.ssm_a=BF16 blk\.16\.ssm_alpha\.weight=BF16 blk\.16\.ssm_beta\.weight=BF16 blk\.16\.ssm_conv1d\.weight=BF16 blk\.16\.ssm_dt\.bias=BF16 blk\.16\.ssm_norm\.weight=BF16 blk\.16\.ssm_out\.weight=BF16 blk\.17\.attn_gate\.weight=Q8_0 blk\.17\.attn_norm\.weight=BF16 blk\.17\.attn_qkv\.weight=Q6_K blk\.17\.ffn_down\.weight=Q6_K blk\.17\.ffn_gate\.weight=Q6_K blk\.17\.ffn_up\.weight=Q6_K blk\.17\.post_attention_norm\.weight=BF16 blk\.17\.ssm_a=BF16 blk\.17\.ssm_alpha\.weight=BF16 blk\.17\.ssm_beta\.weight=BF16 blk\.17\.ssm_conv1d\.weight=BF16 blk\.17\.ssm_dt\.bias=BF16 blk\.17\.ssm_norm\.weight=BF16 blk\.17\.ssm_out\.weight=BF16 blk\.18\.attn_gate\.weight=Q8_0 blk\.18\.attn_norm\.weight=BF16 blk\.18\.attn_qkv\.weight=Q6_K blk\.18\.ffn_down\.weight=Q6_K blk\.18\.ffn_gate\.weight=Q6_K blk\.18\.ffn_up\.weight=Q6_K blk\.18\.post_attention_norm\.weight=BF16 blk\.18\.ssm_a=BF16 blk\.18\.ssm_alpha\.weight=BF16 blk\.18\.ssm_beta\.weight=BF16 blk\.18\.ssm_conv1d\.weight=BF16 blk\.18\.ssm_dt\.bias=BF16 blk\.18\.ssm_norm\.weight=BF16 blk\.18\.ssm_out\.weight=BF16 blk\.19\.attn_k\.weight=BF16 blk\.19\.attn_k_norm\.weight=BF16 blk\.19\.attn_norm\.weight=BF16 blk\.19\.attn_q\.weight=Q8_0 blk\.19\.attn_q_norm\.weight=BF16 blk\.19\.attn_v\.weight=BF16 blk\.19\.ffn_down\.weight=Q6_K blk\.19\.ffn_gate\.weight=Q6_K blk\.19\.ffn_up\.weight=Q6_K blk\.19\.post_attention_norm\.weight=BF16 blk\.19\.attn_output\.weight=Q6_K blk\.20\.attn_gate\.weight=Q8_0 blk\.20\.attn_norm\.weight=BF16 blk\.20\.attn_qkv\.weight=Q6_K blk\.20\.ffn_down\.weight=Q6_K blk\.20\.ffn_gate\.weight=Q6_K blk\.20\.ffn_up\.weight=Q6_K blk\.20\.post_attention_norm\.weight=BF16 blk\.20\.ssm_a=BF16 blk\.20\.ssm_alpha\.weight=BF16 blk\.20\.ssm_beta\.weight=BF16 blk\.20\.ssm_conv1d\.weight=BF16 blk\.20\.ssm_dt\.bias=BF16 blk\.20\.ssm_norm\.weight=BF16 blk\.20\.ssm_out\.weight=BF16 blk\.21\.attn_gate\.weight=Q8_0 blk\.21\.attn_norm\.weight=BF16 blk\.21\.attn_qkv\.weight=Q6_K blk\.21\.ffn_down\.weight=Q6_K blk\.21\.ffn_gate\.weight=Q6_K blk\.21\.ffn_up\.weight=Q6_K blk\.21\.post_attention_norm\.weight=BF16 blk\.21\.ssm_a=BF16 blk\.21\.ssm_alpha\.weight=BF16 blk\.21\.ssm_beta\.weight=BF16 blk\.21\.ssm_conv1d\.weight=BF16 blk\.21\.ssm_dt\.bias=BF16 blk\.21\.ssm_norm\.weight=BF16 blk\.21\.ssm_out\.weight=BF16 blk\.22\.attn_gate\.weight=Q8_0 blk\.22\.attn_norm\.weight=BF16 blk\.22\.attn_qkv\.weight=Q6_K blk\.22\.ffn_down\.weight=Q6_K blk\.22\.ffn_gate\.weight=Q6_K blk\.22\.ffn_up\.weight=Q6_K blk\.22\.post_attention_norm\.weight=BF16 blk\.22\.ssm_a=BF16 blk\.22\.ssm_alpha\.weight=BF16 blk\.22\.ssm_beta\.weight=BF16 blk\.22\.ssm_conv1d\.weight=BF16 blk\.22\.ssm_dt\.bias=BF16 blk\.22\.ssm_norm\.weight=BF16 blk\.22\.ssm_out\.weight=BF16 blk\.23\.attn_k\.weight=BF16 blk\.23\.attn_k_norm\.weight=BF16 blk\.23\.attn_norm\.weight=BF16 blk\.23\.attn_q\.weight=Q8_0 blk\.23\.attn_q_norm\.weight=BF16 blk\.23\.attn_v\.weight=BF16 blk\.23\.ffn_down\.weight=Q6_K blk\.23\.ffn_gate\.weight=Q6_K blk\.23\.ffn_up\.weight=Q6_K blk\.23\.post_attention_norm\.weight=BF16 blk\.23\.attn_output\.weight=Q6_K blk\.24\.attn_gate\.weight=Q8_0 blk\.24\.attn_norm\.weight=BF16 blk\.24\.attn_qkv\.weight=Q6_K blk\.24\.ffn_down\.weight=Q6_K blk\.24\.ffn_gate\.weight=Q6_K blk\.24\.ffn_up\.weight=Q6_K blk\.24\.post_attention_norm\.weight=BF16 blk\.24\.ssm_a=BF16 blk\.24\.ssm_alpha\.weight=BF16 blk\.24\.ssm_beta\.weight=BF16 blk\.24\.ssm_conv1d\.weight=BF16 blk\.24\.ssm_dt\.bias=BF16 blk\.24\.ssm_norm\.weight=BF16 blk\.24\.ssm_out\.weight=BF16 blk\.25\.attn_gate\.weight=Q8_0 blk\.25\.attn_norm\.weight=BF16 blk\.25\.attn_qkv\.weight=Q6_K blk\.25\.ffn_down\.weight=Q6_K blk\.25\.ffn_gate\.weight=Q6_K blk\.25\.ffn_up\.weight=Q6_K blk\.25\.post_attention_norm\.weight=BF16 blk\.25\.ssm_a=BF16 blk\.25\.ssm_alpha\.weight=BF16 blk\.25\.ssm_beta\.weight=BF16 blk\.25\.ssm_conv1d\.weight=BF16 blk\.25\.ssm_dt\.bias=BF16 blk\.25\.ssm_norm\.weight=BF16 blk\.25\.ssm_out\.weight=BF16 blk\.26\.attn_gate\.weight=Q8_0 blk\.26\.attn_norm\.weight=BF16 blk\.26\.attn_qkv\.weight=Q6_K blk\.26\.ffn_down\.weight=Q6_K blk\.26\.ffn_gate\.weight=Q6_K blk\.26\.ffn_up\.weight=Q6_K blk\.26\.post_attention_norm\.weight=BF16 blk\.26\.ssm_a=BF16 blk\.26\.ssm_alpha\.weight=BF16 blk\.26\.ssm_beta\.weight=BF16 blk\.26\.ssm_conv1d\.weight=BF16 blk\.26\.ssm_dt\.bias=BF16 blk\.26\.ssm_norm\.weight=BF16 blk\.26\.ssm_out\.weight=BF16 blk\.27\.attn_k\.weight=BF16 blk\.27\.attn_k_norm\.weight=BF16 blk\.27\.attn_norm\.weight=BF16 blk\.27\.attn_q\.weight=Q8_0 blk\.27\.attn_q_norm\.weight=BF16 blk\.27\.attn_v\.weight=BF16 blk\.27\.ffn_down\.weight=Q6_K blk\.27\.ffn_gate\.weight=Q6_K blk\.27\.ffn_up\.weight=Q6_K blk\.27\.post_attention_norm\.weight=BF16 blk\.27\.attn_output\.weight=Q6_K blk\.28\.attn_gate\.weight=Q8_0 blk\.28\.attn_norm\.weight=BF16 blk\.28\.attn_qkv\.weight=Q6_K blk\.28\.ffn_down\.weight=Q6_K blk\.28\.ffn_gate\.weight=Q6_K blk\.28\.ffn_up\.weight=Q6_K blk\.28\.post_attention_norm\.weight=BF16 blk\.28\.ssm_a=BF16 blk\.28\.ssm_alpha\.weight=BF16 blk\.28\.ssm_beta\.weight=BF16 blk\.28\.ssm_conv1d\.weight=BF16 blk\.28\.ssm_dt\.bias=BF16 blk\.28\.ssm_norm\.weight=BF16 blk\.28\.ssm_out\.weight=BF16 blk\.29\.attn_gate\.weight=Q8_0 blk\.29\.attn_norm\.weight=BF16 blk\.29\.attn_qkv\.weight=Q6_K blk\.29\.ffn_down\.weight=Q6_K blk\.29\.ffn_gate\.weight=Q6_K blk\.29\.ffn_up\.weight=Q6_K blk\.29\.post_attention_norm\.weight=BF16 blk\.29\.ssm_a=BF16 blk\.29\.ssm_alpha\.weight=BF16 blk\.29\.ssm_beta\.weight=BF16 blk\.29\.ssm_conv1d\.weight=BF16 blk\.29\.ssm_dt\.bias=BF16 blk\.29\.ssm_norm\.weight=BF16 blk\.29\.ssm_out\.weight=BF16 blk\.30\.attn_gate\.weight=Q8_0 blk\.30\.attn_norm\.weight=BF16 blk\.30\.attn_qkv\.weight=Q6_K blk\.30\.ffn_down\.weight=Q6_K blk\.30\.ffn_gate\.weight=Q6_K blk\.30\.ffn_up\.weight=Q6_K blk\.30\.post_attention_norm\.weight=BF16 blk\.30\.ssm_a=BF16 blk\.30\.ssm_alpha\.weight=BF16 blk\.30\.ssm_beta\.weight=BF16 blk\.30\.ssm_conv1d\.weight=BF16 blk\.30\.ssm_dt\.bias=BF16 blk\.30\.ssm_norm\.weight=BF16 blk\.30\.ssm_out\.weight=BF16 blk\.31\.attn_k\.weight=BF16 blk\.31\.attn_k_norm\.weight=BF16 blk\.31\.attn_norm\.weight=BF16 blk\.31\.attn_q\.weight=Q8_0 blk\.31\.attn_q_norm\.weight=BF16 blk\.31\.attn_v\.weight=BF16 blk\.31\.ffn_down\.weight=Q6_K blk\.31\.ffn_gate\.weight=Q6_K blk\.31\.ffn_up\.weight=Q6_K blk\.31\.post_attention_norm\.weight=BF16 blk\.31\.attn_output\.weight=Q6_K blk\.32\.attn_gate\.weight=Q8_0 blk\.32\.attn_norm\.weight=BF16 blk\.32\.attn_qkv\.weight=Q6_K blk\.32\.ffn_down\.weight=Q6_K blk\.32\.ffn_gate\.weight=Q6_K blk\.32\.ffn_up\.weight=Q6_K blk\.32\.post_attention_norm\.weight=BF16 blk\.32\.ssm_a=BF16 blk\.32\.ssm_alpha\.weight=BF16 blk\.32\.ssm_beta\.weight=BF16 blk\.32\.ssm_conv1d\.weight=BF16 blk\.32\.ssm_dt\.bias=BF16 blk\.32\.ssm_norm\.weight=BF16 blk\.32\.ssm_out\.weight=BF16 blk\.33\.attn_gate\.weight=Q8_0 blk\.33\.attn_norm\.weight=BF16 blk\.33\.attn_qkv\.weight=Q6_K blk\.33\.ffn_down\.weight=Q6_K blk\.33\.ffn_gate\.weight=Q6_K blk\.33\.ffn_up\.weight=Q6_K blk\.33\.post_attention_norm\.weight=BF16 blk\.33\.ssm_a=BF16 blk\.33\.ssm_alpha\.weight=BF16 blk\.33\.ssm_beta\.weight=BF16 blk\.33\.ssm_conv1d\.weight=BF16 blk\.33\.ssm_dt\.bias=BF16 blk\.33\.ssm_norm\.weight=BF16 blk\.33\.ssm_out\.weight=BF16 blk\.34\.attn_gate\.weight=Q8_0 blk\.34\.attn_norm\.weight=BF16 blk\.34\.attn_qkv\.weight=Q6_K blk\.34\.ffn_down\.weight=Q6_K blk\.34\.ffn_gate\.weight=Q6_K blk\.34\.ffn_up\.weight=Q6_K blk\.34\.post_attention_norm\.weight=BF16 blk\.34\.ssm_a=BF16 blk\.34\.ssm_alpha\.weight=BF16 blk\.34\.ssm_beta\.weight=BF16 blk\.34\.ssm_conv1d\.weight=BF16 blk\.34\.ssm_dt\.bias=BF16 blk\.34\.ssm_norm\.weight=BF16 blk\.34\.ssm_out\.weight=BF16 blk\.35\.attn_k\.weight=BF16 blk\.35\.attn_k_norm\.weight=BF16 blk\.35\.attn_norm\.weight=BF16 blk\.35\.attn_q\.weight=Q8_0 blk\.35\.attn_q_norm\.weight=BF16 blk\.35\.attn_v\.weight=BF16 blk\.35\.ffn_down\.weight=Q6_K blk\.35\.ffn_gate\.weight=Q6_K blk\.35\.ffn_up\.weight=Q6_K blk\.35\.post_attention_norm\.weight=BF16 blk\.35\.attn_output\.weight=Q6_K blk\.36\.attn_gate\.weight=Q8_0 blk\.36\.attn_norm\.weight=BF16 blk\.36\.attn_qkv\.weight=Q6_K blk\.36\.ffn_down\.weight=Q6_K blk\.36\.ffn_gate\.weight=Q6_K blk\.36\.ffn_up\.weight=Q6_K blk\.36\.post_attention_norm\.weight=BF16 blk\.36\.ssm_a=BF16 blk\.36\.ssm_alpha\.weight=BF16 blk\.36\.ssm_beta\.weight=BF16 blk\.36\.ssm_conv1d\.weight=BF16 blk\.36\.ssm_dt\.bias=BF16 blk\.36\.ssm_norm\.weight=BF16 blk\.36\.ssm_out\.weight=BF16 blk\.37\.attn_gate\.weight=Q8_0 blk\.37\.attn_norm\.weight=BF16 blk\.37\.attn_qkv\.weight=Q6_K blk\.37\.ffn_down\.weight=Q6_K blk\.37\.ffn_gate\.weight=Q6_K blk\.37\.ffn_up\.weight=Q6_K blk\.37\.post_attention_norm\.weight=BF16 blk\.37\.ssm_a=BF16 blk\.37\.ssm_alpha\.weight=BF16 blk\.37\.ssm_beta\.weight=BF16 blk\.37\.ssm_conv1d\.weight=BF16 blk\.37\.ssm_dt\.bias=BF16 blk\.37\.ssm_norm\.weight=BF16 blk\.37\.ssm_out\.weight=BF16 blk\.38\.attn_gate\.weight=Q8_0 blk\.38\.attn_norm\.weight=BF16 blk\.38\.attn_qkv\.weight=Q6_K blk\.38\.ffn_down\.weight=Q6_K blk\.38\.ffn_gate\.weight=Q6_K blk\.38\.ffn_up\.weight=Q6_K blk\.38\.post_attention_norm\.weight=BF16 blk\.38\.ssm_a=BF16 blk\.38\.ssm_alpha\.weight=BF16 blk\.38\.ssm_beta\.weight=BF16 blk\.38\.ssm_conv1d\.weight=BF16 blk\.38\.ssm_dt\.bias=BF16 blk\.38\.ssm_norm\.weight=BF16 blk\.38\.ssm_out\.weight=BF16 blk\.39\.attn_k\.weight=BF16 blk\.39\.attn_k_norm\.weight=BF16 blk\.39\.attn_norm\.weight=BF16 blk\.39\.attn_q\.weight=Q8_0 blk\.39\.attn_q_norm\.weight=BF16 blk\.39\.attn_v\.weight=BF16 blk\.39\.ffn_down\.weight=Q6_K blk\.39\.ffn_gate\.weight=Q6_K blk\.39\.ffn_up\.weight=Q6_K blk\.39\.post_attention_norm\.weight=BF16 blk\.39\.attn_output\.weight=Q6_K blk\.40\.attn_gate\.weight=Q8_0 blk\.40\.attn_norm\.weight=BF16 blk\.40\.attn_qkv\.weight=Q6_K blk\.40\.ffn_down\.weight=Q6_K blk\.40\.ffn_gate\.weight=Q6_K blk\.40\.ffn_up\.weight=Q6_K blk\.40\.post_attention_norm\.weight=BF16 blk\.40\.ssm_a=BF16 blk\.40\.ssm_alpha\.weight=BF16 blk\.40\.ssm_beta\.weight=BF16 blk\.40\.ssm_conv1d\.weight=BF16 blk\.40\.ssm_dt\.bias=BF16 blk\.40\.ssm_norm\.weight=BF16 blk\.40\.ssm_out\.weight=BF16 blk\.41\.attn_gate\.weight=Q8_0 blk\.41\.attn_norm\.weight=BF16 blk\.41\.attn_qkv\.weight=Q6_K blk\.41\.ffn_down\.weight=Q6_K blk\.41\.ffn_gate\.weight=Q6_K blk\.41\.ffn_up\.weight=Q6_K blk\.41\.post_attention_norm\.weight=BF16 blk\.41\.ssm_a=BF16 blk\.41\.ssm_alpha\.weight=BF16 blk\.41\.ssm_beta\.weight=BF16 blk\.41\.ssm_conv1d\.weight=BF16 blk\.41\.ssm_dt\.bias=BF16 blk\.41\.ssm_norm\.weight=BF16 blk\.41\.ssm_out\.weight=BF16 blk\.42\.attn_gate\.weight=Q8_0 blk\.42\.attn_norm\.weight=BF16 blk\.42\.attn_qkv\.weight=Q6_K blk\.42\.ffn_down\.weight=Q6_K blk\.42\.ffn_gate\.weight=Q6_K blk\.42\.ffn_up\.weight=Q6_K blk\.42\.post_attention_norm\.weight=BF16 blk\.42\.ssm_a=BF16 blk\.42\.ssm_alpha\.weight=BF16 blk\.42\.ssm_beta\.weight=BF16 blk\.42\.ssm_conv1d\.weight=BF16 blk\.42\.ssm_dt\.bias=BF16 blk\.42\.ssm_norm\.weight=BF16 blk\.42\.ssm_out\.weight=BF16 blk\.43\.attn_k\.weight=BF16 blk\.43\.attn_k_norm\.weight=BF16 blk\.43\.attn_norm\.weight=BF16 blk\.43\.attn_q\.weight=Q8_0 blk\.43\.attn_q_norm\.weight=BF16 blk\.43\.attn_v\.weight=BF16 blk\.43\.ffn_down\.weight=Q6_K blk\.43\.ffn_gate\.weight=Q6_K blk\.43\.ffn_up\.weight=Q6_K blk\.43\.post_attention_norm\.weight=BF16 blk\.43\.attn_output\.weight=Q6_K blk\.44\.attn_gate\.weight=Q8_0 blk\.44\.attn_norm\.weight=BF16 blk\.44\.attn_qkv\.weight=Q6_K blk\.44\.ffn_down\.weight=Q6_K blk\.44\.ffn_gate\.weight=Q6_K blk\.44\.ffn_up\.weight=Q6_K blk\.44\.post_attention_norm\.weight=BF16 blk\.44\.ssm_a=BF16 blk\.44\.ssm_alpha\.weight=BF16 blk\.44\.ssm_beta\.weight=BF16 blk\.44\.ssm_conv1d\.weight=BF16 blk\.44\.ssm_dt\.bias=BF16 blk\.44\.ssm_norm\.weight=BF16 blk\.44\.ssm_out\.weight=BF16 blk\.45\.attn_gate\.weight=Q8_0 blk\.45\.attn_norm\.weight=BF16 blk\.45\.attn_qkv\.weight=Q6_K blk\.45\.ffn_down\.weight=Q6_K blk\.45\.ffn_gate\.weight=Q6_K blk\.45\.ffn_up\.weight=Q6_K blk\.45\.post_attention_norm\.weight=BF16 blk\.45\.ssm_a=BF16 blk\.45\.ssm_alpha\.weight=BF16 blk\.45\.ssm_beta\.weight=BF16 blk\.45\.ssm_conv1d\.weight=BF16 blk\.45\.ssm_dt\.bias=BF16 blk\.45\.ssm_norm\.weight=BF16 blk\.45\.ssm_out\.weight=BF16 blk\.46\.attn_gate\.weight=Q8_0 blk\.46\.attn_norm\.weight=BF16 blk\.46\.attn_qkv\.weight=Q6_K blk\.46\.ffn_down\.weight=Q6_K blk\.46\.ffn_gate\.weight=Q6_K blk\.46\.ffn_up\.weight=Q6_K blk\.46\.post_attention_norm\.weight=BF16 blk\.46\.ssm_a=BF16 blk\.46\.ssm_alpha\.weight=BF16 blk\.46\.ssm_beta\.weight=BF16 blk\.46\.ssm_conv1d\.weight=BF16 blk\.46\.ssm_dt\.bias=BF16 blk\.46\.ssm_norm\.weight=BF16 blk\.46\.ssm_out\.weight=BF16 blk\.47\.attn_k\.weight=BF16 blk\.47\.attn_k_norm\.weight=BF16 blk\.47\.attn_norm\.weight=BF16 blk\.47\.attn_q\.weight=Q8_0 blk\.47\.attn_q_norm\.weight=BF16 blk\.47\.attn_v\.weight=BF16 blk\.47\.ffn_down\.weight=Q6_K blk\.47\.ffn_gate\.weight=Q6_K blk\.47\.ffn_up\.weight=Q6_K blk\.47\.post_attention_norm\.weight=BF16 blk\.47\.attn_output\.weight=Q6_K blk\.48\.attn_gate\.weight=Q8_0 blk\.48\.attn_norm\.weight=BF16 blk\.48\.attn_qkv\.weight=Q6_K blk\.48\.ffn_down\.weight=Q6_K blk\.48\.ffn_gate\.weight=Q6_K blk\.48\.ffn_up\.weight=Q6_K blk\.48\.post_attention_norm\.weight=BF16 blk\.48\.ssm_a=BF16 blk\.48\.ssm_alpha\.weight=BF16 blk\.48\.ssm_beta\.weight=BF16 blk\.48\.ssm_conv1d\.weight=BF16 blk\.48\.ssm_dt\.bias=BF16 blk\.48\.ssm_norm\.weight=BF16 blk\.48\.ssm_out\.weight=BF16 blk\.49\.attn_gate\.weight=Q8_0 blk\.49\.attn_norm\.weight=BF16 blk\.49\.attn_qkv\.weight=Q6_K blk\.49\.ffn_down\.weight=Q6_K blk\.49\.ffn_gate\.weight=Q6_K blk\.49\.ffn_up\.weight=Q6_K blk\.49\.post_attention_norm\.weight=BF16 blk\.49\.ssm_a=BF16 blk\.49\.ssm_alpha\.weight=BF16 blk\.49\.ssm_beta\.weight=BF16 blk\.49\.ssm_conv1d\.weight=BF16 blk\.49\.ssm_dt\.bias=BF16 blk\.49\.ssm_norm\.weight=BF16 blk\.49\.ssm_out\.weight=BF16 blk\.50\.attn_gate\.weight=Q8_0 blk\.50\.attn_norm\.weight=BF16 blk\.50\.attn_qkv\.weight=Q6_K blk\.50\.ffn_down\.weight=Q6_K blk\.50\.ffn_gate\.weight=Q6_K blk\.50\.ffn_up\.weight=Q6_K blk\.50\.post_attention_norm\.weight=BF16 blk\.50\.ssm_a=BF16 blk\.50\.ssm_alpha\.weight=BF16 blk\.50\.ssm_beta\.weight=BF16 blk\.50\.ssm_conv1d\.weight=BF16 blk\.50\.ssm_dt\.bias=BF16 blk\.50\.ssm_norm\.weight=BF16 blk\.50\.ssm_out\.weight=BF16 blk\.51\.attn_k\.weight=BF16 blk\.51\.attn_k_norm\.weight=BF16 blk\.51\.attn_norm\.weight=BF16 blk\.51\.attn_q\.weight=Q8_0 blk\.51\.attn_q_norm\.weight=BF16 blk\.51\.attn_v\.weight=BF16 blk\.51\.ffn_down\.weight=Q8_0 blk\.51\.ffn_gate\.weight=Q8_0 blk\.51\.ffn_up\.weight=Q8_0 blk\.51\.post_attention_norm\.weight=BF16 blk\.51\.attn_output\.weight=Q6_K blk\.52\.attn_gate\.weight=Q8_0 blk\.52\.attn_norm\.weight=BF16 blk\.52\.attn_qkv\.weight=Q6_K blk\.52\.ffn_down\.weight=Q6_K blk\.52\.ffn_gate\.weight=Q6_K blk\.52\.ffn_up\.weight=Q6_K blk\.52\.post_attention_norm\.weight=BF16 blk\.52\.ssm_a=BF16 blk\.52\.ssm_alpha\.weight=BF16 blk\.52\.ssm_beta\.weight=BF16 blk\.52\.ssm_conv1d\.weight=BF16 blk\.52\.ssm_dt\.bias=BF16 blk\.52\.ssm_norm\.weight=BF16 blk\.52\.ssm_out\.weight=BF16 blk\.53\.attn_gate\.weight=Q8_0 blk\.53\.attn_norm\.weight=BF16 blk\.53\.attn_qkv\.weight=Q6_K blk\.53\.ffn_down\.weight=Q6_K blk\.53\.ffn_gate\.weight=Q6_K blk\.53\.ffn_up\.weight=Q6_K blk\.53\.post_attention_norm\.weight=BF16 blk\.53\.ssm_a=BF16 blk\.53\.ssm_alpha\.weight=BF16 blk\.53\.ssm_beta\.weight=BF16 blk\.53\.ssm_conv1d\.weight=BF16 blk\.53\.ssm_dt\.bias=BF16 blk\.53\.ssm_norm\.weight=BF16 blk\.53\.ssm_out\.weight=BF16 blk\.54\.attn_gate\.weight=Q8_0 blk\.54\.attn_norm\.weight=BF16 blk\.54\.attn_qkv\.weight=Q6_K blk\.54\.ffn_down\.weight=Q6_K blk\.54\.ffn_gate\.weight=Q6_K blk\.54\.ffn_up\.weight=Q6_K blk\.54\.post_attention_norm\.weight=BF16 blk\.54\.ssm_a=BF16 blk\.54\.ssm_alpha\.weight=BF16 blk\.54\.ssm_beta\.weight=BF16 blk\.54\.ssm_conv1d\.weight=BF16 blk\.54\.ssm_dt\.bias=BF16 blk\.54\.ssm_norm\.weight=BF16 blk\.54\.ssm_out\.weight=BF16 blk\.55\.attn_k\.weight=BF16 blk\.55\.attn_k_norm\.weight=BF16 blk\.55\.attn_norm\.weight=BF16 blk\.55\.attn_q\.weight=Q8_0 blk\.55\.attn_q_norm\.weight=BF16 blk\.55\.attn_v\.weight=BF16 blk\.55\.ffn_down\.weight=Q6_K blk\.55\.ffn_gate\.weight=Q6_K blk\.55\.ffn_up\.weight=Q6_K blk\.55\.post_attention_norm\.weight=BF16 blk\.55\.attn_output\.weight=Q6_K blk\.56\.attn_gate\.weight=Q8_0 blk\.56\.attn_norm\.weight=BF16 blk\.56\.attn_qkv\.weight=Q6_K blk\.56\.ffn_down\.weight=Q6_K blk\.56\.ffn_gate\.weight=Q6_K blk\.56\.ffn_up\.weight=Q6_K blk\.56\.post_attention_norm\.weight=BF16 blk\.56\.ssm_a=BF16 blk\.56\.ssm_alpha\.weight=BF16 blk\.56\.ssm_beta\.weight=BF16 blk\.56\.ssm_conv1d\.weight=BF16 blk\.56\.ssm_dt\.bias=BF16 blk\.56\.ssm_norm\.weight=BF16 blk\.56\.ssm_out\.weight=BF16 blk\.57\.attn_gate\.weight=Q8_0 blk\.57\.attn_norm\.weight=BF16 blk\.57\.attn_qkv\.weight=Q6_K blk\.57\.ffn_down\.weight=Q6_K blk\.57\.ffn_gate\.weight=Q6_K blk\.57\.ffn_up\.weight=Q6_K blk\.57\.post_attention_norm\.weight=BF16 blk\.57\.ssm_a=BF16 blk\.57\.ssm_alpha\.weight=BF16 blk\.57\.ssm_beta\.weight=BF16 blk\.57\.ssm_conv1d\.weight=BF16 blk\.57\.ssm_dt\.bias=BF16 blk\.57\.ssm_norm\.weight=BF16 blk\.57\.ssm_out\.weight=BF16 blk\.58\.attn_gate\.weight=Q8_0 blk\.58\.attn_norm\.weight=BF16 blk\.58\.attn_qkv\.weight=Q6_K blk\.58\.ffn_down\.weight=Q6_K blk\.58\.ffn_gate\.weight=Q6_K blk\.58\.ffn_up\.weight=Q6_K blk\.58\.post_attention_norm\.weight=BF16 blk\.58\.ssm_a=BF16 blk\.58\.ssm_alpha\.weight=BF16 blk\.58\.ssm_beta\.weight=BF16 blk\.58\.ssm_conv1d\.weight=BF16 blk\.58\.ssm_dt\.bias=BF16 blk\.58\.ssm_norm\.weight=BF16 blk\.58\.ssm_out\.weight=BF16 blk\.59\.attn_k\.weight=BF16 blk\.59\.attn_k_norm\.weight=BF16 blk\.59\.attn_norm\.weight=BF16 blk\.59\.attn_q\.weight=Q8_0 blk\.59\.attn_q_norm\.weight=BF16 blk\.59\.attn_v\.weight=BF16 blk\.59\.ffn_down\.weight=Q8_0 blk\.59\.ffn_gate\.weight=Q8_0 blk\.59\.ffn_up\.weight=Q8_0 blk\.59\.post_attention_norm\.weight=BF16 blk\.59\.attn_output\.weight=Q6_K blk\.60\.attn_gate\.weight=Q8_0 blk\.60\.attn_norm\.weight=BF16 blk\.60\.attn_qkv\.weight=Q6_K blk\.60\.ffn_down\.weight=Q8_0 blk\.60\.ffn_gate\.weight=Q8_0 blk\.60\.ffn_up\.weight=Q8_0 blk\.60\.post_attention_norm\.weight=BF16 blk\.60\.ssm_a=BF16 blk\.60\.ssm_alpha\.weight=BF16 blk\.60\.ssm_beta\.weight=BF16 blk\.60\.ssm_conv1d\.weight=BF16 blk\.60\.ssm_dt\.bias=BF16 blk\.60\.ssm_norm\.weight=BF16 blk\.60\.ssm_out\.weight=BF16 blk\.61\.attn_gate\.weight=Q8_0 blk\.61\.attn_norm\.weight=BF16 blk\.61\.attn_qkv\.weight=Q6_K blk\.61\.ffn_down\.weight=Q8_0 blk\.61\.ffn_gate\.weight=Q8_0 blk\.61\.ffn_up\.weight=Q8_0 blk\.61\.post_attention_norm\.weight=BF16 blk\.61\.ssm_a=BF16 blk\.61\.ssm_alpha\.weight=BF16 blk\.61\.ssm_beta\.weight=BF16 blk\.61\.ssm_conv1d\.weight=BF16 blk\.61\.ssm_dt\.bias=BF16 blk\.61\.ssm_norm\.weight=BF16 blk\.61\.ssm_out\.weight=BF16 blk\.62\.attn_gate\.weight=Q8_0 blk\.62\.attn_norm\.weight=BF16 blk\.62\.attn_qkv\.weight=Q6_K blk\.62\.ffn_down\.weight=Q8_0 blk\.62\.ffn_gate\.weight=Q8_0 blk\.62\.ffn_up\.weight=Q8_0 blk\.62\.post_attention_norm\.weight=BF16 blk\.62\.ssm_a=BF16 blk\.62\.ssm_alpha\.weight=BF16 blk\.62\.ssm_beta\.weight=BF16 blk\.62\.ssm_conv1d\.weight=BF16 blk\.62\.ssm_dt\.bias=BF16 blk\.62\.ssm_norm\.weight=BF16 blk\.62\.ssm_out\.weight=BF16 blk\.63\.attn_k\.weight=BF16 blk\.63\.attn_k_norm\.weight=BF16 blk\.63\.attn_norm\.weight=BF16 blk\.63\.attn_q\.weight=Q8_0 blk\.63\.attn_q_norm\.weight=BF16 blk\.63\.attn_v\.weight=BF16 blk\.63\.ffn_down\.weight=Q8_0 blk\.63\.ffn_gate\.weight=Q8_0 blk\.63\.ffn_up\.weight=Q8_0 blk\.63\.post_attention_norm\.weight=BF16 blk\.63\.attn_output\.weight=Q6_K ```The quantization recipe is copied from Unsloth’s Q6_K_XL, with all F16 weights replaced by BF16. The mmproj is in BF16. |
|
ubergarm/Qwen3.5-397B-A17B-GGUF/IQ2_KL; 8x3090 At the same time, I also got better results for AesSedai IQ3_S. Its 60 tps: |
|
Having a segfault. |
|
@hksdpc255 ik updated the notes pointing out @magikRUKKOLA i'll do some more testing to see if i get any faults, were you using it or sweep benching when that happened? Looking great on 2xA6000 GPUs:
👈 Detailstitle: "ik_llama.cpp PR1388 full GPU offload"
|
| PP | TG | N_KV | T_PP s | S_PP t/s | T_TG s | S_TG t/s |
|---|---|---|---|---|---|---|
| 4096 | 128 | 0 | 1.995 | 2052.91 | 2.405 | 53.21 |
| 4096 | 128 | 4096 | 2.038 | 2010.06 | 2.366 | 54.10 |
| 4096 | 128 | 8192 | 2.093 | 1957.31 | 2.381 | 53.75 |
| 4096 | 128 | 12288 | 2.142 | 1912.16 | 2.411 | 53.09 |
| 4096 | 128 | 16384 | 2.213 | 1851.03 | 2.415 | 53.01 |
| 4096 | 128 | 20480 | 2.262 | 1810.90 | 2.428 | 52.72 |
| 4096 | 128 | 24576 | 2.323 | 1763.04 | 2.450 | 52.25 |
| 4096 | 128 | 28672 | 2.374 | 1725.62 | 2.458 | 52.09 |
| 4096 | 128 | 32768 | 2.436 | 1681.53 | 2.482 | 51.57 |
| 4096 | 128 | 36864 | 2.483 | 1649.42 | 2.485 | 51.51 |
| 4096 | 128 | 40960 | 2.539 | 1613.18 | 2.494 | 51.32 |
| 4096 | 128 | 45056 | 2.590 | 1581.39 | 2.517 | 50.85 |
| 4096 | 128 | 49152 | 2.647 | 1547.36 | 2.521 | 50.76 |
| 4096 | 128 | 53248 | 2.695 | 1520.04 | 2.528 | 50.64 |
| 4096 | 128 | 57344 | 2.757 | 1485.82 | 2.549 | 50.22 |
| 4096 | 128 | 61440 | 2.803 | 1461.32 | 2.553 | 50.13 |
| 4096 | 128 | 65536 | 2.847 | 1438.63 | 2.576 | 49.68 |
| 4096 | 128 | 69632 | 2.907 | 1408.89 | 2.584 | 49.54 |
| 4096 | 128 | 73728 | 2.954 | 1386.38 | 2.588 | 49.47 |
| 4096 | 128 | 77824 | 3.009 | 1361.35 | 2.608 | 49.07 |
| 4096 | 128 | 81920 | 3.071 | 1333.92 | 2.614 | 48.97 |
| 4096 | 128 | 86016 | 3.112 | 1316.26 | 2.627 | 48.73 |
| 4096 | 128 | 90112 | 3.171 | 1291.56 | 2.644 | 48.42 |
| 4096 | 128 | 94208 | 3.232 | 1267.21 | 2.657 | 48.18 |
| 4096 | 128 | 98304 | 3.274 | 1251.13 | 2.678 | 47.79 |
| 4096 | 128 | 102400 | 3.329 | 1230.35 | 2.682 | 47.72 |
| 4096 | 128 | 106496 | 3.381 | 1211.61 | 2.691 | 47.57 |
| 4096 | 128 | 110592 | 3.435 | 1192.53 | 2.712 | 47.20 |
| 4096 | 128 | 114688 | 3.486 | 1175.00 | 2.713 | 47.18 |
| 4096 | 128 | 118784 | 3.547 | 1154.73 | 2.737 | 46.77 |
| 4096 | 128 | 122880 | 3.589 | 1141.39 | 2.743 | 46.66 |
| 4096 | 128 | 126976 | 3.645 | 1123.86 | 2.744 | 46.64 |
| 4096 | 128 | 131072 | 3.691 | 1109.67 | 2.769 | 46.22 |
ik_llama.cpp PR1388 ik/sm_graph_delta_net@d1c0acb
model=/mnt/raid/models/ubergarm/Qwen3.5-122B-A10B-GGUF/Qwen3.5-122B-A10B-IQ4_KSS.gguf
./build/bin/llama-sweep-bench \
--model "$model" \
-c 135168 \
-sm graph \
-ngl 999 \
-ub 4096 -b 4096 \
--threads 1 \
--no-mmap \
-n 128 \
--warmup-batch| PP | TG | N_KV | T_PP s | S_PP t/s | T_TG s | S_TG t/s |
|---|---|---|---|---|---|---|
| 4096 | 128 | 0 | 1.341 | 3055.34 | 1.657 | 77.25 |
| 4096 | 128 | 4096 | 1.399 | 2927.48 | 1.671 | 76.61 |
| 4096 | 128 | 8192 | 1.447 | 2830.55 | 1.674 | 76.44 |
| 4096 | 128 | 12288 | 1.503 | 2725.59 | 1.706 | 75.05 |
| 4096 | 128 | 16384 | 1.567 | 2613.55 | 1.710 | 74.85 |
| 4096 | 128 | 20480 | 1.629 | 2514.75 | 1.719 | 74.46 |
| 4096 | 128 | 24576 | 1.683 | 2433.20 | 1.747 | 73.25 |
| 4096 | 128 | 28672 | 1.734 | 2361.91 | 1.752 | 73.07 |
| 4096 | 128 | 32768 | 1.789 | 2289.76 | 1.779 | 71.95 |
| 4096 | 128 | 36864 | 1.850 | 2214.35 | 1.783 | 71.77 |
| 4096 | 128 | 40960 | 1.911 | 2143.73 | 1.794 | 71.36 |
| 4096 | 128 | 45056 | 1.955 | 2094.78 | 1.816 | 70.50 |
| 4096 | 128 | 49152 | 2.014 | 2033.58 | 1.819 | 70.37 |
| 4096 | 128 | 53248 | 2.058 | 1990.10 | 1.828 | 70.04 |
| 4096 | 128 | 57344 | 2.110 | 1941.10 | 1.850 | 69.20 |
| 4096 | 128 | 61440 | 2.174 | 1883.73 | 1.855 | 68.99 |
| 4096 | 128 | 65536 | 2.209 | 1853.83 | 1.880 | 68.08 |
| 4096 | 128 | 69632 | 2.275 | 1800.64 | 1.887 | 67.84 |
| 4096 | 128 | 73728 | 2.322 | 1764.14 | 1.889 | 67.76 |
| 4096 | 128 | 77824 | 2.374 | 1725.17 | 1.914 | 66.88 |
| 4096 | 128 | 81920 | 2.432 | 1684.17 | 1.920 | 66.66 |
| 4096 | 128 | 86016 | 2.483 | 1649.71 | 1.939 | 66.02 |
| 4096 | 128 | 90112 | 2.540 | 1612.81 | 1.957 | 65.39 |
| 4096 | 128 | 94208 | 2.581 | 1587.01 | 1.963 | 65.21 |
| 4096 | 128 | 98304 | 2.646 | 1548.28 | 1.984 | 64.51 |
| 4096 | 128 | 102400 | 2.694 | 1520.29 | 1.993 | 64.21 |
| 4096 | 128 | 106496 | 2.743 | 1493.35 | 2.002 | 63.92 |
| 4096 | 128 | 110592 | 2.796 | 1465.17 | 2.023 | 63.27 |
| 4096 | 128 | 114688 | 2.859 | 1432.56 | 2.027 | 63.15 |
| 4096 | 128 | 118784 | 2.897 | 1413.94 | 2.053 | 62.35 |
| 4096 | 128 | 122880 | 2.964 | 1381.86 | 2.058 | 62.20 |
| 4096 | 128 | 126976 | 3.006 | 1362.45 | 2.062 | 62.06 |
| 4096 | 128 | 131072 | 3.049 | 1343.55 | 2.086 | 61.37 |
|
Yes, updated the description with a 3rd caveat: reading/writing of the cached recurrent state is not yet implemented, so it crashes. |
|
Hrm, when running llama-server it is immediately segfaulting after i send a prompt... Is there a way to explicitly disable reading/writing cached state (i tried Here is the gdb output recompiled in debug mode: 👈 Detailsgdb -q --args \
./build/bin/llama-server \
--model "$model" \
--alias Qwen3.5-122B-A10B \
-c 262144 \
-sm graph \
-ngl 99 \
-ub 4096 -b 4096 \
--parallel 1 \
--threads 1 \
--host 127.0.0.1 \
--port 8080 \
--jinja \
--no-mmap \
--cache-ram 0
(gdb) set print thread-events off
(gdb) run
...
No tensors in buffer type CUDA0
llm_load_tensors: offloading 48 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 49/49 layers to GPU
llm_load_tensors: CUDA_Host buffer size = 602.46 MiB
llm_load_tensors: CUDA1 buffer size = 602.47 MiB
llm_load_tensors: CUDA_Split buffer size = 61773.38 MiB
=========================== ggml_cuda_set_peer_access: Enabling Peer Access between Devices 1->0
~ggml_backend_cuda_context: have 0 graphs
....................................................................................................
llama_init_from_model: n_ctx = 262144
llama_init_from_model: n_batch = 4096
llama_init_from_model: n_ubatch = 4096
llama_init_from_model: flash_attn = 1
llama_init_from_model: attn_max_b = 0
llama_init_from_model: fused_moe = 1
llama_init_from_model: grouped er = 0
llama_init_from_model: fused_up_gate = 1
llama_init_from_model: fused_mmad = 1
llama_init_from_model: rope_cache = 0
llama_init_from_model: graph_reuse = 1
llama_init_from_model: k_cache_hadam = 0
llama_init_from_model: split_mode_graph_scheduling = 0
llama_init_from_model: reduce_type = f16
llama_init_from_model: sched_async = 0
llama_init_from_model: ser = -1, 0
llama_init_from_model: freq_base = 10000000.0
llama_init_from_model: freq_scale = 1
=========================== ggml_cuda_set_peer_access: Enabling Peer Access between Devices 0->1
=========================== ggml_cuda_set_peer_access: Enabling Peer Access between Devices 1->0
=== Created recurrent cache cache_s_l0 as 1085440 x 1 x 1 x 1
=== Created recurrent cache cache_s_l1 as 1085440 x 1 x 1 x 1
=== Created recurrent cache cache_s_l2 as 1085440 x 1 x 1 x 1
=== Created recurrent cache cache_s_l4 as 1085440 x 1 x 1 x 1
=== Created recurrent cache cache_s_l5 as 1085440 x 1 x 1 x 1
=== Created recurrent cache cache_s_l6 as 1085440 x 1 x 1 x 1
=== Created recurrent cache cache_s_l8 as 1085440 x 1 x 1 x 1
=== Created recurrent cache cache_s_l9 as 1085440 x 1 x 1 x 1
=== Created recurrent cache cache_s_l10 as 1085440 x 1 x 1 x 1
=== Created recurrent cache cache_s_l12 as 1085440 x 1 x 1 x 1
=== Created recurrent cache cache_s_l13 as 1085440 x 1 x 1 x 1
=== Created recurrent cache cache_s_l14 as 1085440 x 1 x 1 x 1
=== Created recurrent cache cache_s_l16 as 1085440 x 1 x 1 x 1
=== Created recurrent cache cache_s_l17 as 1085440 x 1 x 1 x 1
=== Created recurrent cache cache_s_l18 as 1085440 x 1 x 1 x 1
=== Created recurrent cache cache_s_l20 as 1085440 x 1 x 1 x 1
=== Created recurrent cache cache_s_l21 as 1085440 x 1 x 1 x 1
=== Created recurrent cache cache_s_l22 as 1085440 x 1 x 1 x 1
=== Created recurrent cache cache_s_l24 as 1085440 x 1 x 1 x 1
=== Created recurrent cache cache_s_l25 as 1085440 x 1 x 1 x 1
=== Created recurrent cache cache_s_l26 as 1085440 x 1 x 1 x 1
=== Created recurrent cache cache_s_l28 as 1085440 x 1 x 1 x 1
=== Created recurrent cache cache_s_l29 as 1085440 x 1 x 1 x 1
=== Created recurrent cache cache_s_l30 as 1085440 x 1 x 1 x 1
=== Created recurrent cache cache_s_l32 as 1085440 x 1 x 1 x 1
=== Created recurrent cache cache_s_l33 as 1085440 x 1 x 1 x 1
=== Created recurrent cache cache_s_l34 as 1085440 x 1 x 1 x 1
=== Created recurrent cache cache_s_l36 as 1085440 x 1 x 1 x 1
=== Created recurrent cache cache_s_l37 as 1085440 x 1 x 1 x 1
=== Created recurrent cache cache_s_l38 as 1085440 x 1 x 1 x 1
=== Created recurrent cache cache_s_l40 as 1085440 x 1 x 1 x 1
=== Created recurrent cache cache_s_l41 as 1085440 x 1 x 1 x 1
=== Created recurrent cache cache_s_l42 as 1085440 x 1 x 1 x 1
=== Created recurrent cache cache_s_l44 as 1085440 x 1 x 1 x 1
=== Created recurrent cache cache_s_l45 as 1085440 x 1 x 1 x 1
=== Created recurrent cache cache_s_l46 as 1085440 x 1 x 1 x 1
llama_kv_cache_init: CUDA_Split KV buffer size = 6293.07 MiB
llama_kv_cache_init: KV cache size per device:
Device 0: 3146.53 MiB
Device 1: 3146.53 MiB
llama_init_from_model: KV self size = 6144.00 MiB, K (f16): 3072.00 MiB, V (f16): 3072.00 MiB
llama_init_from_model: CUDA_Host output buffer size = 0.95 MiB
ggml_gallocr_reserve_n: reallocating CUDA0 buffer from size 0.00 MiB to 2592.02 MiB
ggml_gallocr_reserve_n: reallocating CUDA1 buffer from size 0.00 MiB to 3976.00 MiB
ggml_gallocr_reserve_n: reallocating CUDA_Host buffer from size 0.00 MiB to 2096.11 MiB
llama_init_from_model: CUDA0 compute buffer size = 2592.02 MiB
llama_init_from_model: CUDA1 compute buffer size = 3976.00 MiB
llama_init_from_model: CUDA_Host compute buffer size = 2096.11 MiB
llama_init_from_model: graph nodes = 6822
llama_init_from_model: graph splits = 289
llama_init_from_model: enabling only_active_experts scheduling
ggml_backend_sched_alloc_splits: failed to allocate graph, reserving (backend_ids_changed = 1)
INFO [ init] initializing slots | tid="140737350545408" timestamp=1773068930 n_slots=1
INFO [ init] new slot | tid="140737350545408" timestamp=1773068930 id_slot=0 n_ctx_slot=262144
srv init: Exclude reasoning tokens when selecting slot based on similarity: start: <think>, end: </think>
use `--reasoning-tokens none` to disable.
ggml_backend_sched_alloc_splits: failed to allocate graph, reserving (backend_ids_changed = 1)
update_cuda_graph_executable: CUDA graph update failed
no implementations specified for speculative decoding
slot init: id 0 | task -1 | speculative decoding context not initialized
prompt cache is disabled - use `--cache-ram N` to enable it
INFO [ main] model loaded | tid="140737350545408" timestamp=1773068930
INFO [ main] chat template | tid="140737350545408" timestamp=1773068930 chat_template="..."
INFO [ main] HTTP server listening | tid="140737350545408" timestamp=1773068930 n_threads_http="47" port="8080" hostname="127.0.0.1
"
INFO [ slots_idle] all slots are idle | tid="140737350545408" timestamp=1773068930
======== Prompt cache: cache size: 0, n_keep: 0, n_discarded_prompt: 0, cache_ram_n_min: 0, f_keep: 0.00, cache_ram_similarity: 0.50
Grammar: bash-any-arg ::= func-bash-kv-command | func-bash-kv-timeout | func-bash-kv-workdir | func-bash-kv-description
bash-any-arg-with-end ::= bash-any-arg bash-last-arg-end
bash-args-relaxed ::= ( bash-any-arg-with-end )*
bash-call ::= "\n<tool_call>\n<function=" "bash" ">\n" bash-args-relaxed
bash-last-arg-end ::= "\n</parameter>\n"
boolean ::= ("true" | "false") space
char ::= [^"\\\x7F\x00-\x1F] | [\\] (["\\bfnrt] | "u" [0-9a-fA-F]{4})
decimal-part ::= [0-9]{1,16}
edit-any-arg ::= func-edit-kv-filePath | func-edit-kv-oldString | func-edit-kv-newString | func-edit-kv-replaceAll
edit-any-arg-with-end ::= edit-any-arg edit-last-arg-end
edit-args-relaxed ::= ( edit-any-arg-with-end )*
edit-call ::= "\n<tool_call>\n<function=" "edit" ">\n" edit-args-relaxed
...
write-last-arg-end ::= "\n</parameter>\n"
Grammar lazy: true
Chat format: Qwen3 Coder
INFO [ launch_slot_with_task] slot is processing task | tid="140737350545408" timestamp=1773069254 id_slot=0 id_task=0
======== Cache: cache_size = 0, n_past0 = 0, n_past1 = 0, n_past_prompt1 = 0, n_past2 = 0, n_past_prompt2 = 0
INFO [ batch_pending_prompt] kv cache rm [p0, end) | tid="140737350545408" timestamp=1773069254 id_slot=0 id_task=0 p0=0
ggml_backend_sched_alloc_splits: failed to allocate graph, reserving (backend_ids_changed = 1)
update_cuda_graph_executable: CUDA graph update failed
Thread 1 "llama-server" received signal SIGSEGV, Segmentation fault.
0x00007ffff7916acb in llama_data_write_buffer::get_tensor_data_split (ptr=0x7feabeaef4b0 "", tensor=0x5555594c0420, kv=0x0,
aux_buffer=std::vector of length 0, capacity 0, offset=0, size=4341760) at /home/w/projects/ik_llama.cpp/src/llama.cpp:6665
6665 auto ne = kv->ne[1];
(gdb) bt
#0 0x00007ffff7916acb in llama_data_write_buffer::get_tensor_data_split (ptr=0x7feabeaef4b0 "", tensor=0x5555594c0420, kv=0x0,
aux_buffer=std::vector of length 0, capacity 0, offset=0, size=4341760) at /home/w/projects/ik_llama.cpp/src/llama.cpp:6665
#1 0x00007ffff79169e9 in llama_data_write_buffer::get_tensor_data_split (this=0x7fffffff91c0, tensor=0x5555594c0420, offset=0, size=4341760, il=0)
at /home/w/projects/ik_llama.cpp/src/llama.cpp:6660
#2 0x00007ffff7916716 in llama_data_write_buffer::write_tensor_data (this=0x7fffffff91c0, tensor=0x5555594c0420, offset=0, size=4341760, il=0)
at /home/w/projects/ik_llama.cpp/src/llama.cpp:6644
#3 0x00007ffff7913427 in llama_data_write::write_kv_cache_data (this=0x7fffffff91c0, ctx=0x55555d0d5340,
cell_ranges=std::vector of length 1, capacity 1 = {...}, seq_id=0, flags=1) at /home/w/projects/ik_llama.cpp/src/llama.cpp:6067
#4 0x00007ffff7913764 in llama_data_write::write_kv_cache (this=0x7fffffff91c0, ctx=0x55555d0d5340, seq_id=0, flags=1)
at /home/w/projects/ik_llama.cpp/src/llama.cpp:6109
#5 0x00007ffff78fec35 in llama_state_seq_get_data_internal (ctx=0x55555d0d5340, data_ctx=..., seq_id=0, flags=1)
at /home/w/projects/ik_llama.cpp/src/llama.cpp:6952
#6 0x00007ffff78fed6c in llama_state_seq_get_data (ctx=0x55555d0d5340, dst=0x7feabeae7010 "", size=156338064, seq_id=0, flags=1)
at /home/w/projects/ik_llama.cpp/src/llama.cpp:6965
#7 0x000055555576bac3 in server_context::create_checkpoint (this=0x7fffffffca20, slot=...)
at /home/w/projects/ik_llama.cpp/examples/server/server-context.cpp:2754
#8 0x000055555576ae3e in server_context::create_checkpoint_at_interval (this=0x7fffffffca20, slot=..., params_base=...)
at /home/w/projects/ik_llama.cpp/examples/server/server-context.cpp:2652
#9 0x0000555555773450 in server_context::process_batch_tokens (this=0x7fffffffca20, n_batch=@0x7fffffff9664: 4096)
at /home/w/projects/ik_llama.cpp/examples/server/server-context.cpp:3432
#10 0x000055555577451e in server_context::update_slots (this=0x7fffffffca20) at /home/w/projects/ik_llama.cpp/examples/server/server-context.cpp:3582
#11 0x00005555556af0f9 in std::__invoke_impl<void, void (server_context::*&)(), server_context*&> (
__f=@0x555568e717c0: (void (server_context::*)(server_context * const)) 0x5555557740e2 <server_context::update_slots()>,
__t=@0x555568e717d0: 0x7fffffffca20) at /usr/include/c++/13/bits/invoke.h:74
#12 0x00005555556a85b7 in std::__invoke<void (server_context::*&)(), server_context*&> (
__fn=@0x555568e717c0: (void (server_context::*)(server_context * const)) 0x5555557740e2 <server_context::update_slots()>)
at /usr/include/c++/13/bits/invoke.h:96
#13 0x000055555569dd89 in std::_Bind<void (server_context::*(server_context*))()>::__call<void, , 0ul>(std::tuple<>&&, std::_Index_tuple<0ul>) (
this=0x555568e717c0, __args=...) at /usr/include/c++/13/functional:506
#14 0x0000555555696095 in std::_Bind<void (server_context::*(server_context*))()>::operator()<, void>() (this=0x555568e717c0)
at /usr/include/c++/13/functional:591
#15 0x0000555555689298 in std::__invoke_impl<void, std::_Bind<void (server_context::*(server_context*))()>&>(std::__invoke_other, std::_Bind<void (ser
ver_context::*(server_context*))()>&) (__f=...) at /usr/include/c++/13/bits/invoke.h:61
#16 0x000055555567d530 in std::__invoke_r<void, std::_Bind<void (server_context::*(server_context*))()>&>(std::_Bind<void (server_context::*(server_co
ntext*))()>&) (__fn=...) at /usr/include/c++/13/bits/invoke.h:111
#17 0x0000555555669959 in std::_Function_handler<void (), std::_Bind<void (server_context::*(server_context*))()> >::_M_invoke(std::_Any_data const&)
(__functor=...) at /usr/include/c++/13/bits/std_function.h:290
--Type <RET> for more, q to quit, c to continue without paging--
#18 0x000055555564c11a in std::function<void ()>::operator()() const (this=0x7fffffffdcc0) at /usr/include/c++/13/bits/std_function.h:591
#19 0x00005555556f94c1 in server_queue::start_loop (this=0x7fffffffdb68) at /home/w/projects/ik_llama.cpp/examples/server/server-queue.cpp:133
#20 0x000055555562c72a in main (argc=27, argv=0x7fffffffdfd8) at /home/w/projects/ik_llama.cpp/examples/server/server.cpp:2139
(gdb) info args
ptr = 0x7feabeaef4b0 ""
tensor = 0x5555594c0420
kv = 0x0
aux_buffer = std::vector of length 0, capacity 0
offset = 0
size = 4341760
(gdb) info locals
ne = 0
full_row_size = 0
first_row = 0
num_rows = 0
extra = 0xec2b22222d21282b
kv_extra = 0xbfbf2e32ed352d24
split_offset = 0
total_size = 0 |
|
I just pushed a change. Does it solve the segfault? |
|
Still faulting immediately after sending prompt short test prompt from the web ui.
gdb output below: 👈 Details |
|
|
Even setups without p2p gpu access will benifit from this. What a nice work! ubergarm/Qwen3.5-27B-GGUF/Qwen3.5-27B-smol-IQ4_NL.gguf, no p2p, 2*3090 without PR, -sm graphDetails
without PR, -sm layerDetails
without PR, 1*3090Details
with PR, -sm graphDetails
with PR, -sm layerDetails
with PR, 1*3090Details
|
|
OK, another fix. That should not crash. |
|
$ git diff
diff --git a/src/llama.cpp b/src/llama.cpp
index d93f8818..84beb7d4 100644
--- a/src/llama.cpp
+++ b/src/llama.cpp
@@ -6717,7 +6717,7 @@ struct llama_data_write_buffer : llama_data_write {
auto split = extra->splits[id];
if (!split) continue;
GGML_ASSERT(split->type == tensor->type);
- auto split_row_size = ggml_row_size(tensor->type, tensor->ne[0]);
+ auto split_row_size = ggml_row_size(tensor->type, split->ne[0]);
auto split_size = split_row_size * num_rows;
if (split_size > aux_buffer.size()) aux_buffer.resize(split_size);
ggml_backend_tensor_get(split, aux_buffer.data(), first_row*split_row_size, split_size); |
|
60 t/s for a 397B model is pretty good! Do we qualify to play in the Big Boys league now (vLLM, sglang) ? |
|
Hi just a question: Do you mean this issue between graph and vision just for qwen3.5 or in general? Because I tested last week and just some minutes ago with the latest build version too, and graph even without NCCL was giving nice results, but I was completely unable to get graph and mmproj/vison to work at the same time with ANY model (also tried ministral-3, gemma3 with their mmproj). Is this the same you mentioned or is this an unrelated / another thing? |
|
@TomTheWise The inability to use vision (or even just loading an mmproj file) with split mode I'm looking into it, but so far cannot see the reason it does not work. |
|
Broadly, I get on the Qwen 3.5 122B MOE in full offload, +20% in PP and +50% in TG vs previous graph implementation. Note: the +50% TG is partly justified by the P-State kick allow full frequencies for both the GPU and VRAM, due to a sufficient load. But side benefits are part of the overall benefits. |
Power consumption per GPU: 133W-295W. (AVG: ~195W) |
|
Here is 4xRTX 6000 PRO Blackwells and SGLang with Qwen3.5-397B-A17B-NVFP4 :
Well, yeah, its about 7 tps better in decode, but the price is [x5] for their setup. :) |
|
The RTX 6000 PRO Blackwell is quite a bit faster than the 3090. My guess is at least 2X for prefill and 1.5X for generation. Then there is the fact that overhead for 4 GPUs is much less than overhead for 8 GPUs. I think you should expect at least 90 t/s from |
|
I have no luck for running -sm graph on my 4*3090 system: CUDA_VISIBLE_DEVICES=4,5,6,7 ./ik_llama.cpp-build-qwen35graph/llama-sweep-bench --split-mode layer --cache-type-k bf16 --cache-type-v bf16 --n-gpu-layers 999 --model Qwen3.5-27B-BF16.gguf --ctx-size 65536
CUDA_VISIBLE_DEVICES=4,5,6,7 ./ik_llama.cpp-build-qwen35graph/llama-sweep-bench --split-mode graph --cache-type-k bf16 --cache-type-v bf16 --n-gpu-layers 999 --model Qwen3.5-27B-BF16.gguf --ctx-size 65536Details |
2405855 to
f09d421
Compare
|
@hksdpc255 The latest commit should solve your issue of not being to run Qwen3.5 with 4 GPUs |
|
@ikawrakow Thank you. I'm building the new branch |
|
Still crash: Details |
|
Well, not sure what your issue is. I don'g have the
|
|
You're right, my hard disk is broken. I move the model and llama-sweep-bench to another disk and now its works. Here's the BF16 performance for 4*3090 Detailsmain: n_kv_max = 65536, n_batch = 2048, n_ubatch = 512, flash_attn = 1, n_gpu_layers = 999, n_threads = 8, n_threads_batch = 8
update: I mixed up several benchmarks that were running simultaneously earlier. The data I’m posting now should be the correct one. |
sglang 0.5.9 seems currently broken for --tp 4 with rtx3090. In my setup the output is just I don't know if the performance in this situation still meaningful. CUDA_VISIBLE_DEVICES=4,5,6,7 python -m sglang.launch_server --model Qwen3.5-27B --tp-size 4 --mem-fraction-static 0.7 --context-length 262144 --reasoning-parser qwen3 --tool-call-parser qwen3_coder --speculative-algo NEXTN --speculative-num-steps 3 --speculative-eagle-topk 1 --speculative-num-draft-tokens 4 --max-running-requests 1 --disable-radix-cacheCUDA_VISIBLE_DEVICES=4,5,6,7 python -m sglang.launch_server --model Qwen3.5-27B --tp-size 4 --mem-fraction-static 0.7 --context-length 262144 --reasoning-parser qwen3 --tool-call-parser qwen3_coder --max-running-requests 1 --disable-radix-cacheik_llama.cpp beats sglang when MTP(speculative) is disabled! But we still slower when MTP is enabled. Speculative decode accept rate around 0.6 seems normal compared to my another sglang deploy (using fp8 quant, fit in 48GB single GPU), the accept rate is usually 0.7 when I give a hard math problem. |
The 4x3090 (PCIe 4.0 x16) gives such a similar result to yours I am not going to even post them. [EDIT]: Actually it runs happily with only three GPU. Will post the results later on. |
…)" This reverts commit f90b4c2.
|
Is it working with Q8 cache? Still crashes for me on qwen 397b. Also since 344688c devstral prompt speed cut in half. My branch at that commit is almost 700t/s and head is 360t/s PP. |
|
@Ph0rk0z I have put a spell on it. It does not work for ST users most of the time. When it does work, it works at half the performance. Here is what a non-ST user such as myself gets on the latest main branch for Devstral-123B-IQ4_KSS
|
So much about Qwen3.5-397B-A17B not working with |
|
Qwen3.5-9B BF16 performance for 4*3090 Detailsmain: n_kv_max = 65536, n_batch = 2048, n_ubatch = 512, flash_attn = 1, n_gpu_layers = 999, n_threads = 8, n_threads_batch = 8
sglang with speculative enabled 4*3090 Detailssglang with speculative disabled 4*3090 DetailsQwen3.5-9B BF16 performance for 2*3090 Detailsmain: n_kv_max = 65536, n_batch = 2048, n_ubatch = 512, flash_attn = 1, n_gpu_layers = 999, n_threads = 8, n_threads_batch = 8
sglang with speculative enabled 2*3090 Detailssglang with speculative disabled 2*3090 DetailsQwen3.5-9B BF16 performance for 4*3090 -sm layer Detailsmain: n_kv_max = 65536, n_batch = 2048, n_ubatch = 512, flash_attn = 1, n_gpu_layers = 999, n_threads = 8, n_threads_batch = 8
Qwen3.5-9B BF16 performance for 2*3090 -sm layer Detailsmain: n_kv_max = 65536, n_batch = 2048, n_ubatch = 512, flash_attn = 1, n_gpu_layers = 999, n_threads = 8, n_threads_batch = 8
Qwen3.5-9B BF16 performance for 1*3090 Detailsmain: n_kv_max = 65536, n_batch = 2048, n_ubatch = 512, flash_attn = 1, n_gpu_layers = 999, n_threads = 8, n_threads_batch = 8
Note: sglang is still outputing |
…)" This reverts commit f90b4c2.
|
What does it have to do with sillytavern? I ran sweep benches before/after with same command line as previously. I am still getting same t/s for devstral Q4_K and somehow it's higher than you at 37t/s. If it's not Q8 cache for qwen then I don't know but that's how it crashes on me and seems to have plenty of memory. |
|
Main branch:
Prior version:
I can check the other commits between and make sure it's this one. It actually started with this one: 14492bf |
|
Do you see this in your output This shouldn't be there. Exactly what did you do that it is there? |
|
I didn't do anything: Loading the qwen as well to see if it still crashes in graph. |
|
What happens if you remove the |
|
Yes, omitting I was confused and thought this was the other PR discussion hah.. When using |
|
Finally loaded the 397b with the vision commit reverted. I use nommap because loading is slower but output is faster. It's kinda crazy that this gives this much extra t/s but cuts the prompt in half. Maybe there's a way to have both :P with commit reverted I have: And briefly graph mode doesn't crash:
With the new commit mistral is bac:
|





Graph parallel (a..k.a. split mode
graph) support for Qwen3-Next and Qwen3-5 was added in PRs #1292, #1331 and #1347. However, these graph parallel implementations are incomplete as recurrent attention layers are still computed on a single GPU.This PR adds a full graph parallel implementation for the Qwen3.5 models.
The tricky part was not the parallelization of the compute graph, as I was expecting, but extracting the right portions from the recurrent attention tensors for each GPU.
I observe very significant performance improvements - see graph below.
Caveat 1: there PR disables graph parallel for Qwen3-Next. As mentioned above, the tricky part for enabling graph parallel for recurrent attention layers is extracting the correct tensor portions for each GPU. Mainline developers for whatever reason have decided to use a different arrangement for the data in Qwen3-Next and the Qwen3.5 series. Qwen3-Next mostly works, but there is something not 100% correct (yet), so I have disabled graph parallel for now.
Caveat 2: There is an issue when using vision with split mode
graph. Hence, I have disabled split modegraphfor now when--mmprojis present in the command line arguments.Caveat 3: Reading and writing the cached recurrent state is not yet implemented for full split mode
graph. Looking into it at this point.The following two graphs show PP-2048 and TG-128 performance as a function of context length for Qwen3.5-27B-IQ4_XS on a 2x3090 system.