[MUSA][9/N] Add FA3 attention backend support through MATE (MUSA AI Tensor Engine) by froststeam · Pull Request #17985 · sgl-project/sglang

froststeam · 2026-01-30T09:07:48Z

Motivation

This PR is the 9th in a series of pull requests (tracked in #16565) to add full support for Moore Threads GPUs, leveraging MUSA (Meta-computing Unified System Architecture) to accelerate LLM inference.

Modifications

This commit adds support for the fa3 attention backend powered by MATE (MUSA AI Tensor Engine).

Key modifications include:

In python/sglang/srt/layers/attention/attention_registry.py, when the MUSA platform's capability is greater than or equal to 31, the fa3 attention backend can be enabled.
In python/sglang/srt/utils/common.py, MATE's flash_attn_with_kvcache interface requires initializing a workspace buffer during the first forward call to generate schedule information for optimal performance. This currently supports pipeline parallelism and two-batch overlap scenarios.
In python/sglang/srt/layers/attention/flashattention_backend.py, MATE's flash_attn_varlen_func and flash_attn_with_kvcache interfaces are integrated to adapt the fa3 attention backend.

Testing Done

Tested in a clean torch_musa container:
Qwen3-4B with MUSA Graph enabled works fine on tp=2, pp=2:

python -m sglang.launch_server \
 --model-path /mnt/seed17/001688/models/Qwen3-4B/ \
 --trust-remote-code \
 --attention-backend fa3 \
 --tp-size 2 \
 --pp-size 2

INFO 01-30 20:54:38 [__init__.py:36] Available plugins for group vllm.platform_plugins:
INFO 01-30 20:54:38 [__init__.py:38] - musa -> vllm_musa:register
INFO 01-30 20:54:38 [__init__.py:41] All plugins in this group will be loaded. Set `VLLM_PLUGINS` to control which plugins to load.
INFO 01-30 20:54:38 [__init__.py:232] Platform plugin musa is activated
[2026-01-30 20:54:38] WARNING warnings.py:109: /mnt/seed17/001688/qzg/qwen3/260124/github/sglang/python/sglang/srt/layers/quantization/awq.py:76: UserWarning: Only CUDA, HIP and XPU support AWQ currently.
  warnings.warn(f"Only CUDA, HIP and XPU support AWQ currently.")

2026-01-30 20:54:38 | warnings | 140285967755072 | WARNING : /mnt/seed17/001688/qzg/qwen3/260124/github/sglang/python/sglang/srt/layers/quantization/awq.py:76: UserWarning: Only CUDA, HIP and XPU support AWQ currently.
  warnings.warn(f"Only CUDA, HIP and XPU support AWQ currently.")

[2026-01-30 20:54:38] WARNING warnings.py:109: /mnt/seed17/001688/qzg/qwen3/260124/github/sglang/python/sglang/srt/layers/quantization/gguf.py:47: UserWarning: Only CUDA support GGUF quantization currently.
  warnings.warn(f"Only CUDA support GGUF quantization currently.")

2026-01-30 20:54:38 | warnings | 140285967755072 | WARNING : /mnt/seed17/001688/qzg/qwen3/260124/github/sglang/python/sglang/srt/layers/quantization/gguf.py:47: UserWarning: Only CUDA support GGUF quantization currently.
  warnings.warn(f"Only CUDA support GGUF quantization currently.")

[2026-01-30 20:54:38] WARNING server_args.py:2138: Pipeline parallelism is incompatible with overlap schedule.
2026-01-30 20:54:38 | server_args | 140285967755072 | WARNING : Pipeline parallelism is incompatible with overlap schedule.
[2026-01-30 20:54:39] server_args=ServerArgs(model_path='/mnt/seed17/001688/models/Qwen3-4B/', tokenizer_path='/mnt/seed17/001688/models/Qwen3-4B/', tokenizer_mode='auto', tokenizer_worker_num=1, skip_tokenizer_init=False, load_format='auto', model_loader_extra_config='{}', trust_remote_code=True, context_length=None, is_embedding=False, enable_multimodal=None, revision=None, model_impl='auto', host='127.0.0.1', port=30000, fastapi_root_path='', grpc_mode=False, skip_server_warmup=False, warmups=None, nccl_port=None, checkpoint_engine_wait_weights_before_ready=False, dtype='auto', quantization=None, quantization_param_path=None, kv_cache_dtype='auto', enable_fp32_lm_head=False, modelopt_quant=None, modelopt_checkpoint_restore_path=None, modelopt_checkpoint_save_path=None, modelopt_export_path=None, quantize_and_serve=False, rl_quant_profile=None, mem_fraction_static=0.831, max_running_requests=None, max_queued_requests=None, max_total_tokens=None, chunked_prefill_size=8192, enable_dynamic_chunking=False, max_prefill_tokens=16384, prefill_max_requests=None, schedule_policy='fcfs', enable_priority_scheduling=False, abort_on_priority_when_disabled=False, schedule_low_priority_values_first=False, priority_scheduling_preemption_threshold=10, schedule_conservativeness=1.0, page_size=1, swa_full_tokens_ratio=0.8, disable_hybrid_swa_memory=False, radix_eviction_policy='lru', enable_prefill_delayer=False, prefill_delayer_max_delay_passes=30, prefill_delayer_token_usage_low_watermark=None, prefill_delayer_forward_passes_buckets=None, prefill_delayer_wait_seconds_buckets=None, device='musa', tp_size=2, pp_size=2, pp_max_micro_batch_size=None, pp_async_batch_depth=0, stream_interval=1, stream_output=False, random_seed=438492825, constrained_json_whitespace_pattern=None, constrained_json_disable_any_whitespace=False, watchdog_timeout=300, soft_watchdog_timeout=None, dist_timeout=None, download_dir=None, model_checksum=None, base_gpu_id=0, gpu_id_step=1, sleep_on_idle=False, custom_sigquit_handler=None, log_level='info', log_level_http=None, log_requests=False, log_requests_level=2, log_requests_format='text', log_requests_target=None, uvicorn_access_log_exclude_prefixes=[], crash_dump_folder=None, show_time_cost=False, enable_metrics=False, enable_metrics_for_all_schedulers=False, tokenizer_metrics_custom_labels_header='x-custom-labels', tokenizer_metrics_allowed_custom_labels=None, bucket_time_to_first_token=None, bucket_inter_token_latency=None, bucket_e2e_request_latency=None, collect_tokens_histogram=False, prompt_tokens_buckets=None, generation_tokens_buckets=None, gc_warning_threshold_secs=0.0, decode_log_interval=40, enable_request_time_stats_logging=False, kv_events_config=None, enable_trace=False, otlp_traces_endpoint='localhost:4317', export_metrics_to_file=False, export_metrics_to_file_dir=None, api_key=None, admin_api_key=None, served_model_name='/mnt/seed17/001688/models/Qwen3-4B/', weight_version='default', chat_template=None, hf_chat_template_name=None, completion_template=None, file_storage_path='sglang_storage', enable_cache_report=False, reasoning_parser=None, tool_call_parser=None, tool_server=None, sampling_defaults='model', dp_size=1, load_balance_method='round_robin', dist_init_addr=None, nnodes=1, node_rank=0, json_model_override_args='{}', preferred_sampling_params=None, enable_lora=None, enable_lora_overlap_loading=None, max_lora_rank=None, lora_target_modules=None, lora_paths=None, max_loaded_loras=None, max_loras_per_batch=8, lora_eviction_policy='lru', lora_backend='csgmv', max_lora_chunk_size=16, attention_backend='fa3', decode_attention_backend=None, prefill_attention_backend=None, sampling_backend='pytorch', grammar_backend='xgrammar', mm_attention_backend=None, fp8_gemm_runner_backend='auto', fp4_gemm_runner_backend='auto', nsa_prefill_backend=None, nsa_decode_backend=None, disable_flashinfer_autotune=False, speculative_algorithm=None, speculative_draft_model_path=None, speculative_draft_model_revision=None, speculative_draft_load_format=None, speculative_num_steps=None, speculative_eagle_topk=None, speculative_num_draft_tokens=None, speculative_accept_threshold_single=1.0, speculative_accept_threshold_acc=1.0, speculative_token_map=None, speculative_attention_mode='prefill', speculative_draft_attention_backend=None, speculative_moe_runner_backend='auto', speculative_moe_a2a_backend=None, speculative_draft_model_quantization=None, speculative_ngram_min_match_window_size=1, speculative_ngram_max_match_window_size=12, speculative_ngram_min_bfs_breadth=1, speculative_ngram_max_bfs_breadth=10, speculative_ngram_match_type='BFS', speculative_ngram_branch_length=18, speculative_ngram_capacity=10000000, enable_multi_layer_eagle=False, ep_size=1, moe_a2a_backend='none', moe_runner_backend='auto', flashinfer_mxfp4_moe_precision='default', enable_flashinfer_allreduce_fusion=False, deepep_mode='auto', ep_num_redundant_experts=0, ep_dispatch_algorithm=None, init_expert_location='trivial', enable_eplb=False, eplb_algorithm='auto', eplb_rebalance_num_iterations=1000, eplb_rebalance_layers_per_chunk=None, eplb_min_rebalancing_utilization_threshold=1.0, expert_distribution_recorder_mode=None, expert_distribution_recorder_buffer_size=1000, enable_expert_distribution_metrics=False, deepep_config=None, moe_dense_tp_size=None, elastic_ep_backend=None, mooncake_ib_device=None, max_mamba_cache_size=None, mamba_ssm_dtype='float32', mamba_full_memory_ratio=0.9, mamba_scheduler_strategy='no_buffer', mamba_track_interval=256, enable_hierarchical_cache=False, hicache_ratio=2.0, hicache_size=0, hicache_write_policy='write_through', hicache_io_backend='kernel', hicache_mem_layout='layer_first', disable_hicache_numa_detect=False, hicache_storage_backend=None, hicache_storage_prefetch_policy='best_effort', hicache_storage_backend_extra_config=None, hierarchical_sparse_attention_extra_config=None, enable_lmcache=False, kt_weight_path=None, kt_method='AMXINT4', kt_cpuinfer=None, kt_threadpool_count=2, kt_num_gpu_experts=None, kt_max_deferred_experts_per_token=None, dllm_algorithm=None, dllm_algorithm_config=None, enable_double_sparsity=False, ds_channel_config_path=None, ds_heavy_channel_num=32, ds_heavy_token_num=256, ds_heavy_channel_type='qk', ds_sparse_decode_threshold=4096, cpu_offload_gb=0, offload_group_size=-1, offload_num_in_group=1, offload_prefetch_step=1, offload_mode='cpu', multi_item_scoring_delimiter=None, disable_radix_cache=False, cuda_graph_max_bs=256, cuda_graph_bs=[1, 2, 4, 8, 12, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200, 208, 216, 224, 232, 240, 248, 256], disable_cuda_graph=False, disable_cuda_graph_padding=False, enable_profile_cuda_graph=False, enable_cudagraph_gc=False, enable_layerwise_nvtx_marker=False, enable_nccl_nvls=False, enable_symm_mem=False, disable_flashinfer_cutlass_moe_fp4_allgather=False, enable_tokenizer_batch_encode=False, disable_tokenizer_batch_decode=False, disable_outlines_disk_cache=False, disable_custom_all_reduce=False, enable_mscclpp=False, enable_torch_symm_mem=False, disable_overlap_schedule=True, enable_mixed_chunk=False, enable_dp_attention=False, enable_dp_lm_head=False, enable_two_batch_overlap=False, enable_single_batch_overlap=False, tbo_token_distribution_threshold=0.48, enable_torch_compile=False, enable_piecewise_cuda_graph=False, enable_torch_compile_debug_mode=False, torch_compile_max_bs=32, piecewise_cuda_graph_max_tokens=8192, piecewise_cuda_graph_tokens=[4, 8, 12, 16, 20, 24, 28, 32, 48, 64, 80, 96, 112, 128, 144, 160, 176, 192, 208, 224, 240, 256, 288, 320, 352, 384, 416, 448, 480, 512, 640, 704, 768, 832, 896, 960, 1024, 1280, 1536, 1792, 2048, 2304, 2560, 2816, 3072, 3328, 3584, 3840, 4096, 4608, 5120, 5632, 6144, 6656, 7168, 7680, 8192], piecewise_cuda_graph_compiler='eager', torchao_config='', enable_nan_detection=False, enable_p2p_check=False, triton_attention_reduce_in_fp32=False, triton_attention_num_kv_splits=8, triton_attention_split_tile_size=None, num_continuous_decode_steps=1, delete_ckpt_after_loading=False, enable_memory_saver=False, enable_weights_cpu_backup=False, enable_draft_weights_cpu_backup=False, allow_auto_truncate=False, enable_custom_logit_processor=False, flashinfer_mla_disable_ragged=False, disable_shared_experts_fusion=False, disable_chunked_prefix_cache=False, disable_fast_image_processor=False, keep_mm_feature_on_device=False, enable_return_hidden_states=False, enable_return_routed_experts=False, scheduler_recv_interval=1, numa_node=None, enable_deterministic_inference=False, rl_on_policy_target=None, enable_attn_tp_input_scattered=False, enable_nsa_prefill_context_parallel=False, nsa_prefill_cp_mode='in-seq-split', enable_fused_qk_norm_rope=False, enable_precise_embedding_interpolation=False, enable_dynamic_batch_tokenizer=False, dynamic_batch_tokenizer_batch_size=32, dynamic_batch_tokenizer_batch_timeout=0.002, debug_tensor_dump_output_folder=None, debug_tensor_dump_layers=None, debug_tensor_dump_input_file=None, debug_tensor_dump_inject=False, disaggregation_mode='null', disaggregation_transfer_backend='mooncake', disaggregation_bootstrap_port=8998, disaggregation_decode_tp=None, disaggregation_decode_dp=None, disaggregation_prefill_pp=1, disaggregation_ib_device=None, disaggregation_decode_enable_offload_kvcache=False, disaggregation_decode_enable_fake_auto=False, num_reserved_decode_tokens=512, disaggregation_decode_polling_interval=1, encoder_only=False, language_only=False, encoder_transfer_backend='zmq_to_scheduler', encoder_urls=[], custom_weight_loader=[], weight_loader_disable_mmap=False, remote_instance_weight_loader_seed_instance_ip=None, remote_instance_weight_loader_seed_instance_service_port=None, remote_instance_weight_loader_send_weights_group_ports=None, remote_instance_weight_loader_backend='nccl', remote_instance_weight_loader_start_seed_via_transfer_engine=False, enable_pdmux=False, pdmux_config_path=None, sm_group_num=8, mm_max_concurrent_calls=32, mm_per_request_timeout=10.0, enable_broadcast_mm_inputs_process=False, enable_prefix_mm_cache=False, mm_enable_dp_encoder=False, mm_process_config={}, limit_mm_data_per_request=None, decrypted_config_file=None, decrypted_draft_config_file=None, forward_hooks=None)
[2026-01-30 20:54:39] Using default HuggingFace chat template with detected content format: string
INFO 01-30 20:54:46 [__init__.py:36] Available plugins for group vllm.platform_plugins:
INFO 01-30 20:54:46 [__init__.py:38] - musa -> vllm_musa:register
INFO 01-30 20:54:46 [__init__.py:41] All plugins in this group will be loaded. Set `VLLM_PLUGINS` to control which plugins to load.
INFO 01-30 20:54:46 [__init__.py:232] Platform plugin musa is activated
[2026-01-30 20:54:46] WARNING warnings.py:109: /mnt/seed17/001688/qzg/qwen3/260124/github/sglang/python/sglang/srt/layers/quantization/awq.py:76: UserWarning: Only CUDA, HIP and XPU support AWQ currently.
  warnings.warn(f"Only CUDA, HIP and XPU support AWQ currently.")

2026-01-30 20:54:46 | warnings | 139691301832512 | WARNING : /mnt/seed17/001688/qzg/qwen3/260124/github/sglang/python/sglang/srt/layers/quantization/awq.py:76: UserWarning: Only CUDA, HIP and XPU support AWQ currently.
  warnings.warn(f"Only CUDA, HIP and XPU support AWQ currently.")

INFO 01-30 20:54:46 [__init__.py:36] Available plugins for group vllm.platform_plugins:
INFO 01-30 20:54:46 [__init__.py:38] - musa -> vllm_musa:register
INFO 01-30 20:54:46 [__init__.py:41] All plugins in this group will be loaded. Set `VLLM_PLUGINS` to control which plugins to load.
INFO 01-30 20:54:46 [__init__.py:232] Platform plugin musa is activated
INFO 01-30 20:54:46 [__init__.py:36] Available plugins for group vllm.platform_plugins:
INFO 01-30 20:54:46 [__init__.py:38] - musa -> vllm_musa:register
INFO 01-30 20:54:46 [__init__.py:41] All plugins in this group will be loaded. Set `VLLM_PLUGINS` to control which plugins to load.
[2026-01-30 20:54:46] WARNING warnings.py:109: /mnt/seed17/001688/qzg/qwen3/260124/github/sglang/python/sglang/srt/layers/quantization/awq.py:76: UserWarning: Only CUDA, HIP and XPU support AWQ currently.
  warnings.warn(f"Only CUDA, HIP and XPU support AWQ currently.")

2026-01-30 20:54:46 | warnings | 139788813391680 | WARNING : /mnt/seed17/001688/qzg/qwen3/260124/github/sglang/python/sglang/srt/layers/quantization/awq.py:76: UserWarning: Only CUDA, HIP and XPU support AWQ currently.
  warnings.warn(f"Only CUDA, HIP and XPU support AWQ currently.")

INFO 01-30 20:54:46 [__init__.py:232] Platform plugin musa is activated
[2026-01-30 20:54:46] WARNING warnings.py:109: /mnt/seed17/001688/qzg/qwen3/260124/github/sglang/python/sglang/srt/layers/quantization/awq.py:76: UserWarning: Only CUDA, HIP and XPU support AWQ currently.
  warnings.warn(f"Only CUDA, HIP and XPU support AWQ currently.")

2026-01-30 20:54:46 | warnings | 140287130433344 | WARNING : /mnt/seed17/001688/qzg/qwen3/260124/github/sglang/python/sglang/srt/layers/quantization/awq.py:76: UserWarning: Only CUDA, HIP and XPU support AWQ currently.
  warnings.warn(f"Only CUDA, HIP and XPU support AWQ currently.")

INFO 01-30 20:54:46 [__init__.py:36] Available plugins for group vllm.platform_plugins:
INFO 01-30 20:54:46 [__init__.py:38] - musa -> vllm_musa:register
INFO 01-30 20:54:46 [__init__.py:41] All plugins in this group will be loaded. Set `VLLM_PLUGINS` to control which plugins to load.
INFO 01-30 20:54:46 [__init__.py:36] Available plugins for group vllm.platform_plugins:
INFO 01-30 20:54:46 [__init__.py:38] - musa -> vllm_musa:register
INFO 01-30 20:54:46 [__init__.py:41] All plugins in this group will be loaded. Set `VLLM_PLUGINS` to control which plugins to load.
[2026-01-30 20:54:46] WARNING warnings.py:109: /mnt/seed17/001688/qzg/qwen3/260124/github/sglang/python/sglang/srt/layers/quantization/gguf.py:47: UserWarning: Only CUDA support GGUF quantization currently.
  warnings.warn(f"Only CUDA support GGUF quantization currently.")

2026-01-30 20:54:46 | warnings | 139691301832512 | WARNING : /mnt/seed17/001688/qzg/qwen3/260124/github/sglang/python/sglang/srt/layers/quantization/gguf.py:47: UserWarning: Only CUDA support GGUF quantization currently.
  warnings.warn(f"Only CUDA support GGUF quantization currently.")

INFO 01-30 20:54:46 [__init__.py:232] Platform plugin musa is activated
INFO 01-30 20:54:46 [__init__.py:232] Platform plugin musa is activated
[2026-01-30 20:54:46] WARNING warnings.py:109: /mnt/seed17/001688/qzg/qwen3/260124/github/sglang/python/sglang/srt/layers/quantization/awq.py:76: UserWarning: Only CUDA, HIP and XPU support AWQ currently.
  warnings.warn(f"Only CUDA, HIP and XPU support AWQ currently.")

[2026-01-30 20:54:46] WARNING warnings.py:109: /mnt/seed17/001688/qzg/qwen3/260124/github/sglang/python/sglang/srt/layers/quantization/awq.py:76: UserWarning: Only CUDA, HIP and XPU support AWQ currently.
  warnings.warn(f"Only CUDA, HIP and XPU support AWQ currently.")

2026-01-30 20:54:46 | warnings | 140425765398336 | WARNING : /mnt/seed17/001688/qzg/qwen3/260124/github/sglang/python/sglang/srt/layers/quantization/awq.py:76: UserWarning: Only CUDA, HIP and XPU support AWQ currently.
  warnings.warn(f"Only CUDA, HIP and XPU support AWQ currently.")

2026-01-30 20:54:46 | warnings | 140328465848128 | WARNING : /mnt/seed17/001688/qzg/qwen3/260124/github/sglang/python/sglang/srt/layers/quantization/awq.py:76: UserWarning: Only CUDA, HIP and XPU support AWQ currently.
  warnings.warn(f"Only CUDA, HIP and XPU support AWQ currently.")

[2026-01-30 20:54:46] WARNING warnings.py:109: /mnt/seed17/001688/qzg/qwen3/260124/github/sglang/python/sglang/srt/layers/quantization/gguf.py:47: UserWarning: Only CUDA support GGUF quantization currently.
  warnings.warn(f"Only CUDA support GGUF quantization currently.")

2026-01-30 20:54:46 | warnings | 139788813391680 | WARNING : /mnt/seed17/001688/qzg/qwen3/260124/github/sglang/python/sglang/srt/layers/quantization/gguf.py:47: UserWarning: Only CUDA support GGUF quantization currently.
  warnings.warn(f"Only CUDA support GGUF quantization currently.")

[2026-01-30 20:54:46] WARNING warnings.py:109: /mnt/seed17/001688/qzg/qwen3/260124/github/sglang/python/sglang/srt/layers/quantization/gguf.py:47: UserWarning: Only CUDA support GGUF quantization currently.
  warnings.warn(f"Only CUDA support GGUF quantization currently.")

2026-01-30 20:54:46 | warnings | 140287130433344 | WARNING : /mnt/seed17/001688/qzg/qwen3/260124/github/sglang/python/sglang/srt/layers/quantization/gguf.py:47: UserWarning: Only CUDA support GGUF quantization currently.
  warnings.warn(f"Only CUDA support GGUF quantization currently.")

[2026-01-30 20:54:46] WARNING warnings.py:109: /mnt/seed17/001688/qzg/qwen3/260124/github/sglang/python/sglang/srt/layers/quantization/gguf.py:47: UserWarning: Only CUDA support GGUF quantization currently.
  warnings.warn(f"Only CUDA support GGUF quantization currently.")

2026-01-30 20:54:46 | warnings | 140328465848128 | WARNING : /mnt/seed17/001688/qzg/qwen3/260124/github/sglang/python/sglang/srt/layers/quantization/gguf.py:47: UserWarning: Only CUDA support GGUF quantization currently.
  warnings.warn(f"Only CUDA support GGUF quantization currently.")

[2026-01-30 20:54:46] WARNING warnings.py:109: /mnt/seed17/001688/qzg/qwen3/260124/github/sglang/python/sglang/srt/layers/quantization/gguf.py:47: UserWarning: Only CUDA support GGUF quantization currently.
  warnings.warn(f"Only CUDA support GGUF quantization currently.")

2026-01-30 20:54:46 | warnings | 140425765398336 | WARNING : /mnt/seed17/001688/qzg/qwen3/260124/github/sglang/python/sglang/srt/layers/quantization/gguf.py:47: UserWarning: Only CUDA support GGUF quantization currently.
  warnings.warn(f"Only CUDA support GGUF quantization currently.")

[2026-01-30 20:54:47 PP0 TP0] Init torch distributed begin.
[2026-01-30 20:54:47 PP1 TP1] Init torch distributed begin.
[2026-01-30 20:54:47 PP0 TP1] Init torch distributed begin.
[2026-01-30 20:54:47 PP1 TP0] Init torch distributed begin.
[2026-01-30 20:54:48 PP1 TP0] sglang is using nccl==2.11.4
[2026-01-30 20:54:48 PP0 TP0] sglang is using nccl==2.11.4
[2026-01-30 20:54:49 PP0 TP0] sglang is using nccl==2.11.4
[2026-01-30 20:54:49 PP0 TP1] sglang is using nccl==2.11.4
[2026-01-30 20:54:50 PP0 TP0] Init torch distributed ends. elapsed=2.96 s, mem usage=0.88 GB
[2026-01-30 20:54:50 PP0 TP1] Init torch distributed ends. elapsed=2.94 s, mem usage=0.96 GB
[2026-01-30 20:54:50 PP1 TP1] Init torch distributed ends. elapsed=2.95 s, mem usage=0.96 GB
[2026-01-30 20:54:50 PP1 TP0] Init torch distributed ends. elapsed=2.93 s, mem usage=0.96 GB
[2026-01-30 20:54:50 PP0 TP0] MOE_RUNNER_BACKEND is not initialized, the backend will be automatically selected
[2026-01-30 20:54:50 PP1 TP0] MOE_RUNNER_BACKEND is not initialized, the backend will be automatically selected
[2026-01-30 20:54:50 PP0 TP1] Ignore import error when loading sglang.srt.models.falcon_h1: No module named 'cuda'
[2026-01-30 20:54:50 PP0 TP0] Ignore import error when loading sglang.srt.models.falcon_h1: No module named 'cuda'
[2026-01-30 20:54:50 PP1 TP0] Ignore import error when loading sglang.srt.models.falcon_h1: No module named 'cuda'
[2026-01-30 20:54:50 PP1 TP1] Ignore import error when loading sglang.srt.models.falcon_h1: No module named 'cuda'
[2026-01-30 20:54:50 PP0 TP1] Ignore import error when loading sglang.srt.models.glm_ocr: No module named 'transformers.models.glm_ocr'
[2026-01-30 20:54:50 PP0 TP0] Ignore import error when loading sglang.srt.models.glm_ocr: No module named 'transformers.models.glm_ocr'
[2026-01-30 20:54:50 PP0 TP1] Ignore import error when loading sglang.srt.models.glm_ocr_nextn: No module named 'transformers.models.glm_ocr'
[2026-01-30 20:54:50 PP0 TP0] Ignore import error when loading sglang.srt.models.glm_ocr_nextn: No module named 'transformers.models.glm_ocr'
[2026-01-30 20:54:50 PP0 TP1] Ignore import error when loading sglang.srt.models.glmasr: cannot import name 'GlmAsrConfig' from 'transformers' (/usr/local/lib/python3.10/dist-packages/transformers/__init__.py)
[2026-01-30 20:54:50 PP0 TP0] Ignore import error when loading sglang.srt.models.glmasr: cannot import name 'GlmAsrConfig' from 'transformers' (/usr/local/lib/python3.10/dist-packages/transformers/__init__.py)
[2026-01-30 20:54:50 PP1 TP0] Ignore import error when loading sglang.srt.models.glm_ocr: No module named 'transformers.models.glm_ocr'
[2026-01-30 20:54:50 PP1 TP0] Ignore import error when loading sglang.srt.models.glm_ocr_nextn: No module named 'transformers.models.glm_ocr'
[2026-01-30 20:54:50 PP1 TP0] Ignore import error when loading sglang.srt.models.glmasr: cannot import name 'GlmAsrConfig' from 'transformers' (/usr/local/lib/python3.10/dist-packages/transformers/__init__.py)
[2026-01-30 20:54:50 PP1 TP1] Ignore import error when loading sglang.srt.models.glm_ocr: No module named 'transformers.models.glm_ocr'
[2026-01-30 20:54:50 PP1 TP1] Ignore import error when loading sglang.srt.models.glm_ocr_nextn: No module named 'transformers.models.glm_ocr'
[2026-01-30 20:54:50 PP1 TP1] Ignore import error when loading sglang.srt.models.glmasr: cannot import name 'GlmAsrConfig' from 'transformers' (/usr/local/lib/python3.10/dist-packages/transformers/__init__.py)
[2026-01-30 20:54:50 PP0 TP1] Ignore import error when loading sglang.srt.models.jet_nemotron: No module named 'cuda'
[2026-01-30 20:54:50 PP0 TP0] Ignore import error when loading sglang.srt.models.jet_nemotron: No module named 'cuda'
[2026-01-30 20:54:50 PP1 TP1] Ignore import error when loading sglang.srt.models.jet_nemotron: No module named 'cuda'
[2026-01-30 20:54:50 PP1 TP0] Ignore import error when loading sglang.srt.models.jet_nemotron: No module named 'cuda'
[2026-01-30 20:54:50 PP0 TP1] Ignore import error when loading sglang.srt.models.jet_vlm: No module named 'cuda'
[2026-01-30 20:54:50 PP0 TP0] Ignore import error when loading sglang.srt.models.jet_vlm: No module named 'cuda'
[2026-01-30 20:54:50 PP1 TP1] Ignore import error when loading sglang.srt.models.jet_vlm: No module named 'cuda'
[2026-01-30 20:54:50 PP1 TP0] Ignore import error when loading sglang.srt.models.jet_vlm: No module named 'cuda'
[2026-01-30 20:54:50 PP0 TP1] Ignore import error when loading sglang.srt.models.midashenglm: No module named 'torchaudio'
[2026-01-30 20:54:50 PP0 TP0] Ignore import error when loading sglang.srt.models.midashenglm: No module named 'torchaudio'
[2026-01-30 20:54:50 PP1 TP1] Ignore import error when loading sglang.srt.models.midashenglm: No module named 'torchaudio'
[2026-01-30 20:54:50 PP1 TP0] Ignore import error when loading sglang.srt.models.midashenglm: No module named 'torchaudio'
[2026-01-30 20:54:50 PP0 TP1] Ignore import error when loading sglang.srt.models.nano_nemotron_vl: No module named 'cuda'
[2026-01-30 20:54:50 PP0 TP0] Ignore import error when loading sglang.srt.models.nano_nemotron_vl: No module named 'cuda'
[2026-01-30 20:54:50 PP0 TP1] Ignore import error when loading sglang.srt.models.nemotron_h: No module named 'cuda'
[2026-01-30 20:54:50 PP0 TP0] Ignore import error when loading sglang.srt.models.nemotron_h: No module named 'cuda'
[2026-01-30 20:54:50 PP0 TP1] Ignore import error when loading sglang.srt.models.nemotron_h_mtp: No module named 'cuda'
[2026-01-30 20:54:50 PP0 TP0] Ignore import error when loading sglang.srt.models.nemotron_h_mtp: No module named 'cuda'
[2026-01-30 20:54:50 PP1 TP1] Ignore import error when loading sglang.srt.models.nano_nemotron_vl: No module named 'cuda'
[2026-01-30 20:54:50 PP1 TP1] Ignore import error when loading sglang.srt.models.nemotron_h: No module named 'cuda'
[2026-01-30 20:54:50 PP1 TP1] Ignore import error when loading sglang.srt.models.nemotron_h_mtp: No module named 'cuda'
[2026-01-30 20:54:50 PP1 TP0] Ignore import error when loading sglang.srt.models.nano_nemotron_vl: No module named 'cuda'
[2026-01-30 20:54:50 PP1 TP0] Ignore import error when loading sglang.srt.models.nemotron_h: No module named 'cuda'
[2026-01-30 20:54:50 PP1 TP0] Ignore import error when loading sglang.srt.models.nemotron_h_mtp: No module named 'cuda'
[2026-01-30 20:54:50 PP0 TP1] Load weight begin. avail mem=78.38 GB
[2026-01-30 20:54:50 PP0 TP0] Load weight begin. avail mem=78.27 GB
[2026-01-30 20:54:50 PP1 TP1] Load weight begin. avail mem=78.38 GB
[2026-01-30 20:54:50 PP1 TP0] Load weight begin. avail mem=78.38 GB
[2026-01-30 20:54:54 PP1 TP0] Beginning to load weights
[2026-01-30 20:54:54 PP0 TP1] Beginning to load weights
[2026-01-30 20:54:54 PP0 TP0] Beginning to load weights
Loading safetensors checkpoint shards:   0% Completed | 0/3 [00:00<?, ?it/s]
[2026-01-30 20:54:54 PP1 TP1] Beginning to load weights
[2026-01-30 20:54:54 PP1 TP1] Parameter model.embed_tokens.weight not found in params_dict
[2026-01-30 20:54:54 PP1 TP0] Parameter model.embed_tokens.weight not found in params_dict
Loading safetensors checkpoint shards:  33% Completed | 1/3 [00:02<00:05,  2.50s/it]
[2026-01-30 20:54:56 PP0 TP0] Parameter model.norm.weight not found in params_dict
Loading safetensors checkpoint shards:  67% Completed | 2/3 [00:02<00:01,  1.10s/it]
[2026-01-30 20:54:56 PP1 TP1] Loading weights took 2.69 seconds
[2026-01-30 20:54:56 PP1 TP1] Load weight end. elapsed=6.21 s, type=Qwen3ForCausalLM, dtype=torch.bfloat16, avail mem=76.05 GB, mem usage=2.33 GB.
[2026-01-30 20:54:57 PP0 TP1] Parameter model.norm.weight not found in params_dict
Loading safetensors checkpoint shards: 100% Completed | 3/3 [00:02<00:00,  1.40it/s]
Loading safetensors checkpoint shards: 100% Completed | 3/3 [00:02<00:00,  1.04it/s]

[2026-01-30 20:54:57 PP0 TP0] Loading weights took 2.89 seconds
[2026-01-30 20:54:57 PP0 TP0] Load weight end. elapsed=6.40 s, type=Qwen3ForCausalLM, dtype=torch.bfloat16, avail mem=75.94 GB, mem usage=2.33 GB.
[2026-01-30 20:54:57 PP0 TP1] Loading weights took 3.11 seconds
[2026-01-30 20:54:57 PP0 TP1] Load weight end. elapsed=6.61 s, type=Qwen3ForCausalLM, dtype=torch.bfloat16, avail mem=76.05 GB, mem usage=2.33 GB.
[2026-01-30 20:54:57 PP0 TP0] Using KV cache dtype: torch.bfloat16
[2026-01-30 20:54:57 PP1 TP0] Loading weights took 3.15 seconds
[2026-01-30 20:54:57 PP1 TP0] Load weight end. elapsed=6.64 s, type=Qwen3ForCausalLM, dtype=torch.bfloat16, avail mem=76.05 GB, mem usage=2.33 GB.
[2026-01-30 20:54:57 PP1 TP0] Using KV cache dtype: torch.bfloat16
[2026-01-30 20:54:57 PP1 TP0] KV Cache is allocated. #tokens: 1826494, K size: 31.35 GB, V size: 31.35 GB
[2026-01-30 20:54:57 PP1 TP0] Memory pool end. avail mem=12.57 GB
[2026-01-30 20:54:57 PP1 TP1] KV Cache is allocated. #tokens: 1826494, K size: 31.35 GB, V size: 31.35 GB
[2026-01-30 20:54:57 PP0 TP0] KV Cache is allocated. #tokens: 1826494, K size: 31.35 GB, V size: 31.35 GB
[2026-01-30 20:54:57 PP0 TP1] KV Cache is allocated. #tokens: 1826494, K size: 31.35 GB, V size: 31.35 GB
[2026-01-30 20:54:57 PP1 TP1] Memory pool end. avail mem=12.57 GB
[2026-01-30 20:54:57 PP0 TP1] Memory pool end. avail mem=12.57 GB
[2026-01-30 20:54:57 PP0 TP0] Memory pool end. avail mem=12.45 GB
[2026-01-30 20:54:58 PP0 TP0] Init attention backend begin.
[2026-01-30 20:54:58 PP1 TP0] Init attention backend begin.
[2026-01-30 20:54:58 PP0 TP1] Init attention backend begin.
[2026-01-30 20:54:58 PP0 TP0] Init attention backend end. elapsed=0.05 s
[2026-01-30 20:54:58 PP0 TP0] Capture cuda graph begin. This can take up to several minutes. avail mem=12.40 GB
[2026-01-30 20:54:58 PP0 TP0] Capture cuda graph bs [1, 2, 4, 8, 12, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200, 208, 216, 224, 232, 240, 248, 256]
[2026-01-30 20:54:58 PP1 TP1] Init attention backend begin.
[2026-01-30 20:54:58 PP1 TP0] Init attention backend end. elapsed=0.03 s
[2026-01-30 20:54:58 PP1 TP0] Capture cuda graph begin. This can take up to several minutes. avail mem=12.51 GB
[2026-01-30 20:54:58 PP1 TP0] Capture cuda graph bs [1, 2, 4, 8, 12, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200, 208, 216, 224, 232, 240, 248, 256]
[2026-01-30 20:54:58 PP0 TP1] Init attention backend end. elapsed=0.04 s
[2026-01-30 20:54:58 PP0 TP1] Capture cuda graph begin. This can take up to several minutes. avail mem=12.51 GB
[2026-01-30 20:54:58 PP1 TP1] Init attention backend end. elapsed=0.04 s
[2026-01-30 20:54:58 PP1 TP1] Capture cuda graph begin. This can take up to several minutes. avail mem=12.51 GB
Capturing batches (bs=1 avail_mem=11.12 GB): 100%|████████████████████████████████████████████████████████████████| 36/36 [01:41<00:00,  2.82s/it]
[2026-01-30 20:56:40 PP0 TP0] Registering 1332 cuda graph addresses
[2026-01-30 20:56:40 PP0 TP0] Capture cuda graph end. Time elapsed: 102.32 s. mem usage=1.29 GB. avail mem=11.11 GB.
[2026-01-30 20:56:40 PP0 TP1] Capture cuda graph end. Time elapsed: 102.44 s. mem usage=1.29 GB. avail mem=11.23 GB.
Capturing batches (bs=1 avail_mem=11.07 GB): 100%|████████████████████████████████████████████████████████████████| 36/36 [02:43<00:00,  4.54s/it]
[2026-01-30 20:57:42 PP1 TP0] Registering 1296 cuda graph addresses
[2026-01-30 20:57:44 PP1 TP1] Capture cuda graph end. Time elapsed: 166.16 s. mem usage=1.45 GB. avail mem=11.06 GB.
[2026-01-30 20:57:44 PP1 TP0] Capture cuda graph end. Time elapsed: 166.34 s. mem usage=1.45 GB. avail mem=11.06 GB.
[2026-01-30 20:57:44 PP0 TP0] max_total_num_tokens=1826494, chunked_prefill_size=8192, max_prefill_tokens=16384, max_running_requests=4096, context_len=40960, available_gpu_mem=11.11 GB
[2026-01-30 20:57:44 PP1 TP0] max_total_num_tokens=1826494, chunked_prefill_size=8192, max_prefill_tokens=16384, max_running_requests=4096, context_len=40960, available_gpu_mem=11.06 GB
[2026-01-30 20:57:45] INFO:     Started server process [3478419]
[2026-01-30 20:57:45] INFO:     Waiting for application startup.
[2026-01-30 20:57:45] Using default chat sampling params from model generation config: {'repetition_penalty': 1.0, 'temperature': 0.6, 'top_k': 20, 'top_p': 0.95}
[2026-01-30 20:57:45] Using default chat sampling params from model generation config: {'repetition_penalty': 1.0, 'temperature': 0.6, 'top_k': 20, 'top_p': 0.95}
[2026-01-30 20:57:45] INFO:     Application startup complete.
[2026-01-30 20:57:45] INFO:     Uvicorn running on http://127.0.0.1:30000 (Press CTRL+C to quit)
[2026-01-30 20:57:46] INFO:     127.0.0.1:38850 - "GET /model_info HTTP/1.1" 200 OK
[2026-01-30 20:57:46 PP0 TP0] Prefill batch, #new-seq: 1, #new-token: 6, #cached-token: 0, token usage: 0.00, #running-req: 0, #queue-req: 0, input throughput (token/s): 0.00, 
[2026-01-30 20:57:46 PP1 TP0] Prefill batch, #new-seq: 1, #new-token: 6, #cached-token: 0, token usage: 0.00, #running-req: 0, #queue-req: 0, input throughput (token/s): 0.00, 
[2026-01-30 20:57:46 PP0 TP0] /mnt/seed17/001688/qzg/qwen3/260124/github/sglang/python/sglang/srt/distributed/parallel_state.py:1010: UserWarning: The given buffer is not writable, and PyTorch does not support non-writable tensors. This means you can write to the underlying (supposedly non-writable) buffer using the tensor. You may want to copy the buffer to protect its data or make it writable before converting it to a tensor. This type of warning will be suppressed for the rest of this program. (Triggered internally at /home/pytorch/torch/csrc/utils/tensor_new.cpp:1577.)
  object_tensor = torch.frombuffer(pickle.dumps(obj), dtype=torch.uint8)

[2026-01-30 20:57:46 PP0 TP1] /mnt/seed17/001688/qzg/qwen3/260124/github/sglang/python/sglang/srt/distributed/parallel_state.py:1010: UserWarning: The given buffer is not writable, and PyTorch does not support non-writable tensors. This means you can write to the underlying (supposedly non-writable) buffer using the tensor. You may want to copy the buffer to protect its data or make it writable before converting it to a tensor. This type of warning will be suppressed for the rest of this program. (Triggered internally at /home/pytorch/torch/csrc/utils/tensor_new.cpp:1577.)
  object_tensor = torch.frombuffer(pickle.dumps(obj), dtype=torch.uint8)

[2026-01-30 20:57:47 PP1 TP0] /mnt/seed17/001688/qzg/qwen3/260124/github/sglang/python/sglang/srt/distributed/parallel_state.py:1010: UserWarning: The given buffer is not writable, and PyTorch does not support non-writable tensors. This means you can write to the underlying (supposedly non-writable) buffer using the tensor. You may want to copy the buffer to protect its data or make it writable before converting it to a tensor. This type of warning will be suppressed for the rest of this program. (Triggered internally at /home/pytorch/torch/csrc/utils/tensor_new.cpp:1577.)
  object_tensor = torch.frombuffer(pickle.dumps(obj), dtype=torch.uint8)

[2026-01-30 20:57:47 PP1 TP1] /mnt/seed17/001688/qzg/qwen3/260124/github/sglang/python/sglang/srt/distributed/parallel_state.py:1010: UserWarning: The given buffer is not writable, and PyTorch does not support non-writable tensors. This means you can write to the underlying (supposedly non-writable) buffer using the tensor. You may want to copy the buffer to protect its data or make it writable before converting it to a tensor. This type of warning will be suppressed for the rest of this program. (Triggered internally at /home/pytorch/torch/csrc/utils/tensor_new.cpp:1577.)
  object_tensor = torch.frombuffer(pickle.dumps(obj), dtype=torch.uint8)

[2026-01-30 20:57:47] INFO:     127.0.0.1:38864 - "POST /generate HTTP/1.1" 200 OK
[2026-01-30 20:57:47] The server is fired up and ready to roll!
[2026-01-30 20:58:43 PP0 TP0] Prefill batch, #new-seq: 1, #new-token: 3, #cached-token: 0, token usage: 0.00, #running-req: 0, #queue-req: 0, input throughput (token/s): 0.11, 
[2026-01-30 20:58:43 PP1 TP0] Prefill batch, #new-seq: 1, #new-token: 3, #cached-token: 0, token usage: 0.00, #running-req: 0, #queue-req: 0, input throughput (token/s): 0.11, 
[2026-01-30 20:58:44 PP0 TP0] Decode batch, #running-req: 1, #token: 36, token usage: 0.00, cuda graph: True, gen throughput (token/s): 0.17, #queue-req: 0, 
[2026-01-30 20:58:44 PP1 TP0] Decode batch, #running-req: 1, #token: 36, token usage: 0.00, cuda graph: True, gen throughput (token/s): 0.17, #queue-req: 0, 
[2026-01-30 20:58:45 PP0 TP0] Decode batch, #running-req: 1, #token: 76, token usage: 0.00, cuda graph: True, gen throughput (token/s): 46.09, #queue-req: 0, 
[2026-01-30 20:58:45 PP1 TP0] Decode batch, #running-req: 1, #token: 76, token usage: 0.00, cuda graph: True, gen throughput (token/s): 46.08, #queue-req: 0, 
[2026-01-30 20:58:45] INFO:     127.0.0.1:34282 - "POST /generate HTTP/1.1" 200 OK

Checklist

Format your code according to the Format code with pre-commit.
Add unit tests according to the Run and add unit tests.
Update documentation according to Write documentations.
Provide accuracy and speed benchmark results according to Test the accuracy and Benchmark the speed.
Follow the SGLang code style guidance.

Review Process

Ping Merge Oncalls to start the PR flow. See the PR Merge Process.
Get approvals from CODEOWNERS and other reviewers.
Trigger CI tests with comments or contact authorized users to do so.
- /tag-run-ci-label, /rerun-failed-ci, /tag-and-rerun-ci
After green CI and required approvals, ask Merge Oncalls to merge.

gemini-code-assist · 2026-01-30T09:08:08Z

Summary of Changes

Hello @froststeam, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request integrates support for MATE's FlashAttention v3 compatibility interface, primarily targeting MUSA (MTHREADS) GPUs. The changes enable the system to dynamically select and utilize MATE's optimized attention kernels when running on MUSA hardware. This involves updating device capability checks, conditionally importing MATE-specific attention functions, and passing necessary scheduler metadata to these kernels. Additionally, the model configuration has been enhanced to accommodate specific model architectures, such as MoE models with initial dense layers, ensuring broader compatibility and optimized performance across different hardware and model types.

Highlights

MUSA (MTHREADS) GPU Support for FlashAttention v3: The FlashAttention v3 backend has been extended to support MUSA GPUs, including specific device capability checks for MUSA architecture (MP>=31) alongside existing NVIDIA CUDA requirements (SM>=80 and SM<=90).
MATE Integration for MUSA Devices: Conditional imports have been introduced to dynamically use MATE's optimized FlashAttention v3 implementation when running on MUSA devices, ensuring compatibility and performance benefits for this specific hardware.
MATE Scheduler Metadata Integration: A new mechanism generates and passes MATE-specific scheduler metadata (including a workspace buffer and initialization status) to the FlashAttention kernels during both forward_extend and forward_decode operations, which is crucial for MATE's efficient operation.
Enhanced Model Configuration for MoE Models: The ModelConfig now includes a first_k_dense_replace attribute, allowing for better handling of Mixture-of-Experts (MoE) models that may have an initial set of dense layers before transitioning to MoE layers, especially in a pipeline parallel setup.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request adds support for MATE's FA3 compatibility interface, which appears to be for MUSA/MTHREADS hardware. The changes are well-structured and touch upon model configuration, attention registry, and the FlashAttention backend to accommodate the new platform.

I've identified a critical bug in flashattention_backend.py that could lead to a NameError, along with some opportunities for code improvement regarding type hints, code duplication, and import style. Addressing these points will enhance the code's correctness and maintainability.

python/sglang/srt/layers/attention/flashattention_backend.py

python/sglang/srt/utils/common.py

python/sglang/srt/layers/attention/attention_registry.py

python/sglang/srt/layers/attention/flashattention_backend.py

python/sglang/srt/utils/common.py

python/sglang/srt/layers/attention/flashattention_backend.py

python/pyproject_other.toml

yeahdongcn · 2026-03-20T07:27:54Z

/tag-run-ci-label

yeahdongcn · 2026-03-23T07:21:15Z

/rerun-failed-ci

yeahdongcn · 2026-03-24T00:37:46Z

/rerun-failed-ci

yeahdongcn · 2026-03-24T02:44:33Z

/rerun-failed-ci

yeahdongcn · 2026-03-24T08:14:14Z

/rerun-failed-ci

python/sglang/srt/server_args.py

alexnails · 2026-03-25T01:06:46Z

python/sglang/srt/layers/attention/attention_registry.py

-        "FlashAttention v3 Backend requires SM>=80 and SM<=90. "
-        "Please use `--attention-backend flashinfer`."
-    )
+    major, minor = get_device_capability()


sort of code dupe with re: #18648 (comment)

alexnails · 2026-03-25T01:08:09Z

python/sglang/srt/layers/attention/flashattention_backend.py

+            max_seqlen_k = metadata.max_seq_len_k
+
+        # Create MUSA flash attention context manager (or nullcontext for CUDA)
+        musa_ctx = (


isn't this just more generally

if context is not null, call forward extend implementation?

You're right that nullcontext() is a no-op and won't affect other platforms. My main hesitation is that this context is currently designed specifically for MUSA, and I'm unsure if other platforms will have similar needs in the future.
Given this uncertainty, what would you recommend as the best approach? Should we abstract it now with some documentation to clarify the intent, or keep the platform check explicit until there's more demand? Any suggestions would be greatly appreciated!

I think for now leave a TODO with a lengthy explanation

Thanks! TODO comment added to get_flash_attention_context.

alexnails · 2026-03-25T01:08:49Z

python/sglang/srt/layers/attention/flashattention_backend.py

+                        update_flash_attention_context(
+                            prefix="forward_extend_use_cascade_attn",
+                            max_seqlen_k=self.forward_metadata_spec_decode_expand.max_seq_len_k,
+                        )


can this context update be done earlier? (I may be mistaken here)

Thanks for the suggestion. Unfortunately, moving this earlier would cause unnecessary scheduler metadata generation, which would hurt inference performance. I've kept it here to avoid that overhead.

Gotcha.

Separately, I see you made a refactor, did this not come to pass or what was final result?

I extracted the context creation into a separate method (get_flash_attention_context) and added a TODO comment to clarify the intent. No structural refactoring beyond that - keeping it simple since only MUSA needs this for now.

…ensor Engine)

yeahdongcn · 2026-04-02T01:58:08Z

/rerun-failed-ci

yeahdongcn · 2026-04-02T05:48:37Z

/rerun-failed-ci

yeahdongcn · 2026-04-02T06:51:33Z

/rerun-failed-ci

yeahdongcn · 2026-04-02T08:12:03Z

/rerun-failed-ci

…ensor Engine) (sgl-project#17985) Co-authored-by: R0CKSTAR <xiaodong.ye@mthreads.com>

Fridge003 · 2026-04-03T03:41:50Z

This commit is breaking this failure: https://github.com/sgl-project/sglang/actions/runs/23928333410/job/69789912493

* [AMD] Fix AMD CI monitor GitHub API rate limit exhaustion (sgl-project#21527) * [CI] Register missing jit_kernel test files (sgl-project#21547) * [diffusion] fix: return None instead of raising RuntimeError when no model info found (sgl-project#21319) Co-authored-by: Mick <mickjagger19@icloud.com> * [rl][sgl] fix tensor mismatch after pause (sgl-project#21514) * [Hicache & JIT_kernel] Support page first layout & mla jit kernel (sgl-project#18311) * test: point DSV3 int8 MLA CI models to lmsys Hugging Face org (sgl-project#21561) * [CI] Relax several thresholds in flaky CIs (sgl-project#21562) * feat: add gc_threshold arg (sgl-project#21481) Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> * Fix flaky test_pp_single_node (sgl-project#21564) * Split workflow for releasing runtime docker (sgl-project#21563) * fix tp capture in vit cuda graph (sgl-project#17255) * [1/n] lora support - Auto detect lora target modules (sgl-project#21439) Co-authored-by: Baizhou Zhang <sobereddiezhang@gmail.com> * [fix] qwen3.5 fuse_moe_triton_tune bug (sgl-project#20232) * Remove sync when enabling return_logprob (sgl-project#20972) * Scope streaming backlog coalescing to incremental_streaming_output mode (sgl-project#21037) Signed-off-by: Vladislav Nosivskoy <vladnosiv@gmail.com> Co-authored-by: Lianmin Zheng <lianminzheng@gmail.com> * docs: flesh out MAINTAINER.md oncall lists and link GitHub profiles (sgl-project#21575) * [NVIDIA] Enable automatic NUMA configuration (sgl-project#19452) * [diffusion] UX: aggregate expected dtype-cast logs during weight loading (sgl-project#21552) * [diffusion] refactor: Unify `TeaCacheParams` and `WanTeaCacheParams` (sgl-project#20706) Co-authored-by: Mick <mickjagger19@icloud.com> * [diffusion] chore: remove redundant identity preprocess_text functions(sgl-project#20633) Co-authored-by: Fengyuan Yu <15fengyuan@gmail.com> * Update CODEOWNERS for transformers.py and docs (sgl-project#21555) Co-authored-by: Lianmin Zheng <lianminzheng@gmail.com> * reduce CPU peak memory in multimodal tensor hashing (sgl-project#21123) * Fix HFRunner hang when subprocess dies during init (sgl-project#21582) * Fix Piecewise CUDA Graph crash with `-enable-mixed-chunk` (sgl-project#20441) Co-authored-by: jianyingzhu <joeyzhu@nvidia.com> * [CI] Replace upload/download-artifact with job outputs in release-docker workflow (sgl-project#21579) Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Patch transformers is_base_mistral in CI to avoid HF 429 rate limiting (sgl-project#21586) * [CI] Move v32 cp test to deepep running suite (sgl-project#21585) * [AMD] Add GLM-4.7-FP8 accuracy CI test for MI35x (sgl-project#21534) Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> * [Clean] Remove deprecated environs (sgl-project#21536) * [diffusion] fix: fix Flux2-Klein prompt tokenization length to 512 and add regression coverage (sgl-project#21407) * [CI] hot-fix ci lint (sgl-project#21608) * [diffusion] feat: support overlay model materialization (sgl-project#21600) * [VLM] Optimize ShmPointerMMData for multi-pickle safety and deferred unwrap (sgl-project#21465) * feat: enable CUDA graph and timestamp for the whisper model(sgl-project#21190) * [NPU] Update quantization&CI documentation (sgl-project#21100) Co-authored-by: Tamir Baydasov <41994229+TamirBaydasov@users.noreply.github.com> * Skip ci for .md files (sgl-project#21482) * Support skip-softmax attention (sgl-project#19089) * fix: piecewise_cuda_graph get correct qo_indptr (sgl-project#21452) Co-authored-by: Avery Huang <averyh@nvidia.com> * fix bench_serving sglang backend to support image dataset (sgl-project#21294) * [AMD] Add peft>=0.18.0 to diffusion_hip deps for transformers 5.x compat for AMD diffusion model (sgl-project#21442) Co-authored-by: HaiShaw <hixiao@gmail.com> * [GDN] Fuse GDN kkt + solve_tril into one kernel (sgl-project#21411) Co-authored-by: luoyuan.luo <luoyuan.luo@antgroup.com> * [Diffusion] Align diffusion benchmark skill presets with nightly comparison cases (sgl-project#21616) Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> * Clean up detokenizer and remove dead multimodal_gen code (sgl-project#21588) Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * [CI] Skip flaky elastic EP test (sgl-project#21619) * feat(ci): add GB300 nightly benchmark test suites (sgl-project#21487) Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * [CI] Lossen test_return_routed_experts threshold (sgl-project#21270) * Add subprocess liveness monitor to detect scheduler crashes (sgl-project#18582) Co-authored-by: 继优 <jiyou.ljy@alibaba-inc.com> Co-authored-by: shuwenn <47200617+alphabetc1@users.noreply.github.com> * fix: scheduler launch hang when non-current rank dies (sgl-project#20287) * Wrap IPv6 addresses in gRPC, bench_serving, and log messages (sgl-project#21236) Co-authored-by: hnyls2002 <lsyincs@gmail.com> Co-authored-by: Liangsheng Yin <hnyls2002@gmail.com> * [HiCache] fix: graceful shutdown of pending async tasks in bench_mix.py (sgl-project#20276) * Clean up _wait_for_scheduler_ready implementation (sgl-project#21626) * fix cuda graph capturing error in sm120 mxfp8 triton path (sgl-project#19835) * [sgl] disable piecewise cuda graph when a model doesn't have layers (sgl-project#21565) * [Feature] Optimizations for JPEG input on NVIDIA GPU (sgl-project#19749) * [VLM] perf: optimize CUDA IPC for multimodal transfer by caching IPC pool handles (sgl-project#21418) * [Fix] SGLANG_USE_CUDA_IPC_TRANSPORT=1 and SGLANG_ENABLE_MM_SPLITTING=1 do not work at the same time. (sgl-project#19915) * [Fix] Remove redundant allreduce fusion block and skip TP=1 (sgl-project#20621) * Simplify routed experts test and move base64 encoding to tokenizer manager (sgl-project#21634) Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * [Cleanup] Remove unused BatchMultimodalOutput and BatchMultimodalDecodeReq (sgl-project#21640) Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Clean up TokenizerManager: remove dead code and improve rid validation (sgl-project#21639) Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * README: coding agent sponsorship for long-term contributors (sgl-project#21642) * Fix circular reference in CustomTestCase.__init_subclass__ (sgl-project#21650) Co-authored-by: wan4ch <wan4ch@gmail.com> * [Fix] Fix Qwen3.5 MoE model loading and Mamba cache sharding in PP mode (sgl-project#21448) Co-authored-by: zhangxiaolei123456 <zhangxiaolei.666@bytedance.com> * [diffusion] CI: fix dashboard chart (nightly) display issues (sgl-project#21653) Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> * Update sponsorship details in README.md (sgl-project#21658) * [Fix] Handle pre-release tags in nightly wheel version parsing (sgl-project#21656) Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * [Intel GPU] Enable DeepSeek R1 inference on XPU (sgl-project#18461) Signed-off-by: P V R K Jyothendra Varma <polisetty.v.r.k.jyothendra.varma@intel.com> * [Doc] Update tips for developer new-comers (sgl-project#21659) Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> * [CI] [FlashInfer v0.6.7] Use offline quantized checkpoint for MXFP8 Gemm tests (sgl-project#21625) * MFU metrics in Prometheus (sgl-project#19395) * fix topk softmax performance issue (sgl-project#14702) * [CPU] add kernel apply_rotary_pos_emb_cpu for Qwen3-VL and Qwen3-Omni (sgl-project#13121) Co-authored-by: Ma Mingfei <mingfei.ma@intel.com> * [CPU] Implement MXFP4 Gemm kernels for intel AMX to support GPT OSS series. (sgl-project#14385) * [AMD] Fused rope kv store (sgl-project#21315) Co-authored-by: wunhuang <wunhuang@amd.com> * [NPU] Update DeepSeek-V3.2 model deployment instructions in documentation (sgl-project#21468) Co-authored-by: wuxue (C) <w00964934@china.huawei.com> * [AMD] Support AMD MXFP4 Qwen3.5-397B-A17B model (sgl-project#21234) * [Fix] Fix weight_loader property assignment for qwen3-next FP8 models (sgl-project#21662) Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix mamba cache leak when adder fails to add a matched req. (sgl-project#21404) * fix: Mistral Small 4 fails to start due to config/weight format mismatch (sgl-project#21620) Co-authored-by: mengxiancheng03 <mengxiancheng03@kuaishou.com> Co-authored-by: Baizhou Zhang <sobereddiezhang@gmail.com> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * [diffusion] feat: enhance overlay mechanism (sgl-project#21648) * [diffusion] CI: relax pr-test threshold (sgl-project#21682) * [NPU][Diffusion] fix sp modulate for qwen-image-edit (sgl-project#20974) Co-authored-by: 高鑫 <gaoxin@gaoxindeMacBook-Pro.local> * [NPU] fix eagle3 accept rate (sgl-project#21255) * DeepSeek-R1-0528-w4a8: DeepEP Low Latency Dispatch Adopts FP8 Communication (sgl-project#14162) Co-authored-by: undefined <zhouchen.arrebol@jd.com> * [NPU] GLM-5 optimize with fused kernels (sgl-project#18617) * [NPU][diffusion]: support parallel decoding of qwen-image (sgl-project#20757) Co-authored-by: 高鑫 <gaoxin@gaoxindeMacBook-Pro.local> * [diffusion] [NPU] support ring attention on NPU with FA (sgl-project#21383) * [diffusion][doc]: add ring sp performance benchmark page (sgl-project#20998) Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> * [GLM-V and GLM-4.7] Cast to FP32 before gate projection for GLM model. (sgl-project#21660) * fix nemotron capture for non attention layers (sgl-project#21436) * [Bugfix][NPU] Skip FRACTAL_NZ format for MoE weights with unaligned dimensions (sgl-project#21209) Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Co-authored-by: ronnie_zheng <zl19940307@163.com> * [AMD] Add SGLANG_DISAGGREGATION_NUM_PRE_ALLOCATE_REQS env var for configurable KV transfer overlap (sgl-project#20410) Co-authored-by: HaiShaw <hixiao@gmail.com> * [AMD][MoRI] bump MoRI to v0.1.0 (sgl-project#21673) * [AMD] fix performance regression issue when run gpt-oss with "--context-length 13824" (sgl-project#21691) * Remove flashinfer wheel cache cleanup that deletes other versions (sgl-project#21711) Co-authored-by: Alison Shao <alison.shao@MacBook-Pro-D2W773R9CD.local> * [misc] multiprocess compilation to speed up test (sgl-project#21483) * Fix human-eval CI install on 5090 runners (sgl-project#21714) Co-authored-by: Alison Shao <alison.shao@Mac.attlocal.net> * Revert "DeepSeek-R1-0528-w4a8: DeepEP Low Latency Dispatch Adopts FP8 Communication" (sgl-project#21719) * [Fix] Update supported custom_mem_pool types for mooncake (sgl-project#21728) Co-authored-by: 百麒 <yaozhong.lyz@alibaba-inc.com> * [Perf]Remove H2D for Qwen3.5 SpecV2 (sgl-project#20864) * [AMD] Fix CI multimodal-gen-test-1-gpu-amd for gen model (sgl-project#21621) * [diffusion] fix: fix Flux.2 with tp(sgl-project#21664) * Add explicit disable flag for FlashInfer allreduce fusion (sgl-project#21446) * [NPU] fix conflict between empty_cache and use_mem_pool (sgl-project#21507) * [AMD] Use tgemm.mm for MoEGate router gemm in deepseek_v2.py (sgl-project#21657) * [CI]Remove msgm-en and mmlu tests which cause timeout (sgl-project#21733) * Fix disaggregation hybrid attention ci (sgl-project#21745) * Rename rerun-ut to rerun-test (sgl-project#21747) * bugfix(model):fix deepstack index out of range error (sgl-project#21727) Co-authored-by: xiaoqi.31 <xiaoqi.31@jd.com> * [diffusion] fix: fix typo (sgl-project#21746) Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com> * [CI] Fix rerun-test suite detection to skip commented registrations (sgl-project#21753) * [PD] Refactor Disagg Conn and Fix Hang with total_request/total_tokens Balancing (sgl-project#21299) Co-authored-by: Weiliangl User <weiliangl@login-node.hosted.internal> * [CI] Fix ring test timeout (sgl-project#21751) * Enable evict swa with piecewise cuda graph (sgl-project#21754) * Fix kimi-linear launch server error (sgl-project#21752) Co-authored-by: luoyuan.luo <luoyuan.luo@antgroup.com> * [PD] Tiny cleanup after KVReceiver refactor (sgl-project#21760) Signed-off-by: Shangming Cai <csmthu@gmail.com> * Fix remote weight info nnode>1 and dp>1 (sgl-project#17389) * [diffusion] UX: replace deprecated ORJSONResponse with orjson_response (sgl-project#21755) Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> * [diffusion] fix: fix Wan2.2-I2V-A14B video max size issue(sgl-project#21390) Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com> Co-authored-by: Mick <mickjagger19@icloud.com> * [HiMambaTree]: Optimize mamba host lock mechanism (sgl-project#21750) * [AMD] Fix Handle missing rope_theta in get_rope_config for Grok-1 (sgl-project#21518) * [bugfix] Fix rope theta config for MiniMax after transformers v5 update (sgl-project#21241) * Fix ineffective is_base_mistral CI patch for HF API rate limiting (sgl-project#21729) * [2/n] lora - Shared outer experts and support qwen3_30b_a3b_instruct (sgl-project#21466) Co-authored-by: Baizhou Zhang <sobereddiezhang@gmail.com> * Fix cuda graph max bs capture upper bound (sgl-project#21005) * [Fix] Fall back to triton MOE for GPT-OSS on Blackwell with driver >= 595 (sgl-project#21780) Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Cache nvidia wheels locally to skip repeated 830 MB downloads in CI (sgl-project#21778) * Add Trivy vulnerability scanning to nightly dev Docker builds (sgl-project#21772) Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * [CI] Remove more redundant PCG tests (sgl-project#21554) * [moe] add customized option to moe-a2a-backend (sgl-project#21786) * Add CompletionSampler for non-chat eval in run_eval (sgl-project#21785) * Remove redundant test_moe_eval_accuracy_large (sgl-project#21787) * Increase hicache eval to 200 examples (sgl-project#21791) * Switch MooncakeSpec to EAGLE3 + Llama-3.1 (sgl-project#21794) * Reduce redundant speculative decoding CI tests (sgl-project#21779) * Fix killall.py crash when sglang is not yet installed (sgl-project#21797) * Remove obsolete sgl-kernel legacy paths (sgl-project#21528) * [jit_kernel] Optimize fused_qknorm_rope: deduplicate sincosf for interleave RoPE (sgl-project#21654) * CUTLASS NVFP4 GEMM improvement of SM120 (sgl-project#21314) * [gRPC] Preserve original ImportError in grpc_server.py (sgl-project#21801) Signed-off-by: Chang Su <chang.s.su@oracle.com> * [Misc] Tiny: Add test network timeouts and dynamic max-parallel for 5090/2-gpu runners (sgl-project#21800) * Fix draft extend cuda graph when spec_step=1 (sgl-project#21709) * [Diffusion] Add `--uvicorn-access-log-exclude-prefixes` to suppress noisy access logs (sgl-project#20379) * Add latency and throughput metrics to run_eval (sgl-project#21793) * [diffusion] CI: improve ci reliability (sgl-project#21763) * [bugfix]GLM-4V model (sgl-project#17122) * Fix CVEs in Docker image: pillow, linux-libc-dev, and broken sgl-model-gateway build (sgl-project#21789) Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: only showing recent runners from ci failure analysis (sgl-project#21015) * [MPS] Fix Triton stub sub-module imports on Python 3.12+ (sgl-project#21551) Co-authored-by: karanb192 <karan@example.com> Co-authored-by: R0CKSTAR <yeahdongcn@gmail.com> Co-authored-by: R0CKSTAR <xiaodong.ye@mthreads.com> * [KDA] Fuse scaled_dot_kkt + solve_tril + recompute_w_u for KDA (sgl-project#21604) Co-authored-by: luoyuan.luo <luoyuan.luo@antgroup.com> * chore: bump flashinfer version to 0.6.7 (sgl-project#21422) Co-authored-by: sglang-bot <sglang-bot@users.noreply.github.com> Co-authored-by: Baizhou Zhang <sobereddiezhang@gmail.com> * [3/n] lora moe - Support Qwen3-VL-30B-A3B-Instruct (sgl-project#21469) Co-authored-by: Baizhou Zhang <sobereddiezhang@gmail.com> * [Feature Restoration] repetition_penalty is essential for GLM-V models (sgl-project#21258) Co-authored-by: Xinyuan Tong <115166877+JustinTong0323@users.noreply.github.com> Co-authored-by: Xinyuan Tong <xinyuantong.cs@gmail.com> Co-authored-by: hnyls2002 <lsyincs@gmail.com> Co-authored-by: Liangsheng Yin <hnyls2002@gmail.com> * VLM: change default mm-attention backend from triton_attn to fa4 (on blackwell) (sgl-project#21595) * Fix added tokens config with sensible filter (sgl-project#17905) * [AMD] Optimize Qwen3-VL decode - fuse QK-norm + 3D mRoPE + KV cache write (sgl-project#21458) Co-authored-by: Bingxu Chen <bingxche@amd.com> Co-authored-by: HaiShaw <hixiao@gmail.com> * [Bugfix] Fix PP tied embeddings weight loading for qwen3.5 4B dense model (sgl-project#21347) * [CI] Fix lint that was not applied in sgl-project#21458 (sgl-project#21818) * Bug fix for llama eagle3 (sgl-project#21397) * glm_interleave for GLM-V (sgl-project#21671) * style refinement for hisparse (sgl-project#21198) * [Bug][VLM] Fix shared memory race condition in ShmPointerMMData broadcast for multi-GPU VLM serving (sgl-project#21655) * [Bugfix] Fix effective_mamba_size over-allocation (sgl-project#20858) Co-authored-by: Shangming Cai <csmthu@gmail.com> * Fix in-place mode in pause generation (sgl-project#21705) * [diffusion] fix: respect --prompt-path (sgl-project#21756) * [NPU] update ascend docs (sgl-project#21807) * [VLM] remove AsyncMMDataProcessor wrapper (sgl-project#21651) * Use CustomTestCase for TestSessionControl to enable CI retry (sgl-project#21830) * [NPU]Add a full test pipeline on NPU, resolve issues in the NPU test architecture (sgl-project#20751) * [diffusion][CI]: Add individual component accuracy CI for diffusion models (sgl-project#18709) Co-authored-by: Xiaoyu Zhang <35585791+BBuf@users.noreply.github.com> * [Feature] JIT rmsnorm update (with claude) (sgl-project#21834) * [Diffusion][NPU] add ring sp performance benchmark page in npu (sgl-project#21811) * fix(MiMo-V2-Flash): add mimo reasoning parser (sgl-project#21414) * [diffusion] hardware: support FA3 attention backend on MUSA (attn backend, 14/N) (sgl-project#18648) Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com> Co-authored-by: Mick <mickjagger19@icloud.com> * fix: pre-init tokenizer_manager to avoid AttributeError in shutdown (sgl-project#21824) * [FlashInver v0.6.7] Integrate flashinfer_trtllm mxfp8 gemm (sgl-project#21576) * [Misc] Add network timeout to eval dataset downloads (sgl-project#21873) Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * [refactor] Clean up duplicate flashinfer trtllm moe code (sgl-project#21233) * [DSA] Support trtllm sparse mla kernel for prefill batches (sgl-project#21783) * [Disagg] GPU staging buffer with dynamic ring allocator for heterogeneous TP KV transfer (sgl-project#19890) * Add merge prohibition policy during CI maintenance mode (sgl-project#21882) * [Misc] Fix comparator e2e tests: add polars dep + fix dp-attention test (sgl-project#21804) Co-authored-by: Alison Shao <alison.shao@mac.lan> * revert: remove TTL-based hard pin from HiRadixCache (sgl-project#21884) * Unify GSM8K eval path to Chat API for regression CI readiness (sgl-project#21667) * [HiCache] fix: Clone host indices to avoid memory leak (sgl-project#21624) Co-authored-by: Zhiqiang Xie <xiezhq@stanford.edu> * [HiCache & PD]Fixed detailed cache hit breakdown in PD scenarios. (sgl-project#21764) * [CI] Add Llama 3.1 8B Instruct FP4 CI test on SM120 (sgl-project#20648) * [CI] Add Per-Tensor, Blockwise FP8 Tests on SM120 (sgl-project#20717) Co-authored-by: Brayden Zhong <b8zhong@uwaterloo.ca> * Allow /rerun-test to checkout fork PR branch for trusted users (sgl-project#21890) * Direct model loading from object storage with Runai Model Streamer (sgl-project#17948) Signed-off-by: Noa Neria <noa@run.ai> * fix pcg torch dynamo recompile in mxfp8 Triton path (sgl-project#21888) Co-authored-by: Hanlin Bi <hanlinbi@umich.edu> * chore: bump mooncake version to 0.3.10.post1 (sgl-project#21844) * [VLM] Add VLM TP=4 per-commit CI test and improve MMMU eval prompt/parser (sgl-project#21841) * fix(ci): update est_time for 57 tests based on runtime analysis (sgl-project#21896) Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * [CI] Increase multimodal server test timeout from 60 to 90 minutes (sgl-project#21897) * [CI] Remove crashing Kimi K2.5 EAGLE3/MTP variants, keep TP8 and TP8+DP8 (sgl-project#21898) * [diffusion] CI: add initial nvfp4 ci test for b200 (sgl-project#21767) Co-authored-by: Mick <mickjagger19@icloud.com> * Migrate all callers from /get_server_info to /server_info (sgl-project#21463) * Support PP key for file backend (sgl-project#21901) * Enable multi-thread weight loading by default (sgl-project#20289) * Skip Go stdlib and NVIDIA tool CVEs in Trivy scan (sgl-project#21905) Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * [Kernel] Fuse temperature + softmax in sampling for decode speedup (sgl-project#20501) * Multi tool streaming fix (sgl-project#20004) * Return HTTP 400 for streaming validation errors (sgl-project#21900) * [Spec][Ngram] 4/N: Remove `max_match_window_size` and `min_match_window_size`, matching all suffixes of the Trie (sgl-project#21225) * Fix ngram doc for speculative_num_draft_tokens default (sgl-project#21910) * [NVIDIA] Enable fp8 flashinfer_trtllm_routed MoE for MiniMax-M2.5 (sgl-project#20394) * scheduler: add prefill-only update in merge batch (sgl-project#21840) * [DSA] Set trtllm kernels as nsa default for Blackwell (sgl-project#21914) * Revert "Rollback flashmla to older version [1/2]" (sgl-project#21922) * test: add manual init test for mooncake transfer engine (sgl-project#21842) Co-authored-by: yunzhi <ningyunxiao.nyx@antgroup.com> * Fix spec v2 + logprob when max_num_token is set (sgl-project#20799) * Migrate ngram corpus from torch cpp_extension to TVM FFI jit_kernel (sgl-project#21920) Co-authored-by: DarkSharpness <2040703891@qq.com> * [NPU] Support GLM-4.7-Flash on NPU (sgl-project#21408) * [CI] Fix gpu deps import in cpu test (sgl-project#21950) * [Parallel State Refactor 1/n] Remove stream of PyNCCL (sgl-project#20866) * [diffusion] chore: fix stage profiler for multi-stage denoising (sgl-project#21955) * [CI] [Tracing] Add ci for tracing and fix bugs (sgl-project#21740) * Remove logging for subprocess watchdog start (sgl-project#21968) * [4/n] Support gpt oss 20b lora (sgl-project#21570) * [MUSA][9/N] Add FA3 attention backend support through MATE (MUSA AI Tensor Engine) (sgl-project#17985) Co-authored-by: R0CKSTAR <xiaodong.ye@mthreads.com> * [Feature] Stronger transformers modeling backend with TP, PP, MoE, VLMs, and torch compile (sgl-project#19163) * [CI] Remove stale Ascend suite entries from test/srt/run_suite.py (sgl-project#21978) Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Skip broken AutoModel mapping entries when resolving Llava submodules (sgl-project#21892) * [CI] Add timeouts to Slack upload urlopen and WebClient (sgl-project#21903) Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * [Diffusion][NPU] Add support for MOVA (sgl-project#21633) Co-authored-by: zhangshuai (S) <z00836796@china.huawei.com> * Remove maxItems=1 restriction when tool_choice is specified (sgl-project#20208) * [Feature] NVFP4 Marlin fallback for non-Blackwell GPUs (SM75+) (sgl-project#19652) * [PP] qwen3 vl skip layer id for pp (sgl-project#19135) * [VLM] Enable per-image MM splitting by default and remove MULTI_IMAGES modality (sgl-project#21899) * [Bugfix] Fix incorrect dp-attention parallel info in bench_one_batch (sgl-project#21519) * Revert "[MUSA][9/N] Add FA3 attention backend support through MATE (MUSA AI Tensor Engine)" (sgl-project#22002) * [NPU] Optimized the wording in the npu docs (sgl-project#21998) * [Parallel State Refactor 2/n] Unify code path of AMD deterministic all reduce (sgl-project#20871) * [AMD] Resolve the performance degression when launch server with "--enable-aiter-allreduce-fusion" (sgl-project#21947) Co-authored-by: wunhuang <wunhuang@amd.com> * chore: bump sgl-kernel version to 0.4.1 (sgl-project#21447) Co-authored-by: sglang-bot <sglang-bot@users.noreply.github.com> * [Workflow] Avoid triggering nightly tests in kernel bump workflow (sgl-project#22010) * [Workflow] Fix kernel release jobs skipped on push events (sgl-project#22011) Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * [PD]: Add support for HiSparse to directly transfer the cache from Prefill to Decode DRAM. (sgl-project#21591) Co-authored-by: Tingwei Huang <huangtingwei9988@gmail.com> Co-authored-by: Shangming Cai <csmthu@gmail.com> Co-authored-by: Zhiqiang Xie <xiezhq@stanford.edu> * [Misc] Update CI permission (sgl-project#22014) * [ROCM][RL] Shuffle Weight In-Place to Preserve Parameter Attributes (sgl-project#21825) * [CI] Fix duplicate job names that bypass branch protection (sgl-project#22001) * fix: remove duplicate words in comments (sgl-project#22007) * [PD] Tiny register info field cleanup for mooncake backend (sgl-project#22016) * [NPU] optimize glm4.7 (sgl-project#19246) * [AMD] Enable FP8 KV cache and FP8 attention kernel for NSA on MI300/MI355 with TileLang backend (sgl-project#21511) * [AMD] Add MiniMax-M2.5 nightly perf benchmarks for MI30x and MI35x (sgl-project#21524) --------- Signed-off-by: Vladislav Nosivskoy <vladnosiv@gmail.com> Signed-off-by: P V R K Jyothendra Varma <polisetty.v.r.k.jyothendra.varma@intel.com> Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com> Signed-off-by: Shangming Cai <csmthu@gmail.com> Signed-off-by: Chang Su <chang.s.su@oracle.com> Signed-off-by: Noa Neria <noa@run.ai> Co-authored-by: Bingxu Chen <bingxche@amd.com> Co-authored-by: Xiaoyu Zhang <35585791+BBuf@users.noreply.github.com> Co-authored-by: yang1002378395-cmyk <yang1002378395@gmail.com> Co-authored-by: Mick <mickjagger19@icloud.com> Co-authored-by: Bi Xue <bi@thinkingmachines.ai> Co-authored-by: huangtingwei <141888744+huangtingwei9988@users.noreply.github.com> Co-authored-by: Lianmin Zheng <lianminzheng@gmail.com> Co-authored-by: Baizhou Zhang <sobereddiezhang@gmail.com> Co-authored-by: Muqi Li <muqi1029@gmail.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Co-authored-by: Qiaolin Yu <liin1211@outlook.com> Co-authored-by: narutolhy <582909902@qq.com> Co-authored-by: Ethan (Yusheng) Su <yushengsu.thu@gmail.com> Co-authored-by: zhangxiaolei <zhangxiaolei.666@bytedance.com> Co-authored-by: Vladislav Nosivskoy <vladnosiv@gmail.com> Co-authored-by: Trevor Morris <tmorris@nvidia.com> Co-authored-by: Eitan Turok <150733043+eitanturok@users.noreply.github.com> Co-authored-by: Fengyuan Yu <Yuandao151112@163.com> Co-authored-by: Fengyuan Yu <15fengyuan@gmail.com> Co-authored-by: Adarsh Shirawalmath <114558126+adarshxs@users.noreply.github.com> Co-authored-by: Yuhao Yang <47235274+yhyang201@users.noreply.github.com> Co-authored-by: Liangsheng Yin <hnyls2002@gmail.com> Co-authored-by: Jianying <53503712+jianyingzhu@users.noreply.github.com> Co-authored-by: jianyingzhu <joeyzhu@nvidia.com> Co-authored-by: Kangyan-Zhou <zky314343421@gmail.com> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Co-authored-by: Jacob0226 <jacchang@amd.com> Co-authored-by: Aditya Sharma <89210949+adityavaid@users.noreply.github.com> Co-authored-by: Yuan Luo <yuan.luo@hotmail.com> Co-authored-by: Xinyuan Tong <115166877+JustinTong0323@users.noreply.github.com> Co-authored-by: Артем Савкин <58187114+OrangeRedeng@users.noreply.github.com> Co-authored-by: Tamir Baydasov <41994229+TamirBaydasov@users.noreply.github.com> Co-authored-by: Shu Wang <shuw@nvidia.com> Co-authored-by: eigen <52445717+yyihuang@users.noreply.github.com> Co-authored-by: Avery Huang <averyh@nvidia.com> Co-authored-by: jacky.cheng <yichiche@amd.com> Co-authored-by: HaiShaw <hixiao@gmail.com> Co-authored-by: luoyuan.luo <luoyuan.luo@antgroup.com> Co-authored-by: Shangming Cai <csmthu@gmail.com> Co-authored-by: Junrong Lin <33685709+ocss884@users.noreply.github.com> Co-authored-by: Simon (Jiyou) Li <Simon-Li@users.noreply.github.com> Co-authored-by: 继优 <jiyou.ljy@alibaba-inc.com> Co-authored-by: shuwenn <47200617+alphabetc1@users.noreply.github.com> Co-authored-by: psaab <ps@meta.com> Co-authored-by: hnyls2002 <lsyincs@gmail.com> Co-authored-by: Hanlin Bi <52993433+wolfcomos@users.noreply.github.com> Co-authored-by: wili <98001977+wili-65535@users.noreply.github.com> Co-authored-by: saatwiknagpal <saatwiknagpal@gmail.com> Co-authored-by: Mohammad Miadh Angkad <176301910+mmangkad@users.noreply.github.com> Co-authored-by: wan4ch <wan4ch@gmail.com> Co-authored-by: Feng Su <sufeng@linux.alibaba.com> Co-authored-by: Ying Sheng <sqy1415@gmail.com> Co-authored-by: Polisetty V R K Jyothendra Varma <polisetty.v.r.k.jyothendra.varma@intel.com> Co-authored-by: Ziang Li <ziangli@umich.edu> Co-authored-by: Aishwarya Ramasethu <56765596+aramasethu@users.noreply.github.com> Co-authored-by: Ma Mingfei <mingfei.ma@intel.com> Co-authored-by: blzheng <beilei.zheng@intel.com> Co-authored-by: kk <43161300+kkHuang-amd@users.noreply.github.com> Co-authored-by: wunhuang <wunhuang@amd.com> Co-authored-by: Michelle Wu <michellewu351@gmail.com> Co-authored-by: wuxue (C) <w00964934@china.huawei.com> Co-authored-by: Hubert Lu <55214931+hubertlu-tw@users.noreply.github.com> Co-authored-by: strgrb <zhangkaihong.zkh@antgroup.com> Co-authored-by: LiYomi <106872109+LiYomi@users.noreply.github.com> Co-authored-by: mengxiancheng03 <mengxiancheng03@kuaishou.com> Co-authored-by: GXIN <37653830+gxxx-hum@users.noreply.github.com> Co-authored-by: 高鑫 <gaoxin@gaoxindeMacBook-Pro.local> Co-authored-by: heziiop <q_m_p@qq.com> Co-authored-by: xieminghe1 <141820649+xieminghe1@users.noreply.github.com> Co-authored-by: undefined <zhouchen.arrebol@jd.com> Co-authored-by: Makcum888e <79456407+Makcum888e@users.noreply.github.com> Co-authored-by: yuefeng Wu <33725817+ChefWu551@users.noreply.github.com> Co-authored-by: Yuxuan Zhang <2448370773@qq.com> Co-authored-by: Vedant V Jhaveri <vedantjh2@gmail.com> Co-authored-by: ronnie_zheng <zl19940307@163.com> Co-authored-by: Zhai Feiyue <80079571+ZhaiFeiyue@users.noreply.github.com> Co-authored-by: jhchouuu <jiahzhou@amd.com> Co-authored-by: Alison Shao <54658187+alisonshao@users.noreply.github.com> Co-authored-by: Alison Shao <alison.shao@MacBook-Pro-D2W773R9CD.local> Co-authored-by: DarkSharpness <76582120+DarkSharpness@users.noreply.github.com> Co-authored-by: Alison Shao <alison.shao@Mac.attlocal.net> Co-authored-by: Lewis <63569348+TTThanos@users.noreply.github.com> Co-authored-by: 百麒 <yaozhong.lyz@alibaba-inc.com> Co-authored-by: Jincong Chen <jincong.cjc@ant-intl.com> Co-authored-by: xiazhahe <86939755+xiazhahe@users.noreply.github.com> Co-authored-by: Thomas Wang <thomawan@amd.com> Co-authored-by: Ke Bao <ispobaoke@gmail.com> Co-authored-by: xiaoqi <xq25478@qq.com> Co-authored-by: xiaoqi.31 <xiaoqi.31@jd.com> Co-authored-by: R0CKSTAR <xiaodong.ye@mthreads.com> Co-authored-by: weireweire <weiliangl@nvidia.com> Co-authored-by: Weiliangl User <weiliangl@login-node.hosted.internal> Co-authored-by: JD <jaedon.guo@gmail.com> Co-authored-by: Zhangheng <hzh0425@apache.org> Co-authored-by: Michael <13900043+michaelzhang-ai@users.noreply.github.com> Co-authored-by: Yilong Zhao <74357408+happierpig@users.noreply.github.com> Co-authored-by: Johnsonms <lizhaofu@gmail.com> Co-authored-by: Brayden Zhong <b8zhong@uwaterloo.ca> Co-authored-by: Chang Su <chang.s.su@oracle.com> Co-authored-by: KnightLTC <56717110+KnightLTC@users.noreply.github.com> Co-authored-by: Douglas Yang <dyang@college.harvard.edu> Co-authored-by: Karan Bansal <karanb192@users.noreply.github.com> Co-authored-by: karanb192 <karan@example.com> Co-authored-by: R0CKSTAR <yeahdongcn@gmail.com> Co-authored-by: sglang-bot <sglangbot@gmail.com> Co-authored-by: sglang-bot <sglang-bot@users.noreply.github.com> Co-authored-by: Xinyuan Tong <xinyuantong.cs@gmail.com> Co-authored-by: sbeurnier <sbeurnier@together.ai> Co-authored-by: YC Yen-Ching Tseng <yctseng@amd.com> Co-authored-by: Wenyao Gao <105094497+edwingao28@users.noreply.github.com> Co-authored-by: Alex Nails <alex.nails@radixark.ai> Co-authored-by: khalilzhk <khalilzhk@gmail.com> Co-authored-by: Zhiqiang Xie <xiezhq@stanford.edu> Co-authored-by: yudian0504 <138860534+yudian0504@users.noreply.github.com> Co-authored-by: yunkchen <chenyunkuo.cyk@alibaba-inc.com> Co-authored-by: wduan-hai <wduan@humansand.ai> Co-authored-by: amote-i <49533125+amote-i@users.noreply.github.com> Co-authored-by: Cherry_ming <136634645@qq.com> Co-authored-by: Ratish P <114130421+Ratish1@users.noreply.github.com> Co-authored-by: YAMY <74099316+YAMY1234@users.noreply.github.com> Co-authored-by: Alison Shao <alison.shao@mac.lan> Co-authored-by: ishandhanani <82981111+ishandhanani@users.noreply.github.com> Co-authored-by: Derek Yu <81697272+DerekY2@users.noreply.github.com> Co-authored-by: Noa Neria <noa@run.ai> Co-authored-by: Hanlin Bi <hanlinbi@umich.edu> Co-authored-by: Prozac614 <dwt614707404@163.com> Co-authored-by: David Cheung <d7cheung@gmail.com> Co-authored-by: Mook <68294499+Godmook@users.noreply.github.com> Co-authored-by: Khoa Pham <khoa.pham@radixark.ai> Co-authored-by: foraxe <73625538+foraxe@users.noreply.github.com> Co-authored-by: yunzhi <ningyunxiao.nyx@antgroup.com> Co-authored-by: DarkSharpness <2040703891@qq.com> Co-authored-by: Todobe <43903496+Todobe@users.noreply.github.com> Co-authored-by: ori <39351881+froststeam@users.noreply.github.com> Co-authored-by: Thomas <zs033@qq.com> Co-authored-by: zhangshuai (S) <z00836796@china.huawei.com> Co-authored-by: lviy <142899752+lviy@users.noreply.github.com> Co-authored-by: Tingwei Huang <huangtingwei9988@gmail.com> Co-authored-by: Yuzhen Zhou <82826991+zyzshishui@users.noreply.github.com> Co-authored-by: Ricardo-M-L <69202550+Ricardo-M-L@users.noreply.github.com> Co-authored-by: Kelon <kelonlu@163.com> Co-authored-by: cen121212 <luochen23@huawei.com>

gemini-code-assist bot reviewed Jan 30, 2026

View reviewed changes

froststeam changed the title ~~feat: support MATE's FA3 compatibility interface~~ [MUSA][9/N] Add FA3 attention backend support through MATE(MUSA AI Tensor Engine) Jan 30, 2026

yeahdongcn reviewed Jan 30, 2026

View reviewed changes

python/sglang/srt/layers/attention/attention_registry.py Outdated Show resolved Hide resolved

python/sglang/srt/layers/attention/attention_registry.py Outdated Show resolved Hide resolved

python/sglang/srt/layers/attention/flashattention_backend.py Outdated Show resolved Hide resolved

yeahdongcn changed the title ~~[MUSA][9/N] Add FA3 attention backend support through MATE(MUSA AI Tensor Engine)~~ [MUSA][9/N] Add FA3 attention backend support through MATE (MUSA AI Tensor Engine) Jan 30, 2026

yeahdongcn added the mthreads label Jan 30, 2026

yeahdongcn mentioned this pull request Jan 30, 2026

[Roadmap][Feature] Support Moore Threads (MUSA) GPU #16565

Open

2 tasks

froststeam force-pushed the qzg/mate-fa3 branch 2 times, most recently from 6de8cdf to 99462a5 Compare January 30, 2026 12:44

github-actions bot added the dependencies Pull requests that update a dependency file label Jan 30, 2026

froststeam force-pushed the qzg/mate-fa3 branch from 99462a5 to 9f31a12 Compare January 30, 2026 12:53

yeahdongcn reviewed Jan 31, 2026

View reviewed changes

python/sglang/srt/utils/common.py Outdated Show resolved Hide resolved

python/sglang/srt/utils/common.py Show resolved Hide resolved

python/sglang/srt/layers/attention/flashattention_backend.py Outdated Show resolved Hide resolved

froststeam force-pushed the qzg/mate-fa3 branch 3 times, most recently from 49d4c53 to 5668a30 Compare February 9, 2026 08:50

froststeam marked this pull request as ready for review February 9, 2026 08:56

froststeam requested review from Fridge003, Qiaolin-Yu, hebiao064, ispobock and merrymercy as code owners February 9, 2026 08:56

froststeam force-pushed the qzg/mate-fa3 branch from 7171c92 to 15b7718 Compare February 11, 2026 12:40

yeahdongcn requested a review from Kangyan-Zhou February 26, 2026 02:35

froststeam force-pushed the qzg/mate-fa3 branch from 15b7718 to cb20c57 Compare March 12, 2026 03:09

froststeam requested a review from HaiShaw as a code owner March 12, 2026 03:09

yeahdongcn reviewed Mar 12, 2026

View reviewed changes

python/pyproject_other.toml Show resolved Hide resolved

yeahdongcn requested changes Mar 12, 2026

View reviewed changes

froststeam force-pushed the qzg/mate-fa3 branch 2 times, most recently from 2e61875 to cbf7377 Compare March 19, 2026 01:58

froststeam force-pushed the qzg/mate-fa3 branch from cbf7377 to 3261300 Compare March 20, 2026 06:38

yeahdongcn approved these changes Mar 20, 2026

View reviewed changes

github-actions bot added the run-ci label Mar 20, 2026

froststeam force-pushed the qzg/mate-fa3 branch 2 times, most recently from 6407037 to 9236b5b Compare March 23, 2026 02:29

alexnails reviewed Mar 25, 2026

View reviewed changes

froststeam force-pushed the qzg/mate-fa3 branch from 6230cc8 to c96aab2 Compare April 1, 2026 03:20

[MUSA][9/N] Add FA3 attention backend support through MATE (MUSA AI T…

2373552

…ensor Engine)

froststeam force-pushed the qzg/mate-fa3 branch from c96aab2 to 2373552 Compare April 1, 2026 03:34

Merge branch 'main' into qzg/mate-fa3

637b715

Kangyan-Zhou merged commit 939cf39 into sgl-project:main Apr 2, 2026
359 of 417 checks passed

satyamk7054 pushed a commit to satyamk7054/sglang that referenced this pull request Apr 3, 2026

[MUSA][9/N] Add FA3 attention backend support through MATE (MUSA AI T…

47d0ddd

…ensor Engine) (sgl-project#17985) Co-authored-by: R0CKSTAR <xiaodong.ye@mthreads.com>

Fridge003 mentioned this pull request Apr 3, 2026

Revert "[MUSA][9/N] Add FA3 attention backend support through MATE (MUSA AI Tensor Engine)" #22002

Merged

froststeam mentioned this pull request Apr 3, 2026

[MUSA][9/N] Re-introduce FA3 attention backend support through MATE #22051

Open

5 tasks

Conversation

froststeam commented Jan 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Testing Done

Checklist

Review Process

Uh oh!

gemini-code-assist bot commented Jan 30, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

yeahdongcn commented Mar 20, 2026

Uh oh!

yeahdongcn commented Mar 23, 2026

Uh oh!

yeahdongcn commented Mar 24, 2026

Uh oh!

yeahdongcn commented Mar 24, 2026

Uh oh!

yeahdongcn commented Mar 24, 2026

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

alexnails Mar 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yeahdongcn commented Apr 2, 2026

Uh oh!

yeahdongcn commented Apr 2, 2026

Uh oh!

yeahdongcn commented Apr 2, 2026

Uh oh!

yeahdongcn commented Apr 2, 2026

Uh oh!

Uh oh!

Fridge003 commented Apr 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

froststeam commented Jan 30, 2026 •

edited

Loading

alexnails Mar 26, 2026 •

edited

Loading