support qwen3-next-fp8 deepep by yizhang2077 · Pull Request #10622 · sgl-project/sglang

yizhang2077 · 2025-09-18T17:03:57Z

Motivation

need merge #10624 first

Modifications

Accuracy Tests

python3 -m sglang.launch_server --model Qwen/Qwen-Next-80B-A3B-Instruct-FP8/  --tp 4 --dp 2 --enable-dp-attention --enable-deepep-moe --cuda-graph-max-bs 128

python3 benchmark/gsm8k/bench_sglang.py --num-questions 1000 
Accuracy: 0.942
Invalid: 0.000
Latency: 176.143 s
Output throughput: 943.616 token/s

python3 -m sglang.launch_server --model Qwen/Qwen-Next-80B-A3B-Instruct-FP8/  --tp 8 --dp 8 --enable-dp-attention --enable-deepep-moe --cuda-graph-max-bs 128

Accuracy: 0.940
Invalid: 0.000
Latency: 145.734 s
Output throughput: 1140.712 token/s

Benchmarking and Profiling

Checklist

Format your code according to the Format code with pre-commit.
Add unit tests according to the Run and add unit tests.
Update documentation according to Write documentations.
Provide accuracy and speed benchmark results according to Test the accuracy and Benchmark the speed.

gemini-code-assist

Summary of Changes

Hello @yizhang2077, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces comprehensive support for Deepseek Expert Parallelism (DeepEP) within the Qwen3-Next-FP8 model architecture. The changes enable efficient distributed execution of Mixture-of-Experts (MoE) layers by integrating DeepEP-specific logic for expert management, load balancing, and tensor parallelism, aiming to improve performance and scalability for large language models.

Highlights

DeepEP Integration: Implemented DeepEP (Deepseek Expert Parallelism) support for Qwen3-Next-FP8 models, enabling specialized handling of Mixture-of-Experts (MoE) layers for improved distributed inference.
Expert Parallelism Configuration: Enhanced MoE layer initialization to incorporate redundant experts and pass tensor parallelism configurations to sub-layers, optimizing distributed execution and resource utilization.
Expert Weight Management: Introduced a mechanism to lazily retrieve and manage expert weights within the Qwen3-Next model, facilitating DeepEP's operational requirements and dynamic expert routing.
Expert Distribution Tracking: Integrated global expert distribution recording to monitor and potentially optimize expert allocation across layers, which is crucial for load balancing in DeepEP.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces support for qwen3-next-fp8 with DeepEP. The changes are mainly in qwen2_moe.py and qwen3_next.py. In qwen2_moe.py, a new forward path for DeepEP is added, along with configurations for redundant experts and shared experts. A method to retrieve expert weights is also included. In qwen3_next.py, this new method is used to expose expert weights for load balancing, and expert distribution recording is integrated. The changes are logical and well-implemented. I have one minor suggestion to remove some redundant code to improve clarity.

gemini-code-assist · 2025-09-18T17:05:49Z

python/sglang/srt/models/qwen2_moe.py

+        if get_moe_a2a_backend().is_deepep():
+            # TODO: we will support tp < ep in the future
+            self.ep_size = get_moe_expert_parallel_world_size()
+            self.num_experts = (
+                config.num_experts + global_server_args_dict["ep_num_redundant_experts"]
+            )
+            self.top_k = config.num_experts_per_tok


This block of code appears to be redundant. The attributes self.ep_size, self.num_experts, and self.top_k are assigned but are not used within the class. The values for num_experts and top_k were already used during the initialization of self.experts and self.topk respectively. If this code is for future use as hinted by the TODO, it should be commented out. Otherwise, it can be removed to improve code clarity.

* origin/qwen3: (30 commits) chore: bump sgl-kernel 0.3.11 (sgl-project#10630) feat: add fused moe config for Qwen3-Next-80B-A3B-Instruct on B200 (sgl-project#10631) model support: Sarashina2VisionForCausalLM (sgl-project#10632) [Performance] Qwen3-Next: speed up update_mamba_state_after_mtp_verify by 10x; e2e up to 3.54% faster (sgl-project#10586) [Performance] Qwen3-Next: replace arange to cached query_start_loc_li… (sgl-project#10553) [Feature] Speculative decoding support lookahead (sgl-project#9873) refactor: use registry for _get_attention_backend_from_str (sgl-project#10629) [router] refactor worker to builder pattern 1/n (sgl-project#10628) Garbage collector regression in the online server (sgl-project#10621) feat: Add FlexAttention Backend for Efficient Sparse Attention (sgl-project#9947) Fix bias handling in TritonMoeQuantInfo within quantization/mxfp4.py (sgl-project#10579) [Performance] qwen3-next improve causal conv1d in prefill phase (sgl-project#10595) Fix sgl_kernel import failure on devices other than CUDA (sgl-project#10610) support qwen3-next-fp8 deepep (sgl-project#10622) update deepep version for qwen3-next deepep moe (sgl-project#10624) Feat/add heartbeat mechanism for nixl conn (sgl-project#10222) [RL] Add destroy process group api (sgl-project#9979) fix deepep assert when PD disaggregation == null (sgl-project#8274) Scale kkt after reduction (sgl-project#10604) [improvement] add average input/output token length for hicache benchmark stats output (sgl-project#10525) ...

lmu97 · 2025-10-15T10:04:33Z

Hi @yizhang2077 , I wonder if tuning triton moe kernel script need to change?
When I launch qwen3-next model by

python3 -m sglang.launch_server --model-path /opt/nim/workspace/ --chunked-prefill-size 16384 --max-prefill-tokens 16384 --max-running-requests 1024 --tp-size 4 --ep-size 4 --mem-fraction-static 0.8 --random-seed 0 --cuda-graph-bs 1 2 4 8 16 32 64 128 256 512 1024 --log-level info --host 0.0.0.0 --port 8001 --random-seed 0 --moe-runner-backend auto --attention-backend flashinfer --mamba-ssm-dtype bfloat16

it says:
Using default MoE kernel config. Performance might be sub-optimal! Config file not found at /sgl-workspace/sglang/python/sglang/srt/layers/moe/fused_moe_triton/configs/triton_3_4_0/E=128,N=512,device_name=NVIDIA_H200.json, you can create them with https://github.com/sgl-project/sglang/tree/main/benchmark/kernels/fused_moe_triton

I wonder how "E=128" is calculated (seems num_experts / ep_size ). in tuning script,

sglang/benchmark/kernels/fused_moe_triton/tuning_fused_moe_triton.py

Line 432 in ab9187a

E = config.num_experts

, E is always 512 according to config.num_experts https://huggingface.co/Qwen/Qwen3-Next-80B-A3B-Instruct/blob/main/config.json#L26.

if tuning triton kernel script need to change?

lmu97 · 2025-10-15T10:26:45Z

Plus: will the log below cause performance issue?

vincentzed · 2025-12-27T01:16:49Z

@yizhang2077 On main:
This setup cause gibberish. Is this setup not working with Blackwell? Before I make a bug report, I ask here.

launch server
export SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK=1024

model-path: Qwen/Qwen3-Next-80B-A3B-Instruct-FP8
tp-size: 4
model-path: Qwen/Qwen3-Next-80B-A3B-Instruct-FP8
tp-size: 4
deepep-mode: auto
dp-size: 4
enable-dp-attention: true
moe-a2a-backend: deepep

Sgl docker.
Env:

python -m sglang.check_env             
Python: 3.12.3 (main, Nov  6 2025, 13:44:16) [GCC 13.3.0]
CUDA available: True
GPU 0,1,2,3,4,5,6,7: NVIDIA B200
GPU 0,1,2,3,4,5,6,7 Compute Capability: 10.0
CUDA_HOME: /usr/local/cuda
NVCC: Cuda compilation tools, release 12.9, V12.9.86
CUDA Driver Version: 580.82.07
PyTorch: 2.9.1+cu129
sglang: 0.5.6.post2
sgl_kernel: 0.3.20
flashinfer_python: 0.5.3
flashinfer_cubin: 0.5.3
flashinfer_jit_cache: 0.5.3+cu129
triton: 3.5.1
transformers: 4.57.1
torchao: 0.9.0
numpy: 2.3.5
aiohttp: 3.13.2
fastapi: 0.124.4
hf_transfer: 0.1.9
huggingface_hub: 0.36.0
interegular: 0.3.3
modelscope: 1.33.0
orjson: 3.11.5
outlines: 0.1.11
packaging: 25.0
psutil: 7.1.3
pydantic: 2.12.5
python-multipart: 0.0.20
pyzmq: 27.1.0
uvicorn: 0.38.0
uvloop: 0.22.1
vllm: Module Not Found
xgrammar: 0.1.27
openai: 2.6.1
tiktoken: 0.12.0
anthropic: 0.75.0
litellm: Module Not Found
decord2: 2.0.0
NVIDIA Topology: 
        GPU0    GPU1    GPU2    GPU3    GPU4    GPU5    GPU6    GPU7    CPU Affinity    NUMA Affinity   GPU NUMA ID
GPU0     X      NV18    NV18    NV18    NV18    NV18    NV18    NV18    0-239   0               N/A
GPU1    NV18     X      NV18    NV18    NV18    NV18    NV18    NV18    0-239   0               N/A
GPU2    NV18    NV18     X      NV18    NV18    NV18    NV18    NV18    0-239   0               N/A
GPU3    NV18    NV18    NV18     X      NV18    NV18    NV18    NV18    0-239   0               N/A
GPU4    NV18    NV18    NV18    NV18     X      NV18    NV18    NV18    0-239   0               N/A
GPU5    NV18    NV18    NV18    NV18    NV18     X      NV18    NV18    0-239   0               N/A
GPU6    NV18    NV18    NV18    NV18    NV18    NV18     X      NV18    0-239   0               N/A
GPU7    NV18    NV18    NV18    NV18    NV18    NV18    NV18     X      0-239   0               N/A

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

Hypervisor vendor:: KVM
ulimit soft: 1048576

Startup logs:

❯ ./server.sh
>>> Launching SGLang server from server.yaml
[2025-12-26 23:00:58] WARNING model_config.py:803: DeepGemm is enabled but the scale_fmt of checkpoint is not ue8m0. This might cause accuracy degradation on Blackwell.
[2025-12-26 23:00:58] INFO server_args.py:1375: Use triton as attention backend on sm100 for Qwen3NextForCausalLM
[2025-12-26 23:00:58] WARNING server_args.py:1408: Disabling overlap schedule since MambaRadixCache no_buffer is not compatible with overlap schedule currently, try to use --mamba-scheduler-strategy extra_buffer to enable overlap schedule
[2025-12-26 23:00:58] WARNING server_args.py:1788: DP attention is enabled. The chunked prefill size is adjusted to 4096 to avoid MoE kernel issues. 
[2025-12-26 23:00:58] WARNING server_args.py:1841: DeepEP MoE is enabled. The expert parallel size is adjusted to be the same as the tensor parallel size[4].
[2025-12-26 23:00:58] server_args=ServerArgs(model_path='Qwen/Qwen3-Next-80B-A3B-Instruct-FP8', tokenizer_path='Qwen/Qwen3-Next-80B-A3B-Instruct-FP8', tokenizer_mode='auto', tokenizer_worker_num=1, skip_tokenizer_init=False, load_format='auto', model_loader_extra_config='{"enable_multithread_load": true, "num_threads": 8}', trust_remote_code=False, context_length=None, is_embedding=False, enable_multimodal=None, revision=None, model_impl='auto', host='127.0.0.1', port=30000, fastapi_root_path='', grpc_mode=False, skip_server_warmup=False, warmups=None, nccl_port=None, checkpoint_engine_wait_weights_before_ready=False, dtype='auto', quantization='fp8', quantization_param_path=None, kv_cache_dtype='auto', enable_fp32_lm_head=False, modelopt_quant=None, modelopt_checkpoint_restore_path=None, modelopt_checkpoint_save_path=None, modelopt_export_path=None, quantize_and_serve=False, rl_quant_profile=None, mem_fraction_static=0.805, max_running_requests=None, max_queued_requests=None, max_total_tokens=None, chunked_prefill_size=4096, enable_dynamic_chunking=False, max_prefill_tokens=16384, prefill_max_requests=None, schedule_policy='fcfs', enable_priority_scheduling=False, abort_on_priority_when_disabled=False, schedule_low_priority_values_first=False, priority_scheduling_preemption_threshold=10, schedule_conservativeness=0.3, page_size=1, hybrid_kvcache_ratio=None, swa_full_tokens_ratio=0.8, disable_hybrid_swa_memory=False, radix_eviction_policy='lru', device='cuda', tp_size=4, pp_size=1, pp_max_micro_batch_size=None, pp_async_batch_depth=0, stream_interval=1, stream_output=False, random_seed=387663723, constrained_json_whitespace_pattern=None, constrained_json_disable_any_whitespace=False, watchdog_timeout=300, soft_watchdog_timeout=None, dist_timeout=None, download_dir=None, base_gpu_id=0, gpu_id_step=1, sleep_on_idle=False, custom_sigquit_handler=None, log_level='info', log_level_http=None, log_requests=False, log_requests_level=2, log_requests_format='text', crash_dump_folder=None, show_time_cost=False, enable_metrics=False, enable_metrics_for_all_schedulers=False, tokenizer_metrics_custom_labels_header='x-custom-labels', tokenizer_metrics_allowed_custom_labels=None, bucket_time_to_first_token=None, bucket_inter_token_latency=None, bucket_e2e_request_latency=None, collect_tokens_histogram=False, prompt_tokens_buckets=None, generation_tokens_buckets=None, gc_warning_threshold_secs=0.0, decode_log_interval=40, enable_request_time_stats_logging=False, kv_events_config=None, enable_trace=False, otlp_traces_endpoint='localhost:4317', export_metrics_to_file=False, export_metrics_to_file_dir=None, api_key=None, served_model_name='Qwen/Qwen3-Next-80B-A3B-Instruct-FP8', weight_version='default', chat_template=None, completion_template=None, file_storage_path='sglang_storage', enable_cache_report=False, reasoning_parser=None, tool_call_parser=None, tool_server=None, sampling_defaults='model', dp_size=4, load_balance_method='round_robin', prefill_round_robin_balance=False, dist_init_addr=None, nnodes=1, node_rank=0, json_model_override_args='{}', preferred_sampling_params=None, enable_lora=None, max_lora_rank=None, lora_target_modules=None, lora_paths=None, max_loaded_loras=None, max_loras_per_batch=8, lora_eviction_policy='lru', lora_backend='csgmv', max_lora_chunk_size=16, attention_backend='triton', decode_attention_backend=None, prefill_attention_backend=None, sampling_backend='flashinfer', grammar_backend='xgrammar', mm_attention_backend=None, fp8_gemm_runner_backend='auto', nsa_prefill_backend='flashmla_sparse', nsa_decode_backend='fa3', disable_flashinfer_autotune=False, speculative_algorithm=None, speculative_draft_model_path=None, speculative_draft_model_revision=None, speculative_draft_load_format=None, speculative_num_steps=None, speculative_eagle_topk=None, speculative_num_draft_tokens=None, speculative_accept_threshold_single=1.0, speculative_accept_threshold_acc=1.0, speculative_token_map=None, speculative_attention_mode='prefill', speculative_draft_attention_backend=None, speculative_moe_runner_backend='auto', speculative_moe_a2a_backend=None, speculative_draft_model_quantization=None, speculative_ngram_min_match_window_size=1, speculative_ngram_max_match_window_size=12, speculative_ngram_min_bfs_breadth=1, speculative_ngram_max_bfs_breadth=10, speculative_ngram_match_type='BFS', speculative_ngram_branch_length=18, speculative_ngram_capacity=10000000, enable_multi_layer_eagle=False, ep_size=4, moe_a2a_backend='deepep', moe_runner_backend='auto', flashinfer_mxfp4_moe_precision='default', enable_flashinfer_allreduce_fusion=False, deepep_mode='auto', ep_num_redundant_experts=0, ep_dispatch_algorithm=None, init_expert_location='trivial', enable_eplb=False, eplb_algorithm='auto', eplb_rebalance_num_iterations=1000, eplb_rebalance_layers_per_chunk=None, eplb_min_rebalancing_utilization_threshold=1.0, expert_distribution_recorder_mode=None, expert_distribution_recorder_buffer_size=1000, enable_expert_distribution_metrics=False, deepep_config=None, moe_dense_tp_size=None, elastic_ep_backend=None, mooncake_ib_device=None, max_mamba_cache_size=None, mamba_ssm_dtype='float32', mamba_full_memory_ratio=0.9, mamba_scheduler_strategy='no_buffer', mamba_track_interval=256, enable_hierarchical_cache=False, hicache_ratio=2.0, hicache_size=0, hicache_write_policy='write_through', hicache_io_backend='kernel', hicache_mem_layout='layer_first', hicache_storage_backend=None, hicache_storage_prefetch_policy='best_effort', hicache_storage_backend_extra_config=None, enable_lmcache=False, kt_weight_path=None, kt_method='AMXINT4', kt_cpuinfer=None, kt_threadpool_count=2, kt_num_gpu_experts=None, kt_max_deferred_experts_per_token=None, dllm_algorithm=None, dllm_algorithm_config=None, enable_double_sparsity=False, ds_channel_config_path=None, ds_heavy_channel_num=32, ds_heavy_token_num=256, ds_heavy_channel_type='qk', ds_sparse_decode_threshold=4096, cpu_offload_gb=0, offload_group_size=-1, offload_num_in_group=1, offload_prefetch_step=1, offload_mode='cpu', multi_item_scoring_delimiter=None, disable_radix_cache=False, cuda_graph_max_bs=512, cuda_graph_bs=[1, 2, 4, 8, 12, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200, 208, 216, 224, 232, 240, 248, 256, 272, 288, 304, 320, 336, 352, 368, 384, 400, 416, 432, 448, 464, 480, 496, 512], disable_cuda_graph=False, disable_cuda_graph_padding=False, enable_profile_cuda_graph=False, enable_cudagraph_gc=False, enable_layerwise_nvtx_marker=False, enable_nccl_nvls=False, enable_symm_mem=False, disable_flashinfer_cutlass_moe_fp4_allgather=False, enable_tokenizer_batch_encode=False, disable_tokenizer_batch_decode=False, disable_outlines_disk_cache=False, disable_custom_all_reduce=False, enable_mscclpp=False, enable_torch_symm_mem=False, disable_overlap_schedule=True, enable_mixed_chunk=False, enable_dp_attention=True, enable_dp_lm_head=False, enable_two_batch_overlap=False, enable_single_batch_overlap=False, tbo_token_distribution_threshold=0.48, enable_torch_compile=False, enable_piecewise_cuda_graph=False, enable_torch_compile_debug_mode=False, torch_compile_max_bs=32, piecewise_cuda_graph_max_tokens=16384, piecewise_cuda_graph_tokens=[4, 8, 12, 16, 20, 24, 28, 32, 48, 64, 80, 96, 112, 128, 144, 160, 176, 192, 208, 224, 240, 256, 288, 320, 352, 384, 416, 448, 480, 512, 640, 768, 896, 1024, 1152, 1280, 1408, 1536, 1664, 1792, 1920, 2048, 2176, 2304, 2432, 2560, 2688, 2816, 2944, 3072, 3200, 3328, 3456, 3584, 3712, 3840, 3968, 4096, 4352, 4608, 4864, 5120, 5376, 5632, 5888, 6144, 6400, 6656, 6912, 7168, 7424, 7680, 7936, 8192, 8448, 8704, 8960, 9216, 9472, 9728, 9984, 10240, 10496, 10752, 11008, 11264, 11520, 11776, 12032, 12288, 12544, 12800, 13056, 13312, 13568, 13824, 14080, 14336, 14592, 14848, 15104, 15360, 15616, 15872, 16128, 16384], piecewise_cuda_graph_compiler='eager', torchao_config='', enable_nan_detection=False, enable_p2p_check=False, triton_attention_reduce_in_fp32=False, triton_attention_num_kv_splits=8, triton_attention_split_tile_size=None, num_continuous_decode_steps=1, delete_ckpt_after_loading=False, enable_memory_saver=False, enable_weights_cpu_backup=False, enable_draft_weights_cpu_backup=False, allow_auto_truncate=False, enable_custom_logit_processor=False, flashinfer_mla_disable_ragged=False, disable_shared_experts_fusion=False, disable_chunked_prefix_cache=False, disable_fast_image_processor=False, keep_mm_feature_on_device=False, enable_return_hidden_states=False, enable_return_routed_experts=False, scheduler_recv_interval=1, numa_node=None, enable_deterministic_inference=False, rl_on_policy_target=None, enable_attn_tp_input_scattered=False, enable_nsa_prefill_context_parallel=False, enable_fused_qk_norm_rope=False, enable_dynamic_batch_tokenizer=False, dynamic_batch_tokenizer_batch_size=32, dynamic_batch_tokenizer_batch_timeout=0.002, debug_tensor_dump_output_folder=None, debug_tensor_dump_layers=None, debug_tensor_dump_input_file=None, debug_tensor_dump_inject=False, disaggregation_mode='null', disaggregation_transfer_backend='mooncake', disaggregation_bootstrap_port=8998, disaggregation_decode_tp=None, disaggregation_decode_dp=None, disaggregation_prefill_pp=1, disaggregation_ib_device=None, disaggregation_decode_enable_offload_kvcache=False, disaggregation_decode_enable_fake_auto=False, num_reserved_decode_tokens=512, disaggregation_decode_polling_interval=1, encoder_only=False, language_only=False, encoder_transfer_backend='zmq_to_scheduler', encoder_urls=[], custom_weight_loader=[], weight_loader_disable_mmap=False, remote_instance_weight_loader_seed_instance_ip=None, remote_instance_weight_loader_seed_instance_service_port=None, remote_instance_weight_loader_send_weights_group_ports=None, remote_instance_weight_loader_backend='nccl', remote_instance_weight_loader_start_seed_via_transfer_engine=False, enable_pdmux=False, pdmux_config_path=None, sm_group_num=8, mm_max_concurrent_calls=32, mm_per_request_timeout=10.0, enable_broadcast_mm_inputs_process=False, enable_prefix_mm_cache=False, mm_enable_dp_encoder=False, mm_process_config={}, limit_mm_data_per_request=None, decrypted_config_file=None, decrypted_draft_config_file=None, forward_hooks=None)
[2025-12-26 23:01:09] DeepGemm is enabled but the scale_fmt of checkpoint is not ue8m0. This might cause accuracy degradation on Blackwell.
[2025-12-26 23:01:10] Using default HuggingFace chat template with detected content format: string
[2025-12-26 23:01:13 DP0 TP0 EP0] DeepGemm is enabled but the scale_fmt of checkpoint is not ue8m0. This might cause accuracy degradation on Blackwell.
[2025-12-26 23:01:14 DP0 TP0 EP0] DeepGemm is enabled but the scale_fmt of checkpoint is not ue8m0. This might cause accuracy degradation on Blackwell.
[2025-12-26 23:01:14 DP0 TP0 EP0] Init torch distributed begin.
[2025-12-26 23:01:18 DP1 TP1 EP1] DeepGemm is enabled but the scale_fmt of checkpoint is not ue8m0. This might cause accuracy degradation on Blackwell.
[2025-12-26 23:01:19 DP1 TP1 EP1] DeepGemm is enabled but the scale_fmt of checkpoint is not ue8m0. This might cause accuracy degradation on Blackwell.
[2025-12-26 23:01:19 DP1 TP1 EP1] Init torch distributed begin.
[2025-12-26 23:01:24 DP2 TP2 EP2] DeepGemm is enabled but the scale_fmt of checkpoint is not ue8m0. This might cause accuracy degradation on Blackwell.
[2025-12-26 23:01:24 DP2 TP2 EP2] DeepGemm is enabled but the scale_fmt of checkpoint is not ue8m0. This might cause accuracy degradation on Blackwell.
[2025-12-26 23:01:24 DP2 TP2 EP2] Init torch distributed begin.
[2025-12-26 23:01:29 DP3 TP3 EP3] DeepGemm is enabled but the scale_fmt of checkpoint is not ue8m0. This might cause accuracy degradation on Blackwell.
[2025-12-26 23:01:30 DP3 TP3 EP3] DeepGemm is enabled but the scale_fmt of checkpoint is not ue8m0. This might cause accuracy degradation on Blackwell.
[2025-12-26 23:01:30 DP3 TP3 EP3] Init torch distributed begin.
[Gloo] Rank 0 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[Gloo] Rank 1 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[Gloo] Rank 2 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[Gloo] Rank 3 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[Gloo] Rank 3 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[Gloo] Rank 0 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[Gloo] Rank 2 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[Gloo] Rank 1 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[2025-12-26 23:01:30 DP0 TP0 EP0] sglang is using nccl==2.27.5
[2025-12-26 23:01:32 DP0 TP0 EP0] [AR] Using sglang CustomAllreduce
[2025-12-26 23:01:32 DP1 TP1 EP1] [AR] Using sglang CustomAllreduce
[2025-12-26 23:01:32 DP3 TP3 EP3] [AR] Using sglang CustomAllreduce
[2025-12-26 23:01:32 DP2 TP2 EP2] [AR] Using sglang CustomAllreduce
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[2025-12-26 23:01:33 DP0 TP0 EP0] Init torch distributed ends. mem usage=1.46 GB
[2025-12-26 23:01:33 DP3 TP3 EP3] Init torch distributed ends. mem usage=1.33 GB
[2025-12-26 23:01:33 DP2 TP2 EP2] Init torch distributed ends. mem usage=1.49 GB
[2025-12-26 23:01:33 DP1 TP1 EP1] Init torch distributed ends. mem usage=1.49 GB
[2025-12-26 23:01:33 DP0 TP0 EP0] Ignore import error when loading sglang.srt.models.glmasr: cannot import name 'GlmAsrConfig' from 'transformers' (/usr/local/lib/python3.12/dist-packages/transformers/__init__.py)
[2025-12-26 23:01:33 DP2 TP2 EP2] Ignore import error when loading sglang.srt.models.glmasr: cannot import name 'GlmAsrConfig' from 'transformers' (/usr/local/lib/python3.12/dist-packages/transformers/__init__.py)
[2025-12-26 23:01:33 DP1 TP1 EP1] Ignore import error when loading sglang.srt.models.glmasr: cannot import name 'GlmAsrConfig' from 'transformers' (/usr/local/lib/python3.12/dist-packages/transformers/__init__.py)
[2025-12-26 23:01:33 DP3 TP3 EP3] Ignore import error when loading sglang.srt.models.glmasr: cannot import name 'GlmAsrConfig' from 'transformers' (/usr/local/lib/python3.12/dist-packages/transformers/__init__.py)
[2025-12-26 23:01:34 DP0 TP0 EP0] Load weight begin. avail mem=176.18 GB
[2025-12-26 23:01:34 DP0 TP0 EP0] Detected fp8 checkpoint.
`torch_dtype` is deprecated! Use `dtype` instead!
[2025-12-26 23:01:34 DP1 TP1 EP1] Load weight begin. avail mem=176.15 GB
[2025-12-26 23:01:34 DP3 TP3 EP3] Load weight begin. avail mem=176.31 GB
`torch_dtype` is deprecated! Use `dtype` instead!
[2025-12-26 23:01:34 DP2 TP2 EP2] Load weight begin. avail mem=176.15 GB
`torch_dtype` is deprecated! Use `dtype` instead!
`torch_dtype` is deprecated! Use `dtype` instead!
[2025-12-26 23:01:34 DP0 TP0 EP0] using attn output gate!
[2025-12-26 23:01:34 DP1 TP1 EP1] using attn output gate!
[2025-12-26 23:01:34 DP2 TP2 EP2] using attn output gate!
[2025-12-26 23:01:34 DP3 TP3 EP3] using attn output gate!
[2025-12-26 23:01:35 DP0 TP0 EP0] Found local HF snapshot for Qwen/Qwen3-Next-80B-A3B-Instruct-FP8 at /root/.cache/huggingface/hub/models--Qwen--Qwen3-Next-80B-A3B-Instruct-FP8/snapshots/c5f5f263bdd5cc134092897864e8905d8fe7b928; skipping download.
Multi-thread loading shards:   0% Completed | 0/8 [00:00<?, ?it/s]
Multi-thread loading shards:  12% Completed | 1/8 [00:11<01:21, 11.62s/it]
Multi-thread loading shards:  25% Completed | 2/8 [00:13<00:34,  5.82s/it]
Multi-thread loading shards:  38% Completed | 3/8 [00:14<00:19,  3.89s/it]
Multi-thread loading shards:  50% Completed | 4/8 [00:16<00:12,  3.06s/it]
Multi-thread loading shards:  62% Completed | 5/8 [00:18<00:07,  2.57s/it]
Multi-thread loading shards:  75% Completed | 6/8 [00:20<00:04,  2.28s/it]
Multi-thread loading shards:  88% Completed | 7/8 [00:21<00:02,  2.10s/it]
Multi-thread loading shards: 100% Completed | 8/8 [00:23<00:00,  1.96s/it]
Multi-thread loading shards: 100% Completed | 8/8 [00:23<00:00,  2.94s/it]

[2025-12-26 23:02:00 DP1 TP1 EP1] Load weight end. type=Qwen3NextForCausalLM, dtype=torch.bfloat16, avail mem=155.43 GB, mem usage=20.72 GB.
[2025-12-26 23:02:00 DP0 TP0 EP0] Load weight end. type=Qwen3NextForCausalLM, dtype=torch.bfloat16, avail mem=155.46 GB, mem usage=20.72 GB.
[2025-12-26 23:02:01 DP2 TP2 EP2] Load weight end. type=Qwen3NextForCausalLM, dtype=torch.bfloat16, avail mem=155.43 GB, mem usage=20.72 GB.
[2025-12-26 23:02:03 DP3 TP3 EP3] Load weight end. type=Qwen3NextForCausalLM, dtype=torch.bfloat16, avail mem=155.59 GB, mem usage=20.72 GB.
[2025-12-26 23:02:03 DP0 TP0 EP0] Using KV cache dtype: torch.bfloat16
[2025-12-26 23:02:03 DP0 TP0 EP0] The available memory for KV cache is 63.77 GB.
[2025-12-26 23:02:03 DP3 TP3 EP3] The available memory for KV cache is 63.77 GB.
[2025-12-26 23:02:03 DP2 TP2 EP2] The available memory for KV cache is 63.77 GB.
[2025-12-26 23:02:03 DP1 TP1 EP1] The available memory for KV cache is 63.77 GB.
[2025-12-26 23:02:03 DP3 TP3 EP3] Mamba Cache is allocated. max_mamba_cache_size: 797, conv_state size: 1.32GB, ssm_state size: 56.11GB 
[2025-12-26 23:02:03 DP0 TP0 EP0] Mamba Cache is allocated. max_mamba_cache_size: 797, conv_state size: 1.32GB, ssm_state size: 56.11GB 
[2025-12-26 23:02:03 DP2 TP2 EP2] Mamba Cache is allocated. max_mamba_cache_size: 797, conv_state size: 1.32GB, ssm_state size: 56.11GB 
[2025-12-26 23:02:03 DP1 TP1 EP1] Mamba Cache is allocated. max_mamba_cache_size: 797, conv_state size: 1.32GB, ssm_state size: 56.11GB 
[2025-12-26 23:02:03 DP2 TP2 EP2] KV Cache is allocated. #tokens: 2786325, K size: 31.89 GB, V size: 31.89 GB
[2025-12-26 23:02:03 DP2 TP2 EP2] Memory pool end. avail mem=34.12 GB
[2025-12-26 23:02:03 DP0 TP0 EP0] KV Cache is allocated. #tokens: 2786325, K size: 31.89 GB, V size: 31.89 GB
[2025-12-26 23:02:03 DP1 TP1 EP1] KV Cache is allocated. #tokens: 2786325, K size: 31.89 GB, V size: 31.89 GB
[2025-12-26 23:02:03 DP0 TP0 EP0] Memory pool end. avail mem=34.15 GB
[2025-12-26 23:02:03 DP1 TP1 EP1] Memory pool end. avail mem=34.12 GB
[2025-12-26 23:02:03 DP3 TP3 EP3] KV Cache is allocated. #tokens: 2786325, K size: 31.89 GB, V size: 31.89 GB
[2025-12-26 23:02:03 DP3 TP3 EP3] Memory pool end. avail mem=34.27 GB
[2025-12-26 23:02:03 DP1 TP1 EP1] Using hybrid linear attention backend for hybrid GDN models.
[2025-12-26 23:02:03 DP0 TP0 EP0] Using hybrid linear attention backend for hybrid GDN models.
[2025-12-26 23:02:03 DP1 TP1 EP1] Capture cuda graph begin. This can take up to several minutes. avail mem=34.05 GB
[2025-12-26 23:02:03 DP0 TP0 EP0] Capture cuda graph begin. This can take up to several minutes. avail mem=34.08 GB
[2025-12-26 23:02:03 DP0 TP0 EP0] Capture cuda graph bs [1, 2, 4, 8, 12, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 159]
[2025-12-26 23:02:03 DP2 TP2 EP2] Using hybrid linear attention backend for hybrid GDN models.
[2025-12-26 23:02:03 DP2 TP2 EP2] Capture cuda graph begin. This can take up to several minutes. avail mem=34.05 GB
[2025-12-26 23:02:03 DP3 TP3 EP3] Using hybrid linear attention backend for hybrid GDN models.
[2025-12-26 23:02:03 DP3 TP3 EP3] Capture cuda graph begin. This can take up to several minutes. avail mem=34.21 GB
Capturing batches (bs=159 avail_mem=33.68 GB):   0%|                                                                                                  | 0/24 [00:00<?, ?it/s][2025-12-26 23:02:04 DP0 TP0 EP0] Entering DeepGEMM JIT Pre-Compile session. It may take a long time (typically 10-20 mins) if you have not run `sglang.compile_deep_gemm`. It is recommended to run `sglang.compile_deep_gemm` with same args as `sglang.launch_server` for pre-compilation to reduce the overhead if you have not run it before. For example: `python3 -m sglang.compile_deep_gemm --model deepseek-ai/DeepSeek-V3 --tp 8 --trust-remote-code`
[2025-12-26 23:02:04 DP0 TP0 EP0] Try DeepGEMM JIT Compiling for <GEMM_NT_F8F8BF16> N=12288, K=2048, num_groups=1 with all Ms. It only takes a little time (typically 1 sec) if you have run `python3 -m sglang.compile_deep_gemm`. 
[2025-12-26 23:02:04 DP0 TP0 EP0] Required memory for warmup: 0.4296875GB, Available memory: 33.65899658203125GB
                                                                                                                                                                            [2025-12-26 23:02:05 DP1 TP1 EP1] Only use 20 SMs for DeepEP communication. This may result in highly suboptimal performance. Consider using --deepep-config to change the behavior.
[2025-12-26 23:02:05 DP2 TP2 EP2] Only use 20 SMs for DeepEP communication. This may result in highly suboptimal performance. Consider using --deepep-config to change the behavior.
[2025-12-26 23:02:05 DP3 TP3 EP3] Only use 20 SMs for DeepEP communication. This may result in highly suboptimal performance. Consider using --deepep-config to change the behavior.
DeepGEMM warmup: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████| 16384/16384 [00:01<00:00, 10918.33it/s]
[2025-12-26 23:02:06 DP0 TP0 EP0] Entering DeepGEMM JIT Pre-Compile session. It may take a long time (typically 10-20 mins) if you have not run `sglang.compile_deep_gemm`. It is recommended to run `sglang.compile_deep_gemm` with same args as `sglang.launch_server` for pre-compilation to reduce the overhead if you have not run it before. For example: `python3 -m sglang.compile_deep_gemm --model deepseek-ai/DeepSeek-V3 --tp 8 --trust-remote-code`
[2025-12-26 23:02:06 DP0 TP0 EP0] Try DeepGEMM JIT Compiling for <GEMM_NT_F8F8BF16> N=2048, K=4096, num_groups=1 with all Ms. It only takes a little time (typically 1 sec) if you have run `python3 -m sglang.compile_deep_gemm`. 
[2025-12-26 23:02:06 DP0 TP0 EP0] Required memory for warmup: 0.1328125GB, Available memory: 33.63751220703125GB
DeepGEMM warmup: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████| 16384/16384 [00:01<00:00, 10324.22it/s]
[2025-12-26 23:02:08 DP0 TP0 EP0] Entering DeepGEMM JIT Pre-Compile session. It may take a long time (typically 10-20 mins) if you have not run `sglang.compile_deep_gemm`. It is recommended to run `sglang.compile_deep_gemm` with same args as `sglang.launch_server` for pre-compilation to reduce the overhead if you have not run it before. For example: `python3 -m sglang.compile_deep_gemm --model deepseek-ai/DeepSeek-V3 --tp 8 --trust-remote-code`
[2025-12-26 23:02:08 DP0 TP0 EP0] Try DeepGEMM JIT Compiling for <GEMM_NT_F8F8BF16> N=1024, K=2048, num_groups=1 with all Ms. It only takes a little time (typically 1 sec) if you have run `python3 -m sglang.compile_deep_gemm`. 
[2025-12-26 23:02:08 DP0 TP0 EP0] Required memory for warmup: 0.064453125GB, Available memory: 33.63751220703125GB
DeepGEMM warmup: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████| 16384/16384 [00:01<00:00, 9778.82it/s]
[2025-12-26 23:02:09 DP0 TP0 EP0] Entering DeepGEMM JIT Pre-Compile session. It may take a long time (typically 10-20 mins) if you have not run `sglang.compile_deep_gemm`. It is recommended to run `sglang.compile_deep_gemm` with same args as `sglang.launch_server` for pre-compilation to reduce the overhead if you have not run it before. For example: `python3 -m sglang.compile_deep_gemm --model deepseek-ai/DeepSeek-V3 --tp 8 --trust-remote-code`
[2025-12-26 23:02:09 DP0 TP0 EP0] Try DeepGEMM JIT Compiling for <GEMM_NT_F8F8BF16> N=2048, K=512, num_groups=1 with all Ms. It only takes a little time (typically 1 sec) if you have run `python3 -m sglang.compile_deep_gemm`. 
[2025-12-26 23:02:09 DP0 TP0 EP0] Required memory for warmup: 0.0712890625GB, Available memory: 33.63555908203125GB
DeepGEMM warmup: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████| 16384/16384 [00:01<00:00, 10187.07it/s]
[2025-12-26 23:02:11 DP0 TP0 EP0] Only use 20 SMs for DeepEP communication. This may result in highly suboptimal performance. Consider using --deepep-config to change the behavior.
/dvs/p4/build/sw/rel/gpgpu/toolkit/r12.8/main_nvshmem/src/host/transport/transport.cpp:nvshmemi_transport_init:282: init failed for transport: IBGDA
/dvs/p4/build/sw/rel/gpgpu/toolkit/r12.8/main_nvshmem/src/host/transport/transport.cpp:nvshmemi_transport_init:282: init failed for transport: IBGDA
/dvs/p4/build/sw/rel/gpgpu/toolkit/r12.8/main_nvshmem/src/host/transport/transport.cpp:nvshmemi_transport_init:282: init failed for transport: IBGDA
/dvs/p4/build/sw/rel/gpgpu/toolkit/r12.8/main_nvshmem/src/host/transport/transport.cpp:nvshmemi_transport_init:282: init failed for transport: IBGDA
[2025-12-26 23:02:17 DP0 TP0 EP0] Entering DeepGEMM JIT Pre-Compile session. It may take a long time (typically 10-20 mins) if you have not run `sglang.compile_deep_gemm`. It is recommended to run `sglang.compile_deep_gemm` with same args as `sglang.launch_server` for pre-compilation to reduce the overhead if you have not run it before. For example: `python3 -m sglang.compile_deep_gemm --model deepseek-ai/DeepSeek-V3 --tp 8 --trust-remote-code`
[2025-12-26 23:02:17 DP0 TP0 EP0] Try DeepGEMM JIT Compiling for <GROUPED_GEMM_NT_F8F8BF16_MASKED> N=1024, K=2048, num_groups=128 with all Ms. It only takes a little time (typically 1 sec) if you have run `python3 -m sglang.compile_deep_gemm`. 
[2025-12-26 23:02:17 DP0 TP0 EP0] Required memory for warmup: 8.250000476837158GB, Available memory: 16.91876220703125GB
DeepGEMM warmup: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████| 16384/16384 [00:01<00:00, 9184.02it/s]
[2025-12-26 23:02:19 DP0 TP0 EP0] Entering DeepGEMM JIT Pre-Compile session. It may take a long time (typically 10-20 mins) if you have not run `sglang.compile_deep_gemm`. It is recommended to run `sglang.compile_deep_gemm` with same args as `sglang.launch_server` for pre-compilation to reduce the overhead if you have not run it before. For example: `python3 -m sglang.compile_deep_gemm --model deepseek-ai/DeepSeek-V3 --tp 8 --trust-remote-code`
[2025-12-26 23:02:19 DP0 TP0 EP0] Try DeepGEMM JIT Compiling for <GROUPED_GEMM_NT_F8F8BF16_MASKED> N=2048, K=512, num_groups=128 with all Ms. It only takes a little time (typically 1 sec) if you have run `python3 -m sglang.compile_deep_gemm`. 
[2025-12-26 23:02:19 DP0 TP0 EP0] Required memory for warmup: 9.125000476837158GB, Available memory: 15.89923095703125GB
DeepGEMM warmup: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████| 16384/16384 [00:01<00:00, 13029.39it/s]
[2025-12-26 23:02:20 DP0 TP0 EP0] Entering DeepGEMM JIT Pre-Compile session. It may take a long time (typically 10-20 mins) if you have not run `sglang.compile_deep_gemm`. It is recommended to run `sglang.compile_deep_gemm` with same args as `sglang.launch_server` for pre-compilation to reduce the overhead if you have not run it before. For example: `python3 -m sglang.compile_deep_gemm --model deepseek-ai/DeepSeek-V3 --tp 8 --trust-remote-code`
[2025-12-26 23:02:20 DP0 TP0 EP0] Try DeepGEMM JIT Compiling for <GEMM_NT_F8F8BF16> N=9216, K=2048, num_groups=1 with all Ms. It only takes a little time (typically 1 sec) if you have run `python3 -m sglang.compile_deep_gemm`. 
[2025-12-26 23:02:20 DP0 TP0 EP0] Required memory for warmup: 0.330078125GB, Available memory: 18.91876220703125GB
DeepGEMM warmup: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████| 16384/16384 [00:01<00:00, 10968.94it/s]
Capturing batches (bs=1 avail_mem=14.71 GB): 100%|███████████████████████████████████████████████████████████████████████████████████████████| 24/24 [00:31<00:00,  1.33s/it]
[2025-12-26 23:02:36 DP0 TP0 EP0] Registering 24 cuda graph addresses
[2025-12-26 23:02:36 DP2 TP2 EP2] Capture cuda graph end. Time elapsed: 32.67 s. mem usage=19.53 GB. avail mem=14.52 GB.
[2025-12-26 23:02:36 DP0 TP0 EP0] Capture cuda graph end. Time elapsed: 32.68 s. mem usage=19.38 GB. avail mem=14.70 GB.
[2025-12-26 23:02:36 DP3 TP3 EP3] Capture cuda graph end. Time elapsed: 32.78 s. mem usage=18.75 GB. avail mem=15.46 GB.
[2025-12-26 23:02:36 DP1 TP1 EP1] Capture cuda graph end. Time elapsed: 32.79 s. mem usage=19.53 GB. avail mem=14.52 GB.
[2025-12-26 23:02:37 DP0 TP0 EP0] max_total_num_tokens=2786325, chunked_prefill_size=4096, max_prefill_tokens=16384, max_running_requests=159, context_len=262144, available_gpu_mem=14.70 GB
[2025-12-26 23:02:38] INFO:     Started server process [406845]
[2025-12-26 23:02:38] INFO:     Waiting for application startup.
[2025-12-26 23:02:38] Using default chat sampling params from model generation config: {'repetition_penalty': 1.0, 'temperature': 0.7, 'top_k': 20, 'top_p': 0.8}
[2025-12-26 23:02:38] Using default chat sampling params from model generation config: {'repetition_penalty': 1.0, 'temperature': 0.7, 'top_k': 20, 'top_p': 0.8}
[2025-12-26 23:02:38] INFO:     Application startup complete.
[2025-12-26 23:02:38] INFO:     Uvicorn running on http://127.0.0.1:30000 (Press CTRL+C to quit)
[2025-12-26 23:02:39] INFO:     127.0.0.1:47852 - "GET /model_info HTTP/1.1" 200 OK
[2025-12-26 23:02:39 DP1 TP1 EP1] Prefill batch, #new-seq: 1, #new-token: 6, #cached-token: 0, full token usage: 0.00, mamba usage: 0.00, #running-req: 0, #queue-req: 0, 
[2025-12-26 23:02:39 DP0 TP0 EP0] Prefill batch, #new-seq: 1, #new-token: 6, #cached-token: 0, full token usage: 0.00, mamba usage: 0.00, #running-req: 0, #queue-req: 0, 
[2025-12-26 23:02:39 DP3 TP3 EP3] Prefill batch, #new-seq: 1, #new-token: 6, #cached-token: 0, full token usage: 0.00, mamba usage: 0.00, #running-req: 0, #queue-req: 0, 
[2025-12-26 23:02:39 DP2 TP2 EP2] Prefill batch, #new-seq: 1, #new-token: 6, #cached-token: 0, full token usage: 0.00, mamba usage: 0.00, #running-req: 0, #queue-req: 0, 
[2025-12-26 23:02:39 DP0 TP0 EP0] Entering DeepGEMM JIT Pre-Compile session. It may take a long time (typically 10-20 mins) if you have not run `sglang.compile_deep_gemm`. It is recommended to run `sglang.compile_deep_gemm` with same args as `sglang.launch_server` for pre-compilation to reduce the overhead if you have not run it before. For example: `python3 -m sglang.compile_deep_gemm --model deepseek-ai/DeepSeek-V3 --tp 8 --trust-remote-code`
[2025-12-26 23:02:39 DP0 TP0 EP0] Try DeepGEMM JIT Compiling for <GROUPED_GEMM_NT_F8F8BF16_CONTIG> N=1024, K=2048, num_groups=128 with all Ms. It only takes a little time (typically 1 sec) if you have run `python3 -m sglang.compile_deep_gemm`. 
[2025-12-26 23:02:39 DP0 TP0 EP0] Required memory for warmup: 0.31256103515625GB, Available memory: 14.69805908203125GB
DeepGEMM warmup: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 128/128 [00:00<00:00, 421.87it/s]
[2025-12-26 23:02:39 DP0 TP0 EP0] Entering DeepGEMM JIT Pre-Compile session. It may take a long time (typically 10-20 mins) if you have not run `sglang.compile_deep_gemm`. It is recommended to run `sglang.compile_deep_gemm` with same args as `sglang.launch_server` for pre-compilation to reduce the overhead if you have not run it before. For example: `python3 -m sglang.compile_deep_gemm --model deepseek-ai/DeepSeek-V3 --tp 8 --trust-remote-code`
[2025-12-26 23:02:39 DP0 TP0 EP0] Try DeepGEMM JIT Compiling for <GROUPED_GEMM_NT_F8F8BF16_CONTIG> N=2048, K=512, num_groups=128 with all Ms. It only takes a little time (typically 1 sec) if you have run `python3 -m sglang.compile_deep_gemm`. 
[2025-12-26 23:02:39 DP0 TP0 EP0] Required memory for warmup: 0.19537353515625GB, Available memory: 14.69415283203125GB
DeepGEMM warmup: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 128/128 [00:00<00:00, 606.33it/s]
[2025-12-26 23:02:40] INFO:     127.0.0.1:47862 - "POST /generate HTTP/1.1" 200 OK
[2025-12-26 23:02:40] The server is fired up and ready to roll!
[2025-12-26 23:02:53] INFO:     127.0.0.1:56170 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-12-26 23:02:53 DP0 TP0 EP0] Prefill batch, #new-seq: 1, #new-token: 171, #cached-token: 0, full token usage: 0.00, mamba usage: 0.00, #running-req: 0, #queue-req: 0, 
[2025-12-26 23:02:53 DP0 TP0 EP0] Decode batch, #running-req: 1, #full token: 204, full token usage: 0.00, mamba num: 2, mamba usage: 0.00, cuda graph: True, gen throughput (token/s): 0.40, #queue-req: 0, 
[2025-12-26 23:02:54 DP0 TP0 EP0] Decode batch, #running-req: 1, #full token: 244, full token usage: 0.00, mamba num: 2, mamba usage: 0.00, cuda graph: True, gen throughput (token/s): 64.96, #queue-req: 0, 
[2025-12-26 23:02:55 DP0 TP0 EP0] Decode batch, #running-req: 1, #full token: 284, full token usage: 0.00, mamba num: 2, mamba usage: 0.00, cuda graph: True, gen throughput (token/s): 64.55, #queue-req: 0, 
[2025-12-26 23:02:55 DP0 TP0 EP0] Decode batch, #running-req: 1, #full token: 324, full token usage: 0.00, mamba num: 2, mamba usage: 0.00, cuda graph: True, gen throughput (token/s): 61.17, #queue-req: 0, 
[2025-12-26 23:02:56 DP0 TP0 EP0] Decode batch, #running-req: 1, #full token: 364, full token usage: 0.00, mamba num: 2, mamba usage: 0.00, cuda graph: True, gen throughput (token/s): 64.43, #queue-req: 0, 
[2025-12-27 01:14:04] INFO:     127.0.0.1:57614 - "POST /generate HTTP/1.1" 200 OK
[2025-12-27 01:14:04 DP1 TP1 EP1] Prefill batch, #new-seq: 1, #new-token: 18, #cached-token: 0, full token usage: 0.00, mamba usage: 0.00, #running-req: 0, #queue-req: 0, 
[2025-12-27 01:14:05 DP1 TP1 EP1] Decode batch, #running-req: 1, #full token: 51, full token usage: 0.00, mamba num: 2, mamba usage: 0.00, cuda graph: True, gen throughput (token/s): 0.01, #queue-req: 0, 
[2025-12-27 01:14:06 DP1 TP1 EP1] Decode batch, #running-req: 1, #full token: 91, full token usage: 0.00, mamba num: 2, mamba usage: 0.00, cuda graph: True, gen throughput (token/s): 60.89, #queue-req: 0, 
[2025-12-27 01:14:06 DP1 TP1 EP1] Decode batch, #running-req: 1, #full token: 131, full token usage: 0.00, mamba num: 2, mamba usage: 0.00, cuda graph: True, gen throughput (token/s): 63.90, #queue-req: 0, 
[2025-12-27 01:14:07 DP1 TP1 EP1] Decode batch, #running-req: 1, #full token: 171, full token usage: 0.00, mamba num: 2, mamba usage: 0.00, cuda graph: True, gen throughput (token/s): 64.01, #queue-req: 0, 
[2025-12-27 01:14:07 DP1 TP1 EP1] Decode batch, #running-req: 1, #full token: 211, full token usage: 0.00, mamba num: 2, mamba usage: 0.00, cuda graph: True, gen throughput (token/s): 64.64, #queue-req: 0, 
^CProcess Process-1:1:

support qwen3-next-fp8 deepep

f9ff0a8

sglang-bot added the run-ci label Sep 18, 2025

gemini-code-assist bot reviewed Sep 18, 2025

View reviewed changes

zhyncs approved these changes Sep 18, 2025

View reviewed changes

zhyncs merged commit 1344ebc into main Sep 18, 2025
72 of 77 checks passed

zhyncs deleted the support-qwen3-next-deepep branch September 18, 2025 18:36

lifuhuang pushed a commit that referenced this pull request Sep 20, 2025

support qwen3-next-fp8 deepep (#10622)

9995fdf

HanHan009527 pushed a commit to HanHan009527/sglang that referenced this pull request Oct 9, 2025

support qwen3-next-fp8 deepep (sgl-project#10622)

9ac9220

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

support qwen3-next-fp8 deepep#10622

support qwen3-next-fp8 deepep#10622
zhyncs merged 1 commit intomainfrom
support-qwen3-next-deepep

yizhang2077 commented Sep 18, 2025 •

edited

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Sep 18, 2025

Uh oh!

Uh oh!

lmu97 commented Oct 15, 2025

Uh oh!

lmu97 commented Oct 15, 2025

Uh oh!

vincentzed commented Dec 27, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

yizhang2077 commented Sep 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Accuracy Tests

Benchmarking and Profiling

Checklist

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Sep 18, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

lmu97 commented Oct 15, 2025

Uh oh!

lmu97 commented Oct 15, 2025

Uh oh!

vincentzed commented Dec 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

yizhang2077 commented Sep 18, 2025 •

edited

Loading

vincentzed commented Dec 27, 2025 •

edited

Loading