Skip to content

support qwen3-next-fp8 deepep#10622

Merged
zhyncs merged 1 commit intomainfrom
support-qwen3-next-deepep
Sep 18, 2025
Merged

support qwen3-next-fp8 deepep#10622
zhyncs merged 1 commit intomainfrom
support-qwen3-next-deepep

Conversation

@yizhang2077
Copy link
Copy Markdown
Collaborator

@yizhang2077 yizhang2077 commented Sep 18, 2025

Motivation

need merge #10624 first

Modifications

Accuracy Tests

python3 -m sglang.launch_server --model Qwen/Qwen-Next-80B-A3B-Instruct-FP8/  --tp 4 --dp 2 --enable-dp-attention --enable-deepep-moe --cuda-graph-max-bs 128

python3 benchmark/gsm8k/bench_sglang.py --num-questions 1000 
Accuracy: 0.942
Invalid: 0.000
Latency: 176.143 s
Output throughput: 943.616 token/s

python3 -m sglang.launch_server --model Qwen/Qwen-Next-80B-A3B-Instruct-FP8/  --tp 8 --dp 8 --enable-dp-attention --enable-deepep-moe --cuda-graph-max-bs 128

Accuracy: 0.940
Invalid: 0.000
Latency: 145.734 s
Output throughput: 1140.712 token/s

Benchmarking and Profiling

Checklist

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summary of Changes

Hello @yizhang2077, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces comprehensive support for Deepseek Expert Parallelism (DeepEP) within the Qwen3-Next-FP8 model architecture. The changes enable efficient distributed execution of Mixture-of-Experts (MoE) layers by integrating DeepEP-specific logic for expert management, load balancing, and tensor parallelism, aiming to improve performance and scalability for large language models.

Highlights

  • DeepEP Integration: Implemented DeepEP (Deepseek Expert Parallelism) support for Qwen3-Next-FP8 models, enabling specialized handling of Mixture-of-Experts (MoE) layers for improved distributed inference.
  • Expert Parallelism Configuration: Enhanced MoE layer initialization to incorporate redundant experts and pass tensor parallelism configurations to sub-layers, optimizing distributed execution and resource utilization.
  • Expert Weight Management: Introduced a mechanism to lazily retrieve and manage expert weights within the Qwen3-Next model, facilitating DeepEP's operational requirements and dynamic expert routing.
  • Expert Distribution Tracking: Integrated global expert distribution recording to monitor and potentially optimize expert allocation across layers, which is crucial for load balancing in DeepEP.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces support for qwen3-next-fp8 with DeepEP. The changes are mainly in qwen2_moe.py and qwen3_next.py. In qwen2_moe.py, a new forward path for DeepEP is added, along with configurations for redundant experts and shared experts. A method to retrieve expert weights is also included. In qwen3_next.py, this new method is used to expose expert weights for load balancing, and expert distribution recording is integrated. The changes are logical and well-implemented. I have one minor suggestion to remove some redundant code to improve clarity.

Comment on lines +191 to +197
if get_moe_a2a_backend().is_deepep():
# TODO: we will support tp < ep in the future
self.ep_size = get_moe_expert_parallel_world_size()
self.num_experts = (
config.num_experts + global_server_args_dict["ep_num_redundant_experts"]
)
self.top_k = config.num_experts_per_tok
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This block of code appears to be redundant. The attributes self.ep_size, self.num_experts, and self.top_k are assigned but are not used within the class. The values for num_experts and top_k were already used during the initialization of self.experts and self.topk respectively. If this code is for future use as hinted by the TODO, it should be commented out. Otherwise, it can be removed to improve code clarity.

@zhyncs zhyncs merged commit 1344ebc into main Sep 18, 2025
72 of 77 checks passed
@zhyncs zhyncs deleted the support-qwen3-next-deepep branch September 18, 2025 18:36
chenxu140 added a commit to ping1jing2/sglang that referenced this pull request Sep 20, 2025
* origin/qwen3: (30 commits)
  chore: bump sgl-kernel 0.3.11 (sgl-project#10630)
  feat: add fused moe config for Qwen3-Next-80B-A3B-Instruct on B200 (sgl-project#10631)
  model support: Sarashina2VisionForCausalLM (sgl-project#10632)
  [Performance] Qwen3-Next: speed up update_mamba_state_after_mtp_verify by 10x; e2e up to 3.54% faster (sgl-project#10586)
  [Performance] Qwen3-Next: replace arange to cached query_start_loc_li… (sgl-project#10553)
  [Feature] Speculative decoding support lookahead (sgl-project#9873)
  refactor: use registry for _get_attention_backend_from_str (sgl-project#10629)
  [router] refactor worker to builder pattern 1/n (sgl-project#10628)
  Garbage collector regression in the online server (sgl-project#10621)
  feat: Add FlexAttention Backend for Efficient Sparse Attention (sgl-project#9947)
  Fix bias handling in TritonMoeQuantInfo within quantization/mxfp4.py (sgl-project#10579)
  [Performance] qwen3-next improve causal conv1d in prefill phase (sgl-project#10595)
  Fix sgl_kernel import failure on devices other than CUDA (sgl-project#10610)
  support qwen3-next-fp8 deepep (sgl-project#10622)
  update deepep version for qwen3-next deepep moe (sgl-project#10624)
  Feat/add heartbeat mechanism for nixl conn (sgl-project#10222)
  [RL] Add destroy process group api (sgl-project#9979)
  fix deepep assert when PD disaggregation == null (sgl-project#8274)
  Scale kkt after reduction (sgl-project#10604)
  [improvement] add average input/output token length for hicache benchmark stats output (sgl-project#10525)
  ...
lifuhuang pushed a commit that referenced this pull request Sep 20, 2025
HanHan009527 pushed a commit to HanHan009527/sglang that referenced this pull request Oct 9, 2025
@lmu97
Copy link
Copy Markdown

lmu97 commented Oct 15, 2025

Hi @yizhang2077 , I wonder if tuning triton moe kernel script need to change?
When I launch qwen3-next model by

python3 -m sglang.launch_server --model-path /opt/nim/workspace/ --chunked-prefill-size 16384 --max-prefill-tokens 16384 --max-running-requests 1024 --tp-size 4 --ep-size 4 --mem-fraction-static 0.8 --random-seed 0 --cuda-graph-bs 1 2 4 8 16 32 64 128 256 512 1024 --log-level info --host 0.0.0.0 --port 8001 --random-seed 0 --moe-runner-backend auto --attention-backend flashinfer --mamba-ssm-dtype bfloat16

it says:
Using default MoE kernel config. Performance might be sub-optimal! Config file not found at /sgl-workspace/sglang/python/sglang/srt/layers/moe/fused_moe_triton/configs/triton_3_4_0/E=128,N=512,device_name=NVIDIA_H200.json, you can create them with https://github.com/sgl-project/sglang/tree/main/benchmark/kernels/fused_moe_triton

I wonder how "E=128" is calculated (seems num_experts / ep_size ). in tuning script,

, E is always 512 according to config.num_experts https://huggingface.co/Qwen/Qwen3-Next-80B-A3B-Instruct/blob/main/config.json#L26.

if tuning triton kernel script need to change?

@lmu97
Copy link
Copy Markdown

lmu97 commented Oct 15, 2025

Plus: will the log below cause performance issue?
image

@vincentzed
Copy link
Copy Markdown
Contributor

vincentzed commented Dec 27, 2025

@yizhang2077 On main:
This setup cause gibberish. Is this setup not working with Blackwell? Before I make a bug report, I ask here.

launch server
export SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK=1024

model-path: Qwen/Qwen3-Next-80B-A3B-Instruct-FP8
tp-size: 4
model-path: Qwen/Qwen3-Next-80B-A3B-Instruct-FP8
tp-size: 4
deepep-mode: auto
dp-size: 4
enable-dp-attention: true
moe-a2a-backend: deepep
  • Sgl docker.
    Env:
python -m sglang.check_env             
Python: 3.12.3 (main, Nov  6 2025, 13:44:16) [GCC 13.3.0]
CUDA available: True
GPU 0,1,2,3,4,5,6,7: NVIDIA B200
GPU 0,1,2,3,4,5,6,7 Compute Capability: 10.0
CUDA_HOME: /usr/local/cuda
NVCC: Cuda compilation tools, release 12.9, V12.9.86
CUDA Driver Version: 580.82.07
PyTorch: 2.9.1+cu129
sglang: 0.5.6.post2
sgl_kernel: 0.3.20
flashinfer_python: 0.5.3
flashinfer_cubin: 0.5.3
flashinfer_jit_cache: 0.5.3+cu129
triton: 3.5.1
transformers: 4.57.1
torchao: 0.9.0
numpy: 2.3.5
aiohttp: 3.13.2
fastapi: 0.124.4
hf_transfer: 0.1.9
huggingface_hub: 0.36.0
interegular: 0.3.3
modelscope: 1.33.0
orjson: 3.11.5
outlines: 0.1.11
packaging: 25.0
psutil: 7.1.3
pydantic: 2.12.5
python-multipart: 0.0.20
pyzmq: 27.1.0
uvicorn: 0.38.0
uvloop: 0.22.1
vllm: Module Not Found
xgrammar: 0.1.27
openai: 2.6.1
tiktoken: 0.12.0
anthropic: 0.75.0
litellm: Module Not Found
decord2: 2.0.0
NVIDIA Topology: 
        GPU0    GPU1    GPU2    GPU3    GPU4    GPU5    GPU6    GPU7    CPU Affinity    NUMA Affinity   GPU NUMA ID
GPU0     X      NV18    NV18    NV18    NV18    NV18    NV18    NV18    0-239   0               N/A
GPU1    NV18     X      NV18    NV18    NV18    NV18    NV18    NV18    0-239   0               N/A
GPU2    NV18    NV18     X      NV18    NV18    NV18    NV18    NV18    0-239   0               N/A
GPU3    NV18    NV18    NV18     X      NV18    NV18    NV18    NV18    0-239   0               N/A
GPU4    NV18    NV18    NV18    NV18     X      NV18    NV18    NV18    0-239   0               N/A
GPU5    NV18    NV18    NV18    NV18    NV18     X      NV18    NV18    0-239   0               N/A
GPU6    NV18    NV18    NV18    NV18    NV18    NV18     X      NV18    0-239   0               N/A
GPU7    NV18    NV18    NV18    NV18    NV18    NV18    NV18     X      0-239   0               N/A

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

Hypervisor vendor:: KVM
ulimit soft: 1048576

Startup logs:

❯ ./server.sh
>>> Launching SGLang server from server.yaml
[2025-12-26 23:00:58] WARNING model_config.py:803: DeepGemm is enabled but the scale_fmt of checkpoint is not ue8m0. This might cause accuracy degradation on Blackwell.
[2025-12-26 23:00:58] INFO server_args.py:1375: Use triton as attention backend on sm100 for Qwen3NextForCausalLM
[2025-12-26 23:00:58] WARNING server_args.py:1408: Disabling overlap schedule since MambaRadixCache no_buffer is not compatible with overlap schedule currently, try to use --mamba-scheduler-strategy extra_buffer to enable overlap schedule
[2025-12-26 23:00:58] WARNING server_args.py:1788: DP attention is enabled. The chunked prefill size is adjusted to 4096 to avoid MoE kernel issues. 
[2025-12-26 23:00:58] WARNING server_args.py:1841: DeepEP MoE is enabled. The expert parallel size is adjusted to be the same as the tensor parallel size[4].
[2025-12-26 23:00:58] server_args=ServerArgs(model_path='Qwen/Qwen3-Next-80B-A3B-Instruct-FP8', tokenizer_path='Qwen/Qwen3-Next-80B-A3B-Instruct-FP8', tokenizer_mode='auto', tokenizer_worker_num=1, skip_tokenizer_init=False, load_format='auto', model_loader_extra_config='{"enable_multithread_load": true, "num_threads": 8}', trust_remote_code=False, context_length=None, is_embedding=False, enable_multimodal=None, revision=None, model_impl='auto', host='127.0.0.1', port=30000, fastapi_root_path='', grpc_mode=False, skip_server_warmup=False, warmups=None, nccl_port=None, checkpoint_engine_wait_weights_before_ready=False, dtype='auto', quantization='fp8', quantization_param_path=None, kv_cache_dtype='auto', enable_fp32_lm_head=False, modelopt_quant=None, modelopt_checkpoint_restore_path=None, modelopt_checkpoint_save_path=None, modelopt_export_path=None, quantize_and_serve=False, rl_quant_profile=None, mem_fraction_static=0.805, max_running_requests=None, max_queued_requests=None, max_total_tokens=None, chunked_prefill_size=4096, enable_dynamic_chunking=False, max_prefill_tokens=16384, prefill_max_requests=None, schedule_policy='fcfs', enable_priority_scheduling=False, abort_on_priority_when_disabled=False, schedule_low_priority_values_first=False, priority_scheduling_preemption_threshold=10, schedule_conservativeness=0.3, page_size=1, hybrid_kvcache_ratio=None, swa_full_tokens_ratio=0.8, disable_hybrid_swa_memory=False, radix_eviction_policy='lru', device='cuda', tp_size=4, pp_size=1, pp_max_micro_batch_size=None, pp_async_batch_depth=0, stream_interval=1, stream_output=False, random_seed=387663723, constrained_json_whitespace_pattern=None, constrained_json_disable_any_whitespace=False, watchdog_timeout=300, soft_watchdog_timeout=None, dist_timeout=None, download_dir=None, base_gpu_id=0, gpu_id_step=1, sleep_on_idle=False, custom_sigquit_handler=None, log_level='info', log_level_http=None, log_requests=False, log_requests_level=2, log_requests_format='text', crash_dump_folder=None, show_time_cost=False, enable_metrics=False, enable_metrics_for_all_schedulers=False, tokenizer_metrics_custom_labels_header='x-custom-labels', tokenizer_metrics_allowed_custom_labels=None, bucket_time_to_first_token=None, bucket_inter_token_latency=None, bucket_e2e_request_latency=None, collect_tokens_histogram=False, prompt_tokens_buckets=None, generation_tokens_buckets=None, gc_warning_threshold_secs=0.0, decode_log_interval=40, enable_request_time_stats_logging=False, kv_events_config=None, enable_trace=False, otlp_traces_endpoint='localhost:4317', export_metrics_to_file=False, export_metrics_to_file_dir=None, api_key=None, served_model_name='Qwen/Qwen3-Next-80B-A3B-Instruct-FP8', weight_version='default', chat_template=None, completion_template=None, file_storage_path='sglang_storage', enable_cache_report=False, reasoning_parser=None, tool_call_parser=None, tool_server=None, sampling_defaults='model', dp_size=4, load_balance_method='round_robin', prefill_round_robin_balance=False, dist_init_addr=None, nnodes=1, node_rank=0, json_model_override_args='{}', preferred_sampling_params=None, enable_lora=None, max_lora_rank=None, lora_target_modules=None, lora_paths=None, max_loaded_loras=None, max_loras_per_batch=8, lora_eviction_policy='lru', lora_backend='csgmv', max_lora_chunk_size=16, attention_backend='triton', decode_attention_backend=None, prefill_attention_backend=None, sampling_backend='flashinfer', grammar_backend='xgrammar', mm_attention_backend=None, fp8_gemm_runner_backend='auto', nsa_prefill_backend='flashmla_sparse', nsa_decode_backend='fa3', disable_flashinfer_autotune=False, speculative_algorithm=None, speculative_draft_model_path=None, speculative_draft_model_revision=None, speculative_draft_load_format=None, speculative_num_steps=None, speculative_eagle_topk=None, speculative_num_draft_tokens=None, speculative_accept_threshold_single=1.0, speculative_accept_threshold_acc=1.0, speculative_token_map=None, speculative_attention_mode='prefill', speculative_draft_attention_backend=None, speculative_moe_runner_backend='auto', speculative_moe_a2a_backend=None, speculative_draft_model_quantization=None, speculative_ngram_min_match_window_size=1, speculative_ngram_max_match_window_size=12, speculative_ngram_min_bfs_breadth=1, speculative_ngram_max_bfs_breadth=10, speculative_ngram_match_type='BFS', speculative_ngram_branch_length=18, speculative_ngram_capacity=10000000, enable_multi_layer_eagle=False, ep_size=4, moe_a2a_backend='deepep', moe_runner_backend='auto', flashinfer_mxfp4_moe_precision='default', enable_flashinfer_allreduce_fusion=False, deepep_mode='auto', ep_num_redundant_experts=0, ep_dispatch_algorithm=None, init_expert_location='trivial', enable_eplb=False, eplb_algorithm='auto', eplb_rebalance_num_iterations=1000, eplb_rebalance_layers_per_chunk=None, eplb_min_rebalancing_utilization_threshold=1.0, expert_distribution_recorder_mode=None, expert_distribution_recorder_buffer_size=1000, enable_expert_distribution_metrics=False, deepep_config=None, moe_dense_tp_size=None, elastic_ep_backend=None, mooncake_ib_device=None, max_mamba_cache_size=None, mamba_ssm_dtype='float32', mamba_full_memory_ratio=0.9, mamba_scheduler_strategy='no_buffer', mamba_track_interval=256, enable_hierarchical_cache=False, hicache_ratio=2.0, hicache_size=0, hicache_write_policy='write_through', hicache_io_backend='kernel', hicache_mem_layout='layer_first', hicache_storage_backend=None, hicache_storage_prefetch_policy='best_effort', hicache_storage_backend_extra_config=None, enable_lmcache=False, kt_weight_path=None, kt_method='AMXINT4', kt_cpuinfer=None, kt_threadpool_count=2, kt_num_gpu_experts=None, kt_max_deferred_experts_per_token=None, dllm_algorithm=None, dllm_algorithm_config=None, enable_double_sparsity=False, ds_channel_config_path=None, ds_heavy_channel_num=32, ds_heavy_token_num=256, ds_heavy_channel_type='qk', ds_sparse_decode_threshold=4096, cpu_offload_gb=0, offload_group_size=-1, offload_num_in_group=1, offload_prefetch_step=1, offload_mode='cpu', multi_item_scoring_delimiter=None, disable_radix_cache=False, cuda_graph_max_bs=512, cuda_graph_bs=[1, 2, 4, 8, 12, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200, 208, 216, 224, 232, 240, 248, 256, 272, 288, 304, 320, 336, 352, 368, 384, 400, 416, 432, 448, 464, 480, 496, 512], disable_cuda_graph=False, disable_cuda_graph_padding=False, enable_profile_cuda_graph=False, enable_cudagraph_gc=False, enable_layerwise_nvtx_marker=False, enable_nccl_nvls=False, enable_symm_mem=False, disable_flashinfer_cutlass_moe_fp4_allgather=False, enable_tokenizer_batch_encode=False, disable_tokenizer_batch_decode=False, disable_outlines_disk_cache=False, disable_custom_all_reduce=False, enable_mscclpp=False, enable_torch_symm_mem=False, disable_overlap_schedule=True, enable_mixed_chunk=False, enable_dp_attention=True, enable_dp_lm_head=False, enable_two_batch_overlap=False, enable_single_batch_overlap=False, tbo_token_distribution_threshold=0.48, enable_torch_compile=False, enable_piecewise_cuda_graph=False, enable_torch_compile_debug_mode=False, torch_compile_max_bs=32, piecewise_cuda_graph_max_tokens=16384, piecewise_cuda_graph_tokens=[4, 8, 12, 16, 20, 24, 28, 32, 48, 64, 80, 96, 112, 128, 144, 160, 176, 192, 208, 224, 240, 256, 288, 320, 352, 384, 416, 448, 480, 512, 640, 768, 896, 1024, 1152, 1280, 1408, 1536, 1664, 1792, 1920, 2048, 2176, 2304, 2432, 2560, 2688, 2816, 2944, 3072, 3200, 3328, 3456, 3584, 3712, 3840, 3968, 4096, 4352, 4608, 4864, 5120, 5376, 5632, 5888, 6144, 6400, 6656, 6912, 7168, 7424, 7680, 7936, 8192, 8448, 8704, 8960, 9216, 9472, 9728, 9984, 10240, 10496, 10752, 11008, 11264, 11520, 11776, 12032, 12288, 12544, 12800, 13056, 13312, 13568, 13824, 14080, 14336, 14592, 14848, 15104, 15360, 15616, 15872, 16128, 16384], piecewise_cuda_graph_compiler='eager', torchao_config='', enable_nan_detection=False, enable_p2p_check=False, triton_attention_reduce_in_fp32=False, triton_attention_num_kv_splits=8, triton_attention_split_tile_size=None, num_continuous_decode_steps=1, delete_ckpt_after_loading=False, enable_memory_saver=False, enable_weights_cpu_backup=False, enable_draft_weights_cpu_backup=False, allow_auto_truncate=False, enable_custom_logit_processor=False, flashinfer_mla_disable_ragged=False, disable_shared_experts_fusion=False, disable_chunked_prefix_cache=False, disable_fast_image_processor=False, keep_mm_feature_on_device=False, enable_return_hidden_states=False, enable_return_routed_experts=False, scheduler_recv_interval=1, numa_node=None, enable_deterministic_inference=False, rl_on_policy_target=None, enable_attn_tp_input_scattered=False, enable_nsa_prefill_context_parallel=False, enable_fused_qk_norm_rope=False, enable_dynamic_batch_tokenizer=False, dynamic_batch_tokenizer_batch_size=32, dynamic_batch_tokenizer_batch_timeout=0.002, debug_tensor_dump_output_folder=None, debug_tensor_dump_layers=None, debug_tensor_dump_input_file=None, debug_tensor_dump_inject=False, disaggregation_mode='null', disaggregation_transfer_backend='mooncake', disaggregation_bootstrap_port=8998, disaggregation_decode_tp=None, disaggregation_decode_dp=None, disaggregation_prefill_pp=1, disaggregation_ib_device=None, disaggregation_decode_enable_offload_kvcache=False, disaggregation_decode_enable_fake_auto=False, num_reserved_decode_tokens=512, disaggregation_decode_polling_interval=1, encoder_only=False, language_only=False, encoder_transfer_backend='zmq_to_scheduler', encoder_urls=[], custom_weight_loader=[], weight_loader_disable_mmap=False, remote_instance_weight_loader_seed_instance_ip=None, remote_instance_weight_loader_seed_instance_service_port=None, remote_instance_weight_loader_send_weights_group_ports=None, remote_instance_weight_loader_backend='nccl', remote_instance_weight_loader_start_seed_via_transfer_engine=False, enable_pdmux=False, pdmux_config_path=None, sm_group_num=8, mm_max_concurrent_calls=32, mm_per_request_timeout=10.0, enable_broadcast_mm_inputs_process=False, enable_prefix_mm_cache=False, mm_enable_dp_encoder=False, mm_process_config={}, limit_mm_data_per_request=None, decrypted_config_file=None, decrypted_draft_config_file=None, forward_hooks=None)
[2025-12-26 23:01:09] DeepGemm is enabled but the scale_fmt of checkpoint is not ue8m0. This might cause accuracy degradation on Blackwell.
[2025-12-26 23:01:10] Using default HuggingFace chat template with detected content format: string
[2025-12-26 23:01:13 DP0 TP0 EP0] DeepGemm is enabled but the scale_fmt of checkpoint is not ue8m0. This might cause accuracy degradation on Blackwell.
[2025-12-26 23:01:14 DP0 TP0 EP0] DeepGemm is enabled but the scale_fmt of checkpoint is not ue8m0. This might cause accuracy degradation on Blackwell.
[2025-12-26 23:01:14 DP0 TP0 EP0] Init torch distributed begin.
[2025-12-26 23:01:18 DP1 TP1 EP1] DeepGemm is enabled but the scale_fmt of checkpoint is not ue8m0. This might cause accuracy degradation on Blackwell.
[2025-12-26 23:01:19 DP1 TP1 EP1] DeepGemm is enabled but the scale_fmt of checkpoint is not ue8m0. This might cause accuracy degradation on Blackwell.
[2025-12-26 23:01:19 DP1 TP1 EP1] Init torch distributed begin.
[2025-12-26 23:01:24 DP2 TP2 EP2] DeepGemm is enabled but the scale_fmt of checkpoint is not ue8m0. This might cause accuracy degradation on Blackwell.
[2025-12-26 23:01:24 DP2 TP2 EP2] DeepGemm is enabled but the scale_fmt of checkpoint is not ue8m0. This might cause accuracy degradation on Blackwell.
[2025-12-26 23:01:24 DP2 TP2 EP2] Init torch distributed begin.
[2025-12-26 23:01:29 DP3 TP3 EP3] DeepGemm is enabled but the scale_fmt of checkpoint is not ue8m0. This might cause accuracy degradation on Blackwell.
[2025-12-26 23:01:30 DP3 TP3 EP3] DeepGemm is enabled but the scale_fmt of checkpoint is not ue8m0. This might cause accuracy degradation on Blackwell.
[2025-12-26 23:01:30 DP3 TP3 EP3] Init torch distributed begin.
[Gloo] Rank 0 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[Gloo] Rank 1 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[Gloo] Rank 2 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[Gloo] Rank 3 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[Gloo] Rank 3 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[Gloo] Rank 0 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[Gloo] Rank 2 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[Gloo] Rank 1 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[2025-12-26 23:01:30 DP0 TP0 EP0] sglang is using nccl==2.27.5
[2025-12-26 23:01:32 DP0 TP0 EP0] [AR] Using sglang CustomAllreduce
[2025-12-26 23:01:32 DP1 TP1 EP1] [AR] Using sglang CustomAllreduce
[2025-12-26 23:01:32 DP3 TP3 EP3] [AR] Using sglang CustomAllreduce
[2025-12-26 23:01:32 DP2 TP2 EP2] [AR] Using sglang CustomAllreduce
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[2025-12-26 23:01:33 DP0 TP0 EP0] Init torch distributed ends. mem usage=1.46 GB
[2025-12-26 23:01:33 DP3 TP3 EP3] Init torch distributed ends. mem usage=1.33 GB
[2025-12-26 23:01:33 DP2 TP2 EP2] Init torch distributed ends. mem usage=1.49 GB
[2025-12-26 23:01:33 DP1 TP1 EP1] Init torch distributed ends. mem usage=1.49 GB
[2025-12-26 23:01:33 DP0 TP0 EP0] Ignore import error when loading sglang.srt.models.glmasr: cannot import name 'GlmAsrConfig' from 'transformers' (/usr/local/lib/python3.12/dist-packages/transformers/__init__.py)
[2025-12-26 23:01:33 DP2 TP2 EP2] Ignore import error when loading sglang.srt.models.glmasr: cannot import name 'GlmAsrConfig' from 'transformers' (/usr/local/lib/python3.12/dist-packages/transformers/__init__.py)
[2025-12-26 23:01:33 DP1 TP1 EP1] Ignore import error when loading sglang.srt.models.glmasr: cannot import name 'GlmAsrConfig' from 'transformers' (/usr/local/lib/python3.12/dist-packages/transformers/__init__.py)
[2025-12-26 23:01:33 DP3 TP3 EP3] Ignore import error when loading sglang.srt.models.glmasr: cannot import name 'GlmAsrConfig' from 'transformers' (/usr/local/lib/python3.12/dist-packages/transformers/__init__.py)
[2025-12-26 23:01:34 DP0 TP0 EP0] Load weight begin. avail mem=176.18 GB
[2025-12-26 23:01:34 DP0 TP0 EP0] Detected fp8 checkpoint.
`torch_dtype` is deprecated! Use `dtype` instead!
[2025-12-26 23:01:34 DP1 TP1 EP1] Load weight begin. avail mem=176.15 GB
[2025-12-26 23:01:34 DP3 TP3 EP3] Load weight begin. avail mem=176.31 GB
`torch_dtype` is deprecated! Use `dtype` instead!
[2025-12-26 23:01:34 DP2 TP2 EP2] Load weight begin. avail mem=176.15 GB
`torch_dtype` is deprecated! Use `dtype` instead!
`torch_dtype` is deprecated! Use `dtype` instead!
[2025-12-26 23:01:34 DP0 TP0 EP0] using attn output gate!
[2025-12-26 23:01:34 DP1 TP1 EP1] using attn output gate!
[2025-12-26 23:01:34 DP2 TP2 EP2] using attn output gate!
[2025-12-26 23:01:34 DP3 TP3 EP3] using attn output gate!
[2025-12-26 23:01:35 DP0 TP0 EP0] Found local HF snapshot for Qwen/Qwen3-Next-80B-A3B-Instruct-FP8 at /root/.cache/huggingface/hub/models--Qwen--Qwen3-Next-80B-A3B-Instruct-FP8/snapshots/c5f5f263bdd5cc134092897864e8905d8fe7b928; skipping download.
Multi-thread loading shards:   0% Completed | 0/8 [00:00<?, ?it/s]
Multi-thread loading shards:  12% Completed | 1/8 [00:11<01:21, 11.62s/it]
Multi-thread loading shards:  25% Completed | 2/8 [00:13<00:34,  5.82s/it]
Multi-thread loading shards:  38% Completed | 3/8 [00:14<00:19,  3.89s/it]
Multi-thread loading shards:  50% Completed | 4/8 [00:16<00:12,  3.06s/it]
Multi-thread loading shards:  62% Completed | 5/8 [00:18<00:07,  2.57s/it]
Multi-thread loading shards:  75% Completed | 6/8 [00:20<00:04,  2.28s/it]
Multi-thread loading shards:  88% Completed | 7/8 [00:21<00:02,  2.10s/it]
Multi-thread loading shards: 100% Completed | 8/8 [00:23<00:00,  1.96s/it]
Multi-thread loading shards: 100% Completed | 8/8 [00:23<00:00,  2.94s/it]

[2025-12-26 23:02:00 DP1 TP1 EP1] Load weight end. type=Qwen3NextForCausalLM, dtype=torch.bfloat16, avail mem=155.43 GB, mem usage=20.72 GB.
[2025-12-26 23:02:00 DP0 TP0 EP0] Load weight end. type=Qwen3NextForCausalLM, dtype=torch.bfloat16, avail mem=155.46 GB, mem usage=20.72 GB.
[2025-12-26 23:02:01 DP2 TP2 EP2] Load weight end. type=Qwen3NextForCausalLM, dtype=torch.bfloat16, avail mem=155.43 GB, mem usage=20.72 GB.
[2025-12-26 23:02:03 DP3 TP3 EP3] Load weight end. type=Qwen3NextForCausalLM, dtype=torch.bfloat16, avail mem=155.59 GB, mem usage=20.72 GB.
[2025-12-26 23:02:03 DP0 TP0 EP0] Using KV cache dtype: torch.bfloat16
[2025-12-26 23:02:03 DP0 TP0 EP0] The available memory for KV cache is 63.77 GB.
[2025-12-26 23:02:03 DP3 TP3 EP3] The available memory for KV cache is 63.77 GB.
[2025-12-26 23:02:03 DP2 TP2 EP2] The available memory for KV cache is 63.77 GB.
[2025-12-26 23:02:03 DP1 TP1 EP1] The available memory for KV cache is 63.77 GB.
[2025-12-26 23:02:03 DP3 TP3 EP3] Mamba Cache is allocated. max_mamba_cache_size: 797, conv_state size: 1.32GB, ssm_state size: 56.11GB 
[2025-12-26 23:02:03 DP0 TP0 EP0] Mamba Cache is allocated. max_mamba_cache_size: 797, conv_state size: 1.32GB, ssm_state size: 56.11GB 
[2025-12-26 23:02:03 DP2 TP2 EP2] Mamba Cache is allocated. max_mamba_cache_size: 797, conv_state size: 1.32GB, ssm_state size: 56.11GB 
[2025-12-26 23:02:03 DP1 TP1 EP1] Mamba Cache is allocated. max_mamba_cache_size: 797, conv_state size: 1.32GB, ssm_state size: 56.11GB 
[2025-12-26 23:02:03 DP2 TP2 EP2] KV Cache is allocated. #tokens: 2786325, K size: 31.89 GB, V size: 31.89 GB
[2025-12-26 23:02:03 DP2 TP2 EP2] Memory pool end. avail mem=34.12 GB
[2025-12-26 23:02:03 DP0 TP0 EP0] KV Cache is allocated. #tokens: 2786325, K size: 31.89 GB, V size: 31.89 GB
[2025-12-26 23:02:03 DP1 TP1 EP1] KV Cache is allocated. #tokens: 2786325, K size: 31.89 GB, V size: 31.89 GB
[2025-12-26 23:02:03 DP0 TP0 EP0] Memory pool end. avail mem=34.15 GB
[2025-12-26 23:02:03 DP1 TP1 EP1] Memory pool end. avail mem=34.12 GB
[2025-12-26 23:02:03 DP3 TP3 EP3] KV Cache is allocated. #tokens: 2786325, K size: 31.89 GB, V size: 31.89 GB
[2025-12-26 23:02:03 DP3 TP3 EP3] Memory pool end. avail mem=34.27 GB
[2025-12-26 23:02:03 DP1 TP1 EP1] Using hybrid linear attention backend for hybrid GDN models.
[2025-12-26 23:02:03 DP0 TP0 EP0] Using hybrid linear attention backend for hybrid GDN models.
[2025-12-26 23:02:03 DP1 TP1 EP1] Capture cuda graph begin. This can take up to several minutes. avail mem=34.05 GB
[2025-12-26 23:02:03 DP0 TP0 EP0] Capture cuda graph begin. This can take up to several minutes. avail mem=34.08 GB
[2025-12-26 23:02:03 DP0 TP0 EP0] Capture cuda graph bs [1, 2, 4, 8, 12, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 159]
[2025-12-26 23:02:03 DP2 TP2 EP2] Using hybrid linear attention backend for hybrid GDN models.
[2025-12-26 23:02:03 DP2 TP2 EP2] Capture cuda graph begin. This can take up to several minutes. avail mem=34.05 GB
[2025-12-26 23:02:03 DP3 TP3 EP3] Using hybrid linear attention backend for hybrid GDN models.
[2025-12-26 23:02:03 DP3 TP3 EP3] Capture cuda graph begin. This can take up to several minutes. avail mem=34.21 GB
Capturing batches (bs=159 avail_mem=33.68 GB):   0%|                                                                                                  | 0/24 [00:00<?, ?it/s][2025-12-26 23:02:04 DP0 TP0 EP0] Entering DeepGEMM JIT Pre-Compile session. It may take a long time (typically 10-20 mins) if you have not run `sglang.compile_deep_gemm`. It is recommended to run `sglang.compile_deep_gemm` with same args as `sglang.launch_server` for pre-compilation to reduce the overhead if you have not run it before. For example: `python3 -m sglang.compile_deep_gemm --model deepseek-ai/DeepSeek-V3 --tp 8 --trust-remote-code`
[2025-12-26 23:02:04 DP0 TP0 EP0] Try DeepGEMM JIT Compiling for <GEMM_NT_F8F8BF16> N=12288, K=2048, num_groups=1 with all Ms. It only takes a little time (typically 1 sec) if you have run `python3 -m sglang.compile_deep_gemm`. 
[2025-12-26 23:02:04 DP0 TP0 EP0] Required memory for warmup: 0.4296875GB, Available memory: 33.65899658203125GB
                                                                                                                                                                            [2025-12-26 23:02:05 DP1 TP1 EP1] Only use 20 SMs for DeepEP communication. This may result in highly suboptimal performance. Consider using --deepep-config to change the behavior.
[2025-12-26 23:02:05 DP2 TP2 EP2] Only use 20 SMs for DeepEP communication. This may result in highly suboptimal performance. Consider using --deepep-config to change the behavior.
[2025-12-26 23:02:05 DP3 TP3 EP3] Only use 20 SMs for DeepEP communication. This may result in highly suboptimal performance. Consider using --deepep-config to change the behavior.
DeepGEMM warmup: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████| 16384/16384 [00:01<00:00, 10918.33it/s]
[2025-12-26 23:02:06 DP0 TP0 EP0] Entering DeepGEMM JIT Pre-Compile session. It may take a long time (typically 10-20 mins) if you have not run `sglang.compile_deep_gemm`. It is recommended to run `sglang.compile_deep_gemm` with same args as `sglang.launch_server` for pre-compilation to reduce the overhead if you have not run it before. For example: `python3 -m sglang.compile_deep_gemm --model deepseek-ai/DeepSeek-V3 --tp 8 --trust-remote-code`
[2025-12-26 23:02:06 DP0 TP0 EP0] Try DeepGEMM JIT Compiling for <GEMM_NT_F8F8BF16> N=2048, K=4096, num_groups=1 with all Ms. It only takes a little time (typically 1 sec) if you have run `python3 -m sglang.compile_deep_gemm`. 
[2025-12-26 23:02:06 DP0 TP0 EP0] Required memory for warmup: 0.1328125GB, Available memory: 33.63751220703125GB
DeepGEMM warmup: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████| 16384/16384 [00:01<00:00, 10324.22it/s]
[2025-12-26 23:02:08 DP0 TP0 EP0] Entering DeepGEMM JIT Pre-Compile session. It may take a long time (typically 10-20 mins) if you have not run `sglang.compile_deep_gemm`. It is recommended to run `sglang.compile_deep_gemm` with same args as `sglang.launch_server` for pre-compilation to reduce the overhead if you have not run it before. For example: `python3 -m sglang.compile_deep_gemm --model deepseek-ai/DeepSeek-V3 --tp 8 --trust-remote-code`
[2025-12-26 23:02:08 DP0 TP0 EP0] Try DeepGEMM JIT Compiling for <GEMM_NT_F8F8BF16> N=1024, K=2048, num_groups=1 with all Ms. It only takes a little time (typically 1 sec) if you have run `python3 -m sglang.compile_deep_gemm`. 
[2025-12-26 23:02:08 DP0 TP0 EP0] Required memory for warmup: 0.064453125GB, Available memory: 33.63751220703125GB
DeepGEMM warmup: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████| 16384/16384 [00:01<00:00, 9778.82it/s]
[2025-12-26 23:02:09 DP0 TP0 EP0] Entering DeepGEMM JIT Pre-Compile session. It may take a long time (typically 10-20 mins) if you have not run `sglang.compile_deep_gemm`. It is recommended to run `sglang.compile_deep_gemm` with same args as `sglang.launch_server` for pre-compilation to reduce the overhead if you have not run it before. For example: `python3 -m sglang.compile_deep_gemm --model deepseek-ai/DeepSeek-V3 --tp 8 --trust-remote-code`
[2025-12-26 23:02:09 DP0 TP0 EP0] Try DeepGEMM JIT Compiling for <GEMM_NT_F8F8BF16> N=2048, K=512, num_groups=1 with all Ms. It only takes a little time (typically 1 sec) if you have run `python3 -m sglang.compile_deep_gemm`. 
[2025-12-26 23:02:09 DP0 TP0 EP0] Required memory for warmup: 0.0712890625GB, Available memory: 33.63555908203125GB
DeepGEMM warmup: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████| 16384/16384 [00:01<00:00, 10187.07it/s]
[2025-12-26 23:02:11 DP0 TP0 EP0] Only use 20 SMs for DeepEP communication. This may result in highly suboptimal performance. Consider using --deepep-config to change the behavior.
/dvs/p4/build/sw/rel/gpgpu/toolkit/r12.8/main_nvshmem/src/host/transport/transport.cpp:nvshmemi_transport_init:282: init failed for transport: IBGDA
/dvs/p4/build/sw/rel/gpgpu/toolkit/r12.8/main_nvshmem/src/host/transport/transport.cpp:nvshmemi_transport_init:282: init failed for transport: IBGDA
/dvs/p4/build/sw/rel/gpgpu/toolkit/r12.8/main_nvshmem/src/host/transport/transport.cpp:nvshmemi_transport_init:282: init failed for transport: IBGDA
/dvs/p4/build/sw/rel/gpgpu/toolkit/r12.8/main_nvshmem/src/host/transport/transport.cpp:nvshmemi_transport_init:282: init failed for transport: IBGDA
[2025-12-26 23:02:17 DP0 TP0 EP0] Entering DeepGEMM JIT Pre-Compile session. It may take a long time (typically 10-20 mins) if you have not run `sglang.compile_deep_gemm`. It is recommended to run `sglang.compile_deep_gemm` with same args as `sglang.launch_server` for pre-compilation to reduce the overhead if you have not run it before. For example: `python3 -m sglang.compile_deep_gemm --model deepseek-ai/DeepSeek-V3 --tp 8 --trust-remote-code`
[2025-12-26 23:02:17 DP0 TP0 EP0] Try DeepGEMM JIT Compiling for <GROUPED_GEMM_NT_F8F8BF16_MASKED> N=1024, K=2048, num_groups=128 with all Ms. It only takes a little time (typically 1 sec) if you have run `python3 -m sglang.compile_deep_gemm`. 
[2025-12-26 23:02:17 DP0 TP0 EP0] Required memory for warmup: 8.250000476837158GB, Available memory: 16.91876220703125GB
DeepGEMM warmup: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████| 16384/16384 [00:01<00:00, 9184.02it/s]
[2025-12-26 23:02:19 DP0 TP0 EP0] Entering DeepGEMM JIT Pre-Compile session. It may take a long time (typically 10-20 mins) if you have not run `sglang.compile_deep_gemm`. It is recommended to run `sglang.compile_deep_gemm` with same args as `sglang.launch_server` for pre-compilation to reduce the overhead if you have not run it before. For example: `python3 -m sglang.compile_deep_gemm --model deepseek-ai/DeepSeek-V3 --tp 8 --trust-remote-code`
[2025-12-26 23:02:19 DP0 TP0 EP0] Try DeepGEMM JIT Compiling for <GROUPED_GEMM_NT_F8F8BF16_MASKED> N=2048, K=512, num_groups=128 with all Ms. It only takes a little time (typically 1 sec) if you have run `python3 -m sglang.compile_deep_gemm`. 
[2025-12-26 23:02:19 DP0 TP0 EP0] Required memory for warmup: 9.125000476837158GB, Available memory: 15.89923095703125GB
DeepGEMM warmup: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████| 16384/16384 [00:01<00:00, 13029.39it/s]
[2025-12-26 23:02:20 DP0 TP0 EP0] Entering DeepGEMM JIT Pre-Compile session. It may take a long time (typically 10-20 mins) if you have not run `sglang.compile_deep_gemm`. It is recommended to run `sglang.compile_deep_gemm` with same args as `sglang.launch_server` for pre-compilation to reduce the overhead if you have not run it before. For example: `python3 -m sglang.compile_deep_gemm --model deepseek-ai/DeepSeek-V3 --tp 8 --trust-remote-code`
[2025-12-26 23:02:20 DP0 TP0 EP0] Try DeepGEMM JIT Compiling for <GEMM_NT_F8F8BF16> N=9216, K=2048, num_groups=1 with all Ms. It only takes a little time (typically 1 sec) if you have run `python3 -m sglang.compile_deep_gemm`. 
[2025-12-26 23:02:20 DP0 TP0 EP0] Required memory for warmup: 0.330078125GB, Available memory: 18.91876220703125GB
DeepGEMM warmup: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████| 16384/16384 [00:01<00:00, 10968.94it/s]
Capturing batches (bs=1 avail_mem=14.71 GB): 100%|███████████████████████████████████████████████████████████████████████████████████████████| 24/24 [00:31<00:00,  1.33s/it]
[2025-12-26 23:02:36 DP0 TP0 EP0] Registering 24 cuda graph addresses
[2025-12-26 23:02:36 DP2 TP2 EP2] Capture cuda graph end. Time elapsed: 32.67 s. mem usage=19.53 GB. avail mem=14.52 GB.
[2025-12-26 23:02:36 DP0 TP0 EP0] Capture cuda graph end. Time elapsed: 32.68 s. mem usage=19.38 GB. avail mem=14.70 GB.
[2025-12-26 23:02:36 DP3 TP3 EP3] Capture cuda graph end. Time elapsed: 32.78 s. mem usage=18.75 GB. avail mem=15.46 GB.
[2025-12-26 23:02:36 DP1 TP1 EP1] Capture cuda graph end. Time elapsed: 32.79 s. mem usage=19.53 GB. avail mem=14.52 GB.
[2025-12-26 23:02:37 DP0 TP0 EP0] max_total_num_tokens=2786325, chunked_prefill_size=4096, max_prefill_tokens=16384, max_running_requests=159, context_len=262144, available_gpu_mem=14.70 GB
[2025-12-26 23:02:38] INFO:     Started server process [406845]
[2025-12-26 23:02:38] INFO:     Waiting for application startup.
[2025-12-26 23:02:38] Using default chat sampling params from model generation config: {'repetition_penalty': 1.0, 'temperature': 0.7, 'top_k': 20, 'top_p': 0.8}
[2025-12-26 23:02:38] Using default chat sampling params from model generation config: {'repetition_penalty': 1.0, 'temperature': 0.7, 'top_k': 20, 'top_p': 0.8}
[2025-12-26 23:02:38] INFO:     Application startup complete.
[2025-12-26 23:02:38] INFO:     Uvicorn running on http://127.0.0.1:30000 (Press CTRL+C to quit)
[2025-12-26 23:02:39] INFO:     127.0.0.1:47852 - "GET /model_info HTTP/1.1" 200 OK
[2025-12-26 23:02:39 DP1 TP1 EP1] Prefill batch, #new-seq: 1, #new-token: 6, #cached-token: 0, full token usage: 0.00, mamba usage: 0.00, #running-req: 0, #queue-req: 0, 
[2025-12-26 23:02:39 DP0 TP0 EP0] Prefill batch, #new-seq: 1, #new-token: 6, #cached-token: 0, full token usage: 0.00, mamba usage: 0.00, #running-req: 0, #queue-req: 0, 
[2025-12-26 23:02:39 DP3 TP3 EP3] Prefill batch, #new-seq: 1, #new-token: 6, #cached-token: 0, full token usage: 0.00, mamba usage: 0.00, #running-req: 0, #queue-req: 0, 
[2025-12-26 23:02:39 DP2 TP2 EP2] Prefill batch, #new-seq: 1, #new-token: 6, #cached-token: 0, full token usage: 0.00, mamba usage: 0.00, #running-req: 0, #queue-req: 0, 
[2025-12-26 23:02:39 DP0 TP0 EP0] Entering DeepGEMM JIT Pre-Compile session. It may take a long time (typically 10-20 mins) if you have not run `sglang.compile_deep_gemm`. It is recommended to run `sglang.compile_deep_gemm` with same args as `sglang.launch_server` for pre-compilation to reduce the overhead if you have not run it before. For example: `python3 -m sglang.compile_deep_gemm --model deepseek-ai/DeepSeek-V3 --tp 8 --trust-remote-code`
[2025-12-26 23:02:39 DP0 TP0 EP0] Try DeepGEMM JIT Compiling for <GROUPED_GEMM_NT_F8F8BF16_CONTIG> N=1024, K=2048, num_groups=128 with all Ms. It only takes a little time (typically 1 sec) if you have run `python3 -m sglang.compile_deep_gemm`. 
[2025-12-26 23:02:39 DP0 TP0 EP0] Required memory for warmup: 0.31256103515625GB, Available memory: 14.69805908203125GB
DeepGEMM warmup: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 128/128 [00:00<00:00, 421.87it/s]
[2025-12-26 23:02:39 DP0 TP0 EP0] Entering DeepGEMM JIT Pre-Compile session. It may take a long time (typically 10-20 mins) if you have not run `sglang.compile_deep_gemm`. It is recommended to run `sglang.compile_deep_gemm` with same args as `sglang.launch_server` for pre-compilation to reduce the overhead if you have not run it before. For example: `python3 -m sglang.compile_deep_gemm --model deepseek-ai/DeepSeek-V3 --tp 8 --trust-remote-code`
[2025-12-26 23:02:39 DP0 TP0 EP0] Try DeepGEMM JIT Compiling for <GROUPED_GEMM_NT_F8F8BF16_CONTIG> N=2048, K=512, num_groups=128 with all Ms. It only takes a little time (typically 1 sec) if you have run `python3 -m sglang.compile_deep_gemm`. 
[2025-12-26 23:02:39 DP0 TP0 EP0] Required memory for warmup: 0.19537353515625GB, Available memory: 14.69415283203125GB
DeepGEMM warmup: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 128/128 [00:00<00:00, 606.33it/s]
[2025-12-26 23:02:40] INFO:     127.0.0.1:47862 - "POST /generate HTTP/1.1" 200 OK
[2025-12-26 23:02:40] The server is fired up and ready to roll!
[2025-12-26 23:02:53] INFO:     127.0.0.1:56170 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-12-26 23:02:53 DP0 TP0 EP0] Prefill batch, #new-seq: 1, #new-token: 171, #cached-token: 0, full token usage: 0.00, mamba usage: 0.00, #running-req: 0, #queue-req: 0, 
[2025-12-26 23:02:53 DP0 TP0 EP0] Decode batch, #running-req: 1, #full token: 204, full token usage: 0.00, mamba num: 2, mamba usage: 0.00, cuda graph: True, gen throughput (token/s): 0.40, #queue-req: 0, 
[2025-12-26 23:02:54 DP0 TP0 EP0] Decode batch, #running-req: 1, #full token: 244, full token usage: 0.00, mamba num: 2, mamba usage: 0.00, cuda graph: True, gen throughput (token/s): 64.96, #queue-req: 0, 
[2025-12-26 23:02:55 DP0 TP0 EP0] Decode batch, #running-req: 1, #full token: 284, full token usage: 0.00, mamba num: 2, mamba usage: 0.00, cuda graph: True, gen throughput (token/s): 64.55, #queue-req: 0, 
[2025-12-26 23:02:55 DP0 TP0 EP0] Decode batch, #running-req: 1, #full token: 324, full token usage: 0.00, mamba num: 2, mamba usage: 0.00, cuda graph: True, gen throughput (token/s): 61.17, #queue-req: 0, 
[2025-12-26 23:02:56 DP0 TP0 EP0] Decode batch, #running-req: 1, #full token: 364, full token usage: 0.00, mamba num: 2, mamba usage: 0.00, cuda graph: True, gen throughput (token/s): 64.43, #queue-req: 0, 
[2025-12-27 01:14:04] INFO:     127.0.0.1:57614 - "POST /generate HTTP/1.1" 200 OK
[2025-12-27 01:14:04 DP1 TP1 EP1] Prefill batch, #new-seq: 1, #new-token: 18, #cached-token: 0, full token usage: 0.00, mamba usage: 0.00, #running-req: 0, #queue-req: 0, 
[2025-12-27 01:14:05 DP1 TP1 EP1] Decode batch, #running-req: 1, #full token: 51, full token usage: 0.00, mamba num: 2, mamba usage: 0.00, cuda graph: True, gen throughput (token/s): 0.01, #queue-req: 0, 
[2025-12-27 01:14:06 DP1 TP1 EP1] Decode batch, #running-req: 1, #full token: 91, full token usage: 0.00, mamba num: 2, mamba usage: 0.00, cuda graph: True, gen throughput (token/s): 60.89, #queue-req: 0, 
[2025-12-27 01:14:06 DP1 TP1 EP1] Decode batch, #running-req: 1, #full token: 131, full token usage: 0.00, mamba num: 2, mamba usage: 0.00, cuda graph: True, gen throughput (token/s): 63.90, #queue-req: 0, 
[2025-12-27 01:14:07 DP1 TP1 EP1] Decode batch, #running-req: 1, #full token: 171, full token usage: 0.00, mamba num: 2, mamba usage: 0.00, cuda graph: True, gen throughput (token/s): 64.01, #queue-req: 0, 
[2025-12-27 01:14:07 DP1 TP1 EP1] Decode batch, #running-req: 1, #full token: 211, full token usage: 0.00, mamba num: 2, mamba usage: 0.00, cuda graph: True, gen throughput (token/s): 64.64, #queue-req: 0, 
^CProcess Process-1:1:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants