Skip to content

[1/N] Support PassManager Framework and Fusion Pass#11830

Open
yuan-luo wants to merge 2 commits intosgl-project:mainfrom
antgroup:support_fusion_pass
Open

[1/N] Support PassManager Framework and Fusion Pass#11830
yuan-luo wants to merge 2 commits intosgl-project:mainfrom
antgroup:support_fusion_pass

Conversation

@yuan-luo
Copy link
Copy Markdown
Collaborator

@yuan-luo yuan-luo commented Oct 19, 2025

Motivation

Pass is the key component in code transformation, optimization or analysis used by compiler such as LLVM/TVM.

For example in LLVM, a Pass Manager is introduced as a component of the LLVM compiler infrastructure. The main goal is to:

  • orchestrate the execution of a sequence of passes over a specific unit of intermediate representation (IR), such as a module or a function.
  • pipeline the execution of passes for better performance, manage analysis results and their invalidation, and enforce a disciplined workflow for pass developers.

In SGLang, we introduce a similar Pass Manager framework in order to orchestrate the execution of passes. Further more we adopt a fusion pass to fuse several complex operators, more passes such as AsyncTPPass, SequenceParallelismPass and etc., are to be introduced in the following PRs.

With this PR, we don't need to change model files to adopt certain fusion kernels, the compiler will manage the under-layer work as long as the pattern matches. We just need to add the Pass once and enjoy the compiler optimization every where.

The Pass Manager framework and Fusion Pass are borrowed from vLLM with significant SGLang customization. We express our respects to relative vLLM developers on this area.

Modifications

Accuracy Tests

Benchmarking and Profiling

Checklist

@gemini-code-assist
Copy link
Copy Markdown
Contributor

Summary of Changes

Hello @yuan-luo, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly refactors the SGLang compilation backend by introducing a new, unified configuration system (SGLangConfig) that centralizes model, device, and compilation settings. It also lays the groundwork for advanced graph optimizations by adding support for collective fusion passes, specifically targeting allreduce and RMSNorm operations using FlashInfer. The changes aim to improve the modularity, maintainability, and performance of the SGLang compilation pipeline, although some fusion passes are currently marked as work-in-progress.

Highlights

  • New SGLang Configuration System: Introduced a centralized SGLangConfig dataclass to manage model, device, and compilation settings, replacing the old CompilationConfig for a more unified approach.
  • Refactored Compilation Backend: The SGLangBackend and PiecewiseCompileInterpreter classes were updated to integrate seamlessly with the new SGLangConfig and CompilationConfig, streamlining how compilation settings are passed and utilized throughout the system.
  • Introduced Collective Fusion Pass: Added AllReduceFusionPass to optimize allreduce + RMSNorm operations, leveraging FlashInfer for potential performance gains. This includes new pattern matching utilities to identify and replace these operations.
  • Enhanced Compilation Cache Logic: Implemented cache loading for previously compiled graphs within the SGLangBackend, along with detailed timing mechanisms to record compilation and cache hit durations, aiming to reduce redundant compilation efforts.
  • New Inductor Pass Framework: Created SGLangInductorPass and SGLangPatternMatcherPass to provide a structured and extensible way for defining and managing custom Inductor passes, including built-in logging and debugging utilities.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@yuan-luo yuan-luo marked this pull request as draft October 19, 2025 13:47
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a significant refactoring of the configuration management by adopting SGLangConfig, which improves code clarity and maintainability. It also adds a new fusion pass for all_reduce and RMSNorm operations, which is a promising optimization. The changes are extensive and touch core compilation logic. I've identified one instance of unreachable code and a risky use of a private PyTorch API that should be addressed. Overall, the direction of these changes is positive, especially for a draft PR.


# WARNING: This is a hack to clear the pattern matcher cache
# and allow multiple values of epsilon.
torch._inductor.pattern_matcher._seen_patterns.clear()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Using torch._inductor.pattern_matcher._seen_patterns.clear() is a risky hack as it relies on a private PyTorch API. This could break with future PyTorch updates. It would be safer to check for a public API to achieve this, or at least guard this call with version checks to prevent unexpected failures.

@yuan-luo yuan-luo force-pushed the support_fusion_pass branch 4 times, most recently from b888157 to 030c0bc Compare October 26, 2025 10:21
@yuan-luo yuan-luo changed the title [Draft][Placeholder] Support fusion pass and refactor pass manager [Draft][WIP] Support fusion pass and refactor pass manager Oct 26, 2025
@yuan-luo yuan-luo force-pushed the support_fusion_pass branch 2 times, most recently from efae87d to 0736534 Compare November 9, 2025 02:22
@yuan-luo yuan-luo marked this pull request as ready for review November 9, 2025 02:26
@yuan-luo yuan-luo requested a review from Fridge003 as a code owner November 9, 2025 02:26
@yuan-luo
Copy link
Copy Markdown
Collaborator Author

yuan-luo commented Nov 9, 2025

Basic function passed, doing more verification in progress.

@yuan-luo yuan-luo changed the title [Draft][WIP] Support fusion pass and refactor pass manager [WIP] Support fusion pass and refactor pass manager Nov 9, 2025
@yuan-luo yuan-luo force-pushed the support_fusion_pass branch 2 times, most recently from 6fc1a4f to 99b0cda Compare November 9, 2025 08:30
@yuan-luo yuan-luo changed the title [WIP] Support fusion pass and refactor pass manager Support pass manager framework and fusion pass Nov 9, 2025
@yuan-luo
Copy link
Copy Markdown
Collaborator Author

yuan-luo commented Nov 9, 2025

Regression passed. Ready to review.

...
Capturing batches (bs=1 avail_mem=22.80 GB): 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 52/52 [00:30<00:00,  1.71it/s]
[2025-11-09 00:47:24 TP0] Registering 5044 cuda graph addresses
[2025-11-09 00:47:25 TP3] Capture cuda graph end. Time elapsed: 31.21 s. mem usage=1.46 GB. avail mem=22.78 GB.
[2025-11-09 00:47:25 TP1] Capture cuda graph end. Time elapsed: 31.29 s. mem usage=1.46 GB. avail mem=22.75 GB.
[2025-11-09 00:47:25 TP0] Capture cuda graph end. Time elapsed: 31.29 s. mem usage=1.46 GB. avail mem=22.78 GB.
[2025-11-09 00:47:25 TP2] Capture cuda graph end. Time elapsed: 31.34 s. mem usage=1.46 GB. avail mem=22.75 GB.
[2025-11-09 00:47:26 TP0] max_total_num_tokens=5958788, chunked_prefill_size=16384, max_prefill_tokens=16384, max_running_requests=4096, context_len=40960, available_gpu_mem=22.78 GB
[2025-11-09 00:47:26] INFO:     Started server process [546876]
[2025-11-09 00:47:26] INFO:     Waiting for application startup.
[2025-11-09 00:47:26] Using default chat sampling params from model generation config: {'repetition_penalty': 1.0, 'temperature': 0.6, 'top_k': 20, 'top_p': 0.95}
[2025-11-09 00:47:26] Using default chat sampling params from model generation config: {'repetition_penalty': 1.0, 'temperature': 0.6, 'top_k': 20, 'top_p': 0.95}
[2025-11-09 00:47:26] INFO:     Application startup complete.
[2025-11-09 00:47:26] INFO:     Uvicorn running on http://127.0.0.1:30000 (Press CTRL+C to quit)
[2025-11-09 00:47:27] INFO:     127.0.0.1:59104 - "GET /get_model_info HTTP/1.1" 200 OK
[2025-11-09 00:47:27 TP0] Prefill batch, #new-seq: 1, #new-token: 6, #cached-token: 0, token usage: 0.00, #running-req: 0, #queue-req: 0,


[2025-11-09 00:48:56] INFO:     127.0.0.1:59106 - "POST /generate HTTP/1.1" 200 OK
[2025-11-09 00:48:56] The server is fired up and ready to roll!
[2025-11-09 00:49:27 TP0] Prefill batch, #new-seq: 1, #new-token: 33, #cached-token: 0, token usage: 0.00, #running-req: 0, #queue-req: 0,
[2025-11-09 00:49:27 TP0] Decode batch, #running-req: 1, #token: 66, token usage: 0.00, cuda graph: True, gen throughput (token/s): 0.33, #queue-req: 0,
[2025-11-09 00:49:27 TP0] Decode batch, #running-req: 1, #token: 106, token usage: 0.00, cuda graph: True, gen throughput (token/s): 239.64, #queue-req: 0,
[2025-11-09 00:49:27 TP0] Decode batch, #running-req: 1, #token: 146, token usage: 0.00, cuda graph: True, gen throughput (token/s): 238.96, #queue-req: 0,
[2025-11-09 00:49:27 TP0] Decode batch, #running-req: 1, #token: 186, token usage: 0.00, cuda graph: True, gen throughput (token/s): 237.71, #queue-req: 0,
[2025-11-09 00:49:28 TP0] Decode batch, #running-req: 1, #token: 226, token usage: 0.00, cuda graph: True, gen throughput (token/s): 237.62, #queue-req: 0,
[2025-11-09 00:49:28] INFO:     127.0.0.1:49302 - "POST /v1/chat/completions HTTP/1.1" 200 OK
➜  /sgl-workspace python test_openai.py
ChatCompletion(id='b5d705da849a42f7b6e9716669d7e2e0', choices=[Choice(finish_reason='length', index=0, logprobs=None, message=ChatCompletionMessage(content="<think>\nOkay, the user asked for three countries and their capitals, and then how I rank them. Let me start by picking three countries. Maybe the US, Japan, and Brazil. Their capitals are Washington, D.C., Tokyo, and Brasília. Now, how to rank them? The user didn't specify the criteria, so I need to think of possible ways. Maybe by population, economic size, or cultural influence. Let me check the population. The US has around 330 million, Japan about 125 million, Brazil 215 million. So US first, Brazil second, Japan third. But if I consider GDP, the US is the largest, then Japan, then Brazil. Alternatively, cultural influence: Japan has a strong cultural impact, maybe higher than Brazil. But the user might not have a specific criteria. I should mention that the ranking depends on the criteria and provide examples. Also, make sure the capitals are correct. Washington, D.C", refusal=None, role='assistant', annotations=None, audio=None, function_call=None, tool_calls=None, reasoning_content=None), matched_stop=None)], created=1762678168, model='default', object='chat.completion', service_tier=None, system_fingerprint=None, usage=CompletionUsage(completion_tokens=200, prompt_tokens=33, total_tokens=233, completion_tokens_details=None, prompt_tokens_details=None, reasoning_tokens=0), metadata={'weight_version': 'default'})

@Oasis-Git
Copy link
Copy Markdown
Collaborator

@yuan-luo

Hi, thanks for your contribution! Could you please add some benchmark result/script so that we can make some verification? Thanks

@yuan-luo
Copy link
Copy Markdown
Collaborator Author

@yuan-luo

Hi, thanks for your contribution! Could you please add some benchmark result/script so that we can make some verification? Thanks

@Oasis-Git ok, I'll add it.

@BBuf
Copy link
Copy Markdown
Collaborator

BBuf commented Dec 1, 2025

Is this pass independent of Inductor? I noticed the piecewise CUDA graph can run directly by enabling eager mode. Additionally, should we supplement comments for the pass? For example, by analyzing the compiled text to verify whether the graph modification was successful.

@DevashishLal-CB
Copy link
Copy Markdown
Contributor

Is this pass independent of Inductor? I noticed the piecewise CUDA graph can run directly by enabling eager mode. Additionally, should we supplement comments for the pass? For example, by analyzing the compiled text to verify whether the graph modification was successful.

not sure if it's related but checking of graph modification should be possible I was able to write unit tests for pattern matcher passes https://github.com/sgl-project/sglang/pull/10549/files#diff-eb9a5b46a9d09bb4f2ba51eabe57a0b89b6ebd1106aeed0a55d42ea30b7a53e7R90

at runtime for the pattern matcher we could just check the number of matches encountered should be cheaper than a string search (tough probably not much of a concern)

@Oasis-Git
Copy link
Copy Markdown
Collaborator

I am considering about the position of config file and its file name. In sglang specific config is usually set under its own directory instead of config/ directory(such as lora). So I think it would be better to set it under compilation config.

Also there is no global config concept in sglang which is different from vllm. So I think it would be better to remove sglang_config and move all the logics under compilation config

@DevashishLal-CB
Copy link
Copy Markdown
Contributor

Problem Resolved.

➜  sglang_dev git:(support_fusion_pass) python3 -m sglang.launch_server --model-path Qwen/Qwen2.5-VL-7B-Instruct --enable-piecewise-cuda-graph --piecewise-cuda-graph-compiler eager --disable-radix-cache --device cuda --host 127.0.0.1 --port 30000
[2025-11-30 12:49:52] WARNING server_args.py:1304: Attention backend not explicitly specified. Use flashinfer backend by default.
[2025-11-30 12:49:52] server_args=ServerArgs(model_path='Qwen/Qwen2.5-VL-7B-Instruct', tokenizer_path='Qwen/Qwen2.5-VL-7B-Instruct', tokenizer_mode='auto', tokenizer_worker_num=1, skip_tokenizer_init=False, load_format='auto', model_loader_extra_config='{}', trust_remote_code=False, context_length=None, is_embedding=False, enable_multimodal=None, revision=None, model_impl='auto', host='127.0.0.1', port=30000, fastapi_root_path='', grpc_mode=False, skip_server_warmup=False, warmups=None, nccl_port=None, checkpoint_engine_wait_weights_before_ready=False, dtype='auto', quantization=None, quantization_param_path=None, kv_cache_dtype='auto', enable_fp32_lm_head=False, modelopt_quant=None, modelopt_checkpoint_restore_path=None, modelopt_checkpoint_save_path=None, modelopt_export_path=None, quantize_and_serve=False, mem_fraction_static=0.7629746875, max_running_requests=None, max_queued_requests=None, max_total_tokens=None, chunked_prefill_size=16384, max_prefill_tokens=16384, schedule_policy='fcfs', enable_priority_scheduling=False, abort_on_priority_when_disabled=False, schedule_low_priority_values_first=False, priority_scheduling_preemption_threshold=10, schedule_conservativeness=1.0, page_size=1, hybrid_kvcache_ratio=None, swa_full_tokens_ratio=0.8, disable_hybrid_swa_memory=False, radix_eviction_policy='lru', device='cuda', tp_size=1, pp_size=1, pp_max_micro_batch_size=None, stream_interval=1, stream_output=False, random_seed=575006528, constrained_json_whitespace_pattern=None, constrained_json_disable_any_whitespace=False, watchdog_timeout=300, dist_timeout=None, download_dir=None, base_gpu_id=0, gpu_id_step=1, sleep_on_idle=False, mm_process_config={}, log_level='info', log_level_http=None, log_requests=False, log_requests_level=2, crash_dump_folder=None, show_time_cost=False, enable_metrics=False, enable_metrics_for_all_schedulers=False, tokenizer_metrics_custom_labels_header='x-custom-labels', tokenizer_metrics_allowed_custom_labels=None, bucket_time_to_first_token=None, bucket_inter_token_latency=None, bucket_e2e_request_latency=None, collect_tokens_histogram=False, prompt_tokens_buckets=None, generation_tokens_buckets=None, gc_warning_threshold_secs=0.0, decode_log_interval=40, enable_request_time_stats_logging=False, kv_events_config=None, enable_trace=False, otlp_traces_endpoint='localhost:4317', export_metrics_to_file=False, export_metrics_to_file_dir=None, api_key=None, served_model_name='Qwen/Qwen2.5-VL-7B-Instruct', weight_version='default', chat_template=None, completion_template=None, file_storage_path='sglang_storage', enable_cache_report=False, reasoning_parser=None, tool_call_parser=None, tool_server=None, sampling_defaults='model', dp_size=1, load_balance_method='round_robin', load_watch_interval=0.1, prefill_round_robin_balance=False, dist_init_addr=None, nnodes=1, node_rank=0, json_model_override_args='{}', preferred_sampling_params=None, enable_lora=None, max_lora_rank=None, lora_target_modules=None, lora_paths=None, max_loaded_loras=None, max_loras_per_batch=8, lora_eviction_policy='lru', lora_backend='csgmv', max_lora_chunk_size=16, attention_backend='flashinfer', decode_attention_backend=None, prefill_attention_backend=None, sampling_backend='flashinfer', grammar_backend='xgrammar', mm_attention_backend=None, nsa_prefill_backend='flashmla_sparse', nsa_decode_backend='fa3', speculative_algorithm=None, speculative_draft_model_path=None, speculative_draft_model_revision=None, speculative_draft_load_format=None, speculative_num_steps=None, speculative_eagle_topk=None, speculative_num_draft_tokens=None, speculative_accept_threshold_single=1.0, speculative_accept_threshold_acc=1.0, speculative_token_map=None, speculative_attention_mode='prefill', speculative_moe_runner_backend=None, speculative_ngram_min_match_window_size=1, speculative_ngram_max_match_window_size=12, speculative_ngram_min_bfs_breadth=1, speculative_ngram_max_bfs_breadth=10, speculative_ngram_match_type='BFS', speculative_ngram_branch_length=18, speculative_ngram_capacity=10000000, ep_size=1, moe_a2a_backend='none', moe_runner_backend='auto', flashinfer_mxfp4_moe_precision='default', enable_flashinfer_allreduce_fusion=False, deepep_mode='auto', ep_num_redundant_experts=0, ep_dispatch_algorithm=None, init_expert_location='trivial', enable_eplb=False, eplb_algorithm='auto', eplb_rebalance_num_iterations=1000, eplb_rebalance_layers_per_chunk=None, eplb_min_rebalancing_utilization_threshold=1.0, expert_distribution_recorder_mode=None, expert_distribution_recorder_buffer_size=1000, enable_expert_distribution_metrics=False, deepep_config=None, moe_dense_tp_size=None, elastic_ep_backend=None, mooncake_ib_device=None, max_mamba_cache_size=None, mamba_ssm_dtype='float32', mamba_full_memory_ratio=0.9, enable_hierarchical_cache=False, hicache_ratio=2.0, hicache_size=0, hicache_write_policy='write_through', hicache_io_backend='kernel', hicache_mem_layout='layer_first', hicache_storage_backend=None, hicache_storage_prefetch_policy='best_effort', hicache_storage_backend_extra_config=None, enable_lmcache=False, kt_weight_path=None, kt_method='AMXINT4', kt_cpuinfer=None, kt_threadpool_count=2, kt_num_gpu_experts=None, kt_max_deferred_experts_per_token=None, dllm_algorithm=None, dllm_block_size=None, enable_double_sparsity=False, ds_channel_config_path=None, ds_heavy_channel_num=32, ds_heavy_token_num=256, ds_heavy_channel_type='qk', ds_sparse_decode_threshold=4096, cpu_offload_gb=0, offload_group_size=-1, offload_num_in_group=1, offload_prefetch_step=1, offload_mode='cpu', multi_item_scoring_delimiter=None, disable_radix_cache=True, cuda_graph_max_bs=512, cuda_graph_bs=[1, 2, 4, 8, 12, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200, 208, 216, 224, 232, 240, 248, 256, 272, 288, 304, 320, 336, 352, 368, 384, 400, 416, 432, 448, 464, 480, 496, 512], disable_cuda_graph=False, disable_cuda_graph_padding=False, enable_profile_cuda_graph=False, enable_cudagraph_gc=False, enable_layerwise_nvtx_marker=False, enable_nccl_nvls=False, enable_symm_mem=False, disable_flashinfer_cutlass_moe_fp4_allgather=False, enable_tokenizer_batch_encode=False, disable_tokenizer_batch_decode=False, disable_outlines_disk_cache=False, disable_custom_all_reduce=False, enable_mscclpp=False, enable_torch_symm_mem=False, disable_overlap_schedule=False, enable_mixed_chunk=False, enable_dp_attention=False, enable_dp_lm_head=False, enable_two_batch_overlap=False, enable_single_batch_overlap=False, tbo_token_distribution_threshold=0.48, enable_torch_compile=False, enable_piecewise_cuda_graph=True, enable_torch_compile_debug_mode=False, torch_compile_max_bs=32, piecewise_cuda_graph_max_tokens=4096, piecewise_cuda_graph_tokens=[4, 8, 12, 16, 20, 24, 28, 32, 48, 64, 80, 96, 112, 128, 144, 160, 176, 192, 208, 224, 240, 256, 288, 320, 352, 384, 416, 448, 480, 512, 640, 768, 896, 1024, 1152, 1280, 1408, 1536, 1664, 1792, 1920, 2048, 2176, 2304, 2432, 2560, 2688, 2816, 2944, 3072, 3200, 3328, 3456, 3584, 3712, 3840, 3968, 4096], piecewise_cuda_graph_compiler='eager', torchao_config='', enable_nan_detection=False, enable_p2p_check=False, triton_attention_reduce_in_fp32=False, triton_attention_num_kv_splits=8, triton_attention_split_tile_size=None, num_continuous_decode_steps=1, delete_ckpt_after_loading=False, enable_memory_saver=False, enable_weights_cpu_backup=False, enable_draft_weights_cpu_backup=False, allow_auto_truncate=False, enable_custom_logit_processor=False, flashinfer_mla_disable_ragged=False, disable_shared_experts_fusion=False, disable_chunked_prefix_cache=False, disable_fast_image_processor=False, keep_mm_feature_on_device=False, enable_return_hidden_states=False, scheduler_recv_interval=1, numa_node=None, enable_deterministic_inference=False, rl_on_policy_target=None, enable_attn_tp_input_scattered=False, enable_nsa_prefill_context_parallel=False, enable_dynamic_batch_tokenizer=False, dynamic_batch_tokenizer_batch_size=32, dynamic_batch_tokenizer_batch_timeout=0.002, debug_tensor_dump_output_folder=None, debug_tensor_dump_layers=None, debug_tensor_dump_input_file=None, debug_tensor_dump_inject=False, disaggregation_mode='null', disaggregation_transfer_backend='mooncake', disaggregation_bootstrap_port=8998, disaggregation_decode_tp=None, disaggregation_decode_dp=None, disaggregation_prefill_pp=1, disaggregation_ib_device=None, disaggregation_decode_enable_offload_kvcache=False, num_reserved_decode_tokens=512, disaggregation_decode_polling_interval=1, custom_weight_loader=[], weight_loader_disable_mmap=False, remote_instance_weight_loader_seed_instance_ip=None, remote_instance_weight_loader_seed_instance_service_port=None, remote_instance_weight_loader_send_weights_group_ports=None, enable_pdmux=False, pdmux_config_path=None, sm_group_num=8, mm_max_concurrent_calls=32, mm_per_request_timeout=10.0, enable_broadcast_mm_inputs_process=False, decrypted_config_file=None, decrypted_draft_config_file=None, mm_enable_dp_encoder=False, forward_hooks=None)
[2025-11-30 12:49:56] Using default HuggingFace chat template with detected content format: openai
[2025-11-30 12:50:03] Init torch distributed begin.
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[2025-11-30 12:50:04] Init torch distributed ends. mem usage=0.00 GB
[2025-11-30 12:50:04] MOE_RUNNER_BACKEND is not initialized, the backend will be automatically selected
[2025-11-30 12:50:04] Ignore import error when loading sglang.srt.models.mindspore: name 'ms' is not defined
[2025-11-30 12:50:05] Load weight begin. avail mem=177.74 GB
[2025-11-30 12:50:05] Multimodal attention backend not set. Use triton_attn.
[2025-11-30 12:50:05] Using triton_attn as multimodal attention backend.
[2025-11-30 12:50:05] Using model weights format ['*.safetensors']
Loading safetensors checkpoint shards:   0% Completed | 0/5 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:  20% Completed | 1/5 [00:00<00:03,  1.08it/s]
Loading safetensors checkpoint shards:  40% Completed | 2/5 [00:01<00:02,  1.07it/s]
Loading safetensors checkpoint shards:  60% Completed | 3/5 [00:02<00:01,  1.53it/s]
Loading safetensors checkpoint shards:  80% Completed | 4/5 [00:03<00:00,  1.27it/s]
Loading safetensors checkpoint shards: 100% Completed | 5/5 [00:04<00:00,  1.21it/s]
Loading safetensors checkpoint shards: 100% Completed | 5/5 [00:04<00:00,  1.23it/s]

[2025-11-30 12:50:10] Load weight end. type=Qwen2_5_VLForConditionalGeneration, dtype=torch.bfloat16, avail mem=161.94 GB, mem usage=15.79 GB.
[2025-11-30 12:50:10] Using KV cache dtype: torch.bfloat16
[2025-11-30 12:50:10] KV Cache is allocated. #tokens: 2243474, K size: 59.91 GB, V size: 59.91 GB
[2025-11-30 12:50:10] Memory pool end. avail mem=40.03 GB
[2025-11-30 12:50:10] CUTLASS backend is disabled when piecewise cuda graph is enabled due to TMA descriptor initialization issues on B200. Using auto backend instead for stability.
[2025-11-30 12:50:10] Capture cuda graph begin. This can take up to several minutes. avail mem=39.48 GB
[2025-11-30 12:50:10] Capture cuda graph bs [1, 2, 4, 8, 12, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200, 208, 216, 224, 232, 240, 248, 256, 272, 288, 304, 320, 336, 352, 368, 384, 400, 416, 432, 448, 464, 480, 496, 512]
Capturing batches (bs=1 avail_mem=38.07 GB): 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 52/52 [00:04<00:00, 10.83it/s]
[2025-11-30 12:50:15] Capture cuda graph end. Time elapsed: 5.37 s. mem usage=1.44 GB. avail mem=38.05 GB.
[2025-11-30 12:50:15] Capture cuda graph num tokens [4, 8, 12, 16, 20, 24, 28, 32, 48, 64, 80, 96, 112, 128, 144, 160, 176, 192, 208, 224, 240, 256, 288, 320, 352, 384, 416, 448, 480, 512, 640, 768, 896, 1024, 1152, 1280, 1408, 1536, 1664, 1792, 1920, 2048, 2176, 2304, 2432, 2560, 2688, 2816, 2944, 3072, 3200, 3328, 3456, 3584, 3712, 3840, 3968, 4096]
[2025-11-30 12:50:15] install_torch_compiled
/usr/local/lib/python3.12/dist-packages/torch/_dynamo/variables/functions.py:1575: UserWarning: Dynamo detected a call to a `functools.lru_cache`-wrapped function. Dynamo ignores the cache wrapper and directly traces the wrapped function. Silent incorrectness is only a *potential* risk, not something we have observed. Enable TORCH_LOGS="+dynamo" for a DEBUG stack trace.
  torch._dynamo.utils.warn_once(msg)
[2025-11-30 12:50:18] Initializing SGLangBackend
[2025-11-30 12:50:18] SGLangBackend __call__
[2025-11-30 12:50:18] Compiling a graph for dynamic shape takes 0.25 s
[2025-11-30 12:50:18] Computation graph saved to /root/.cache/sglang/torch_compile_cache/rank_0_0/backbone/computation_graph_1764507018.5970824.py
Capturing num tokens (num_tokens=4096 avail_mem=38.02 GB):   0%|                                                                                                                                                            | 0/58 [00:00<?, ?it/s][2025-11-30 12:50:21] Initializing SGLangBackend
[2025-11-30 12:50:21] SGLangBackend __call__
[2025-11-30 12:50:21] Compiling a graph for dynamic shape takes 0.27 s
[2025-11-30 12:50:21] Computation graph saved to /root/.cache/sglang/torch_compile_cache/rank_0_0/backbone/computation_graph_1764507021.9251404.py
Capturing num tokens (num_tokens=3968 avail_mem=37.30 GB):   2%|██▌                                                                                                                                                 | 1/58 [00:03<03:02,  3.21s/it][2025-11-30 12:50:24] Initializing SGLangBackend
[2025-11-30 12:50:24] SGLangBackend __call__
[2025-11-30 12:50:25] Compiling a graph for dynamic shape takes 0.28 s
[2025-11-30 12:50:25] Computation graph saved to /root/.cache/sglang/torch_compile_cache/rank_0_0/backbone/computation_graph_1764507025.2558053.py
Capturing num tokens (num_tokens=4 avail_mem=37.10 GB): 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 58/58 [00:14<00:00,  3.92it/s]
[2025-11-30 12:50:37] max_total_num_tokens=2243474, chunked_prefill_size=16384, max_prefill_tokens=16384, max_running_requests=4096, context_len=128000, available_gpu_mem=37.09 GB
[2025-11-30 12:50:37] INFO:     Started server process [189179]
[2025-11-30 12:50:37] INFO:     Waiting for application startup.
[2025-11-30 12:50:37] Using default chat sampling params from model generation config: {'repetition_penalty': 1.05, 'temperature': 1e-06, 'top_k': 50, 'top_p': 1.0}
[2025-11-30 12:50:37] Using default chat sampling params from model generation config: {'repetition_penalty': 1.05, 'temperature': 1e-06, 'top_k': 50, 'top_p': 1.0}
[2025-11-30 12:50:37] INFO:     Application startup complete.
[2025-11-30 12:50:37] INFO:     Uvicorn running on http://127.0.0.1:30000 (Press CTRL+C to quit)
[2025-11-30 12:50:38] Endpoint '/get_model_info' is deprecated and will be removed in a future version. Please use '/model_info' instead.
[2025-11-30 12:50:38] INFO:     127.0.0.1:34098 - "GET /get_model_info HTTP/1.1" 200 OK
[2025-11-30 12:50:39] Prefill batch, #new-seq: 1, #new-token: 29, #cached-token: 0, token usage: 0.00, #running-req: 0, #queue-req: 0,
[2025-11-30 12:50:39] INFO:     127.0.0.1:34104 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-11-30 12:50:39] The server is fired up and ready to roll!

Considering this output is it right to say that the fusions are applied to prefill only ? decode doesn't need to use piecewise graphs but we should still apply fusions to the full decode graphs that is where we will see the maximum performance gain and the kernel launches per token will reduce

Copy link
Copy Markdown
Collaborator

@Oasis-Git Oasis-Git left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general the compilation side is good to me except the config settings. I still suggest that we should set the compilation config and its own directory.

For the addition of pass manager ops, torch profiling and benchmark is strong recommended. This is also applied for later additional pass manager fusion ops.

@yuan-luo
Copy link
Copy Markdown
Collaborator Author

yuan-luo commented Dec 2, 2025

In general the compilation side is good to me except the config settings. I still suggest that we should set the compilation config and its own directory.

For the addition of pass manager ops, torch profiling and benchmark is strong recommended. This is also applied for later additional pass manager fusion ops.

Sure, I'll move the config settings to compilation folder.

Copy link
Copy Markdown
Collaborator

@ispobock ispobock left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering whether this Pass Manager is intended only for piecewise CUDA graphs, or if it's also used in the decode phase?
Is there an option/server_arg to turn on it?

@DevashishLal-CB
Copy link
Copy Markdown
Contributor

I'm wondering whether this Pass Manager is intended only for piecewise CUDA graphs, or if it's also used in the decode phase? Is there an option/server_arg to turn on it?

That would require changes in the cuda graph runner (this is how I did it https://github.com/sgl-project/sglang/pull/10549/files#diff-bd9ac3018495d7e6099bae6a48bfa242f997fc1d06b64fc37699d3f5d44efc86R191)

@yuan-luo yuan-luo force-pushed the support_fusion_pass branch 2 times, most recently from ba2c716 to ad63c41 Compare December 7, 2025 10:58
@yuan-luo
Copy link
Copy Markdown
Collaborator Author

yuan-luo commented Dec 7, 2025

Is this pass independent of Inductor? I noticed the piecewise CUDA graph can run directly by enabling eager mode. Additionally, should we supplement comments for the pass? For example, by analyzing the compiled text to verify whether the graph modification was successful.

Once the pass working correctly, we can dump the fusion graph out, and it can analyze the compiled text to check the graph modification result.

@yuan-luo yuan-luo force-pushed the support_fusion_pass branch 2 times, most recently from 3e0a103 to 4c60605 Compare December 7, 2025 12:06
@yuan-luo yuan-luo force-pushed the support_fusion_pass branch 3 times, most recently from 122be14 to 895e494 Compare December 21, 2025 12:36
@yuan-luo yuan-luo force-pushed the support_fusion_pass branch from 895e494 to 3e6cc2a Compare December 23, 2025 11:53
@yuan-luo yuan-luo changed the title Support pass manager framework and fusion pass [1/N] Support PassManager Framework and Fusion Pass Dec 24, 2025
Co-authored-by: Yuan Luo <yuan.luo@hotmail.com>
@yuan-luo yuan-luo force-pushed the support_fusion_pass branch 2 times, most recently from 4ee09ab to 4d1f9ac Compare December 29, 2025 05:00
@yuan-luo yuan-luo marked this pull request as draft December 29, 2025 15:09
@yuan-luo
Copy link
Copy Markdown
Collaborator Author

yuan-luo commented Dec 29, 2025

Fixing some bugs in fusion.

[2025-12-29 14:48:54] Received sigquit from a child process. It usually means the child failed.
[2025-12-29 14:48:54 TP1] Scheduler hit an exception: Traceback (most recent call last):
  File "/usr/local/lib/python3.12/dist-packages/sglang/srt/managers/scheduler.py", line 2905, in run_scheduler_process
    scheduler = Scheduler(
                ^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/sglang/srt/managers/scheduler.py", line 333, in __init__
    self.init_model_worker()
  File "/usr/local/lib/python3.12/dist-packages/sglang/srt/managers/scheduler.py", line 471, in init_model_worker
    self.tp_worker = TpModelWorker(
                     ^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/sglang/srt/managers/tp_worker.py", line 253, in __init__
    self._model_runner = ModelRunner(
                         ^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/sglang/srt/model_executor/model_runner.py", line 355, in __init__
    self.initialize(min_per_gpu_memory)
  File "/usr/local/lib/python3.12/dist-packages/sglang/srt/model_executor/model_runner.py", line 588, in initialize
    self.init_piecewise_cuda_graphs()
  File "/usr/local/lib/python3.12/dist-packages/sglang/srt/model_executor/model_runner.py", line 2060, in init_piecewise_cuda_graphs
    self.piecewise_cuda_graph_runner = PiecewiseCudaGraphRunner(self)
                                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/sglang/srt/model_executor/piecewise_cuda_graph_runner.py", line 251, in __init__
    self.warmup_torch_compile(num_tokens=num_tokens)
  File "/usr/local/lib/python3.12/dist-packages/sglang/srt/model_executor/piecewise_cuda_graph_runner.py", line 330, in warmup_torch_compile
    _ = self.model_runner.model.forward(
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 120, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/sglang/srt/models/qwen2_5_vl.py", line 745, in forward
    hidden_states = general_mm_embed_routine(
                    ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/sglang/srt/managers/mm_utils.py", line 1092, in general_mm_embed_routine
    hidden_states = language_model(
                    ^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1775, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1786, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/sglang/srt/compilation/compile.py", line 206, in trampoline
    _ensure_compiled(self, *args, **kwargs)
  File "/usr/local/lib/python3.12/dist-packages/sglang/srt/compilation/compile.py", line 197, in _ensure_compiled
    compiled_callable(*args, **kwargs)
  File "/usr/local/lib/python3.12/dist-packages/torch/_dynamo/eval_frame.py", line 845, in compile_wrapper
    raise e.remove_dynamo_frames() from None  # see TORCHDYNAMO_VERBOSE=1
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/_dynamo/output_graph.py", line 2196, in _call_user_compiler
    raise BackendCompilerFailed(
  File "/usr/local/lib/python3.12/dist-packages/torch/_dynamo/output_graph.py", line 2171, in _call_user_compiler
    compiled_fn = compiler_fn(gm, example_inputs)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/_dynamo/repro/after_dynamo.py", line 156, in __call__
    compiled_gm = compiler_fn(gm, example_inputs)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/__init__.py", line 2437, in __call__
    return self.compiler_fn(model_, inputs_, **self.kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/sglang/srt/compilation/compile.py", line 144, in <lambda>
    backend_factory = lambda gm, ex: SGLangBackend(sglang_config, graph_pool)(
                                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/sglang/srt/compilation/backend.py", line 487, in __call__
    self.configure_post_pass()
  File "/usr/local/lib/python3.12/dist-packages/sglang/srt/compilation/backend.py", line 445, in configure_post_pass
    self.post_grad_pass_manager.configure(self.sglang_config)
  File "/usr/local/lib/python3.12/dist-packages/sglang/srt/compilation/pass_manager.py", line 52, in configure
    self.passes += [AllReduceFusionPass(config)]
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/sglang/srt/compilation/collective_fusion.py", line 416, in __init__
    self.register_patterns()
  File "/usr/local/lib/python3.12/dist-packages/sglang/srt/compilation/inductor_pass.py", line 123, in fn_new
    result = fn(*args, **kwargs)
             ^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/sglang/srt/compilation/collective_fusion.py", line 431, in register_patterns
    ).register(self.patterns)
      ^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/sglang/srt/compilation/collective_fusion.py", line 288, in register
    pm.register_replacement(
  File "/usr/local/lib/python3.12/dist-packages/torch/_inductor/pattern_matcher.py", line 1552, in register_replacement
    pattern, gm = gen_pattern_and_search_gm(
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/contextlib.py", line 81, in inner
    return func(*args, **kwds)
           ^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/_inductor/pattern_matcher.py", line 1760, in gen_pattern_and_search_gm
    search_gm = trace_fn(search_fn, flat_inputs)
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 120, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/_inductor/pattern_matcher.py", line 2115, in fwd_only
    gm = make_fx(fn, decompositions, tracing_mode="real")(*args)
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/fx/experimental/proxy_tensor.py", line 2429, in wrapped
    return make_fx_tracer.trace(f, *args)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/fx/experimental/proxy_tensor.py", line 2356, in trace
    return self._trace_inner(f, *args)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/fx/experimental/proxy_tensor.py", line 2318, in _trace_inner
    t = dispatch_trace(
        ^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/_compile.py", line 53, in inner
    return disable_fn(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/_dynamo/eval_frame.py", line 1044, in _fn
    return fn(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/fx/experimental/proxy_tensor.py", line 1303, in dispatch_trace
    graph = tracer.trace(root, concrete_args)  # type: ignore[arg-type]
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/_dynamo/eval_frame.py", line 1044, in _fn
    return fn(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/fx/_symbolic_trace.py", line 868, in trace
    (self.create_arg(fn(*args)),),
                     ^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/fx/experimental/proxy_tensor.py", line 1361, in wrapped
    out = f(*tensors)  # type:ignore[call-arg]
          ^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/sglang/srt/compilation/collective_fusion.py", line 266, in pattern
    rms = self.rmsnorm_matcher(allreduce_output, weight)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/sglang/srt/compilation/matcher_utils.py", line 32, in __call__
    return self.forward(*args, **kws)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/sglang/srt/compilation/matcher_utils.py", line 64, in forward_custom
    _, result = auto_functionalized(
                ^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/_higher_order_ops/auto_functionalize.py", line 357, in __call__
    assert can_auto_functionalize(_mutable_op)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
torch._dynamo.exc.BackendCompilerFailed: backend='<lambda>' raised:
AssertionError: 

Set TORCHDYNAMO_VERBOSE=1 for the internal stack trace (please do this especially if you're reporting a bug to PyTorch). For even more developer context, set TORCH_LOGS="+dynamo"

@yuan-luo yuan-luo force-pushed the support_fusion_pass branch from 4d1f9ac to cd10d83 Compare December 29, 2025 15:09
@yuan-luo
Copy link
Copy Markdown
Collaborator Author

Problem resolved. Now the Fusion Pass for AllReduceFusion works as expected.

root@c7e9bb6a6789:/sgl-workspace/sglang# SGLANG_VLM_CACHE_SIZE_MB=0  python -m sglang.launch_server --model-path Qwen/Qwen2.5-VL-7B-Instruct --host 0.0.0.0 --port 30000 --trust-remote-code --tp-size 2 --enable-cache-report --log-level info --max-running-requests 64 --mem-fraction-static 0.65 --chunked-prefill-size 8192 --attention-backend fa3 --chat-template qwen2-vl --mm-attention-backend fa3 --disable-radix-cache --enable-piecewise-cuda-graph --piecewise-cuda-graph-max-tokens 8192  --piecewise-cuda-graph-compiler eager
[2025-12-30 02:27:12] server_args=ServerArgs(model_path='Qwen/Qwen2.5-VL-7B-Instruct', tokenizer_path='Qwen/Qwen2.5-VL-7B-Instruct', tokenizer_mode='auto', tokenizer_worker_num=1, skip_tokenizer_init=False, load_format='auto', model_loader_extra_config='{}', trust_remote_code=True, context_length=None, is_embedding=False, enable_multimodal=None, revision=None, model_impl='auto', host='0.0.0.0', port=30000, fastapi_root_path='', grpc_mode=False, skip_server_warmup=False, warmups=None, nccl_port=None, checkpoint_engine_wait_weights_before_ready=False, dtype='auto', quantization=None, quantization_param_path=None, kv_cache_dtype='auto', enable_fp32_lm_head=False, modelopt_quant=None, modelopt_checkpoint_restore_path=None, modelopt_checkpoint_save_path=None, modelopt_export_path=None, quantize_and_serve=False, rl_quant_profile=None, mem_fraction_static=0.65, max_running_requests=64, max_queued_requests=None, max_total_tokens=None, chunked_prefill_size=8192, enable_dynamic_chunking=False, max_prefill_tokens=16384, prefill_max_requests=None, schedule_policy='fcfs', enable_priority_scheduling=False, abort_on_priority_when_disabled=False, schedule_low_priority_values_first=False, priority_scheduling_preemption_threshold=10, schedule_conservativeness=1.0, page_size=1, hybrid_kvcache_ratio=None, swa_full_tokens_ratio=0.8, disable_hybrid_swa_memory=False, radix_eviction_policy='lru', device='cuda', tp_size=2, pp_size=1, pp_max_micro_batch_size=None, pp_async_batch_depth=0, stream_interval=1, stream_output=False, random_seed=1070049921, constrained_json_whitespace_pattern=None, constrained_json_disable_any_whitespace=False, watchdog_timeout=300, soft_watchdog_timeout=None, dist_timeout=None, download_dir=None, base_gpu_id=0, gpu_id_step=1, sleep_on_idle=False, custom_sigquit_handler=None, log_level='info', log_level_http=None, log_requests=False, log_requests_level=2, log_requests_format='text', crash_dump_folder=None, show_time_cost=False, enable_metrics=False, enable_metrics_for_all_schedulers=False, tokenizer_metrics_custom_labels_header='x-custom-labels', tokenizer_metrics_allowed_custom_labels=None, bucket_time_to_first_token=None, bucket_inter_token_latency=None, bucket_e2e_request_latency=None, collect_tokens_histogram=False, prompt_tokens_buckets=None, generation_tokens_buckets=None, gc_warning_threshold_secs=0.0, decode_log_interval=40, enable_request_time_stats_logging=False, kv_events_config=None, enable_trace=False, otlp_traces_endpoint='localhost:4317', export_metrics_to_file=False, export_metrics_to_file_dir=None, api_key=None, served_model_name='Qwen/Qwen2.5-VL-7B-Instruct', weight_version='default', chat_template='qwen2-vl', completion_template=None, file_storage_path='sglang_storage', enable_cache_report=True, reasoning_parser=None, tool_call_parser=None, tool_server=None, sampling_defaults='model', dp_size=1, load_balance_method='round_robin', prefill_round_robin_balance=False, dist_init_addr=None, nnodes=1, node_rank=0, json_model_override_args='{}', preferred_sampling_params=None, enable_lora=None, max_lora_rank=None, lora_target_modules=None, lora_paths=None, max_loaded_loras=None, max_loras_per_batch=8, lora_eviction_policy='lru', lora_backend='csgmv', max_lora_chunk_size=16, attention_backend='fa3', decode_attention_backend=None, prefill_attention_backend=None, sampling_backend='flashinfer', grammar_backend='xgrammar', mm_attention_backend='fa3', fp8_gemm_runner_backend='auto', nsa_prefill_backend='flashmla_sparse', nsa_decode_backend='fa3', disable_flashinfer_autotune=False, speculative_algorithm=None, speculative_draft_model_path=None, speculative_draft_model_revision=None, speculative_draft_load_format=None, speculative_num_steps=None, speculative_eagle_topk=None, speculative_num_draft_tokens=None, speculative_accept_threshold_single=1.0, speculative_accept_threshold_acc=1.0, speculative_token_map=None, speculative_attention_mode='prefill', speculative_draft_attention_backend=None, speculative_moe_runner_backend='auto', speculative_moe_a2a_backend=None, speculative_draft_model_quantization=None, speculative_ngram_min_match_window_size=1, speculative_ngram_max_match_window_size=12, speculative_ngram_min_bfs_breadth=1, speculative_ngram_max_bfs_breadth=10, speculative_ngram_match_type='BFS', speculative_ngram_branch_length=18, speculative_ngram_capacity=10000000, enable_multi_layer_eagle=False, ep_size=1, moe_a2a_backend='none', moe_runner_backend='auto', flashinfer_mxfp4_moe_precision='default', enable_flashinfer_allreduce_fusion=False, deepep_mode='auto', ep_num_redundant_experts=0, ep_dispatch_algorithm=None, init_expert_location='trivial', enable_eplb=False, eplb_algorithm='auto', eplb_rebalance_num_iterations=1000, eplb_rebalance_layers_per_chunk=None, eplb_min_rebalancing_utilization_threshold=1.0, expert_distribution_recorder_mode=None, expert_distribution_recorder_buffer_size=1000, enable_expert_distribution_metrics=False, deepep_config=None, moe_dense_tp_size=None, elastic_ep_backend=None, mooncake_ib_device=None, max_mamba_cache_size=None, mamba_ssm_dtype='float32', mamba_full_memory_ratio=0.9, mamba_scheduler_strategy='no_buffer', mamba_track_interval=256, enable_hierarchical_cache=False, hicache_ratio=2.0, hicache_size=0, hicache_write_policy='write_through', hicache_io_backend='kernel', hicache_mem_layout='layer_first', hicache_storage_backend=None, hicache_storage_prefetch_policy='best_effort', hicache_storage_backend_extra_config=None, enable_lmcache=False, kt_weight_path=None, kt_method='AMXINT4', kt_cpuinfer=None, kt_threadpool_count=2, kt_num_gpu_experts=None, kt_max_deferred_experts_per_token=None, dllm_algorithm=None, dllm_algorithm_config=None, enable_double_sparsity=False, ds_channel_config_path=None, ds_heavy_channel_num=32, ds_heavy_token_num=256, ds_heavy_channel_type='qk', ds_sparse_decode_threshold=4096, cpu_offload_gb=0, offload_group_size=-1, offload_num_in_group=1, offload_prefetch_step=1, offload_mode='cpu', multi_item_scoring_delimiter=None, disable_radix_cache=True, cuda_graph_max_bs=256, cuda_graph_bs=[1, 2, 4, 8, 12, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200, 208, 216, 224, 232, 240, 248, 256], disable_cuda_graph=False, disable_cuda_graph_padding=False, enable_profile_cuda_graph=False, enable_cudagraph_gc=False, enable_layerwise_nvtx_marker=False, enable_nccl_nvls=False, enable_symm_mem=False, disable_flashinfer_cutlass_moe_fp4_allgather=False, enable_tokenizer_batch_encode=False, disable_tokenizer_batch_decode=False, disable_outlines_disk_cache=False, disable_custom_all_reduce=False, enable_mscclpp=False, enable_torch_symm_mem=False, disable_overlap_schedule=False, enable_mixed_chunk=False, enable_dp_attention=False, enable_dp_lm_head=False, enable_two_batch_overlap=False, enable_single_batch_overlap=False, tbo_token_distribution_threshold=0.48, enable_torch_compile=False, enable_piecewise_cuda_graph=True, enable_torch_compile_debug_mode=False, torch_compile_max_bs=32, piecewise_cuda_graph_max_tokens=8192, piecewise_cuda_graph_tokens=[4, 8, 12, 16, 20, 24, 28, 32, 48, 64, 80, 96, 112, 128, 144, 160, 176, 192, 208, 224, 240, 256, 288, 320, 352, 384, 416, 448, 480, 512, 640, 768, 896, 1024, 1152, 1280, 1408, 1536, 1664, 1792, 1920, 2048, 2176, 2304, 2432, 2560, 2688, 2816, 2944, 3072, 3200, 3328, 3456, 3584, 3712, 3840, 3968, 4096, 4352, 4608, 4864, 5120, 5376, 5632, 5888, 6144, 6400, 6656, 6912, 7168, 7424, 7680, 7936, 8192], piecewise_cuda_graph_compiler='eager', torchao_config='', enable_nan_detection=False, enable_p2p_check=False, triton_attention_reduce_in_fp32=False, triton_attention_num_kv_splits=8, triton_attention_split_tile_size=None, num_continuous_decode_steps=1, delete_ckpt_after_loading=False, enable_memory_saver=False, enable_weights_cpu_backup=False, enable_draft_weights_cpu_backup=False, allow_auto_truncate=False, enable_custom_logit_processor=False, flashinfer_mla_disable_ragged=False, disable_shared_experts_fusion=False, disable_chunked_prefix_cache=False, disable_fast_image_processor=False, keep_mm_feature_on_device=False, enable_return_hidden_states=False, enable_return_routed_experts=False, scheduler_recv_interval=1, numa_node=None, enable_deterministic_inference=False, rl_on_policy_target=None, enable_attn_tp_input_scattered=False, enable_nsa_prefill_context_parallel=False, enable_fused_qk_norm_rope=False, enable_dynamic_batch_tokenizer=False, dynamic_batch_tokenizer_batch_size=32, dynamic_batch_tokenizer_batch_timeout=0.002, debug_tensor_dump_output_folder=None, debug_tensor_dump_layers=None, debug_tensor_dump_input_file=None, debug_tensor_dump_inject=False, disaggregation_mode='null', disaggregation_transfer_backend='mooncake', disaggregation_bootstrap_port=8998, disaggregation_decode_tp=None, disaggregation_decode_dp=None, disaggregation_prefill_pp=1, disaggregation_ib_device=None, disaggregation_decode_enable_offload_kvcache=False, disaggregation_decode_enable_fake_auto=False, num_reserved_decode_tokens=512, disaggregation_decode_polling_interval=1, encoder_only=False, language_only=False, encoder_transfer_backend='zmq_to_scheduler', encoder_urls=[], custom_weight_loader=[], weight_loader_disable_mmap=False, remote_instance_weight_loader_seed_instance_ip=None, remote_instance_weight_loader_seed_instance_service_port=None, remote_instance_weight_loader_send_weights_group_ports=None, remote_instance_weight_loader_backend='nccl', remote_instance_weight_loader_start_seed_via_transfer_engine=False, enable_pdmux=False, pdmux_config_path=None, sm_group_num=8, mm_max_concurrent_calls=32, mm_per_request_timeout=10.0, enable_broadcast_mm_inputs_process=False, enable_prefix_mm_cache=False, mm_enable_dp_encoder=False, mm_process_config={}, limit_mm_data_per_request=None, decrypted_config_file=None, decrypted_draft_config_file=None, forward_hooks=None)
[2025-12-30 02:27:13] Ignore import error when loading sglang.srt.multimodal.processors.glmasr: cannot import name 'GlmAsrConfig' from 'transformers' (/usr/local/lib/python3.12/dist-packages/transformers/__init__.py)
[2025-12-30 02:27:18] Loading chat template from argument: qwen2-vl
[2025-12-30 02:27:24 TP0] Init torch distributed begin.
[2025-12-30 02:27:25 TP1] Init torch distributed begin.
[rank1]:[W1230 02:27:25.552575147 ProcessGroupGloo.cpp:516] Warning: Unable to resolve hostname to a (local) address. Using the loopback address as fallback. Manually set the network interface to bind to with GLOO_SOCKET_IFNAME. (function operator())
[rank0]:[W1230 02:27:25.556913957 ProcessGroupGloo.cpp:516] Warning: Unable to resolve hostname to a (local) address. Using the loopback address as fallback. Manually set the network interface to bind to with GLOO_SOCKET_IFNAME. (function operator())
[Gloo] Rank 1 is connected to 1 peer ranks. Expected number of connected peer ranks is : 1
[Gloo] Rank 0 is connected to 1 peer ranks. Expected number of connected peer ranks is : 1
[Gloo] Rank 1 is connected to 1 peer ranks. Expected number of connected peer ranks is : 1
[Gloo] Rank 0 is connected to 1 peer ranks. Expected number of connected peer ranks is : 1
[2025-12-30 02:27:25 TP0] sglang is using nccl==2.28.3
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 1 peer ranks. Expected number of connected peer ranks is : 1
[Gloo] Rank 1 is connected to 1 peer ranks. Expected number of connected peer ranks is : 1
[2025-12-30 02:27:26 TP0] Init torch distributed ends. mem usage=0.79 GB
[2025-12-30 02:27:26 TP1] Init torch distributed ends. mem usage=0.79 GB
[2025-12-30 02:27:26 TP0] MOE_RUNNER_BACKEND is not initialized, the backend will be automatically selected
[2025-12-30 02:27:26 TP1] Ignore import error when loading sglang.srt.models.glmasr: cannot import name 'GlmAsrConfig' from 'transformers' (/usr/local/lib/python3.12/dist-packages/transformers/__init__.py)
[2025-12-30 02:27:26 TP0] Ignore import error when loading sglang.srt.models.glmasr: cannot import name 'GlmAsrConfig' from 'transformers' (/usr/local/lib/python3.12/dist-packages/transformers/__init__.py)
[2025-12-30 02:27:26 TP1] Load weight begin. avail mem=78.01 GB
[2025-12-30 02:27:26 TP0] Load weight begin. avail mem=78.01 GB
[2025-12-30 02:27:26 TP1] Using fa3 as multimodal attention backend.
[2025-12-30 02:27:26 TP0] Using fa3 as multimodal attention backend.
[2025-12-30 02:27:27 TP0] Found local HF snapshot for Qwen/Qwen2.5-VL-7B-Instruct at /root/.cache/huggingface/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/cc594898137f460bfe9f0759e9844b3ce807cfb5; skipping download.
Loading safetensors checkpoint shards:   0% Completed | 0/5 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:  20% Completed | 1/5 [00:00<00:02,  1.85it/s]
Loading safetensors checkpoint shards:  40% Completed | 2/5 [00:01<00:01,  1.68it/s]
Loading safetensors checkpoint shards:  60% Completed | 3/5 [00:01<00:01,  1.63it/s]
[2025-12-30 02:27:29 TP1] Load weight end. type=Qwen2_5_VLForConditionalGeneration, dtype=torch.bfloat16, avail mem=70.08 GB, mem usage=7.93 GB.
Loading safetensors checkpoint shards:  80% Completed | 4/5 [00:02<00:00,  1.59it/s]
Loading safetensors checkpoint shards: 100% Completed | 5/5 [00:02<00:00,  2.10it/s]
Loading safetensors checkpoint shards: 100% Completed | 5/5 [00:02<00:00,  1.88it/s]

[2025-12-30 02:27:30 TP0] Load weight end. type=Qwen2_5_VLForConditionalGeneration, dtype=torch.bfloat16, avail mem=70.08 GB, mem usage=7.93 GB.
[2025-12-30 02:27:30 TP0] Using KV cache dtype: torch.bfloat16
[2025-12-30 02:27:30 TP1] KV Cache is allocated. #tokens: 1601902, K size: 21.39 GB, V size: 21.39 GB
[2025-12-30 02:27:30 TP0] KV Cache is allocated. #tokens: 1601902, K size: 21.39 GB, V size: 21.39 GB
[2025-12-30 02:27:30 TP1] Memory pool end. avail mem=27.16 GB
[2025-12-30 02:27:30 TP0] Memory pool end. avail mem=27.16 GB
[2025-12-30 02:27:30 TP1] Capture cuda graph begin. This can take up to several minutes. avail mem=27.07 GB
[2025-12-30 02:27:30 TP0] Capture cuda graph begin. This can take up to several minutes. avail mem=27.07 GB
[2025-12-30 02:27:30 TP0] Capture cuda graph bs [1, 2, 4, 8, 12, 16, 24, 32, 40, 48, 56, 64]
Capturing batches (bs=1 avail_mem=26.83 GB): 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 12/12 [00:01<00:00,  6.96it/s]
[2025-12-30 02:27:32 TP0] Registering 684 cuda graph addresses
[2025-12-30 02:27:32 TP0] Capture cuda graph end. Time elapsed: 2.29 s. mem usage=0.24 GB. avail mem=26.83 GB.
[2025-12-30 02:27:32 TP0] Capture piecewise CUDA graph begin. avail mem=26.83 GB
[2025-12-30 02:27:32 TP0] Capture cuda graph num tokens [4, 8, 12, 16, 20, 24, 28, 32, 48, 64, 80, 96, 112, 128, 144, 160, 176, 192, 208, 224, 240, 256, 288, 320, 352, 384, 416, 448, 480, 512, 640, 768, 896, 1024, 1152, 1280, 1408, 1536, 1664, 1792, 1920, 2048, 2176, 2304, 2432, 2560, 2688, 2816, 2944, 3072, 3200, 3328, 3456, 3584, 3712, 3840, 3968, 4096, 4352, 4608, 4864, 5120, 5376, 5632, 5888, 6144, 6400, 6656, 6912, 7168, 7424, 7680, 7936, 8192]
[2025-12-30 02:27:32 TP0] install_torch_compiled
[2025-12-30 02:27:32 TP1] Capture cuda graph end. Time elapsed: 2.30 s. mem usage=0.24 GB. avail mem=26.83 GB.
[2025-12-30 02:27:32 TP1] Capture piecewise CUDA graph begin. avail mem=26.83 GB
Compiling num tokens (num_tokens=8192):   0%|                                                                                                                                           | 0/74 [00:00<?, ?it/s]/usr/local/lib/python3.12/dist-packages/torch/_dynamo/variables/functions.py:1692: UserWarning: Dynamo detected a call to a `functools.lru_cache`-wrapped function. Dynamo ignores the cache wrapper and directly traces the wrapped function. Silent incorrectness is only a *potential* risk, not something we have observed. Enable TORCH_LOGS="+dynamo" for a DEBUG stack trace.
  torch._dynamo.utils.warn_once(msg)
/usr/local/lib/python3.12/dist-packages/torch/_dynamo/variables/functions.py:1692: UserWarning: Dynamo detected a call to a `functools.lru_cache`-wrapped function. Dynamo ignores the cache wrapper and directly traces the wrapped function. Silent incorrectness is only a *potential* risk, not something we have observed. Enable TORCH_LOGS="+dynamo" for a DEBUG stack trace.
  torch._dynamo.utils.warn_once(msg)
[2025-12-30 02:27:35 TP0] Initializing SGLangBackend
[2025-12-30 02:27:35 TP0] SGLangBackend __call__
/usr/local/lib/python3.12/dist-packages/torch/distributed/distributed_c10d.py:4876: UserWarning: barrier(): using the device under current context. You can specify `device_id` in `init_process_group` to mute this warning.
  warnings.warn(  # warn only once
rank 0 allocated ipc_handles: [['0x7f821c000000', '0x7f8218000000'], ['0x7f84cfc00000', '0x7f84cfe00000'], ['0x7f820c000000', '0x7f8200000000']]
rank 1 allocated ipc_handles: [['0x7f2c5a000000', '0x7f2c5e000000'], ['0x7f2f0be00000', '0x7f2f0bc00000'], ['0x7f2c42000000', '0x7f2c4e000000']]
[2025-12-30 02:27:35.335] [info] lamportInitialize start: buffer: 0x7f2c4e000000, size: 100663296
[2025-12-30 02:27:35.335] [info] lamportInitialize start: buffer: 0x7f820c000000, size: 100663296
set flag_ptr[3] = lamport_comm_size:  67106816
Rank 1 workspace[0] 0x7f2c5a000000
Rank 1 workspace[1] 0x7f2c5e000000
Rank 1 workspace[2] 0x7f2f0be00000
Rank 1 workspace[3] 0x7f2f0bc00000
Rank 1 workspace[4] 0x7f2c42000000
Rank 1 workspace[5] 0x7f2c4e000000
Rank 1 workspace[6] 0x7f3afde40400
set flag_ptr[3] = lamport_comm_size:  67106816
Rank 0 workspace[0] 0x7f821c000000
Rank 0 workspace[1] 0x7f8218000000
Rank 0 workspace[2] 0x7f84cfc00000
Rank 0 workspace[3] 0x7f84cfe00000
Rank 0 workspace[4] 0x7f820c000000
Rank 0 workspace[5] 0x7f8200000000
Rank 0 workspace[6] 0x7f90bfe40400
[2025-12-30 02:27:35 TP0] Custom op RMSNorm was not registered, which means it won't appear in the op registry. It will be enabled/disabled based on the global settings.
[2025-12-30 02:27:35 TP1] Custom op RMSNorm was not registered, which means it won't appear in the op registry. It will be enabled/disabled based on the global settings.
[2025-12-30 02:27:35 TP0] Compiling a graph for dynamic shape takes 0.27 s
[2025-12-30 02:27:35 TP1] Compiling a graph for dynamic shape takes 0.27 s
[2025-12-30 02:27:35 TP0] Computation graph saved to /root/.cache/sglang/torch_compile_cache/rank_0_0/backbone/computation_graph_1767061655.851224.py
Compiling num tokens (num_tokens=7936):   1%|█▊                                                                                                                                 | 1/74 [00:04<05:09,  4.24s/it][2025-12-30 02:27:39 TP0] Initializing SGLangBackend
[2025-12-30 02:27:39 TP0] SGLangBackend __call__
rank 0 allocated ipc_handles: [['0x7f81d6000000', '0x7f81d2000000'], ['0x7f81e3400000', '0x7f81e3600000'], ['0x7f81c6000000', '0x7f81ba000000']]
rank 1 allocated ipc_handles: [['0x7f2c12000000', '0x7f2c16000000'], ['0x7f2c23600000', '0x7f2c23400000'], ['0x7f2bfa000000', '0x7f2c06000000']]
[2025-12-30 02:27:39.800] [info] lamportInitialize start: buffer: 0x7f2c06000000, size: 100663296
[2025-12-30 02:27:39.800] [info] lamportInitialize start: buffer: 0x7f81c6000000, size: 100663296
set flag_ptr[3] = lamport_comm_size:  67106816
Rank 0 workspace[0] 0x7f81d6000000
Rank 0 workspace[1] 0x7f81d2000000
Rank 0 workspace[2] 0x7f81e3400000
set flag_ptr[3] = lamport_comm_size:  67106816
Rank 0 workspace[3] 0x7f81e3600000
Rank 0 workspace[4] 0x7f81c6000000
Rank 0 workspace[5] 0x7f81ba000000
Rank 1 workspace[0] 0x7f2c12000000
Rank 0 workspace[6] 0x7f90bfe40600
Rank 1 workspace[1] 0x7f2c16000000
Rank 1 workspace[2] 0x7f2c23600000
Rank 1 workspace[3] 0x7f2c23400000
Rank 1 workspace[4] 0x7f2bfa000000
Rank 1 workspace[5] 0x7f2c06000000
Rank 1 workspace[6] 0x7f3afde40600
[2025-12-30 02:27:40 TP0] Compiling a graph for dynamic shape takes 0.30 s
[2025-12-30 02:27:40 TP1] Compiling a graph for dynamic shape takes 0.31 s
[2025-12-30 02:27:40 TP0] Computation graph saved to /root/.cache/sglang/torch_compile_cache/rank_0_0/backbone/computation_graph_1767061660.327471.py
Compiling num tokens (num_tokens=1152):  50%|█████████████████████████████████████████████████████████████████                                                                 | 37/74 [00:10<00:01, 22.80it/s][2025-12-30 02:27:46 TP0] Initializing SGLangBackend
[2025-12-30 02:27:46 TP0] SGLangBackend __call__
rank 0 allocated ipc_handles: [['0x7f81b6000000', '0x7f81b2000000'], ['0x7f81e3800000', '0x7f81e3a00000'], ['0x7f81a6000000', '0x7f819a000000']]
rank 1 allocated ipc_handles: [['0x7f2bf2000000', '0x7f2bf6000000'], ['0x7f2c23a00000', '0x7f2c23800000'], ['0x7f2bda000000', '0x7f2be6000000']]
[2025-12-30 02:27:46.030] [info] lamportInitialize start: buffer: 0x7f81a6000000, size: 100663296
[2025-12-30 02:27:46.030] [info] lamportInitialize start: buffer: 0x7f2be6000000, size: 100663296
set flag_ptr[3] = lamport_comm_size:  67106816
Rank 1 workspace[0] 0x7f2bf2000000
Rank 1 workspace[1] 0x7f2bf6000000
Rank 1 workspace[2] 0x7f2c23a00000
Rank 1 workspace[3] 0x7f2c23800000
set flag_ptr[3] = lamport_comm_size:  67106816
Rank 1 workspace[4] 0x7f2bda000000
Rank 0 workspace[0] 0x7f81b6000000
Rank 1 workspace[5] 0x7f2be6000000
Rank 0 workspace[1] 0x7f81b2000000
Rank 1 workspace[6] 0x7f3afde40800
Rank 0 workspace[2] 0x7f81e3800000
Rank 0 workspace[3] 0x7f81e3a00000
Rank 0 workspace[4] 0x7f81a6000000
Rank 0 workspace[5] 0x7f819a000000
Rank 0 workspace[6] 0x7f90bfe40800
[2025-12-30 02:27:46 TP0] Compiling a graph for dynamic shape takes 0.28 s
[2025-12-30 02:27:46 TP1] Compiling a graph for dynamic shape takes 0.28 s
[2025-12-30 02:27:46 TP0] Computation graph saved to /root/.cache/sglang/torch_compile_cache/rank_0_0/backbone/computation_graph_1767061666.4889896.py
Compiling num tokens (num_tokens=4): 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 74/74 [00:15<00:00,  4.82it/s]
Capturing num tokens (num_tokens=4 avail_mem=24.44 GB): 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 74/74 [00:08<00:00,  9.13it/s]
[2025-12-30 02:27:56 TP0] Registering 1960 cuda graph addresses
[2025-12-30 02:27:56 TP1] Capture piecewise CUDA graph end. Time elapsed: 23.92 s. mem usage=2.39 GB. avail mem=24.43 GB.
[2025-12-30 02:27:56 TP0] Capture piecewise CUDA graph end. Time elapsed: 23.94 s. mem usage=2.39 GB. avail mem=24.43 GB.
[2025-12-30 02:28:01 TP0] max_total_num_tokens=1601902, chunked_prefill_size=8192, max_prefill_tokens=16384, max_running_requests=64, context_len=128000, available_gpu_mem=24.43 GB
[2025-12-30 02:28:02] INFO:     Started server process [30775]
[2025-12-30 02:28:02] INFO:     Waiting for application startup.
[2025-12-30 02:28:02] Using default chat sampling params from model generation config: {'repetition_penalty': 1.05, 'temperature': 1e-06, 'top_k': 50, 'top_p': 1.0}
[2025-12-30 02:28:02] Using default chat sampling params from model generation config: {'repetition_penalty': 1.05, 'temperature': 1e-06, 'top_k': 50, 'top_p': 1.0}
[2025-12-30 02:28:02] INFO:     Application startup complete.
[2025-12-30 02:28:02] INFO:     Uvicorn running on http://0.0.0.0:30000 (Press CTRL+C to quit)
[2025-12-30 02:28:03] INFO:     127.0.0.1:61184 - "GET /model_info HTTP/1.1" 200 OK
[2025-12-30 02:28:03 TP0] Prefill batch, #new-seq: 1, #new-token: 29, #cached-token: 0, token usage: 0.00, #running-req: 0, #queue-req: 0, 
[2025-12-30 02:28:03 TP1] Multimodal embedding cache is full. This typically occurs when a single embedding exceeds the cache size limit. Consider increasing the `SGLANG_VLM_CACHE_SIZE_MB` environment variable or reducing the input embedding size.
[2025-12-30 02:28:03 TP0] Multimodal embedding cache is full. This typically occurs when a single embedding exceeds the cache size limit. Consider increasing the `SGLANG_VLM_CACHE_SIZE_MB` environment variable or reducing the input embedding size.
[2025-12-30 02:28:03] INFO:     127.0.0.1:61196 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-12-30 02:28:03] The server is fired up and ready to roll!
[2025-12-30 02:28:08 TP0] Prefill batch, #new-seq: 1, #new-token: 33, #cached-token: 0, token usage: 0.00, #running-req: 0, #queue-req: 0, 
[2025-12-30 02:28:08 TP0] Decode batch, #running-req: 1, #token: 66, token usage: 0.00, cuda graph: True, gen throughput (token/s): 0.82, #queue-req: 0, 
[2025-12-30 02:28:09 TP0] Decode batch, #running-req: 1, #token: 106, token usage: 0.00, cuda graph: True, gen throughput (token/s): 248.42, #queue-req: 0, 
[2025-12-30 02:28:09 TP0] Decode batch, #running-req: 1, #token: 146, token usage: 0.00, cuda graph: True, gen throughput (token/s): 245.27, #queue-req: 0, 
[2025-12-30 02:28:09 TP0] Decode batch, #running-req: 1, #token: 186, token usage: 0.00, cuda graph: True, gen throughput (token/s): 241.86, #queue-req: 0, 
[2025-12-30 02:28:09] INFO:     127.0.0.1:57772 - "POST /v1/chat/completions HTTP/1.1" 200 OK

Client:

root@c7e9bb6a6789:/sgl-workspace/bench_script# python test_openai.py 
ChatCompletion(id='25367dbf57584ac896c3bd9d421cd12f', choices=[Choice(finish_reason='stop', index=0, logprobs=None, message=ChatCompletionMessage(content='Sure, here are three countries and their capitals, along with a simple ranking based on the size of their populations as of 2023:\n\n1. **China - Beijing** - Population: Approximately 1.4 billion (2023 estimate)\n2. **India - New Delhi** - Population: Approximately 1.37 billion (2023 estimate)\n3. **United States - Washington, D.C.** - Population: Approximately 332 million (2023 estimate)\n\n**Ranking:**\n\n1. **China** - The most populous country in the world.\n2. **India** - The second most populous country in the world.\n3. **United States** - The third most populous country in the world.\n\nThese rankings are based on the population size of the countries, which is a common metric used to compare the size of populations across different nations.', refusal=None, role='assistant', annotations=None, audio=None, function_call=None, tool_calls=None, reasoning_content=None), matched_stop=151645)], created=1767061689, model='default', object='chat.completion', service_tier=None, system_fingerprint=None, usage=CompletionUsage(completion_tokens=185, prompt_tokens=33, total_tokens=218, completion_tokens_details=None, prompt_tokens_details=None, reasoning_tokens=0), metadata={'weight_version': 'default'})
root@c7e9bb6a6789:/sgl-workspace/bench_script# 

@yuan-luo yuan-luo force-pushed the support_fusion_pass branch from cd10d83 to 52662bb Compare December 30, 2025 02:32
@yuan-luo
Copy link
Copy Markdown
Collaborator Author

/tag-and-rerun-ci

@yuan-luo
Copy link
Copy Markdown
Collaborator Author

yuan-luo commented Jan 16, 2026

We had a in-depth conversation with @ispobock and @BBuf about whether to proceed this PR. The decision is we will retain this PR and not proceed this direction.
The consideration is that the complexity of implementation, debugging approach and the overhead which brings is out of control.
I'll proceed to split this PR and adapt some configuration helpers into the infra.
cc: @DevashishLal-CB

@WhoisZihan WhoisZihan mentioned this pull request Jan 25, 2026
5 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants