[diffusion]: align sglang diffusion AMD pyproject_other.toml diffusion dependency with pyproject.toml by ZiguanWang · Pull Request #16225 · sgl-project/sglang

ZiguanWang · 2025-12-31T12:20:58Z

Motivation

align AMD pyproject_other.toml diffusion dependency with pyproject.toml

Modifications

modify AMD pyproject_other.toml diffusion dependency to align with pyproject.toml
directory use yunchang@git+https://github.com/feifeibear/long-context-attention.git@b192e97 to fix torch_cpp_ext._get_cuda_arch_flags() (fix import flashinfer error on AMD GPUs feifeibear/long-context-attention#148)
set runai_model_streamer==0.15.3 and validate RUNAI_MODEL_STREAMER (ROCM Docker default version is 0.11.0 will cause some problem)

Accuracy Tests

sglang generate --model-path Wan-AI/Wan2.1-T2V-1.3B-Diffusers     --prompt "A curious raccoon"     --save-output

sglang generate --model-path=Wan-AI/Wan2.1-I2V-14B-480P-Diffusers     --prompt="Summer beach vacation style, a white cat wearing sunglasses sits on a surfboard. The fluffy-furred feline gazes directly at the camera with a relaxed expression. Blurred beach scenery forms the background featuring crystal-clear waters, distant green hills, and a blue sky dotted with white clouds. The cat assumes a naturally relaxed posture, as if savoring the sea breeze and warm sunlight. A close-up shot highlights the feline's intricate details and the refreshing atmosphere of the seaside."     --image-path="https://github.com/Wan-Video/Wan2.2/blob/990af50de458c19590c245151197326e208d7191/examples/i2v_input.JPG?raw=true"     --num-gpus 2 --enable-cfg-parallel --save-output

the output video is the same

Log before this change:

[aiter] import [module_aiter_enum] under /sgl-workspace/aiter/aiter/jit/module_aiter_enum.so
INFO:aiter:import [module_aiter_enum] under /sgl-workspace/aiter/aiter/jit/module_aiter_enum.so
INFO 12-31 10:27:00 [__init__.py:241] Automatically detected platform rocm.
/sgl-workspace/sglang/python/sglang/srt/layers/quantization/awq.py:71: UserWarning: HIP does not support fused_marlin_moe currently.
  warnings.warn(f"HIP does not support fused_marlin_moe currently.")
/sgl-workspace/sglang/python/sglang/srt/layers/quantization/gguf.py:46: UserWarning: Only CUDA support GGUF quantization currently.
  warnings.warn(f"Only CUDA support GGUF quantization currently.")
[12-31 10:27:01] Attention backend not specified. Using 'aiter' by default on ROCm to match SGLang SRT defaults.
[12-31 10:27:01] server_args: {"model_path": "Wan-AI/Wan2.1-T2V-1.3B-Diffusers", "attention_backend": "aiter", "nccl_port": null, "trust_remote_code": false, "revision": null, "num_gpus": 1, "tp_size": 1, "sp_degree": 1, "ulysses_degree": 1, "ring_degree": 1, "dp_size": 1, "dp_degree": 1, "enable_cfg_parallel": false, "hsdp_replicate_dim": 1, "hsdp_shard_dim": 1, "dist_timeout": null, "lora_path": null, "lora_nickname": "default", "vae_path": null, "lora_target_modules": null, "dit_cpu_offload": true, "dit_layerwise_offload": null, "text_encoder_cpu_offload": true, "image_encoder_cpu_offload": true, "vae_cpu_offload": true, "use_fsdp_inference": false, "pin_cpu_memory": true, "mask_strategy_file_path": null, "STA_mode": "STA_inference", "skip_time_steps": 15, "enable_torch_compile": false, "disable_autocast": false, "VSA_sparsity": 0.0, "moba_config_path": null, "moba_config": {}, "master_port": 30059, "host": null, "port": null, "webui": false, "webui_port": 12312, "scheduler_port": 5630, "prompt_file_path": null, "model_paths": {}, "model_loaded": {"transformer": true, "vae": true}, "boundary_ratio": null, "log_level": "info"}
[12-31 10:27:01] Local mode: True
[12-31 10:27:01] Starting server...
[aiter] import [module_aiter_enum] under /sgl-workspace/aiter/aiter/jit/module_aiter_enum.so
INFO:aiter:import [module_aiter_enum] under /sgl-workspace/aiter/aiter/jit/module_aiter_enum.so
INFO 12-31 10:27:08 [__init__.py:241] Automatically detected platform rocm.
/sgl-workspace/sglang/python/sglang/srt/layers/quantization/awq.py:71: UserWarning: HIP does not support fused_marlin_moe currently.
  warnings.warn(f"HIP does not support fused_marlin_moe currently.")
/sgl-workspace/sglang/python/sglang/srt/layers/quantization/gguf.py:46: UserWarning: Only CUDA support GGUF quantization currently.
  warnings.warn(f"Only CUDA support GGUF quantization currently.")
[12-31 10:27:09] Scheduler bind at endpoint: tcp://localhost:5630
[12-31 10:27:09] Initializing distributed environment with world_size=1, device=cuda:0
[12-31 10:27:13] Downloaded model_index.json for Wan-AI/Wan2.1-T2V-1.3B-Diffusers, pipeline: WanPipeline
[12-31 10:27:13] Found model info: ModelInfo(pipeline_cls=<class 'sglang.multimodal_gen.runtime.pipelines.wan_pipeline.WanPipeline'>, sampling_param_cls=<class 'sglang.multimodal_gen.configs.sample.wan.WanT2V_1_3B_SamplingParams'>, pipeline_config_cls=<class 'sglang.multimodal_gen.configs.pipeline_configs.wan.WanT2V480PConfig'>)
[12-31 10:27:13] Loading pipeline modules...
[12-31 10:27:13] Checking for cached model in HF Hub cache for Wan-AI/Wan2.1-T2V-1.3B-Diffusers...
[12-31 10:27:13] Found complete model in cache at /home/roywan/model/models--Wan-AI--Wan2.1-T2V-1.3B-Diffusers/snapshots/0fad780a534b6463e45facd96134c9f345acfa5b
[12-31 10:27:13] Model path: /home/roywan/model/models--Wan-AI--Wan2.1-T2V-1.3B-Diffusers/snapshots/0fad780a534b6463e45facd96134c9f345acfa5b
[12-31 10:27:13] Diffusers version: 0.33.0.dev0
[12-31 10:27:13] Loading pipeline modules from config: {'_class_name': 'WanPipeline', '_diffusers_version': '0.33.0.dev0', 'scheduler': ['diffusers', 'UniPCMultistepScheduler'], 'text_encoder': ['transformers', 'UMT5EncoderModel'], 'tokenizer': ['transformers', 'T5TokenizerFast'], 'transformer': ['diffusers', 'WanTransformer3DModel'], 'vae': ['diffusers', 'AutoencoderKLWan']}
[12-31 10:27:13] Loading required components: ['text_encoder', 'tokenizer', 'vae', 'transformer', 'scheduler']
Loading required modules:   0%|          | 0/5 [00:00<?, ?it/s][12-31 10:27:13] Loading text_encoder from /home/roywan/model/models--Wan-AI--Wan2.1-T2V-1.3B-Diffusers/snapshots/0fad780a534b6463e45facd96134c9f345acfa5b/text_encoder. avail mem: 184.91 GB
[12-31 10:27:13] HF model config: {'architectures': ['UMT5EncoderModel'], 'classifier_dropout': 0.0, 'd_ff': 10240, 'd_kv': 64, 'd_model': 4096, 'decoder_start_token_id': 0, 'dense_act_fn': 'gelu_new', 'dropout_rate': 0.1, 'eos_token_id': 1, 'feed_forward_proj': 'gated-gelu', 'initializer_factor': 1.0, 'is_encoder_decoder': True, 'is_gated_act': True, 'layer_norm_epsilon': 1e-06, 'num_decoder_layers': 24, 'num_heads': 64, 'num_layers': 24, 'output_past': True, 'pad_token_id': 0, 'relative_attention_max_distance': 128, 'relative_attention_num_buckets': 32, 'scalable_attention': True, 'tie_word_embeddings': False, 'use_cache': True, 'vocab_size': 256384}

Loading safetensors checkpoint shards:   0% Completed | 0/5 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:  20% Completed | 1/5 [00:01<00:04,  1.01s/it]
Loading safetensors checkpoint shards:  40% Completed | 2/5 [00:02<00:04,  1.53s/it]
Loading safetensors checkpoint shards:  60% Completed | 3/5 [00:05<00:03,  1.81s/it]
Loading safetensors checkpoint shards:  80% Completed | 4/5 [00:06<00:01,  1.84s/it]
Loading safetensors checkpoint shards: 100% Completed | 5/5 [00:09<00:00,  1.84s/it]

[12-31 10:28:22] Loaded text_encoder: FSDPUMT5EncoderModel from customized. model size: 21.16 GB, avail mem: 183.84 GB
Loading required modules:  20%|██        | 1/5 [01:09<04:36, 69.11s/it][12-31 10:28:22] Loading tokenizer from /home/roywan/model/models--Wan-AI--Wan2.1-T2V-1.3B-Diffusers/snapshots/0fad780a534b6463e45facd96134c9f345acfa5b/tokenizer. avail mem: 183.84 GB
[12-31 10:28:23] Loaded tokenizer: T5TokenizerFast from customized. model size: 0.00 GB, avail mem: 183.84 GB
Loading required modules:  40%|████      | 2/5 [01:09<01:25, 28.67s/it][12-31 10:28:23] Loading vae from /home/roywan/model/models--Wan-AI--Wan2.1-T2V-1.3B-Diffusers/snapshots/0fad780a534b6463e45facd96134c9f345acfa5b/vae. avail mem: 183.84 GB
[12-31 10:28:23] HF model config: {'attn_scales': [], 'base_dim': 96, 'dim_mult': [1, 2, 4, 4], 'dropout': 0.0, 'latents_mean': [-0.7571, -0.7089, -0.9113, 0.1075, -0.1745, 0.9653, -0.1517, 1.5508, 0.4134, -0.0715, 0.5517, -0.3632, -0.1922, -0.9497, 0.2503, -0.2921], 'latents_std': [2.8184, 1.4541, 2.3275, 2.6558, 1.2196, 1.7708, 2.6052, 2.0743, 3.2687, 2.1526, 2.8652, 1.5579, 1.6382, 1.1253, 2.8251, 1.916], 'num_res_blocks': 2, 'temperal_downsample': [False, True, True], 'z_dim': 16}
[12-31 10:28:23] Loaded vae: AutoencoderKLWan from customized. model size: 0.27 GB, avail mem: 183.84 GB
[12-31 10:28:23] Loading transformer from /home/roywan/model/models--Wan-AI--Wan2.1-T2V-1.3B-Diffusers/snapshots/0fad780a534b6463e45facd96134c9f345acfa5b/transformer. avail mem: 183.84 GB
[12-31 10:28:23] Loading WanTransformer3DModel from 2 safetensors files, default_dtype: torch.bfloat16
[12-31 10:28:23] Trying SGLANG_DIFFUSION_ATTENTION_BACKEND=None
[12-31 10:28:23] Using AITer backend on ROCm.

Loading safetensors checkpoint shards: 100% Completed | 2/2 [00:00<00:00, 106.43it/s]

[12-31 10:28:24] Loaded model with 1.42B parameters
[12-31 10:28:24] Loaded transformer: WanTransformer3DModel from customized. model size: 2.64 GB, avail mem: 181.00 GB
Loading required modules:  80%|████████  | 4/5 [01:10<00:10, 10.93s/it][12-31 10:28:24] Loading scheduler from /home/roywan/model/models--Wan-AI--Wan2.1-T2V-1.3B-Diffusers/snapshots/0fad780a534b6463e45facd96134c9f345acfa5b/scheduler. avail mem: 181.00 GB
[12-31 10:28:24] Loaded scheduler: UniPCMultistepScheduler from customized. model size: 0.00 GB, avail mem: 181.00 GB
Loading required modules: 100%|██████████| 5/5 [01:10<00:00, 14.04s/it]
[12-31 10:28:24] Creating pipeline stages...
[12-31 10:28:24] Trying SGLANG_DIFFUSION_ATTENTION_BACKEND=None
[12-31 10:28:24] Using AITer backend on ROCm.
[12-31 10:28:24] Pipeline instantiated
[12-31 10:28:24] Worker 0: Initialized device, model, and distributed environment.
[12-31 10:28:24] Worker 0: Scheduler loop started.
[12-31 10:28:24] Processing prompt 1/1: A curious raccoon
[12-31 10:28:24] Sampling params:
                       width: 832
                      height: 480
                  num_frames: 81
                      prompt: A curious raccoon
                  neg_prompt: Bright tones, overexposed, static, blurred details, subtitles, style, works, paintings, images, static, overall gray, worst quality, low quality, JPEG compression residue, ugly, incomplete, extra fingers, poorly drawn hands, poorly drawn faces, deformed, disfigured, misshapen limbs, fused fingers, still picture, messy background, three legs, many people in the background, walking backwards
                        seed: 1024
                 infer_steps: 50
      num_outputs_per_prompt: 1
              guidance_scale: 3.0
     embedded_guidance_scale: 6.0
                    n_tokens: None
                  flow_shift: 3.0
                  image_path: None
                 save_output: True
            output_file_path: outputs/A_curious_raccoon_20251231-102824_483bc3b0.mp4
        
[12-31 10:28:24] Running pipeline stages: ['input_validation_stage', 'prompt_encoding_stage', 'conditioning_stage', 'timestep_preparation_stage', 'latent_preparation_stage', 'denoising_stage', 'decoding_stage']
[12-31 10:28:24] [InputValidationStage] started...
[12-31 10:28:24] [InputValidationStage] finished in 0.0001 seconds
[12-31 10:28:24] [TextEncodingStage] started...
[12-31 10:28:32] [TextEncodingStage] finished in 8.9242 seconds
[12-31 10:28:32] [ConditioningStage] started...
[12-31 10:28:32] [ConditioningStage] finished in 0.0000 seconds
[12-31 10:28:32] [TimestepPreparationStage] started...
[12-31 10:28:32] [TimestepPreparationStage] finished in 0.0004 seconds
[12-31 10:28:32] [LatentPreparationStage] started...
[12-31 10:28:32] [LatentPreparationStage] finished in 0.0042 seconds
[12-31 10:28:32] [DenoisingStage] started...
  0%|          | 0/50 [00:00<?, ?it/s][aiter] start build [module_fmha_v3_fwd] under /sgl-workspace/aiter/aiter/jit/build/module_fmha_v3_fwd
[12-31 10:28:38] start build [module_fmha_v3_fwd] under /sgl-workspace/aiter/aiter/jit/build/module_fmha_v3_fwd
[aiter] finish build [module_fmha_v3_fwd], cost 38.7s 
[12-31 10:29:17] finish build [module_fmha_v3_fwd], cost 38.7s 
[aiter] import [module_fmha_v3_fwd] under /sgl-workspace/aiter/aiter/jit/module_fmha_v3_fwd.so
[12-31 10:29:17] import [module_fmha_v3_fwd] under /sgl-workspace/aiter/aiter/jit/module_fmha_v3_fwd.so
[aiter] type hints mismatch, override to --> fmha_v3_fwd(q: torch.Tensor, k: torch.Tensor, v: torch.Tensor, dropout_p: float, softmax_scale: float, is_causal: bool, window_size_left: int, window_size_right: int, return_softmax_lse: bool, return_dropout_randval: bool, how_v3_bf16_cvt: int, out: Optional[torch.Tensor] = None, bias: Optional[torch.Tensor] = None, alibi_slopes: Optional[torch.Tensor] = None, gen: Optional[torch.Generator] = None) -> List[torch.Tensor]
[12-31 10:29:17] type hints mismatch, override to --> fmha_v3_fwd(q: torch.Tensor, k: torch.Tensor, v: torch.Tensor, dropout_p: float, softmax_scale: float, is_causal: bool, window_size_left: int, window_size_right: int, return_softmax_lse: bool, return_dropout_randval: bool, how_v3_bf16_cvt: int, out: Optional[torch.Tensor] = None, bias: Optional[torch.Tensor] = None, alibi_slopes: Optional[torch.Tensor] = None, gen: Optional[torch.Generator] = None) -> List[torch.Tensor]
100%|██████████| 50/50 [02:09<00:00,  2.59s/it]
[12-31 10:30:42] [DenoisingStage] average time per step: 2.5911 seconds
[12-31 10:30:42] [DenoisingStage] finished in 129.5573 seconds
[12-31 10:30:42] [DecodingStage] started...
[12-31 10:30:48] [DecodingStage] finished in 6.2848 seconds
[12-31 10:30:48] Peak GPU memory: 11.85 GB, Remaining GPU memory at peak: 180.14 GB. Components that can stay resident: ['text_encoder', 'vae', 'transformer']
[12-31 10:30:55] Output saved to outputs/A_curious_raccoon_20251231-102824_483bc3b0.mp4
[12-31 10:30:55] Pixel data generated successfully in 151.01 seconds
[12-31 10:30:55] Completed batch processing. Generated 1 outputs in 151.01 seconds.
[12-31 10:30:55] Memory usage - Max peak: 12132.01 MB, Avg peak: 12132.01 MB
[12-31 10:30:55] Generator was garbage collected without being shut down. Attempting to shut down the local server and client.
/usr/lib/python3.10/multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 1 leaked semaphore objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '
[aiter] import [module_aiter_enum] under /sgl-workspace/aiter/aiter/jit/module_aiter_enum.so
INFO:aiter:import [module_aiter_enum] under /sgl-workspace/aiter/aiter/jit/module_aiter_enum.so
INFO 12-31 10:31:04 [__init__.py:241] Automatically detected platform rocm.
/sgl-workspace/sglang/python/sglang/srt/layers/quantization/awq.py:71: UserWarning: HIP does not support fused_marlin_moe currently.
  warnings.warn(f"HIP does not support fused_marlin_moe currently.")
/sgl-workspace/sglang/python/sglang/srt/layers/quantization/gguf.py:46: UserWarning: Only CUDA support GGUF quantization currently.
  warnings.warn(f"Only CUDA support GGUF quantization currently.")
[12-31 10:31:04] Attention backend not specified. Using 'aiter' by default on ROCm to match SGLang SRT defaults.
[12-31 10:31:04] server_args: {"model_path": "Wan-AI/Wan2.1-I2V-14B-480P-Diffusers", "attention_backend": "aiter", "nccl_port": null, "trust_remote_code": false, "revision": null, "num_gpus": 2, "tp_size": 1, "sp_degree": 1, "ulysses_degree": 1, "ring_degree": 1, "dp_size": 1, "dp_degree": 1, "enable_cfg_parallel": true, "hsdp_replicate_dim": 1, "hsdp_shard_dim": 2, "dist_timeout": null, "lora_path": null, "lora_nickname": "default", "vae_path": null, "lora_target_modules": null, "dit_cpu_offload": true, "dit_layerwise_offload": null, "text_encoder_cpu_offload": true, "image_encoder_cpu_offload": true, "vae_cpu_offload": true, "use_fsdp_inference": false, "pin_cpu_memory": true, "mask_strategy_file_path": null, "STA_mode": "STA_inference", "skip_time_steps": 15, "enable_torch_compile": false, "disable_autocast": false, "VSA_sparsity": 0.0, "moba_config_path": null, "moba_config": {}, "master_port": 30105, "host": null, "port": null, "webui": false, "webui_port": 12312, "scheduler_port": 5655, "prompt_file_path": null, "model_paths": {}, "model_loaded": {"transformer": true, "vae": true}, "boundary_ratio": null, "log_level": "info"}
[12-31 10:31:04] Local mode: True
[12-31 10:31:04] Starting server...
[aiter] import [module_aiter_enum] under /sgl-workspace/aiter/aiter/jit/module_aiter_enum.so
INFO:aiter:import [module_aiter_enum] under /sgl-workspace/aiter/aiter/jit/module_aiter_enum.so
[aiter] import [module_aiter_enum] under /sgl-workspace/aiter/aiter/jit/module_aiter_enum.so
INFO:aiter:import [module_aiter_enum] under /sgl-workspace/aiter/aiter/jit/module_aiter_enum.so
INFO 12-31 10:31:12 [__init__.py:241] Automatically detected platform rocm.
INFO 12-31 10:31:12 [__init__.py:241] Automatically detected platform rocm.
/sgl-workspace/sglang/python/sglang/srt/layers/quantization/awq.py:71: UserWarning: HIP does not support fused_marlin_moe currently.
  warnings.warn(f"HIP does not support fused_marlin_moe currently.")
/sgl-workspace/sglang/python/sglang/srt/layers/quantization/gguf.py:46: UserWarning: Only CUDA support GGUF quantization currently.
  warnings.warn(f"Only CUDA support GGUF quantization currently.")
/sgl-workspace/sglang/python/sglang/srt/layers/quantization/awq.py:71: UserWarning: HIP does not support fused_marlin_moe currently.
  warnings.warn(f"HIP does not support fused_marlin_moe currently.")
/sgl-workspace/sglang/python/sglang/srt/layers/quantization/gguf.py:46: UserWarning: Only CUDA support GGUF quantization currently.
  warnings.warn(f"Only CUDA support GGUF quantization currently.")
[12-31 10:31:12] Scheduler bind at endpoint: tcp://localhost:5655
[12-31 10:31:12] Initializing distributed environment with world_size=2, device=cuda:0
[12-31 10:31:18] Found nccl from library librccl.so.1
[12-31 10:31:18] sglang-diffusion is using nccl==2.26.6
[12-31 10:31:19] Found nccl from library librccl.so.1
[12-31 10:31:19] sglang-diffusion is using nccl==2.26.6
[12-31 10:31:24] Downloaded model_index.json for Wan-AI/Wan2.1-I2V-14B-480P-Diffusers, pipeline: WanImageToVideoPipeline
[12-31 10:31:24] Found model info: ModelInfo(pipeline_cls=<class 'sglang.multimodal_gen.runtime.pipelines.wan_i2v_pipeline.WanImageToVideoPipeline'>, sampling_param_cls=<class 'sglang.multimodal_gen.configs.sample.wan.WanI2V_14B_480P_SamplingParam'>, pipeline_config_cls=<class 'sglang.multimodal_gen.configs.pipeline_configs.wan.WanI2V480PConfig'>)
[12-31 10:31:24] Loading pipeline modules...
[12-31 10:31:24] Checking for cached model in HF Hub cache for Wan-AI/Wan2.1-I2V-14B-480P-Diffusers...
[12-31 10:31:24] Found complete model in cache at /home/roywan/model/models--Wan-AI--Wan2.1-I2V-14B-480P-Diffusers/snapshots/b184e23a8a16b20f108f727c902e769e873ffc73
[12-31 10:31:24] Model path: /home/roywan/model/models--Wan-AI--Wan2.1-I2V-14B-480P-Diffusers/snapshots/b184e23a8a16b20f108f727c902e769e873ffc73
[12-31 10:31:24] Diffusers version: 0.33.0.dev0
[12-31 10:31:24] Loading pipeline modules from config: {'_class_name': 'WanImageToVideoPipeline', '_diffusers_version': '0.33.0.dev0', 'image_encoder': ['transformers', 'CLIPVisionModelWithProjection'], 'image_processor': ['transformers', 'CLIPImageProcessor'], 'scheduler': ['diffusers', 'UniPCMultistepScheduler'], 'text_encoder': ['transformers', 'UMT5EncoderModel'], 'tokenizer': ['transformers', 'T5TokenizerFast'], 'transformer': ['diffusers', 'WanTransformer3DModel'], 'vae': ['diffusers', 'AutoencoderKLWan']}
[12-31 10:31:24] Loading required components: ['text_encoder', 'tokenizer', 'vae', 'transformer', 'scheduler', 'image_encoder', 'image_processor']
Loading required modules:   0%|          | 0/7 [00:00<?, ?it/s][12-31 10:31:24] Loading text_encoder from /home/roywan/model/models--Wan-AI--Wan2.1-I2V-14B-480P-Diffusers/snapshots/b184e23a8a16b20f108f727c902e769e873ffc73/text_encoder. avail mem: 183.63 GB
[12-31 10:31:24] HF model config: {'architectures': ['UMT5EncoderModel'], 'classifier_dropout': 0.0, 'd_ff': 10240, 'd_kv': 64, 'd_model': 4096, 'decoder_start_token_id': 0, 'dense_act_fn': 'gelu_new', 'dropout_rate': 0.1, 'eos_token_id': 1, 'feed_forward_proj': 'gated-gelu', 'initializer_factor': 1.0, 'is_encoder_decoder': True, 'is_gated_act': True, 'layer_norm_epsilon': 1e-06, 'num_decoder_layers': 24, 'num_heads': 64, 'num_layers': 24, 'output_past': True, 'pad_token_id': 0, 'relative_attention_max_distance': 128, 'relative_attention_num_buckets': 32, 'scalable_attention': True, 'tie_word_embeddings': False, 'use_cache': True, 'vocab_size': 256384}
Loading required modules:   0%|          | 0/7 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:   0% Completed | 0/5 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:  20% Completed | 1/5 [00:01<00:06,  1.53s/it]
Loading safetensors checkpoint shards:  40% Completed | 2/5 [00:04<00:07,  2.65s/it]
Loading safetensors checkpoint shards:  60% Completed | 3/5 [00:09<00:07,  3.54s/it]
Loading safetensors checkpoint shards:  80% Completed | 4/5 [00:12<00:03,  3.31s/it]
Loading safetensors checkpoint shards: 100% Completed | 5/5 [00:15<00:00,  3.17s/it]

Loading required modules:  29%|██▊       | 2/7 [01:59<04:06, 49.33s/it] [12-31 10:33:33] Loaded text_encoder: FSDPUMT5EncoderModel from customized. model size: 21.16 GB, avail mem: 182.56 GB
Loading required modules:  14%|█▍        | 1/7 [02:08<12:48, 128.14s/it][12-31 10:33:33] Loading tokenizer from /home/roywan/model/models--Wan-AI--Wan2.1-I2V-14B-480P-Diffusers/snapshots/b184e23a8a16b20f108f727c902e769e873ffc73/tokenizer. avail mem: 182.56 GB
[12-31 10:33:33] Loaded tokenizer: T5TokenizerFast from customized. model size: 0.00 GB, avail mem: 182.56 GB
Loading required modules:  29%|██▊       | 2/7 [02:08<04:25, 53.04s/it] [12-31 10:33:33] Loading vae from /home/roywan/model/models--Wan-AI--Wan2.1-I2V-14B-480P-Diffusers/snapshots/b184e23a8a16b20f108f727c902e769e873ffc73/vae. avail mem: 182.56 GB
[12-31 10:33:33] HF model config: {'attn_scales': [], 'base_dim': 96, 'dim_mult': [1, 2, 4, 4], 'dropout': 0.0, 'latents_mean': [-0.7571, -0.7089, -0.9113, 0.1075, -0.1745, 0.9653, -0.1517, 1.5508, 0.4134, -0.0715, 0.5517, -0.3632, -0.1922, -0.9497, 0.2503, -0.2921], 'latents_std': [2.8184, 1.4541, 2.3275, 2.6558, 1.2196, 1.7708, 2.6052, 2.0743, 3.2687, 2.1526, 2.8652, 1.5579, 1.6382, 1.1253, 2.8251, 1.916], 'num_res_blocks': 2, 'temperal_downsample': [False, True, True], 'z_dim': 16}
[12-31 10:33:33] Loaded vae: AutoencoderKLWan from customized. model size: 0.47 GB, avail mem: 182.56 GB
[12-31 10:33:33] Loading transformer from /home/roywan/model/models--Wan-AI--Wan2.1-I2V-14B-480P-Diffusers/snapshots/b184e23a8a16b20f108f727c902e769e873ffc73/transformer. avail mem: 182.56 GB
[12-31 10:33:33] Loading WanTransformer3DModel from 14 safetensors files, default_dtype: torch.bfloat16
[12-31 10:33:33] Trying SGLANG_DIFFUSION_ATTENTION_BACKEND=None
[12-31 10:33:33] Using AITer backend on ROCm.

Loading safetensors checkpoint shards: 100% Completed | 14/14 [00:00<00:00, 377.41it/s]

Loading required modules: 100%|██████████| 7/7 [02:18<00:00, 19.76s/it]
[12-31 10:33:45] Loaded model with 16.40B parameters
[12-31 10:33:45] Loaded transformer: WanTransformer3DModel from customized. model size: 30.54 GB, avail mem: 150.61 GB
Loading required modules:  57%|█████▋    | 4/7 [02:21<01:11, 23.68s/it][12-31 10:33:45] Loading scheduler from /home/roywan/model/models--Wan-AI--Wan2.1-I2V-14B-480P-Diffusers/snapshots/b184e23a8a16b20f108f727c902e769e873ffc73/scheduler. avail mem: 150.61 GB
[12-31 10:33:45] Loaded scheduler: UniPCMultistepScheduler from customized. model size: 0.00 GB, avail mem: 150.61 GB
[12-31 10:33:45] Loading image_encoder from /home/roywan/model/models--Wan-AI--Wan2.1-I2V-14B-480P-Diffusers/snapshots/b184e23a8a16b20f108f727c902e769e873ffc73/image_encoder. avail mem: 150.61 GB
[12-31 10:33:45] HF model config: {'architectures': ['CLIPVisionModelWithProjection'], 'attention_dropout': 0.0, 'dropout': 0.0, 'hidden_act': 'gelu', 'hidden_size': 1280, 'image_size': 224, 'initializer_factor': 1.0, 'initializer_range': 0.02, 'intermediate_size': 5120, 'layer_norm_eps': 1e-05, 'num_attention_heads': 16, 'num_channels': 3, 'num_hidden_layers': 32, 'patch_size': 14, 'projection_dim': 1024}
[12-31 10:33:45] Trying SGLANG_DIFFUSION_ATTENTION_BACKEND=None
[12-31 10:33:45] Cannot use FlashAttention backend because the flash_attn package is not found. Make sure that flash_attn was built and installed (on by default).
[12-31 10:33:45] Using Torch SDPA backend.

Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  1.66it/s]

[12-31 10:33:46] Loaded image_encoder: CLIPVisionModel from customized. model size: 2.35 GB, avail mem: 149.24 GB
Loading required modules:  86%|████████▌ | 6/7 [02:21<00:12, 12.65s/it][12-31 10:33:46] Loading image_processor from /home/roywan/model/models--Wan-AI--Wan2.1-I2V-14B-480P-Diffusers/snapshots/b184e23a8a16b20f108f727c902e769e873ffc73/image_processor. avail mem: 149.24 GB
[12-31 10:33:46] Loaded image_processor: CLIPImageProcessorFast from customized. model size: 0.00 GB, avail mem: 149.24 GB
Loading required modules: 100%|██████████| 7/7 [02:21<00:00, 20.24s/it]
[12-31 10:33:46] Creating pipeline stages...
[12-31 10:33:46] Trying SGLANG_DIFFUSION_ATTENTION_BACKEND=None
[12-31 10:33:46] Using AITer backend on ROCm.
[12-31 10:33:46] Pipeline instantiated
[12-31 10:33:46] Worker 0: Initialized device, model, and distributed environment.
[12-31 10:33:46] Worker 0: Scheduler loop started.
[12-31 10:33:46] Adjusting number of frames from 81 to 85 based on number of GPUs (2)
[12-31 10:33:46] Processing prompt 1/1: Summer beach vacation style, a white cat wearing sunglasses sits on a surfboard. The fluffy-furred f
[12-31 10:33:46] Sampling params:
                       width: 832
                      height: 480
                  num_frames: 85
                      prompt: Summer beach vacation style, a white cat wearing sunglasses sits on a surfboard. The fluffy-furred feline gazes directly at the camera with a relaxed expression. Blurred beach scenery forms the background featuring crystal-clear waters, distant green hills, and a blue sky dotted with white clouds. The cat assumes a naturally relaxed posture, as if savoring the sea breeze and warm sunlight. A close-up shot highlights the feline's intricate details and the refreshing atmosphere of the seaside.
                  neg_prompt: Bright tones, overexposed, static, blurred details, subtitles, style, works, paintings, images, static, overall gray, worst quality, low quality, JPEG compression residue, ugly, incomplete, extra fingers, poorly drawn hands, poorly drawn faces, deformed, disfigured, misshapen limbs, fused fingers, still picture, messy background, three legs, many people in the background, walking backwards
                        seed: 1024
                 infer_steps: 50
      num_outputs_per_prompt: 1
              guidance_scale: 5.0
     embedded_guidance_scale: 6.0
                    n_tokens: None
                  flow_shift: 3.0
                  image_path: ['https://github.com/Wan-Video/Wan2.2/blob/990af50de458c19590c245151197326e208d7191/examples/i2v_input.JPG?raw=true']
                 save_output: True
            output_file_path: outputs/Summer_beach_vacation_style_a_white_cat_wearing_sunglasses_sits_on_a_surfboard._The_fluffy-furred_f_20251231-103346_d2038859.mp4
        
[12-31 10:33:46] Running pipeline stages: ['input_validation_stage', 'prompt_encoding_stage', 'image_encoding_stage', 'conditioning_stage', 'timestep_preparation_stage', 'latent_preparation_stage', 'image_latent_preparation_stage', 'denoising_stage', 'decoding_stage']
[12-31 10:33:46] [InputValidationStage] started...
[12-31 10:33:47] [InputValidationStage] finished in 0.6357 seconds
[12-31 10:33:47] [TextEncodingStage] started...
[12-31 10:33:56] [TextEncodingStage] finished in 8.9319 seconds
[12-31 10:33:56] [ImageEncodingStage] started...
[12-31 10:33:57] [ImageEncodingStage] finished in 1.1375 seconds
[12-31 10:33:57] [ConditioningStage] started...
[12-31 10:33:57] [ConditioningStage] finished in 0.0001 seconds
[12-31 10:33:57] [TimestepPreparationStage] started...
[12-31 10:33:57] [TimestepPreparationStage] finished in 0.0007 seconds
[12-31 10:33:57] [LatentPreparationStage] started...
[12-31 10:33:57] [LatentPreparationStage] finished in 0.0003 seconds
[12-31 10:33:57] [ImageVAEEncodingStage] started...
[12-31 10:35:15] [ImageVAEEncodingStage] finished in 78.0605 seconds
[12-31 10:35:15] [DenoisingStage] started...
  0%|          | 0/50 [00:00<?, ?it/s]MIOpen(HIP): Error [Init] Not found :30-DeviceGroupedConvFwdMultipleABD_Xdl_CShuffle_V3<256, 128, 128, 64, Default, 32, 32, 2, 2, 8, 8, 8, 1, 1, BlkGemmPipelineScheduler: Intrawave, BlkGemmPipelineVersion: v3>
MIOpen(HIP): Error [Init] Not found :30-DeviceGroupedConvFwdMultipleABD_Xdl_CShuffle_V3<256, 128, 128, 64, Default, 32, 32, 2, 2, 8, 8, 8, 1, 1, BlkGemmPipelineScheduler: Intrawave, BlkGemmPipelineVersion: v3>
[aiter] import [module_fmha_v3_fwd] under /sgl-workspace/aiter/aiter/jit/module_fmha_v3_fwd.so
[12-31 10:35:19] import [module_fmha_v3_fwd] under /sgl-workspace/aiter/aiter/jit/module_fmha_v3_fwd.so
[aiter] type hints mismatch, override to --> fmha_v3_fwd(q: torch.Tensor, k: torch.Tensor, v: torch.Tensor, dropout_p: float, softmax_scale: float, is_causal: bool, window_size_left: int, window_size_right: int, return_softmax_lse: bool, return_dropout_randval: bool, how_v3_bf16_cvt: int, out: Optional[torch.Tensor] = None, bias: Optional[torch.Tensor] = None, alibi_slopes: Optional[torch.Tensor] = None, gen: Optional[torch.Generator] = None) -> List[torch.Tensor]
[12-31 10:35:19] type hints mismatch, override to --> fmha_v3_fwd(q: torch.Tensor, k: torch.Tensor, v: torch.Tensor, dropout_p: float, softmax_scale: float, is_causal: bool, window_size_left: int, window_size_right: int, return_softmax_lse: bool, return_dropout_randval: bool, how_v3_bf16_cvt: int, out: Optional[torch.Tensor] = None, bias: Optional[torch.Tensor] = None, alibi_slopes: Optional[torch.Tensor] = None, gen: Optional[torch.Generator] = None) -> List[torch.Tensor]
[aiter] import [module_fmha_v3_fwd] under /sgl-workspace/aiter/aiter/jit/module_fmha_v3_fwd.so
[12-31 10:35:20] import [module_fmha_v3_fwd] under /sgl-workspace/aiter/aiter/jit/module_fmha_v3_fwd.so
[aiter] type hints mismatch, override to --> fmha_v3_fwd(q: torch.Tensor, k: torch.Tensor, v: torch.Tensor, dropout_p: float, softmax_scale: float, is_causal: bool, window_size_left: int, window_size_right: int, return_softmax_lse: bool, return_dropout_randval: bool, how_v3_bf16_cvt: int, out: Optional[torch.Tensor] = None, bias: Optional[torch.Tensor] = None, alibi_slopes: Optional[torch.Tensor] = None, gen: Optional[torch.Generator] = None) -> List[torch.Tensor]
[12-31 10:35:20] type hints mismatch, override to --> fmha_v3_fwd(q: torch.Tensor, k: torch.Tensor, v: torch.Tensor, dropout_p: float, softmax_scale: float, is_causal: bool, window_size_left: int, window_size_right: int, return_softmax_lse: bool, return_dropout_randval: bool, how_v3_bf16_cvt: int, out: Optional[torch.Tensor] = None, bias: Optional[torch.Tensor] = None, alibi_slopes: Optional[torch.Tensor] = None, gen: Optional[torch.Generator] = None) -> List[torch.Tensor]
100%|██████████| 50/50 [04:00<00:00,  4.80s/it]
[12-31 10:39:15] [DenoisingStage] average time per step: 4.8006 seconds
[12-31 10:39:15] [DenoisingStage] finished in 240.0356 seconds
[12-31 10:39:15] [DecodingStage] started...
[12-31 10:40:15] [DecodingStage] finished in 60.2300 seconds
[12-31 10:40:15] Peak GPU memory: 42.32 GB, Remaining GPU memory at peak: 149.67 GB. Components that can stay resident: ['text_encoder', 'vae', 'transformer', 'image_encoder']
[12-31 10:40:19] Output saved to outputs/Summer_beach_vacation_style_a_white_cat_wearing_sunglasses_sits_on_a_surfboard._The_fluffy-furred_f_20251231-103346_d2038859.mp4
[12-31 10:40:19] Pixel data generated successfully in 392.65 seconds
[12-31 10:40:19] Completed batch processing. Generated 1 outputs in 392.65 seconds.
[12-31 10:40:19] Memory usage - Max peak: 43333.40 MB, Avg peak: 43333.40 MB
[12-31 10:40:19] Generator was garbage collected without being shut down. Attempting to shut down the local server and client.
/usr/lib/python3.10/multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 2 leaked semaphore objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '

Log after this change:

[aiter] import [module_aiter_enum] under /sgl-workspace/aiter/aiter/jit/module_aiter_enum.so
INFO:aiter:import [module_aiter_enum] under /sgl-workspace/aiter/aiter/jit/module_aiter_enum.so
INFO 12-31 11:58:38 [__init__.py:241] Automatically detected platform rocm.
/home/roywan/sglang/python/sglang/srt/layers/quantization/awq.py:71: UserWarning: HIP does not support fused_marlin_moe currently.
  warnings.warn(f"HIP does not support fused_marlin_moe currently.")
/home/roywan/sglang/python/sglang/srt/layers/quantization/gguf.py:46: UserWarning: Only CUDA support GGUF quantization currently.
  warnings.warn(f"Only CUDA support GGUF quantization currently.")
[12-31 11:58:38] Attention backend not specified. Using 'aiter' by default on ROCm to match SGLang SRT defaults.
[12-31 11:58:38] server_args: {"model_path": "Wan-AI/Wan2.1-T2V-1.3B-Diffusers", "attention_backend": "aiter", "nccl_port": null, "trust_remote_code": false, "revision": null, "num_gpus": 1, "tp_size": 1, "sp_degree": 1, "ulysses_degree": 1, "ring_degree": 1, "dp_size": 1, "dp_degree": 1, "enable_cfg_parallel": false, "hsdp_replicate_dim": 1, "hsdp_shard_dim": 1, "dist_timeout": null, "lora_path": null, "lora_nickname": "default", "vae_path": null, "lora_target_modules": null, "dit_cpu_offload": true, "dit_layerwise_offload": null, "text_encoder_cpu_offload": true, "image_encoder_cpu_offload": true, "vae_cpu_offload": true, "use_fsdp_inference": false, "pin_cpu_memory": true, "mask_strategy_file_path": null, "STA_mode": "STA_inference", "skip_time_steps": 15, "enable_torch_compile": false, "disable_autocast": false, "VSA_sparsity": 0.0, "moba_config_path": null, "moba_config": {}, "master_port": 30091, "host": null, "port": null, "webui": false, "webui_port": 12312, "scheduler_port": 5565, "prompt_file_path": null, "model_paths": {}, "model_loaded": {"transformer": true, "vae": true}, "boundary_ratio": null, "log_level": "info"}
[12-31 11:58:38] Local mode: True
[12-31 11:58:38] Starting server...
[aiter] import [module_aiter_enum] under /sgl-workspace/aiter/aiter/jit/module_aiter_enum.so
INFO:aiter:import [module_aiter_enum] under /sgl-workspace/aiter/aiter/jit/module_aiter_enum.so
INFO 12-31 11:58:45 [__init__.py:241] Automatically detected platform rocm.
/home/roywan/sglang/python/sglang/srt/layers/quantization/awq.py:71: UserWarning: HIP does not support fused_marlin_moe currently.
  warnings.warn(f"HIP does not support fused_marlin_moe currently.")
/home/roywan/sglang/python/sglang/srt/layers/quantization/gguf.py:46: UserWarning: Only CUDA support GGUF quantization currently.
  warnings.warn(f"Only CUDA support GGUF quantization currently.")
[12-31 11:58:46] Scheduler bind at endpoint: tcp://localhost:5565
[12-31 11:58:46] Initializing distributed environment with world_size=1, device=cuda:0
[12-31 11:58:51] Downloaded model_index.json for Wan-AI/Wan2.1-T2V-1.3B-Diffusers, pipeline: WanPipeline
[12-31 11:58:51] Found model info: ModelInfo(pipeline_cls=<class 'sglang.multimodal_gen.runtime.pipelines.wan_pipeline.WanPipeline'>, sampling_param_cls=<class 'sglang.multimodal_gen.configs.sample.wan.WanT2V_1_3B_SamplingParams'>, pipeline_config_cls=<class 'sglang.multimodal_gen.configs.pipeline_configs.wan.WanT2V480PConfig'>)
[12-31 11:58:51] Loading pipeline modules...
[12-31 11:58:51] Checking for cached model in HF Hub cache for Wan-AI/Wan2.1-T2V-1.3B-Diffusers...
[12-31 11:58:51] Found complete model in cache at /home/roywan/model/models--Wan-AI--Wan2.1-T2V-1.3B-Diffusers/snapshots/0fad780a534b6463e45facd96134c9f345acfa5b
[12-31 11:58:51] Model path: /home/roywan/model/models--Wan-AI--Wan2.1-T2V-1.3B-Diffusers/snapshots/0fad780a534b6463e45facd96134c9f345acfa5b
[12-31 11:58:51] Diffusers version: 0.33.0.dev0
[12-31 11:58:51] Loading pipeline modules from config: {'_class_name': 'WanPipeline', '_diffusers_version': '0.33.0.dev0', 'scheduler': ['diffusers', 'UniPCMultistepScheduler'], 'text_encoder': ['transformers', 'UMT5EncoderModel'], 'tokenizer': ['transformers', 'T5TokenizerFast'], 'transformer': ['diffusers', 'WanTransformer3DModel'], 'vae': ['diffusers', 'AutoencoderKLWan']}
[12-31 11:58:51] Loading required components: ['text_encoder', 'tokenizer', 'vae', 'transformer', 'scheduler']
Loading required modules:   0%|          | 0/5 [00:00<?, ?it/s][12-31 11:58:51] Loading text_encoder from /home/roywan/model/models--Wan-AI--Wan2.1-T2V-1.3B-Diffusers/snapshots/0fad780a534b6463e45facd96134c9f345acfa5b/text_encoder. avail mem: 184.91 GB
[12-31 11:58:51] HF model config: {'architectures': ['UMT5EncoderModel'], 'classifier_dropout': 0.0, 'd_ff': 10240, 'd_kv': 64, 'd_model': 4096, 'decoder_start_token_id': 0, 'dense_act_fn': 'gelu_new', 'dropout_rate': 0.1, 'eos_token_id': 1, 'feed_forward_proj': 'gated-gelu', 'initializer_factor': 1.0, 'is_encoder_decoder': True, 'is_gated_act': True, 'layer_norm_epsilon': 1e-06, 'num_decoder_layers': 24, 'num_heads': 64, 'num_layers': 24, 'output_past': True, 'pad_token_id': 0, 'relative_attention_max_distance': 128, 'relative_attention_num_buckets': 32, 'scalable_attention': True, 'tie_word_embeddings': False, 'use_cache': True, 'vocab_size': 256384}
[12-31 11:59:00] [RunAI Streamer] Overall time to stream 21.2 GiB of all files to cpu: 8.69s, 2.4 GiB/s
[12-31 12:00:28] Loaded text_encoder: FSDPUMT5EncoderModel from customized. model size: 21.16 GB, avail mem: 183.84 GB
Loading required modules:  20%|██        | 1/5 [01:37<06:29, 97.46s/it][12-31 12:00:28] Loading tokenizer from /home/roywan/model/models--Wan-AI--Wan2.1-T2V-1.3B-Diffusers/snapshots/0fad780a534b6463e45facd96134c9f345acfa5b/tokenizer. avail mem: 183.84 GB
[12-31 12:00:29] Loaded tokenizer: T5TokenizerFast from customized. model size: 0.00 GB, avail mem: 183.84 GB
Loading required modules:  40%|████      | 2/5 [01:37<02:01, 40.34s/it][12-31 12:00:29] Loading vae from /home/roywan/model/models--Wan-AI--Wan2.1-T2V-1.3B-Diffusers/snapshots/0fad780a534b6463e45facd96134c9f345acfa5b/vae. avail mem: 183.84 GB
[12-31 12:00:29] HF model config: {'attn_scales': [], 'base_dim': 96, 'dim_mult': [1, 2, 4, 4], 'dropout': 0.0, 'latents_mean': [-0.7571, -0.7089, -0.9113, 0.1075, -0.1745, 0.9653, -0.1517, 1.5508, 0.4134, -0.0715, 0.5517, -0.3632, -0.1922, -0.9497, 0.2503, -0.2921], 'latents_std': [2.8184, 1.4541, 2.3275, 2.6558, 1.2196, 1.7708, 2.6052, 2.0743, 3.2687, 2.1526, 2.8652, 1.5579, 1.6382, 1.1253, 2.8251, 1.916], 'num_res_blocks': 2, 'temperal_downsample': [False, True, True], 'z_dim': 16}
[12-31 12:00:29] Loaded vae: AutoencoderKLWan from customized. model size: 0.27 GB, avail mem: 183.84 GB
[12-31 12:00:29] Loading transformer from /home/roywan/model/models--Wan-AI--Wan2.1-T2V-1.3B-Diffusers/snapshots/0fad780a534b6463e45facd96134c9f345acfa5b/transformer. avail mem: 183.84 GB
[12-31 12:00:29] Loading WanTransformer3DModel from 2 safetensors files, default_dtype: torch.bfloat16
[12-31 12:00:29] Trying SGLANG_DIFFUSION_ATTENTION_BACKEND=None
[12-31 12:00:29] Using AITer backend on ROCm.
[12-31 12:00:31] [RunAI Streamer] Overall time to stream 5.3 GiB of all files to cpu: 2.18s, 2.4 GiB/s
[12-31 12:00:32] Loaded model with 1.42B parameters
[12-31 12:00:32] Loaded transformer: WanTransformer3DModel from customized. model size: 2.64 GB, avail mem: 181.00 GB
Loading required modules:  80%|████████  | 4/5 [01:40<00:15, 15.94s/it][12-31 12:00:32] Loading scheduler from /home/roywan/model/models--Wan-AI--Wan2.1-T2V-1.3B-Diffusers/snapshots/0fad780a534b6463e45facd96134c9f345acfa5b/scheduler. avail mem: 181.00 GB
[12-31 12:00:32] Loaded scheduler: UniPCMultistepScheduler from customized. model size: 0.00 GB, avail mem: 181.00 GB
Loading required modules: 100%|██████████| 5/5 [01:40<00:00, 20.13s/it]
[12-31 12:00:32] Creating pipeline stages...
[12-31 12:00:32] Trying SGLANG_DIFFUSION_ATTENTION_BACKEND=None
[12-31 12:00:32] Using AITer backend on ROCm.
[12-31 12:00:32] Pipeline instantiated
[12-31 12:00:32] Worker 0: Initialized device, model, and distributed environment.
[12-31 12:00:32] Worker 0: Scheduler loop started.
[12-31 12:00:32] Processing prompt 1/1: A curious raccoon
[12-31 12:00:32] Sampling params:
                       width: 832
                      height: 480
                  num_frames: 81
                      prompt: A curious raccoon
                  neg_prompt: Bright tones, overexposed, static, blurred details, subtitles, style, works, paintings, images, static, overall gray, worst quality, low quality, JPEG compression residue, ugly, incomplete, extra fingers, poorly drawn hands, poorly drawn faces, deformed, disfigured, misshapen limbs, fused fingers, still picture, messy background, three legs, many people in the background, walking backwards
                        seed: 1024
                 infer_steps: 50
      num_outputs_per_prompt: 1
              guidance_scale: 3.0
     embedded_guidance_scale: 6.0
                    n_tokens: None
                  flow_shift: 3.0
                  image_path: None
                 save_output: True
            output_file_path: outputs/A_curious_raccoon_20251231-120032_483bc3b0.mp4
        
[12-31 12:00:32] Running pipeline stages: ['input_validation_stage', 'prompt_encoding_stage', 'conditioning_stage', 'timestep_preparation_stage', 'latent_preparation_stage', 'denoising_stage', 'decoding_stage']
[12-31 12:00:32] [InputValidationStage] started...
[12-31 12:00:32] [InputValidationStage] finished in 0.0001 seconds
[12-31 12:00:32] [TextEncodingStage] started...
[12-31 12:00:39] [TextEncodingStage] finished in 7.0431 seconds
[12-31 12:00:39] [ConditioningStage] started...
[12-31 12:00:39] [ConditioningStage] finished in 0.0000 seconds
[12-31 12:00:39] [TimestepPreparationStage] started...
[12-31 12:00:39] [TimestepPreparationStage] finished in 0.0004 seconds
[12-31 12:00:39] [LatentPreparationStage] started...
[12-31 12:00:39] [LatentPreparationStage] finished in 0.0041 seconds
[12-31 12:00:39] [DenoisingStage] started...
  0%|          | 0/50 [00:00<?, ?it/s][aiter] import [module_fmha_v3_fwd] under /sgl-workspace/aiter/aiter/jit/module_fmha_v3_fwd.so
[12-31 12:00:42] import [module_fmha_v3_fwd] under /sgl-workspace/aiter/aiter/jit/module_fmha_v3_fwd.so
[aiter] type hints mismatch, override to --> fmha_v3_fwd(q: torch.Tensor, k: torch.Tensor, v: torch.Tensor, dropout_p: float, softmax_scale: float, is_causal: bool, window_size_left: int, window_size_right: int, return_softmax_lse: bool, return_dropout_randval: bool, how_v3_bf16_cvt: int, out: Optional[torch.Tensor] = None, bias: Optional[torch.Tensor] = None, alibi_slopes: Optional[torch.Tensor] = None, gen: Optional[torch.Generator] = None) -> List[torch.Tensor]
[12-31 12:00:42] type hints mismatch, override to --> fmha_v3_fwd(q: torch.Tensor, k: torch.Tensor, v: torch.Tensor, dropout_p: float, softmax_scale: float, is_causal: bool, window_size_left: int, window_size_right: int, return_softmax_lse: bool, return_dropout_randval: bool, how_v3_bf16_cvt: int, out: Optional[torch.Tensor] = None, bias: Optional[torch.Tensor] = None, alibi_slopes: Optional[torch.Tensor] = None, gen: Optional[torch.Generator] = None) -> List[torch.Tensor]
100%|██████████| 50/50 [01:29<00:00,  1.79s/it]
[12-31 12:02:08] [DenoisingStage] average time per step: 1.7908 seconds
[12-31 12:02:08] [DenoisingStage] finished in 89.5441 seconds
[12-31 12:02:08] [DecodingStage] started...
[12-31 12:02:15] [DecodingStage] finished in 6.7211 seconds
[12-31 12:02:15] Peak GPU memory: 11.84 GB, Remaining GPU memory at peak: 180.14 GB. Components that can stay resident: ['text_encoder', 'vae', 'transformer']
[12-31 12:02:20] Output saved to outputs/A_curious_raccoon_20251231-120032_483bc3b0.mp4
[12-31 12:02:20] Pixel data generated successfully in 108.11 seconds
[12-31 12:02:20] Completed batch processing. Generated 1 outputs in 108.12 seconds.
[12-31 12:02:20] Memory usage - Max peak: 12129.25 MB, Avg peak: 12129.25 MB
[12-31 12:02:20] Generator was garbage collected without being shut down. Attempting to shut down the local server and client.
/usr/lib/python3.10/multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 1 leaked semaphore objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '
[aiter] import [module_aiter_enum] under /sgl-workspace/aiter/aiter/jit/module_aiter_enum.so
INFO:aiter:import [module_aiter_enum] under /sgl-workspace/aiter/aiter/jit/module_aiter_enum.so
INFO 12-31 12:02:30 [__init__.py:241] Automatically detected platform rocm.
/home/roywan/sglang/python/sglang/srt/layers/quantization/awq.py:71: UserWarning: HIP does not support fused_marlin_moe currently.
  warnings.warn(f"HIP does not support fused_marlin_moe currently.")
/home/roywan/sglang/python/sglang/srt/layers/quantization/gguf.py:46: UserWarning: Only CUDA support GGUF quantization currently.
  warnings.warn(f"Only CUDA support GGUF quantization currently.")
[12-31 12:02:30] Attention backend not specified. Using 'aiter' by default on ROCm to match SGLang SRT defaults.
[12-31 12:02:30] server_args: {"model_path": "Wan-AI/Wan2.1-I2V-14B-480P-Diffusers", "attention_backend": "aiter", "nccl_port": null, "trust_remote_code": false, "revision": null, "num_gpus": 2, "tp_size": 1, "sp_degree": 1, "ulysses_degree": 1, "ring_degree": 1, "dp_size": 1, "dp_degree": 1, "enable_cfg_parallel": true, "hsdp_replicate_dim": 1, "hsdp_shard_dim": 2, "dist_timeout": null, "lora_path": null, "lora_nickname": "default", "vae_path": null, "lora_target_modules": null, "dit_cpu_offload": true, "dit_layerwise_offload": null, "text_encoder_cpu_offload": true, "image_encoder_cpu_offload": true, "vae_cpu_offload": true, "use_fsdp_inference": false, "pin_cpu_memory": true, "mask_strategy_file_path": null, "STA_mode": "STA_inference", "skip_time_steps": 15, "enable_torch_compile": false, "disable_autocast": false, "VSA_sparsity": 0.0, "moba_config_path": null, "moba_config": {}, "master_port": 30066, "host": null, "port": null, "webui": false, "webui_port": 12312, "scheduler_port": 5626, "prompt_file_path": null, "model_paths": {}, "model_loaded": {"transformer": true, "vae": true}, "boundary_ratio": null, "log_level": "info"}
[12-31 12:02:30] Local mode: True
[12-31 12:02:30] Starting server...
[aiter] import [module_aiter_enum] under /sgl-workspace/aiter/aiter/jit/module_aiter_enum.so
INFO:aiter:import [module_aiter_enum] under /sgl-workspace/aiter/aiter/jit/module_aiter_enum.so
[aiter] import [module_aiter_enum] under /sgl-workspace/aiter/aiter/jit/module_aiter_enum.so
INFO:aiter:import [module_aiter_enum] under /sgl-workspace/aiter/aiter/jit/module_aiter_enum.so
INFO 12-31 12:02:38 [__init__.py:241] Automatically detected platform rocm.
INFO 12-31 12:02:38 [__init__.py:241] Automatically detected platform rocm.
/home/roywan/sglang/python/sglang/srt/layers/quantization/awq.py:71: UserWarning: HIP does not support fused_marlin_moe currently.
  warnings.warn(f"HIP does not support fused_marlin_moe currently.")
/home/roywan/sglang/python/sglang/srt/layers/quantization/gguf.py:46: UserWarning: Only CUDA support GGUF quantization currently.
  warnings.warn(f"Only CUDA support GGUF quantization currently.")
/home/roywan/sglang/python/sglang/srt/layers/quantization/awq.py:71: UserWarning: HIP does not support fused_marlin_moe currently.
  warnings.warn(f"HIP does not support fused_marlin_moe currently.")
/home/roywan/sglang/python/sglang/srt/layers/quantization/gguf.py:46: UserWarning: Only CUDA support GGUF quantization currently.
  warnings.warn(f"Only CUDA support GGUF quantization currently.")
[12-31 12:02:38] Scheduler bind at endpoint: tcp://localhost:5626
[12-31 12:02:38] Initializing distributed environment with world_size=2, device=cuda:0
[12-31 12:02:42] Found nccl from library librccl.so.1
[12-31 12:02:42] sglang-diffusion is using nccl==2.26.6
[12-31 12:02:43] Found nccl from library librccl.so.1
[12-31 12:02:43] sglang-diffusion is using nccl==2.26.6
Loading required modules:   0%|          | 0/7 [00:00<?, ?it/s][12-31 12:02:46] Downloaded model_index.json for Wan-AI/Wan2.1-I2V-14B-480P-Diffusers, pipeline: WanImageToVideoPipeline
[12-31 12:02:46] Found model info: ModelInfo(pipeline_cls=<class 'sglang.multimodal_gen.runtime.pipelines.wan_i2v_pipeline.WanImageToVideoPipeline'>, sampling_param_cls=<class 'sglang.multimodal_gen.configs.sample.wan.WanI2V_14B_480P_SamplingParam'>, pipeline_config_cls=<class 'sglang.multimodal_gen.configs.pipeline_configs.wan.WanI2V480PConfig'>)
[12-31 12:02:46] Loading pipeline modules...
[12-31 12:02:46] Checking for cached model in HF Hub cache for Wan-AI/Wan2.1-I2V-14B-480P-Diffusers...
[12-31 12:02:46] Found complete model in cache at /home/roywan/model/models--Wan-AI--Wan2.1-I2V-14B-480P-Diffusers/snapshots/b184e23a8a16b20f108f727c902e769e873ffc73
[12-31 12:02:46] Model path: /home/roywan/model/models--Wan-AI--Wan2.1-I2V-14B-480P-Diffusers/snapshots/b184e23a8a16b20f108f727c902e769e873ffc73
[12-31 12:02:46] Diffusers version: 0.33.0.dev0
[12-31 12:02:46] Loading pipeline modules from config: {'_class_name': 'WanImageToVideoPipeline', '_diffusers_version': '0.33.0.dev0', 'image_encoder': ['transformers', 'CLIPVisionModelWithProjection'], 'image_processor': ['transformers', 'CLIPImageProcessor'], 'scheduler': ['diffusers', 'UniPCMultistepScheduler'], 'text_encoder': ['transformers', 'UMT5EncoderModel'], 'tokenizer': ['transformers', 'T5TokenizerFast'], 'transformer': ['diffusers', 'WanTransformer3DModel'], 'vae': ['diffusers', 'AutoencoderKLWan']}
[12-31 12:02:46] Loading required components: ['text_encoder', 'tokenizer', 'vae', 'transformer', 'scheduler', 'image_encoder', 'image_processor']
Loading required modules:   0%|          | 0/7 [00:00<?, ?it/s][12-31 12:02:46] Loading text_encoder from /home/roywan/model/models--Wan-AI--Wan2.1-I2V-14B-480P-Diffusers/snapshots/b184e23a8a16b20f108f727c902e769e873ffc73/text_encoder. avail mem: 183.63 GB
[12-31 12:02:46] HF model config: {'architectures': ['UMT5EncoderModel'], 'classifier_dropout': 0.0, 'd_ff': 10240, 'd_kv': 64, 'd_model': 4096, 'decoder_start_token_id': 0, 'dense_act_fn': 'gelu_new', 'dropout_rate': 0.1, 'eos_token_id': 1, 'feed_forward_proj': 'gated-gelu', 'initializer_factor': 1.0, 'is_encoder_decoder': True, 'is_gated_act': True, 'layer_norm_epsilon': 1e-06, 'num_decoder_layers': 24, 'num_heads': 64, 'num_layers': 24, 'output_past': True, 'pad_token_id': 0, 'relative_attention_max_distance': 128, 'relative_attention_num_buckets': 32, 'scalable_attention': True, 'tie_word_embeddings': False, 'use_cache': True, 'vocab_size': 256384}
[12-31 12:02:58] [RunAI Streamer] Overall time to stream 21.2 GiB of all files to cpu: 12.0s, 1.8 GiB/s
[12-31 12:02:59] [RunAI Streamer] Overall time to stream 21.2 GiB of all files to cpu: 13.45s, 1.6 GiB/s
[12-31 12:05:02] Loaded text_encoder: FSDPUMT5EncoderModel from customized. model size: 21.16 GB, avail mem: 182.34 GB
Loading required modules:  14%|█▍        | 1/7 [02:16<13:38, 136.47s/it][12-31 12:05:02] Loading tokenizer from /home/roywan/model/models--Wan-AI--Wan2.1-I2V-14B-480P-Diffusers/snapshots/b184e23a8a16b20f108f727c902e769e873ffc73/tokenizer. avail mem: 182.34 GB
[12-31 12:05:03] Loaded tokenizer: T5TokenizerFast from customized. model size: 0.00 GB, avail mem: 182.34 GB
Loading required modules:  29%|██▊       | 2/7 [02:17<04:42, 56.51s/it] [12-31 12:05:03] Loading vae from /home/roywan/model/models--Wan-AI--Wan2.1-I2V-14B-480P-Diffusers/snapshots/b184e23a8a16b20f108f727c902e769e873ffc73/vae. avail mem: 182.34 GB
[12-31 12:05:03] HF model config: {'attn_scales': [], 'base_dim': 96, 'dim_mult': [1, 2, 4, 4], 'dropout': 0.0, 'latents_mean': [-0.7571, -0.7089, -0.9113, 0.1075, -0.1745, 0.9653, -0.1517, 1.5508, 0.4134, -0.0715, 0.5517, -0.3632, -0.1922, -0.9497, 0.2503, -0.2921], 'latents_std': [2.8184, 1.4541, 2.3275, 2.6558, 1.2196, 1.7708, 2.6052, 2.0743, 3.2687, 2.1526, 2.8652, 1.5579, 1.6382, 1.1253, 2.8251, 1.916], 'num_res_blocks': 2, 'temperal_downsample': [False, True, True], 'z_dim': 16}
[12-31 12:05:03] Loaded vae: AutoencoderKLWan from customized. model size: 0.47 GB, avail mem: 182.34 GB
[12-31 12:05:03] Loading transformer from /home/roywan/model/models--Wan-AI--Wan2.1-I2V-14B-480P-Diffusers/snapshots/b184e23a8a16b20f108f727c902e769e873ffc73/transformer. avail mem: 182.34 GB
[12-31 12:05:03] Loading WanTransformer3DModel from 14 safetensors files, default_dtype: torch.bfloat16
[12-31 12:05:03] Trying SGLANG_DIFFUSION_ATTENTION_BACKEND=None
[12-31 12:05:03] Using AITer backend on ROCm.
Loading required modules:  29%|██▊       | 2/7 [02:20<04:49, 57.87s/it] [12-31 12:06:22] [RunAI Streamer] Overall time to stream 61.1 GiB of all files to cpu: 79.09s, 790.8 MiB/s
[12-31 12:06:24] [RunAI Streamer] Overall time to stream 61.1 GiB of all files to cpu: 77.61s, 805.8 MiB/s
[12-31 12:06:39] Loaded model with 16.40B parameters
[12-31 12:06:39] Loaded transformer: WanTransformer3DModel from customized. model size: 30.54 GB, avail mem: 149.60 GB
Loading required modules:  57%|█████▋    | 4/7 [03:53<02:33, 51.19s/it][12-31 12:06:39] Loading scheduler from /home/roywan/model/models--Wan-AI--Wan2.1-I2V-14B-480P-Diffusers/snapshots/b184e23a8a16b20f108f727c902e769e873ffc73/scheduler. avail mem: 149.60 GB
[12-31 12:06:39] Loaded scheduler: UniPCMultistepScheduler from customized. model size: 0.00 GB, avail mem: 149.60 GB
[12-31 12:06:39] Loading image_encoder from /home/roywan/model/models--Wan-AI--Wan2.1-I2V-14B-480P-Diffusers/snapshots/b184e23a8a16b20f108f727c902e769e873ffc73/image_encoder. avail mem: 149.60 GB
[12-31 12:06:39] HF model config: {'architectures': ['CLIPVisionModelWithProjection'], 'attention_dropout': 0.0, 'dropout': 0.0, 'hidden_act': 'gelu', 'hidden_size': 1280, 'image_size': 224, 'initializer_factor': 1.0, 'initializer_range': 0.02, 'intermediate_size': 5120, 'layer_norm_eps': 1e-05, 'num_attention_heads': 16, 'num_channels': 3, 'num_hidden_layers': 32, 'patch_size': 14, 'projection_dim': 1024}
[12-31 12:06:39] Trying SGLANG_DIFFUSION_ATTENTION_BACKEND=None
[12-31 12:06:39] Cannot use FlashAttention backend because the flash_attn package is not found. Make sure that flash_attn was built and installed (on by default).
[12-31 12:06:39] Using Torch SDPA backend.
Loading required modules:  57%|█████▋    | 4/7 [03:54<02:33, 51.19s/it][12-31 12:06:42] [RunAI Streamer] Overall time to stream 1.2 GiB of all files to cpu: 1.09s, 1.1 GiB/s
[12-31 12:06:42] [RunAI Streamer] Overall time to stream 1.2 GiB of all files to cpu: 2.86s, 422.0 MiB/s
[12-31 12:06:42] Loaded image_encoder: CLIPVisionModel from customized. model size: 2.35 GB, avail mem: 148.77 GB
Loading required modules:  86%|████████▌ | 6/7 [03:55<00:27, 27.69s/it][12-31 12:06:42] Loading image_processor from /home/roywan/model/models--Wan-AI--Wan2.1-I2V-14B-480P-Diffusers/snapshots/b184e23a8a16b20f108f727c902e769e873ffc73/image_processor. avail mem: 148.77 GB
[12-31 12:06:42] Loaded image_processor: CLIPImageProcessorFast from customized. model size: 0.00 GB, avail mem: 148.77 GB
Loading required modules: 100%|██████████| 7/7 [03:55<00:00, 33.71s/it]
Loading required modules: 100%|██████████| 7/7 [03:55<00:00, 33.71s/it]
[12-31 12:06:42] Creating pipeline stages...
[12-31 12:06:42] Trying SGLANG_DIFFUSION_ATTENTION_BACKEND=None
[12-31 12:06:42] Using AITer backend on ROCm.
[12-31 12:06:42] Pipeline instantiated
[12-31 12:06:42] Worker 0: Initialized device, model, and distributed environment.
[12-31 12:06:42] Worker 0: Scheduler loop started.
[12-31 12:06:42] Adjusting number of frames from 81 to 85 based on number of GPUs (2)
[12-31 12:06:42] Processing prompt 1/1: Summer beach vacation style, a white cat wearing sunglasses sits on a surfboard. The fluffy-furred f
[12-31 12:06:42] Sampling params:
                       width: 832
                      height: 480
                  num_frames: 85
                      prompt: Summer beach vacation style, a white cat wearing sunglasses sits on a surfboard. The fluffy-furred feline gazes directly at the camera with a relaxed expression. Blurred beach scenery forms the background featuring crystal-clear waters, distant green hills, and a blue sky dotted with white clouds. The cat assumes a naturally relaxed posture, as if savoring the sea breeze and warm sunlight. A close-up shot highlights the feline's intricate details and the refreshing atmosphere of the seaside.
                  neg_prompt: Bright tones, overexposed, static, blurred details, subtitles, style, works, paintings, images, static, overall gray, worst quality, low quality, JPEG compression residue, ugly, incomplete, extra fingers, poorly drawn hands, poorly drawn faces, deformed, disfigured, misshapen limbs, fused fingers, still picture, messy background, three legs, many people in the background, walking backwards
                        seed: 1024
                 infer_steps: 50
      num_outputs_per_prompt: 1
              guidance_scale: 5.0
     embedded_guidance_scale: 6.0
                    n_tokens: None
                  flow_shift: 3.0
                  image_path: ['https://github.com/Wan-Video/Wan2.2/blob/990af50de458c19590c245151197326e208d7191/examples/i2v_input.JPG?raw=true']
                 save_output: True
            output_file_path: outputs/Summer_beach_vacation_style_a_white_cat_wearing_sunglasses_sits_on_a_surfboard._The_fluffy-furred_f_20251231-120642_d2038859.mp4
        
[12-31 12:06:42] Running pipeline stages: ['input_validation_stage', 'prompt_encoding_stage', 'image_encoding_stage', 'conditioning_stage', 'timestep_preparation_stage', 'latent_preparation_stage', 'image_latent_preparation_stage', 'denoising_stage', 'decoding_stage']
[12-31 12:06:42] [InputValidationStage] started...
[12-31 12:06:42] [InputValidationStage] finished in 0.5580 seconds
[12-31 12:06:42] [TextEncodingStage] started...
[12-31 12:06:51] [TextEncodingStage] finished in 8.6503 seconds
[12-31 12:06:51] [ImageEncodingStage] started...
[12-31 12:06:52] [ImageEncodingStage] finished in 0.9449 seconds
[12-31 12:06:52] [ConditioningStage] started...
[12-31 12:06:52] [ConditioningStage] finished in 0.0000 seconds
[12-31 12:06:52] [TimestepPreparationStage] started...
[12-31 12:06:52] [TimestepPreparationStage] finished in 0.0007 seconds
[12-31 12:06:52] [LatentPreparationStage] started...
[12-31 12:06:52] [LatentPreparationStage] finished in 0.0002 seconds
[12-31 12:06:52] [ImageVAEEncodingStage] started...
[12-31 12:06:57] [ImageVAEEncodingStage] finished in 5.5479 seconds
[12-31 12:06:57] [DenoisingStage] started...
  0%|          | 0/50 [00:00<?, ?it/s][aiter] import [module_fmha_v3_fwd] under /sgl-workspace/aiter/aiter/jit/module_fmha_v3_fwd.so
[12-31 12:07:02] import [module_fmha_v3_fwd] under /sgl-workspace/aiter/aiter/jit/module_fmha_v3_fwd.so
[aiter] type hints mismatch, override to --> fmha_v3_fwd(q: torch.Tensor, k: torch.Tensor, v: torch.Tensor, dropout_p: float, softmax_scale: float, is_causal: bool, window_size_left: int, window_size_right: int, return_softmax_lse: bool, return_dropout_randval: bool, how_v3_bf16_cvt: int, out: Optional[torch.Tensor] = None, bias: Optional[torch.Tensor] = None, alibi_slopes: Optional[torch.Tensor] = None, gen: Optional[torch.Generator] = None) -> List[torch.Tensor]
[12-31 12:07:02] type hints mismatch, override to --> fmha_v3_fwd(q: torch.Tensor, k: torch.Tensor, v: torch.Tensor, dropout_p: float, softmax_scale: float, is_causal: bool, window_size_left: int, window_size_right: int, return_softmax_lse: bool, return_dropout_randval: bool, how_v3_bf16_cvt: int, out: Optional[torch.Tensor] = None, bias: Optional[torch.Tensor] = None, alibi_slopes: Optional[torch.Tensor] = None, gen: Optional[torch.Generator] = None) -> List[torch.Tensor]
[aiter] import [module_fmha_v3_fwd] under /sgl-workspace/aiter/aiter/jit/module_fmha_v3_fwd.so
[12-31 12:07:02] import [module_fmha_v3_fwd] under /sgl-workspace/aiter/aiter/jit/module_fmha_v3_fwd.so
[aiter] type hints mismatch, override to --> fmha_v3_fwd(q: torch.Tensor, k: torch.Tensor, v: torch.Tensor, dropout_p: float, softmax_scale: float, is_causal: bool, window_size_left: int, window_size_right: int, return_softmax_lse: bool, return_dropout_randval: bool, how_v3_bf16_cvt: int, out: Optional[torch.Tensor] = None, bias: Optional[torch.Tensor] = None, alibi_slopes: Optional[torch.Tensor] = None, gen: Optional[torch.Generator] = None) -> List[torch.Tensor]
[12-31 12:07:02] type hints mismatch, override to --> fmha_v3_fwd(q: torch.Tensor, k: torch.Tensor, v: torch.Tensor, dropout_p: float, softmax_scale: float, is_causal: bool, window_size_left: int, window_size_right: int, return_softmax_lse: bool, return_dropout_randval: bool, how_v3_bf16_cvt: int, out: Optional[torch.Tensor] = None, bias: Optional[torch.Tensor] = None, alibi_slopes: Optional[torch.Tensor] = None, gen: Optional[torch.Generator] = None) -> List[torch.Tensor]
100%|██████████| 50/50 [03:59<00:00,  4.79s/it]
[12-31 12:10:57] [DenoisingStage] average time per step: 4.7867 seconds
[12-31 12:10:57] [DenoisingStage] finished in 239.3369 seconds
[12-31 12:10:57] [DecodingStage] started...
[12-31 12:11:05] [DecodingStage] finished in 8.4089 seconds
[12-31 12:11:05] Peak GPU memory: 40.90 GB, Remaining GPU memory at peak: 151.08 GB. Components that can stay resident: ['text_encoder', 'vae', 'transformer', 'image_encoder']
[12-31 12:11:09] Output saved to outputs/Summer_beach_vacation_style_a_white_cat_wearing_sunglasses_sits_on_a_surfboard._The_fluffy-furred_f_20251231-120642_d2038859.mp4
[12-31 12:11:09] Pixel data generated successfully in 266.92 seconds
[12-31 12:11:09] Completed batch processing. Generated 1 outputs in 266.92 seconds.
[12-31 12:11:09] Memory usage - Max peak: 41886.43 MB, Avg peak: 41886.43 MB
[12-31 12:11:09] Generator was garbage collected without being shut down. Attempting to shut down the local server and client.
/usr/lib/python3.10/multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 2 leaked semaphore objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '

Benchmarking and Profiling

Checklist

Format your code according to the Format code with pre-commit.
Add unit tests according to the Run and add unit tests.
Update documentation according to Write documentations.
Provide accuracy and speed benchmark results according to Test the accuracy and Benchmark the speed.
Follow the SGLang code style guidance.

Review Process

Ping Merge Oncalls to start the PR flow. See the PR Merge Process.
Get approvals from CODEOWNERS and other reviewers.
Trigger CI tests with comments (/tag-run-ci-label, /rerun-failed-ci, /tag-and-rerun-ci) or contact authorized users to do so.
After green CI and required approvals, ask Merge Oncalls to merge.

gemini-code-assist · 2025-12-31T12:21:19Z

Summary of Changes

Hello @ZiguanWang, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly improves the compatibility and performance of diffusion models on AMD/ROCm platforms. It achieves this by updating and aligning critical dependencies, integrating an external fix for torch_cpp_ext issues, and enabling a newer, compatible version of runai_model_streamer that was previously problematic on ROCm. These changes aim to provide a more stable and efficient experience for users leveraging AMD hardware.

Highlights

AMD Dependency Alignment: The diffusion dependencies within pyproject_other.toml for AMD platforms have been updated and aligned with the main pyproject.toml to ensure consistency and compatibility.
Yunchang Integration for ROCm Fix: The local yunchang implementation has been replaced by an external dependency from a specific GitHub commit (feifeibear/long-context-attention), which specifically addresses and fixes issues related to torch_cpp_ext._get_cuda_arch_flags() on ROCm.
RunAI Model Streamer Update and Re-enablement: The runai_model_streamer dependency has been pinned to version 0.15.3. The previous explicit disablement of this streamer for AMD/ROCm platforms has been removed, indicating that the updated version resolves prior compatibility issues.
General Dependency Updates: The diffusers library has been updated to version 0.36.0, and cache-dit==1.1.8 has been added to the diffusion dependencies.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request aligns the AMD diffusion dependencies in pyproject_other.toml with the main pyproject.toml, updates several packages, and removes previously vendored code. Specifically, it pins diffusers and runai_model_streamer, and uses a forked long-context-attention to address a specific issue. These changes also enable runai_model_streamer on ROCm, which was previously disabled, and appear to yield significant performance improvements based on the provided logs. The changes are well-motivated and look good. I have one suggestion to improve the maintainability of the dependency list.

python/pyproject_other.toml

mickqian · 2026-01-01T02:07:35Z

/tag-and-rerun-ci

ZiguanWang · 2026-01-01T04:48:34Z

I find CI test has some errors:

[01-01 02:17:24] [denoising_step_0] Error during execution after 1273.8658 ms: apply_qk_norm: fused inplace QK-norm is not applicable (expected CUDA, contiguous q/k, matching eps, and supported head_dim)
Traceback (most recent call last):
  File "/sglang-checkout/python/sglang/multimodal_gen/runtime/pipelines_core/stages/denoising.py", line 1011, in forward
    noise_pred = self._predict_noise_with_cfg(
  File "/sglang-checkout/python/sglang/multimodal_gen/runtime/pipelines_core/stages/denoising.py", line 1262, in _predict_noise_with_cfg
    noise_pred_cond = self._predict_noise(
  File "/sglang-checkout/python/sglang/multimodal_gen/runtime/pipelines_core/stages/denoising.py", line 1208, in _predict_noise
    return current_model(
  File "/opt/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1775, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1786, in _call_impl
    return forward_call(*args, **kwargs)
  File "/sglang-checkout/python/sglang/multimodal_gen/runtime/models/dits/qwen_image.py", line 972, in forward
    encoder_hidden_states, hidden_states = block(
  File "/opt/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1775, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1786, in _call_impl
    return forward_call(*args, **kwargs)
  File "/sglang-checkout/python/sglang/multimodal_gen/runtime/models/dits/qwen_image.py", line 757, in forward
    attn_output = self.attn(
  File "/opt/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1775, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1786, in _call_impl
    return forward_call(*args, **kwargs)
  File "/sglang-checkout/python/sglang/multimodal_gen/runtime/models/dits/qwen_image.py", line 556, in forward
    img_query, img_key = apply_qk_norm(
  File "/sglang-checkout/python/sglang/multimodal_gen/runtime/layers/layernorm.py", line 461, in apply_qk_norm
    raise RuntimeError(
RuntimeError: apply_qk_norm: fused inplace QK-norm is not applicable (expected CUDA, contiguous q/k, matching eps, and supported head_dim)

I got the same errors in my local environment before this change, Maybe I can help these errors after the NEW YEAR Holiday, @mickqian

ZiguanWang · 2026-01-12T08:09:45Z

I find CI test has some errors:

[01-01 02:17:24] [denoising_step_0] Error during execution after 1273.8658 ms: apply_qk_norm: fused inplace QK-norm is not applicable (expected CUDA, contiguous q/k, matching eps, and supported head_dim)
Traceback (most recent call last):
  File "/sglang-checkout/python/sglang/multimodal_gen/runtime/pipelines_core/stages/denoising.py", line 1011, in forward
    noise_pred = self._predict_noise_with_cfg(
  File "/sglang-checkout/python/sglang/multimodal_gen/runtime/pipelines_core/stages/denoising.py", line 1262, in _predict_noise_with_cfg
    noise_pred_cond = self._predict_noise(
  File "/sglang-checkout/python/sglang/multimodal_gen/runtime/pipelines_core/stages/denoising.py", line 1208, in _predict_noise
    return current_model(
  File "/opt/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1775, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1786, in _call_impl
    return forward_call(*args, **kwargs)
  File "/sglang-checkout/python/sglang/multimodal_gen/runtime/models/dits/qwen_image.py", line 972, in forward
    encoder_hidden_states, hidden_states = block(
  File "/opt/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1775, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1786, in _call_impl
    return forward_call(*args, **kwargs)
  File "/sglang-checkout/python/sglang/multimodal_gen/runtime/models/dits/qwen_image.py", line 757, in forward
    attn_output = self.attn(
  File "/opt/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1775, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1786, in _call_impl
    return forward_call(*args, **kwargs)
  File "/sglang-checkout/python/sglang/multimodal_gen/runtime/models/dits/qwen_image.py", line 556, in forward
    img_query, img_key = apply_qk_norm(
  File "/sglang-checkout/python/sglang/multimodal_gen/runtime/layers/layernorm.py", line 461, in apply_qk_norm
    raise RuntimeError(
RuntimeError: apply_qk_norm: fused inplace QK-norm is not applicable (expected CUDA, contiguous q/k, matching eps, and supported head_dim)

I got the same errors in my local environment before this change, Maybe I can help these errors after the NEW YEAR Holiday, @mickqian

Resolved by #16287

mickqian · 2026-01-12T09:40:14Z

/rerun-failed-ci

ZiguanWang · 2026-01-13T06:47:56Z

/rerun-failed-ci

ZiguanWang · 2026-01-15T03:05:24Z

@mickqian, @yhyang201 Can you help check this?

mickqian · 2026-01-15T13:37:40Z

/rerun-failed-ci

ZiguanWang · 2026-01-20T05:41:52Z

chang dev_hip = ["sglang[all_hip]", "sglang[diffusion_hip]", "sglang[test]"] so ci_sglang docker container use scripts/ci/amd_ci_install_dependency.sh install python[dev_hip] will install diffusion_hip dependency.

ZiguanWang · 2026-01-21T02:21:20Z

@mickqian only one ci test failed because of timeout, can you help me rerun again?

mickqian · 2026-01-21T03:35:54Z

/rerun-failed-ci

hubertlu-tw · 2026-01-26T04:14:24Z

/rerun-failed-ci

python/pyproject_other.toml

hubertlu-tw

LGTM

hubertlu-tw · 2026-01-27T02:56:01Z

@yctseng0211 could you please help check the failed AMD tests in this PR? I think they are unrelated to the changes of this PR.

yctseng0211 · 2026-01-27T03:11:12Z

@yctseng0211 could you please help check the failed AMD tests in this PR? I think they are unrelated to the changes of this PR.

checking, cc: @bingxche

yctseng0211 · 2026-01-27T03:56:59Z

@hubertlu-tw I believe the failed AMD tests are unrelated to this PR, we will fix them in #17633

ZiguanWang · 2026-01-28T02:40:33Z

@HaiShaw Can you help me merge this PR? thanks

…n dependency with pyproject.toml

yctseng0211

LGTM

…n dependency with pyproject.toml (sgl-project#16225) Co-authored-by: roywang <roywang@amd.com>

ZiguanWang requested review from mickqian and yhyang201 as code owners December 31, 2025 12:20

github-actions bot added dependencies Pull requests that update a dependency file diffusion SGLang Diffusion labels Dec 31, 2025

gemini-code-assist bot reviewed Dec 31, 2025

View reviewed changes

python/pyproject_other.toml Show resolved Hide resolved

github-actions bot added the run-ci label Jan 1, 2026

ZiguanWang force-pushed the fix_rocm_diffusion branch from 76387af to 6335f61 Compare January 12, 2026 07:57

mickqian approved these changes Jan 12, 2026

View reviewed changes

mickqian approved these changes Jan 15, 2026

View reviewed changes

ZiguanWang force-pushed the fix_rocm_diffusion branch 2 times, most recently from 04d0ac8 to a488c89 Compare January 20, 2026 05:39

ZiguanWang force-pushed the fix_rocm_diffusion branch from 13d4390 to d053aa6 Compare January 27, 2026 02:05

hubertlu-tw requested changes Jan 27, 2026

View reviewed changes

python/pyproject_other.toml Show resolved Hide resolved

ZiguanWang requested review from Fridge003, HaiShaw, ishandhanani and ispobock as code owners January 27, 2026 02:51

github-actions bot added the documentation Improvements or additions to documentation label Jan 27, 2026

github-actions bot added the amd label Jan 27, 2026

hubertlu-tw approved these changes Jan 27, 2026

View reviewed changes

roywang added 2 commits January 29, 2026 03:29

[diffusion]: align sglang diffusion AMD pyproject_other.toml diffusio…

95b074e

…n dependency with pyproject.toml

docs: delete useless diffusion_hip dependencies

c295077

ZiguanWang force-pushed the fix_rocm_diffusion branch from 4df3bf9 to c295077 Compare January 29, 2026 03:30

yctseng0211 reviewed Jan 29, 2026

View reviewed changes

HaiShaw merged commit 30adf78 into sgl-project:main Jan 29, 2026
107 of 110 checks passed

charlesHsuGG pushed a commit to charlesHsuGG/sglang that referenced this pull request Jan 30, 2026

[diffusion]: align sglang diffusion AMD pyproject_other.toml diffusio…

1911817

…n dependency with pyproject.toml (sgl-project#16225) Co-authored-by: roywang <roywang@amd.com>

Chen-0210 pushed a commit to Chen-0210/sglang that referenced this pull request Jan 30, 2026

[diffusion]: align sglang diffusion AMD pyproject_other.toml diffusio…

4c2fe16

…n dependency with pyproject.toml (sgl-project#16225) Co-authored-by: roywang <roywang@amd.com>

sfiisf pushed a commit to sfiisf/sglang that referenced this pull request Feb 5, 2026

[diffusion]: align sglang diffusion AMD pyproject_other.toml diffusio…

d7c77ef

…n dependency with pyproject.toml (sgl-project#16225) Co-authored-by: roywang <roywang@amd.com>

Johnsonms pushed a commit to Johnsonms/sglang that referenced this pull request Feb 14, 2026

[diffusion]: align sglang diffusion AMD pyproject_other.toml diffusio…

4bb311c

…n dependency with pyproject.toml (sgl-project#16225) Co-authored-by: roywang <roywang@amd.com>

ZiguanWang deleted the fix_rocm_diffusion branch March 26, 2026 02:51

Conversation

ZiguanWang commented Dec 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Accuracy Tests

Benchmarking and Profiling

Checklist

Review Process

Uh oh!

gemini-code-assist bot commented Dec 31, 2025

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

mickqian commented Jan 1, 2026

Uh oh!

ZiguanWang commented Jan 1, 2026

Uh oh!

ZiguanWang commented Jan 12, 2026

Uh oh!

mickqian commented Jan 12, 2026

Uh oh!

ZiguanWang commented Jan 13, 2026

Uh oh!

ZiguanWang commented Jan 15, 2026

Uh oh!

mickqian commented Jan 15, 2026

Uh oh!

ZiguanWang commented Jan 20, 2026

Uh oh!

ZiguanWang commented Jan 21, 2026

Uh oh!

mickqian commented Jan 21, 2026

Uh oh!

hubertlu-tw commented Jan 26, 2026

Uh oh!

Uh oh!

hubertlu-tw left a comment

Choose a reason for hiding this comment

Uh oh!

hubertlu-tw commented Jan 27, 2026

Uh oh!

yctseng0211 commented Jan 27, 2026

Uh oh!

yctseng0211 commented Jan 27, 2026

Uh oh!

ZiguanWang commented Jan 28, 2026

Uh oh!

yctseng0211 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

ZiguanWang commented Dec 31, 2025 •

edited

Loading