Skip to content

[diffusion]: align sglang diffusion AMD pyproject_other.toml diffusion dependency with pyproject.toml#16225

Merged
HaiShaw merged 2 commits intosgl-project:mainfrom
AMD-AIM:fix_rocm_diffusion
Jan 29, 2026
Merged

[diffusion]: align sglang diffusion AMD pyproject_other.toml diffusion dependency with pyproject.toml#16225
HaiShaw merged 2 commits intosgl-project:mainfrom
AMD-AIM:fix_rocm_diffusion

Conversation

@ZiguanWang
Copy link
Copy Markdown
Contributor

@ZiguanWang ZiguanWang commented Dec 31, 2025

Motivation

align AMD pyproject_other.toml diffusion dependency with pyproject.toml

Modifications

  1. modify AMD pyproject_other.toml diffusion dependency to align with pyproject.toml
  2. directory use yunchang@git+https://github.com/feifeibear/long-context-attention.git@b192e97 to fix torch_cpp_ext._get_cuda_arch_flags() (fix import flashinfer error on AMD GPUs feifeibear/long-context-attention#148)
  3. set runai_model_streamer==0.15.3 and validate RUNAI_MODEL_STREAMER (ROCM Docker default version is 0.11.0 will cause some problem)

Accuracy Tests

sglang generate --model-path Wan-AI/Wan2.1-T2V-1.3B-Diffusers     --prompt "A curious raccoon"     --save-output

sglang generate --model-path=Wan-AI/Wan2.1-I2V-14B-480P-Diffusers     --prompt="Summer beach vacation style, a white cat wearing sunglasses sits on a surfboard. The fluffy-furred feline gazes directly at the camera with a relaxed expression. Blurred beach scenery forms the background featuring crystal-clear waters, distant green hills, and a blue sky dotted with white clouds. The cat assumes a naturally relaxed posture, as if savoring the sea breeze and warm sunlight. A close-up shot highlights the feline's intricate details and the refreshing atmosphere of the seaside."     --image-path="https://github.com/Wan-Video/Wan2.2/blob/990af50de458c19590c245151197326e208d7191/examples/i2v_input.JPG?raw=true"     --num-gpus 2 --enable-cfg-parallel --save-output

the output video is the same

Log before this change:

[aiter] import [module_aiter_enum] under /sgl-workspace/aiter/aiter/jit/module_aiter_enum.so
INFO:aiter:import [module_aiter_enum] under /sgl-workspace/aiter/aiter/jit/module_aiter_enum.so
INFO 12-31 10:27:00 [__init__.py:241] Automatically detected platform rocm.
/sgl-workspace/sglang/python/sglang/srt/layers/quantization/awq.py:71: UserWarning: HIP does not support fused_marlin_moe currently.
  warnings.warn(f"HIP does not support fused_marlin_moe currently.")
/sgl-workspace/sglang/python/sglang/srt/layers/quantization/gguf.py:46: UserWarning: Only CUDA support GGUF quantization currently.
  warnings.warn(f"Only CUDA support GGUF quantization currently.")
[12-31 10:27:01] Attention backend not specified. Using 'aiter' by default on ROCm to match SGLang SRT defaults.
[12-31 10:27:01] server_args: {"model_path": "Wan-AI/Wan2.1-T2V-1.3B-Diffusers", "attention_backend": "aiter", "nccl_port": null, "trust_remote_code": false, "revision": null, "num_gpus": 1, "tp_size": 1, "sp_degree": 1, "ulysses_degree": 1, "ring_degree": 1, "dp_size": 1, "dp_degree": 1, "enable_cfg_parallel": false, "hsdp_replicate_dim": 1, "hsdp_shard_dim": 1, "dist_timeout": null, "lora_path": null, "lora_nickname": "default", "vae_path": null, "lora_target_modules": null, "dit_cpu_offload": true, "dit_layerwise_offload": null, "text_encoder_cpu_offload": true, "image_encoder_cpu_offload": true, "vae_cpu_offload": true, "use_fsdp_inference": false, "pin_cpu_memory": true, "mask_strategy_file_path": null, "STA_mode": "STA_inference", "skip_time_steps": 15, "enable_torch_compile": false, "disable_autocast": false, "VSA_sparsity": 0.0, "moba_config_path": null, "moba_config": {}, "master_port": 30059, "host": null, "port": null, "webui": false, "webui_port": 12312, "scheduler_port": 5630, "prompt_file_path": null, "model_paths": {}, "model_loaded": {"transformer": true, "vae": true}, "boundary_ratio": null, "log_level": "info"}
[12-31 10:27:01] Local mode: True
[12-31 10:27:01] Starting server...
[aiter] import [module_aiter_enum] under /sgl-workspace/aiter/aiter/jit/module_aiter_enum.so
INFO:aiter:import [module_aiter_enum] under /sgl-workspace/aiter/aiter/jit/module_aiter_enum.so
INFO 12-31 10:27:08 [__init__.py:241] Automatically detected platform rocm.
/sgl-workspace/sglang/python/sglang/srt/layers/quantization/awq.py:71: UserWarning: HIP does not support fused_marlin_moe currently.
  warnings.warn(f"HIP does not support fused_marlin_moe currently.")
/sgl-workspace/sglang/python/sglang/srt/layers/quantization/gguf.py:46: UserWarning: Only CUDA support GGUF quantization currently.
  warnings.warn(f"Only CUDA support GGUF quantization currently.")
[12-31 10:27:09] Scheduler bind at endpoint: tcp://localhost:5630
[12-31 10:27:09] Initializing distributed environment with world_size=1, device=cuda:0
[12-31 10:27:13] Downloaded model_index.json for Wan-AI/Wan2.1-T2V-1.3B-Diffusers, pipeline: WanPipeline
[12-31 10:27:13] Found model info: ModelInfo(pipeline_cls=<class 'sglang.multimodal_gen.runtime.pipelines.wan_pipeline.WanPipeline'>, sampling_param_cls=<class 'sglang.multimodal_gen.configs.sample.wan.WanT2V_1_3B_SamplingParams'>, pipeline_config_cls=<class 'sglang.multimodal_gen.configs.pipeline_configs.wan.WanT2V480PConfig'>)
[12-31 10:27:13] Loading pipeline modules...
[12-31 10:27:13] Checking for cached model in HF Hub cache for Wan-AI/Wan2.1-T2V-1.3B-Diffusers...
[12-31 10:27:13] Found complete model in cache at /home/roywan/model/models--Wan-AI--Wan2.1-T2V-1.3B-Diffusers/snapshots/0fad780a534b6463e45facd96134c9f345acfa5b
[12-31 10:27:13] Model path: /home/roywan/model/models--Wan-AI--Wan2.1-T2V-1.3B-Diffusers/snapshots/0fad780a534b6463e45facd96134c9f345acfa5b
[12-31 10:27:13] Diffusers version: 0.33.0.dev0
[12-31 10:27:13] Loading pipeline modules from config: {'_class_name': 'WanPipeline', '_diffusers_version': '0.33.0.dev0', 'scheduler': ['diffusers', 'UniPCMultistepScheduler'], 'text_encoder': ['transformers', 'UMT5EncoderModel'], 'tokenizer': ['transformers', 'T5TokenizerFast'], 'transformer': ['diffusers', 'WanTransformer3DModel'], 'vae': ['diffusers', 'AutoencoderKLWan']}
[12-31 10:27:13] Loading required components: ['text_encoder', 'tokenizer', 'vae', 'transformer', 'scheduler']
Loading required modules:   0%|          | 0/5 [00:00<?, ?it/s][12-31 10:27:13] Loading text_encoder from /home/roywan/model/models--Wan-AI--Wan2.1-T2V-1.3B-Diffusers/snapshots/0fad780a534b6463e45facd96134c9f345acfa5b/text_encoder. avail mem: 184.91 GB
[12-31 10:27:13] HF model config: {'architectures': ['UMT5EncoderModel'], 'classifier_dropout': 0.0, 'd_ff': 10240, 'd_kv': 64, 'd_model': 4096, 'decoder_start_token_id': 0, 'dense_act_fn': 'gelu_new', 'dropout_rate': 0.1, 'eos_token_id': 1, 'feed_forward_proj': 'gated-gelu', 'initializer_factor': 1.0, 'is_encoder_decoder': True, 'is_gated_act': True, 'layer_norm_epsilon': 1e-06, 'num_decoder_layers': 24, 'num_heads': 64, 'num_layers': 24, 'output_past': True, 'pad_token_id': 0, 'relative_attention_max_distance': 128, 'relative_attention_num_buckets': 32, 'scalable_attention': True, 'tie_word_embeddings': False, 'use_cache': True, 'vocab_size': 256384}

Loading safetensors checkpoint shards:   0% Completed | 0/5 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:  20% Completed | 1/5 [00:01<00:04,  1.01s/it]
Loading safetensors checkpoint shards:  40% Completed | 2/5 [00:02<00:04,  1.53s/it]
Loading safetensors checkpoint shards:  60% Completed | 3/5 [00:05<00:03,  1.81s/it]
Loading safetensors checkpoint shards:  80% Completed | 4/5 [00:06<00:01,  1.84s/it]
Loading safetensors checkpoint shards: 100% Completed | 5/5 [00:09<00:00,  1.84s/it]

[12-31 10:28:22] Loaded text_encoder: FSDPUMT5EncoderModel from customized. model size: 21.16 GB, avail mem: 183.84 GB
Loading required modules:  20%|██        | 1/5 [01:09<04:36, 69.11s/it][12-31 10:28:22] Loading tokenizer from /home/roywan/model/models--Wan-AI--Wan2.1-T2V-1.3B-Diffusers/snapshots/0fad780a534b6463e45facd96134c9f345acfa5b/tokenizer. avail mem: 183.84 GB
[12-31 10:28:23] Loaded tokenizer: T5TokenizerFast from customized. model size: 0.00 GB, avail mem: 183.84 GB
Loading required modules:  40%|████      | 2/5 [01:09<01:25, 28.67s/it][12-31 10:28:23] Loading vae from /home/roywan/model/models--Wan-AI--Wan2.1-T2V-1.3B-Diffusers/snapshots/0fad780a534b6463e45facd96134c9f345acfa5b/vae. avail mem: 183.84 GB
[12-31 10:28:23] HF model config: {'attn_scales': [], 'base_dim': 96, 'dim_mult': [1, 2, 4, 4], 'dropout': 0.0, 'latents_mean': [-0.7571, -0.7089, -0.9113, 0.1075, -0.1745, 0.9653, -0.1517, 1.5508, 0.4134, -0.0715, 0.5517, -0.3632, -0.1922, -0.9497, 0.2503, -0.2921], 'latents_std': [2.8184, 1.4541, 2.3275, 2.6558, 1.2196, 1.7708, 2.6052, 2.0743, 3.2687, 2.1526, 2.8652, 1.5579, 1.6382, 1.1253, 2.8251, 1.916], 'num_res_blocks': 2, 'temperal_downsample': [False, True, True], 'z_dim': 16}
[12-31 10:28:23] Loaded vae: AutoencoderKLWan from customized. model size: 0.27 GB, avail mem: 183.84 GB
[12-31 10:28:23] Loading transformer from /home/roywan/model/models--Wan-AI--Wan2.1-T2V-1.3B-Diffusers/snapshots/0fad780a534b6463e45facd96134c9f345acfa5b/transformer. avail mem: 183.84 GB
[12-31 10:28:23] Loading WanTransformer3DModel from 2 safetensors files, default_dtype: torch.bfloat16
[12-31 10:28:23] Trying SGLANG_DIFFUSION_ATTENTION_BACKEND=None
[12-31 10:28:23] Using AITer backend on ROCm.

Loading safetensors checkpoint shards: 100% Completed | 2/2 [00:00<00:00, 106.43it/s]

[12-31 10:28:24] Loaded model with 1.42B parameters
[12-31 10:28:24] Loaded transformer: WanTransformer3DModel from customized. model size: 2.64 GB, avail mem: 181.00 GB
Loading required modules:  80%|████████  | 4/5 [01:10<00:10, 10.93s/it][12-31 10:28:24] Loading scheduler from /home/roywan/model/models--Wan-AI--Wan2.1-T2V-1.3B-Diffusers/snapshots/0fad780a534b6463e45facd96134c9f345acfa5b/scheduler. avail mem: 181.00 GB
[12-31 10:28:24] Loaded scheduler: UniPCMultistepScheduler from customized. model size: 0.00 GB, avail mem: 181.00 GB
Loading required modules: 100%|██████████| 5/5 [01:10<00:00, 14.04s/it]
[12-31 10:28:24] Creating pipeline stages...
[12-31 10:28:24] Trying SGLANG_DIFFUSION_ATTENTION_BACKEND=None
[12-31 10:28:24] Using AITer backend on ROCm.
[12-31 10:28:24] Pipeline instantiated
[12-31 10:28:24] Worker 0: Initialized device, model, and distributed environment.
[12-31 10:28:24] Worker 0: Scheduler loop started.
[12-31 10:28:24] Processing prompt 1/1: A curious raccoon
[12-31 10:28:24] Sampling params:
                       width: 832
                      height: 480
                  num_frames: 81
                      prompt: A curious raccoon
                  neg_prompt: Bright tones, overexposed, static, blurred details, subtitles, style, works, paintings, images, static, overall gray, worst quality, low quality, JPEG compression residue, ugly, incomplete, extra fingers, poorly drawn hands, poorly drawn faces, deformed, disfigured, misshapen limbs, fused fingers, still picture, messy background, three legs, many people in the background, walking backwards
                        seed: 1024
                 infer_steps: 50
      num_outputs_per_prompt: 1
              guidance_scale: 3.0
     embedded_guidance_scale: 6.0
                    n_tokens: None
                  flow_shift: 3.0
                  image_path: None
                 save_output: True
            output_file_path: outputs/A_curious_raccoon_20251231-102824_483bc3b0.mp4
        
[12-31 10:28:24] Running pipeline stages: ['input_validation_stage', 'prompt_encoding_stage', 'conditioning_stage', 'timestep_preparation_stage', 'latent_preparation_stage', 'denoising_stage', 'decoding_stage']
[12-31 10:28:24] [InputValidationStage] started...
[12-31 10:28:24] [InputValidationStage] finished in 0.0001 seconds
[12-31 10:28:24] [TextEncodingStage] started...
[12-31 10:28:32] [TextEncodingStage] finished in 8.9242 seconds
[12-31 10:28:32] [ConditioningStage] started...
[12-31 10:28:32] [ConditioningStage] finished in 0.0000 seconds
[12-31 10:28:32] [TimestepPreparationStage] started...
[12-31 10:28:32] [TimestepPreparationStage] finished in 0.0004 seconds
[12-31 10:28:32] [LatentPreparationStage] started...
[12-31 10:28:32] [LatentPreparationStage] finished in 0.0042 seconds
[12-31 10:28:32] [DenoisingStage] started...
  0%|          | 0/50 [00:00<?, ?it/s][aiter] start build [module_fmha_v3_fwd] under /sgl-workspace/aiter/aiter/jit/build/module_fmha_v3_fwd
[12-31 10:28:38] start build [module_fmha_v3_fwd] under /sgl-workspace/aiter/aiter/jit/build/module_fmha_v3_fwd
[aiter] finish build [module_fmha_v3_fwd], cost 38.7s 
[12-31 10:29:17] finish build [module_fmha_v3_fwd], cost 38.7s 
[aiter] import [module_fmha_v3_fwd] under /sgl-workspace/aiter/aiter/jit/module_fmha_v3_fwd.so
[12-31 10:29:17] import [module_fmha_v3_fwd] under /sgl-workspace/aiter/aiter/jit/module_fmha_v3_fwd.so
[aiter] type hints mismatch, override to --> fmha_v3_fwd(q: torch.Tensor, k: torch.Tensor, v: torch.Tensor, dropout_p: float, softmax_scale: float, is_causal: bool, window_size_left: int, window_size_right: int, return_softmax_lse: bool, return_dropout_randval: bool, how_v3_bf16_cvt: int, out: Optional[torch.Tensor] = None, bias: Optional[torch.Tensor] = None, alibi_slopes: Optional[torch.Tensor] = None, gen: Optional[torch.Generator] = None) -> List[torch.Tensor]
[12-31 10:29:17] type hints mismatch, override to --> fmha_v3_fwd(q: torch.Tensor, k: torch.Tensor, v: torch.Tensor, dropout_p: float, softmax_scale: float, is_causal: bool, window_size_left: int, window_size_right: int, return_softmax_lse: bool, return_dropout_randval: bool, how_v3_bf16_cvt: int, out: Optional[torch.Tensor] = None, bias: Optional[torch.Tensor] = None, alibi_slopes: Optional[torch.Tensor] = None, gen: Optional[torch.Generator] = None) -> List[torch.Tensor]
100%|██████████| 50/50 [02:09<00:00,  2.59s/it]
[12-31 10:30:42] [DenoisingStage] average time per step: 2.5911 seconds
[12-31 10:30:42] [DenoisingStage] finished in 129.5573 seconds
[12-31 10:30:42] [DecodingStage] started...
[12-31 10:30:48] [DecodingStage] finished in 6.2848 seconds
[12-31 10:30:48] Peak GPU memory: 11.85 GB, Remaining GPU memory at peak: 180.14 GB. Components that can stay resident: ['text_encoder', 'vae', 'transformer']
[12-31 10:30:55] Output saved to outputs/A_curious_raccoon_20251231-102824_483bc3b0.mp4
[12-31 10:30:55] Pixel data generated successfully in 151.01 seconds
[12-31 10:30:55] Completed batch processing. Generated 1 outputs in 151.01 seconds.
[12-31 10:30:55] Memory usage - Max peak: 12132.01 MB, Avg peak: 12132.01 MB
[12-31 10:30:55] Generator was garbage collected without being shut down. Attempting to shut down the local server and client.
/usr/lib/python3.10/multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 1 leaked semaphore objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '
[aiter] import [module_aiter_enum] under /sgl-workspace/aiter/aiter/jit/module_aiter_enum.so
INFO:aiter:import [module_aiter_enum] under /sgl-workspace/aiter/aiter/jit/module_aiter_enum.so
INFO 12-31 10:31:04 [__init__.py:241] Automatically detected platform rocm.
/sgl-workspace/sglang/python/sglang/srt/layers/quantization/awq.py:71: UserWarning: HIP does not support fused_marlin_moe currently.
  warnings.warn(f"HIP does not support fused_marlin_moe currently.")
/sgl-workspace/sglang/python/sglang/srt/layers/quantization/gguf.py:46: UserWarning: Only CUDA support GGUF quantization currently.
  warnings.warn(f"Only CUDA support GGUF quantization currently.")
[12-31 10:31:04] Attention backend not specified. Using 'aiter' by default on ROCm to match SGLang SRT defaults.
[12-31 10:31:04] server_args: {"model_path": "Wan-AI/Wan2.1-I2V-14B-480P-Diffusers", "attention_backend": "aiter", "nccl_port": null, "trust_remote_code": false, "revision": null, "num_gpus": 2, "tp_size": 1, "sp_degree": 1, "ulysses_degree": 1, "ring_degree": 1, "dp_size": 1, "dp_degree": 1, "enable_cfg_parallel": true, "hsdp_replicate_dim": 1, "hsdp_shard_dim": 2, "dist_timeout": null, "lora_path": null, "lora_nickname": "default", "vae_path": null, "lora_target_modules": null, "dit_cpu_offload": true, "dit_layerwise_offload": null, "text_encoder_cpu_offload": true, "image_encoder_cpu_offload": true, "vae_cpu_offload": true, "use_fsdp_inference": false, "pin_cpu_memory": true, "mask_strategy_file_path": null, "STA_mode": "STA_inference", "skip_time_steps": 15, "enable_torch_compile": false, "disable_autocast": false, "VSA_sparsity": 0.0, "moba_config_path": null, "moba_config": {}, "master_port": 30105, "host": null, "port": null, "webui": false, "webui_port": 12312, "scheduler_port": 5655, "prompt_file_path": null, "model_paths": {}, "model_loaded": {"transformer": true, "vae": true}, "boundary_ratio": null, "log_level": "info"}
[12-31 10:31:04] Local mode: True
[12-31 10:31:04] Starting server...
[aiter] import [module_aiter_enum] under /sgl-workspace/aiter/aiter/jit/module_aiter_enum.so
INFO:aiter:import [module_aiter_enum] under /sgl-workspace/aiter/aiter/jit/module_aiter_enum.so
[aiter] import [module_aiter_enum] under /sgl-workspace/aiter/aiter/jit/module_aiter_enum.so
INFO:aiter:import [module_aiter_enum] under /sgl-workspace/aiter/aiter/jit/module_aiter_enum.so
INFO 12-31 10:31:12 [__init__.py:241] Automatically detected platform rocm.
INFO 12-31 10:31:12 [__init__.py:241] Automatically detected platform rocm.
/sgl-workspace/sglang/python/sglang/srt/layers/quantization/awq.py:71: UserWarning: HIP does not support fused_marlin_moe currently.
  warnings.warn(f"HIP does not support fused_marlin_moe currently.")
/sgl-workspace/sglang/python/sglang/srt/layers/quantization/gguf.py:46: UserWarning: Only CUDA support GGUF quantization currently.
  warnings.warn(f"Only CUDA support GGUF quantization currently.")
/sgl-workspace/sglang/python/sglang/srt/layers/quantization/awq.py:71: UserWarning: HIP does not support fused_marlin_moe currently.
  warnings.warn(f"HIP does not support fused_marlin_moe currently.")
/sgl-workspace/sglang/python/sglang/srt/layers/quantization/gguf.py:46: UserWarning: Only CUDA support GGUF quantization currently.
  warnings.warn(f"Only CUDA support GGUF quantization currently.")
[12-31 10:31:12] Scheduler bind at endpoint: tcp://localhost:5655
[12-31 10:31:12] Initializing distributed environment with world_size=2, device=cuda:0
[12-31 10:31:18] Found nccl from library librccl.so.1
[12-31 10:31:18] sglang-diffusion is using nccl==2.26.6
[12-31 10:31:19] Found nccl from library librccl.so.1
[12-31 10:31:19] sglang-diffusion is using nccl==2.26.6
[12-31 10:31:24] Downloaded model_index.json for Wan-AI/Wan2.1-I2V-14B-480P-Diffusers, pipeline: WanImageToVideoPipeline
[12-31 10:31:24] Found model info: ModelInfo(pipeline_cls=<class 'sglang.multimodal_gen.runtime.pipelines.wan_i2v_pipeline.WanImageToVideoPipeline'>, sampling_param_cls=<class 'sglang.multimodal_gen.configs.sample.wan.WanI2V_14B_480P_SamplingParam'>, pipeline_config_cls=<class 'sglang.multimodal_gen.configs.pipeline_configs.wan.WanI2V480PConfig'>)
[12-31 10:31:24] Loading pipeline modules...
[12-31 10:31:24] Checking for cached model in HF Hub cache for Wan-AI/Wan2.1-I2V-14B-480P-Diffusers...
[12-31 10:31:24] Found complete model in cache at /home/roywan/model/models--Wan-AI--Wan2.1-I2V-14B-480P-Diffusers/snapshots/b184e23a8a16b20f108f727c902e769e873ffc73
[12-31 10:31:24] Model path: /home/roywan/model/models--Wan-AI--Wan2.1-I2V-14B-480P-Diffusers/snapshots/b184e23a8a16b20f108f727c902e769e873ffc73
[12-31 10:31:24] Diffusers version: 0.33.0.dev0
[12-31 10:31:24] Loading pipeline modules from config: {'_class_name': 'WanImageToVideoPipeline', '_diffusers_version': '0.33.0.dev0', 'image_encoder': ['transformers', 'CLIPVisionModelWithProjection'], 'image_processor': ['transformers', 'CLIPImageProcessor'], 'scheduler': ['diffusers', 'UniPCMultistepScheduler'], 'text_encoder': ['transformers', 'UMT5EncoderModel'], 'tokenizer': ['transformers', 'T5TokenizerFast'], 'transformer': ['diffusers', 'WanTransformer3DModel'], 'vae': ['diffusers', 'AutoencoderKLWan']}
[12-31 10:31:24] Loading required components: ['text_encoder', 'tokenizer', 'vae', 'transformer', 'scheduler', 'image_encoder', 'image_processor']
Loading required modules:   0%|          | 0/7 [00:00<?, ?it/s][12-31 10:31:24] Loading text_encoder from /home/roywan/model/models--Wan-AI--Wan2.1-I2V-14B-480P-Diffusers/snapshots/b184e23a8a16b20f108f727c902e769e873ffc73/text_encoder. avail mem: 183.63 GB
[12-31 10:31:24] HF model config: {'architectures': ['UMT5EncoderModel'], 'classifier_dropout': 0.0, 'd_ff': 10240, 'd_kv': 64, 'd_model': 4096, 'decoder_start_token_id': 0, 'dense_act_fn': 'gelu_new', 'dropout_rate': 0.1, 'eos_token_id': 1, 'feed_forward_proj': 'gated-gelu', 'initializer_factor': 1.0, 'is_encoder_decoder': True, 'is_gated_act': True, 'layer_norm_epsilon': 1e-06, 'num_decoder_layers': 24, 'num_heads': 64, 'num_layers': 24, 'output_past': True, 'pad_token_id': 0, 'relative_attention_max_distance': 128, 'relative_attention_num_buckets': 32, 'scalable_attention': True, 'tie_word_embeddings': False, 'use_cache': True, 'vocab_size': 256384}
Loading required modules:   0%|          | 0/7 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:   0% Completed | 0/5 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:  20% Completed | 1/5 [00:01<00:06,  1.53s/it]
Loading safetensors checkpoint shards:  40% Completed | 2/5 [00:04<00:07,  2.65s/it]
Loading safetensors checkpoint shards:  60% Completed | 3/5 [00:09<00:07,  3.54s/it]
Loading safetensors checkpoint shards:  80% Completed | 4/5 [00:12<00:03,  3.31s/it]
Loading safetensors checkpoint shards: 100% Completed | 5/5 [00:15<00:00,  3.17s/it]

Loading required modules:  29%|██▊       | 2/7 [01:59<04:06, 49.33s/it] [12-31 10:33:33] Loaded text_encoder: FSDPUMT5EncoderModel from customized. model size: 21.16 GB, avail mem: 182.56 GB
Loading required modules:  14%|█▍        | 1/7 [02:08<12:48, 128.14s/it][12-31 10:33:33] Loading tokenizer from /home/roywan/model/models--Wan-AI--Wan2.1-I2V-14B-480P-Diffusers/snapshots/b184e23a8a16b20f108f727c902e769e873ffc73/tokenizer. avail mem: 182.56 GB
[12-31 10:33:33] Loaded tokenizer: T5TokenizerFast from customized. model size: 0.00 GB, avail mem: 182.56 GB
Loading required modules:  29%|██▊       | 2/7 [02:08<04:25, 53.04s/it] [12-31 10:33:33] Loading vae from /home/roywan/model/models--Wan-AI--Wan2.1-I2V-14B-480P-Diffusers/snapshots/b184e23a8a16b20f108f727c902e769e873ffc73/vae. avail mem: 182.56 GB
[12-31 10:33:33] HF model config: {'attn_scales': [], 'base_dim': 96, 'dim_mult': [1, 2, 4, 4], 'dropout': 0.0, 'latents_mean': [-0.7571, -0.7089, -0.9113, 0.1075, -0.1745, 0.9653, -0.1517, 1.5508, 0.4134, -0.0715, 0.5517, -0.3632, -0.1922, -0.9497, 0.2503, -0.2921], 'latents_std': [2.8184, 1.4541, 2.3275, 2.6558, 1.2196, 1.7708, 2.6052, 2.0743, 3.2687, 2.1526, 2.8652, 1.5579, 1.6382, 1.1253, 2.8251, 1.916], 'num_res_blocks': 2, 'temperal_downsample': [False, True, True], 'z_dim': 16}
[12-31 10:33:33] Loaded vae: AutoencoderKLWan from customized. model size: 0.47 GB, avail mem: 182.56 GB
[12-31 10:33:33] Loading transformer from /home/roywan/model/models--Wan-AI--Wan2.1-I2V-14B-480P-Diffusers/snapshots/b184e23a8a16b20f108f727c902e769e873ffc73/transformer. avail mem: 182.56 GB
[12-31 10:33:33] Loading WanTransformer3DModel from 14 safetensors files, default_dtype: torch.bfloat16
[12-31 10:33:33] Trying SGLANG_DIFFUSION_ATTENTION_BACKEND=None
[12-31 10:33:33] Using AITer backend on ROCm.

Loading safetensors checkpoint shards: 100% Completed | 14/14 [00:00<00:00, 377.41it/s]

Loading required modules: 100%|██████████| 7/7 [02:18<00:00, 19.76s/it]
[12-31 10:33:45] Loaded model with 16.40B parameters
[12-31 10:33:45] Loaded transformer: WanTransformer3DModel from customized. model size: 30.54 GB, avail mem: 150.61 GB
Loading required modules:  57%|█████▋    | 4/7 [02:21<01:11, 23.68s/it][12-31 10:33:45] Loading scheduler from /home/roywan/model/models--Wan-AI--Wan2.1-I2V-14B-480P-Diffusers/snapshots/b184e23a8a16b20f108f727c902e769e873ffc73/scheduler. avail mem: 150.61 GB
[12-31 10:33:45] Loaded scheduler: UniPCMultistepScheduler from customized. model size: 0.00 GB, avail mem: 150.61 GB
[12-31 10:33:45] Loading image_encoder from /home/roywan/model/models--Wan-AI--Wan2.1-I2V-14B-480P-Diffusers/snapshots/b184e23a8a16b20f108f727c902e769e873ffc73/image_encoder. avail mem: 150.61 GB
[12-31 10:33:45] HF model config: {'architectures': ['CLIPVisionModelWithProjection'], 'attention_dropout': 0.0, 'dropout': 0.0, 'hidden_act': 'gelu', 'hidden_size': 1280, 'image_size': 224, 'initializer_factor': 1.0, 'initializer_range': 0.02, 'intermediate_size': 5120, 'layer_norm_eps': 1e-05, 'num_attention_heads': 16, 'num_channels': 3, 'num_hidden_layers': 32, 'patch_size': 14, 'projection_dim': 1024}
[12-31 10:33:45] Trying SGLANG_DIFFUSION_ATTENTION_BACKEND=None
[12-31 10:33:45] Cannot use FlashAttention backend because the flash_attn package is not found. Make sure that flash_attn was built and installed (on by default).
[12-31 10:33:45] Using Torch SDPA backend.

Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  1.66it/s]

[12-31 10:33:46] Loaded image_encoder: CLIPVisionModel from customized. model size: 2.35 GB, avail mem: 149.24 GB
Loading required modules:  86%|████████▌ | 6/7 [02:21<00:12, 12.65s/it][12-31 10:33:46] Loading image_processor from /home/roywan/model/models--Wan-AI--Wan2.1-I2V-14B-480P-Diffusers/snapshots/b184e23a8a16b20f108f727c902e769e873ffc73/image_processor. avail mem: 149.24 GB
[12-31 10:33:46] Loaded image_processor: CLIPImageProcessorFast from customized. model size: 0.00 GB, avail mem: 149.24 GB
Loading required modules: 100%|██████████| 7/7 [02:21<00:00, 20.24s/it]
[12-31 10:33:46] Creating pipeline stages...
[12-31 10:33:46] Trying SGLANG_DIFFUSION_ATTENTION_BACKEND=None
[12-31 10:33:46] Using AITer backend on ROCm.
[12-31 10:33:46] Pipeline instantiated
[12-31 10:33:46] Worker 0: Initialized device, model, and distributed environment.
[12-31 10:33:46] Worker 0: Scheduler loop started.
[12-31 10:33:46] Adjusting number of frames from 81 to 85 based on number of GPUs (2)
[12-31 10:33:46] Processing prompt 1/1: Summer beach vacation style, a white cat wearing sunglasses sits on a surfboard. The fluffy-furred f
[12-31 10:33:46] Sampling params:
                       width: 832
                      height: 480
                  num_frames: 85
                      prompt: Summer beach vacation style, a white cat wearing sunglasses sits on a surfboard. The fluffy-furred feline gazes directly at the camera with a relaxed expression. Blurred beach scenery forms the background featuring crystal-clear waters, distant green hills, and a blue sky dotted with white clouds. The cat assumes a naturally relaxed posture, as if savoring the sea breeze and warm sunlight. A close-up shot highlights the feline's intricate details and the refreshing atmosphere of the seaside.
                  neg_prompt: Bright tones, overexposed, static, blurred details, subtitles, style, works, paintings, images, static, overall gray, worst quality, low quality, JPEG compression residue, ugly, incomplete, extra fingers, poorly drawn hands, poorly drawn faces, deformed, disfigured, misshapen limbs, fused fingers, still picture, messy background, three legs, many people in the background, walking backwards
                        seed: 1024
                 infer_steps: 50
      num_outputs_per_prompt: 1
              guidance_scale: 5.0
     embedded_guidance_scale: 6.0
                    n_tokens: None
                  flow_shift: 3.0
                  image_path: ['https://github.com/Wan-Video/Wan2.2/blob/990af50de458c19590c245151197326e208d7191/examples/i2v_input.JPG?raw=true']
                 save_output: True
            output_file_path: outputs/Summer_beach_vacation_style_a_white_cat_wearing_sunglasses_sits_on_a_surfboard._The_fluffy-furred_f_20251231-103346_d2038859.mp4
        
[12-31 10:33:46] Running pipeline stages: ['input_validation_stage', 'prompt_encoding_stage', 'image_encoding_stage', 'conditioning_stage', 'timestep_preparation_stage', 'latent_preparation_stage', 'image_latent_preparation_stage', 'denoising_stage', 'decoding_stage']
[12-31 10:33:46] [InputValidationStage] started...
[12-31 10:33:47] [InputValidationStage] finished in 0.6357 seconds
[12-31 10:33:47] [TextEncodingStage] started...
[12-31 10:33:56] [TextEncodingStage] finished in 8.9319 seconds
[12-31 10:33:56] [ImageEncodingStage] started...
[12-31 10:33:57] [ImageEncodingStage] finished in 1.1375 seconds
[12-31 10:33:57] [ConditioningStage] started...
[12-31 10:33:57] [ConditioningStage] finished in 0.0001 seconds
[12-31 10:33:57] [TimestepPreparationStage] started...
[12-31 10:33:57] [TimestepPreparationStage] finished in 0.0007 seconds
[12-31 10:33:57] [LatentPreparationStage] started...
[12-31 10:33:57] [LatentPreparationStage] finished in 0.0003 seconds
[12-31 10:33:57] [ImageVAEEncodingStage] started...
[12-31 10:35:15] [ImageVAEEncodingStage] finished in 78.0605 seconds
[12-31 10:35:15] [DenoisingStage] started...
  0%|          | 0/50 [00:00<?, ?it/s]MIOpen(HIP): Error [Init] Not found :30-DeviceGroupedConvFwdMultipleABD_Xdl_CShuffle_V3<256, 128, 128, 64, Default, 32, 32, 2, 2, 8, 8, 8, 1, 1, BlkGemmPipelineScheduler: Intrawave, BlkGemmPipelineVersion: v3>
MIOpen(HIP): Error [Init] Not found :30-DeviceGroupedConvFwdMultipleABD_Xdl_CShuffle_V3<256, 128, 128, 64, Default, 32, 32, 2, 2, 8, 8, 8, 1, 1, BlkGemmPipelineScheduler: Intrawave, BlkGemmPipelineVersion: v3>
[aiter] import [module_fmha_v3_fwd] under /sgl-workspace/aiter/aiter/jit/module_fmha_v3_fwd.so
[12-31 10:35:19] import [module_fmha_v3_fwd] under /sgl-workspace/aiter/aiter/jit/module_fmha_v3_fwd.so
[aiter] type hints mismatch, override to --> fmha_v3_fwd(q: torch.Tensor, k: torch.Tensor, v: torch.Tensor, dropout_p: float, softmax_scale: float, is_causal: bool, window_size_left: int, window_size_right: int, return_softmax_lse: bool, return_dropout_randval: bool, how_v3_bf16_cvt: int, out: Optional[torch.Tensor] = None, bias: Optional[torch.Tensor] = None, alibi_slopes: Optional[torch.Tensor] = None, gen: Optional[torch.Generator] = None) -> List[torch.Tensor]
[12-31 10:35:19] type hints mismatch, override to --> fmha_v3_fwd(q: torch.Tensor, k: torch.Tensor, v: torch.Tensor, dropout_p: float, softmax_scale: float, is_causal: bool, window_size_left: int, window_size_right: int, return_softmax_lse: bool, return_dropout_randval: bool, how_v3_bf16_cvt: int, out: Optional[torch.Tensor] = None, bias: Optional[torch.Tensor] = None, alibi_slopes: Optional[torch.Tensor] = None, gen: Optional[torch.Generator] = None) -> List[torch.Tensor]
[aiter] import [module_fmha_v3_fwd] under /sgl-workspace/aiter/aiter/jit/module_fmha_v3_fwd.so
[12-31 10:35:20] import [module_fmha_v3_fwd] under /sgl-workspace/aiter/aiter/jit/module_fmha_v3_fwd.so
[aiter] type hints mismatch, override to --> fmha_v3_fwd(q: torch.Tensor, k: torch.Tensor, v: torch.Tensor, dropout_p: float, softmax_scale: float, is_causal: bool, window_size_left: int, window_size_right: int, return_softmax_lse: bool, return_dropout_randval: bool, how_v3_bf16_cvt: int, out: Optional[torch.Tensor] = None, bias: Optional[torch.Tensor] = None, alibi_slopes: Optional[torch.Tensor] = None, gen: Optional[torch.Generator] = None) -> List[torch.Tensor]
[12-31 10:35:20] type hints mismatch, override to --> fmha_v3_fwd(q: torch.Tensor, k: torch.Tensor, v: torch.Tensor, dropout_p: float, softmax_scale: float, is_causal: bool, window_size_left: int, window_size_right: int, return_softmax_lse: bool, return_dropout_randval: bool, how_v3_bf16_cvt: int, out: Optional[torch.Tensor] = None, bias: Optional[torch.Tensor] = None, alibi_slopes: Optional[torch.Tensor] = None, gen: Optional[torch.Generator] = None) -> List[torch.Tensor]
100%|██████████| 50/50 [04:00<00:00,  4.80s/it]
[12-31 10:39:15] [DenoisingStage] average time per step: 4.8006 seconds
[12-31 10:39:15] [DenoisingStage] finished in 240.0356 seconds
[12-31 10:39:15] [DecodingStage] started...
[12-31 10:40:15] [DecodingStage] finished in 60.2300 seconds
[12-31 10:40:15] Peak GPU memory: 42.32 GB, Remaining GPU memory at peak: 149.67 GB. Components that can stay resident: ['text_encoder', 'vae', 'transformer', 'image_encoder']
[12-31 10:40:19] Output saved to outputs/Summer_beach_vacation_style_a_white_cat_wearing_sunglasses_sits_on_a_surfboard._The_fluffy-furred_f_20251231-103346_d2038859.mp4
[12-31 10:40:19] Pixel data generated successfully in 392.65 seconds
[12-31 10:40:19] Completed batch processing. Generated 1 outputs in 392.65 seconds.
[12-31 10:40:19] Memory usage - Max peak: 43333.40 MB, Avg peak: 43333.40 MB
[12-31 10:40:19] Generator was garbage collected without being shut down. Attempting to shut down the local server and client.
/usr/lib/python3.10/multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 2 leaked semaphore objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '

Log after this change:

[aiter] import [module_aiter_enum] under /sgl-workspace/aiter/aiter/jit/module_aiter_enum.so
INFO:aiter:import [module_aiter_enum] under /sgl-workspace/aiter/aiter/jit/module_aiter_enum.so
INFO 12-31 11:58:38 [__init__.py:241] Automatically detected platform rocm.
/home/roywan/sglang/python/sglang/srt/layers/quantization/awq.py:71: UserWarning: HIP does not support fused_marlin_moe currently.
  warnings.warn(f"HIP does not support fused_marlin_moe currently.")
/home/roywan/sglang/python/sglang/srt/layers/quantization/gguf.py:46: UserWarning: Only CUDA support GGUF quantization currently.
  warnings.warn(f"Only CUDA support GGUF quantization currently.")
[12-31 11:58:38] Attention backend not specified. Using 'aiter' by default on ROCm to match SGLang SRT defaults.
[12-31 11:58:38] server_args: {"model_path": "Wan-AI/Wan2.1-T2V-1.3B-Diffusers", "attention_backend": "aiter", "nccl_port": null, "trust_remote_code": false, "revision": null, "num_gpus": 1, "tp_size": 1, "sp_degree": 1, "ulysses_degree": 1, "ring_degree": 1, "dp_size": 1, "dp_degree": 1, "enable_cfg_parallel": false, "hsdp_replicate_dim": 1, "hsdp_shard_dim": 1, "dist_timeout": null, "lora_path": null, "lora_nickname": "default", "vae_path": null, "lora_target_modules": null, "dit_cpu_offload": true, "dit_layerwise_offload": null, "text_encoder_cpu_offload": true, "image_encoder_cpu_offload": true, "vae_cpu_offload": true, "use_fsdp_inference": false, "pin_cpu_memory": true, "mask_strategy_file_path": null, "STA_mode": "STA_inference", "skip_time_steps": 15, "enable_torch_compile": false, "disable_autocast": false, "VSA_sparsity": 0.0, "moba_config_path": null, "moba_config": {}, "master_port": 30091, "host": null, "port": null, "webui": false, "webui_port": 12312, "scheduler_port": 5565, "prompt_file_path": null, "model_paths": {}, "model_loaded": {"transformer": true, "vae": true}, "boundary_ratio": null, "log_level": "info"}
[12-31 11:58:38] Local mode: True
[12-31 11:58:38] Starting server...
[aiter] import [module_aiter_enum] under /sgl-workspace/aiter/aiter/jit/module_aiter_enum.so
INFO:aiter:import [module_aiter_enum] under /sgl-workspace/aiter/aiter/jit/module_aiter_enum.so
INFO 12-31 11:58:45 [__init__.py:241] Automatically detected platform rocm.
/home/roywan/sglang/python/sglang/srt/layers/quantization/awq.py:71: UserWarning: HIP does not support fused_marlin_moe currently.
  warnings.warn(f"HIP does not support fused_marlin_moe currently.")
/home/roywan/sglang/python/sglang/srt/layers/quantization/gguf.py:46: UserWarning: Only CUDA support GGUF quantization currently.
  warnings.warn(f"Only CUDA support GGUF quantization currently.")
[12-31 11:58:46] Scheduler bind at endpoint: tcp://localhost:5565
[12-31 11:58:46] Initializing distributed environment with world_size=1, device=cuda:0
[12-31 11:58:51] Downloaded model_index.json for Wan-AI/Wan2.1-T2V-1.3B-Diffusers, pipeline: WanPipeline
[12-31 11:58:51] Found model info: ModelInfo(pipeline_cls=<class 'sglang.multimodal_gen.runtime.pipelines.wan_pipeline.WanPipeline'>, sampling_param_cls=<class 'sglang.multimodal_gen.configs.sample.wan.WanT2V_1_3B_SamplingParams'>, pipeline_config_cls=<class 'sglang.multimodal_gen.configs.pipeline_configs.wan.WanT2V480PConfig'>)
[12-31 11:58:51] Loading pipeline modules...
[12-31 11:58:51] Checking for cached model in HF Hub cache for Wan-AI/Wan2.1-T2V-1.3B-Diffusers...
[12-31 11:58:51] Found complete model in cache at /home/roywan/model/models--Wan-AI--Wan2.1-T2V-1.3B-Diffusers/snapshots/0fad780a534b6463e45facd96134c9f345acfa5b
[12-31 11:58:51] Model path: /home/roywan/model/models--Wan-AI--Wan2.1-T2V-1.3B-Diffusers/snapshots/0fad780a534b6463e45facd96134c9f345acfa5b
[12-31 11:58:51] Diffusers version: 0.33.0.dev0
[12-31 11:58:51] Loading pipeline modules from config: {'_class_name': 'WanPipeline', '_diffusers_version': '0.33.0.dev0', 'scheduler': ['diffusers', 'UniPCMultistepScheduler'], 'text_encoder': ['transformers', 'UMT5EncoderModel'], 'tokenizer': ['transformers', 'T5TokenizerFast'], 'transformer': ['diffusers', 'WanTransformer3DModel'], 'vae': ['diffusers', 'AutoencoderKLWan']}
[12-31 11:58:51] Loading required components: ['text_encoder', 'tokenizer', 'vae', 'transformer', 'scheduler']
Loading required modules:   0%|          | 0/5 [00:00<?, ?it/s][12-31 11:58:51] Loading text_encoder from /home/roywan/model/models--Wan-AI--Wan2.1-T2V-1.3B-Diffusers/snapshots/0fad780a534b6463e45facd96134c9f345acfa5b/text_encoder. avail mem: 184.91 GB
[12-31 11:58:51] HF model config: {'architectures': ['UMT5EncoderModel'], 'classifier_dropout': 0.0, 'd_ff': 10240, 'd_kv': 64, 'd_model': 4096, 'decoder_start_token_id': 0, 'dense_act_fn': 'gelu_new', 'dropout_rate': 0.1, 'eos_token_id': 1, 'feed_forward_proj': 'gated-gelu', 'initializer_factor': 1.0, 'is_encoder_decoder': True, 'is_gated_act': True, 'layer_norm_epsilon': 1e-06, 'num_decoder_layers': 24, 'num_heads': 64, 'num_layers': 24, 'output_past': True, 'pad_token_id': 0, 'relative_attention_max_distance': 128, 'relative_attention_num_buckets': 32, 'scalable_attention': True, 'tie_word_embeddings': False, 'use_cache': True, 'vocab_size': 256384}
[12-31 11:59:00] [RunAI Streamer] Overall time to stream 21.2 GiB of all files to cpu: 8.69s, 2.4 GiB/s
[12-31 12:00:28] Loaded text_encoder: FSDPUMT5EncoderModel from customized. model size: 21.16 GB, avail mem: 183.84 GB
Loading required modules:  20%|██        | 1/5 [01:37<06:29, 97.46s/it][12-31 12:00:28] Loading tokenizer from /home/roywan/model/models--Wan-AI--Wan2.1-T2V-1.3B-Diffusers/snapshots/0fad780a534b6463e45facd96134c9f345acfa5b/tokenizer. avail mem: 183.84 GB
[12-31 12:00:29] Loaded tokenizer: T5TokenizerFast from customized. model size: 0.00 GB, avail mem: 183.84 GB
Loading required modules:  40%|████      | 2/5 [01:37<02:01, 40.34s/it][12-31 12:00:29] Loading vae from /home/roywan/model/models--Wan-AI--Wan2.1-T2V-1.3B-Diffusers/snapshots/0fad780a534b6463e45facd96134c9f345acfa5b/vae. avail mem: 183.84 GB
[12-31 12:00:29] HF model config: {'attn_scales': [], 'base_dim': 96, 'dim_mult': [1, 2, 4, 4], 'dropout': 0.0, 'latents_mean': [-0.7571, -0.7089, -0.9113, 0.1075, -0.1745, 0.9653, -0.1517, 1.5508, 0.4134, -0.0715, 0.5517, -0.3632, -0.1922, -0.9497, 0.2503, -0.2921], 'latents_std': [2.8184, 1.4541, 2.3275, 2.6558, 1.2196, 1.7708, 2.6052, 2.0743, 3.2687, 2.1526, 2.8652, 1.5579, 1.6382, 1.1253, 2.8251, 1.916], 'num_res_blocks': 2, 'temperal_downsample': [False, True, True], 'z_dim': 16}
[12-31 12:00:29] Loaded vae: AutoencoderKLWan from customized. model size: 0.27 GB, avail mem: 183.84 GB
[12-31 12:00:29] Loading transformer from /home/roywan/model/models--Wan-AI--Wan2.1-T2V-1.3B-Diffusers/snapshots/0fad780a534b6463e45facd96134c9f345acfa5b/transformer. avail mem: 183.84 GB
[12-31 12:00:29] Loading WanTransformer3DModel from 2 safetensors files, default_dtype: torch.bfloat16
[12-31 12:00:29] Trying SGLANG_DIFFUSION_ATTENTION_BACKEND=None
[12-31 12:00:29] Using AITer backend on ROCm.
[12-31 12:00:31] [RunAI Streamer] Overall time to stream 5.3 GiB of all files to cpu: 2.18s, 2.4 GiB/s
[12-31 12:00:32] Loaded model with 1.42B parameters
[12-31 12:00:32] Loaded transformer: WanTransformer3DModel from customized. model size: 2.64 GB, avail mem: 181.00 GB
Loading required modules:  80%|████████  | 4/5 [01:40<00:15, 15.94s/it][12-31 12:00:32] Loading scheduler from /home/roywan/model/models--Wan-AI--Wan2.1-T2V-1.3B-Diffusers/snapshots/0fad780a534b6463e45facd96134c9f345acfa5b/scheduler. avail mem: 181.00 GB
[12-31 12:00:32] Loaded scheduler: UniPCMultistepScheduler from customized. model size: 0.00 GB, avail mem: 181.00 GB
Loading required modules: 100%|██████████| 5/5 [01:40<00:00, 20.13s/it]
[12-31 12:00:32] Creating pipeline stages...
[12-31 12:00:32] Trying SGLANG_DIFFUSION_ATTENTION_BACKEND=None
[12-31 12:00:32] Using AITer backend on ROCm.
[12-31 12:00:32] Pipeline instantiated
[12-31 12:00:32] Worker 0: Initialized device, model, and distributed environment.
[12-31 12:00:32] Worker 0: Scheduler loop started.
[12-31 12:00:32] Processing prompt 1/1: A curious raccoon
[12-31 12:00:32] Sampling params:
                       width: 832
                      height: 480
                  num_frames: 81
                      prompt: A curious raccoon
                  neg_prompt: Bright tones, overexposed, static, blurred details, subtitles, style, works, paintings, images, static, overall gray, worst quality, low quality, JPEG compression residue, ugly, incomplete, extra fingers, poorly drawn hands, poorly drawn faces, deformed, disfigured, misshapen limbs, fused fingers, still picture, messy background, three legs, many people in the background, walking backwards
                        seed: 1024
                 infer_steps: 50
      num_outputs_per_prompt: 1
              guidance_scale: 3.0
     embedded_guidance_scale: 6.0
                    n_tokens: None
                  flow_shift: 3.0
                  image_path: None
                 save_output: True
            output_file_path: outputs/A_curious_raccoon_20251231-120032_483bc3b0.mp4
        
[12-31 12:00:32] Running pipeline stages: ['input_validation_stage', 'prompt_encoding_stage', 'conditioning_stage', 'timestep_preparation_stage', 'latent_preparation_stage', 'denoising_stage', 'decoding_stage']
[12-31 12:00:32] [InputValidationStage] started...
[12-31 12:00:32] [InputValidationStage] finished in 0.0001 seconds
[12-31 12:00:32] [TextEncodingStage] started...
[12-31 12:00:39] [TextEncodingStage] finished in 7.0431 seconds
[12-31 12:00:39] [ConditioningStage] started...
[12-31 12:00:39] [ConditioningStage] finished in 0.0000 seconds
[12-31 12:00:39] [TimestepPreparationStage] started...
[12-31 12:00:39] [TimestepPreparationStage] finished in 0.0004 seconds
[12-31 12:00:39] [LatentPreparationStage] started...
[12-31 12:00:39] [LatentPreparationStage] finished in 0.0041 seconds
[12-31 12:00:39] [DenoisingStage] started...
  0%|          | 0/50 [00:00<?, ?it/s][aiter] import [module_fmha_v3_fwd] under /sgl-workspace/aiter/aiter/jit/module_fmha_v3_fwd.so
[12-31 12:00:42] import [module_fmha_v3_fwd] under /sgl-workspace/aiter/aiter/jit/module_fmha_v3_fwd.so
[aiter] type hints mismatch, override to --> fmha_v3_fwd(q: torch.Tensor, k: torch.Tensor, v: torch.Tensor, dropout_p: float, softmax_scale: float, is_causal: bool, window_size_left: int, window_size_right: int, return_softmax_lse: bool, return_dropout_randval: bool, how_v3_bf16_cvt: int, out: Optional[torch.Tensor] = None, bias: Optional[torch.Tensor] = None, alibi_slopes: Optional[torch.Tensor] = None, gen: Optional[torch.Generator] = None) -> List[torch.Tensor]
[12-31 12:00:42] type hints mismatch, override to --> fmha_v3_fwd(q: torch.Tensor, k: torch.Tensor, v: torch.Tensor, dropout_p: float, softmax_scale: float, is_causal: bool, window_size_left: int, window_size_right: int, return_softmax_lse: bool, return_dropout_randval: bool, how_v3_bf16_cvt: int, out: Optional[torch.Tensor] = None, bias: Optional[torch.Tensor] = None, alibi_slopes: Optional[torch.Tensor] = None, gen: Optional[torch.Generator] = None) -> List[torch.Tensor]
100%|██████████| 50/50 [01:29<00:00,  1.79s/it]
[12-31 12:02:08] [DenoisingStage] average time per step: 1.7908 seconds
[12-31 12:02:08] [DenoisingStage] finished in 89.5441 seconds
[12-31 12:02:08] [DecodingStage] started...
[12-31 12:02:15] [DecodingStage] finished in 6.7211 seconds
[12-31 12:02:15] Peak GPU memory: 11.84 GB, Remaining GPU memory at peak: 180.14 GB. Components that can stay resident: ['text_encoder', 'vae', 'transformer']
[12-31 12:02:20] Output saved to outputs/A_curious_raccoon_20251231-120032_483bc3b0.mp4
[12-31 12:02:20] Pixel data generated successfully in 108.11 seconds
[12-31 12:02:20] Completed batch processing. Generated 1 outputs in 108.12 seconds.
[12-31 12:02:20] Memory usage - Max peak: 12129.25 MB, Avg peak: 12129.25 MB
[12-31 12:02:20] Generator was garbage collected without being shut down. Attempting to shut down the local server and client.
/usr/lib/python3.10/multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 1 leaked semaphore objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '
[aiter] import [module_aiter_enum] under /sgl-workspace/aiter/aiter/jit/module_aiter_enum.so
INFO:aiter:import [module_aiter_enum] under /sgl-workspace/aiter/aiter/jit/module_aiter_enum.so
INFO 12-31 12:02:30 [__init__.py:241] Automatically detected platform rocm.
/home/roywan/sglang/python/sglang/srt/layers/quantization/awq.py:71: UserWarning: HIP does not support fused_marlin_moe currently.
  warnings.warn(f"HIP does not support fused_marlin_moe currently.")
/home/roywan/sglang/python/sglang/srt/layers/quantization/gguf.py:46: UserWarning: Only CUDA support GGUF quantization currently.
  warnings.warn(f"Only CUDA support GGUF quantization currently.")
[12-31 12:02:30] Attention backend not specified. Using 'aiter' by default on ROCm to match SGLang SRT defaults.
[12-31 12:02:30] server_args: {"model_path": "Wan-AI/Wan2.1-I2V-14B-480P-Diffusers", "attention_backend": "aiter", "nccl_port": null, "trust_remote_code": false, "revision": null, "num_gpus": 2, "tp_size": 1, "sp_degree": 1, "ulysses_degree": 1, "ring_degree": 1, "dp_size": 1, "dp_degree": 1, "enable_cfg_parallel": true, "hsdp_replicate_dim": 1, "hsdp_shard_dim": 2, "dist_timeout": null, "lora_path": null, "lora_nickname": "default", "vae_path": null, "lora_target_modules": null, "dit_cpu_offload": true, "dit_layerwise_offload": null, "text_encoder_cpu_offload": true, "image_encoder_cpu_offload": true, "vae_cpu_offload": true, "use_fsdp_inference": false, "pin_cpu_memory": true, "mask_strategy_file_path": null, "STA_mode": "STA_inference", "skip_time_steps": 15, "enable_torch_compile": false, "disable_autocast": false, "VSA_sparsity": 0.0, "moba_config_path": null, "moba_config": {}, "master_port": 30066, "host": null, "port": null, "webui": false, "webui_port": 12312, "scheduler_port": 5626, "prompt_file_path": null, "model_paths": {}, "model_loaded": {"transformer": true, "vae": true}, "boundary_ratio": null, "log_level": "info"}
[12-31 12:02:30] Local mode: True
[12-31 12:02:30] Starting server...
[aiter] import [module_aiter_enum] under /sgl-workspace/aiter/aiter/jit/module_aiter_enum.so
INFO:aiter:import [module_aiter_enum] under /sgl-workspace/aiter/aiter/jit/module_aiter_enum.so
[aiter] import [module_aiter_enum] under /sgl-workspace/aiter/aiter/jit/module_aiter_enum.so
INFO:aiter:import [module_aiter_enum] under /sgl-workspace/aiter/aiter/jit/module_aiter_enum.so
INFO 12-31 12:02:38 [__init__.py:241] Automatically detected platform rocm.
INFO 12-31 12:02:38 [__init__.py:241] Automatically detected platform rocm.
/home/roywan/sglang/python/sglang/srt/layers/quantization/awq.py:71: UserWarning: HIP does not support fused_marlin_moe currently.
  warnings.warn(f"HIP does not support fused_marlin_moe currently.")
/home/roywan/sglang/python/sglang/srt/layers/quantization/gguf.py:46: UserWarning: Only CUDA support GGUF quantization currently.
  warnings.warn(f"Only CUDA support GGUF quantization currently.")
/home/roywan/sglang/python/sglang/srt/layers/quantization/awq.py:71: UserWarning: HIP does not support fused_marlin_moe currently.
  warnings.warn(f"HIP does not support fused_marlin_moe currently.")
/home/roywan/sglang/python/sglang/srt/layers/quantization/gguf.py:46: UserWarning: Only CUDA support GGUF quantization currently.
  warnings.warn(f"Only CUDA support GGUF quantization currently.")
[12-31 12:02:38] Scheduler bind at endpoint: tcp://localhost:5626
[12-31 12:02:38] Initializing distributed environment with world_size=2, device=cuda:0
[12-31 12:02:42] Found nccl from library librccl.so.1
[12-31 12:02:42] sglang-diffusion is using nccl==2.26.6
[12-31 12:02:43] Found nccl from library librccl.so.1
[12-31 12:02:43] sglang-diffusion is using nccl==2.26.6
Loading required modules:   0%|          | 0/7 [00:00<?, ?it/s][12-31 12:02:46] Downloaded model_index.json for Wan-AI/Wan2.1-I2V-14B-480P-Diffusers, pipeline: WanImageToVideoPipeline
[12-31 12:02:46] Found model info: ModelInfo(pipeline_cls=<class 'sglang.multimodal_gen.runtime.pipelines.wan_i2v_pipeline.WanImageToVideoPipeline'>, sampling_param_cls=<class 'sglang.multimodal_gen.configs.sample.wan.WanI2V_14B_480P_SamplingParam'>, pipeline_config_cls=<class 'sglang.multimodal_gen.configs.pipeline_configs.wan.WanI2V480PConfig'>)
[12-31 12:02:46] Loading pipeline modules...
[12-31 12:02:46] Checking for cached model in HF Hub cache for Wan-AI/Wan2.1-I2V-14B-480P-Diffusers...
[12-31 12:02:46] Found complete model in cache at /home/roywan/model/models--Wan-AI--Wan2.1-I2V-14B-480P-Diffusers/snapshots/b184e23a8a16b20f108f727c902e769e873ffc73
[12-31 12:02:46] Model path: /home/roywan/model/models--Wan-AI--Wan2.1-I2V-14B-480P-Diffusers/snapshots/b184e23a8a16b20f108f727c902e769e873ffc73
[12-31 12:02:46] Diffusers version: 0.33.0.dev0
[12-31 12:02:46] Loading pipeline modules from config: {'_class_name': 'WanImageToVideoPipeline', '_diffusers_version': '0.33.0.dev0', 'image_encoder': ['transformers', 'CLIPVisionModelWithProjection'], 'image_processor': ['transformers', 'CLIPImageProcessor'], 'scheduler': ['diffusers', 'UniPCMultistepScheduler'], 'text_encoder': ['transformers', 'UMT5EncoderModel'], 'tokenizer': ['transformers', 'T5TokenizerFast'], 'transformer': ['diffusers', 'WanTransformer3DModel'], 'vae': ['diffusers', 'AutoencoderKLWan']}
[12-31 12:02:46] Loading required components: ['text_encoder', 'tokenizer', 'vae', 'transformer', 'scheduler', 'image_encoder', 'image_processor']
Loading required modules:   0%|          | 0/7 [00:00<?, ?it/s][12-31 12:02:46] Loading text_encoder from /home/roywan/model/models--Wan-AI--Wan2.1-I2V-14B-480P-Diffusers/snapshots/b184e23a8a16b20f108f727c902e769e873ffc73/text_encoder. avail mem: 183.63 GB
[12-31 12:02:46] HF model config: {'architectures': ['UMT5EncoderModel'], 'classifier_dropout': 0.0, 'd_ff': 10240, 'd_kv': 64, 'd_model': 4096, 'decoder_start_token_id': 0, 'dense_act_fn': 'gelu_new', 'dropout_rate': 0.1, 'eos_token_id': 1, 'feed_forward_proj': 'gated-gelu', 'initializer_factor': 1.0, 'is_encoder_decoder': True, 'is_gated_act': True, 'layer_norm_epsilon': 1e-06, 'num_decoder_layers': 24, 'num_heads': 64, 'num_layers': 24, 'output_past': True, 'pad_token_id': 0, 'relative_attention_max_distance': 128, 'relative_attention_num_buckets': 32, 'scalable_attention': True, 'tie_word_embeddings': False, 'use_cache': True, 'vocab_size': 256384}
[12-31 12:02:58] [RunAI Streamer] Overall time to stream 21.2 GiB of all files to cpu: 12.0s, 1.8 GiB/s
[12-31 12:02:59] [RunAI Streamer] Overall time to stream 21.2 GiB of all files to cpu: 13.45s, 1.6 GiB/s
[12-31 12:05:02] Loaded text_encoder: FSDPUMT5EncoderModel from customized. model size: 21.16 GB, avail mem: 182.34 GB
Loading required modules:  14%|█▍        | 1/7 [02:16<13:38, 136.47s/it][12-31 12:05:02] Loading tokenizer from /home/roywan/model/models--Wan-AI--Wan2.1-I2V-14B-480P-Diffusers/snapshots/b184e23a8a16b20f108f727c902e769e873ffc73/tokenizer. avail mem: 182.34 GB
[12-31 12:05:03] Loaded tokenizer: T5TokenizerFast from customized. model size: 0.00 GB, avail mem: 182.34 GB
Loading required modules:  29%|██▊       | 2/7 [02:17<04:42, 56.51s/it] [12-31 12:05:03] Loading vae from /home/roywan/model/models--Wan-AI--Wan2.1-I2V-14B-480P-Diffusers/snapshots/b184e23a8a16b20f108f727c902e769e873ffc73/vae. avail mem: 182.34 GB
[12-31 12:05:03] HF model config: {'attn_scales': [], 'base_dim': 96, 'dim_mult': [1, 2, 4, 4], 'dropout': 0.0, 'latents_mean': [-0.7571, -0.7089, -0.9113, 0.1075, -0.1745, 0.9653, -0.1517, 1.5508, 0.4134, -0.0715, 0.5517, -0.3632, -0.1922, -0.9497, 0.2503, -0.2921], 'latents_std': [2.8184, 1.4541, 2.3275, 2.6558, 1.2196, 1.7708, 2.6052, 2.0743, 3.2687, 2.1526, 2.8652, 1.5579, 1.6382, 1.1253, 2.8251, 1.916], 'num_res_blocks': 2, 'temperal_downsample': [False, True, True], 'z_dim': 16}
[12-31 12:05:03] Loaded vae: AutoencoderKLWan from customized. model size: 0.47 GB, avail mem: 182.34 GB
[12-31 12:05:03] Loading transformer from /home/roywan/model/models--Wan-AI--Wan2.1-I2V-14B-480P-Diffusers/snapshots/b184e23a8a16b20f108f727c902e769e873ffc73/transformer. avail mem: 182.34 GB
[12-31 12:05:03] Loading WanTransformer3DModel from 14 safetensors files, default_dtype: torch.bfloat16
[12-31 12:05:03] Trying SGLANG_DIFFUSION_ATTENTION_BACKEND=None
[12-31 12:05:03] Using AITer backend on ROCm.
Loading required modules:  29%|██▊       | 2/7 [02:20<04:49, 57.87s/it] [12-31 12:06:22] [RunAI Streamer] Overall time to stream 61.1 GiB of all files to cpu: 79.09s, 790.8 MiB/s
[12-31 12:06:24] [RunAI Streamer] Overall time to stream 61.1 GiB of all files to cpu: 77.61s, 805.8 MiB/s
[12-31 12:06:39] Loaded model with 16.40B parameters
[12-31 12:06:39] Loaded transformer: WanTransformer3DModel from customized. model size: 30.54 GB, avail mem: 149.60 GB
Loading required modules:  57%|█████▋    | 4/7 [03:53<02:33, 51.19s/it][12-31 12:06:39] Loading scheduler from /home/roywan/model/models--Wan-AI--Wan2.1-I2V-14B-480P-Diffusers/snapshots/b184e23a8a16b20f108f727c902e769e873ffc73/scheduler. avail mem: 149.60 GB
[12-31 12:06:39] Loaded scheduler: UniPCMultistepScheduler from customized. model size: 0.00 GB, avail mem: 149.60 GB
[12-31 12:06:39] Loading image_encoder from /home/roywan/model/models--Wan-AI--Wan2.1-I2V-14B-480P-Diffusers/snapshots/b184e23a8a16b20f108f727c902e769e873ffc73/image_encoder. avail mem: 149.60 GB
[12-31 12:06:39] HF model config: {'architectures': ['CLIPVisionModelWithProjection'], 'attention_dropout': 0.0, 'dropout': 0.0, 'hidden_act': 'gelu', 'hidden_size': 1280, 'image_size': 224, 'initializer_factor': 1.0, 'initializer_range': 0.02, 'intermediate_size': 5120, 'layer_norm_eps': 1e-05, 'num_attention_heads': 16, 'num_channels': 3, 'num_hidden_layers': 32, 'patch_size': 14, 'projection_dim': 1024}
[12-31 12:06:39] Trying SGLANG_DIFFUSION_ATTENTION_BACKEND=None
[12-31 12:06:39] Cannot use FlashAttention backend because the flash_attn package is not found. Make sure that flash_attn was built and installed (on by default).
[12-31 12:06:39] Using Torch SDPA backend.
Loading required modules:  57%|█████▋    | 4/7 [03:54<02:33, 51.19s/it][12-31 12:06:42] [RunAI Streamer] Overall time to stream 1.2 GiB of all files to cpu: 1.09s, 1.1 GiB/s
[12-31 12:06:42] [RunAI Streamer] Overall time to stream 1.2 GiB of all files to cpu: 2.86s, 422.0 MiB/s
[12-31 12:06:42] Loaded image_encoder: CLIPVisionModel from customized. model size: 2.35 GB, avail mem: 148.77 GB
Loading required modules:  86%|████████▌ | 6/7 [03:55<00:27, 27.69s/it][12-31 12:06:42] Loading image_processor from /home/roywan/model/models--Wan-AI--Wan2.1-I2V-14B-480P-Diffusers/snapshots/b184e23a8a16b20f108f727c902e769e873ffc73/image_processor. avail mem: 148.77 GB
[12-31 12:06:42] Loaded image_processor: CLIPImageProcessorFast from customized. model size: 0.00 GB, avail mem: 148.77 GB
Loading required modules: 100%|██████████| 7/7 [03:55<00:00, 33.71s/it]
Loading required modules: 100%|██████████| 7/7 [03:55<00:00, 33.71s/it]
[12-31 12:06:42] Creating pipeline stages...
[12-31 12:06:42] Trying SGLANG_DIFFUSION_ATTENTION_BACKEND=None
[12-31 12:06:42] Using AITer backend on ROCm.
[12-31 12:06:42] Pipeline instantiated
[12-31 12:06:42] Worker 0: Initialized device, model, and distributed environment.
[12-31 12:06:42] Worker 0: Scheduler loop started.
[12-31 12:06:42] Adjusting number of frames from 81 to 85 based on number of GPUs (2)
[12-31 12:06:42] Processing prompt 1/1: Summer beach vacation style, a white cat wearing sunglasses sits on a surfboard. The fluffy-furred f
[12-31 12:06:42] Sampling params:
                       width: 832
                      height: 480
                  num_frames: 85
                      prompt: Summer beach vacation style, a white cat wearing sunglasses sits on a surfboard. The fluffy-furred feline gazes directly at the camera with a relaxed expression. Blurred beach scenery forms the background featuring crystal-clear waters, distant green hills, and a blue sky dotted with white clouds. The cat assumes a naturally relaxed posture, as if savoring the sea breeze and warm sunlight. A close-up shot highlights the feline's intricate details and the refreshing atmosphere of the seaside.
                  neg_prompt: Bright tones, overexposed, static, blurred details, subtitles, style, works, paintings, images, static, overall gray, worst quality, low quality, JPEG compression residue, ugly, incomplete, extra fingers, poorly drawn hands, poorly drawn faces, deformed, disfigured, misshapen limbs, fused fingers, still picture, messy background, three legs, many people in the background, walking backwards
                        seed: 1024
                 infer_steps: 50
      num_outputs_per_prompt: 1
              guidance_scale: 5.0
     embedded_guidance_scale: 6.0
                    n_tokens: None
                  flow_shift: 3.0
                  image_path: ['https://github.com/Wan-Video/Wan2.2/blob/990af50de458c19590c245151197326e208d7191/examples/i2v_input.JPG?raw=true']
                 save_output: True
            output_file_path: outputs/Summer_beach_vacation_style_a_white_cat_wearing_sunglasses_sits_on_a_surfboard._The_fluffy-furred_f_20251231-120642_d2038859.mp4
        
[12-31 12:06:42] Running pipeline stages: ['input_validation_stage', 'prompt_encoding_stage', 'image_encoding_stage', 'conditioning_stage', 'timestep_preparation_stage', 'latent_preparation_stage', 'image_latent_preparation_stage', 'denoising_stage', 'decoding_stage']
[12-31 12:06:42] [InputValidationStage] started...
[12-31 12:06:42] [InputValidationStage] finished in 0.5580 seconds
[12-31 12:06:42] [TextEncodingStage] started...
[12-31 12:06:51] [TextEncodingStage] finished in 8.6503 seconds
[12-31 12:06:51] [ImageEncodingStage] started...
[12-31 12:06:52] [ImageEncodingStage] finished in 0.9449 seconds
[12-31 12:06:52] [ConditioningStage] started...
[12-31 12:06:52] [ConditioningStage] finished in 0.0000 seconds
[12-31 12:06:52] [TimestepPreparationStage] started...
[12-31 12:06:52] [TimestepPreparationStage] finished in 0.0007 seconds
[12-31 12:06:52] [LatentPreparationStage] started...
[12-31 12:06:52] [LatentPreparationStage] finished in 0.0002 seconds
[12-31 12:06:52] [ImageVAEEncodingStage] started...
[12-31 12:06:57] [ImageVAEEncodingStage] finished in 5.5479 seconds
[12-31 12:06:57] [DenoisingStage] started...
  0%|          | 0/50 [00:00<?, ?it/s][aiter] import [module_fmha_v3_fwd] under /sgl-workspace/aiter/aiter/jit/module_fmha_v3_fwd.so
[12-31 12:07:02] import [module_fmha_v3_fwd] under /sgl-workspace/aiter/aiter/jit/module_fmha_v3_fwd.so
[aiter] type hints mismatch, override to --> fmha_v3_fwd(q: torch.Tensor, k: torch.Tensor, v: torch.Tensor, dropout_p: float, softmax_scale: float, is_causal: bool, window_size_left: int, window_size_right: int, return_softmax_lse: bool, return_dropout_randval: bool, how_v3_bf16_cvt: int, out: Optional[torch.Tensor] = None, bias: Optional[torch.Tensor] = None, alibi_slopes: Optional[torch.Tensor] = None, gen: Optional[torch.Generator] = None) -> List[torch.Tensor]
[12-31 12:07:02] type hints mismatch, override to --> fmha_v3_fwd(q: torch.Tensor, k: torch.Tensor, v: torch.Tensor, dropout_p: float, softmax_scale: float, is_causal: bool, window_size_left: int, window_size_right: int, return_softmax_lse: bool, return_dropout_randval: bool, how_v3_bf16_cvt: int, out: Optional[torch.Tensor] = None, bias: Optional[torch.Tensor] = None, alibi_slopes: Optional[torch.Tensor] = None, gen: Optional[torch.Generator] = None) -> List[torch.Tensor]
[aiter] import [module_fmha_v3_fwd] under /sgl-workspace/aiter/aiter/jit/module_fmha_v3_fwd.so
[12-31 12:07:02] import [module_fmha_v3_fwd] under /sgl-workspace/aiter/aiter/jit/module_fmha_v3_fwd.so
[aiter] type hints mismatch, override to --> fmha_v3_fwd(q: torch.Tensor, k: torch.Tensor, v: torch.Tensor, dropout_p: float, softmax_scale: float, is_causal: bool, window_size_left: int, window_size_right: int, return_softmax_lse: bool, return_dropout_randval: bool, how_v3_bf16_cvt: int, out: Optional[torch.Tensor] = None, bias: Optional[torch.Tensor] = None, alibi_slopes: Optional[torch.Tensor] = None, gen: Optional[torch.Generator] = None) -> List[torch.Tensor]
[12-31 12:07:02] type hints mismatch, override to --> fmha_v3_fwd(q: torch.Tensor, k: torch.Tensor, v: torch.Tensor, dropout_p: float, softmax_scale: float, is_causal: bool, window_size_left: int, window_size_right: int, return_softmax_lse: bool, return_dropout_randval: bool, how_v3_bf16_cvt: int, out: Optional[torch.Tensor] = None, bias: Optional[torch.Tensor] = None, alibi_slopes: Optional[torch.Tensor] = None, gen: Optional[torch.Generator] = None) -> List[torch.Tensor]
100%|██████████| 50/50 [03:59<00:00,  4.79s/it]
[12-31 12:10:57] [DenoisingStage] average time per step: 4.7867 seconds
[12-31 12:10:57] [DenoisingStage] finished in 239.3369 seconds
[12-31 12:10:57] [DecodingStage] started...
[12-31 12:11:05] [DecodingStage] finished in 8.4089 seconds
[12-31 12:11:05] Peak GPU memory: 40.90 GB, Remaining GPU memory at peak: 151.08 GB. Components that can stay resident: ['text_encoder', 'vae', 'transformer', 'image_encoder']
[12-31 12:11:09] Output saved to outputs/Summer_beach_vacation_style_a_white_cat_wearing_sunglasses_sits_on_a_surfboard._The_fluffy-furred_f_20251231-120642_d2038859.mp4
[12-31 12:11:09] Pixel data generated successfully in 266.92 seconds
[12-31 12:11:09] Completed batch processing. Generated 1 outputs in 266.92 seconds.
[12-31 12:11:09] Memory usage - Max peak: 41886.43 MB, Avg peak: 41886.43 MB
[12-31 12:11:09] Generator was garbage collected without being shut down. Attempting to shut down the local server and client.
/usr/lib/python3.10/multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 2 leaked semaphore objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '

Benchmarking and Profiling

Checklist

Review Process

  1. Ping Merge Oncalls to start the PR flow. See the PR Merge Process.
  2. Get approvals from CODEOWNERS and other reviewers.
  3. Trigger CI tests with comments (/tag-run-ci-label, /rerun-failed-ci, /tag-and-rerun-ci) or contact authorized users to do so.
  4. After green CI and required approvals, ask Merge Oncalls to merge.

@github-actions github-actions bot added dependencies Pull requests that update a dependency file diffusion SGLang Diffusion labels Dec 31, 2025
@gemini-code-assist
Copy link
Copy Markdown
Contributor

Summary of Changes

Hello @ZiguanWang, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly improves the compatibility and performance of diffusion models on AMD/ROCm platforms. It achieves this by updating and aligning critical dependencies, integrating an external fix for torch_cpp_ext issues, and enabling a newer, compatible version of runai_model_streamer that was previously problematic on ROCm. These changes aim to provide a more stable and efficient experience for users leveraging AMD hardware.

Highlights

  • AMD Dependency Alignment: The diffusion dependencies within pyproject_other.toml for AMD platforms have been updated and aligned with the main pyproject.toml to ensure consistency and compatibility.
  • Yunchang Integration for ROCm Fix: The local yunchang implementation has been replaced by an external dependency from a specific GitHub commit (feifeibear/long-context-attention), which specifically addresses and fixes issues related to torch_cpp_ext._get_cuda_arch_flags() on ROCm.
  • RunAI Model Streamer Update and Re-enablement: The runai_model_streamer dependency has been pinned to version 0.15.3. The previous explicit disablement of this streamer for AMD/ROCm platforms has been removed, indicating that the updated version resolves prior compatibility issues.
  • General Dependency Updates: The diffusers library has been updated to version 0.36.0, and cache-dit==1.1.8 has been added to the diffusion dependencies.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request aligns the AMD diffusion dependencies in pyproject_other.toml with the main pyproject.toml, updates several packages, and removes previously vendored code. Specifically, it pins diffusers and runai_model_streamer, and uses a forked long-context-attention to address a specific issue. These changes also enable runai_model_streamer on ROCm, which was previously disabled, and appear to yield significant performance improvements based on the provided logs. The changes are well-motivated and look good. I have one suggestion to improve the maintainability of the dependency list.

@mickqian
Copy link
Copy Markdown
Collaborator

mickqian commented Jan 1, 2026

/tag-and-rerun-ci

@github-actions github-actions bot added the run-ci label Jan 1, 2026
@ZiguanWang
Copy link
Copy Markdown
Contributor Author

I find CI test has some errors:

[01-01 02:17:24] [denoising_step_0] Error during execution after 1273.8658 ms: apply_qk_norm: fused inplace QK-norm is not applicable (expected CUDA, contiguous q/k, matching eps, and supported head_dim)
Traceback (most recent call last):
  File "/sglang-checkout/python/sglang/multimodal_gen/runtime/pipelines_core/stages/denoising.py", line 1011, in forward
    noise_pred = self._predict_noise_with_cfg(
  File "/sglang-checkout/python/sglang/multimodal_gen/runtime/pipelines_core/stages/denoising.py", line 1262, in _predict_noise_with_cfg
    noise_pred_cond = self._predict_noise(
  File "/sglang-checkout/python/sglang/multimodal_gen/runtime/pipelines_core/stages/denoising.py", line 1208, in _predict_noise
    return current_model(
  File "/opt/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1775, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1786, in _call_impl
    return forward_call(*args, **kwargs)
  File "/sglang-checkout/python/sglang/multimodal_gen/runtime/models/dits/qwen_image.py", line 972, in forward
    encoder_hidden_states, hidden_states = block(
  File "/opt/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1775, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1786, in _call_impl
    return forward_call(*args, **kwargs)
  File "/sglang-checkout/python/sglang/multimodal_gen/runtime/models/dits/qwen_image.py", line 757, in forward
    attn_output = self.attn(
  File "/opt/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1775, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1786, in _call_impl
    return forward_call(*args, **kwargs)
  File "/sglang-checkout/python/sglang/multimodal_gen/runtime/models/dits/qwen_image.py", line 556, in forward
    img_query, img_key = apply_qk_norm(
  File "/sglang-checkout/python/sglang/multimodal_gen/runtime/layers/layernorm.py", line 461, in apply_qk_norm
    raise RuntimeError(
RuntimeError: apply_qk_norm: fused inplace QK-norm is not applicable (expected CUDA, contiguous q/k, matching eps, and supported head_dim)

I got the same errors in my local environment before this change, Maybe I can help these errors after the NEW YEAR Holiday, @mickqian

@ZiguanWang
Copy link
Copy Markdown
Contributor Author

I find CI test has some errors:

[01-01 02:17:24] [denoising_step_0] Error during execution after 1273.8658 ms: apply_qk_norm: fused inplace QK-norm is not applicable (expected CUDA, contiguous q/k, matching eps, and supported head_dim)
Traceback (most recent call last):
  File "/sglang-checkout/python/sglang/multimodal_gen/runtime/pipelines_core/stages/denoising.py", line 1011, in forward
    noise_pred = self._predict_noise_with_cfg(
  File "/sglang-checkout/python/sglang/multimodal_gen/runtime/pipelines_core/stages/denoising.py", line 1262, in _predict_noise_with_cfg
    noise_pred_cond = self._predict_noise(
  File "/sglang-checkout/python/sglang/multimodal_gen/runtime/pipelines_core/stages/denoising.py", line 1208, in _predict_noise
    return current_model(
  File "/opt/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1775, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1786, in _call_impl
    return forward_call(*args, **kwargs)
  File "/sglang-checkout/python/sglang/multimodal_gen/runtime/models/dits/qwen_image.py", line 972, in forward
    encoder_hidden_states, hidden_states = block(
  File "/opt/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1775, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1786, in _call_impl
    return forward_call(*args, **kwargs)
  File "/sglang-checkout/python/sglang/multimodal_gen/runtime/models/dits/qwen_image.py", line 757, in forward
    attn_output = self.attn(
  File "/opt/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1775, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1786, in _call_impl
    return forward_call(*args, **kwargs)
  File "/sglang-checkout/python/sglang/multimodal_gen/runtime/models/dits/qwen_image.py", line 556, in forward
    img_query, img_key = apply_qk_norm(
  File "/sglang-checkout/python/sglang/multimodal_gen/runtime/layers/layernorm.py", line 461, in apply_qk_norm
    raise RuntimeError(
RuntimeError: apply_qk_norm: fused inplace QK-norm is not applicable (expected CUDA, contiguous q/k, matching eps, and supported head_dim)

I got the same errors in my local environment before this change, Maybe I can help these errors after the NEW YEAR Holiday, @mickqian

Resolved by #16287

@mickqian
Copy link
Copy Markdown
Collaborator

/rerun-failed-ci

1 similar comment
@ZiguanWang
Copy link
Copy Markdown
Contributor Author

/rerun-failed-ci

@ZiguanWang
Copy link
Copy Markdown
Contributor Author

@mickqian, @yhyang201 Can you help check this?

@mickqian
Copy link
Copy Markdown
Collaborator

/rerun-failed-ci

@ZiguanWang ZiguanWang force-pushed the fix_rocm_diffusion branch 2 times, most recently from 04d0ac8 to a488c89 Compare January 20, 2026 05:39
@ZiguanWang
Copy link
Copy Markdown
Contributor Author

chang dev_hip = ["sglang[all_hip]", "sglang[diffusion_hip]", "sglang[test]"] so ci_sglang docker container use scripts/ci/amd_ci_install_dependency.sh install python[dev_hip] will install diffusion_hip dependency.

@ZiguanWang
Copy link
Copy Markdown
Contributor Author

@mickqian only one ci test failed because of timeout, can you help me rerun again?

@mickqian
Copy link
Copy Markdown
Collaborator

/rerun-failed-ci

1 similar comment
@hubertlu-tw
Copy link
Copy Markdown
Collaborator

/rerun-failed-ci

@github-actions github-actions bot added the documentation Improvements or additions to documentation label Jan 27, 2026
@github-actions github-actions bot added the amd label Jan 27, 2026
Copy link
Copy Markdown
Collaborator

@hubertlu-tw hubertlu-tw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@hubertlu-tw
Copy link
Copy Markdown
Collaborator

@yctseng0211 could you please help check the failed AMD tests in this PR? I think they are unrelated to the changes of this PR.

@yctseng0211
Copy link
Copy Markdown
Collaborator

@yctseng0211 could you please help check the failed AMD tests in this PR? I think they are unrelated to the changes of this PR.

checking, cc: @bingxche

@yctseng0211
Copy link
Copy Markdown
Collaborator

@hubertlu-tw I believe the failed AMD tests are unrelated to this PR, we will fix them in #17633

@ZiguanWang
Copy link
Copy Markdown
Contributor Author

@HaiShaw Can you help me merge this PR? thanks

Copy link
Copy Markdown
Collaborator

@yctseng0211 yctseng0211 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@HaiShaw HaiShaw merged commit 30adf78 into sgl-project:main Jan 29, 2026
107 of 110 checks passed
charlesHsuGG pushed a commit to charlesHsuGG/sglang that referenced this pull request Jan 30, 2026
…n dependency with pyproject.toml (sgl-project#16225)

Co-authored-by: roywang <roywang@amd.com>
Chen-0210 pushed a commit to Chen-0210/sglang that referenced this pull request Jan 30, 2026
…n dependency with pyproject.toml (sgl-project#16225)

Co-authored-by: roywang <roywang@amd.com>
sfiisf pushed a commit to sfiisf/sglang that referenced this pull request Feb 5, 2026
…n dependency with pyproject.toml (sgl-project#16225)

Co-authored-by: roywang <roywang@amd.com>
Johnsonms pushed a commit to Johnsonms/sglang that referenced this pull request Feb 14, 2026
…n dependency with pyproject.toml (sgl-project#16225)

Co-authored-by: roywang <roywang@amd.com>
@ZiguanWang ZiguanWang deleted the fix_rocm_diffusion branch March 26, 2026 02:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

amd dependencies Pull requests that update a dependency file diffusion SGLang Diffusion documentation Improvements or additions to documentation run-ci

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants