[SGLang-Diffusion] Fix custom op fake impl missing eps default for torch.compile#19725
[SGLang-Diffusion] Fix custom op fake impl missing eps default for torch.compile#19725
Conversation
Add default value `eps=1e-5` to `register_fake` implementations of `fused_norm_scale_shift` and `fused_scale_residual_norm_scale_shift` custom ops, matching the default in the actual custom_op signatures. Made-with: Cursor
|
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
|
BTW, I think that our logging after enabled (sglang) ➜ python git:(fix/fake-impl-eps-default-for-torch-compile) sglang serve --model-path zai-org/GLM-Image --enable-torch-compile
[2026-03-03 00:37:08] INFO _client.py:1025: HTTP Request: HEAD https://huggingface.co/zai-org/GLM-Image/resolve/main/model_index.json "HTTP/1.1 307 Temporary Redirect"
[2026-03-03 00:37:09] INFO _client.py:1025: HTTP Request: HEAD https://huggingface.co/api/resolve-cache/models/zai-org/GLM-Image/2c433cc0cbc293bde2ac8ca9624f279b5d23fcf4/model_index.json "HTTP/1.1 200 OK"
[2026-03-03 00:37:09] INFO serve.py:87: Diffusion model detected
[2026-03-03 00:37:09] INFO _client.py:1025: HTTP Request: HEAD https://huggingface.co/zai-org/GLM-Image/resolve/main/model_index.json "HTTP/1.1 307 Temporary Redirect"
[2026-03-03 00:37:09] INFO _client.py:1025: HTTP Request: HEAD https://huggingface.co/api/resolve-cache/models/zai-org/GLM-Image/2c433cc0cbc293bde2ac8ca9624f279b5d23fcf4/model_index.json "HTTP/1.1 200 OK"
[2026-03-03 00:37:09] INFO _client.py:1025: HTTP Request: HEAD https://huggingface.co/zai-org/GLM-Image/resolve/main/model_index.json "HTTP/1.1 307 Temporary Redirect"
[2026-03-03 00:37:09] INFO _client.py:1025: HTTP Request: HEAD https://huggingface.co/api/resolve-cache/models/zai-org/GLM-Image/2c433cc0cbc293bde2ac8ca9624f279b5d23fcf4/model_index.json "HTTP/1.1 200 OK"
[03-03 00:37:09] Disabling some offloading (except dit, text_encoder) for image generation model
[03-03 00:37:09] server_args: {"model_path": "zai-org/GLM-Image", "model_id": null, "backend": "auto", "attention_backend": null, "attention_backend_config": {}, "cache_dit_config": null, "nccl_port": null, "trust_remote_code": false, "revision": null, "num_gpus": 1, "tp_size": 1, "sp_degree": 1, "ulysses_degree": 1, "ring_degree": 1, "dp_size": 1, "dp_degree": 1, "enable_cfg_parallel": false, "hsdp_replicate_dim": 1, "hsdp_shard_dim": 1, "dist_timeout": 3600, "pipeline_class_name": null, "lora_path": null, "lora_nickname": "default", "lora_scale": 1.0, "component_paths": {}, "transformer_weights_path": null, "lora_target_modules": null, "dit_cpu_offload": true, "dit_layerwise_offload": null, "dit_offload_prefetch_size": 0.0, "text_encoder_cpu_offload": true, "image_encoder_cpu_offload": false, "vae_cpu_offload": false, "use_fsdp_inference": false, "pin_cpu_memory": true, "comfyui_mode": false, "enable_torch_compile": true, "warmup": false, "warmup_resolutions": null, "disable_autocast": true, "master_port": 30075, "host": "127.0.0.1", "port": 30000, "webui": false, "webui_port": 12312, "scheduler_port": 5649, "output_path": "outputs/", "input_save_path": "inputs/uploads", "prompt_file_path": null, "model_paths": {}, "model_loaded": {"transformer": true, "vae": true, "video_vae": true, "audio_vae": true, "video_dit": true, "audio_dit": true, "dual_tower_bridge": true}, "boundary_ratio": null, "log_level": "info"}
[03-03 00:37:09] Starting server...
[03-03 00:37:16] Scheduler bind at endpoint: tcp://127.0.0.1:5649
[03-03 00:37:16] Initializing distributed environment with world_size=1, device=cuda:0, timeout=3600
[03-03 00:37:16] Setting distributed timeout to 3600 seconds
[03-03 00:37:17] No pipeline_class_name specified, using model_index.json
[03-03 00:37:18] HTTP Request: HEAD https://huggingface.co/zai-org/GLM-Image/resolve/main/model_index.json "HTTP/1.1 307 Temporary Redirect"
[03-03 00:37:18] HTTP Request: HEAD https://huggingface.co/api/resolve-cache/models/zai-org/GLM-Image/2c433cc0cbc293bde2ac8ca9624f279b5d23fcf4/model_index.json "HTTP/1.1 200 OK"
[03-03 00:37:18] HTTP Request: HEAD https://huggingface.co/zai-org/GLM-Image/resolve/main/model_index.json "HTTP/1.1 307 Temporary Redirect"
[03-03 00:37:18] HTTP Request: HEAD https://huggingface.co/api/resolve-cache/models/zai-org/GLM-Image/2c433cc0cbc293bde2ac8ca9624f279b5d23fcf4/model_index.json "HTTP/1.1 200 OK"
[03-03 00:37:18] Using pipeline from model_index.json: GlmImagePipeline
[03-03 00:37:18] Loading pipeline modules...
[03-03 00:37:18] Checking for cached model in HF Hub cache for zai-org/GLM-Image...
[03-03 00:37:18] Found complete model in cache at /root/.cache/huggingface/hub/models--zai-org--GLM-Image/snapshots/2c433cc0cbc293bde2ac8ca9624f279b5d23fcf4
[03-03 00:37:18] Model path: /root/.cache/huggingface/hub/models--zai-org--GLM-Image/snapshots/2c433cc0cbc293bde2ac8ca9624f279b5d23fcf4
[03-03 00:37:18] Diffusers version: 0.37.0.dev0
[03-03 00:37:18] Loading pipeline modules from config: {'_class_name': 'GlmImagePipeline', '_diffusers_version': '0.37.0.dev0', '_name_or_path': 'zai-org/GLM-Image-Decoder', 'text_encoder': ['transformers', 'T5EncoderModel'], 'vision_language_encoder': ['transformers', 'GlmImageForConditionalGeneration'], 'scheduler': ['diffusers', 'FlowMatchEulerDiscreteScheduler'], 'tokenizer': ['transformers', 'ByT5Tokenizer'], 'processor': ['transformers', 'GlmImageProcessor'], 'transformer': ['diffusers', 'GlmImageTransformer2DModel'], 'vae': ['diffusers', 'AutoencoderKL']}
[03-03 00:37:18] Loading required components: ['text_encoder', 'tokenizer', 'vae', 'vision_language_encoder', 'processor', 'transformer', 'scheduler']
Loading required modules: 0%| | 0/7 [00:00<?, ?it/s][03-03 00:37:18] Loading text_encoder from /root/.cache/huggingface/hub/models--zai-org--GLM-Image/snapshots/2c433cc0cbc293bde2ac8ca9624f279b5d23fcf4/text_encoder. avail mem: 134.48 GB
[03-03 00:37:18] [RunAI Streamer] Overall time to stream 830.3 MiB of all files to cpu: 0.2s, 4.0 GiB/s
[03-03 00:37:19] Loaded text_encoder: FSDPT5EncoderModel (sgl-diffusion version). model size: 0.81 GB, consumed GPU mem: 0.76 GB, avail GPU mem: 133.71 GB
Loading required modules: 14%|██▍ | 1/7 [00:01<00:08, 1.42s/it][03-03 00:37:19] Loading tokenizer from /root/.cache/huggingface/hub/models--zai-org--GLM-Image/snapshots/2c433cc0cbc293bde2ac8ca9624f279b5d23fcf4/tokenizer. avail mem: 133.71 GB
[03-03 00:37:19] Loaded tokenizer: ByT5Tokenizer (sgl-diffusion version). model size: NA GB, consumed GPU mem: 0.00 GB, avail GPU mem: 133.71 GB
[03-03 00:37:19] Loading vae from /root/.cache/huggingface/hub/models--zai-org--GLM-Image/snapshots/2c433cc0cbc293bde2ac8ca9624f279b5d23fcf4/vae. avail mem: 133.71 GB
[03-03 00:37:20] Loaded vae: AutoencoderKL (sgl-diffusion version). model size: 0.76 GB, consumed GPU mem: 0.77 GB, avail GPU mem: 132.94 GB
Loading required modules: 43%|███████▎ | 3/7 [00:02<00:02, 1.62it/s][03-03 00:37:20] Loading vision_language_encoder from /root/.cache/huggingface/hub/models--zai-org--GLM-Image/snapshots/2c433cc0cbc293bde2ac8ca9624f279b5d23fcf4/vision_language_encoder. avail mem: 132.94 GB
Loading weights: 100%|█| 1011/1011 [00:00<00:00, 4461.57it/s, Materializing pa
[03-03 00:37:25] Loaded vision_language_encoder: GlmImageForConditionalGeneration (sgl-diffusion version). model size: 18.84 GB, consumed GPU mem: 18.88 GB, avail GPU mem: 114.07 GB
Loading required modules: 57%|█████████▋ | 4/7 [00:06<00:06, 2.07s/it][03-03 00:37:25] Loading processor from /root/.cache/huggingface/hub/models--zai-org--GLM-Image/snapshots/2c433cc0cbc293bde2ac8ca9624f279b5d23fcf4/processor. avail mem: 114.07 GB
The image processor of type `GlmImageImageProcessor` is now loaded as a fast processor by default, even if the model checkpoint was saved with a slow processor. This is a breaking change and may produce slightly different outputs. To continue using the slow processor, instantiate this class with `use_fast=False`.
[03-03 00:37:26] Loaded processor: GlmImageProcessor (sgl-diffusion version). model size: NA GB, consumed GPU mem: 0.00 GB, avail GPU mem: 114.07 GB
Loading required modules: 71%|████████████▏ | 5/7 [00:08<00:03, 1.88s/it][03-03 00:37:26] Loading transformer from /root/.cache/huggingface/hub/models--zai-org--GLM-Image/snapshots/2c433cc0cbc293bde2ac8ca9624f279b5d23fcf4/transformer. avail mem: 114.07 GB
[03-03 00:37:26] Loading GlmImageTransformer2DModel from 3 safetensors file(s) , param_dtype: torch.bfloat16
[03-03 00:37:26] flash_attn 3 package is not installed. It's recommended to install flash_attn3 on hopper, otherwise performance is sub-optimal
[03-03 00:37:26] Using FlashAttention (FA3 for hopper, FA4 for blackwell) backend
[03-03 00:37:27] [RunAI Streamer] Overall time to stream 12.9 GiB of all files to cpu: 0.98s, 13.2 GiB/s
[03-03 00:37:36] Loaded model with 6.93B parameters
[03-03 00:37:36] Loaded transformer: GlmImageTransformer2DModel (sgl-diffusion version). model size: 12.9 GB, consumed GPU mem: 0.00 GB, avail GPU mem: 114.07 GB
Loading required modules: 86%|██████████████▌ | 6/7 [00:18<00:04, 4.37s/it][03-03 00:37:36] Loading scheduler from /root/.cache/huggingface/hub/models--zai-org--GLM-Image/snapshots/2c433cc0cbc293bde2ac8ca9624f279b5d23fcf4/scheduler. avail mem: 114.07 GB
[03-03 00:37:36] Loaded scheduler: FlowMatchEulerDiscreteScheduler (sgl-diffusion version). model size: NA GB, consumed GPU mem: 0.00 GB, avail GPU mem: 114.07 GB
Loading required modules: 100%|█████████████████| 7/7 [00:18<00:00, 2.58s/it]
[03-03 00:37:36] Creating pipeline stages...
[03-03 00:37:36] Compiling transformer with mode: max-autotune-no-cudagraphs
[03-03 00:37:36] Using FlashAttention (FA3 for hopper, FA4 for blackwell) backend
[03-03 00:37:36] Pipeline instantiated
[03-03 00:37:36] Worker 0: Initialized device, model, and distributed environment.
[03-03 00:37:36] Worker 0: Scheduler loop started.
[03-03 00:37:36] Starting FastAPI server.
[2026-03-03 00:37:36] INFO: Started server process [970914]
[2026-03-03 00:37:36] INFO: Waiting for application startup.
[03-03 00:37:36] ZMQ Broker is listening for offline jobs on tcp://*:30001
[2026-03-03 00:37:36] INFO: Application startup complete.
[2026-03-03 00:37:36] INFO: Uvicorn running on http://127.0.0.1:30000 (Press CTRL+C to quit)
[03-03 00:37:45] HTTP Request: HEAD https://huggingface.co/zai-org/GLM-Image/resolve/main/model_index.json "HTTP/1.1 307 Temporary Redirect"
[03-03 00:37:45] HTTP Request: HEAD https://huggingface.co/api/resolve-cache/models/zai-org/GLM-Image/2c433cc0cbc293bde2ac8ca9624f279b5d23fcf4/model_index.json "HTTP/1.1 200 OK"
[03-03 00:37:45] Running pipeline stages: ['glm_image_before_denoising_stage', 'denoising_stage', 'decoding_stage']
[03-03 00:37:45] [GlmImageBeforeDenoisingStage] started...
[03-03 00:38:11] generate_prior_tokens time: 26.338573455810547
[03-03 00:38:13] [GlmImageBeforeDenoisingStage] finished in 27.4317 seconds
[03-03 00:38:13] [DenoisingStage] started...
0%| | 0/30 [00:00<?, ?it/s]/data/chenyang/.python/sglang/lib/python3.12/site-packages/torch/_dynamo/variables/functions.py:1598: UserWarning: Dynamo does not know how to trace the builtin `posix.lstat.` This function is either a Python builtin (e.g. _warnings.warn) or a third-party C/C++ Python extension (perhaps created with pybind).
If it is a Python builtin, please file an issue on GitHub so the PyTorch team can add support for it and see the next case for a workaround.
If it is a third-party C/C++ Python extension, please either wrap it into a PyTorch-understood custom operator (see https://pytorch.org/tutorials/advanced/custom_ops_landing_page.html for more details) or, if it is traceable, use `torch.compiler.allow_in_graph`.
torch._dynamo.utils.warn_once(explanation + "\n" + "\n".join(hints))
/data/chenyang/.python/sglang/lib/python3.12/site-packages/torch/_dynamo/variables/functions.py:1692: UserWarning: Dynamo detected a call to a `functools.lru_cache`-wrapped function. Dynamo ignores the cache wrapper and directly traces the wrapped function. Silent incorrectness is only a *potential* risk, not something we have observed. Enable TORCH_LOGS="+dynamo" for a DEBUG stack trace.
torch._dynamo.utils.warn_once(msg)
/data/chenyang/.python/sglang/lib/python3.12/site-packages/torch/_dynamo/variables/functions.py:1598: UserWarning: Dynamo does not know how to trace the builtin `<unknown module>.datetime.now.` This function is either a Python builtin (e.g. _warnings.warn) or a third-party C/C++ Python extension (perhaps created with pybind).
If it is a Python builtin, please file an issue on GitHub so the PyTorch team can add support for it and see the next case for a workaround.
If it is a third-party C/C++ Python extension, please either wrap it into a PyTorch-understood custom operator (see https://pytorch.org/tutorials/advanced/custom_ops_landing_page.html for more details) or, if it is traceable, use `torch.compiler.allow_in_graph`.
torch._dynamo.utils.warn_once(explanation + "\n" + "\n".join(hints))
final peak_memory=268525568
final peak_memory=268525568
final peak_memory=268525568
final peak_memory=24576
final peak_memory=24576
final peak_memory=24576
final peak_memory=268525568
final peak_memory=268525568
final peak_memory=268525568
final peak_memory=65536
final peak_memory=65536
final peak_memory=65536
100%|█████████████████████████████████████████| 30/30 [00:23<00:00, 1.30it/s]
[03-03 00:38:36] [DenoisingStage] average time per step: 0.7684 seconds
[03-03 00:38:36] [DenoisingStage] finished in 23.0620 seconds
[03-03 00:38:36] [DecodingStage] started...
[03-03 00:38:36] [DecodingStage] finished in 0.4158 seconds
[03-03 00:38:36] Peak GPU memory: 41.14 GB, Peak allocated: 37.33 GB, Memory pool overhead: 3.81 GB (9.3%), Remaining GPU memory at peak: 99.26 GB. Components that could stay resident (based on the last request workload): ['text_encoder', 'transformer']. Related offload server args to disable: --dit-cpu-offload, --text-encoder-cpu-offload
[03-03 00:38:36] Output saved to outputs/8d100b3e-d1ab-4df3-b3cd-f6da622ce1a3.jpg
[03-03 00:38:36] Pixel data generated successfully in 50.96 seconds
[03-03 00:38:36] Completed batch processing. Generated 1 outputs in 50.96 seconds
[03-03 00:38:36] Peak memory usage: 42126.00 MB
[2026-03-03 00:38:36] INFO: 127.0.0.1:33326 - "POST /v1/images/generations HTTP/1.1" 200 OK
[03-03 00:38:47] Running pipeline stages: ['glm_image_before_denoising_stage', 'denoising_stage', 'decoding_stage']
[03-03 00:38:47] [GlmImageBeforeDenoisingStage] started...
[03-03 00:39:13] generate_prior_tokens time: 25.920583486557007
[03-03 00:39:13] [GlmImageBeforeDenoisingStage] finished in 25.9709 seconds
[03-03 00:39:13] [DenoisingStage] started...
100%|█████████████████████████████████████████| 30/30 [00:06<00:00, 4.43it/s]
[03-03 00:39:20] [DenoisingStage] average time per step: 0.2255 seconds
[03-03 00:39:20] [DenoisingStage] finished in 6.7674 seconds
[03-03 00:39:20] [DecodingStage] started...
[03-03 00:39:20] [DecodingStage] finished in 0.0743 seconds
[03-03 00:39:20] Peak GPU memory: 41.15 GB, Peak allocated: 37.33 GB, Memory pool overhead: 3.82 GB (9.3%), Remaining GPU memory at peak: 99.25 GB. Components that could stay resident (based on the last request workload): ['text_encoder', 'transformer']. Related offload server args to disable: --dit-cpu-offload, --text-encoder-cpu-offload
[03-03 00:39:20] Output saved to outputs/3e918edf-9fd7-429d-b23b-1eb1d67fb208.jpg
[03-03 00:39:20] Pixel data generated successfully in 33.04 seconds
[03-03 00:39:20] Completed batch processing. Generated 1 outputs in 33.04 seconds
[03-03 00:39:20] Peak memory usage: 42134.00 MB
[2026-03-03 00:39:20] INFO: 127.0.0.1:37384 - "POST /v1/images/generations HTTP/1.1" 200 OKShall we remove these |
BBuf
left a comment
There was a problem hiding this comment.
Looks good. cc @yingluosanqian
Can you search whether there's an environment variable in torch compile that can control this log? |
|
As suggested by BBuf: The torch.compile logs are not something we are actively controlling. We could check if there are specific environment variables for torch.compile that manage log output. My understanding is that these logs are automatically generated as soon as torch.compile(xxx_module) is called. |
yingluosanqian
left a comment
There was a problem hiding this comment.
Thanks to fix it. we always passed eps explicitly before, so the issue didn’t appear. In your test it might not have been passed, which triggered the problem. I think this change is correct.
…rch.compile (sgl-project#19725) Co-authored-by: Xiaoyu Zhang <35585791+BBuf@users.noreply.github.com>
…rch.compile (sgl-project#19725) Co-authored-by: Xiaoyu Zhang <35585791+BBuf@users.noreply.github.com>
…rch.compile (sgl-project#19725) Co-authored-by: Xiaoyu Zhang <35585791+BBuf@users.noreply.github.com>
…rch.compile (sgl-project#19725) Co-authored-by: Xiaoyu Zhang <35585791+BBuf@users.noreply.github.com>
…rch.compile (sgl-project#19725) Co-authored-by: Xiaoyu Zhang <35585791+BBuf@users.noreply.github.com>
Motivation
While testing #18154 with
--enable-torch-compile, torch.compile's fake tensor tracing produces repeatedTypeErrortraces:These don't cause hard failures (torch.compile falls back to eager), but they spam the log and hurt debuggability.
Modifications
Add
eps=1e-5default to bothregister_fakeimplementations, matching the@custom_opsignatures:_fused_norm_scale_shift_fake(..., eps)→(..., eps=1e-5)_fused_scale_residual_norm_scale_shift_fake(..., eps)→(..., eps=1e-5)Test Plan
This would run into:
Indeed, Torch compile could fall back, but these logs seem ugly.
And, after this PR, we can have this error fixed.