Skip to content

[SGLang-Diffusion] Fix custom op fake impl missing eps default for torch.compile#19725

Merged
BBuf merged 2 commits intomainfrom
fix/fake-impl-eps-default-for-torch-compile
Mar 3, 2026
Merged

[SGLang-Diffusion] Fix custom op fake impl missing eps default for torch.compile#19725
BBuf merged 2 commits intomainfrom
fix/fake-impl-eps-default-for-torch-compile

Conversation

@zhaochenyang20
Copy link
Copy Markdown
Collaborator

@zhaochenyang20 zhaochenyang20 commented Mar 3, 2026

Motivation

While testing #18154 with --enable-torch-compile, torch.compile's fake tensor tracing produces repeated TypeError traces:

TypeError: fused_scale_residual_norm_scale_shift_fake() missing 1 required positional argument: 'eps'

These don't cause hard failures (torch.compile falls back to eager), but they spam the log and hurt debuggability.

Modifications

Add eps=1e-5 default to both register_fake implementations, matching the @custom_op signatures:

  • _fused_norm_scale_shift_fake(..., eps)(..., eps=1e-5)
  • _fused_scale_residual_norm_scale_shift_fake(..., eps)(..., eps=1e-5)

Test Plan

  1. Before the PR, running:
cd python
uv pip install -e ".[diffusion]
# This is required for GLM
pip install --upgrade transformers

sglang serve --model-path zai-org/GLM-Image --enable-torch-compile

# And, sends a request!

curl -sS -X POST "http://localhost:30000/v1/images/generations" \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer sk-proj-1234567890" \
  -d '{
        "prompt": "A calico cat playing a piano on stage",
        "size": "1024x1024",
        "n": 1,
        "response_format": "b64_json"
      }'

This would run into:

SingleProcess AUTOTUNE benchmarking takes 0.6957 seconds and 0.3702 seconds precompiling for 20 choices
final peak_memory=33554432
final peak_memory=33554432
final peak_memory=33554432
final peak_memory=24576
final peak_memory=24576
final peak_memory=24576
final peak_memory=0
final peak_memory=0
final peak_memory=0
final peak_memory=0
final peak_memory=0
final peak_memory=0
/data/chenyang/.python/sglang/lib/python3.12/site-packages/torch/_dynamo/variables/functions.py:1692: UserWarning: Dynamo detected a call to a `functools.lru_cache`-wrapped function. Dynamo ignores the cache wrapper and directly traces the wrapped function. Silent incorrectness is only a *potential* risk, not something we have observed. Enable TORCH_LOGS="+dynamo" for a DEBUG stack trace.
  torch._dynamo.utils.warn_once(msg)
/data/chenyang/.python/sglang/lib/python3.12/site-packages/torch/_dynamo/variables/functions.py:1598: UserWarning: Dynamo does not know how to trace the builtin `<unknown module>.datetime.now.` This function is either a Python builtin (e.g. _warnings.warn) or a third-party C/C++ Python extension (perhaps created with pybind).
If it is a Python builtin, please file an issue on GitHub so the PyTorch team can add support for it and see the next case for a workaround.
If it is a third-party C/C++ Python extension, please either wrap it into a PyTorch-understood custom operator (see https://pytorch.org/tutorials/advanced/custom_ops_landing_page.html for more details) or, if it is traceable, use `torch.compiler.allow_in_graph`.
  torch._dynamo.utils.warn_once(explanation + "\n" + "\n".join(hints))
final peak_memory=0
final peak_memory=0
final peak_memory=0
final peak_memory=0
final peak_memory=0
final peak_memory=0
final peak_memory=0
final peak_memory=0
final peak_memory=0
final peak_memory=0
final peak_memory=0
final peak_memory=0
final peak_memory=0
final peak_memory=0
final peak_memory=0
final peak_memory=0
final peak_memory=0
final peak_memory=0
[rank0]:E0303 00:33:59.254000 965504 torch/_subclasses/fake_tensor.py:1378] [48/0] fake tensor raised TypeError
[rank0]:E0303 00:33:59.254000 965504 torch/_subclasses/fake_tensor.py:1378] [48/0] Traceback (most recent call last):
[rank0]:E0303 00:33:59.254000 965504 torch/_subclasses/fake_tensor.py:1378] [48/0]   File "/data/chenyang/.python/sglang/lib/python3.12/site-packages/torch/_subclasses/fake_tensor.py", line 1376, in __torch_dispatch__
[rank0]:E0303 00:33:59.254000 965504 torch/_subclasses/fake_tensor.py:1378] [48/0]     return self.dispatch(func, types, args, kwargs)
[rank0]:E0303 00:33:59.254000 965504 torch/_subclasses/fake_tensor.py:1378] [48/0]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:E0303 00:33:59.254000 965504 torch/_subclasses/fake_tensor.py:1378] [48/0]   File "/data/chenyang/.python/sglang/lib/python3.12/site-packages/torch/_subclasses/fake_tensor.py", line 2096, in dispatch
[rank0]:E0303 00:33:59.254000 965504 torch/_subclasses/fake_tensor.py:1378] [48/0]     return self._cached_dispatch_impl(func, types, args, kwargs)
[rank0]:E0303 00:33:59.254000 965504 torch/_subclasses/fake_tensor.py:1378] [48/0]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:E0303 00:33:59.254000 965504 torch/_subclasses/fake_tensor.py:1378] [48/0]   File "/data/chenyang/.python/sglang/lib/python3.12/site-packages/torch/_subclasses/fake_tensor.py", line 1511, in _cached_dispatch_impl
[rank0]:E0303 00:33:59.254000 965504 torch/_subclasses/fake_tensor.py:1378] [48/0]     output = self._dispatch_impl(func, types, args, kwargs)
[rank0]:E0303 00:33:59.254000 965504 torch/_subclasses/fake_tensor.py:1378] [48/0]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:E0303 00:33:59.254000 965504 torch/_subclasses/fake_tensor.py:1378] [48/0]   File "/data/chenyang/.python/sglang/lib/python3.12/site-packages/torch/_subclasses/fake_tensor.py", line 2696, in _dispatch_impl
[rank0]:E0303 00:33:59.254000 965504 torch/_subclasses/fake_tensor.py:1378] [48/0]     result = maybe_fake_impl(*args, **kwargs)
[rank0]:E0303 00:33:59.254000 965504 torch/_subclasses/fake_tensor.py:1378] [48/0]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:E0303 00:33:59.254000 965504 torch/_subclasses/fake_tensor.py:1378] [48/0]   File "/data/chenyang/.python/sglang/lib/python3.12/site-packages/torch/_library/utils.py", line 22, in __call__
[rank0]:E0303 00:33:59.254000 965504 torch/_subclasses/fake_tensor.py:1378] [48/0]     return self.func(*args, **kwargs)
[rank0]:E0303 00:33:59.254000 965504 torch/_subclasses/fake_tensor.py:1378] [48/0]            ^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:E0303 00:33:59.254000 965504 torch/_subclasses/fake_tensor.py:1378] [48/0]   File "/data/chenyang/.python/sglang/lib/python3.12/site-packages/torch/library.py", line 1430, in inner
[rank0]:E0303 00:33:59.254000 965504 torch/_subclasses/fake_tensor.py:1378] [48/0]     return func(*args, **kwargs)
[rank0]:E0303 00:33:59.254000 965504 torch/_subclasses/fake_tensor.py:1378] [48/0]            ^^^^^^^^^^^^^^^^^^^^^
[rank0]:E0303 00:33:59.254000 965504 torch/_subclasses/fake_tensor.py:1378] [48/0]   File "/data/chenyang/.python/sglang/lib/python3.12/site-packages/torch/_library/custom_ops.py", line 627, in fake_impl
[rank0]:E0303 00:33:59.254000 965504 torch/_subclasses/fake_tensor.py:1378] [48/0]     return self._abstract_fn(*args, **kwargs)
[rank0]:E0303 00:33:59.254000 965504 torch/_subclasses/fake_tensor.py:1378] [48/0]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:E0303 00:33:59.254000 965504 torch/_subclasses/fake_tensor.py:1378] [48/0] TypeError: _fused_scale_residual_norm_scale_shift_fake() missing 1 required positional argument: 'eps'
[rank0]:E0303 00:33:59.311000 965504 torch/_subclasses/fake_tensor.py:1378] [49/0] fake tensor raised TypeError
[rank0]:E0303 00:33:59.311000 965504 torch/_subclasses/fake_tensor.py:1378] [49/0] Traceback (most recent call last):
[rank0]:E0303 00:33:59.311000 965504 torch/_subclasses/fake_tensor.py:1378] [49/0]   File "/data/chenyang/.python/sglang/lib/python3.12/site-packages/torch/_subclasses/fake_tensor.py", line 1376, in __torch_dispatch__
[rank0]:E0303 00:33:59.311000 965504 torch/_subclasses/fake_tensor.py:1378] [49/0]     return self.dispatch(func, types, args, kwargs)
[rank0]:E0303 00:33:59.311000 965504 torch/_subclasses/fake_tensor.py:1378] [49/0]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:E0303 00:33:59.311000 965504 torch/_subclasses/fake_tensor.py:1378] [49/0]   File "/data/chenyang/.python/sglang/lib/python3.12/site-packages/torch/_subclasses/fake_tensor.py", line 2096, in dispatch
[rank0]:E0303 00:33:59.311000 965504 torch/_subclasses/fake_tensor.py:1378] [49/0]     return self._cached_dispatch_impl(func, types, args, kwargs)
[rank0]:E0303 00:33:59.311000 965504 torch/_subclasses/fake_tensor.py:1378] [49/0]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:E0303 00:33:59.311000 965504 torch/_subclasses/fake_tensor.py:1378] [49/0]   File "/data/chenyang/.python/sglang/lib/python3.12/site-packages/torch/_subclasses/fake_tensor.py", line 1511, in _cached_dispatch_impl
[rank0]:E0303 00:33:59.311000 965504 torch/_subclasses/fake_tensor.py:1378] [49/0]     output = self._dispatch_impl(func, types, args, kwargs)
[rank0]:E0303 00:33:59.311000 965504 torch/_subclasses/fake_tensor.py:1378] [49/0]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:E0303 00:33:59.311000 965504 torch/_subclasses/fake_tensor.py:1378] [49/0]   File "/data/chenyang/.python/sglang/lib/python3.12/site-packages/torch/_subclasses/fake_tensor.py", line 2696, in _dispatch_impl
[rank0]:E0303 00:33:59.311000 965504 torch/_subclasses/fake_tensor.py:1378] [49/0]     result = maybe_fake_impl(*args, **kwargs)
[rank0]:E0303 00:33:59.311000 965504 torch/_subclasses/fake_tensor.py:1378] [49/0]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:E0303 00:33:59.311000 965504 torch/_subclasses/fake_tensor.py:1378] [49/0]   File "/data/chenyang/.python/sglang/lib/python3.12/site-packages/torch/_library/utils.py", line 22, in __call__
[rank0]:E0303 00:33:59.311000 965504 torch/_subclasses/fake_tensor.py:1378] [49/0]     return self.func(*args, **kwargs)
[rank0]:E0303 00:33:59.311000 965504 torch/_subclasses/fake_tensor.py:1378] [49/0]            ^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:E0303 00:33:59.311000 965504 torch/_subclasses/fake_tensor.py:1378] [49/0]   File "/data/chenyang/.python/sglang/lib/python3.12/site-packages/torch/library.py", line 1425, in inner
[rank0]:E0303 00:33:59.311000 965504 torch/_subclasses/fake_tensor.py:1378] [49/0]     return func(*args, **kwargs)
[rank0]:E0303 00:33:59.311000 965504 torch/_subclasses/fake_tensor.py:1378] [49/0]            ^^^^^^^^^^^^^^^^^^^^^
[rank0]:E0303 00:33:59.311000 965504 torch/_subclasses/fake_tensor.py:1378] [49/0]   File "/data/chenyang/.python/sglang/lib/python3.12/site-packages/torch/_library/custom_ops.py", line 627, in fake_impl
[rank0]:E0303 00:33:59.311000 965504 torch/_subclasses/fake_tensor.py:1378] [49/0]     return self._abstract_fn(*args, **kwargs)
[rank0]:E0303 00:33:59.311000 965504 torch/_subclasses/fake_tensor.py:1378] [49/0]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:E0303 00:33:59.311000 965504 torch/_subclasses/fake_tensor.py:1378] [49/0] TypeError: _fused_scale_residual_norm_scale_shift_fake() missing 1 required positional argument: 'eps'
[rank0]:E0303 00:33:59.347000 965504 torch/_subclasses/fake_tensor.py:1378] [50/0] fake tensor raised TypeError
[rank0]:E0303 00:33:59.347000 965504 torch/_subclasses/fake_tensor.py:1378] [50/0] Traceback (most recent call last):
[rank0]:E0303 00:33:59.347000 965504 torch/_subclasses/fake_tensor.py:1378] [50/0]   File "/data/chenyang/.python/sglang/lib/python3.12/site-packages/torch/_subclasses/fake_tensor.py", line 1376, in __torch_dispatch__
[rank0]:E0303 00:33:59.347000 965504 torch/_subclasses/fake_tensor.py:1378] [50/0]     return self.dispatch(func, types, args, kwargs)
[rank0]:E0303 00:33:59.347000 965504 torch/_subclasses/fake_tensor.py:1378] [50/0]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:E0303 00:33:59.347000 965504 torch/_subclasses/fake_tensor.py:1378] [50/0]   File "/data/chenyang/.python/sglang/lib/python3.12/site-packages/torch/_subclasses/fake_tensor.py", line 2096, in dispatch
[rank0]:E0303 00:33:59.347000 965504 torch/_subclasses/fake_tensor.py:1378] [50/0]     return self._cached_dispatch_impl(func, types, args, kwargs)
[rank0]:E0303 00:33:59.347000 965504 torch/_subclasses/fake_tensor.py:1378] [50/0]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:E0303 00:33:59.347000 965504 torch/_subclasses/fake_tensor.py:1378] [50/0]   File "/data/chenyang/.python/sglang/lib/python3.12/site-packages/torch/_subclasses/fake_tensor.py", line 1511, in _cached_dispatch_impl
[rank0]:E0303 00:33:59.347000 965504 torch/_subclasses/fake_tensor.py:1378] [50/0]     output = self._dispatch_impl(func, types, args, kwargs)
[rank0]:E0303 00:33:59.347000 965504 torch/_subclasses/fake_tensor.py:1378] [50/0]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:E0303 00:33:59.347000 965504 torch/_subclasses/fake_tensor.py:1378] [50/0]   File "/data/chenyang/.python/sglang/lib/python3.12/site-packages/torch/_subclasses/fake_tensor.py", line 2696, in _dispatch_impl
[rank0]:E0303 00:33:59.347000 965504 torch/_subclasses/fake_tensor.py:1378] [50/0]     result = maybe_fake_impl(*args, **kwargs)
[rank0]:E0303 00:33:59.347000 965504 torch/_subclasses/fake_tensor.py:1378] [50/0]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:E0303 00:33:59.347000 965504 torch/_subclasses/fake_tensor.py:1378] [50/0]   File "/data/chenyang/.python/sglang/lib/python3.12/site-packages/torch/_library/utils.py", line 22, in __call__
[rank0]:E0303 00:33:59.347000 965504 torch/_subclasses/fake_tensor.py:1378] [50/0]     return self.func(*args, **kwargs)
[rank0]:E0303 00:33:59.347000 965504 torch/_subclasses/fake_tensor.py:1378] [50/0]            ^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:E0303 00:33:59.347000 965504 torch/_subclasses/fake_tensor.py:1378] [50/0]   File "/data/chenyang/.python/sglang/lib/python3.12/site-packages/torch/library.py", line 1425, in inner
[rank0]:E0303 00:33:59.347000 965504 torch/_subclasses/fake_tensor.py:1378] [50/0]     return func(*args, **kwargs)
[rank0]:E0303 00:33:59.347000 965504 torch/_subclasses/fake_tensor.py:1378] [50/0]            ^^^^^^^^^^^^^^^^^^^^^
[rank0]:E0303 00:33:59.347000 965504 torch/_subclasses/fake_tensor.py:1378] [50/0]   File "/data/chenyang/.python/sglang/lib/python3.12/site-packages/torch/_library/custom_ops.py", line 627, in fake_impl
[rank0]:E0303 00:33:59.347000 965504 torch/_subclasses/fake_tensor.py:1378] [50/0]     return self._abstract_fn(*args, **kwargs)
[rank0]:E0303 00:33:59.347000 965504 torch/_subclasses/fake_tensor.py:1378] [50/0]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:E0303 00:33:59.347000 965504 torch/_subclasses/fake_tensor.py:1378] [50/0] TypeError: _fused_scale_residual_norm_scale_shift_fake() missing 1 required positional argument: 'eps'
[rank0]:E0303 00:33:59.857000 965504 torch/_subclasses/fake_tensor.py:1378] [51/0] fake tensor raised TypeError
[rank0]:E0303 00:33:59.857000 965504 torch/_subclasses/fake_tensor.py:1378] [51/0] Traceback (most recent call last):

Indeed, Torch compile could fall back, but these logs seem ugly.

And, after this PR, we can have this error fixed.

[03-03 00:37:18] Found complete model in cache at /root/.cache/huggingface/hub/models--zai-org--GLM-Image/snapshots/2c433cc0cbc293bde2ac8ca9624f279b5d23fcf4
[03-03 00:37:18] Model path: /root/.cache/huggingface/hub/models--zai-org--GLM-Image/snapshots/2c433cc0cbc293bde2ac8ca9624f279b5d23fcf4
[03-03 00:37:18] Diffusers version: 0.37.0.dev0
[03-03 00:37:18] Loading pipeline modules from config: {'_class_name': 'GlmImagePipeline', '_diffusers_version': '0.37.0.dev0', '_name_or_path': 'zai-org/GLM-Image-Decoder', 'text_encoder': ['transformers', 'T5EncoderModel'], 'vision_language_encoder': ['transformers', 'GlmImageForConditionalGeneration'], 'scheduler': ['diffusers', 'FlowMatchEulerDiscreteScheduler'], 'tokenizer': ['transformers', 'ByT5Tokenizer'], 'processor': ['transformers', 'GlmImageProcessor'], 'transformer': ['diffusers', 'GlmImageTransformer2DModel'], 'vae': ['diffusers', 'AutoencoderKL']}
[03-03 00:37:18] Loading required components: ['text_encoder', 'tokenizer', 'vae', 'vision_language_encoder', 'processor', 'transformer', 'scheduler']
Loading required modules:   0%|                         | 0/7 [00:00<?, ?it/s][03-03 00:37:18] Loading text_encoder from /root/.cache/huggingface/hub/models--zai-org--GLM-Image/snapshots/2c433cc0cbc293bde2ac8ca9624f279b5d23fcf4/text_encoder. avail mem: 134.48 GB
[03-03 00:37:18] [RunAI Streamer] Overall time to stream 830.3 MiB of all files to cpu: 0.2s, 4.0 GiB/s
[03-03 00:37:19] Loaded text_encoder: FSDPT5EncoderModel (sgl-diffusion version). model size: 0.81 GB, consumed GPU mem: 0.76 GB, avail GPU mem: 133.71 GB
Loading required modules:  14%|██▍              | 1/7 [00:01<00:08,  1.42s/it][03-03 00:37:19] Loading tokenizer from /root/.cache/huggingface/hub/models--zai-org--GLM-Image/snapshots/2c433cc0cbc293bde2ac8ca9624f279b5d23fcf4/tokenizer. avail mem: 133.71 GB
[03-03 00:37:19] Loaded tokenizer: ByT5Tokenizer (sgl-diffusion version). model size: NA GB, consumed GPU mem: 0.00 GB, avail GPU mem: 133.71 GB
[03-03 00:37:19] Loading vae from /root/.cache/huggingface/hub/models--zai-org--GLM-Image/snapshots/2c433cc0cbc293bde2ac8ca9624f279b5d23fcf4/vae. avail mem: 133.71 GB
[03-03 00:37:20] Loaded vae: AutoencoderKL (sgl-diffusion version). model size: 0.76 GB, consumed GPU mem: 0.77 GB, avail GPU mem: 132.94 GB
Loading required modules:  43%|███████▎         | 3/7 [00:02<00:02,  1.62it/s][03-03 00:37:20] Loading vision_language_encoder from /root/.cache/huggingface/hub/models--zai-org--GLM-Image/snapshots/2c433cc0cbc293bde2ac8ca9624f279b5d23fcf4/vision_language_encoder. avail mem: 132.94 GB
Loading weights: 100%|| 1011/1011 [00:00<00:00, 4461.57it/s, Materializing pa
[03-03 00:37:25] Loaded vision_language_encoder: GlmImageForConditionalGeneration (sgl-diffusion version). model size: 18.84 GB, consumed GPU mem: 18.88 GB, avail GPU mem: 114.07 GB
Loading required modules:  57%|█████████▋       | 4/7 [00:06<00:06,  2.07s/it][03-03 00:37:25] Loading processor from /root/.cache/huggingface/hub/models--zai-org--GLM-Image/snapshots/2c433cc0cbc293bde2ac8ca9624f279b5d23fcf4/processor. avail mem: 114.07 GB
The image processor of type `GlmImageImageProcessor` is now loaded as a fast processor by default, even if the model checkpoint was saved with a slow processor. This is a breaking change and may produce slightly different outputs. To continue using the slow processor, instantiate this class with `use_fast=False`. 
[03-03 00:37:26] Loaded processor: GlmImageProcessor (sgl-diffusion version). model size: NA GB, consumed GPU mem: 0.00 GB, avail GPU mem: 114.07 GB
Loading required modules:  71%|████████████▏    | 5/7 [00:08<00:03,  1.88s/it][03-03 00:37:26] Loading transformer from /root/.cache/huggingface/hub/models--zai-org--GLM-Image/snapshots/2c433cc0cbc293bde2ac8ca9624f279b5d23fcf4/transformer. avail mem: 114.07 GB
[03-03 00:37:26] Loading GlmImageTransformer2DModel from 3 safetensors file(s) , param_dtype: torch.bfloat16
[03-03 00:37:26] flash_attn 3 package is not installed. It's recommended to install flash_attn3 on hopper, otherwise performance is sub-optimal
[03-03 00:37:26] Using FlashAttention (FA3 for hopper, FA4 for blackwell) backend
[03-03 00:37:27] [RunAI Streamer] Overall time to stream 12.9 GiB of all files to cpu: 0.98s, 13.2 GiB/s
[03-03 00:37:36] Loaded model with 6.93B parameters
[03-03 00:37:36] Loaded transformer: GlmImageTransformer2DModel (sgl-diffusion version). model size: 12.9 GB, consumed GPU mem: 0.00 GB, avail GPU mem: 114.07 GB
Loading required modules:  86%|██████████████▌  | 6/7 [00:18<00:04,  4.37s/it][03-03 00:37:36] Loading scheduler from /root/.cache/huggingface/hub/models--zai-org--GLM-Image/snapshots/2c433cc0cbc293bde2ac8ca9624f279b5d23fcf4/scheduler. avail mem: 114.07 GB
[03-03 00:37:36] Loaded scheduler: FlowMatchEulerDiscreteScheduler (sgl-diffusion version). model size: NA GB, consumed GPU mem: 0.00 GB, avail GPU mem: 114.07 GB
Loading required modules: 100%|█████████████████| 7/7 [00:18<00:00,  2.58s/it]
[03-03 00:37:36] Creating pipeline stages...
[03-03 00:37:36] Compiling transformer with mode: max-autotune-no-cudagraphs
[03-03 00:37:36] Using FlashAttention (FA3 for hopper, FA4 for blackwell) backend
[03-03 00:37:36] Pipeline instantiated
[03-03 00:37:36] Worker 0: Initialized device, model, and distributed environment.
[03-03 00:37:36] Worker 0: Scheduler loop started.
[03-03 00:37:36] Starting FastAPI server.
[2026-03-03 00:37:36] INFO:     Started server process [970914]
[2026-03-03 00:37:36] INFO:     Waiting for application startup.
[03-03 00:37:36] ZMQ Broker is listening for offline jobs on tcp://*:30001
[2026-03-03 00:37:36] INFO:     Application startup complete.
[2026-03-03 00:37:36] INFO:     Uvicorn running on http://127.0.0.1:30000 (Press CTRL+C to quit)
[03-03 00:37:45] HTTP Request: HEAD https://huggingface.co/zai-org/GLM-Image/resolve/main/model_index.json "HTTP/1.1 307 Temporary Redirect"
[03-03 00:37:45] HTTP Request: HEAD https://huggingface.co/api/resolve-cache/models/zai-org/GLM-Image/2c433cc0cbc293bde2ac8ca9624f279b5d23fcf4/model_index.json "HTTP/1.1 200 OK"
[03-03 00:37:45] Running pipeline stages: ['glm_image_before_denoising_stage', 'denoising_stage', 'decoding_stage']
[03-03 00:37:45] [GlmImageBeforeDenoisingStage] started...
[03-03 00:38:11] generate_prior_tokens time: 26.338573455810547
[03-03 00:38:13] [GlmImageBeforeDenoisingStage] finished in 27.4317 seconds
[03-03 00:38:13] [DenoisingStage] started...
  0%|                                                  | 0/30 [00:00<?, ?it/s]/data/chenyang/.python/sglang/lib/python3.12/site-packages/torch/_dynamo/variables/functions.py:1598: UserWarning: Dynamo does not know how to trace the builtin `posix.lstat.` This function is either a Python builtin (e.g. _warnings.warn) or a third-party C/C++ Python extension (perhaps created with pybind).
If it is a Python builtin, please file an issue on GitHub so the PyTorch team can add support for it and see the next case for a workaround.
If it is a third-party C/C++ Python extension, please either wrap it into a PyTorch-understood custom operator (see https://pytorch.org/tutorials/advanced/custom_ops_landing_page.html for more details) or, if it is traceable, use `torch.compiler.allow_in_graph`.
  torch._dynamo.utils.warn_once(explanation + "\n" + "\n".join(hints))
/data/chenyang/.python/sglang/lib/python3.12/site-packages/torch/_dynamo/variables/functions.py:1692: UserWarning: Dynamo detected a call to a `functools.lru_cache`-wrapped function. Dynamo ignores the cache wrapper and directly traces the wrapped function. Silent incorrectness is only a *potential* risk, not something we have observed. Enable TORCH_LOGS="+dynamo" for a DEBUG stack trace.
  torch._dynamo.utils.warn_once(msg)
/data/chenyang/.python/sglang/lib/python3.12/site-packages/torch/_dynamo/variables/functions.py:1598: UserWarning: Dynamo does not know how to trace the builtin `<unknown module>.datetime.now.` This function is either a Python builtin (e.g. _warnings.warn) or a third-party C/C++ Python extension (perhaps created with pybind).
If it is a Python builtin, please file an issue on GitHub so the PyTorch team can add support for it and see the next case for a workaround.
If it is a third-party C/C++ Python extension, please either wrap it into a PyTorch-understood custom operator (see https://pytorch.org/tutorials/advanced/custom_ops_landing_page.html for more details) or, if it is traceable, use `torch.compiler.allow_in_graph`.
  torch._dynamo.utils.warn_once(explanation + "\n" + "\n".join(hints))
final peak_memory=268525568
final peak_memory=268525568
final peak_memory=268525568
final peak_memory=24576
final peak_memory=24576
final peak_memory=24576
final peak_memory=268525568
final peak_memory=268525568
final peak_memory=268525568
final peak_memory=65536
final peak_memory=65536
final peak_memory=65536
100%|█████████████████████████████████████████| 30/30 [00:23<00:00,  1.30it/s]
[03-03 00:38:36] [DenoisingStage] average time per step: 0.7684 seconds
[03-03 00:38:36] [DenoisingStage] finished in 23.0620 seconds
[03-03 00:38:36] [DecodingStage] started...
[03-03 00:38:36] [DecodingStage] finished in 0.4158 seconds
[03-03 00:38:36] Peak GPU memory: 41.14 GB, Peak allocated: 37.33 GB, Memory pool overhead: 3.81 GB (9.3%), Remaining GPU memory at peak: 99.26 GB. Components that could stay resident (based on the last request workload): ['text_encoder', 'transformer']. Related offload server args to disable: --dit-cpu-offload, --text-encoder-cpu-offload
[03-03 00:38:36] Output saved to outputs/8d100b3e-d1ab-4df3-b3cd-f6da622ce1a3.jpg
[03-03 00:38:36] Pixel data generated successfully in 50.96 seconds
[03-03 00:38:36] Completed batch processing. Generated 1 outputs in 50.96 seconds
[03-03 00:38:36] Peak memory usage: 42126.00 MB
[2026-03-03 00:38:36] INFO:     127.0.0.1:33326 - "POST /v1/images/generations HTTP/1.1" 200 OK
[03-03 00:38:47] Running pipeline stages: ['glm_image_before_denoising_stage', 'denoising_stage', 'decoding_stage']
[03-03 00:38:47] [GlmImageBeforeDenoisingStage] started...
[03-03 00:39:13] generate_prior_tokens time: 25.920583486557007
[03-03 00:39:13] [GlmImageBeforeDenoisingStage] finished in 25.9709 seconds
[03-03 00:39:13] [DenoisingStage] started...
100%|█████████████████████████████████████████| 30/30 [00:06<00:00,  4.43it/s]
[03-03 00:39:20] [DenoisingStage] average time per step: 0.2255 seconds
[03-03 00:39:20] [DenoisingStage] finished in 6.7674 seconds
[03-03 00:39:20] [DecodingStage] started...
[03-03 00:39:20] [DecodingStage] finished in 0.0743 seconds
[03-03 00:39:20] Peak GPU memory: 41.15 GB, Peak allocated: 37.33 GB, Memory pool overhead: 3.82 GB (9.3%), Remaining GPU memory at peak: 99.25 GB. Components that could stay resident (based on the last request workload): ['text_encoder', 'transformer']. Related offload server args to disable: --dit-cpu-offload, --text-encoder-cpu-offload
[03-03 00:39:20] Output saved to outputs/3e918edf-9fd7-429d-b23b-1eb1d67fb208.jpg
[03-03 00:39:20] Pixel data generated successfully in 33.04 seconds
[03-03 00:39:20] Completed batch processing. Generated 1 outputs in 33.04 seconds
[03-03 00:39:20] Peak memory usage: 42134.00 MB
[2026-03-03 00:39:20] INFO:     127.0.0.1:37384 - "POST /v1/images/generations HTTP/1.1" 200 OK

Add default value `eps=1e-5` to `register_fake` implementations of
`fused_norm_scale_shift` and `fused_scale_residual_norm_scale_shift`
custom ops, matching the default in the actual custom_op signatures.

Made-with: Cursor
@gemini-code-assist
Copy link
Copy Markdown
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

@zhaochenyang20
Copy link
Copy Markdown
Collaborator Author

BTW, I think that our logging after enabled torch compile is too messy.

(sglang) ➜  python git:(fix/fake-impl-eps-default-for-torch-compile) sglang serve --model-path zai-org/GLM-Image --enable-torch-compile
[2026-03-03 00:37:08] INFO _client.py:1025: HTTP Request: HEAD https://huggingface.co/zai-org/GLM-Image/resolve/main/model_index.json "HTTP/1.1 307 Temporary Redirect"
[2026-03-03 00:37:09] INFO _client.py:1025: HTTP Request: HEAD https://huggingface.co/api/resolve-cache/models/zai-org/GLM-Image/2c433cc0cbc293bde2ac8ca9624f279b5d23fcf4/model_index.json "HTTP/1.1 200 OK"
[2026-03-03 00:37:09] INFO serve.py:87: Diffusion model detected
[2026-03-03 00:37:09] INFO _client.py:1025: HTTP Request: HEAD https://huggingface.co/zai-org/GLM-Image/resolve/main/model_index.json "HTTP/1.1 307 Temporary Redirect"
[2026-03-03 00:37:09] INFO _client.py:1025: HTTP Request: HEAD https://huggingface.co/api/resolve-cache/models/zai-org/GLM-Image/2c433cc0cbc293bde2ac8ca9624f279b5d23fcf4/model_index.json "HTTP/1.1 200 OK"
[2026-03-03 00:37:09] INFO _client.py:1025: HTTP Request: HEAD https://huggingface.co/zai-org/GLM-Image/resolve/main/model_index.json "HTTP/1.1 307 Temporary Redirect"
[2026-03-03 00:37:09] INFO _client.py:1025: HTTP Request: HEAD https://huggingface.co/api/resolve-cache/models/zai-org/GLM-Image/2c433cc0cbc293bde2ac8ca9624f279b5d23fcf4/model_index.json "HTTP/1.1 200 OK"
[03-03 00:37:09] Disabling some offloading (except dit, text_encoder) for image generation model
[03-03 00:37:09] server_args: {"model_path": "zai-org/GLM-Image", "model_id": null, "backend": "auto", "attention_backend": null, "attention_backend_config": {}, "cache_dit_config": null, "nccl_port": null, "trust_remote_code": false, "revision": null, "num_gpus": 1, "tp_size": 1, "sp_degree": 1, "ulysses_degree": 1, "ring_degree": 1, "dp_size": 1, "dp_degree": 1, "enable_cfg_parallel": false, "hsdp_replicate_dim": 1, "hsdp_shard_dim": 1, "dist_timeout": 3600, "pipeline_class_name": null, "lora_path": null, "lora_nickname": "default", "lora_scale": 1.0, "component_paths": {}, "transformer_weights_path": null, "lora_target_modules": null, "dit_cpu_offload": true, "dit_layerwise_offload": null, "dit_offload_prefetch_size": 0.0, "text_encoder_cpu_offload": true, "image_encoder_cpu_offload": false, "vae_cpu_offload": false, "use_fsdp_inference": false, "pin_cpu_memory": true, "comfyui_mode": false, "enable_torch_compile": true, "warmup": false, "warmup_resolutions": null, "disable_autocast": true, "master_port": 30075, "host": "127.0.0.1", "port": 30000, "webui": false, "webui_port": 12312, "scheduler_port": 5649, "output_path": "outputs/", "input_save_path": "inputs/uploads", "prompt_file_path": null, "model_paths": {}, "model_loaded": {"transformer": true, "vae": true, "video_vae": true, "audio_vae": true, "video_dit": true, "audio_dit": true, "dual_tower_bridge": true}, "boundary_ratio": null, "log_level": "info"}
[03-03 00:37:09] Starting server...
[03-03 00:37:16] Scheduler bind at endpoint: tcp://127.0.0.1:5649
[03-03 00:37:16] Initializing distributed environment with world_size=1, device=cuda:0, timeout=3600
[03-03 00:37:16] Setting distributed timeout to 3600 seconds
[03-03 00:37:17] No pipeline_class_name specified, using model_index.json
[03-03 00:37:18] HTTP Request: HEAD https://huggingface.co/zai-org/GLM-Image/resolve/main/model_index.json "HTTP/1.1 307 Temporary Redirect"
[03-03 00:37:18] HTTP Request: HEAD https://huggingface.co/api/resolve-cache/models/zai-org/GLM-Image/2c433cc0cbc293bde2ac8ca9624f279b5d23fcf4/model_index.json "HTTP/1.1 200 OK"
[03-03 00:37:18] HTTP Request: HEAD https://huggingface.co/zai-org/GLM-Image/resolve/main/model_index.json "HTTP/1.1 307 Temporary Redirect"
[03-03 00:37:18] HTTP Request: HEAD https://huggingface.co/api/resolve-cache/models/zai-org/GLM-Image/2c433cc0cbc293bde2ac8ca9624f279b5d23fcf4/model_index.json "HTTP/1.1 200 OK"
[03-03 00:37:18] Using pipeline from model_index.json: GlmImagePipeline
[03-03 00:37:18] Loading pipeline modules...
[03-03 00:37:18] Checking for cached model in HF Hub cache for zai-org/GLM-Image...
[03-03 00:37:18] Found complete model in cache at /root/.cache/huggingface/hub/models--zai-org--GLM-Image/snapshots/2c433cc0cbc293bde2ac8ca9624f279b5d23fcf4
[03-03 00:37:18] Model path: /root/.cache/huggingface/hub/models--zai-org--GLM-Image/snapshots/2c433cc0cbc293bde2ac8ca9624f279b5d23fcf4
[03-03 00:37:18] Diffusers version: 0.37.0.dev0
[03-03 00:37:18] Loading pipeline modules from config: {'_class_name': 'GlmImagePipeline', '_diffusers_version': '0.37.0.dev0', '_name_or_path': 'zai-org/GLM-Image-Decoder', 'text_encoder': ['transformers', 'T5EncoderModel'], 'vision_language_encoder': ['transformers', 'GlmImageForConditionalGeneration'], 'scheduler': ['diffusers', 'FlowMatchEulerDiscreteScheduler'], 'tokenizer': ['transformers', 'ByT5Tokenizer'], 'processor': ['transformers', 'GlmImageProcessor'], 'transformer': ['diffusers', 'GlmImageTransformer2DModel'], 'vae': ['diffusers', 'AutoencoderKL']}
[03-03 00:37:18] Loading required components: ['text_encoder', 'tokenizer', 'vae', 'vision_language_encoder', 'processor', 'transformer', 'scheduler']
Loading required modules:   0%|                         | 0/7 [00:00<?, ?it/s][03-03 00:37:18] Loading text_encoder from /root/.cache/huggingface/hub/models--zai-org--GLM-Image/snapshots/2c433cc0cbc293bde2ac8ca9624f279b5d23fcf4/text_encoder. avail mem: 134.48 GB
[03-03 00:37:18] [RunAI Streamer] Overall time to stream 830.3 MiB of all files to cpu: 0.2s, 4.0 GiB/s
[03-03 00:37:19] Loaded text_encoder: FSDPT5EncoderModel (sgl-diffusion version). model size: 0.81 GB, consumed GPU mem: 0.76 GB, avail GPU mem: 133.71 GB
Loading required modules:  14%|██▍              | 1/7 [00:01<00:08,  1.42s/it][03-03 00:37:19] Loading tokenizer from /root/.cache/huggingface/hub/models--zai-org--GLM-Image/snapshots/2c433cc0cbc293bde2ac8ca9624f279b5d23fcf4/tokenizer. avail mem: 133.71 GB
[03-03 00:37:19] Loaded tokenizer: ByT5Tokenizer (sgl-diffusion version). model size: NA GB, consumed GPU mem: 0.00 GB, avail GPU mem: 133.71 GB
[03-03 00:37:19] Loading vae from /root/.cache/huggingface/hub/models--zai-org--GLM-Image/snapshots/2c433cc0cbc293bde2ac8ca9624f279b5d23fcf4/vae. avail mem: 133.71 GB
[03-03 00:37:20] Loaded vae: AutoencoderKL (sgl-diffusion version). model size: 0.76 GB, consumed GPU mem: 0.77 GB, avail GPU mem: 132.94 GB
Loading required modules:  43%|███████▎         | 3/7 [00:02<00:02,  1.62it/s][03-03 00:37:20] Loading vision_language_encoder from /root/.cache/huggingface/hub/models--zai-org--GLM-Image/snapshots/2c433cc0cbc293bde2ac8ca9624f279b5d23fcf4/vision_language_encoder. avail mem: 132.94 GB
Loading weights: 100%|| 1011/1011 [00:00<00:00, 4461.57it/s, Materializing pa
[03-03 00:37:25] Loaded vision_language_encoder: GlmImageForConditionalGeneration (sgl-diffusion version). model size: 18.84 GB, consumed GPU mem: 18.88 GB, avail GPU mem: 114.07 GB
Loading required modules:  57%|█████████▋       | 4/7 [00:06<00:06,  2.07s/it][03-03 00:37:25] Loading processor from /root/.cache/huggingface/hub/models--zai-org--GLM-Image/snapshots/2c433cc0cbc293bde2ac8ca9624f279b5d23fcf4/processor. avail mem: 114.07 GB
The image processor of type `GlmImageImageProcessor` is now loaded as a fast processor by default, even if the model checkpoint was saved with a slow processor. This is a breaking change and may produce slightly different outputs. To continue using the slow processor, instantiate this class with `use_fast=False`. 
[03-03 00:37:26] Loaded processor: GlmImageProcessor (sgl-diffusion version). model size: NA GB, consumed GPU mem: 0.00 GB, avail GPU mem: 114.07 GB
Loading required modules:  71%|████████████▏    | 5/7 [00:08<00:03,  1.88s/it][03-03 00:37:26] Loading transformer from /root/.cache/huggingface/hub/models--zai-org--GLM-Image/snapshots/2c433cc0cbc293bde2ac8ca9624f279b5d23fcf4/transformer. avail mem: 114.07 GB
[03-03 00:37:26] Loading GlmImageTransformer2DModel from 3 safetensors file(s) , param_dtype: torch.bfloat16
[03-03 00:37:26] flash_attn 3 package is not installed. It's recommended to install flash_attn3 on hopper, otherwise performance is sub-optimal
[03-03 00:37:26] Using FlashAttention (FA3 for hopper, FA4 for blackwell) backend
[03-03 00:37:27] [RunAI Streamer] Overall time to stream 12.9 GiB of all files to cpu: 0.98s, 13.2 GiB/s
[03-03 00:37:36] Loaded model with 6.93B parameters
[03-03 00:37:36] Loaded transformer: GlmImageTransformer2DModel (sgl-diffusion version). model size: 12.9 GB, consumed GPU mem: 0.00 GB, avail GPU mem: 114.07 GB
Loading required modules:  86%|██████████████▌  | 6/7 [00:18<00:04,  4.37s/it][03-03 00:37:36] Loading scheduler from /root/.cache/huggingface/hub/models--zai-org--GLM-Image/snapshots/2c433cc0cbc293bde2ac8ca9624f279b5d23fcf4/scheduler. avail mem: 114.07 GB
[03-03 00:37:36] Loaded scheduler: FlowMatchEulerDiscreteScheduler (sgl-diffusion version). model size: NA GB, consumed GPU mem: 0.00 GB, avail GPU mem: 114.07 GB
Loading required modules: 100%|█████████████████| 7/7 [00:18<00:00,  2.58s/it]
[03-03 00:37:36] Creating pipeline stages...
[03-03 00:37:36] Compiling transformer with mode: max-autotune-no-cudagraphs
[03-03 00:37:36] Using FlashAttention (FA3 for hopper, FA4 for blackwell) backend
[03-03 00:37:36] Pipeline instantiated
[03-03 00:37:36] Worker 0: Initialized device, model, and distributed environment.
[03-03 00:37:36] Worker 0: Scheduler loop started.
[03-03 00:37:36] Starting FastAPI server.
[2026-03-03 00:37:36] INFO:     Started server process [970914]
[2026-03-03 00:37:36] INFO:     Waiting for application startup.
[03-03 00:37:36] ZMQ Broker is listening for offline jobs on tcp://*:30001
[2026-03-03 00:37:36] INFO:     Application startup complete.
[2026-03-03 00:37:36] INFO:     Uvicorn running on http://127.0.0.1:30000 (Press CTRL+C to quit)
[03-03 00:37:45] HTTP Request: HEAD https://huggingface.co/zai-org/GLM-Image/resolve/main/model_index.json "HTTP/1.1 307 Temporary Redirect"
[03-03 00:37:45] HTTP Request: HEAD https://huggingface.co/api/resolve-cache/models/zai-org/GLM-Image/2c433cc0cbc293bde2ac8ca9624f279b5d23fcf4/model_index.json "HTTP/1.1 200 OK"
[03-03 00:37:45] Running pipeline stages: ['glm_image_before_denoising_stage', 'denoising_stage', 'decoding_stage']
[03-03 00:37:45] [GlmImageBeforeDenoisingStage] started...
[03-03 00:38:11] generate_prior_tokens time: 26.338573455810547
[03-03 00:38:13] [GlmImageBeforeDenoisingStage] finished in 27.4317 seconds
[03-03 00:38:13] [DenoisingStage] started...
  0%|                                                  | 0/30 [00:00<?, ?it/s]/data/chenyang/.python/sglang/lib/python3.12/site-packages/torch/_dynamo/variables/functions.py:1598: UserWarning: Dynamo does not know how to trace the builtin `posix.lstat.` This function is either a Python builtin (e.g. _warnings.warn) or a third-party C/C++ Python extension (perhaps created with pybind).
If it is a Python builtin, please file an issue on GitHub so the PyTorch team can add support for it and see the next case for a workaround.
If it is a third-party C/C++ Python extension, please either wrap it into a PyTorch-understood custom operator (see https://pytorch.org/tutorials/advanced/custom_ops_landing_page.html for more details) or, if it is traceable, use `torch.compiler.allow_in_graph`.
  torch._dynamo.utils.warn_once(explanation + "\n" + "\n".join(hints))
/data/chenyang/.python/sglang/lib/python3.12/site-packages/torch/_dynamo/variables/functions.py:1692: UserWarning: Dynamo detected a call to a `functools.lru_cache`-wrapped function. Dynamo ignores the cache wrapper and directly traces the wrapped function. Silent incorrectness is only a *potential* risk, not something we have observed. Enable TORCH_LOGS="+dynamo" for a DEBUG stack trace.
  torch._dynamo.utils.warn_once(msg)
/data/chenyang/.python/sglang/lib/python3.12/site-packages/torch/_dynamo/variables/functions.py:1598: UserWarning: Dynamo does not know how to trace the builtin `<unknown module>.datetime.now.` This function is either a Python builtin (e.g. _warnings.warn) or a third-party C/C++ Python extension (perhaps created with pybind).
If it is a Python builtin, please file an issue on GitHub so the PyTorch team can add support for it and see the next case for a workaround.
If it is a third-party C/C++ Python extension, please either wrap it into a PyTorch-understood custom operator (see https://pytorch.org/tutorials/advanced/custom_ops_landing_page.html for more details) or, if it is traceable, use `torch.compiler.allow_in_graph`.
  torch._dynamo.utils.warn_once(explanation + "\n" + "\n".join(hints))
final peak_memory=268525568
final peak_memory=268525568
final peak_memory=268525568
final peak_memory=24576
final peak_memory=24576
final peak_memory=24576
final peak_memory=268525568
final peak_memory=268525568
final peak_memory=268525568
final peak_memory=65536
final peak_memory=65536
final peak_memory=65536
100%|█████████████████████████████████████████| 30/30 [00:23<00:00,  1.30it/s]
[03-03 00:38:36] [DenoisingStage] average time per step: 0.7684 seconds
[03-03 00:38:36] [DenoisingStage] finished in 23.0620 seconds
[03-03 00:38:36] [DecodingStage] started...
[03-03 00:38:36] [DecodingStage] finished in 0.4158 seconds
[03-03 00:38:36] Peak GPU memory: 41.14 GB, Peak allocated: 37.33 GB, Memory pool overhead: 3.81 GB (9.3%), Remaining GPU memory at peak: 99.26 GB. Components that could stay resident (based on the last request workload): ['text_encoder', 'transformer']. Related offload server args to disable: --dit-cpu-offload, --text-encoder-cpu-offload
[03-03 00:38:36] Output saved to outputs/8d100b3e-d1ab-4df3-b3cd-f6da622ce1a3.jpg
[03-03 00:38:36] Pixel data generated successfully in 50.96 seconds
[03-03 00:38:36] Completed batch processing. Generated 1 outputs in 50.96 seconds
[03-03 00:38:36] Peak memory usage: 42126.00 MB
[2026-03-03 00:38:36] INFO:     127.0.0.1:33326 - "POST /v1/images/generations HTTP/1.1" 200 OK
[03-03 00:38:47] Running pipeline stages: ['glm_image_before_denoising_stage', 'denoising_stage', 'decoding_stage']
[03-03 00:38:47] [GlmImageBeforeDenoisingStage] started...
[03-03 00:39:13] generate_prior_tokens time: 25.920583486557007
[03-03 00:39:13] [GlmImageBeforeDenoisingStage] finished in 25.9709 seconds
[03-03 00:39:13] [DenoisingStage] started...
100%|█████████████████████████████████████████| 30/30 [00:06<00:00,  4.43it/s]
[03-03 00:39:20] [DenoisingStage] average time per step: 0.2255 seconds
[03-03 00:39:20] [DenoisingStage] finished in 6.7674 seconds
[03-03 00:39:20] [DecodingStage] started...
[03-03 00:39:20] [DecodingStage] finished in 0.0743 seconds
[03-03 00:39:20] Peak GPU memory: 41.15 GB, Peak allocated: 37.33 GB, Memory pool overhead: 3.82 GB (9.3%), Remaining GPU memory at peak: 99.25 GB. Components that could stay resident (based on the last request workload): ['text_encoder', 'transformer']. Related offload server args to disable: --dit-cpu-offload, --text-encoder-cpu-offload
[03-03 00:39:20] Output saved to outputs/3e918edf-9fd7-429d-b23b-1eb1d67fb208.jpg
[03-03 00:39:20] Pixel data generated successfully in 33.04 seconds
[03-03 00:39:20] Completed batch processing. Generated 1 outputs in 33.04 seconds
[03-03 00:39:20] Peak memory usage: 42134.00 MB
[2026-03-03 00:39:20] INFO:     127.0.0.1:37384 - "POST /v1/images/generations HTTP/1.1" 200 OK

Shall we remove these final peak_memory=65536 logging?

Copy link
Copy Markdown
Collaborator

@BBuf BBuf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. cc @yingluosanqian

@BBuf
Copy link
Copy Markdown
Collaborator

BBuf commented Mar 3, 2026

BTW, I think that our logging after enabled torch compile is too messy.

(sglang) ➜  python git:(fix/fake-impl-eps-default-for-torch-compile) sglang serve --model-path zai-org/GLM-Image --enable-torch-compile
[2026-03-03 00:37:08] INFO _client.py:1025: HTTP Request: HEAD https://huggingface.co/zai-org/GLM-Image/resolve/main/model_index.json "HTTP/1.1 307 Temporary Redirect"
[2026-03-03 00:37:09] INFO _client.py:1025: HTTP Request: HEAD https://huggingface.co/api/resolve-cache/models/zai-org/GLM-Image/2c433cc0cbc293bde2ac8ca9624f279b5d23fcf4/model_index.json "HTTP/1.1 200 OK"
[2026-03-03 00:37:09] INFO serve.py:87: Diffusion model detected
[2026-03-03 00:37:09] INFO _client.py:1025: HTTP Request: HEAD https://huggingface.co/zai-org/GLM-Image/resolve/main/model_index.json "HTTP/1.1 307 Temporary Redirect"
[2026-03-03 00:37:09] INFO _client.py:1025: HTTP Request: HEAD https://huggingface.co/api/resolve-cache/models/zai-org/GLM-Image/2c433cc0cbc293bde2ac8ca9624f279b5d23fcf4/model_index.json "HTTP/1.1 200 OK"
[2026-03-03 00:37:09] INFO _client.py:1025: HTTP Request: HEAD https://huggingface.co/zai-org/GLM-Image/resolve/main/model_index.json "HTTP/1.1 307 Temporary Redirect"
[2026-03-03 00:37:09] INFO _client.py:1025: HTTP Request: HEAD https://huggingface.co/api/resolve-cache/models/zai-org/GLM-Image/2c433cc0cbc293bde2ac8ca9624f279b5d23fcf4/model_index.json "HTTP/1.1 200 OK"
[03-03 00:37:09] Disabling some offloading (except dit, text_encoder) for image generation model
[03-03 00:37:09] server_args: {"model_path": "zai-org/GLM-Image", "model_id": null, "backend": "auto", "attention_backend": null, "attention_backend_config": {}, "cache_dit_config": null, "nccl_port": null, "trust_remote_code": false, "revision": null, "num_gpus": 1, "tp_size": 1, "sp_degree": 1, "ulysses_degree": 1, "ring_degree": 1, "dp_size": 1, "dp_degree": 1, "enable_cfg_parallel": false, "hsdp_replicate_dim": 1, "hsdp_shard_dim": 1, "dist_timeout": 3600, "pipeline_class_name": null, "lora_path": null, "lora_nickname": "default", "lora_scale": 1.0, "component_paths": {}, "transformer_weights_path": null, "lora_target_modules": null, "dit_cpu_offload": true, "dit_layerwise_offload": null, "dit_offload_prefetch_size": 0.0, "text_encoder_cpu_offload": true, "image_encoder_cpu_offload": false, "vae_cpu_offload": false, "use_fsdp_inference": false, "pin_cpu_memory": true, "comfyui_mode": false, "enable_torch_compile": true, "warmup": false, "warmup_resolutions": null, "disable_autocast": true, "master_port": 30075, "host": "127.0.0.1", "port": 30000, "webui": false, "webui_port": 12312, "scheduler_port": 5649, "output_path": "outputs/", "input_save_path": "inputs/uploads", "prompt_file_path": null, "model_paths": {}, "model_loaded": {"transformer": true, "vae": true, "video_vae": true, "audio_vae": true, "video_dit": true, "audio_dit": true, "dual_tower_bridge": true}, "boundary_ratio": null, "log_level": "info"}
[03-03 00:37:09] Starting server...
[03-03 00:37:16] Scheduler bind at endpoint: tcp://127.0.0.1:5649
[03-03 00:37:16] Initializing distributed environment with world_size=1, device=cuda:0, timeout=3600
[03-03 00:37:16] Setting distributed timeout to 3600 seconds
[03-03 00:37:17] No pipeline_class_name specified, using model_index.json
[03-03 00:37:18] HTTP Request: HEAD https://huggingface.co/zai-org/GLM-Image/resolve/main/model_index.json "HTTP/1.1 307 Temporary Redirect"
[03-03 00:37:18] HTTP Request: HEAD https://huggingface.co/api/resolve-cache/models/zai-org/GLM-Image/2c433cc0cbc293bde2ac8ca9624f279b5d23fcf4/model_index.json "HTTP/1.1 200 OK"
[03-03 00:37:18] HTTP Request: HEAD https://huggingface.co/zai-org/GLM-Image/resolve/main/model_index.json "HTTP/1.1 307 Temporary Redirect"
[03-03 00:37:18] HTTP Request: HEAD https://huggingface.co/api/resolve-cache/models/zai-org/GLM-Image/2c433cc0cbc293bde2ac8ca9624f279b5d23fcf4/model_index.json "HTTP/1.1 200 OK"
[03-03 00:37:18] Using pipeline from model_index.json: GlmImagePipeline
[03-03 00:37:18] Loading pipeline modules...
[03-03 00:37:18] Checking for cached model in HF Hub cache for zai-org/GLM-Image...
[03-03 00:37:18] Found complete model in cache at /root/.cache/huggingface/hub/models--zai-org--GLM-Image/snapshots/2c433cc0cbc293bde2ac8ca9624f279b5d23fcf4
[03-03 00:37:18] Model path: /root/.cache/huggingface/hub/models--zai-org--GLM-Image/snapshots/2c433cc0cbc293bde2ac8ca9624f279b5d23fcf4
[03-03 00:37:18] Diffusers version: 0.37.0.dev0
[03-03 00:37:18] Loading pipeline modules from config: {'_class_name': 'GlmImagePipeline', '_diffusers_version': '0.37.0.dev0', '_name_or_path': 'zai-org/GLM-Image-Decoder', 'text_encoder': ['transformers', 'T5EncoderModel'], 'vision_language_encoder': ['transformers', 'GlmImageForConditionalGeneration'], 'scheduler': ['diffusers', 'FlowMatchEulerDiscreteScheduler'], 'tokenizer': ['transformers', 'ByT5Tokenizer'], 'processor': ['transformers', 'GlmImageProcessor'], 'transformer': ['diffusers', 'GlmImageTransformer2DModel'], 'vae': ['diffusers', 'AutoencoderKL']}
[03-03 00:37:18] Loading required components: ['text_encoder', 'tokenizer', 'vae', 'vision_language_encoder', 'processor', 'transformer', 'scheduler']
Loading required modules:   0%|                         | 0/7 [00:00<?, ?it/s][03-03 00:37:18] Loading text_encoder from /root/.cache/huggingface/hub/models--zai-org--GLM-Image/snapshots/2c433cc0cbc293bde2ac8ca9624f279b5d23fcf4/text_encoder. avail mem: 134.48 GB
[03-03 00:37:18] [RunAI Streamer] Overall time to stream 830.3 MiB of all files to cpu: 0.2s, 4.0 GiB/s
[03-03 00:37:19] Loaded text_encoder: FSDPT5EncoderModel (sgl-diffusion version). model size: 0.81 GB, consumed GPU mem: 0.76 GB, avail GPU mem: 133.71 GB
Loading required modules:  14%|██▍              | 1/7 [00:01<00:08,  1.42s/it][03-03 00:37:19] Loading tokenizer from /root/.cache/huggingface/hub/models--zai-org--GLM-Image/snapshots/2c433cc0cbc293bde2ac8ca9624f279b5d23fcf4/tokenizer. avail mem: 133.71 GB
[03-03 00:37:19] Loaded tokenizer: ByT5Tokenizer (sgl-diffusion version). model size: NA GB, consumed GPU mem: 0.00 GB, avail GPU mem: 133.71 GB
[03-03 00:37:19] Loading vae from /root/.cache/huggingface/hub/models--zai-org--GLM-Image/snapshots/2c433cc0cbc293bde2ac8ca9624f279b5d23fcf4/vae. avail mem: 133.71 GB
[03-03 00:37:20] Loaded vae: AutoencoderKL (sgl-diffusion version). model size: 0.76 GB, consumed GPU mem: 0.77 GB, avail GPU mem: 132.94 GB
Loading required modules:  43%|███████▎         | 3/7 [00:02<00:02,  1.62it/s][03-03 00:37:20] Loading vision_language_encoder from /root/.cache/huggingface/hub/models--zai-org--GLM-Image/snapshots/2c433cc0cbc293bde2ac8ca9624f279b5d23fcf4/vision_language_encoder. avail mem: 132.94 GB
Loading weights: 100%|| 1011/1011 [00:00<00:00, 4461.57it/s, Materializing pa
[03-03 00:37:25] Loaded vision_language_encoder: GlmImageForConditionalGeneration (sgl-diffusion version). model size: 18.84 GB, consumed GPU mem: 18.88 GB, avail GPU mem: 114.07 GB
Loading required modules:  57%|█████████▋       | 4/7 [00:06<00:06,  2.07s/it][03-03 00:37:25] Loading processor from /root/.cache/huggingface/hub/models--zai-org--GLM-Image/snapshots/2c433cc0cbc293bde2ac8ca9624f279b5d23fcf4/processor. avail mem: 114.07 GB
The image processor of type `GlmImageImageProcessor` is now loaded as a fast processor by default, even if the model checkpoint was saved with a slow processor. This is a breaking change and may produce slightly different outputs. To continue using the slow processor, instantiate this class with `use_fast=False`. 
[03-03 00:37:26] Loaded processor: GlmImageProcessor (sgl-diffusion version). model size: NA GB, consumed GPU mem: 0.00 GB, avail GPU mem: 114.07 GB
Loading required modules:  71%|████████████▏    | 5/7 [00:08<00:03,  1.88s/it][03-03 00:37:26] Loading transformer from /root/.cache/huggingface/hub/models--zai-org--GLM-Image/snapshots/2c433cc0cbc293bde2ac8ca9624f279b5d23fcf4/transformer. avail mem: 114.07 GB
[03-03 00:37:26] Loading GlmImageTransformer2DModel from 3 safetensors file(s) , param_dtype: torch.bfloat16
[03-03 00:37:26] flash_attn 3 package is not installed. It's recommended to install flash_attn3 on hopper, otherwise performance is sub-optimal
[03-03 00:37:26] Using FlashAttention (FA3 for hopper, FA4 for blackwell) backend
[03-03 00:37:27] [RunAI Streamer] Overall time to stream 12.9 GiB of all files to cpu: 0.98s, 13.2 GiB/s
[03-03 00:37:36] Loaded model with 6.93B parameters
[03-03 00:37:36] Loaded transformer: GlmImageTransformer2DModel (sgl-diffusion version). model size: 12.9 GB, consumed GPU mem: 0.00 GB, avail GPU mem: 114.07 GB
Loading required modules:  86%|██████████████▌  | 6/7 [00:18<00:04,  4.37s/it][03-03 00:37:36] Loading scheduler from /root/.cache/huggingface/hub/models--zai-org--GLM-Image/snapshots/2c433cc0cbc293bde2ac8ca9624f279b5d23fcf4/scheduler. avail mem: 114.07 GB
[03-03 00:37:36] Loaded scheduler: FlowMatchEulerDiscreteScheduler (sgl-diffusion version). model size: NA GB, consumed GPU mem: 0.00 GB, avail GPU mem: 114.07 GB
Loading required modules: 100%|█████████████████| 7/7 [00:18<00:00,  2.58s/it]
[03-03 00:37:36] Creating pipeline stages...
[03-03 00:37:36] Compiling transformer with mode: max-autotune-no-cudagraphs
[03-03 00:37:36] Using FlashAttention (FA3 for hopper, FA4 for blackwell) backend
[03-03 00:37:36] Pipeline instantiated
[03-03 00:37:36] Worker 0: Initialized device, model, and distributed environment.
[03-03 00:37:36] Worker 0: Scheduler loop started.
[03-03 00:37:36] Starting FastAPI server.
[2026-03-03 00:37:36] INFO:     Started server process [970914]
[2026-03-03 00:37:36] INFO:     Waiting for application startup.
[03-03 00:37:36] ZMQ Broker is listening for offline jobs on tcp://*:30001
[2026-03-03 00:37:36] INFO:     Application startup complete.
[2026-03-03 00:37:36] INFO:     Uvicorn running on http://127.0.0.1:30000 (Press CTRL+C to quit)
[03-03 00:37:45] HTTP Request: HEAD https://huggingface.co/zai-org/GLM-Image/resolve/main/model_index.json "HTTP/1.1 307 Temporary Redirect"
[03-03 00:37:45] HTTP Request: HEAD https://huggingface.co/api/resolve-cache/models/zai-org/GLM-Image/2c433cc0cbc293bde2ac8ca9624f279b5d23fcf4/model_index.json "HTTP/1.1 200 OK"
[03-03 00:37:45] Running pipeline stages: ['glm_image_before_denoising_stage', 'denoising_stage', 'decoding_stage']
[03-03 00:37:45] [GlmImageBeforeDenoisingStage] started...
[03-03 00:38:11] generate_prior_tokens time: 26.338573455810547
[03-03 00:38:13] [GlmImageBeforeDenoisingStage] finished in 27.4317 seconds
[03-03 00:38:13] [DenoisingStage] started...
  0%|                                                  | 0/30 [00:00<?, ?it/s]/data/chenyang/.python/sglang/lib/python3.12/site-packages/torch/_dynamo/variables/functions.py:1598: UserWarning: Dynamo does not know how to trace the builtin `posix.lstat.` This function is either a Python builtin (e.g. _warnings.warn) or a third-party C/C++ Python extension (perhaps created with pybind).
If it is a Python builtin, please file an issue on GitHub so the PyTorch team can add support for it and see the next case for a workaround.
If it is a third-party C/C++ Python extension, please either wrap it into a PyTorch-understood custom operator (see https://pytorch.org/tutorials/advanced/custom_ops_landing_page.html for more details) or, if it is traceable, use `torch.compiler.allow_in_graph`.
  torch._dynamo.utils.warn_once(explanation + "\n" + "\n".join(hints))
/data/chenyang/.python/sglang/lib/python3.12/site-packages/torch/_dynamo/variables/functions.py:1692: UserWarning: Dynamo detected a call to a `functools.lru_cache`-wrapped function. Dynamo ignores the cache wrapper and directly traces the wrapped function. Silent incorrectness is only a *potential* risk, not something we have observed. Enable TORCH_LOGS="+dynamo" for a DEBUG stack trace.
  torch._dynamo.utils.warn_once(msg)
/data/chenyang/.python/sglang/lib/python3.12/site-packages/torch/_dynamo/variables/functions.py:1598: UserWarning: Dynamo does not know how to trace the builtin `<unknown module>.datetime.now.` This function is either a Python builtin (e.g. _warnings.warn) or a third-party C/C++ Python extension (perhaps created with pybind).
If it is a Python builtin, please file an issue on GitHub so the PyTorch team can add support for it and see the next case for a workaround.
If it is a third-party C/C++ Python extension, please either wrap it into a PyTorch-understood custom operator (see https://pytorch.org/tutorials/advanced/custom_ops_landing_page.html for more details) or, if it is traceable, use `torch.compiler.allow_in_graph`.
  torch._dynamo.utils.warn_once(explanation + "\n" + "\n".join(hints))
final peak_memory=268525568
final peak_memory=268525568
final peak_memory=268525568
final peak_memory=24576
final peak_memory=24576
final peak_memory=24576
final peak_memory=268525568
final peak_memory=268525568
final peak_memory=268525568
final peak_memory=65536
final peak_memory=65536
final peak_memory=65536
100%|█████████████████████████████████████████| 30/30 [00:23<00:00,  1.30it/s]
[03-03 00:38:36] [DenoisingStage] average time per step: 0.7684 seconds
[03-03 00:38:36] [DenoisingStage] finished in 23.0620 seconds
[03-03 00:38:36] [DecodingStage] started...
[03-03 00:38:36] [DecodingStage] finished in 0.4158 seconds
[03-03 00:38:36] Peak GPU memory: 41.14 GB, Peak allocated: 37.33 GB, Memory pool overhead: 3.81 GB (9.3%), Remaining GPU memory at peak: 99.26 GB. Components that could stay resident (based on the last request workload): ['text_encoder', 'transformer']. Related offload server args to disable: --dit-cpu-offload, --text-encoder-cpu-offload
[03-03 00:38:36] Output saved to outputs/8d100b3e-d1ab-4df3-b3cd-f6da622ce1a3.jpg
[03-03 00:38:36] Pixel data generated successfully in 50.96 seconds
[03-03 00:38:36] Completed batch processing. Generated 1 outputs in 50.96 seconds
[03-03 00:38:36] Peak memory usage: 42126.00 MB
[2026-03-03 00:38:36] INFO:     127.0.0.1:33326 - "POST /v1/images/generations HTTP/1.1" 200 OK
[03-03 00:38:47] Running pipeline stages: ['glm_image_before_denoising_stage', 'denoising_stage', 'decoding_stage']
[03-03 00:38:47] [GlmImageBeforeDenoisingStage] started...
[03-03 00:39:13] generate_prior_tokens time: 25.920583486557007
[03-03 00:39:13] [GlmImageBeforeDenoisingStage] finished in 25.9709 seconds
[03-03 00:39:13] [DenoisingStage] started...
100%|█████████████████████████████████████████| 30/30 [00:06<00:00,  4.43it/s]
[03-03 00:39:20] [DenoisingStage] average time per step: 0.2255 seconds
[03-03 00:39:20] [DenoisingStage] finished in 6.7674 seconds
[03-03 00:39:20] [DecodingStage] started...
[03-03 00:39:20] [DecodingStage] finished in 0.0743 seconds
[03-03 00:39:20] Peak GPU memory: 41.15 GB, Peak allocated: 37.33 GB, Memory pool overhead: 3.82 GB (9.3%), Remaining GPU memory at peak: 99.25 GB. Components that could stay resident (based on the last request workload): ['text_encoder', 'transformer']. Related offload server args to disable: --dit-cpu-offload, --text-encoder-cpu-offload
[03-03 00:39:20] Output saved to outputs/3e918edf-9fd7-429d-b23b-1eb1d67fb208.jpg
[03-03 00:39:20] Pixel data generated successfully in 33.04 seconds
[03-03 00:39:20] Completed batch processing. Generated 1 outputs in 33.04 seconds
[03-03 00:39:20] Peak memory usage: 42134.00 MB
[2026-03-03 00:39:20] INFO:     127.0.0.1:37384 - "POST /v1/images/generations HTTP/1.1" 200 OK

Shall we remove these final peak_memory=65536 logging?

Can you search whether there's an environment variable in torch compile that can control this log?

@zhaochenyang20
Copy link
Copy Markdown
Collaborator Author

As suggested by BBuf:

The torch.compile logs are not something we are actively controlling.

We could check if there are specific environment variables for torch.compile that manage log output.

My understanding is that these logs are automatically generated as soon as torch.compile(xxx_module) is called.

Copy link
Copy Markdown
Collaborator

@yingluosanqian yingluosanqian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks to fix it. we always passed eps explicitly before, so the issue didn’t appear. In your test it might not have been passed, which triggered the problem. I think this change is correct.

@BBuf BBuf merged commit 62480eb into main Mar 3, 2026
71 of 83 checks passed
@BBuf BBuf deleted the fix/fake-impl-eps-default-for-torch-compile branch March 3, 2026 07:24
AMD-yanfeiwang pushed a commit to AMD-yanfeiwang/sglang that referenced this pull request Mar 3, 2026
…rch.compile (sgl-project#19725)

Co-authored-by: Xiaoyu Zhang <35585791+BBuf@users.noreply.github.com>
AMD-yanfeiwang pushed a commit to AMD-yanfeiwang/sglang that referenced this pull request Mar 3, 2026
…rch.compile (sgl-project#19725)

Co-authored-by: Xiaoyu Zhang <35585791+BBuf@users.noreply.github.com>
Kangyan-Zhou pushed a commit to Kangyan-Zhou/sglang that referenced this pull request Mar 4, 2026
…rch.compile (sgl-project#19725)

Co-authored-by: Xiaoyu Zhang <35585791+BBuf@users.noreply.github.com>
magicYang1573 pushed a commit to magicYang1573/sglang that referenced this pull request Mar 9, 2026
…rch.compile (sgl-project#19725)

Co-authored-by: Xiaoyu Zhang <35585791+BBuf@users.noreply.github.com>
Wangzheee pushed a commit to Wangzheee/sglang that referenced this pull request Mar 21, 2026
…rch.compile (sgl-project#19725)

Co-authored-by: Xiaoyu Zhang <35585791+BBuf@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants