[SGLang-Diffusion] Fix custom op fake impl missing eps default for torch.compile by zhaochenyang20 · Pull Request #19725 · sgl-project/sglang

zhaochenyang20 · 2026-03-03T00:31:36Z

Motivation

While testing #18154 with --enable-torch-compile, torch.compile's fake tensor tracing produces repeated TypeError traces:

TypeError: fused_scale_residual_norm_scale_shift_fake() missing 1 required positional argument: 'eps'

These don't cause hard failures (torch.compile falls back to eager), but they spam the log and hurt debuggability.

Modifications

Add eps=1e-5 default to both register_fake implementations, matching the @custom_op signatures:

_fused_norm_scale_shift_fake(..., eps) → (..., eps=1e-5)
_fused_scale_residual_norm_scale_shift_fake(..., eps) → (..., eps=1e-5)

Test Plan

Before the PR, running:

cd python
uv pip install -e ".[diffusion]
# This is required for GLM
pip install --upgrade transformers

sglang serve --model-path zai-org/GLM-Image --enable-torch-compile

# And, sends a request!

curl -sS -X POST "http://localhost:30000/v1/images/generations" \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer sk-proj-1234567890" \
  -d '{
        "prompt": "A calico cat playing a piano on stage",
        "size": "1024x1024",
        "n": 1,
        "response_format": "b64_json"
      }'

This would run into:

SingleProcess AUTOTUNE benchmarking takes 0.6957 seconds and 0.3702 seconds precompiling for 20 choices
final peak_memory=33554432
final peak_memory=33554432
final peak_memory=33554432
final peak_memory=24576
final peak_memory=24576
final peak_memory=24576
final peak_memory=0
final peak_memory=0
final peak_memory=0
final peak_memory=0
final peak_memory=0
final peak_memory=0
/data/chenyang/.python/sglang/lib/python3.12/site-packages/torch/_dynamo/variables/functions.py:1692: UserWarning: Dynamo detected a call to a `functools.lru_cache`-wrapped function. Dynamo ignores the cache wrapper and directly traces the wrapped function. Silent incorrectness is only a *potential* risk, not something we have observed. Enable TORCH_LOGS="+dynamo" for a DEBUG stack trace.
  torch._dynamo.utils.warn_once(msg)
/data/chenyang/.python/sglang/lib/python3.12/site-packages/torch/_dynamo/variables/functions.py:1598: UserWarning: Dynamo does not know how to trace the builtin `<unknown module>.datetime.now.` This function is either a Python builtin (e.g. _warnings.warn) or a third-party C/C++ Python extension (perhaps created with pybind).
If it is a Python builtin, please file an issue on GitHub so the PyTorch team can add support for it and see the next case for a workaround.
If it is a third-party C/C++ Python extension, please either wrap it into a PyTorch-understood custom operator (see https://pytorch.org/tutorials/advanced/custom_ops_landing_page.html for more details) or, if it is traceable, use `torch.compiler.allow_in_graph`.
  torch._dynamo.utils.warn_once(explanation + "\n" + "\n".join(hints))
final peak_memory=0
final peak_memory=0
final peak_memory=0
final peak_memory=0
final peak_memory=0
final peak_memory=0
final peak_memory=0
final peak_memory=0
final peak_memory=0
final peak_memory=0
final peak_memory=0
final peak_memory=0
final peak_memory=0
final peak_memory=0
final peak_memory=0
final peak_memory=0
final peak_memory=0
final peak_memory=0
[rank0]:E0303 00:33:59.254000 965504 torch/_subclasses/fake_tensor.py:1378] [48/0] fake tensor raised TypeError
[rank0]:E0303 00:33:59.254000 965504 torch/_subclasses/fake_tensor.py:1378] [48/0] Traceback (most recent call last):
[rank0]:E0303 00:33:59.254000 965504 torch/_subclasses/fake_tensor.py:1378] [48/0]   File "/data/chenyang/.python/sglang/lib/python3.12/site-packages/torch/_subclasses/fake_tensor.py", line 1376, in __torch_dispatch__
[rank0]:E0303 00:33:59.254000 965504 torch/_subclasses/fake_tensor.py:1378] [48/0]     return self.dispatch(func, types, args, kwargs)
[rank0]:E0303 00:33:59.254000 965504 torch/_subclasses/fake_tensor.py:1378] [48/0]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:E0303 00:33:59.254000 965504 torch/_subclasses/fake_tensor.py:1378] [48/0]   File "/data/chenyang/.python/sglang/lib/python3.12/site-packages/torch/_subclasses/fake_tensor.py", line 2096, in dispatch
[rank0]:E0303 00:33:59.254000 965504 torch/_subclasses/fake_tensor.py:1378] [48/0]     return self._cached_dispatch_impl(func, types, args, kwargs)
[rank0]:E0303 00:33:59.254000 965504 torch/_subclasses/fake_tensor.py:1378] [48/0]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:E0303 00:33:59.254000 965504 torch/_subclasses/fake_tensor.py:1378] [48/0]   File "/data/chenyang/.python/sglang/lib/python3.12/site-packages/torch/_subclasses/fake_tensor.py", line 1511, in _cached_dispatch_impl
[rank0]:E0303 00:33:59.254000 965504 torch/_subclasses/fake_tensor.py:1378] [48/0]     output = self._dispatch_impl(func, types, args, kwargs)
[rank0]:E0303 00:33:59.254000 965504 torch/_subclasses/fake_tensor.py:1378] [48/0]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:E0303 00:33:59.254000 965504 torch/_subclasses/fake_tensor.py:1378] [48/0]   File "/data/chenyang/.python/sglang/lib/python3.12/site-packages/torch/_subclasses/fake_tensor.py", line 2696, in _dispatch_impl
[rank0]:E0303 00:33:59.254000 965504 torch/_subclasses/fake_tensor.py:1378] [48/0]     result = maybe_fake_impl(*args, **kwargs)
[rank0]:E0303 00:33:59.254000 965504 torch/_subclasses/fake_tensor.py:1378] [48/0]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:E0303 00:33:59.254000 965504 torch/_subclasses/fake_tensor.py:1378] [48/0]   File "/data/chenyang/.python/sglang/lib/python3.12/site-packages/torch/_library/utils.py", line 22, in __call__
[rank0]:E0303 00:33:59.254000 965504 torch/_subclasses/fake_tensor.py:1378] [48/0]     return self.func(*args, **kwargs)
[rank0]:E0303 00:33:59.254000 965504 torch/_subclasses/fake_tensor.py:1378] [48/0]            ^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:E0303 00:33:59.254000 965504 torch/_subclasses/fake_tensor.py:1378] [48/0]   File "/data/chenyang/.python/sglang/lib/python3.12/site-packages/torch/library.py", line 1430, in inner
[rank0]:E0303 00:33:59.254000 965504 torch/_subclasses/fake_tensor.py:1378] [48/0]     return func(*args, **kwargs)
[rank0]:E0303 00:33:59.254000 965504 torch/_subclasses/fake_tensor.py:1378] [48/0]            ^^^^^^^^^^^^^^^^^^^^^
[rank0]:E0303 00:33:59.254000 965504 torch/_subclasses/fake_tensor.py:1378] [48/0]   File "/data/chenyang/.python/sglang/lib/python3.12/site-packages/torch/_library/custom_ops.py", line 627, in fake_impl
[rank0]:E0303 00:33:59.254000 965504 torch/_subclasses/fake_tensor.py:1378] [48/0]     return self._abstract_fn(*args, **kwargs)
[rank0]:E0303 00:33:59.254000 965504 torch/_subclasses/fake_tensor.py:1378] [48/0]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:E0303 00:33:59.254000 965504 torch/_subclasses/fake_tensor.py:1378] [48/0] TypeError: _fused_scale_residual_norm_scale_shift_fake() missing 1 required positional argument: 'eps'
[rank0]:E0303 00:33:59.311000 965504 torch/_subclasses/fake_tensor.py:1378] [49/0] fake tensor raised TypeError
[rank0]:E0303 00:33:59.311000 965504 torch/_subclasses/fake_tensor.py:1378] [49/0] Traceback (most recent call last):
[rank0]:E0303 00:33:59.311000 965504 torch/_subclasses/fake_tensor.py:1378] [49/0]   File "/data/chenyang/.python/sglang/lib/python3.12/site-packages/torch/_subclasses/fake_tensor.py", line 1376, in __torch_dispatch__
[rank0]:E0303 00:33:59.311000 965504 torch/_subclasses/fake_tensor.py:1378] [49/0]     return self.dispatch(func, types, args, kwargs)
[rank0]:E0303 00:33:59.311000 965504 torch/_subclasses/fake_tensor.py:1378] [49/0]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:E0303 00:33:59.311000 965504 torch/_subclasses/fake_tensor.py:1378] [49/0]   File "/data/chenyang/.python/sglang/lib/python3.12/site-packages/torch/_subclasses/fake_tensor.py", line 2096, in dispatch
[rank0]:E0303 00:33:59.311000 965504 torch/_subclasses/fake_tensor.py:1378] [49/0]     return self._cached_dispatch_impl(func, types, args, kwargs)
[rank0]:E0303 00:33:59.311000 965504 torch/_subclasses/fake_tensor.py:1378] [49/0]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:E0303 00:33:59.311000 965504 torch/_subclasses/fake_tensor.py:1378] [49/0]   File "/data/chenyang/.python/sglang/lib/python3.12/site-packages/torch/_subclasses/fake_tensor.py", line 1511, in _cached_dispatch_impl
[rank0]:E0303 00:33:59.311000 965504 torch/_subclasses/fake_tensor.py:1378] [49/0]     output = self._dispatch_impl(func, types, args, kwargs)
[rank0]:E0303 00:33:59.311000 965504 torch/_subclasses/fake_tensor.py:1378] [49/0]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:E0303 00:33:59.311000 965504 torch/_subclasses/fake_tensor.py:1378] [49/0]   File "/data/chenyang/.python/sglang/lib/python3.12/site-packages/torch/_subclasses/fake_tensor.py", line 2696, in _dispatch_impl
[rank0]:E0303 00:33:59.311000 965504 torch/_subclasses/fake_tensor.py:1378] [49/0]     result = maybe_fake_impl(*args, **kwargs)
[rank0]:E0303 00:33:59.311000 965504 torch/_subclasses/fake_tensor.py:1378] [49/0]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:E0303 00:33:59.311000 965504 torch/_subclasses/fake_tensor.py:1378] [49/0]   File "/data/chenyang/.python/sglang/lib/python3.12/site-packages/torch/_library/utils.py", line 22, in __call__
[rank0]:E0303 00:33:59.311000 965504 torch/_subclasses/fake_tensor.py:1378] [49/0]     return self.func(*args, **kwargs)
[rank0]:E0303 00:33:59.311000 965504 torch/_subclasses/fake_tensor.py:1378] [49/0]            ^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:E0303 00:33:59.311000 965504 torch/_subclasses/fake_tensor.py:1378] [49/0]   File "/data/chenyang/.python/sglang/lib/python3.12/site-packages/torch/library.py", line 1425, in inner
[rank0]:E0303 00:33:59.311000 965504 torch/_subclasses/fake_tensor.py:1378] [49/0]     return func(*args, **kwargs)
[rank0]:E0303 00:33:59.311000 965504 torch/_subclasses/fake_tensor.py:1378] [49/0]            ^^^^^^^^^^^^^^^^^^^^^
[rank0]:E0303 00:33:59.311000 965504 torch/_subclasses/fake_tensor.py:1378] [49/0]   File "/data/chenyang/.python/sglang/lib/python3.12/site-packages/torch/_library/custom_ops.py", line 627, in fake_impl
[rank0]:E0303 00:33:59.311000 965504 torch/_subclasses/fake_tensor.py:1378] [49/0]     return self._abstract_fn(*args, **kwargs)
[rank0]:E0303 00:33:59.311000 965504 torch/_subclasses/fake_tensor.py:1378] [49/0]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:E0303 00:33:59.311000 965504 torch/_subclasses/fake_tensor.py:1378] [49/0] TypeError: _fused_scale_residual_norm_scale_shift_fake() missing 1 required positional argument: 'eps'
[rank0]:E0303 00:33:59.347000 965504 torch/_subclasses/fake_tensor.py:1378] [50/0] fake tensor raised TypeError
[rank0]:E0303 00:33:59.347000 965504 torch/_subclasses/fake_tensor.py:1378] [50/0] Traceback (most recent call last):
[rank0]:E0303 00:33:59.347000 965504 torch/_subclasses/fake_tensor.py:1378] [50/0]   File "/data/chenyang/.python/sglang/lib/python3.12/site-packages/torch/_subclasses/fake_tensor.py", line 1376, in __torch_dispatch__
[rank0]:E0303 00:33:59.347000 965504 torch/_subclasses/fake_tensor.py:1378] [50/0]     return self.dispatch(func, types, args, kwargs)
[rank0]:E0303 00:33:59.347000 965504 torch/_subclasses/fake_tensor.py:1378] [50/0]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:E0303 00:33:59.347000 965504 torch/_subclasses/fake_tensor.py:1378] [50/0]   File "/data/chenyang/.python/sglang/lib/python3.12/site-packages/torch/_subclasses/fake_tensor.py", line 2096, in dispatch
[rank0]:E0303 00:33:59.347000 965504 torch/_subclasses/fake_tensor.py:1378] [50/0]     return self._cached_dispatch_impl(func, types, args, kwargs)
[rank0]:E0303 00:33:59.347000 965504 torch/_subclasses/fake_tensor.py:1378] [50/0]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:E0303 00:33:59.347000 965504 torch/_subclasses/fake_tensor.py:1378] [50/0]   File "/data/chenyang/.python/sglang/lib/python3.12/site-packages/torch/_subclasses/fake_tensor.py", line 1511, in _cached_dispatch_impl
[rank0]:E0303 00:33:59.347000 965504 torch/_subclasses/fake_tensor.py:1378] [50/0]     output = self._dispatch_impl(func, types, args, kwargs)
[rank0]:E0303 00:33:59.347000 965504 torch/_subclasses/fake_tensor.py:1378] [50/0]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:E0303 00:33:59.347000 965504 torch/_subclasses/fake_tensor.py:1378] [50/0]   File "/data/chenyang/.python/sglang/lib/python3.12/site-packages/torch/_subclasses/fake_tensor.py", line 2696, in _dispatch_impl
[rank0]:E0303 00:33:59.347000 965504 torch/_subclasses/fake_tensor.py:1378] [50/0]     result = maybe_fake_impl(*args, **kwargs)
[rank0]:E0303 00:33:59.347000 965504 torch/_subclasses/fake_tensor.py:1378] [50/0]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:E0303 00:33:59.347000 965504 torch/_subclasses/fake_tensor.py:1378] [50/0]   File "/data/chenyang/.python/sglang/lib/python3.12/site-packages/torch/_library/utils.py", line 22, in __call__
[rank0]:E0303 00:33:59.347000 965504 torch/_subclasses/fake_tensor.py:1378] [50/0]     return self.func(*args, **kwargs)
[rank0]:E0303 00:33:59.347000 965504 torch/_subclasses/fake_tensor.py:1378] [50/0]            ^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:E0303 00:33:59.347000 965504 torch/_subclasses/fake_tensor.py:1378] [50/0]   File "/data/chenyang/.python/sglang/lib/python3.12/site-packages/torch/library.py", line 1425, in inner
[rank0]:E0303 00:33:59.347000 965504 torch/_subclasses/fake_tensor.py:1378] [50/0]     return func(*args, **kwargs)
[rank0]:E0303 00:33:59.347000 965504 torch/_subclasses/fake_tensor.py:1378] [50/0]            ^^^^^^^^^^^^^^^^^^^^^
[rank0]:E0303 00:33:59.347000 965504 torch/_subclasses/fake_tensor.py:1378] [50/0]   File "/data/chenyang/.python/sglang/lib/python3.12/site-packages/torch/_library/custom_ops.py", line 627, in fake_impl
[rank0]:E0303 00:33:59.347000 965504 torch/_subclasses/fake_tensor.py:1378] [50/0]     return self._abstract_fn(*args, **kwargs)
[rank0]:E0303 00:33:59.347000 965504 torch/_subclasses/fake_tensor.py:1378] [50/0]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:E0303 00:33:59.347000 965504 torch/_subclasses/fake_tensor.py:1378] [50/0] TypeError: _fused_scale_residual_norm_scale_shift_fake() missing 1 required positional argument: 'eps'
[rank0]:E0303 00:33:59.857000 965504 torch/_subclasses/fake_tensor.py:1378] [51/0] fake tensor raised TypeError
[rank0]:E0303 00:33:59.857000 965504 torch/_subclasses/fake_tensor.py:1378] [51/0] Traceback (most recent call last):

Indeed, Torch compile could fall back, but these logs seem ugly.

And, after this PR, we can have this error fixed.

[03-03 00:37:18] Found complete model in cache at /root/.cache/huggingface/hub/models--zai-org--GLM-Image/snapshots/2c433cc0cbc293bde2ac8ca9624f279b5d23fcf4
[03-03 00:37:18] Model path: /root/.cache/huggingface/hub/models--zai-org--GLM-Image/snapshots/2c433cc0cbc293bde2ac8ca9624f279b5d23fcf4
[03-03 00:37:18] Diffusers version: 0.37.0.dev0
[03-03 00:37:18] Loading pipeline modules from config: {'_class_name': 'GlmImagePipeline', '_diffusers_version': '0.37.0.dev0', '_name_or_path': 'zai-org/GLM-Image-Decoder', 'text_encoder': ['transformers', 'T5EncoderModel'], 'vision_language_encoder': ['transformers', 'GlmImageForConditionalGeneration'], 'scheduler': ['diffusers', 'FlowMatchEulerDiscreteScheduler'], 'tokenizer': ['transformers', 'ByT5Tokenizer'], 'processor': ['transformers', 'GlmImageProcessor'], 'transformer': ['diffusers', 'GlmImageTransformer2DModel'], 'vae': ['diffusers', 'AutoencoderKL']}
[03-03 00:37:18] Loading required components: ['text_encoder', 'tokenizer', 'vae', 'vision_language_encoder', 'processor', 'transformer', 'scheduler']
Loading required modules:   0%|                         | 0/7 [00:00<?, ?it/s][03-03 00:37:18] Loading text_encoder from /root/.cache/huggingface/hub/models--zai-org--GLM-Image/snapshots/2c433cc0cbc293bde2ac8ca9624f279b5d23fcf4/text_encoder. avail mem: 134.48 GB
[03-03 00:37:18] [RunAI Streamer] Overall time to stream 830.3 MiB of all files to cpu: 0.2s, 4.0 GiB/s
[03-03 00:37:19] Loaded text_encoder: FSDPT5EncoderModel (sgl-diffusion version). model size: 0.81 GB, consumed GPU mem: 0.76 GB, avail GPU mem: 133.71 GB
Loading required modules:  14%|██▍              | 1/7 [00:01<00:08,  1.42s/it][03-03 00:37:19] Loading tokenizer from /root/.cache/huggingface/hub/models--zai-org--GLM-Image/snapshots/2c433cc0cbc293bde2ac8ca9624f279b5d23fcf4/tokenizer. avail mem: 133.71 GB
[03-03 00:37:19] Loaded tokenizer: ByT5Tokenizer (sgl-diffusion version). model size: NA GB, consumed GPU mem: 0.00 GB, avail GPU mem: 133.71 GB
[03-03 00:37:19] Loading vae from /root/.cache/huggingface/hub/models--zai-org--GLM-Image/snapshots/2c433cc0cbc293bde2ac8ca9624f279b5d23fcf4/vae. avail mem: 133.71 GB
[03-03 00:37:20] Loaded vae: AutoencoderKL (sgl-diffusion version). model size: 0.76 GB, consumed GPU mem: 0.77 GB, avail GPU mem: 132.94 GB
Loading required modules:  43%|███████▎         | 3/7 [00:02<00:02,  1.62it/s][03-03 00:37:20] Loading vision_language_encoder from /root/.cache/huggingface/hub/models--zai-org--GLM-Image/snapshots/2c433cc0cbc293bde2ac8ca9624f279b5d23fcf4/vision_language_encoder. avail mem: 132.94 GB
Loading weights: 100%|█| 1011/1011 [00:00<00:00, 4461.57it/s, Materializing pa
[03-03 00:37:25] Loaded vision_language_encoder: GlmImageForConditionalGeneration (sgl-diffusion version). model size: 18.84 GB, consumed GPU mem: 18.88 GB, avail GPU mem: 114.07 GB
Loading required modules:  57%|█████████▋       | 4/7 [00:06<00:06,  2.07s/it][03-03 00:37:25] Loading processor from /root/.cache/huggingface/hub/models--zai-org--GLM-Image/snapshots/2c433cc0cbc293bde2ac8ca9624f279b5d23fcf4/processor. avail mem: 114.07 GB
The image processor of type `GlmImageImageProcessor` is now loaded as a fast processor by default, even if the model checkpoint was saved with a slow processor. This is a breaking change and may produce slightly different outputs. To continue using the slow processor, instantiate this class with `use_fast=False`. 
[03-03 00:37:26] Loaded processor: GlmImageProcessor (sgl-diffusion version). model size: NA GB, consumed GPU mem: 0.00 GB, avail GPU mem: 114.07 GB
Loading required modules:  71%|████████████▏    | 5/7 [00:08<00:03,  1.88s/it][03-03 00:37:26] Loading transformer from /root/.cache/huggingface/hub/models--zai-org--GLM-Image/snapshots/2c433cc0cbc293bde2ac8ca9624f279b5d23fcf4/transformer. avail mem: 114.07 GB
[03-03 00:37:26] Loading GlmImageTransformer2DModel from 3 safetensors file(s) , param_dtype: torch.bfloat16
[03-03 00:37:26] flash_attn 3 package is not installed. It's recommended to install flash_attn3 on hopper, otherwise performance is sub-optimal
[03-03 00:37:26] Using FlashAttention (FA3 for hopper, FA4 for blackwell) backend
[03-03 00:37:27] [RunAI Streamer] Overall time to stream 12.9 GiB of all files to cpu: 0.98s, 13.2 GiB/s
[03-03 00:37:36] Loaded model with 6.93B parameters
[03-03 00:37:36] Loaded transformer: GlmImageTransformer2DModel (sgl-diffusion version). model size: 12.9 GB, consumed GPU mem: 0.00 GB, avail GPU mem: 114.07 GB
Loading required modules:  86%|██████████████▌  | 6/7 [00:18<00:04,  4.37s/it][03-03 00:37:36] Loading scheduler from /root/.cache/huggingface/hub/models--zai-org--GLM-Image/snapshots/2c433cc0cbc293bde2ac8ca9624f279b5d23fcf4/scheduler. avail mem: 114.07 GB
[03-03 00:37:36] Loaded scheduler: FlowMatchEulerDiscreteScheduler (sgl-diffusion version). model size: NA GB, consumed GPU mem: 0.00 GB, avail GPU mem: 114.07 GB
Loading required modules: 100%|█████████████████| 7/7 [00:18<00:00,  2.58s/it]
[03-03 00:37:36] Creating pipeline stages...
[03-03 00:37:36] Compiling transformer with mode: max-autotune-no-cudagraphs
[03-03 00:37:36] Using FlashAttention (FA3 for hopper, FA4 for blackwell) backend
[03-03 00:37:36] Pipeline instantiated
[03-03 00:37:36] Worker 0: Initialized device, model, and distributed environment.
[03-03 00:37:36] Worker 0: Scheduler loop started.
[03-03 00:37:36] Starting FastAPI server.
[2026-03-03 00:37:36] INFO:     Started server process [970914]
[2026-03-03 00:37:36] INFO:     Waiting for application startup.
[03-03 00:37:36] ZMQ Broker is listening for offline jobs on tcp://*:30001
[2026-03-03 00:37:36] INFO:     Application startup complete.
[2026-03-03 00:37:36] INFO:     Uvicorn running on http://127.0.0.1:30000 (Press CTRL+C to quit)
[03-03 00:37:45] HTTP Request: HEAD https://huggingface.co/zai-org/GLM-Image/resolve/main/model_index.json "HTTP/1.1 307 Temporary Redirect"
[03-03 00:37:45] HTTP Request: HEAD https://huggingface.co/api/resolve-cache/models/zai-org/GLM-Image/2c433cc0cbc293bde2ac8ca9624f279b5d23fcf4/model_index.json "HTTP/1.1 200 OK"
[03-03 00:37:45] Running pipeline stages: ['glm_image_before_denoising_stage', 'denoising_stage', 'decoding_stage']
[03-03 00:37:45] [GlmImageBeforeDenoisingStage] started...
[03-03 00:38:11] generate_prior_tokens time: 26.338573455810547
[03-03 00:38:13] [GlmImageBeforeDenoisingStage] finished in 27.4317 seconds
[03-03 00:38:13] [DenoisingStage] started...
  0%|                                                  | 0/30 [00:00<?, ?it/s]/data/chenyang/.python/sglang/lib/python3.12/site-packages/torch/_dynamo/variables/functions.py:1598: UserWarning: Dynamo does not know how to trace the builtin `posix.lstat.` This function is either a Python builtin (e.g. _warnings.warn) or a third-party C/C++ Python extension (perhaps created with pybind).
If it is a Python builtin, please file an issue on GitHub so the PyTorch team can add support for it and see the next case for a workaround.
If it is a third-party C/C++ Python extension, please either wrap it into a PyTorch-understood custom operator (see https://pytorch.org/tutorials/advanced/custom_ops_landing_page.html for more details) or, if it is traceable, use `torch.compiler.allow_in_graph`.
  torch._dynamo.utils.warn_once(explanation + "\n" + "\n".join(hints))
/data/chenyang/.python/sglang/lib/python3.12/site-packages/torch/_dynamo/variables/functions.py:1692: UserWarning: Dynamo detected a call to a `functools.lru_cache`-wrapped function. Dynamo ignores the cache wrapper and directly traces the wrapped function. Silent incorrectness is only a *potential* risk, not something we have observed. Enable TORCH_LOGS="+dynamo" for a DEBUG stack trace.
  torch._dynamo.utils.warn_once(msg)
/data/chenyang/.python/sglang/lib/python3.12/site-packages/torch/_dynamo/variables/functions.py:1598: UserWarning: Dynamo does not know how to trace the builtin `<unknown module>.datetime.now.` This function is either a Python builtin (e.g. _warnings.warn) or a third-party C/C++ Python extension (perhaps created with pybind).
If it is a Python builtin, please file an issue on GitHub so the PyTorch team can add support for it and see the next case for a workaround.
If it is a third-party C/C++ Python extension, please either wrap it into a PyTorch-understood custom operator (see https://pytorch.org/tutorials/advanced/custom_ops_landing_page.html for more details) or, if it is traceable, use `torch.compiler.allow_in_graph`.
  torch._dynamo.utils.warn_once(explanation + "\n" + "\n".join(hints))
final peak_memory=268525568
final peak_memory=268525568
final peak_memory=268525568
final peak_memory=24576
final peak_memory=24576
final peak_memory=24576
final peak_memory=268525568
final peak_memory=268525568
final peak_memory=268525568
final peak_memory=65536
final peak_memory=65536
final peak_memory=65536
100%|█████████████████████████████████████████| 30/30 [00:23<00:00,  1.30it/s]
[03-03 00:38:36] [DenoisingStage] average time per step: 0.7684 seconds
[03-03 00:38:36] [DenoisingStage] finished in 23.0620 seconds
[03-03 00:38:36] [DecodingStage] started...
[03-03 00:38:36] [DecodingStage] finished in 0.4158 seconds
[03-03 00:38:36] Peak GPU memory: 41.14 GB, Peak allocated: 37.33 GB, Memory pool overhead: 3.81 GB (9.3%), Remaining GPU memory at peak: 99.26 GB. Components that could stay resident (based on the last request workload): ['text_encoder', 'transformer']. Related offload server args to disable: --dit-cpu-offload, --text-encoder-cpu-offload
[03-03 00:38:36] Output saved to outputs/8d100b3e-d1ab-4df3-b3cd-f6da622ce1a3.jpg
[03-03 00:38:36] Pixel data generated successfully in 50.96 seconds
[03-03 00:38:36] Completed batch processing. Generated 1 outputs in 50.96 seconds
[03-03 00:38:36] Peak memory usage: 42126.00 MB
[2026-03-03 00:38:36] INFO:     127.0.0.1:33326 - "POST /v1/images/generations HTTP/1.1" 200 OK
[03-03 00:38:47] Running pipeline stages: ['glm_image_before_denoising_stage', 'denoising_stage', 'decoding_stage']
[03-03 00:38:47] [GlmImageBeforeDenoisingStage] started...
[03-03 00:39:13] generate_prior_tokens time: 25.920583486557007
[03-03 00:39:13] [GlmImageBeforeDenoisingStage] finished in 25.9709 seconds
[03-03 00:39:13] [DenoisingStage] started...
100%|█████████████████████████████████████████| 30/30 [00:06<00:00,  4.43it/s]
[03-03 00:39:20] [DenoisingStage] average time per step: 0.2255 seconds
[03-03 00:39:20] [DenoisingStage] finished in 6.7674 seconds
[03-03 00:39:20] [DecodingStage] started...
[03-03 00:39:20] [DecodingStage] finished in 0.0743 seconds
[03-03 00:39:20] Peak GPU memory: 41.15 GB, Peak allocated: 37.33 GB, Memory pool overhead: 3.82 GB (9.3%), Remaining GPU memory at peak: 99.25 GB. Components that could stay resident (based on the last request workload): ['text_encoder', 'transformer']. Related offload server args to disable: --dit-cpu-offload, --text-encoder-cpu-offload
[03-03 00:39:20] Output saved to outputs/3e918edf-9fd7-429d-b23b-1eb1d67fb208.jpg
[03-03 00:39:20] Pixel data generated successfully in 33.04 seconds
[03-03 00:39:20] Completed batch processing. Generated 1 outputs in 33.04 seconds
[03-03 00:39:20] Peak memory usage: 42134.00 MB
[2026-03-03 00:39:20] INFO:     127.0.0.1:37384 - "POST /v1/images/generations HTTP/1.1" 200 OK

Add default value `eps=1e-5` to `register_fake` implementations of `fused_norm_scale_shift` and `fused_scale_residual_norm_scale_shift` custom ops, matching the default in the actual custom_op signatures. Made-with: Cursor

gemini-code-assist · 2026-03-03T00:31:39Z

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

zhaochenyang20 · 2026-03-03T00:40:50Z

BTW, I think that our logging after enabled torch compile is too messy.

(sglang) ➜  python git:(fix/fake-impl-eps-default-for-torch-compile) sglang serve --model-path zai-org/GLM-Image --enable-torch-compile
[2026-03-03 00:37:08] INFO _client.py:1025: HTTP Request: HEAD https://huggingface.co/zai-org/GLM-Image/resolve/main/model_index.json "HTTP/1.1 307 Temporary Redirect"
[2026-03-03 00:37:09] INFO _client.py:1025: HTTP Request: HEAD https://huggingface.co/api/resolve-cache/models/zai-org/GLM-Image/2c433cc0cbc293bde2ac8ca9624f279b5d23fcf4/model_index.json "HTTP/1.1 200 OK"
[2026-03-03 00:37:09] INFO serve.py:87: Diffusion model detected
[2026-03-03 00:37:09] INFO _client.py:1025: HTTP Request: HEAD https://huggingface.co/zai-org/GLM-Image/resolve/main/model_index.json "HTTP/1.1 307 Temporary Redirect"
[2026-03-03 00:37:09] INFO _client.py:1025: HTTP Request: HEAD https://huggingface.co/api/resolve-cache/models/zai-org/GLM-Image/2c433cc0cbc293bde2ac8ca9624f279b5d23fcf4/model_index.json "HTTP/1.1 200 OK"
[2026-03-03 00:37:09] INFO _client.py:1025: HTTP Request: HEAD https://huggingface.co/zai-org/GLM-Image/resolve/main/model_index.json "HTTP/1.1 307 Temporary Redirect"
[2026-03-03 00:37:09] INFO _client.py:1025: HTTP Request: HEAD https://huggingface.co/api/resolve-cache/models/zai-org/GLM-Image/2c433cc0cbc293bde2ac8ca9624f279b5d23fcf4/model_index.json "HTTP/1.1 200 OK"
[03-03 00:37:09] Disabling some offloading (except dit, text_encoder) for image generation model
[03-03 00:37:09] server_args: {"model_path": "zai-org/GLM-Image", "model_id": null, "backend": "auto", "attention_backend": null, "attention_backend_config": {}, "cache_dit_config": null, "nccl_port": null, "trust_remote_code": false, "revision": null, "num_gpus": 1, "tp_size": 1, "sp_degree": 1, "ulysses_degree": 1, "ring_degree": 1, "dp_size": 1, "dp_degree": 1, "enable_cfg_parallel": false, "hsdp_replicate_dim": 1, "hsdp_shard_dim": 1, "dist_timeout": 3600, "pipeline_class_name": null, "lora_path": null, "lora_nickname": "default", "lora_scale": 1.0, "component_paths": {}, "transformer_weights_path": null, "lora_target_modules": null, "dit_cpu_offload": true, "dit_layerwise_offload": null, "dit_offload_prefetch_size": 0.0, "text_encoder_cpu_offload": true, "image_encoder_cpu_offload": false, "vae_cpu_offload": false, "use_fsdp_inference": false, "pin_cpu_memory": true, "comfyui_mode": false, "enable_torch_compile": true, "warmup": false, "warmup_resolutions": null, "disable_autocast": true, "master_port": 30075, "host": "127.0.0.1", "port": 30000, "webui": false, "webui_port": 12312, "scheduler_port": 5649, "output_path": "outputs/", "input_save_path": "inputs/uploads", "prompt_file_path": null, "model_paths": {}, "model_loaded": {"transformer": true, "vae": true, "video_vae": true, "audio_vae": true, "video_dit": true, "audio_dit": true, "dual_tower_bridge": true}, "boundary_ratio": null, "log_level": "info"}
[03-03 00:37:09] Starting server...
[03-03 00:37:16] Scheduler bind at endpoint: tcp://127.0.0.1:5649
[03-03 00:37:16] Initializing distributed environment with world_size=1, device=cuda:0, timeout=3600
[03-03 00:37:16] Setting distributed timeout to 3600 seconds
[03-03 00:37:17] No pipeline_class_name specified, using model_index.json
[03-03 00:37:18] HTTP Request: HEAD https://huggingface.co/zai-org/GLM-Image/resolve/main/model_index.json "HTTP/1.1 307 Temporary Redirect"
[03-03 00:37:18] HTTP Request: HEAD https://huggingface.co/api/resolve-cache/models/zai-org/GLM-Image/2c433cc0cbc293bde2ac8ca9624f279b5d23fcf4/model_index.json "HTTP/1.1 200 OK"
[03-03 00:37:18] HTTP Request: HEAD https://huggingface.co/zai-org/GLM-Image/resolve/main/model_index.json "HTTP/1.1 307 Temporary Redirect"
[03-03 00:37:18] HTTP Request: HEAD https://huggingface.co/api/resolve-cache/models/zai-org/GLM-Image/2c433cc0cbc293bde2ac8ca9624f279b5d23fcf4/model_index.json "HTTP/1.1 200 OK"
[03-03 00:37:18] Using pipeline from model_index.json: GlmImagePipeline
[03-03 00:37:18] Loading pipeline modules...
[03-03 00:37:18] Checking for cached model in HF Hub cache for zai-org/GLM-Image...
[03-03 00:37:18] Found complete model in cache at /root/.cache/huggingface/hub/models--zai-org--GLM-Image/snapshots/2c433cc0cbc293bde2ac8ca9624f279b5d23fcf4
[03-03 00:37:18] Model path: /root/.cache/huggingface/hub/models--zai-org--GLM-Image/snapshots/2c433cc0cbc293bde2ac8ca9624f279b5d23fcf4
[03-03 00:37:18] Diffusers version: 0.37.0.dev0
[03-03 00:37:18] Loading pipeline modules from config: {'_class_name': 'GlmImagePipeline', '_diffusers_version': '0.37.0.dev0', '_name_or_path': 'zai-org/GLM-Image-Decoder', 'text_encoder': ['transformers', 'T5EncoderModel'], 'vision_language_encoder': ['transformers', 'GlmImageForConditionalGeneration'], 'scheduler': ['diffusers', 'FlowMatchEulerDiscreteScheduler'], 'tokenizer': ['transformers', 'ByT5Tokenizer'], 'processor': ['transformers', 'GlmImageProcessor'], 'transformer': ['diffusers', 'GlmImageTransformer2DModel'], 'vae': ['diffusers', 'AutoencoderKL']}
[03-03 00:37:18] Loading required components: ['text_encoder', 'tokenizer', 'vae', 'vision_language_encoder', 'processor', 'transformer', 'scheduler']
Loading required modules:   0%|                         | 0/7 [00:00<?, ?it/s][03-03 00:37:18] Loading text_encoder from /root/.cache/huggingface/hub/models--zai-org--GLM-Image/snapshots/2c433cc0cbc293bde2ac8ca9624f279b5d23fcf4/text_encoder. avail mem: 134.48 GB
[03-03 00:37:18] [RunAI Streamer] Overall time to stream 830.3 MiB of all files to cpu: 0.2s, 4.0 GiB/s
[03-03 00:37:19] Loaded text_encoder: FSDPT5EncoderModel (sgl-diffusion version). model size: 0.81 GB, consumed GPU mem: 0.76 GB, avail GPU mem: 133.71 GB
Loading required modules:  14%|██▍              | 1/7 [00:01<00:08,  1.42s/it][03-03 00:37:19] Loading tokenizer from /root/.cache/huggingface/hub/models--zai-org--GLM-Image/snapshots/2c433cc0cbc293bde2ac8ca9624f279b5d23fcf4/tokenizer. avail mem: 133.71 GB
[03-03 00:37:19] Loaded tokenizer: ByT5Tokenizer (sgl-diffusion version). model size: NA GB, consumed GPU mem: 0.00 GB, avail GPU mem: 133.71 GB
[03-03 00:37:19] Loading vae from /root/.cache/huggingface/hub/models--zai-org--GLM-Image/snapshots/2c433cc0cbc293bde2ac8ca9624f279b5d23fcf4/vae. avail mem: 133.71 GB
[03-03 00:37:20] Loaded vae: AutoencoderKL (sgl-diffusion version). model size: 0.76 GB, consumed GPU mem: 0.77 GB, avail GPU mem: 132.94 GB
Loading required modules:  43%|███████▎         | 3/7 [00:02<00:02,  1.62it/s][03-03 00:37:20] Loading vision_language_encoder from /root/.cache/huggingface/hub/models--zai-org--GLM-Image/snapshots/2c433cc0cbc293bde2ac8ca9624f279b5d23fcf4/vision_language_encoder. avail mem: 132.94 GB
Loading weights: 100%|█| 1011/1011 [00:00<00:00, 4461.57it/s, Materializing pa
[03-03 00:37:25] Loaded vision_language_encoder: GlmImageForConditionalGeneration (sgl-diffusion version). model size: 18.84 GB, consumed GPU mem: 18.88 GB, avail GPU mem: 114.07 GB
Loading required modules:  57%|█████████▋       | 4/7 [00:06<00:06,  2.07s/it][03-03 00:37:25] Loading processor from /root/.cache/huggingface/hub/models--zai-org--GLM-Image/snapshots/2c433cc0cbc293bde2ac8ca9624f279b5d23fcf4/processor. avail mem: 114.07 GB
The image processor of type `GlmImageImageProcessor` is now loaded as a fast processor by default, even if the model checkpoint was saved with a slow processor. This is a breaking change and may produce slightly different outputs. To continue using the slow processor, instantiate this class with `use_fast=False`. 
[03-03 00:37:26] Loaded processor: GlmImageProcessor (sgl-diffusion version). model size: NA GB, consumed GPU mem: 0.00 GB, avail GPU mem: 114.07 GB
Loading required modules:  71%|████████████▏    | 5/7 [00:08<00:03,  1.88s/it][03-03 00:37:26] Loading transformer from /root/.cache/huggingface/hub/models--zai-org--GLM-Image/snapshots/2c433cc0cbc293bde2ac8ca9624f279b5d23fcf4/transformer. avail mem: 114.07 GB
[03-03 00:37:26] Loading GlmImageTransformer2DModel from 3 safetensors file(s) , param_dtype: torch.bfloat16
[03-03 00:37:26] flash_attn 3 package is not installed. It's recommended to install flash_attn3 on hopper, otherwise performance is sub-optimal
[03-03 00:37:26] Using FlashAttention (FA3 for hopper, FA4 for blackwell) backend
[03-03 00:37:27] [RunAI Streamer] Overall time to stream 12.9 GiB of all files to cpu: 0.98s, 13.2 GiB/s
[03-03 00:37:36] Loaded model with 6.93B parameters
[03-03 00:37:36] Loaded transformer: GlmImageTransformer2DModel (sgl-diffusion version). model size: 12.9 GB, consumed GPU mem: 0.00 GB, avail GPU mem: 114.07 GB
Loading required modules:  86%|██████████████▌  | 6/7 [00:18<00:04,  4.37s/it][03-03 00:37:36] Loading scheduler from /root/.cache/huggingface/hub/models--zai-org--GLM-Image/snapshots/2c433cc0cbc293bde2ac8ca9624f279b5d23fcf4/scheduler. avail mem: 114.07 GB
[03-03 00:37:36] Loaded scheduler: FlowMatchEulerDiscreteScheduler (sgl-diffusion version). model size: NA GB, consumed GPU mem: 0.00 GB, avail GPU mem: 114.07 GB
Loading required modules: 100%|█████████████████| 7/7 [00:18<00:00,  2.58s/it]
[03-03 00:37:36] Creating pipeline stages...
[03-03 00:37:36] Compiling transformer with mode: max-autotune-no-cudagraphs
[03-03 00:37:36] Using FlashAttention (FA3 for hopper, FA4 for blackwell) backend
[03-03 00:37:36] Pipeline instantiated
[03-03 00:37:36] Worker 0: Initialized device, model, and distributed environment.
[03-03 00:37:36] Worker 0: Scheduler loop started.
[03-03 00:37:36] Starting FastAPI server.
[2026-03-03 00:37:36] INFO:     Started server process [970914]
[2026-03-03 00:37:36] INFO:     Waiting for application startup.
[03-03 00:37:36] ZMQ Broker is listening for offline jobs on tcp://*:30001
[2026-03-03 00:37:36] INFO:     Application startup complete.
[2026-03-03 00:37:36] INFO:     Uvicorn running on http://127.0.0.1:30000 (Press CTRL+C to quit)
[03-03 00:37:45] HTTP Request: HEAD https://huggingface.co/zai-org/GLM-Image/resolve/main/model_index.json "HTTP/1.1 307 Temporary Redirect"
[03-03 00:37:45] HTTP Request: HEAD https://huggingface.co/api/resolve-cache/models/zai-org/GLM-Image/2c433cc0cbc293bde2ac8ca9624f279b5d23fcf4/model_index.json "HTTP/1.1 200 OK"
[03-03 00:37:45] Running pipeline stages: ['glm_image_before_denoising_stage', 'denoising_stage', 'decoding_stage']
[03-03 00:37:45] [GlmImageBeforeDenoisingStage] started...
[03-03 00:38:11] generate_prior_tokens time: 26.338573455810547
[03-03 00:38:13] [GlmImageBeforeDenoisingStage] finished in 27.4317 seconds
[03-03 00:38:13] [DenoisingStage] started...
  0%|                                                  | 0/30 [00:00<?, ?it/s]/data/chenyang/.python/sglang/lib/python3.12/site-packages/torch/_dynamo/variables/functions.py:1598: UserWarning: Dynamo does not know how to trace the builtin `posix.lstat.` This function is either a Python builtin (e.g. _warnings.warn) or a third-party C/C++ Python extension (perhaps created with pybind).
If it is a Python builtin, please file an issue on GitHub so the PyTorch team can add support for it and see the next case for a workaround.
If it is a third-party C/C++ Python extension, please either wrap it into a PyTorch-understood custom operator (see https://pytorch.org/tutorials/advanced/custom_ops_landing_page.html for more details) or, if it is traceable, use `torch.compiler.allow_in_graph`.
  torch._dynamo.utils.warn_once(explanation + "\n" + "\n".join(hints))
/data/chenyang/.python/sglang/lib/python3.12/site-packages/torch/_dynamo/variables/functions.py:1692: UserWarning: Dynamo detected a call to a `functools.lru_cache`-wrapped function. Dynamo ignores the cache wrapper and directly traces the wrapped function. Silent incorrectness is only a *potential* risk, not something we have observed. Enable TORCH_LOGS="+dynamo" for a DEBUG stack trace.
  torch._dynamo.utils.warn_once(msg)
/data/chenyang/.python/sglang/lib/python3.12/site-packages/torch/_dynamo/variables/functions.py:1598: UserWarning: Dynamo does not know how to trace the builtin `<unknown module>.datetime.now.` This function is either a Python builtin (e.g. _warnings.warn) or a third-party C/C++ Python extension (perhaps created with pybind).
If it is a Python builtin, please file an issue on GitHub so the PyTorch team can add support for it and see the next case for a workaround.
If it is a third-party C/C++ Python extension, please either wrap it into a PyTorch-understood custom operator (see https://pytorch.org/tutorials/advanced/custom_ops_landing_page.html for more details) or, if it is traceable, use `torch.compiler.allow_in_graph`.
  torch._dynamo.utils.warn_once(explanation + "\n" + "\n".join(hints))
final peak_memory=268525568
final peak_memory=268525568
final peak_memory=268525568
final peak_memory=24576
final peak_memory=24576
final peak_memory=24576
final peak_memory=268525568
final peak_memory=268525568
final peak_memory=268525568
final peak_memory=65536
final peak_memory=65536
final peak_memory=65536
100%|█████████████████████████████████████████| 30/30 [00:23<00:00,  1.30it/s]
[03-03 00:38:36] [DenoisingStage] average time per step: 0.7684 seconds
[03-03 00:38:36] [DenoisingStage] finished in 23.0620 seconds
[03-03 00:38:36] [DecodingStage] started...
[03-03 00:38:36] [DecodingStage] finished in 0.4158 seconds
[03-03 00:38:36] Peak GPU memory: 41.14 GB, Peak allocated: 37.33 GB, Memory pool overhead: 3.81 GB (9.3%), Remaining GPU memory at peak: 99.26 GB. Components that could stay resident (based on the last request workload): ['text_encoder', 'transformer']. Related offload server args to disable: --dit-cpu-offload, --text-encoder-cpu-offload
[03-03 00:38:36] Output saved to outputs/8d100b3e-d1ab-4df3-b3cd-f6da622ce1a3.jpg
[03-03 00:38:36] Pixel data generated successfully in 50.96 seconds
[03-03 00:38:36] Completed batch processing. Generated 1 outputs in 50.96 seconds
[03-03 00:38:36] Peak memory usage: 42126.00 MB
[2026-03-03 00:38:36] INFO:     127.0.0.1:33326 - "POST /v1/images/generations HTTP/1.1" 200 OK
[03-03 00:38:47] Running pipeline stages: ['glm_image_before_denoising_stage', 'denoising_stage', 'decoding_stage']
[03-03 00:38:47] [GlmImageBeforeDenoisingStage] started...
[03-03 00:39:13] generate_prior_tokens time: 25.920583486557007
[03-03 00:39:13] [GlmImageBeforeDenoisingStage] finished in 25.9709 seconds
[03-03 00:39:13] [DenoisingStage] started...
100%|█████████████████████████████████████████| 30/30 [00:06<00:00,  4.43it/s]
[03-03 00:39:20] [DenoisingStage] average time per step: 0.2255 seconds
[03-03 00:39:20] [DenoisingStage] finished in 6.7674 seconds
[03-03 00:39:20] [DecodingStage] started...
[03-03 00:39:20] [DecodingStage] finished in 0.0743 seconds
[03-03 00:39:20] Peak GPU memory: 41.15 GB, Peak allocated: 37.33 GB, Memory pool overhead: 3.82 GB (9.3%), Remaining GPU memory at peak: 99.25 GB. Components that could stay resident (based on the last request workload): ['text_encoder', 'transformer']. Related offload server args to disable: --dit-cpu-offload, --text-encoder-cpu-offload
[03-03 00:39:20] Output saved to outputs/3e918edf-9fd7-429d-b23b-1eb1d67fb208.jpg
[03-03 00:39:20] Pixel data generated successfully in 33.04 seconds
[03-03 00:39:20] Completed batch processing. Generated 1 outputs in 33.04 seconds
[03-03 00:39:20] Peak memory usage: 42134.00 MB
[2026-03-03 00:39:20] INFO:     127.0.0.1:37384 - "POST /v1/images/generations HTTP/1.1" 200 OK

Shall we remove these final peak_memory=65536 logging?

BBuf

Looks good. cc @yingluosanqian

BBuf · 2026-03-03T01:05:42Z

BTW, I think that our logging after enabled torch compile is too messy.

(sglang) ➜  python git:(fix/fake-impl-eps-default-for-torch-compile) sglang serve --model-path zai-org/GLM-Image --enable-torch-compile
[2026-03-03 00:37:08] INFO _client.py:1025: HTTP Request: HEAD https://huggingface.co/zai-org/GLM-Image/resolve/main/model_index.json "HTTP/1.1 307 Temporary Redirect"
[2026-03-03 00:37:09] INFO _client.py:1025: HTTP Request: HEAD https://huggingface.co/api/resolve-cache/models/zai-org/GLM-Image/2c433cc0cbc293bde2ac8ca9624f279b5d23fcf4/model_index.json "HTTP/1.1 200 OK"
[2026-03-03 00:37:09] INFO serve.py:87: Diffusion model detected
[2026-03-03 00:37:09] INFO _client.py:1025: HTTP Request: HEAD https://huggingface.co/zai-org/GLM-Image/resolve/main/model_index.json "HTTP/1.1 307 Temporary Redirect"
[2026-03-03 00:37:09] INFO _client.py:1025: HTTP Request: HEAD https://huggingface.co/api/resolve-cache/models/zai-org/GLM-Image/2c433cc0cbc293bde2ac8ca9624f279b5d23fcf4/model_index.json "HTTP/1.1 200 OK"
[2026-03-03 00:37:09] INFO _client.py:1025: HTTP Request: HEAD https://huggingface.co/zai-org/GLM-Image/resolve/main/model_index.json "HTTP/1.1 307 Temporary Redirect"
[2026-03-03 00:37:09] INFO _client.py:1025: HTTP Request: HEAD https://huggingface.co/api/resolve-cache/models/zai-org/GLM-Image/2c433cc0cbc293bde2ac8ca9624f279b5d23fcf4/model_index.json "HTTP/1.1 200 OK"
[03-03 00:37:09] Disabling some offloading (except dit, text_encoder) for image generation model
[03-03 00:37:09] server_args: {"model_path": "zai-org/GLM-Image", "model_id": null, "backend": "auto", "attention_backend": null, "attention_backend_config": {}, "cache_dit_config": null, "nccl_port": null, "trust_remote_code": false, "revision": null, "num_gpus": 1, "tp_size": 1, "sp_degree": 1, "ulysses_degree": 1, "ring_degree": 1, "dp_size": 1, "dp_degree": 1, "enable_cfg_parallel": false, "hsdp_replicate_dim": 1, "hsdp_shard_dim": 1, "dist_timeout": 3600, "pipeline_class_name": null, "lora_path": null, "lora_nickname": "default", "lora_scale": 1.0, "component_paths": {}, "transformer_weights_path": null, "lora_target_modules": null, "dit_cpu_offload": true, "dit_layerwise_offload": null, "dit_offload_prefetch_size": 0.0, "text_encoder_cpu_offload": true, "image_encoder_cpu_offload": false, "vae_cpu_offload": false, "use_fsdp_inference": false, "pin_cpu_memory": true, "comfyui_mode": false, "enable_torch_compile": true, "warmup": false, "warmup_resolutions": null, "disable_autocast": true, "master_port": 30075, "host": "127.0.0.1", "port": 30000, "webui": false, "webui_port": 12312, "scheduler_port": 5649, "output_path": "outputs/", "input_save_path": "inputs/uploads", "prompt_file_path": null, "model_paths": {}, "model_loaded": {"transformer": true, "vae": true, "video_vae": true, "audio_vae": true, "video_dit": true, "audio_dit": true, "dual_tower_bridge": true}, "boundary_ratio": null, "log_level": "info"}
[03-03 00:37:09] Starting server...
[03-03 00:37:16] Scheduler bind at endpoint: tcp://127.0.0.1:5649
[03-03 00:37:16] Initializing distributed environment with world_size=1, device=cuda:0, timeout=3600
[03-03 00:37:16] Setting distributed timeout to 3600 seconds
[03-03 00:37:17] No pipeline_class_name specified, using model_index.json
[03-03 00:37:18] HTTP Request: HEAD https://huggingface.co/zai-org/GLM-Image/resolve/main/model_index.json "HTTP/1.1 307 Temporary Redirect"
[03-03 00:37:18] HTTP Request: HEAD https://huggingface.co/api/resolve-cache/models/zai-org/GLM-Image/2c433cc0cbc293bde2ac8ca9624f279b5d23fcf4/model_index.json "HTTP/1.1 200 OK"
[03-03 00:37:18] HTTP Request: HEAD https://huggingface.co/zai-org/GLM-Image/resolve/main/model_index.json "HTTP/1.1 307 Temporary Redirect"
[03-03 00:37:18] HTTP Request: HEAD https://huggingface.co/api/resolve-cache/models/zai-org/GLM-Image/2c433cc0cbc293bde2ac8ca9624f279b5d23fcf4/model_index.json "HTTP/1.1 200 OK"
[03-03 00:37:18] Using pipeline from model_index.json: GlmImagePipeline
[03-03 00:37:18] Loading pipeline modules...
[03-03 00:37:18] Checking for cached model in HF Hub cache for zai-org/GLM-Image...
[03-03 00:37:18] Found complete model in cache at /root/.cache/huggingface/hub/models--zai-org--GLM-Image/snapshots/2c433cc0cbc293bde2ac8ca9624f279b5d23fcf4
[03-03 00:37:18] Model path: /root/.cache/huggingface/hub/models--zai-org--GLM-Image/snapshots/2c433cc0cbc293bde2ac8ca9624f279b5d23fcf4
[03-03 00:37:18] Diffusers version: 0.37.0.dev0
[03-03 00:37:18] Loading pipeline modules from config: {'_class_name': 'GlmImagePipeline', '_diffusers_version': '0.37.0.dev0', '_name_or_path': 'zai-org/GLM-Image-Decoder', 'text_encoder': ['transformers', 'T5EncoderModel'], 'vision_language_encoder': ['transformers', 'GlmImageForConditionalGeneration'], 'scheduler': ['diffusers', 'FlowMatchEulerDiscreteScheduler'], 'tokenizer': ['transformers', 'ByT5Tokenizer'], 'processor': ['transformers', 'GlmImageProcessor'], 'transformer': ['diffusers', 'GlmImageTransformer2DModel'], 'vae': ['diffusers', 'AutoencoderKL']}
[03-03 00:37:18] Loading required components: ['text_encoder', 'tokenizer', 'vae', 'vision_language_encoder', 'processor', 'transformer', 'scheduler']
Loading required modules:   0%|                         | 0/7 [00:00<?, ?it/s][03-03 00:37:18] Loading text_encoder from /root/.cache/huggingface/hub/models--zai-org--GLM-Image/snapshots/2c433cc0cbc293bde2ac8ca9624f279b5d23fcf4/text_encoder. avail mem: 134.48 GB
[03-03 00:37:18] [RunAI Streamer] Overall time to stream 830.3 MiB of all files to cpu: 0.2s, 4.0 GiB/s
[03-03 00:37:19] Loaded text_encoder: FSDPT5EncoderModel (sgl-diffusion version). model size: 0.81 GB, consumed GPU mem: 0.76 GB, avail GPU mem: 133.71 GB
Loading required modules:  14%|██▍              | 1/7 [00:01<00:08,  1.42s/it][03-03 00:37:19] Loading tokenizer from /root/.cache/huggingface/hub/models--zai-org--GLM-Image/snapshots/2c433cc0cbc293bde2ac8ca9624f279b5d23fcf4/tokenizer. avail mem: 133.71 GB
[03-03 00:37:19] Loaded tokenizer: ByT5Tokenizer (sgl-diffusion version). model size: NA GB, consumed GPU mem: 0.00 GB, avail GPU mem: 133.71 GB
[03-03 00:37:19] Loading vae from /root/.cache/huggingface/hub/models--zai-org--GLM-Image/snapshots/2c433cc0cbc293bde2ac8ca9624f279b5d23fcf4/vae. avail mem: 133.71 GB
[03-03 00:37:20] Loaded vae: AutoencoderKL (sgl-diffusion version). model size: 0.76 GB, consumed GPU mem: 0.77 GB, avail GPU mem: 132.94 GB
Loading required modules:  43%|███████▎         | 3/7 [00:02<00:02,  1.62it/s][03-03 00:37:20] Loading vision_language_encoder from /root/.cache/huggingface/hub/models--zai-org--GLM-Image/snapshots/2c433cc0cbc293bde2ac8ca9624f279b5d23fcf4/vision_language_encoder. avail mem: 132.94 GB
Loading weights: 100%|█| 1011/1011 [00:00<00:00, 4461.57it/s, Materializing pa
[03-03 00:37:25] Loaded vision_language_encoder: GlmImageForConditionalGeneration (sgl-diffusion version). model size: 18.84 GB, consumed GPU mem: 18.88 GB, avail GPU mem: 114.07 GB
Loading required modules:  57%|█████████▋       | 4/7 [00:06<00:06,  2.07s/it][03-03 00:37:25] Loading processor from /root/.cache/huggingface/hub/models--zai-org--GLM-Image/snapshots/2c433cc0cbc293bde2ac8ca9624f279b5d23fcf4/processor. avail mem: 114.07 GB
The image processor of type `GlmImageImageProcessor` is now loaded as a fast processor by default, even if the model checkpoint was saved with a slow processor. This is a breaking change and may produce slightly different outputs. To continue using the slow processor, instantiate this class with `use_fast=False`. 
[03-03 00:37:26] Loaded processor: GlmImageProcessor (sgl-diffusion version). model size: NA GB, consumed GPU mem: 0.00 GB, avail GPU mem: 114.07 GB
Loading required modules:  71%|████████████▏    | 5/7 [00:08<00:03,  1.88s/it][03-03 00:37:26] Loading transformer from /root/.cache/huggingface/hub/models--zai-org--GLM-Image/snapshots/2c433cc0cbc293bde2ac8ca9624f279b5d23fcf4/transformer. avail mem: 114.07 GB
[03-03 00:37:26] Loading GlmImageTransformer2DModel from 3 safetensors file(s) , param_dtype: torch.bfloat16
[03-03 00:37:26] flash_attn 3 package is not installed. It's recommended to install flash_attn3 on hopper, otherwise performance is sub-optimal
[03-03 00:37:26] Using FlashAttention (FA3 for hopper, FA4 for blackwell) backend
[03-03 00:37:27] [RunAI Streamer] Overall time to stream 12.9 GiB of all files to cpu: 0.98s, 13.2 GiB/s
[03-03 00:37:36] Loaded model with 6.93B parameters
[03-03 00:37:36] Loaded transformer: GlmImageTransformer2DModel (sgl-diffusion version). model size: 12.9 GB, consumed GPU mem: 0.00 GB, avail GPU mem: 114.07 GB
Loading required modules:  86%|██████████████▌  | 6/7 [00:18<00:04,  4.37s/it][03-03 00:37:36] Loading scheduler from /root/.cache/huggingface/hub/models--zai-org--GLM-Image/snapshots/2c433cc0cbc293bde2ac8ca9624f279b5d23fcf4/scheduler. avail mem: 114.07 GB
[03-03 00:37:36] Loaded scheduler: FlowMatchEulerDiscreteScheduler (sgl-diffusion version). model size: NA GB, consumed GPU mem: 0.00 GB, avail GPU mem: 114.07 GB
Loading required modules: 100%|█████████████████| 7/7 [00:18<00:00,  2.58s/it]
[03-03 00:37:36] Creating pipeline stages...
[03-03 00:37:36] Compiling transformer with mode: max-autotune-no-cudagraphs
[03-03 00:37:36] Using FlashAttention (FA3 for hopper, FA4 for blackwell) backend
[03-03 00:37:36] Pipeline instantiated
[03-03 00:37:36] Worker 0: Initialized device, model, and distributed environment.
[03-03 00:37:36] Worker 0: Scheduler loop started.
[03-03 00:37:36] Starting FastAPI server.
[2026-03-03 00:37:36] INFO:     Started server process [970914]
[2026-03-03 00:37:36] INFO:     Waiting for application startup.
[03-03 00:37:36] ZMQ Broker is listening for offline jobs on tcp://*:30001
[2026-03-03 00:37:36] INFO:     Application startup complete.
[2026-03-03 00:37:36] INFO:     Uvicorn running on http://127.0.0.1:30000 (Press CTRL+C to quit)
[03-03 00:37:45] HTTP Request: HEAD https://huggingface.co/zai-org/GLM-Image/resolve/main/model_index.json "HTTP/1.1 307 Temporary Redirect"
[03-03 00:37:45] HTTP Request: HEAD https://huggingface.co/api/resolve-cache/models/zai-org/GLM-Image/2c433cc0cbc293bde2ac8ca9624f279b5d23fcf4/model_index.json "HTTP/1.1 200 OK"
[03-03 00:37:45] Running pipeline stages: ['glm_image_before_denoising_stage', 'denoising_stage', 'decoding_stage']
[03-03 00:37:45] [GlmImageBeforeDenoisingStage] started...
[03-03 00:38:11] generate_prior_tokens time: 26.338573455810547
[03-03 00:38:13] [GlmImageBeforeDenoisingStage] finished in 27.4317 seconds
[03-03 00:38:13] [DenoisingStage] started...
  0%|                                                  | 0/30 [00:00<?, ?it/s]/data/chenyang/.python/sglang/lib/python3.12/site-packages/torch/_dynamo/variables/functions.py:1598: UserWarning: Dynamo does not know how to trace the builtin `posix.lstat.` This function is either a Python builtin (e.g. _warnings.warn) or a third-party C/C++ Python extension (perhaps created with pybind).
If it is a Python builtin, please file an issue on GitHub so the PyTorch team can add support for it and see the next case for a workaround.
If it is a third-party C/C++ Python extension, please either wrap it into a PyTorch-understood custom operator (see https://pytorch.org/tutorials/advanced/custom_ops_landing_page.html for more details) or, if it is traceable, use `torch.compiler.allow_in_graph`.
  torch._dynamo.utils.warn_once(explanation + "\n" + "\n".join(hints))
/data/chenyang/.python/sglang/lib/python3.12/site-packages/torch/_dynamo/variables/functions.py:1692: UserWarning: Dynamo detected a call to a `functools.lru_cache`-wrapped function. Dynamo ignores the cache wrapper and directly traces the wrapped function. Silent incorrectness is only a *potential* risk, not something we have observed. Enable TORCH_LOGS="+dynamo" for a DEBUG stack trace.
  torch._dynamo.utils.warn_once(msg)
/data/chenyang/.python/sglang/lib/python3.12/site-packages/torch/_dynamo/variables/functions.py:1598: UserWarning: Dynamo does not know how to trace the builtin `<unknown module>.datetime.now.` This function is either a Python builtin (e.g. _warnings.warn) or a third-party C/C++ Python extension (perhaps created with pybind).
If it is a Python builtin, please file an issue on GitHub so the PyTorch team can add support for it and see the next case for a workaround.
If it is a third-party C/C++ Python extension, please either wrap it into a PyTorch-understood custom operator (see https://pytorch.org/tutorials/advanced/custom_ops_landing_page.html for more details) or, if it is traceable, use `torch.compiler.allow_in_graph`.
  torch._dynamo.utils.warn_once(explanation + "\n" + "\n".join(hints))
final peak_memory=268525568
final peak_memory=268525568
final peak_memory=268525568
final peak_memory=24576
final peak_memory=24576
final peak_memory=24576
final peak_memory=268525568
final peak_memory=268525568
final peak_memory=268525568
final peak_memory=65536
final peak_memory=65536
final peak_memory=65536
100%|█████████████████████████████████████████| 30/30 [00:23<00:00,  1.30it/s]
[03-03 00:38:36] [DenoisingStage] average time per step: 0.7684 seconds
[03-03 00:38:36] [DenoisingStage] finished in 23.0620 seconds
[03-03 00:38:36] [DecodingStage] started...
[03-03 00:38:36] [DecodingStage] finished in 0.4158 seconds
[03-03 00:38:36] Peak GPU memory: 41.14 GB, Peak allocated: 37.33 GB, Memory pool overhead: 3.81 GB (9.3%), Remaining GPU memory at peak: 99.26 GB. Components that could stay resident (based on the last request workload): ['text_encoder', 'transformer']. Related offload server args to disable: --dit-cpu-offload, --text-encoder-cpu-offload
[03-03 00:38:36] Output saved to outputs/8d100b3e-d1ab-4df3-b3cd-f6da622ce1a3.jpg
[03-03 00:38:36] Pixel data generated successfully in 50.96 seconds
[03-03 00:38:36] Completed batch processing. Generated 1 outputs in 50.96 seconds
[03-03 00:38:36] Peak memory usage: 42126.00 MB
[2026-03-03 00:38:36] INFO:     127.0.0.1:33326 - "POST /v1/images/generations HTTP/1.1" 200 OK
[03-03 00:38:47] Running pipeline stages: ['glm_image_before_denoising_stage', 'denoising_stage', 'decoding_stage']
[03-03 00:38:47] [GlmImageBeforeDenoisingStage] started...
[03-03 00:39:13] generate_prior_tokens time: 25.920583486557007
[03-03 00:39:13] [GlmImageBeforeDenoisingStage] finished in 25.9709 seconds
[03-03 00:39:13] [DenoisingStage] started...
100%|█████████████████████████████████████████| 30/30 [00:06<00:00,  4.43it/s]
[03-03 00:39:20] [DenoisingStage] average time per step: 0.2255 seconds
[03-03 00:39:20] [DenoisingStage] finished in 6.7674 seconds
[03-03 00:39:20] [DecodingStage] started...
[03-03 00:39:20] [DecodingStage] finished in 0.0743 seconds
[03-03 00:39:20] Peak GPU memory: 41.15 GB, Peak allocated: 37.33 GB, Memory pool overhead: 3.82 GB (9.3%), Remaining GPU memory at peak: 99.25 GB. Components that could stay resident (based on the last request workload): ['text_encoder', 'transformer']. Related offload server args to disable: --dit-cpu-offload, --text-encoder-cpu-offload
[03-03 00:39:20] Output saved to outputs/3e918edf-9fd7-429d-b23b-1eb1d67fb208.jpg
[03-03 00:39:20] Pixel data generated successfully in 33.04 seconds
[03-03 00:39:20] Completed batch processing. Generated 1 outputs in 33.04 seconds
[03-03 00:39:20] Peak memory usage: 42134.00 MB
[2026-03-03 00:39:20] INFO:     127.0.0.1:37384 - "POST /v1/images/generations HTTP/1.1" 200 OK

Shall we remove these final peak_memory=65536 logging?

Can you search whether there's an environment variable in torch compile that can control this log?

zhaochenyang20 · 2026-03-03T01:08:50Z

As suggested by BBuf:

The torch.compile logs are not something we are actively controlling.

We could check if there are specific environment variables for torch.compile that manage log output.

My understanding is that these logs are automatically generated as soon as torch.compile(xxx_module) is called.

yingluosanqian

Thanks to fix it. we always passed eps explicitly before, so the issue didn’t appear. In your test it might not have been passed, which triggered the problem. I think this change is correct.

…rch.compile (sgl-project#19725) Co-authored-by: Xiaoyu Zhang <35585791+BBuf@users.noreply.github.com>

zhaochenyang20 requested review from BBuf, mickqian and yingluosanqian as code owners March 3, 2026 00:31

zhaochenyang20 added the run-ci label Mar 3, 2026

BBuf approved these changes Mar 3, 2026

View reviewed changes

Merge branch 'main' into fix/fake-impl-eps-default-for-torch-compile

53a0a28

zhaochenyang20 mentioned this pull request Mar 3, 2026

[SGLang-Diffusion] Clean up noisy torch.compile logs when using --enable-torch-compile #19734

Open

yingluosanqian approved these changes Mar 3, 2026

View reviewed changes

BBuf merged commit 62480eb into main Mar 3, 2026
71 of 83 checks passed

BBuf deleted the fix/fake-impl-eps-default-for-torch-compile branch March 3, 2026 07:24

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SGLang-Diffusion] Fix custom op fake impl missing eps default for torch.compile#19725

[SGLang-Diffusion] Fix custom op fake impl missing eps default for torch.compile#19725
BBuf merged 2 commits intomainfrom
fix/fake-impl-eps-default-for-torch-compile

zhaochenyang20 commented Mar 3, 2026 •

edited

Loading

Uh oh!

gemini-code-assist bot commented Mar 3, 2026

Uh oh!

zhaochenyang20 commented Mar 3, 2026

Uh oh!

BBuf left a comment

Uh oh!

BBuf commented Mar 3, 2026

Uh oh!

zhaochenyang20 commented Mar 3, 2026

Uh oh!

yingluosanqian left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

zhaochenyang20 commented Mar 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Test Plan

Uh oh!

gemini-code-assist bot commented Mar 3, 2026

Uh oh!

zhaochenyang20 commented Mar 3, 2026

Uh oh!

BBuf left a comment

Choose a reason for hiding this comment

Uh oh!

BBuf commented Mar 3, 2026

Uh oh!

zhaochenyang20 commented Mar 3, 2026

Uh oh!

yingluosanqian left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

zhaochenyang20 commented Mar 3, 2026 •

edited

Loading