[VLM] remove AsyncMMDataProcessor wrapper by yhyang201 · Pull Request #21651 · sgl-project/sglang

yhyang201 · 2026-03-30T03:48:20Z

Motivation

Revert #12066

Remove AsyncMMDataProcessor as it provides no real value and has fundamental design flaws.

All multimodal processors inherit from BaseMultiModalProcessor, which defines process_mm_data_async as @abstractmethod. This means:

Every processor already implements process_mm_data_async, so AsyncMMDataProcessor always takes the "async" path. The sync fallback (ThreadPoolExecutor + run_in_executor) is dead code — no processor ever triggers it. Moreover, the processors are not thread-safe (HuggingFace processors and tokenizers carry mutable state), making this fallback path not just dead but also incorrect by design.
The remaining features provide limited value for most processors:

Feature	Reality
Async non-blocking	Most processors are "fake async" — synchronous CPU work inside `async def` with no real `await` points (e.g. `gemma3`, `pixtral`), blocking the event loop regardless. Only a few (e.g. `llava`) use genuine `await loop.run_in_executor()`.
Sync fallback (`run_in_executor`)	Dead code — all processors implement `process_mm_data_async`, no processor takes this path
Semaphore concurrency limit	Only effective for processors with real `await` points; for the majority that block synchronously, the event loop is never yielded so the semaphore has no opportunity to gate concurrency
Timeout (`wait_for`)	Only effective at `await` points; for processors that block synchronously, `asyncio.wait_for` cannot interrupt the blocking work

The two call sites in TokenizerManager now call mm_processor.process_mm_data_async() directly. The associated server args (--mm-max-concurrent-calls, --mm-per-request-timeout) are also removed.

Modifications

Accuracy Tests

Speed Tests and Profiling

Checklist

Format your code according to the Format code with pre-commit.
Add unit tests according to the Run and add unit tests.
Update documentation according to Write documentations.
Provide accuracy and speed benchmark results according to Test the accuracy and Benchmark the speed.
Follow the SGLang code style guidance.

Review and Merge Process

Ping Merge Oncalls to start the process. See the PR Merge Process.
Get approvals from CODEOWNERS and other reviewers.
Trigger CI tests with comments or contact authorized users to do so.
- Common commands include /tag-and-rerun-ci, /tag-run-ci-label, /rerun-failed-ci
After green CI and required approvals, ask Merge Oncalls or people with Write permission to merge the PR.

gemini-code-assist

Code Review

This pull request removes the AsyncMMDataProcessor utility and its corresponding unit tests, opting instead to call process_mm_data_async directly within the TokenizerManager. This change simplifies the multi-modal data processing pipeline and results in the removal of the mm_max_concurrent_calls and mm_per_request_timeout configuration options from the server arguments and CLI. I have no feedback to provide.

yhyang201 · 2026-03-30T03:52:54Z

/tag-and-rerun-ci

yuan-luo · 2026-03-30T07:47:00Z

Thanks for the cleanup effort. I agree the sync fallback path is dead code and the ThreadPoolExecutor safety concern is valid. However, I think this PR throws out the baby with the bathwater — specifically the asyncio.wait_for timeout and Semaphore concurrency guard. I'd like to raise some concerns before we merge.

1. The analysis conflates two very different phases of `process_mm_data_async`

The PR description focuses on the CPU-bound preprocessing phase (HuggingFace image processor, tokenizer) being "fake async," and that's correct for that phase. But a multimodal request's lifecycle has two distinct phases:

Phase	Nature	Latency bound	`asyncio.wait_for` effective?
Remote resource loading (download image/video from URL)	Network I/O	Unbounded (server hang, network partition)	Depends on implementation
CPU preprocessing (image decode, tokenize)	CPU-bound	Bounded (typically <1s)	No (no yield points)

The remote loading phase is where timeout protection matters most — a hung remote server can block indefinitely, while CPU preprocessing has a predictable upper bound. The PR's argument that "most processors are fake async" only addresses phase 2 and does not consider phase 1.

2. `asyncio.wait_for` does work for processors with real `await` points

Several processors in this codebase use await loop.run_in_executor(...) for image loading. For example, LLaVA's processor offloads work to a thread/process pool:

# Evidence from stack traces (sglang/srt/managers/image_processor.py):
image_inputs = await self._process_single_image(images, input_text)
# which internally does:
image_inputs = await loop.run_in_executor(...)

For any processor that uses run_in_executor or genuine async I/O (aiohttp, etc.), the await points are real, and asyncio.wait_for can successfully cancel the coroutine at those points. The original PR #12066 was motivated by a real production issue with Qwen2.5-VL-72B where /metrics was blocked for 10+ seconds — this wasn't a hypothetical concern.

3. Even for "fake async" processors, timeout still prevents GPU resource waste

Consider this scenario without AsyncMMDataProcessor:

User sends a VLM request with an image URL pointing to a slow/hung server
process_mm_data_async blocks the event loop for 60 seconds downloading the image (synchronous requests.get / urllib.request.urlopen inside async def)
After 60 seconds, the download finally completes (or errors)
The request proceeds into the prefill queue and consumes GPU compute
But the HTTP client has long since timed out and disconnected

With asyncio.wait_for(timeout=10), step 4 would be skipped — the request would be discarded as a TimeoutError immediately after the blocking call returns. This "delayed timeout" doesn't save the event loop (that requires real async), but it does prevent wasting GPU cycles on already-abandoned requests. This matters at scale.

4. The Semaphore provides genuine backpressure for eligible processors

For processors that have even a single await point (during image download), the asyncio.Semaphore(max_concurrent_calls=32) provides real concurrency limiting:

Prevents flooding the outbound network with too many concurrent image downloads
Provides graceful degradation under load (queuing excess requests at the semaphore)
Prevents remote image servers from rate-limiting the sglang instance

After this PR, there is zero backpressure mechanism for multimodal preprocessing. All concurrent requests will attempt to download and process images simultaneously.

5. Removing server args breaks existing deployments

Users who have tuned --mm-max-concurrent-calls and --mm-per-request-timeout for their production VLM workloads will have their configurations silently ignored after upgrading. This is a breaking change that deserves at minimum a deprecation period.

yuan-luo · 2026-03-30T07:47:42Z

Given the above analysis, instead of a full revert of #12066, I'd suggest a more targeted cleanup:

Remove the dead code: Drop the sync fallback path (ThreadPoolExecutor, process_mm_data detection, shutdown(), __del__). This addresses the legitimate design-flaw concern.
Keep wait_for + Semaphore: These provide real value for processors with await points and "delayed timeout" value for others.

The simplified AsyncMMDataProcessor would be like following:

class AsyncMMDataProcessor:
    def __init__(self, mm_processor, *, max_concurrent_calls=None, timeout_s=None):
        self.mm_processor = mm_processor
        self.timeout_s = timeout_s
        self.semaphore = (
            asyncio.Semaphore(max_concurrent_calls) if max_concurrent_calls else None
        )

    async def process(self, **kwargs):
        async def _invoke():
            return await self.mm_processor.process_mm_data_async(**kwargs)

        if self.semaphore is not None:
            async with self.semaphore:
                if self.timeout_s is not None:
                    return await asyncio.wait_for(_invoke(), timeout=self.timeout_s)
                return await _invoke()

        if self.timeout_s is not None:
            return await asyncio.wait_for(_invoke(), timeout=self.timeout_s)
        return await _invoke()

This keeps the safety net while removing all the dead code. Removing a partially-effective safety mechanism in favor of no safety mechanism doesn't seem like the right trade-off for a system serving trillions of tokens daily in production.

yuan-luo · 2026-03-30T08:16:47Z

Here's a concrete example demonstrating where AsyncMMDataProcessor's timeout provides real value.

Setup

Any processor that has a real await point — whether from run_in_executor, aiohttp, asyncio.gather, or even multiple await self._process_single(...) calls — is protected by asyncio.wait_for.

The Scenario

A sglang server serving a VLM model whose processor uses run_in_executor for image loading (e.g., the LLaVA family). A client sends a request with a remote image URL.

# Client request
curl http://sglang-server:30000/v1/chat/completions \
  -d '{
    "model": "llava-v1.6-34b",
    "messages": [{
      "role": "user",
      "content": [
        {"type": "image_url", "image_url": {"url": "https://slow-cdn.example.com/huge-image.jpg"}},
        {"type": "text", "text": "Describe this image"}
      ]
    }]
  }'

The remote CDN is experiencing issues — it accepts the TCP connection but responds extremely slowly (trickles data at ~1KB/s for a 5MB image, so ~80 minutes to complete).

What Happens Inside TokenizerManager

The request enters _tokenize_one_request, which calls process_mm_data_async. Inside the processor:

# Simplified view of what happens inside a processor using run_in_executor
async def process_mm_data_async(self, image_data, ...):
    loop = asyncio.get_running_loop()
    # This is a REAL await point - the event loop is free while the thread runs
    image = await loop.run_in_executor(
        self._executor,
        self._download_and_decode,  # calls requests.get(url) internally
        image_data[0]
    )
    # ... CPU preprocessing ...
    return result

Timeline: WITHOUT AsyncMMDataProcessor (after this PR)

T=0s      Client sends request. TokenizerManager creates async task.
T=0s      process_mm_data_async starts. run_in_executor spawns a thread
          to download the image. The event loop is FREE (this is a real await).
T=0s~     Event loop happily serves other requests. So far so good.

T=30s     Client's HTTP timeout fires. Client disconnects.
          But the download thread is still running. The coroutine is still
          awaiting the Future from run_in_executor. Nobody cancels it.

T=4800s   (~80 min later) Download thread finally finishes.
          The Future resolves. The coroutine resumes.
          CPU preprocessing runs. Result is returned.
          TokenizerManager sends the request to the Scheduler.
          Scheduler allocates KV cache, queues it for prefill.
          GPU runs prefill (1-3 seconds of H100 compute).
          GPU runs decode, generates full response.
          TokenizerManager tries to send response... but the client
          is long gone. All GPU compute wasted.

Impact: The request consumed GPU time and KV cache memory for a response that nobody will ever read. Under load, orphaned requests like this accumulate and degrade throughput for legitimate requests.

Timeline: WITH AsyncMMDataProcessor (before this PR)

T=0s      Client sends request. TokenizerManager creates async task.
T=0s      AsyncMMDataProcessor.process() wraps the call:
            await asyncio.wait_for(_invoke(), timeout=10.0)
T=0s      _invoke() calls process_mm_data_async.
          run_in_executor spawns a thread. Event loop is FREE.

T=10s     asyncio.wait_for timeout fires.
          Since the coroutine is suspended at a real `await` point
          (waiting on the run_in_executor Future), the event loop
          CAN cancel the Task right now.
          → asyncio.TimeoutError is raised.
          → The request is immediately rejected with an error.
          → The Semaphore is released.
          → NO GPU compute is wasted.
          → NO KV cache is allocated for this request.

          (The download thread continues in the background, but its
          result will be discarded when the Future is garbage collected.)

Impact: The system spends 0 GPU seconds on this doomed request. The error propagates back quickly (even though the client already disconnected at T=30s, the server-side cleanup happened at T=10s, freeing resources 4790 seconds earlier).

yhyang201 · 2026-03-30T08:46:50Z

Here's a concrete example demonstrating where AsyncMMDataProcessor's timeout provides real value.

Setup

Any processor that has a real await point — whether from run_in_executor, aiohttp, asyncio.gather, or even multiple await self._process_single(...) calls — is protected by asyncio.wait_for.

The Scenario

A sglang server serving a VLM model whose processor uses run_in_executor for image loading (e.g., the LLaVA family). A client sends a request with a remote image URL.
# Client request
curl http://sglang-server:30000/v1/chat/completions \
  -d '{
    "model": "llava-v1.6-34b",
    "messages": [{
      "role": "user",
      "content": [
        {"type": "image_url", "image_url": {"url": "https://slow-cdn.example.com/huge-image.jpg"}},
        {"type": "text", "text": "Describe this image"}
      ]
    }]
  }'
The remote CDN is experiencing issues — it accepts the TCP connection but responds extremely slowly (trickles data at ~1KB/s for a 5MB image, so ~80 minutes to complete).

What Happens Inside TokenizerManager

The request enters _tokenize_one_request, which calls process_mm_data_async. Inside the processor:
# Simplified view of what happens inside a processor using run_in_executor
async def process_mm_data_async(self, image_data, ...):
    loop = asyncio.get_running_loop()
    # This is a REAL await point - the event loop is free while the thread runs
    image = await loop.run_in_executor(
        self._executor,
        self._download_and_decode,  # calls requests.get(url) internally
        image_data[0]
    )
    # ... CPU preprocessing ...
    return result
Timeline: WITHOUT AsyncMMDataProcessor (after this PR)
T=0s      Client sends request. TokenizerManager creates async task.
T=0s      process_mm_data_async starts. run_in_executor spawns a thread
          to download the image. The event loop is FREE (this is a real await).
T=0s~     Event loop happily serves other requests. So far so good.

T=30s     Client's HTTP timeout fires. Client disconnects.
          But the download thread is still running. The coroutine is still
          awaiting the Future from run_in_executor. Nobody cancels it.

T=4800s   (~80 min later) Download thread finally finishes.
          The Future resolves. The coroutine resumes.
          CPU preprocessing runs. Result is returned.
          TokenizerManager sends the request to the Scheduler.
          Scheduler allocates KV cache, queues it for prefill.
          GPU runs prefill (1-3 seconds of H100 compute).
          GPU runs decode, generates full response.
          TokenizerManager tries to send response... but the client
          is long gone. All GPU compute wasted.
Impact: The request consumed GPU time and KV cache memory for a response that nobody will ever read. Under load, orphaned requests like this accumulate and degrade throughput for legitimate requests.

Timeline: WITH AsyncMMDataProcessor (before this PR)
T=0s      Client sends request. TokenizerManager creates async task.
T=0s      AsyncMMDataProcessor.process() wraps the call:
            await asyncio.wait_for(_invoke(), timeout=10.0)
T=0s      _invoke() calls process_mm_data_async.
          run_in_executor spawns a thread. Event loop is FREE.

T=10s     asyncio.wait_for timeout fires.
          Since the coroutine is suspended at a real `await` point
          (waiting on the run_in_executor Future), the event loop
          CAN cancel the Task right now.
          → asyncio.TimeoutError is raised.
          → The request is immediately rejected with an error.
          → The Semaphore is released.
          → NO GPU compute is wasted.
          → NO KV cache is allocated for this request.

          (The download thread continues in the background, but its
          result will be discarded when the Future is garbage collected.)
Impact: The system spends 0 GPU seconds on this doomed request. The error propagates back quickly (even though the client already disconnected at T=30s, the server-side cleanup happened at T=10s, freeing resources 4790 seconds earlier).

I have a question I’d like to ask: after simplification, is the AsyncMMDataProcessor only applicable to the LLaVA processor?

yuan-luo · 2026-03-30T09:40:16Z

Thanks for the digging. I did a code audit and want to share what I found.

Code Audit Results

I ran grep -rn "await " python/sglang/srt/multimodal/processors/*.py and examined all await expressions across every processor. Here's the breakdown:

Processors with real yield points (where asyncio.wait_for IS effective):

llava.py — uses await loop.run_in_executor(...) and await asyncio.gather(...). These are genuine yield points.

Processors with fake-async await (where asyncio.wait_for is NOT effective):

qwen_vl.py, ernie45_vl.py, interns1pro.py — all await preprocess_video(...), but preprocess_video is declared async def with zero internal await expressions. Pure synchronous CPU work (numpy, video decoding, torch ops) inside an async shell.

So your analysis is largely correct for the current codebase — the timeout/semaphore only provides real protection for llava.py.

The Deeper Problem

However, this audit also revealed something concerning: preprocess_video in qwen_vl.py is async def but performs heavy synchronous video decoding (vr.get_frames_as_tensor(...)) with no await — it blocks the event loop during the entire decode. The same pattern exists in ernie45_vl.py and interns1pro.py. For a long video, this can block the TokenizerManager event loop for seconds, causing the exact same /metrics stall you described in #12066's motivation.

This is the root cause that #12066 tried to address. The AsyncMMDataProcessor wrapper was a mitigation layer, but the real fix should be making these functions truly async — wrapping the heavy work in run_in_executor, like llava.py already does.

My Concern

This PR removes the mitigation layer without fixing the root cause. After this lands:

llava.py loses its working timeout/semaphore protection
The fake-async processors continue to block the event loop (unchanged)
When someone later fixes the fake-async processors (wrapping video decode in run_in_executor), there will be no timeout/semaphore infrastructure to protect them

Would you consider one of these alternatives?

Option A: Keep a minimal wrapper (just wait_for + Semaphore, drop all the dead sync-fallback code), so llava.py stays protected and future processor fixes automatically get protection.

Option B: If you want to remove the wrapper entirely, could we first fix preprocess_video in qwen_vl.py/ernie45_vl.py/interns1pro.py to use run_in_executor (making them truly async like llava.py), and then add timeout protection at the call site in tokenizer_manager.py directly? That way we're not removing protection without also fixing the underlying event-loop-blocking issue.

Either way, the preprocess_video blocking issue should probably be tracked as a separate issue — Qwen-VL video requests are blocking the TokenizerManager event loop today, which is a latent production risk.

yhyang201 · 2026-03-30T09:55:33Z

Thanks for the digging. I did a code audit and want to share what I found.

Code Audit Results

I ran grep -rn "await " python/sglang/srt/multimodal/processors/*.py and examined all await expressions across every processor. Here's the breakdown:

Processors with real yield points (where asyncio.wait_for IS effective):

llava.py — uses await loop.run_in_executor(...) and await asyncio.gather(...). These are genuine yield points.

Processors with fake-async await (where asyncio.wait_for is NOT effective):

qwen_vl.py, ernie45_vl.py, interns1pro.py — all await preprocess_video(...), but preprocess_video is declared async def with zero internal await expressions. Pure synchronous CPU work (numpy, video decoding, torch ops) inside an async shell.

So your analysis is largely correct for the current codebase — the timeout/semaphore only provides real protection for llava.py.

The Deeper Problem

However, this audit also revealed something concerning: preprocess_video in qwen_vl.py is async def but performs heavy synchronous video decoding (vr.get_frames_as_tensor(...)) with no await — it blocks the event loop during the entire decode. The same pattern exists in ernie45_vl.py and interns1pro.py. For a long video, this can block the TokenizerManager event loop for seconds, causing the exact same /metrics stall you described in #12066's motivation.

This is the root cause that #12066 tried to address. The AsyncMMDataProcessor wrapper was a mitigation layer, but the real fix should be making these functions truly async — wrapping the heavy work in run_in_executor, like llava.py already does.

My Concern

This PR removes the mitigation layer without fixing the root cause. After this lands:

llava.py loses its working timeout/semaphore protection

The fake-async processors continue to block the event loop (unchanged)

When someone later fixes the fake-async processors (wrapping video decode in run_in_executor), there will be no timeout/semaphore infrastructure to protect them

Would you consider one of these alternatives?

Option A: Keep a minimal wrapper (just wait_for + Semaphore, drop all the dead sync-fallback code), so llava.py stays protected and future processor fixes automatically get protection.

Option B: If you want to remove the wrapper entirely, could we first fix preprocess_video in qwen_vl.py/ernie45_vl.py/interns1pro.py to use run_in_executor (making them truly async like llava.py), and then add timeout protection at the call site in tokenizer_manager.py directly? That way we're not removing protection without also fixing the underlying event-loop-blocking issue.

Either way, the preprocess_video blocking issue should probably be tracked as a separate issue — Qwen-VL video requests are blocking the TokenizerManager event loop today, which is a latent production risk.

Thanks for the thorough audit — the findings align with ours.

Two points:

LLaVA timeout protection is preserved in this PR. We added asyncio.wait_for directly inside LlavaImageProcessor._process_single_image, wrapping the run_in_executor call. This is more precise than the external wrapper — it applies the timeout exactly at the real await point, per-image rather than per-request. See the latest commit.
The preprocess_video blocking issue is orthogonal to this PR. AsyncMMDataProcessor cannot mitigate it either — qwen_vl.process_mm_data_async() blocks the event loop on self.load_mm_data() → future.result() before preprocess_video is even reached, and preprocess_video itself has zero await points. With or without the wrapper, asyncio.wait_for has no opportunity to fire. The real fix is to make preprocess_video (and load_mm_data) truly async via run_in_executor, which is a separate effort worth tracking as its own issue.

yuan-luo · 2026-03-30T10:20:05Z

Thanks. Opinions are aligned. Let's remove this wrapper.

LLaVA is the only processor with real await points (via run_in_executor), so add per-image timeout protection directly where it is effective.

mickqian · 2026-03-30T10:59:47Z

/tag-and-rerun-ci

* [AMD] Fix AMD CI monitor GitHub API rate limit exhaustion (sgl-project#21527) * [CI] Register missing jit_kernel test files (sgl-project#21547) * [diffusion] fix: return None instead of raising RuntimeError when no model info found (sgl-project#21319) Co-authored-by: Mick <mickjagger19@icloud.com> * [rl][sgl] fix tensor mismatch after pause (sgl-project#21514) * [Hicache & JIT_kernel] Support page first layout & mla jit kernel (sgl-project#18311) * test: point DSV3 int8 MLA CI models to lmsys Hugging Face org (sgl-project#21561) * [CI] Relax several thresholds in flaky CIs (sgl-project#21562) * feat: add gc_threshold arg (sgl-project#21481) Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> * Fix flaky test_pp_single_node (sgl-project#21564) * Split workflow for releasing runtime docker (sgl-project#21563) * fix tp capture in vit cuda graph (sgl-project#17255) * [1/n] lora support - Auto detect lora target modules (sgl-project#21439) Co-authored-by: Baizhou Zhang <sobereddiezhang@gmail.com> * [fix] qwen3.5 fuse_moe_triton_tune bug (sgl-project#20232) * Remove sync when enabling return_logprob (sgl-project#20972) * Scope streaming backlog coalescing to incremental_streaming_output mode (sgl-project#21037) Signed-off-by: Vladislav Nosivskoy <vladnosiv@gmail.com> Co-authored-by: Lianmin Zheng <lianminzheng@gmail.com> * docs: flesh out MAINTAINER.md oncall lists and link GitHub profiles (sgl-project#21575) * [NVIDIA] Enable automatic NUMA configuration (sgl-project#19452) * [diffusion] UX: aggregate expected dtype-cast logs during weight loading (sgl-project#21552) * [diffusion] refactor: Unify `TeaCacheParams` and `WanTeaCacheParams` (sgl-project#20706) Co-authored-by: Mick <mickjagger19@icloud.com> * [diffusion] chore: remove redundant identity preprocess_text functions(sgl-project#20633) Co-authored-by: Fengyuan Yu <15fengyuan@gmail.com> * Update CODEOWNERS for transformers.py and docs (sgl-project#21555) Co-authored-by: Lianmin Zheng <lianminzheng@gmail.com> * reduce CPU peak memory in multimodal tensor hashing (sgl-project#21123) * Fix HFRunner hang when subprocess dies during init (sgl-project#21582) * Fix Piecewise CUDA Graph crash with `-enable-mixed-chunk` (sgl-project#20441) Co-authored-by: jianyingzhu <joeyzhu@nvidia.com> * [CI] Replace upload/download-artifact with job outputs in release-docker workflow (sgl-project#21579) Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Patch transformers is_base_mistral in CI to avoid HF 429 rate limiting (sgl-project#21586) * [CI] Move v32 cp test to deepep running suite (sgl-project#21585) * [AMD] Add GLM-4.7-FP8 accuracy CI test for MI35x (sgl-project#21534) Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> * [Clean] Remove deprecated environs (sgl-project#21536) * [diffusion] fix: fix Flux2-Klein prompt tokenization length to 512 and add regression coverage (sgl-project#21407) * [CI] hot-fix ci lint (sgl-project#21608) * [diffusion] feat: support overlay model materialization (sgl-project#21600) * [VLM] Optimize ShmPointerMMData for multi-pickle safety and deferred unwrap (sgl-project#21465) * feat: enable CUDA graph and timestamp for the whisper model(sgl-project#21190) * [NPU] Update quantization&CI documentation (sgl-project#21100) Co-authored-by: Tamir Baydasov <41994229+TamirBaydasov@users.noreply.github.com> * Skip ci for .md files (sgl-project#21482) * Support skip-softmax attention (sgl-project#19089) * fix: piecewise_cuda_graph get correct qo_indptr (sgl-project#21452) Co-authored-by: Avery Huang <averyh@nvidia.com> * fix bench_serving sglang backend to support image dataset (sgl-project#21294) * [AMD] Add peft>=0.18.0 to diffusion_hip deps for transformers 5.x compat for AMD diffusion model (sgl-project#21442) Co-authored-by: HaiShaw <hixiao@gmail.com> * [GDN] Fuse GDN kkt + solve_tril into one kernel (sgl-project#21411) Co-authored-by: luoyuan.luo <luoyuan.luo@antgroup.com> * [Diffusion] Align diffusion benchmark skill presets with nightly comparison cases (sgl-project#21616) Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> * Clean up detokenizer and remove dead multimodal_gen code (sgl-project#21588) Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * [CI] Skip flaky elastic EP test (sgl-project#21619) * feat(ci): add GB300 nightly benchmark test suites (sgl-project#21487) Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * [CI] Lossen test_return_routed_experts threshold (sgl-project#21270) * Add subprocess liveness monitor to detect scheduler crashes (sgl-project#18582) Co-authored-by: 继优 <jiyou.ljy@alibaba-inc.com> Co-authored-by: shuwenn <47200617+alphabetc1@users.noreply.github.com> * fix: scheduler launch hang when non-current rank dies (sgl-project#20287) * Wrap IPv6 addresses in gRPC, bench_serving, and log messages (sgl-project#21236) Co-authored-by: hnyls2002 <lsyincs@gmail.com> Co-authored-by: Liangsheng Yin <hnyls2002@gmail.com> * [HiCache] fix: graceful shutdown of pending async tasks in bench_mix.py (sgl-project#20276) * Clean up _wait_for_scheduler_ready implementation (sgl-project#21626) * fix cuda graph capturing error in sm120 mxfp8 triton path (sgl-project#19835) * [sgl] disable piecewise cuda graph when a model doesn't have layers (sgl-project#21565) * [Feature] Optimizations for JPEG input on NVIDIA GPU (sgl-project#19749) * [VLM] perf: optimize CUDA IPC for multimodal transfer by caching IPC pool handles (sgl-project#21418) * [Fix] SGLANG_USE_CUDA_IPC_TRANSPORT=1 and SGLANG_ENABLE_MM_SPLITTING=1 do not work at the same time. (sgl-project#19915) * [Fix] Remove redundant allreduce fusion block and skip TP=1 (sgl-project#20621) * Simplify routed experts test and move base64 encoding to tokenizer manager (sgl-project#21634) Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * [Cleanup] Remove unused BatchMultimodalOutput and BatchMultimodalDecodeReq (sgl-project#21640) Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Clean up TokenizerManager: remove dead code and improve rid validation (sgl-project#21639) Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * README: coding agent sponsorship for long-term contributors (sgl-project#21642) * Fix circular reference in CustomTestCase.__init_subclass__ (sgl-project#21650) Co-authored-by: wan4ch <wan4ch@gmail.com> * [Fix] Fix Qwen3.5 MoE model loading and Mamba cache sharding in PP mode (sgl-project#21448) Co-authored-by: zhangxiaolei123456 <zhangxiaolei.666@bytedance.com> * [diffusion] CI: fix dashboard chart (nightly) display issues (sgl-project#21653) Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> * Update sponsorship details in README.md (sgl-project#21658) * [Fix] Handle pre-release tags in nightly wheel version parsing (sgl-project#21656) Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * [Intel GPU] Enable DeepSeek R1 inference on XPU (sgl-project#18461) Signed-off-by: P V R K Jyothendra Varma <polisetty.v.r.k.jyothendra.varma@intel.com> * [Doc] Update tips for developer new-comers (sgl-project#21659) Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> * [CI] [FlashInfer v0.6.7] Use offline quantized checkpoint for MXFP8 Gemm tests (sgl-project#21625) * MFU metrics in Prometheus (sgl-project#19395) * fix topk softmax performance issue (sgl-project#14702) * [CPU] add kernel apply_rotary_pos_emb_cpu for Qwen3-VL and Qwen3-Omni (sgl-project#13121) Co-authored-by: Ma Mingfei <mingfei.ma@intel.com> * [CPU] Implement MXFP4 Gemm kernels for intel AMX to support GPT OSS series. (sgl-project#14385) * [AMD] Fused rope kv store (sgl-project#21315) Co-authored-by: wunhuang <wunhuang@amd.com> * [NPU] Update DeepSeek-V3.2 model deployment instructions in documentation (sgl-project#21468) Co-authored-by: wuxue (C) <w00964934@china.huawei.com> * [AMD] Support AMD MXFP4 Qwen3.5-397B-A17B model (sgl-project#21234) * [Fix] Fix weight_loader property assignment for qwen3-next FP8 models (sgl-project#21662) Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix mamba cache leak when adder fails to add a matched req. (sgl-project#21404) * fix: Mistral Small 4 fails to start due to config/weight format mismatch (sgl-project#21620) Co-authored-by: mengxiancheng03 <mengxiancheng03@kuaishou.com> Co-authored-by: Baizhou Zhang <sobereddiezhang@gmail.com> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * [diffusion] feat: enhance overlay mechanism (sgl-project#21648) * [diffusion] CI: relax pr-test threshold (sgl-project#21682) * [NPU][Diffusion] fix sp modulate for qwen-image-edit (sgl-project#20974) Co-authored-by: 高鑫 <gaoxin@gaoxindeMacBook-Pro.local> * [NPU] fix eagle3 accept rate (sgl-project#21255) * DeepSeek-R1-0528-w4a8: DeepEP Low Latency Dispatch Adopts FP8 Communication (sgl-project#14162) Co-authored-by: undefined <zhouchen.arrebol@jd.com> * [NPU] GLM-5 optimize with fused kernels (sgl-project#18617) * [NPU][diffusion]: support parallel decoding of qwen-image (sgl-project#20757) Co-authored-by: 高鑫 <gaoxin@gaoxindeMacBook-Pro.local> * [diffusion] [NPU] support ring attention on NPU with FA (sgl-project#21383) * [diffusion][doc]: add ring sp performance benchmark page (sgl-project#20998) Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> * [GLM-V and GLM-4.7] Cast to FP32 before gate projection for GLM model. (sgl-project#21660) * fix nemotron capture for non attention layers (sgl-project#21436) * [Bugfix][NPU] Skip FRACTAL_NZ format for MoE weights with unaligned dimensions (sgl-project#21209) Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Co-authored-by: ronnie_zheng <zl19940307@163.com> * [AMD] Add SGLANG_DISAGGREGATION_NUM_PRE_ALLOCATE_REQS env var for configurable KV transfer overlap (sgl-project#20410) Co-authored-by: HaiShaw <hixiao@gmail.com> * [AMD][MoRI] bump MoRI to v0.1.0 (sgl-project#21673) * [AMD] fix performance regression issue when run gpt-oss with "--context-length 13824" (sgl-project#21691) * Remove flashinfer wheel cache cleanup that deletes other versions (sgl-project#21711) Co-authored-by: Alison Shao <alison.shao@MacBook-Pro-D2W773R9CD.local> * [misc] multiprocess compilation to speed up test (sgl-project#21483) * Fix human-eval CI install on 5090 runners (sgl-project#21714) Co-authored-by: Alison Shao <alison.shao@Mac.attlocal.net> * Revert "DeepSeek-R1-0528-w4a8: DeepEP Low Latency Dispatch Adopts FP8 Communication" (sgl-project#21719) * [Fix] Update supported custom_mem_pool types for mooncake (sgl-project#21728) Co-authored-by: 百麒 <yaozhong.lyz@alibaba-inc.com> * [Perf]Remove H2D for Qwen3.5 SpecV2 (sgl-project#20864) * [AMD] Fix CI multimodal-gen-test-1-gpu-amd for gen model (sgl-project#21621) * [diffusion] fix: fix Flux.2 with tp(sgl-project#21664) * Add explicit disable flag for FlashInfer allreduce fusion (sgl-project#21446) * [NPU] fix conflict between empty_cache and use_mem_pool (sgl-project#21507) * [AMD] Use tgemm.mm for MoEGate router gemm in deepseek_v2.py (sgl-project#21657) * [CI]Remove msgm-en and mmlu tests which cause timeout (sgl-project#21733) * Fix disaggregation hybrid attention ci (sgl-project#21745) * Rename rerun-ut to rerun-test (sgl-project#21747) * bugfix(model):fix deepstack index out of range error (sgl-project#21727) Co-authored-by: xiaoqi.31 <xiaoqi.31@jd.com> * [diffusion] fix: fix typo (sgl-project#21746) Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com> * [CI] Fix rerun-test suite detection to skip commented registrations (sgl-project#21753) * [PD] Refactor Disagg Conn and Fix Hang with total_request/total_tokens Balancing (sgl-project#21299) Co-authored-by: Weiliangl User <weiliangl@login-node.hosted.internal> * [CI] Fix ring test timeout (sgl-project#21751) * Enable evict swa with piecewise cuda graph (sgl-project#21754) * Fix kimi-linear launch server error (sgl-project#21752) Co-authored-by: luoyuan.luo <luoyuan.luo@antgroup.com> * [PD] Tiny cleanup after KVReceiver refactor (sgl-project#21760) Signed-off-by: Shangming Cai <csmthu@gmail.com> * Fix remote weight info nnode>1 and dp>1 (sgl-project#17389) * [diffusion] UX: replace deprecated ORJSONResponse with orjson_response (sgl-project#21755) Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> * [diffusion] fix: fix Wan2.2-I2V-A14B video max size issue(sgl-project#21390) Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com> Co-authored-by: Mick <mickjagger19@icloud.com> * [HiMambaTree]: Optimize mamba host lock mechanism (sgl-project#21750) * [AMD] Fix Handle missing rope_theta in get_rope_config for Grok-1 (sgl-project#21518) * [bugfix] Fix rope theta config for MiniMax after transformers v5 update (sgl-project#21241) * Fix ineffective is_base_mistral CI patch for HF API rate limiting (sgl-project#21729) * [2/n] lora - Shared outer experts and support qwen3_30b_a3b_instruct (sgl-project#21466) Co-authored-by: Baizhou Zhang <sobereddiezhang@gmail.com> * Fix cuda graph max bs capture upper bound (sgl-project#21005) * [Fix] Fall back to triton MOE for GPT-OSS on Blackwell with driver >= 595 (sgl-project#21780) Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Cache nvidia wheels locally to skip repeated 830 MB downloads in CI (sgl-project#21778) * Add Trivy vulnerability scanning to nightly dev Docker builds (sgl-project#21772) Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * [CI] Remove more redundant PCG tests (sgl-project#21554) * [moe] add customized option to moe-a2a-backend (sgl-project#21786) * Add CompletionSampler for non-chat eval in run_eval (sgl-project#21785) * Remove redundant test_moe_eval_accuracy_large (sgl-project#21787) * Increase hicache eval to 200 examples (sgl-project#21791) * Switch MooncakeSpec to EAGLE3 + Llama-3.1 (sgl-project#21794) * Reduce redundant speculative decoding CI tests (sgl-project#21779) * Fix killall.py crash when sglang is not yet installed (sgl-project#21797) * Remove obsolete sgl-kernel legacy paths (sgl-project#21528) * [jit_kernel] Optimize fused_qknorm_rope: deduplicate sincosf for interleave RoPE (sgl-project#21654) * CUTLASS NVFP4 GEMM improvement of SM120 (sgl-project#21314) * [gRPC] Preserve original ImportError in grpc_server.py (sgl-project#21801) Signed-off-by: Chang Su <chang.s.su@oracle.com> * [Misc] Tiny: Add test network timeouts and dynamic max-parallel for 5090/2-gpu runners (sgl-project#21800) * Fix draft extend cuda graph when spec_step=1 (sgl-project#21709) * [Diffusion] Add `--uvicorn-access-log-exclude-prefixes` to suppress noisy access logs (sgl-project#20379) * Add latency and throughput metrics to run_eval (sgl-project#21793) * [diffusion] CI: improve ci reliability (sgl-project#21763) * [bugfix]GLM-4V model (sgl-project#17122) * Fix CVEs in Docker image: pillow, linux-libc-dev, and broken sgl-model-gateway build (sgl-project#21789) Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: only showing recent runners from ci failure analysis (sgl-project#21015) * [MPS] Fix Triton stub sub-module imports on Python 3.12+ (sgl-project#21551) Co-authored-by: karanb192 <karan@example.com> Co-authored-by: R0CKSTAR <yeahdongcn@gmail.com> Co-authored-by: R0CKSTAR <xiaodong.ye@mthreads.com> * [KDA] Fuse scaled_dot_kkt + solve_tril + recompute_w_u for KDA (sgl-project#21604) Co-authored-by: luoyuan.luo <luoyuan.luo@antgroup.com> * chore: bump flashinfer version to 0.6.7 (sgl-project#21422) Co-authored-by: sglang-bot <sglang-bot@users.noreply.github.com> Co-authored-by: Baizhou Zhang <sobereddiezhang@gmail.com> * [3/n] lora moe - Support Qwen3-VL-30B-A3B-Instruct (sgl-project#21469) Co-authored-by: Baizhou Zhang <sobereddiezhang@gmail.com> * [Feature Restoration] repetition_penalty is essential for GLM-V models (sgl-project#21258) Co-authored-by: Xinyuan Tong <115166877+JustinTong0323@users.noreply.github.com> Co-authored-by: Xinyuan Tong <xinyuantong.cs@gmail.com> Co-authored-by: hnyls2002 <lsyincs@gmail.com> Co-authored-by: Liangsheng Yin <hnyls2002@gmail.com> * VLM: change default mm-attention backend from triton_attn to fa4 (on blackwell) (sgl-project#21595) * Fix added tokens config with sensible filter (sgl-project#17905) * [AMD] Optimize Qwen3-VL decode - fuse QK-norm + 3D mRoPE + KV cache write (sgl-project#21458) Co-authored-by: Bingxu Chen <bingxche@amd.com> Co-authored-by: HaiShaw <hixiao@gmail.com> * [Bugfix] Fix PP tied embeddings weight loading for qwen3.5 4B dense model (sgl-project#21347) * [CI] Fix lint that was not applied in sgl-project#21458 (sgl-project#21818) * Bug fix for llama eagle3 (sgl-project#21397) * glm_interleave for GLM-V (sgl-project#21671) * style refinement for hisparse (sgl-project#21198) * [Bug][VLM] Fix shared memory race condition in ShmPointerMMData broadcast for multi-GPU VLM serving (sgl-project#21655) * [Bugfix] Fix effective_mamba_size over-allocation (sgl-project#20858) Co-authored-by: Shangming Cai <csmthu@gmail.com> * Fix in-place mode in pause generation (sgl-project#21705) * [diffusion] fix: respect --prompt-path (sgl-project#21756) * [NPU] update ascend docs (sgl-project#21807) * [VLM] remove AsyncMMDataProcessor wrapper (sgl-project#21651) * Use CustomTestCase for TestSessionControl to enable CI retry (sgl-project#21830) * [NPU]Add a full test pipeline on NPU, resolve issues in the NPU test architecture (sgl-project#20751) * [diffusion][CI]: Add individual component accuracy CI for diffusion models (sgl-project#18709) Co-authored-by: Xiaoyu Zhang <35585791+BBuf@users.noreply.github.com> * [Feature] JIT rmsnorm update (with claude) (sgl-project#21834) * [Diffusion][NPU] add ring sp performance benchmark page in npu (sgl-project#21811) * fix(MiMo-V2-Flash): add mimo reasoning parser (sgl-project#21414) * [diffusion] hardware: support FA3 attention backend on MUSA (attn backend, 14/N) (sgl-project#18648) Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com> Co-authored-by: Mick <mickjagger19@icloud.com> * fix: pre-init tokenizer_manager to avoid AttributeError in shutdown (sgl-project#21824) * [FlashInver v0.6.7] Integrate flashinfer_trtllm mxfp8 gemm (sgl-project#21576) * [Misc] Add network timeout to eval dataset downloads (sgl-project#21873) Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * [refactor] Clean up duplicate flashinfer trtllm moe code (sgl-project#21233) * [DSA] Support trtllm sparse mla kernel for prefill batches (sgl-project#21783) * [Disagg] GPU staging buffer with dynamic ring allocator for heterogeneous TP KV transfer (sgl-project#19890) * Add merge prohibition policy during CI maintenance mode (sgl-project#21882) * [Misc] Fix comparator e2e tests: add polars dep + fix dp-attention test (sgl-project#21804) Co-authored-by: Alison Shao <alison.shao@mac.lan> * revert: remove TTL-based hard pin from HiRadixCache (sgl-project#21884) * Unify GSM8K eval path to Chat API for regression CI readiness (sgl-project#21667) * [HiCache] fix: Clone host indices to avoid memory leak (sgl-project#21624) Co-authored-by: Zhiqiang Xie <xiezhq@stanford.edu> * [HiCache & PD]Fixed detailed cache hit breakdown in PD scenarios. (sgl-project#21764) * [CI] Add Llama 3.1 8B Instruct FP4 CI test on SM120 (sgl-project#20648) * [CI] Add Per-Tensor, Blockwise FP8 Tests on SM120 (sgl-project#20717) Co-authored-by: Brayden Zhong <b8zhong@uwaterloo.ca> * Allow /rerun-test to checkout fork PR branch for trusted users (sgl-project#21890) * Direct model loading from object storage with Runai Model Streamer (sgl-project#17948) Signed-off-by: Noa Neria <noa@run.ai> * fix pcg torch dynamo recompile in mxfp8 Triton path (sgl-project#21888) Co-authored-by: Hanlin Bi <hanlinbi@umich.edu> * chore: bump mooncake version to 0.3.10.post1 (sgl-project#21844) * [VLM] Add VLM TP=4 per-commit CI test and improve MMMU eval prompt/parser (sgl-project#21841) * fix(ci): update est_time for 57 tests based on runtime analysis (sgl-project#21896) Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * [CI] Increase multimodal server test timeout from 60 to 90 minutes (sgl-project#21897) * [CI] Remove crashing Kimi K2.5 EAGLE3/MTP variants, keep TP8 and TP8+DP8 (sgl-project#21898) * [diffusion] CI: add initial nvfp4 ci test for b200 (sgl-project#21767) Co-authored-by: Mick <mickjagger19@icloud.com> * Migrate all callers from /get_server_info to /server_info (sgl-project#21463) * Support PP key for file backend (sgl-project#21901) * Enable multi-thread weight loading by default (sgl-project#20289) * Skip Go stdlib and NVIDIA tool CVEs in Trivy scan (sgl-project#21905) Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * [Kernel] Fuse temperature + softmax in sampling for decode speedup (sgl-project#20501) * Multi tool streaming fix (sgl-project#20004) * Return HTTP 400 for streaming validation errors (sgl-project#21900) * [Spec][Ngram] 4/N: Remove `max_match_window_size` and `min_match_window_size`, matching all suffixes of the Trie (sgl-project#21225) * Fix ngram doc for speculative_num_draft_tokens default (sgl-project#21910) * [NVIDIA] Enable fp8 flashinfer_trtllm_routed MoE for MiniMax-M2.5 (sgl-project#20394) * scheduler: add prefill-only update in merge batch (sgl-project#21840) * [DSA] Set trtllm kernels as nsa default for Blackwell (sgl-project#21914) * Revert "Rollback flashmla to older version [1/2]" (sgl-project#21922) * test: add manual init test for mooncake transfer engine (sgl-project#21842) Co-authored-by: yunzhi <ningyunxiao.nyx@antgroup.com> * Fix spec v2 + logprob when max_num_token is set (sgl-project#20799) * Migrate ngram corpus from torch cpp_extension to TVM FFI jit_kernel (sgl-project#21920) Co-authored-by: DarkSharpness <2040703891@qq.com> * [NPU] Support GLM-4.7-Flash on NPU (sgl-project#21408) * [CI] Fix gpu deps import in cpu test (sgl-project#21950) * [Parallel State Refactor 1/n] Remove stream of PyNCCL (sgl-project#20866) * [diffusion] chore: fix stage profiler for multi-stage denoising (sgl-project#21955) * [CI] [Tracing] Add ci for tracing and fix bugs (sgl-project#21740) * Remove logging for subprocess watchdog start (sgl-project#21968) * [4/n] Support gpt oss 20b lora (sgl-project#21570) * [MUSA][9/N] Add FA3 attention backend support through MATE (MUSA AI Tensor Engine) (sgl-project#17985) Co-authored-by: R0CKSTAR <xiaodong.ye@mthreads.com> * [Feature] Stronger transformers modeling backend with TP, PP, MoE, VLMs, and torch compile (sgl-project#19163) * [CI] Remove stale Ascend suite entries from test/srt/run_suite.py (sgl-project#21978) Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Skip broken AutoModel mapping entries when resolving Llava submodules (sgl-project#21892) * [CI] Add timeouts to Slack upload urlopen and WebClient (sgl-project#21903) Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * [Diffusion][NPU] Add support for MOVA (sgl-project#21633) Co-authored-by: zhangshuai (S) <z00836796@china.huawei.com> * Remove maxItems=1 restriction when tool_choice is specified (sgl-project#20208) * [Feature] NVFP4 Marlin fallback for non-Blackwell GPUs (SM75+) (sgl-project#19652) * [PP] qwen3 vl skip layer id for pp (sgl-project#19135) * [VLM] Enable per-image MM splitting by default and remove MULTI_IMAGES modality (sgl-project#21899) * [Bugfix] Fix incorrect dp-attention parallel info in bench_one_batch (sgl-project#21519) * Revert "[MUSA][9/N] Add FA3 attention backend support through MATE (MUSA AI Tensor Engine)" (sgl-project#22002) * [NPU] Optimized the wording in the npu docs (sgl-project#21998) * [Parallel State Refactor 2/n] Unify code path of AMD deterministic all reduce (sgl-project#20871) * [AMD] Resolve the performance degression when launch server with "--enable-aiter-allreduce-fusion" (sgl-project#21947) Co-authored-by: wunhuang <wunhuang@amd.com> * chore: bump sgl-kernel version to 0.4.1 (sgl-project#21447) Co-authored-by: sglang-bot <sglang-bot@users.noreply.github.com> * [Workflow] Avoid triggering nightly tests in kernel bump workflow (sgl-project#22010) * [Workflow] Fix kernel release jobs skipped on push events (sgl-project#22011) Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * [PD]: Add support for HiSparse to directly transfer the cache from Prefill to Decode DRAM. (sgl-project#21591) Co-authored-by: Tingwei Huang <huangtingwei9988@gmail.com> Co-authored-by: Shangming Cai <csmthu@gmail.com> Co-authored-by: Zhiqiang Xie <xiezhq@stanford.edu> * [Misc] Update CI permission (sgl-project#22014) * [ROCM][RL] Shuffle Weight In-Place to Preserve Parameter Attributes (sgl-project#21825) * [CI] Fix duplicate job names that bypass branch protection (sgl-project#22001) * fix: remove duplicate words in comments (sgl-project#22007) * [PD] Tiny register info field cleanup for mooncake backend (sgl-project#22016) * [NPU] optimize glm4.7 (sgl-project#19246) * [AMD] Enable FP8 KV cache and FP8 attention kernel for NSA on MI300/MI355 with TileLang backend (sgl-project#21511) * [AMD] Add MiniMax-M2.5 nightly perf benchmarks for MI30x and MI35x (sgl-project#21524) --------- Signed-off-by: Vladislav Nosivskoy <vladnosiv@gmail.com> Signed-off-by: P V R K Jyothendra Varma <polisetty.v.r.k.jyothendra.varma@intel.com> Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com> Signed-off-by: Shangming Cai <csmthu@gmail.com> Signed-off-by: Chang Su <chang.s.su@oracle.com> Signed-off-by: Noa Neria <noa@run.ai> Co-authored-by: Bingxu Chen <bingxche@amd.com> Co-authored-by: Xiaoyu Zhang <35585791+BBuf@users.noreply.github.com> Co-authored-by: yang1002378395-cmyk <yang1002378395@gmail.com> Co-authored-by: Mick <mickjagger19@icloud.com> Co-authored-by: Bi Xue <bi@thinkingmachines.ai> Co-authored-by: huangtingwei <141888744+huangtingwei9988@users.noreply.github.com> Co-authored-by: Lianmin Zheng <lianminzheng@gmail.com> Co-authored-by: Baizhou Zhang <sobereddiezhang@gmail.com> Co-authored-by: Muqi Li <muqi1029@gmail.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Co-authored-by: Qiaolin Yu <liin1211@outlook.com> Co-authored-by: narutolhy <582909902@qq.com> Co-authored-by: Ethan (Yusheng) Su <yushengsu.thu@gmail.com> Co-authored-by: zhangxiaolei <zhangxiaolei.666@bytedance.com> Co-authored-by: Vladislav Nosivskoy <vladnosiv@gmail.com> Co-authored-by: Trevor Morris <tmorris@nvidia.com> Co-authored-by: Eitan Turok <150733043+eitanturok@users.noreply.github.com> Co-authored-by: Fengyuan Yu <Yuandao151112@163.com> Co-authored-by: Fengyuan Yu <15fengyuan@gmail.com> Co-authored-by: Adarsh Shirawalmath <114558126+adarshxs@users.noreply.github.com> Co-authored-by: Yuhao Yang <47235274+yhyang201@users.noreply.github.com> Co-authored-by: Liangsheng Yin <hnyls2002@gmail.com> Co-authored-by: Jianying <53503712+jianyingzhu@users.noreply.github.com> Co-authored-by: jianyingzhu <joeyzhu@nvidia.com> Co-authored-by: Kangyan-Zhou <zky314343421@gmail.com> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Co-authored-by: Jacob0226 <jacchang@amd.com> Co-authored-by: Aditya Sharma <89210949+adityavaid@users.noreply.github.com> Co-authored-by: Yuan Luo <yuan.luo@hotmail.com> Co-authored-by: Xinyuan Tong <115166877+JustinTong0323@users.noreply.github.com> Co-authored-by: Артем Савкин <58187114+OrangeRedeng@users.noreply.github.com> Co-authored-by: Tamir Baydasov <41994229+TamirBaydasov@users.noreply.github.com> Co-authored-by: Shu Wang <shuw@nvidia.com> Co-authored-by: eigen <52445717+yyihuang@users.noreply.github.com> Co-authored-by: Avery Huang <averyh@nvidia.com> Co-authored-by: jacky.cheng <yichiche@amd.com> Co-authored-by: HaiShaw <hixiao@gmail.com> Co-authored-by: luoyuan.luo <luoyuan.luo@antgroup.com> Co-authored-by: Shangming Cai <csmthu@gmail.com> Co-authored-by: Junrong Lin <33685709+ocss884@users.noreply.github.com> Co-authored-by: Simon (Jiyou) Li <Simon-Li@users.noreply.github.com> Co-authored-by: 继优 <jiyou.ljy@alibaba-inc.com> Co-authored-by: shuwenn <47200617+alphabetc1@users.noreply.github.com> Co-authored-by: psaab <ps@meta.com> Co-authored-by: hnyls2002 <lsyincs@gmail.com> Co-authored-by: Hanlin Bi <52993433+wolfcomos@users.noreply.github.com> Co-authored-by: wili <98001977+wili-65535@users.noreply.github.com> Co-authored-by: saatwiknagpal <saatwiknagpal@gmail.com> Co-authored-by: Mohammad Miadh Angkad <176301910+mmangkad@users.noreply.github.com> Co-authored-by: wan4ch <wan4ch@gmail.com> Co-authored-by: Feng Su <sufeng@linux.alibaba.com> Co-authored-by: Ying Sheng <sqy1415@gmail.com> Co-authored-by: Polisetty V R K Jyothendra Varma <polisetty.v.r.k.jyothendra.varma@intel.com> Co-authored-by: Ziang Li <ziangli@umich.edu> Co-authored-by: Aishwarya Ramasethu <56765596+aramasethu@users.noreply.github.com> Co-authored-by: Ma Mingfei <mingfei.ma@intel.com> Co-authored-by: blzheng <beilei.zheng@intel.com> Co-authored-by: kk <43161300+kkHuang-amd@users.noreply.github.com> Co-authored-by: wunhuang <wunhuang@amd.com> Co-authored-by: Michelle Wu <michellewu351@gmail.com> Co-authored-by: wuxue (C) <w00964934@china.huawei.com> Co-authored-by: Hubert Lu <55214931+hubertlu-tw@users.noreply.github.com> Co-authored-by: strgrb <zhangkaihong.zkh@antgroup.com> Co-authored-by: LiYomi <106872109+LiYomi@users.noreply.github.com> Co-authored-by: mengxiancheng03 <mengxiancheng03@kuaishou.com> Co-authored-by: GXIN <37653830+gxxx-hum@users.noreply.github.com> Co-authored-by: 高鑫 <gaoxin@gaoxindeMacBook-Pro.local> Co-authored-by: heziiop <q_m_p@qq.com> Co-authored-by: xieminghe1 <141820649+xieminghe1@users.noreply.github.com> Co-authored-by: undefined <zhouchen.arrebol@jd.com> Co-authored-by: Makcum888e <79456407+Makcum888e@users.noreply.github.com> Co-authored-by: yuefeng Wu <33725817+ChefWu551@users.noreply.github.com> Co-authored-by: Yuxuan Zhang <2448370773@qq.com> Co-authored-by: Vedant V Jhaveri <vedantjh2@gmail.com> Co-authored-by: ronnie_zheng <zl19940307@163.com> Co-authored-by: Zhai Feiyue <80079571+ZhaiFeiyue@users.noreply.github.com> Co-authored-by: jhchouuu <jiahzhou@amd.com> Co-authored-by: Alison Shao <54658187+alisonshao@users.noreply.github.com> Co-authored-by: Alison Shao <alison.shao@MacBook-Pro-D2W773R9CD.local> Co-authored-by: DarkSharpness <76582120+DarkSharpness@users.noreply.github.com> Co-authored-by: Alison Shao <alison.shao@Mac.attlocal.net> Co-authored-by: Lewis <63569348+TTThanos@users.noreply.github.com> Co-authored-by: 百麒 <yaozhong.lyz@alibaba-inc.com> Co-authored-by: Jincong Chen <jincong.cjc@ant-intl.com> Co-authored-by: xiazhahe <86939755+xiazhahe@users.noreply.github.com> Co-authored-by: Thomas Wang <thomawan@amd.com> Co-authored-by: Ke Bao <ispobaoke@gmail.com> Co-authored-by: xiaoqi <xq25478@qq.com> Co-authored-by: xiaoqi.31 <xiaoqi.31@jd.com> Co-authored-by: R0CKSTAR <xiaodong.ye@mthreads.com> Co-authored-by: weireweire <weiliangl@nvidia.com> Co-authored-by: Weiliangl User <weiliangl@login-node.hosted.internal> Co-authored-by: JD <jaedon.guo@gmail.com> Co-authored-by: Zhangheng <hzh0425@apache.org> Co-authored-by: Michael <13900043+michaelzhang-ai@users.noreply.github.com> Co-authored-by: Yilong Zhao <74357408+happierpig@users.noreply.github.com> Co-authored-by: Johnsonms <lizhaofu@gmail.com> Co-authored-by: Brayden Zhong <b8zhong@uwaterloo.ca> Co-authored-by: Chang Su <chang.s.su@oracle.com> Co-authored-by: KnightLTC <56717110+KnightLTC@users.noreply.github.com> Co-authored-by: Douglas Yang <dyang@college.harvard.edu> Co-authored-by: Karan Bansal <karanb192@users.noreply.github.com> Co-authored-by: karanb192 <karan@example.com> Co-authored-by: R0CKSTAR <yeahdongcn@gmail.com> Co-authored-by: sglang-bot <sglangbot@gmail.com> Co-authored-by: sglang-bot <sglang-bot@users.noreply.github.com> Co-authored-by: Xinyuan Tong <xinyuantong.cs@gmail.com> Co-authored-by: sbeurnier <sbeurnier@together.ai> Co-authored-by: YC Yen-Ching Tseng <yctseng@amd.com> Co-authored-by: Wenyao Gao <105094497+edwingao28@users.noreply.github.com> Co-authored-by: Alex Nails <alex.nails@radixark.ai> Co-authored-by: khalilzhk <khalilzhk@gmail.com> Co-authored-by: Zhiqiang Xie <xiezhq@stanford.edu> Co-authored-by: yudian0504 <138860534+yudian0504@users.noreply.github.com> Co-authored-by: yunkchen <chenyunkuo.cyk@alibaba-inc.com> Co-authored-by: wduan-hai <wduan@humansand.ai> Co-authored-by: amote-i <49533125+amote-i@users.noreply.github.com> Co-authored-by: Cherry_ming <136634645@qq.com> Co-authored-by: Ratish P <114130421+Ratish1@users.noreply.github.com> Co-authored-by: YAMY <74099316+YAMY1234@users.noreply.github.com> Co-authored-by: Alison Shao <alison.shao@mac.lan> Co-authored-by: ishandhanani <82981111+ishandhanani@users.noreply.github.com> Co-authored-by: Derek Yu <81697272+DerekY2@users.noreply.github.com> Co-authored-by: Noa Neria <noa@run.ai> Co-authored-by: Hanlin Bi <hanlinbi@umich.edu> Co-authored-by: Prozac614 <dwt614707404@163.com> Co-authored-by: David Cheung <d7cheung@gmail.com> Co-authored-by: Mook <68294499+Godmook@users.noreply.github.com> Co-authored-by: Khoa Pham <khoa.pham@radixark.ai> Co-authored-by: foraxe <73625538+foraxe@users.noreply.github.com> Co-authored-by: yunzhi <ningyunxiao.nyx@antgroup.com> Co-authored-by: DarkSharpness <2040703891@qq.com> Co-authored-by: Todobe <43903496+Todobe@users.noreply.github.com> Co-authored-by: ori <39351881+froststeam@users.noreply.github.com> Co-authored-by: Thomas <zs033@qq.com> Co-authored-by: zhangshuai (S) <z00836796@china.huawei.com> Co-authored-by: lviy <142899752+lviy@users.noreply.github.com> Co-authored-by: Tingwei Huang <huangtingwei9988@gmail.com> Co-authored-by: Yuzhen Zhou <82826991+zyzshishui@users.noreply.github.com> Co-authored-by: Ricardo-M-L <69202550+Ricardo-M-L@users.noreply.github.com> Co-authored-by: Kelon <kelonlu@163.com> Co-authored-by: cen121212 <luochen23@huawei.com>

remove AsyncMMDataProcessor

49f63c9

yhyang201 requested review from Ying1123, hnyls2002, merrymercy and xiezhq-hermann as code owners March 30, 2026 03:48

gemini-code-assist bot reviewed Mar 30, 2026

View reviewed changes

github-actions bot added the run-ci label Mar 30, 2026

upd

7470800

yhyang201 requested review from JustinTong0323, mickqian and yuan-luo as code owners March 30, 2026 09:58

yuan-luo approved these changes Mar 30, 2026

View reviewed changes

Add asyncio.wait_for timeout to LLaVA processor

0dbae9b

LLaVA is the only processor with real await points (via run_in_executor), so add per-image timeout protection directly where it is effective.

yhyang201 force-pushed the remove_asyncmmdataprocessor branch from 7d4dd19 to 0dbae9b Compare March 30, 2026 10:27

yhyang201 merged commit 1aabe44 into sgl-project:main Apr 1, 2026
1061 of 1161 checks passed

satyamk7054 pushed a commit to satyamk7054/sglang that referenced this pull request Apr 3, 2026

[VLM] remove AsyncMMDataProcessor wrapper (sgl-project#21651)

ce7c939

Conversation

yhyang201 commented Mar 30, 2026

Motivation

Modifications

Accuracy Tests

Speed Tests and Profiling

Checklist

Review and Merge Process

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

yhyang201 commented Mar 30, 2026

Uh oh!

yuan-luo commented Mar 30, 2026

1. The analysis conflates two very different phases of process_mm_data_async

2. asyncio.wait_for does work for processors with real await points

3. Even for "fake async" processors, timeout still prevents GPU resource waste

4. The Semaphore provides genuine backpressure for eligible processors

5. Removing server args breaks existing deployments

Uh oh!

yuan-luo commented Mar 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

yuan-luo commented Mar 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Setup

The Scenario

What Happens Inside TokenizerManager

Timeline: WITHOUT AsyncMMDataProcessor (after this PR)

Timeline: WITH AsyncMMDataProcessor (before this PR)

Uh oh!

yhyang201 commented Mar 30, 2026

Setup

The Scenario

What Happens Inside TokenizerManager

Timeline: WITHOUT AsyncMMDataProcessor (after this PR)

Timeline: WITH AsyncMMDataProcessor (before this PR)

Uh oh!

yuan-luo commented Mar 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

yhyang201 commented Mar 30, 2026

Uh oh!

yuan-luo commented Mar 30, 2026

Uh oh!

mickqian commented Mar 30, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

1. The analysis conflates two very different phases of `process_mm_data_async`

2. `asyncio.wait_for` does work for processors with real `await` points

yuan-luo commented Mar 30, 2026 •

edited

Loading

yuan-luo commented Mar 30, 2026 •

edited

Loading

yuan-luo commented Mar 30, 2026 •

edited

Loading