[docs] Instructions for bench_serving.py by yhyang201 · Pull Request #9071 · sgl-project/sglang

yhyang201 · 2025-08-11T12:43:57Z

Motivation

This is a temporary PR to evaluate the memory leakage and provide a better benchmarking for multi-modal input.

Modifications

Added random-image dataset for benchmarking multi-image + text inputs.
Added CLI options:
- --random-image-num-images to set number of images per request.
- --random-image-resolution to select image resolution (1080p, 720p, 360p).
Updated request handling to support multiple images in both OpenAI Chat and SGLang backends.

Example

git clone -b bench/image https://github.com/yhyang201/sglang.git 
cd sglang

pip install -e "python[all]"

Launch server:

python -m sglang.launch_server --model-path Qwen/Qwen2.5-VL-3B-Instruct --disable-radix-cache

Launch benchmarking. Note that, the num-prompts will first collect 500 requests, then send them at once. Using a large num-prompts would increase the picture creation time.

python -m sglang.bench_serving \
    --backend sglang-oai-chat \
    --dataset-name random-image \
    --num-prompts 500 \
    --random-image-num-images 3 \
    --random-image-resolution 720p \
    --random-input-len 512 \
    --random-output-len 512

Note that we do not need to wait for the bench_serving to end. We can always use the log to analyse the memory.

Find the log file, like 20250822_181618_memory_log.txt, then:

python memory_analyzer.py 20250822_181618_memory_log.txt

We can get the mem profiler.

Note that we modified event_loop_overlap function to let it do gc.collect(), torch.cuda.empty_cache(). You can manually disable it.

With gc.collect(), torch.cuda.empty_cache() always turn on, during the benchmarking, the image processor takes over 30GB of memory (on B200) to process the 500 requests, each of which has 3 images. After the process, the image processor's memory is released.

Note this is without --max-concurrency 1 in the bench serving period.

Using --max-concurrency 1, we have:

The converged value is steady.

Okay. After removing the gc.collect(), torch.cuda.empty_cache() in the scheduler, the reserved memory indeed leaked:

Note that I first ran the benchmark with --max-concurrency 1, then slept for 5 minutes, and ran the benchmarking without --max-concurrency 1.

Accuracy Tests

Benchmarking and Profiling

Checklist

Format your code according to the Format code with pre-commit.
Add unit tests according to the Run and add unit tests.
Update documentation according to Write documentations.
Provide accuracy and speed benchmark results according to Test the accuracy and Benchmark the speed.

gemini-code-assist

Summary of Changes

Hello @yhyang201, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

I've been working on enhancing our benchmarking utility, "bench_serving.py", to support multi-image and text inputs. This pull request introduces a new "random-image" dataset, allowing us to generate synthetic image data with configurable counts and resolutions for more comprehensive benchmarking of multi-modal models.

Highlights

New "random-image" Dataset: A dedicated dataset has been added for generating random images alongside text prompts, specifically for benchmarking multi-modal models.
Configurable Image Count: A new CLI option, "--random-image-num-images", has been introduced to allow users to specify the number of images to be included per request during benchmarking.
Configurable Image Resolution: The "--random-image-resolution" argument now enables selection of image resolutions (1080p, 720p, 360p) for the dynamically generated images.
Multi-Image Backend Support: Both the OpenAI Chat and SGLang backends have been updated to correctly process and handle requests that include multiple image inputs, ensuring compatibility with the new dataset.
Dynamic Image Generation: A utility has been implemented to generate random image data URIs on the fly, facilitating the creation of diverse image inputs for benchmarking purposes.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in issue comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments or fill out our survey to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces a random-image dataset for benchmarking multi-modal models, which is a great addition. The changes correctly update the request handling to support multiple images for both OpenAI Chat and SGLang backends. My review includes a few suggestions for the new sample_random_image_requests function to improve exception handling, adhere to Python's import conventions, and clarify the logic for the number of images generated.

gemini-code-assist · 2025-08-11T12:44:59Z

python/sglang/bench_serving.py

Catching a specific ImportError is better than a generic Exception. This makes the code's intent clearer and avoids masking other potential errors during the import of the PIL library.

Suggested change

except Exception as e:

raise ImportError(

"Please install Pillow to generate random images: pip install pillow"

) from e

except ImportError as e:

raise ImportError(

"Please install Pillow to generate random images: pip install pillow"

) from e

python/sglang/bench_serving.py

JustinTong0323 · 2025-08-12T05:26:04Z

python/sglang/bench_serving.py

Why is the seed changed at this point? For the purpose of reproducibility, it ought to be fixed.

A common default behavior occurs when bench_serving is run multiple times with the random or random-image dataset. Because the dataset name contains “random,” users often assume that new data is generated on each run, and therefore leave the Radix Tree enabled by default.
In reality, the same seed is used by default, meaning that identical requests are sent each time. This allows the Radix Tree to cache and accelerate processing, which can affect benchmark results.
For reproducibility, the --seed option can be set manually. This changes only the default random seed and does not alter any other behavior.

…put lengths and num_images random 1080p images

JustinTong0323

LGTM, good job!

ZhengWG · 2025-08-20T09:36:34Z

python/sglang/bench_serving.py

+        ) from e
+
+    # Check for potentially problematic combinations and warn user
+    if width * height >= 1920 * 1080 and num_images * num_requests >= 100:


The variables width/height appear to be undefined here.

zhaochenyang20 · 2025-08-22T18:36:17Z

While doing bench serving, on the server side:

[2025-08-22 18:32:29] Prefill batch. #new-seq: 4, #new-token: 14985, #cached-token: 0, token usage: 0.03, #running-req: 29, #queue-req: 85, 
[2025-08-22 18:32:30] Memory allocated: 152414226944
[2025-08-22 18:32:30] Memory reserved: 153085804544
[2025-08-22 18:34:03] ERROR:    Exception in ASGI application
  + Exception Group Traceback (most recent call last):
  |   File "/root/.python/sglang/lib/python3.12/site-packages/starlette/_utils.py", line 77, in collapse_excgroups
  |     yield
  |   File "/root/.python/sglang/lib/python3.12/site-packages/starlette/responses.py", line 271, in __call__
  |     async with anyio.create_task_group() as task_group:
  |                ^^^^^^^^^^^^^^^^^^^^^^^^^
  |   File "/root/.python/sglang/lib/python3.12/site-packages/anyio/_backends/_asyncio.py", line 772, in __aexit__
  |     raise BaseExceptionGroup(
  | ExceptionGroup: unhandled errors in a TaskGroup (1 sub-exception)
  +-+---------------- 1 ----------------
    | Traceback (most recent call last):
    |   File "/root/.python/sglang/lib/python3.12/site-packages/uvicorn/protocols/http/h11_impl.py", line 403, in run_asgi
    |     result = await app(  # type: ignore[func-returns-value]
    |              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    |   File "/root/.python/sglang/lib/python3.12/site-packages/uvicorn/middleware/proxy_headers.py", line 60, in __call__
    |     return await self.app(scope, receive, send)
    |            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    |   File "/root/.python/sglang/lib/python3.12/site-packages/fastapi/applications.py", line 1054, in __call__
    |     await super().__call__(scope, receive, send)
    |   File "/root/.python/sglang/lib/python3.12/site-packages/starlette/applications.py", line 113, in __call__
    |     await self.middleware_stack(scope, receive, send)
    |   File "/root/.python/sglang/lib/python3.12/site-packages/starlette/middleware/errors.py", line 186, in __call__
    |     raise exc
    |   File "/root/.python/sglang/lib/python3.12/site-packages/starlette/middleware/errors.py", line 164, in __call__
    |     await self.app(scope, receive, _send)
    |   File "/root/.python/sglang/lib/python3.12/site-packages/starlette/middleware/cors.py", line 85, in __call__
    |     await self.app(scope, receive, send)
    |   File "/root/.python/sglang/lib/python3.12/site-packages/starlette/middleware/exceptions.py", line 63, in __call__
    |     await wrap_app_handling_exceptions(self.app, conn)(scope, receive, send)
    |   File "/root/.python/sglang/lib/python3.12/site-packages/starlette/_exception_handler.py", line 53, in wrapped_app
    |     raise exc
    |   File "/root/.python/sglang/lib/python3.12/site-packages/starlette/_exception_handler.py", line 42, in wrapped_app
    |     await app(scope, receive, sender)
    |   File "/root/.python/sglang/lib/python3.12/site-packages/starlette/routing.py", line 716, in __call__
    |     await self.middleware_stack(scope, receive, send)
    |   File "/root/.python/sglang/lib/python3.12/site-packages/starlette/routing.py", line 736, in app
    |     await route.handle(scope, receive, send)
    |   File "/root/.python/sglang/lib/python3.12/site-packages/starlette/routing.py", line 290, in handle
    |     await self.app(scope, receive, send)
    |   File "/root/.python/sglang/lib/python3.12/site-packages/starlette/routing.py", line 78, in app
    |     await wrap_app_handling_exceptions(app, request)(scope, receive, send)
    |   File "/root/.python/sglang/lib/python3.12/site-packages/starlette/_exception_handler.py", line 53, in wrapped_app
    |     raise exc
    |   File "/root/.python/sglang/lib/python3.12/site-packages/starlette/_exception_handler.py", line 42, in wrapped_app
    |     await app(scope, receive, sender)
    |   File "/root/.python/sglang/lib/python3.12/site-packages/starlette/routing.py", line 76, in app
    |     await response(scope, receive, send)
    |   File "/root/.python/sglang/lib/python3.12/site-packages/starlette/responses.py", line 270, in __call__
    |     with collapse_excgroups():
    |          ^^^^^^^^^^^^^^^^^^^^
    |   File "/usr/lib/python3.12/contextlib.py", line 158, in __exit__
    |     self.gen.throw(value)
    |   File "/root/.python/sglang/lib/python3.12/site-packages/starlette/_utils.py", line 83, in collapse_excgroups
    |     raise exc
    |   File "/root/.python/sglang/lib/python3.12/site-packages/starlette/responses.py", line 274, in wrap
    |     await func()
    |   File "/root/.python/sglang/lib/python3.12/site-packages/starlette/responses.py", line 254, in stream_response
    |     async for chunk in self.body_iterator:
    |   File "/root/sglang/python/sglang/srt/entrypoints/openai/serving_chat.py", line 439, in _generate_chat_stream
    |     async for content in self.tokenizer_manager.generate_request(
    |   File "/root/sglang/python/sglang/srt/managers/tokenizer_manager.py", line 493, in generate_request
    |     tokenized_obj = await self._tokenize_one_request(obj)
    |                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    |   File "/root/sglang/python/sglang/srt/managers/tokenizer_manager.py", line 547, in _tokenize_one_request
    |     mm_inputs: Dict = await self.mm_processor.process_mm_data_async(
    |                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    |   File "/root/sglang/python/sglang/srt/multimodal/processors/qwen_vl.py", line 251, in process_mm_data_async
    |     mm_items, input_ids, ret = self.process_and_combine_mm_data(
    |                                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    |   File "/root/sglang/python/sglang/srt/multimodal/processors/base_processor.py", line 616, in process_and_combine_mm_data
    |     collected_items, input_ids, ret = self._process_and_collect_mm_items(
    |                                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    |   File "/root/sglang/python/sglang/srt/multimodal/processors/base_processor.py", line 565, in _process_and_collect_mm_items
    |     ret = self.process_mm_data(
    |           ^^^^^^^^^^^^^^^^^^^^^
    |   File "/root/sglang/python/sglang/srt/multimodal/processors/base_processor.py", line 236, in process_mm_data
    |     result = processor.__call__(
    |              ^^^^^^^^^^^^^^^^^^^
    |   File "/root/.python/sglang/lib/python3.12/site-packages/transformers/models/qwen2_5_vl/processing_qwen2_5_vl.py", line 177, in __call__
    |     num_image_tokens = image_grid_thw[index].prod() // merge_length
    |                        ~~~~~~~~~~~~~~^^^^^^^
    | IndexError: index 3 is out of bounds for dimension 0 with size 3
    +------------------------------------

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/root/.python/sglang/lib/python3.12/site-packages/uvicorn/protocols/http/h11_impl.py", line 403, in run_asgi
    result = await app(  # type: ignore[func-returns-value]
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/.python/sglang/lib/python3.12/site-packages/uvicorn/middleware/proxy_headers.py", line 60, in __call__
    return await self.app(scope, receive, send)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/.python/sglang/lib/python3.12/site-packages/fastapi/applications.py", line 1054, in __call__
    await super().__call__(scope, receive, send)
  File "/root/.python/sglang/lib/python3.12/site-packages/starlette/applications.py", line 113, in __call__
    await self.middleware_stack(scope, receive, send)
  File "/root/.python/sglang/lib/python3.12/site-packages/starlette/middleware/errors.py", line 186, in __call__
    raise exc
  File "/root/.python/sglang/lib/python3.12/site-packages/starlette/middleware/errors.py", line 164, in __call__
    await self.app(scope, receive, _send)
  File "/root/.python/sglang/lib/python3.12/site-packages/starlette/middleware/cors.py", line 85, in __call__
    await self.app(scope, receive, send)
  File "/root/.python/sglang/lib/python3.12/site-packages/starlette/middleware/exceptions.py", line 63, in __call__
    await wrap_app_handling_exceptions(self.app, conn)(scope, receive, send)
  File "/root/.python/sglang/lib/python3.12/site-packages/starlette/_exception_handler.py", line 53, in wrapped_app
    raise exc
  File "/root/.python/sglang/lib/python3.12/site-packages/starlette/_exception_handler.py", line 42, in wrapped_app
    await app(scope, receive, sender)
  File "/root/.python/sglang/lib/python3.12/site-packages/starlette/routing.py", line 716, in __call__
    await self.middleware_stack(scope, receive, send)
  File "/root/.python/sglang/lib/python3.12/site-packages/starlette/routing.py", line 736, in app
    await route.handle(scope, receive, send)
  File "/root/.python/sglang/lib/python3.12/site-packages/starlette/routing.py", line 290, in handle
    await self.app(scope, receive, send)
  File "/root/.python/sglang/lib/python3.12/site-packages/starlette/routing.py", line 78, in app
    await wrap_app_handling_exceptions(app, request)(scope, receive, send)
  File "/root/.python/sglang/lib/python3.12/site-packages/starlette/_exception_handler.py", line 53, in wrapped_app
    raise exc
  File "/root/.python/sglang/lib/python3.12/site-packages/starlette/_exception_handler.py", line 42, in wrapped_app
    await app(scope, receive, sender)
  File "/root/.python/sglang/lib/python3.12/site-packages/starlette/routing.py", line 76, in app
    await response(scope, receive, send)
  File "/root/.python/sglang/lib/python3.12/site-packages/starlette/responses.py", line 270, in __call__
    with collapse_excgroups():
         ^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/contextlib.py", line 158, in __exit__
    self.gen.throw(value)
  File "/root/.python/sglang/lib/python3.12/site-packages/starlette/_utils.py", line 83, in collapse_excgroups
    raise exc
  File "/root/.python/sglang/lib/python3.12/site-packages/starlette/responses.py", line 274, in wrap
    await func()
  File "/root/.python/sglang/lib/python3.12/site-packages/starlette/responses.py", line 254, in stream_response
    async for chunk in self.body_iterator:
  File "/root/sglang/python/sglang/srt/entrypoints/openai/serving_chat.py", line 439, in _generate_chat_stream
    async for content in self.tokenizer_manager.generate_request(
  File "/root/sglang/python/sglang/srt/managers/tokenizer_manager.py", line 493, in generate_request
    tokenized_obj = await self._tokenize_one_request(obj)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/sglang/python/sglang/srt/managers/tokenizer_manager.py", line 547, in _tokenize_one_request
    mm_inputs: Dict = await self.mm_processor.process_mm_data_async(
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/sglang/python/sglang/srt/multimodal/processors/qwen_vl.py", line 251, in process_mm_data_async
    mm_items, input_ids, ret = self.process_and_combine_mm_data(
                               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/sglang/python/sglang/srt/multimodal/processors/base_processor.py", line 616, in process_and_combine_mm_data
    collected_items, input_ids, ret = self._process_and_collect_mm_items(
                                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/sglang/python/sglang/srt/multimodal/processors/base_processor.py", line 565, in _process_and_collect_mm_items
    ret = self.process_mm_data(
          ^^^^^^^^^^^^^^^^^^^^^
  File "/root/sglang/python/sglang/srt/multimodal/processors/base_processor.py", line 236, in process_mm_data
    result = processor.__call__(
             ^^^^^^^^^^^^^^^^^^^
  File "/root/.python/sglang/lib/python3.12/site-packages/transformers/models/qwen2_5_vl/processing_qwen2_5_vl.py", line 177, in __call__
    num_image_tokens = image_grid_thw[index].prod() // merge_length

zhaochenyang20 · 2025-08-22T19:01:27Z

try to fix this also:

/root/sglang/python/sglang/bench_serving.py:1210: DeprecationWarning: 'mode' parameter is deprecated and will be removed in Pillow 13 (2026-10-15)

zhaochenyang20 · 2025-08-23T03:27:38Z

Refer to this:

#9365 (comment)

We shall detach this PR into two commits:

support new benchmarking
adding gc collection in scheculer.

@yhyang201

zhaochenyang20 · 2025-08-23T04:18:43Z

Chenyang, remove the memory analyzer, gc/empty_cache and review Yuhao's code on the benchmark.
Yuhao, create a separate PR to commit gc/empty_cache into sgl scheduler. Using the API here Support GC Freezing to improve latency & throughput #9241

…image

zhaochenyang20 · 2025-08-26T20:41:23Z

This is a code snippet for memory analysis:

Details

#!/usr/bin/env python3
"""
内存分析脚本 - 从日志文件中读取内存数据并生成变化曲线
使用方法: python memory_analyzer.py <log_file_path>
"""

import argparse
import sys
from datetime import datetime

import matplotlib.dates as mdates
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd


def parse_memory_log(log_file):
    """解析内存日志文件"""
    try:
        # 读取CSV文件
        df = pd.read_csv(log_file)

        # 转换时间戳为datetime对象
        df["timestamp"] = pd.to_datetime(df["timestamp"])

        # 内存值现在已经是MiB单位，直接使用
        df["memory_allocated_mb"] = df["memory_allocated"]
        df["memory_reserved_mb"] = df["memory_reserved"]

        return df
    except Exception as e:
        print(f"错误: 无法解析日志文件 {log_file}: {e}")
        return None


def create_memory_plots(df, output_prefix=None):
    """创建内存变化曲线图"""
    if df is None or df.empty:
        print("错误: 没有有效的数据用于绘图")
        return

    # 设置中文字体
    plt.rcParams["font.sans-serif"] = ["SimHei", "DejaVu Sans"]
    plt.rcParams["axes.unicode_minus"] = False

    # 创建三个子图
    fig, (ax1, ax2, ax3) = plt.subplots(3, 1, figsize=(12, 10))
    fig.suptitle("GPU Memory Usage Over Time", fontsize=16, fontweight="bold")

    # 第一个图: Memory Allocated
    ax1.plot(
        df["timestamp"],
        df["memory_allocated_mb"],
        color="blue",
        linewidth=2,
        label="Allocated Memory",
    )
    ax1.set_ylabel("Memory Allocated (MiB)", fontsize=12)
    ax1.grid(True, alpha=0.3)
    ax1.legend()
    ax1.set_title("GPU Memory Allocated Over Time")

    # 第二个图: Memory Reserved
    ax2.plot(
        df["timestamp"],
        df["memory_reserved_mb"],
        color="red",
        linewidth=2,
        label="Reserved Memory",
    )
    ax2.set_ylabel("Memory Reserved (MiB)", fontsize=12)
    ax2.grid(True, alpha=0.3)
    ax2.legend()
    ax2.set_title("GPU Memory Reserved Over Time")

    # 第三个图: Memory Allocated vs Reserved 对比
    ax3.plot(
        df["timestamp"],
        df["memory_allocated_mb"],
        color="blue",
        linewidth=2,
        label="Allocated",
        alpha=0.8,
    )
    ax3.plot(
        df["timestamp"],
        df["memory_reserved_mb"],
        color="red",
        linewidth=2,
        label="Reserved",
        alpha=0.8,
    )
    ax3.fill_between(
        df["timestamp"], df["memory_allocated_mb"], alpha=0.3, color="blue"
    )
    ax3.fill_between(df["timestamp"], df["memory_reserved_mb"], alpha=0.3, color="red")
    ax3.set_ylabel("Memory Usage (MiB)", fontsize=12)
    ax3.set_xlabel("Time", fontsize=12)
    ax3.grid(True, alpha=0.3)
    ax3.legend()
    ax3.set_title("GPU Memory Allocated vs Reserved Comparison")

    # 格式化时间轴
    for ax in [ax1, ax2, ax3]:
        ax.xaxis.set_major_formatter(mdates.DateFormatter("%H:%M:%S"))
        ax.xaxis.set_major_locator(mdates.SecondLocator(interval=30))
        plt.setp(ax.xaxis.get_majorticklabels(), rotation=45)

    plt.tight_layout()

    # 保存图像
    if output_prefix:
        output_file = f"{output_prefix}_memory_analysis.png"
    else:
        output_file = "memory_analysis.png"

    plt.savefig(output_file, dpi=300, bbox_inches="tight")
    print(f"图像已保存到: {output_file}")

    # 显示图像
    plt.show()


def print_memory_stats(df):
    """打印内存统计信息"""
    if df is None or df.empty:
        return

    print("\n=== 内存使用统计 ===")
    print(f"数据记录数: {len(df)}")
    print(
        f"监控时长: {(df['timestamp'].iloc[-1] - df['timestamp'].iloc[0]).total_seconds():.1f} 秒"
    )

    print(f"\n分配内存 (MiB):")
    print(f"  最小值: {df['memory_allocated_mb'].min():.2f}")
    print(f"  最大值: {df['memory_allocated_mb'].max():.2f}")
    print(f"  平均值: {df['memory_allocated_mb'].mean():.2f}")
    print(f"  标准差: {df['memory_allocated_mb'].std():.2f}")

    print(f"\n保留内存 (MiB):")
    print(f"  最小值: {df['memory_reserved_mb'].min():.2f}")
    print(f"  最大值: {df['memory_reserved_mb'].max():.2f}")
    print(f"  平均值: {df['memory_reserved_mb'].mean():.2f}")
    print(f"  标准差: {df['memory_reserved_mb'].std():.2f}")

    # 计算内存利用率
    utilization = (df["memory_allocated_mb"] / df["memory_reserved_mb"]) * 100
    print(f"\n内存利用率 (%):")
    print(f"  最小值: {utilization.min():.2f}")
    print(f"  最大值: {utilization.max():.2f}")
    print(f"  平均值: {utilization.mean():.2f}")


def main():
    parser = argparse.ArgumentParser(
        description="分析GPU内存使用日志文件并生成变化曲线"
    )
    parser.add_argument("log_file", help="内存日志文件路径")
    parser.add_argument("--output", "-o", help="输出图像文件前缀")
    parser.add_argument("--stats", "-s", action="store_true", help="显示统计信息")

    args = parser.parse_args()

    if not args.log_file:
        print("错误: 请提供日志文件路径")
        sys.exit(1)

    # 解析日志文件
    print(f"正在解析日志文件: {args.log_file}")
    df = parse_memory_log(args.log_file)

    if df is None:
        sys.exit(1)

    print(f"成功读取 {len(df)} 条记录")

    # 打印统计信息
    if args.stats:
        print_memory_stats(df)

    # 创建图表
    create_memory_plots(df, args.output)


if __name__ == "__main__":
    main()

zhaochenyang20 · 2025-08-26T21:07:47Z

Results on B200:

python -m sglang.launch_server --model-path Qwen/Qwen2.5-VL-3B-Instruct --disable-radix-cache

python -m sglang.bench_serving \
    --backend sglang-oai-chat \
    --dataset-name random-image \
    --num-prompts 500 \
    --random-image-num-images 3 \
    --random-image-resolution 720p \
    --random-input-len 512 \
    --random-output-len 512

============ Serving Benchmark Result ============
Backend:                                 sglang-oai-chat
Traffic request rate:                    inf       
Max request concurrency:                 not set   
Successful requests:                     498       
Benchmark duration (s):                  411.47    
Total input tokens:                      132763    
Total generated tokens:                  123381    
Total generated tokens (retokenized):    30426     
Request throughput (req/s):              1.21      
Input token throughput (tok/s):          322.65    
Output token throughput (tok/s):         299.85    
Total token throughput (tok/s):          622.51    
Concurrency:                             491.58    
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   406167.27 
Median E2E Latency (ms):                 407130.40 
---------------Time to First Token----------------
Mean TTFT (ms):                          360920.24 
Median TTFT (ms):                        367521.13 
P99 TTFT (ms):                           401069.34 
---------------Inter-Token Latency----------------
Mean ITL (ms):                           747.29    
Median ITL (ms):                         34.15     
P95 ITL (ms):                            273.88    
P99 ITL (ms):                            28534.37  
Max ITL (ms):                            345368.19 
==================================================

Co-authored-by: zhaochenyang20 <zhaochen20@outlook.com>

zhyncs · 2025-08-26T21:52:35Z

python/sglang/bench_serving.py

-python3 -m sglang.bench_serving --backend sglang --num-prompt 10
-
-python3 -m sglang.bench_serving --backend sglang --dataset-name random --num-prompts 3000 --random-input 1024 --random-output 1024 --random-range-ratio 0.5
+Please refer to https://docs.sglang.ai/developer_guide/bench_serving.html for details.


this doc is 404

this doc is 404

After the merge, this link will be valid. Right now the link is 404, but I just submitted the docs with this PR. And let users to see the new docs.

…image

Co-authored-by: Mick <mickjagger19@icloud.com> Co-authored-by: Xinyuan Tong <115166877+JustinTong0323@users.noreply.github.com> Co-authored-by: zhaochenyang20 <zhaochen20@outlook.com> Co-authored-by: zhaochenyang20 <zhaochenyang20@gmail.com> Co-authored-by: Yineng Zhang <me@zhyncs.com>

yhyang201 requested review from Ying1123, hnyls2002, merrymercy and zhyncs as code owners August 11, 2025 12:43

gemini-code-assist bot reviewed Aug 11, 2025

View reviewed changes

yhyang201 changed the title ~~[WIP] Add random-image dataset with configurable image count and resolution in bench_serving.py for benchmarking~~ Add random-image dataset with configurable image count and resolution in bench_serving.py for benchmarking Aug 12, 2025

JustinTong0323 reviewed Aug 12, 2025

View reviewed changes

yhyang201 force-pushed the bench/image branch from f4ab625 to 947ffb5 Compare August 12, 2025 05:47

zhyncs assigned JustinTong0323 Aug 13, 2025

yhyang201 and others added 9 commits August 17, 2025 09:24

add random-image dataset for benchmarking with configurable input/out…

1e4e3a9

…put lengths and num_images random 1080p images

add --random-image-resolution (1080p/720p/360p) to random-image dataset

8cdac1d

improve

45b1980

Update bench_serving.py

56a3b0f

run pre-commit

94bfc38

fix

d6d2092

add docs

c937cd5

support specifying random-image with resolution

450dba5

add warnings and docs

d47c3e6

yhyang201 force-pushed the bench/image branch from c242ada to d47c3e6 Compare August 17, 2025 16:25

Merge branch 'main' into bench/image

87aa594

JustinTong0323 approved these changes Aug 18, 2025

View reviewed changes

Merge branch 'main' into bench/image

602a947

ZhengWG reviewed Aug 20, 2025

View reviewed changes

yhyang201 and others added 2 commits August 21, 2025 00:15

fix

2fb4707

Merge branch 'main' into bench/image

716fa03

zhaochenyang20 changed the title ~~Add random-image dataset with configurable image count and resolution in bench_serving.py for benchmarking~~ [WIP] Add random-image dataset with configurable image count and resolution in bench_serving.py for benchmarking Aug 22, 2025

add memory analyzer

59919c7

zhaochenyang20 requested a review from xiezhq-hermann as a code owner August 22, 2025 18:20

use mb

35fb066

zhaochenyang20 mentioned this pull request Aug 22, 2025

[Bug] [Tracking] VLM/LLM OOM related issues #9365

Closed

zhaochenyang20 and others added 3 commits August 25, 2025 03:48

[stash] to be clean

bb8d904

Merge branch 'main' into bench/image

7759efb

Merge branch 'bench/image' of github.com:yhyang201/sglang into bench/…

ef2aad0

…image

zhaochen20 added 2 commits August 26, 2025 13:51

remove analyssi

71f1b78

remove scheduler

04a2595

zhaochen20 added 3 commits August 26, 2025 14:09

remove gc

ceba536

finish docs

9026cc6

add examples

2d24e32

zhaochenyang20 changed the title ~~[WIP] Add random-image dataset with configurable image count and resolution in bench_serving.py for benchmarking~~ Add random-image dataset with configurable image count, docs for bench_serving.py Aug 26, 2025

fix gpt oss in reasoning docs

931c58d

Co-authored-by: zhaochenyang20 <zhaochen20@outlook.com>

zhyncs reviewed Aug 26, 2025

View reviewed changes

zhaochenyang20 mentioned this pull request Aug 26, 2025

chore: enhance bench_serving for vlms with a new dataset of configurable image count and resolution #9583

Merged

4 tasks

zhaochenyang20 changed the title ~~Add random-image dataset with configurable image count, docs for bench_serving.py~~ [Merge 9583 Fisrt] Add random-image dataset with configurable image count, docs for bench_serving.py Aug 26, 2025

zhaochen20 and others added 2 commits August 26, 2025 15:57

revert bench serving

e619295

Merge branch 'main' into bench/image

f6fdfc1

zhyncs changed the title ~~[Merge 9583 Fisrt] Add random-image dataset with configurable image count, docs for bench_serving.py~~ Add random-image dataset with configurable image count, docs for bench_serving.py Aug 27, 2025

zhaochen20 and others added 3 commits August 26, 2025 18:24

Merge branch 'bench/image' of github.com:yhyang201/sglang into bench/…

0caa962

…image

fix lint

a0d2e5b

Merge branch 'main' into bench/image

67aa23b

zhaochenyang20 changed the title ~~Add random-image dataset with configurable image count, docs for bench_serving.py~~ [docs] Instructions for bench_serving.py Aug 27, 2025

zhaochenyang20 merged commit a85363c into sgl-project:main Aug 27, 2025
19 of 20 checks passed

-    except Exception as e:
-        raise ImportError(
-            "Please install Pillow to generate random images: pip install pillow"
-        ) from e
+    except ImportError as e:
+        raise ImportError(
+            "Please install Pillow to generate random images: pip install pillow"
+        ) from e

Conversation

yhyang201 commented Aug 11, 2025 • edited by zhaochenyang20 Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Accuracy Tests

Benchmarking and Profiling

Checklist

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Aug 11, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

JustinTong0323 Aug 12, 2025

Choose a reason for hiding this comment

Uh oh!

yhyang201 Aug 12, 2025

Choose a reason for hiding this comment

Uh oh!

JustinTong0323 left a comment

Choose a reason for hiding this comment

Uh oh!

ZhengWG Aug 20, 2025

Choose a reason for hiding this comment

Uh oh!

zhaochenyang20 commented Aug 22, 2025

Uh oh!

zhaochenyang20 commented Aug 22, 2025

Uh oh!

zhaochenyang20 commented Aug 23, 2025

Uh oh!

zhaochenyang20 commented Aug 23, 2025

Uh oh!

zhaochenyang20 commented Aug 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zhaochenyang20 commented Aug 26, 2025

Uh oh!

zhyncs Aug 26, 2025

Choose a reason for hiding this comment

Uh oh!

zhaochenyang20 Aug 26, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

yhyang201 commented Aug 11, 2025 •

edited by zhaochenyang20

Loading

zhaochenyang20 commented Aug 26, 2025 •

edited

Loading