Skip to content

[docs] Instructions for bench_serving.py#9071

Merged
zhaochenyang20 merged 30 commits intosgl-project:mainfrom
yhyang201:bench/image
Aug 27, 2025
Merged

[docs] Instructions for bench_serving.py#9071
zhaochenyang20 merged 30 commits intosgl-project:mainfrom
yhyang201:bench/image

Conversation

@yhyang201
Copy link
Copy Markdown
Collaborator

@yhyang201 yhyang201 commented Aug 11, 2025

Motivation

This is a temporary PR to evaluate the memory leakage and provide a better benchmarking for multi-modal input.

Modifications

  • Added random-image dataset for benchmarking multi-image + text inputs.
  • Added CLI options:
    • --random-image-num-images to set number of images per request.
    • --random-image-resolution to select image resolution (1080p, 720p, 360p).
  • Updated request handling to support multiple images in both OpenAI Chat and SGLang backends.

Example

git clone -b bench/image https://github.com/yhyang201/sglang.git 
cd sglang

pip install -e "python[all]"

Launch server:

python -m sglang.launch_server --model-path Qwen/Qwen2.5-VL-3B-Instruct --disable-radix-cache

Launch benchmarking. Note that, the num-prompts will first collect 500 requests, then send them at once. Using a large num-prompts would increase the picture creation time.

python -m sglang.bench_serving \
    --backend sglang-oai-chat \
    --dataset-name random-image \
    --num-prompts 500 \
    --random-image-num-images 3 \
    --random-image-resolution 720p \
    --random-input-len 512 \
    --random-output-len 512

Note that we do not need to wait for the bench_serving to end. We can always use the log to analyse the memory.

Find the log file, like 20250822_181618_memory_log.txt, then:

python memory_analyzer.py 20250822_181618_memory_log.txt

We can get the mem profiler.

Note that we modified event_loop_overlap function to let it do gc.collect(), torch.cuda.empty_cache(). You can manually disable it.

With gc.collect(), torch.cuda.empty_cache() always turn on, during the benchmarking, the image processor takes over 30GB of memory (on B200) to process the 500 requests, each of which has 3 images. After the process, the image processor's memory is released.

Note this is without --max-concurrency 1 in the bench serving period.

memory_analysis

Using --max-concurrency 1, we have:

memory_analysis_副本

The converged value is steady.

Okay. After removing the gc.collect(), torch.cuda.empty_cache() in the scheduler, the reserved memory indeed leaked:

memory_analysis

Note that I first ran the benchmark with --max-concurrency 1, then slept for 5 minutes, and ran the benchmarking without --max-concurrency 1.

memory_analysis

Accuracy Tests

Benchmarking and Profiling

Checklist

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summary of Changes

Hello @yhyang201, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

I've been working on enhancing our benchmarking utility, "bench_serving.py", to support multi-image and text inputs. This pull request introduces a new "random-image" dataset, allowing us to generate synthetic image data with configurable counts and resolutions for more comprehensive benchmarking of multi-modal models.

Highlights

  • New "random-image" Dataset: A dedicated dataset has been added for generating random images alongside text prompts, specifically for benchmarking multi-modal models.
  • Configurable Image Count: A new CLI option, "--random-image-num-images", has been introduced to allow users to specify the number of images to be included per request during benchmarking.
  • Configurable Image Resolution: The "--random-image-resolution" argument now enables selection of image resolutions (1080p, 720p, 360p) for the dynamically generated images.
  • Multi-Image Backend Support: Both the OpenAI Chat and SGLang backends have been updated to correctly process and handle requests that include multiple image inputs, ensuring compatibility with the new dataset.
  • Dynamic Image Generation: A utility has been implemented to generate random image data URIs on the fly, facilitating the creation of diverse image inputs for benchmarking purposes.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in issue comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments or fill out our survey to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a random-image dataset for benchmarking multi-modal models, which is a great addition. The changes correctly update the request handling to support multiple images for both OpenAI Chat and SGLang backends. My review includes a few suggestions for the new sample_random_image_requests function to improve exception handling, adhere to Python's import conventions, and clarify the logic for the number of images generated.

Comment on lines 1151 to 1186
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Catching a specific ImportError is better than a generic Exception. This makes the code's intent clearer and avoids masking other potential errors during the import of the PIL library.

Suggested change
except Exception as e:
raise ImportError(
"Please install Pillow to generate random images: pip install pillow"
) from e
except ImportError as e:
raise ImportError(
"Please install Pillow to generate random images: pip install pillow"
) from e

@yhyang201 yhyang201 changed the title [WIP] Add random-image dataset with configurable image count and resolution in bench_serving.py for benchmarking Add random-image dataset with configurable image count and resolution in bench_serving.py for benchmarking Aug 12, 2025
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is the seed changed at this point? For the purpose of reproducibility, it ought to be fixed.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A common default behavior occurs when bench_serving is run multiple times with the random or random-image dataset. Because the dataset name contains “random,” users often assume that new data is generated on each run, and therefore leave the Radix Tree enabled by default.
In reality, the same seed is used by default, meaning that identical requests are sent each time. This allows the Radix Tree to cache and accelerate processing, which can affect benchmark results.
For reproducibility, the --seed option can be set manually. This changes only the default random seed and does not alter any other behavior.

Copy link
Copy Markdown
Collaborator

@JustinTong0323 JustinTong0323 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, good job!

) from e

# Check for potentially problematic combinations and warn user
if width * height >= 1920 * 1080 and num_images * num_requests >= 100:
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The variables width/height appear to be undefined here.

@zhaochenyang20 zhaochenyang20 changed the title Add random-image dataset with configurable image count and resolution in bench_serving.py for benchmarking [WIP] Add random-image dataset with configurable image count and resolution in bench_serving.py for benchmarking Aug 22, 2025
@zhaochenyang20
Copy link
Copy Markdown
Collaborator

While doing bench serving, on the server side:

[2025-08-22 18:32:29] Prefill batch. #new-seq: 4, #new-token: 14985, #cached-token: 0, token usage: 0.03, #running-req: 29, #queue-req: 85, 
[2025-08-22 18:32:30] Memory allocated: 152414226944
[2025-08-22 18:32:30] Memory reserved: 153085804544
[2025-08-22 18:34:03] ERROR:    Exception in ASGI application
  + Exception Group Traceback (most recent call last):
  |   File "/root/.python/sglang/lib/python3.12/site-packages/starlette/_utils.py", line 77, in collapse_excgroups
  |     yield
  |   File "/root/.python/sglang/lib/python3.12/site-packages/starlette/responses.py", line 271, in __call__
  |     async with anyio.create_task_group() as task_group:
  |                ^^^^^^^^^^^^^^^^^^^^^^^^^
  |   File "/root/.python/sglang/lib/python3.12/site-packages/anyio/_backends/_asyncio.py", line 772, in __aexit__
  |     raise BaseExceptionGroup(
  | ExceptionGroup: unhandled errors in a TaskGroup (1 sub-exception)
  +-+---------------- 1 ----------------
    | Traceback (most recent call last):
    |   File "/root/.python/sglang/lib/python3.12/site-packages/uvicorn/protocols/http/h11_impl.py", line 403, in run_asgi
    |     result = await app(  # type: ignore[func-returns-value]
    |              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    |   File "/root/.python/sglang/lib/python3.12/site-packages/uvicorn/middleware/proxy_headers.py", line 60, in __call__
    |     return await self.app(scope, receive, send)
    |            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    |   File "/root/.python/sglang/lib/python3.12/site-packages/fastapi/applications.py", line 1054, in __call__
    |     await super().__call__(scope, receive, send)
    |   File "/root/.python/sglang/lib/python3.12/site-packages/starlette/applications.py", line 113, in __call__
    |     await self.middleware_stack(scope, receive, send)
    |   File "/root/.python/sglang/lib/python3.12/site-packages/starlette/middleware/errors.py", line 186, in __call__
    |     raise exc
    |   File "/root/.python/sglang/lib/python3.12/site-packages/starlette/middleware/errors.py", line 164, in __call__
    |     await self.app(scope, receive, _send)
    |   File "/root/.python/sglang/lib/python3.12/site-packages/starlette/middleware/cors.py", line 85, in __call__
    |     await self.app(scope, receive, send)
    |   File "/root/.python/sglang/lib/python3.12/site-packages/starlette/middleware/exceptions.py", line 63, in __call__
    |     await wrap_app_handling_exceptions(self.app, conn)(scope, receive, send)
    |   File "/root/.python/sglang/lib/python3.12/site-packages/starlette/_exception_handler.py", line 53, in wrapped_app
    |     raise exc
    |   File "/root/.python/sglang/lib/python3.12/site-packages/starlette/_exception_handler.py", line 42, in wrapped_app
    |     await app(scope, receive, sender)
    |   File "/root/.python/sglang/lib/python3.12/site-packages/starlette/routing.py", line 716, in __call__
    |     await self.middleware_stack(scope, receive, send)
    |   File "/root/.python/sglang/lib/python3.12/site-packages/starlette/routing.py", line 736, in app
    |     await route.handle(scope, receive, send)
    |   File "/root/.python/sglang/lib/python3.12/site-packages/starlette/routing.py", line 290, in handle
    |     await self.app(scope, receive, send)
    |   File "/root/.python/sglang/lib/python3.12/site-packages/starlette/routing.py", line 78, in app
    |     await wrap_app_handling_exceptions(app, request)(scope, receive, send)
    |   File "/root/.python/sglang/lib/python3.12/site-packages/starlette/_exception_handler.py", line 53, in wrapped_app
    |     raise exc
    |   File "/root/.python/sglang/lib/python3.12/site-packages/starlette/_exception_handler.py", line 42, in wrapped_app
    |     await app(scope, receive, sender)
    |   File "/root/.python/sglang/lib/python3.12/site-packages/starlette/routing.py", line 76, in app
    |     await response(scope, receive, send)
    |   File "/root/.python/sglang/lib/python3.12/site-packages/starlette/responses.py", line 270, in __call__
    |     with collapse_excgroups():
    |          ^^^^^^^^^^^^^^^^^^^^
    |   File "/usr/lib/python3.12/contextlib.py", line 158, in __exit__
    |     self.gen.throw(value)
    |   File "/root/.python/sglang/lib/python3.12/site-packages/starlette/_utils.py", line 83, in collapse_excgroups
    |     raise exc
    |   File "/root/.python/sglang/lib/python3.12/site-packages/starlette/responses.py", line 274, in wrap
    |     await func()
    |   File "/root/.python/sglang/lib/python3.12/site-packages/starlette/responses.py", line 254, in stream_response
    |     async for chunk in self.body_iterator:
    |   File "/root/sglang/python/sglang/srt/entrypoints/openai/serving_chat.py", line 439, in _generate_chat_stream
    |     async for content in self.tokenizer_manager.generate_request(
    |   File "/root/sglang/python/sglang/srt/managers/tokenizer_manager.py", line 493, in generate_request
    |     tokenized_obj = await self._tokenize_one_request(obj)
    |                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    |   File "/root/sglang/python/sglang/srt/managers/tokenizer_manager.py", line 547, in _tokenize_one_request
    |     mm_inputs: Dict = await self.mm_processor.process_mm_data_async(
    |                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    |   File "/root/sglang/python/sglang/srt/multimodal/processors/qwen_vl.py", line 251, in process_mm_data_async
    |     mm_items, input_ids, ret = self.process_and_combine_mm_data(
    |                                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    |   File "/root/sglang/python/sglang/srt/multimodal/processors/base_processor.py", line 616, in process_and_combine_mm_data
    |     collected_items, input_ids, ret = self._process_and_collect_mm_items(
    |                                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    |   File "/root/sglang/python/sglang/srt/multimodal/processors/base_processor.py", line 565, in _process_and_collect_mm_items
    |     ret = self.process_mm_data(
    |           ^^^^^^^^^^^^^^^^^^^^^
    |   File "/root/sglang/python/sglang/srt/multimodal/processors/base_processor.py", line 236, in process_mm_data
    |     result = processor.__call__(
    |              ^^^^^^^^^^^^^^^^^^^
    |   File "/root/.python/sglang/lib/python3.12/site-packages/transformers/models/qwen2_5_vl/processing_qwen2_5_vl.py", line 177, in __call__
    |     num_image_tokens = image_grid_thw[index].prod() // merge_length
    |                        ~~~~~~~~~~~~~~^^^^^^^
    | IndexError: index 3 is out of bounds for dimension 0 with size 3
    +------------------------------------

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/root/.python/sglang/lib/python3.12/site-packages/uvicorn/protocols/http/h11_impl.py", line 403, in run_asgi
    result = await app(  # type: ignore[func-returns-value]
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/.python/sglang/lib/python3.12/site-packages/uvicorn/middleware/proxy_headers.py", line 60, in __call__
    return await self.app(scope, receive, send)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/.python/sglang/lib/python3.12/site-packages/fastapi/applications.py", line 1054, in __call__
    await super().__call__(scope, receive, send)
  File "/root/.python/sglang/lib/python3.12/site-packages/starlette/applications.py", line 113, in __call__
    await self.middleware_stack(scope, receive, send)
  File "/root/.python/sglang/lib/python3.12/site-packages/starlette/middleware/errors.py", line 186, in __call__
    raise exc
  File "/root/.python/sglang/lib/python3.12/site-packages/starlette/middleware/errors.py", line 164, in __call__
    await self.app(scope, receive, _send)
  File "/root/.python/sglang/lib/python3.12/site-packages/starlette/middleware/cors.py", line 85, in __call__
    await self.app(scope, receive, send)
  File "/root/.python/sglang/lib/python3.12/site-packages/starlette/middleware/exceptions.py", line 63, in __call__
    await wrap_app_handling_exceptions(self.app, conn)(scope, receive, send)
  File "/root/.python/sglang/lib/python3.12/site-packages/starlette/_exception_handler.py", line 53, in wrapped_app
    raise exc
  File "/root/.python/sglang/lib/python3.12/site-packages/starlette/_exception_handler.py", line 42, in wrapped_app
    await app(scope, receive, sender)
  File "/root/.python/sglang/lib/python3.12/site-packages/starlette/routing.py", line 716, in __call__
    await self.middleware_stack(scope, receive, send)
  File "/root/.python/sglang/lib/python3.12/site-packages/starlette/routing.py", line 736, in app
    await route.handle(scope, receive, send)
  File "/root/.python/sglang/lib/python3.12/site-packages/starlette/routing.py", line 290, in handle
    await self.app(scope, receive, send)
  File "/root/.python/sglang/lib/python3.12/site-packages/starlette/routing.py", line 78, in app
    await wrap_app_handling_exceptions(app, request)(scope, receive, send)
  File "/root/.python/sglang/lib/python3.12/site-packages/starlette/_exception_handler.py", line 53, in wrapped_app
    raise exc
  File "/root/.python/sglang/lib/python3.12/site-packages/starlette/_exception_handler.py", line 42, in wrapped_app
    await app(scope, receive, sender)
  File "/root/.python/sglang/lib/python3.12/site-packages/starlette/routing.py", line 76, in app
    await response(scope, receive, send)
  File "/root/.python/sglang/lib/python3.12/site-packages/starlette/responses.py", line 270, in __call__
    with collapse_excgroups():
         ^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/contextlib.py", line 158, in __exit__
    self.gen.throw(value)
  File "/root/.python/sglang/lib/python3.12/site-packages/starlette/_utils.py", line 83, in collapse_excgroups
    raise exc
  File "/root/.python/sglang/lib/python3.12/site-packages/starlette/responses.py", line 274, in wrap
    await func()
  File "/root/.python/sglang/lib/python3.12/site-packages/starlette/responses.py", line 254, in stream_response
    async for chunk in self.body_iterator:
  File "/root/sglang/python/sglang/srt/entrypoints/openai/serving_chat.py", line 439, in _generate_chat_stream
    async for content in self.tokenizer_manager.generate_request(
  File "/root/sglang/python/sglang/srt/managers/tokenizer_manager.py", line 493, in generate_request
    tokenized_obj = await self._tokenize_one_request(obj)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/sglang/python/sglang/srt/managers/tokenizer_manager.py", line 547, in _tokenize_one_request
    mm_inputs: Dict = await self.mm_processor.process_mm_data_async(
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/sglang/python/sglang/srt/multimodal/processors/qwen_vl.py", line 251, in process_mm_data_async
    mm_items, input_ids, ret = self.process_and_combine_mm_data(
                               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/sglang/python/sglang/srt/multimodal/processors/base_processor.py", line 616, in process_and_combine_mm_data
    collected_items, input_ids, ret = self._process_and_collect_mm_items(
                                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/sglang/python/sglang/srt/multimodal/processors/base_processor.py", line 565, in _process_and_collect_mm_items
    ret = self.process_mm_data(
          ^^^^^^^^^^^^^^^^^^^^^
  File "/root/sglang/python/sglang/srt/multimodal/processors/base_processor.py", line 236, in process_mm_data
    result = processor.__call__(
             ^^^^^^^^^^^^^^^^^^^
  File "/root/.python/sglang/lib/python3.12/site-packages/transformers/models/qwen2_5_vl/processing_qwen2_5_vl.py", line 177, in __call__
    num_image_tokens = image_grid_thw[index].prod() // merge_length

@zhaochenyang20
Copy link
Copy Markdown
Collaborator

try to fix this also:

/root/sglang/python/sglang/bench_serving.py:1210: DeprecationWarning: 'mode' parameter is deprecated and will be removed in Pillow 13 (2026-10-15)

@zhaochenyang20
Copy link
Copy Markdown
Collaborator

Refer to this:

#9365 (comment)

We shall detach this PR into two commits:

  1. support new benchmarking
  2. adding gc collection in scheculer.

@yhyang201

@zhaochenyang20
Copy link
Copy Markdown
Collaborator

  1. Chenyang, remove the memory analyzer, gc/empty_cache and review Yuhao's code on the benchmark.
  2. Yuhao, create a separate PR to commit gc/empty_cache into sgl scheduler. Using the API here Support GC Freezing to improve latency & throughput #9241

@zhaochenyang20
Copy link
Copy Markdown
Collaborator

zhaochenyang20 commented Aug 26, 2025

This is a code snippet for memory analysis:

Details
#!/usr/bin/env python3
"""
内存分析脚本 - 从日志文件中读取内存数据并生成变化曲线
使用方法: python memory_analyzer.py <log_file_path>
"""

import argparse
import sys
from datetime import datetime

import matplotlib.dates as mdates
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd


def parse_memory_log(log_file):
    """解析内存日志文件"""
    try:
        # 读取CSV文件
        df = pd.read_csv(log_file)

        # 转换时间戳为datetime对象
        df["timestamp"] = pd.to_datetime(df["timestamp"])

        # 内存值现在已经是MiB单位,直接使用
        df["memory_allocated_mb"] = df["memory_allocated"]
        df["memory_reserved_mb"] = df["memory_reserved"]

        return df
    except Exception as e:
        print(f"错误: 无法解析日志文件 {log_file}: {e}")
        return None


def create_memory_plots(df, output_prefix=None):
    """创建内存变化曲线图"""
    if df is None or df.empty:
        print("错误: 没有有效的数据用于绘图")
        return

    # 设置中文字体
    plt.rcParams["font.sans-serif"] = ["SimHei", "DejaVu Sans"]
    plt.rcParams["axes.unicode_minus"] = False

    # 创建三个子图
    fig, (ax1, ax2, ax3) = plt.subplots(3, 1, figsize=(12, 10))
    fig.suptitle("GPU Memory Usage Over Time", fontsize=16, fontweight="bold")

    # 第一个图: Memory Allocated
    ax1.plot(
        df["timestamp"],
        df["memory_allocated_mb"],
        color="blue",
        linewidth=2,
        label="Allocated Memory",
    )
    ax1.set_ylabel("Memory Allocated (MiB)", fontsize=12)
    ax1.grid(True, alpha=0.3)
    ax1.legend()
    ax1.set_title("GPU Memory Allocated Over Time")

    # 第二个图: Memory Reserved
    ax2.plot(
        df["timestamp"],
        df["memory_reserved_mb"],
        color="red",
        linewidth=2,
        label="Reserved Memory",
    )
    ax2.set_ylabel("Memory Reserved (MiB)", fontsize=12)
    ax2.grid(True, alpha=0.3)
    ax2.legend()
    ax2.set_title("GPU Memory Reserved Over Time")

    # 第三个图: Memory Allocated vs Reserved 对比
    ax3.plot(
        df["timestamp"],
        df["memory_allocated_mb"],
        color="blue",
        linewidth=2,
        label="Allocated",
        alpha=0.8,
    )
    ax3.plot(
        df["timestamp"],
        df["memory_reserved_mb"],
        color="red",
        linewidth=2,
        label="Reserved",
        alpha=0.8,
    )
    ax3.fill_between(
        df["timestamp"], df["memory_allocated_mb"], alpha=0.3, color="blue"
    )
    ax3.fill_between(df["timestamp"], df["memory_reserved_mb"], alpha=0.3, color="red")
    ax3.set_ylabel("Memory Usage (MiB)", fontsize=12)
    ax3.set_xlabel("Time", fontsize=12)
    ax3.grid(True, alpha=0.3)
    ax3.legend()
    ax3.set_title("GPU Memory Allocated vs Reserved Comparison")

    # 格式化时间轴
    for ax in [ax1, ax2, ax3]:
        ax.xaxis.set_major_formatter(mdates.DateFormatter("%H:%M:%S"))
        ax.xaxis.set_major_locator(mdates.SecondLocator(interval=30))
        plt.setp(ax.xaxis.get_majorticklabels(), rotation=45)

    plt.tight_layout()

    # 保存图像
    if output_prefix:
        output_file = f"{output_prefix}_memory_analysis.png"
    else:
        output_file = "memory_analysis.png"

    plt.savefig(output_file, dpi=300, bbox_inches="tight")
    print(f"图像已保存到: {output_file}")

    # 显示图像
    plt.show()


def print_memory_stats(df):
    """打印内存统计信息"""
    if df is None or df.empty:
        return

    print("\n=== 内存使用统计 ===")
    print(f"数据记录数: {len(df)}")
    print(
        f"监控时长: {(df['timestamp'].iloc[-1] - df['timestamp'].iloc[0]).total_seconds():.1f} 秒"
    )

    print(f"\n分配内存 (MiB):")
    print(f"  最小值: {df['memory_allocated_mb'].min():.2f}")
    print(f"  最大值: {df['memory_allocated_mb'].max():.2f}")
    print(f"  平均值: {df['memory_allocated_mb'].mean():.2f}")
    print(f"  标准差: {df['memory_allocated_mb'].std():.2f}")

    print(f"\n保留内存 (MiB):")
    print(f"  最小值: {df['memory_reserved_mb'].min():.2f}")
    print(f"  最大值: {df['memory_reserved_mb'].max():.2f}")
    print(f"  平均值: {df['memory_reserved_mb'].mean():.2f}")
    print(f"  标准差: {df['memory_reserved_mb'].std():.2f}")

    # 计算内存利用率
    utilization = (df["memory_allocated_mb"] / df["memory_reserved_mb"]) * 100
    print(f"\n内存利用率 (%):")
    print(f"  最小值: {utilization.min():.2f}")
    print(f"  最大值: {utilization.max():.2f}")
    print(f"  平均值: {utilization.mean():.2f}")


def main():
    parser = argparse.ArgumentParser(
        description="分析GPU内存使用日志文件并生成变化曲线"
    )
    parser.add_argument("log_file", help="内存日志文件路径")
    parser.add_argument("--output", "-o", help="输出图像文件前缀")
    parser.add_argument("--stats", "-s", action="store_true", help="显示统计信息")

    args = parser.parse_args()

    if not args.log_file:
        print("错误: 请提供日志文件路径")
        sys.exit(1)

    # 解析日志文件
    print(f"正在解析日志文件: {args.log_file}")
    df = parse_memory_log(args.log_file)

    if df is None:
        sys.exit(1)

    print(f"成功读取 {len(df)} 条记录")

    # 打印统计信息
    if args.stats:
        print_memory_stats(df)

    # 创建图表
    create_memory_plots(df, args.output)


if __name__ == "__main__":
    main()

@zhaochenyang20
Copy link
Copy Markdown
Collaborator

Results on B200:

python -m sglang.launch_server --model-path Qwen/Qwen2.5-VL-3B-Instruct --disable-radix-cache

python -m sglang.bench_serving \
    --backend sglang-oai-chat \
    --dataset-name random-image \
    --num-prompts 500 \
    --random-image-num-images 3 \
    --random-image-resolution 720p \
    --random-input-len 512 \
    --random-output-len 512
============ Serving Benchmark Result ============
Backend:                                 sglang-oai-chat
Traffic request rate:                    inf       
Max request concurrency:                 not set   
Successful requests:                     498       
Benchmark duration (s):                  411.47    
Total input tokens:                      132763    
Total generated tokens:                  123381    
Total generated tokens (retokenized):    30426     
Request throughput (req/s):              1.21      
Input token throughput (tok/s):          322.65    
Output token throughput (tok/s):         299.85    
Total token throughput (tok/s):          622.51    
Concurrency:                             491.58    
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   406167.27 
Median E2E Latency (ms):                 407130.40 
---------------Time to First Token----------------
Mean TTFT (ms):                          360920.24 
Median TTFT (ms):                        367521.13 
P99 TTFT (ms):                           401069.34 
---------------Inter-Token Latency----------------
Mean ITL (ms):                           747.29    
Median ITL (ms):                         34.15     
P95 ITL (ms):                            273.88    
P99 ITL (ms):                            28534.37  
Max ITL (ms):                            345368.19 
==================================================

@zhaochenyang20 zhaochenyang20 changed the title [WIP] Add random-image dataset with configurable image count and resolution in bench_serving.py for benchmarking Add random-image dataset with configurable image count, docs for bench_serving.py Aug 26, 2025
Co-authored-by: zhaochenyang20 <zhaochen20@outlook.com>
python3 -m sglang.bench_serving --backend sglang --num-prompt 10

python3 -m sglang.bench_serving --backend sglang --dataset-name random --num-prompts 3000 --random-input 1024 --random-output 1024 --random-range-ratio 0.5
Please refer to https://docs.sglang.ai/developer_guide/bench_serving.html for details.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this doc is 404

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this doc is 404

After the merge, this link will be valid. Right now the link is 404, but I just submitted the docs with this PR. And let users to see the new docs.

@zhaochenyang20 zhaochenyang20 changed the title Add random-image dataset with configurable image count, docs for bench_serving.py [Merge 9583 Fisrt] Add random-image dataset with configurable image count, docs for bench_serving.py Aug 26, 2025
@zhyncs zhyncs changed the title [Merge 9583 Fisrt] Add random-image dataset with configurable image count, docs for bench_serving.py Add random-image dataset with configurable image count, docs for bench_serving.py Aug 27, 2025
@zhaochenyang20 zhaochenyang20 changed the title Add random-image dataset with configurable image count, docs for bench_serving.py [docs] Instructions for bench_serving.py Aug 27, 2025
@zhaochenyang20 zhaochenyang20 merged commit a85363c into sgl-project:main Aug 27, 2025
19 of 20 checks passed
MahmoudAshraf97 pushed a commit to MahmoudAshraf97/sglang that referenced this pull request Sep 8, 2025
Co-authored-by: Mick <mickjagger19@icloud.com>
Co-authored-by: Xinyuan Tong <115166877+JustinTong0323@users.noreply.github.com>
Co-authored-by: zhaochenyang20 <zhaochen20@outlook.com>
Co-authored-by: zhaochenyang20 <zhaochenyang20@gmail.com>
Co-authored-by: Yineng Zhang <me@zhyncs.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants