[docs] Instructions for bench_serving.py#9071
[docs] Instructions for bench_serving.py#9071zhaochenyang20 merged 30 commits intosgl-project:mainfrom
Conversation
There was a problem hiding this comment.
Summary of Changes
Hello @yhyang201, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!
I've been working on enhancing our benchmarking utility, "bench_serving.py", to support multi-image and text inputs. This pull request introduces a new "random-image" dataset, allowing us to generate synthetic image data with configurable counts and resolutions for more comprehensive benchmarking of multi-modal models.
Highlights
- New "random-image" Dataset: A dedicated dataset has been added for generating random images alongside text prompts, specifically for benchmarking multi-modal models.
- Configurable Image Count: A new CLI option, "--random-image-num-images", has been introduced to allow users to specify the number of images to be included per request during benchmarking.
- Configurable Image Resolution: The "--random-image-resolution" argument now enables selection of image resolutions (1080p, 720p, 360p) for the dynamically generated images.
- Multi-Image Backend Support: Both the OpenAI Chat and SGLang backends have been updated to correctly process and handle requests that include multiple image inputs, ensuring compatibility with the new dataset.
- Dynamic Image Generation: A utility has been implemented to generate random image data URIs on the fly, facilitating the creation of diverse image inputs for benchmarking purposes.
Using Gemini Code Assist
The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.
Invoking Gemini
You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.
| Feature | Command | Description |
|---|---|---|
| Code Review | /gemini review |
Performs a code review for the current pull request in its current state. |
| Pull Request Summary | /gemini summary |
Provides a summary of the current pull request in its current state. |
| Comment | @gemini-code-assist | Responds in comments when explicitly tagged, both in issue comments and review comments. |
| Help | /gemini help |
Displays a list of available commands. |
Customization
To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.
Limitations & Feedback
Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments or fill out our survey to provide feedback.
You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.
Footnotes
-
Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩
There was a problem hiding this comment.
Code Review
This pull request introduces a random-image dataset for benchmarking multi-modal models, which is a great addition. The changes correctly update the request handling to support multiple images for both OpenAI Chat and SGLang backends. My review includes a few suggestions for the new sample_random_image_requests function to improve exception handling, adhere to Python's import conventions, and clarify the logic for the number of images generated.
python/sglang/bench_serving.py
Outdated
There was a problem hiding this comment.
Catching a specific ImportError is better than a generic Exception. This makes the code's intent clearer and avoids masking other potential errors during the import of the PIL library.
| except Exception as e: | |
| raise ImportError( | |
| "Please install Pillow to generate random images: pip install pillow" | |
| ) from e | |
| except ImportError as e: | |
| raise ImportError( | |
| "Please install Pillow to generate random images: pip install pillow" | |
| ) from e |
python/sglang/bench_serving.py
Outdated
There was a problem hiding this comment.
Why is the seed changed at this point? For the purpose of reproducibility, it ought to be fixed.
There was a problem hiding this comment.
A common default behavior occurs when bench_serving is run multiple times with the random or random-image dataset. Because the dataset name contains “random,” users often assume that new data is generated on each run, and therefore leave the Radix Tree enabled by default.
In reality, the same seed is used by default, meaning that identical requests are sent each time. This allows the Radix Tree to cache and accelerate processing, which can affect benchmark results.
For reproducibility, the --seed option can be set manually. This changes only the default random seed and does not alter any other behavior.
…put lengths and num_images random 1080p images
| ) from e | ||
|
|
||
| # Check for potentially problematic combinations and warn user | ||
| if width * height >= 1920 * 1080 and num_images * num_requests >= 100: |
There was a problem hiding this comment.
The variables width/height appear to be undefined here.
|
While doing bench serving, on the server side: [2025-08-22 18:32:29] Prefill batch. #new-seq: 4, #new-token: 14985, #cached-token: 0, token usage: 0.03, #running-req: 29, #queue-req: 85,
[2025-08-22 18:32:30] Memory allocated: 152414226944
[2025-08-22 18:32:30] Memory reserved: 153085804544
[2025-08-22 18:34:03] ERROR: Exception in ASGI application
+ Exception Group Traceback (most recent call last):
| File "/root/.python/sglang/lib/python3.12/site-packages/starlette/_utils.py", line 77, in collapse_excgroups
| yield
| File "/root/.python/sglang/lib/python3.12/site-packages/starlette/responses.py", line 271, in __call__
| async with anyio.create_task_group() as task_group:
| ^^^^^^^^^^^^^^^^^^^^^^^^^
| File "/root/.python/sglang/lib/python3.12/site-packages/anyio/_backends/_asyncio.py", line 772, in __aexit__
| raise BaseExceptionGroup(
| ExceptionGroup: unhandled errors in a TaskGroup (1 sub-exception)
+-+---------------- 1 ----------------
| Traceback (most recent call last):
| File "/root/.python/sglang/lib/python3.12/site-packages/uvicorn/protocols/http/h11_impl.py", line 403, in run_asgi
| result = await app( # type: ignore[func-returns-value]
| ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
| File "/root/.python/sglang/lib/python3.12/site-packages/uvicorn/middleware/proxy_headers.py", line 60, in __call__
| return await self.app(scope, receive, send)
| ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
| File "/root/.python/sglang/lib/python3.12/site-packages/fastapi/applications.py", line 1054, in __call__
| await super().__call__(scope, receive, send)
| File "/root/.python/sglang/lib/python3.12/site-packages/starlette/applications.py", line 113, in __call__
| await self.middleware_stack(scope, receive, send)
| File "/root/.python/sglang/lib/python3.12/site-packages/starlette/middleware/errors.py", line 186, in __call__
| raise exc
| File "/root/.python/sglang/lib/python3.12/site-packages/starlette/middleware/errors.py", line 164, in __call__
| await self.app(scope, receive, _send)
| File "/root/.python/sglang/lib/python3.12/site-packages/starlette/middleware/cors.py", line 85, in __call__
| await self.app(scope, receive, send)
| File "/root/.python/sglang/lib/python3.12/site-packages/starlette/middleware/exceptions.py", line 63, in __call__
| await wrap_app_handling_exceptions(self.app, conn)(scope, receive, send)
| File "/root/.python/sglang/lib/python3.12/site-packages/starlette/_exception_handler.py", line 53, in wrapped_app
| raise exc
| File "/root/.python/sglang/lib/python3.12/site-packages/starlette/_exception_handler.py", line 42, in wrapped_app
| await app(scope, receive, sender)
| File "/root/.python/sglang/lib/python3.12/site-packages/starlette/routing.py", line 716, in __call__
| await self.middleware_stack(scope, receive, send)
| File "/root/.python/sglang/lib/python3.12/site-packages/starlette/routing.py", line 736, in app
| await route.handle(scope, receive, send)
| File "/root/.python/sglang/lib/python3.12/site-packages/starlette/routing.py", line 290, in handle
| await self.app(scope, receive, send)
| File "/root/.python/sglang/lib/python3.12/site-packages/starlette/routing.py", line 78, in app
| await wrap_app_handling_exceptions(app, request)(scope, receive, send)
| File "/root/.python/sglang/lib/python3.12/site-packages/starlette/_exception_handler.py", line 53, in wrapped_app
| raise exc
| File "/root/.python/sglang/lib/python3.12/site-packages/starlette/_exception_handler.py", line 42, in wrapped_app
| await app(scope, receive, sender)
| File "/root/.python/sglang/lib/python3.12/site-packages/starlette/routing.py", line 76, in app
| await response(scope, receive, send)
| File "/root/.python/sglang/lib/python3.12/site-packages/starlette/responses.py", line 270, in __call__
| with collapse_excgroups():
| ^^^^^^^^^^^^^^^^^^^^
| File "/usr/lib/python3.12/contextlib.py", line 158, in __exit__
| self.gen.throw(value)
| File "/root/.python/sglang/lib/python3.12/site-packages/starlette/_utils.py", line 83, in collapse_excgroups
| raise exc
| File "/root/.python/sglang/lib/python3.12/site-packages/starlette/responses.py", line 274, in wrap
| await func()
| File "/root/.python/sglang/lib/python3.12/site-packages/starlette/responses.py", line 254, in stream_response
| async for chunk in self.body_iterator:
| File "/root/sglang/python/sglang/srt/entrypoints/openai/serving_chat.py", line 439, in _generate_chat_stream
| async for content in self.tokenizer_manager.generate_request(
| File "/root/sglang/python/sglang/srt/managers/tokenizer_manager.py", line 493, in generate_request
| tokenized_obj = await self._tokenize_one_request(obj)
| ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
| File "/root/sglang/python/sglang/srt/managers/tokenizer_manager.py", line 547, in _tokenize_one_request
| mm_inputs: Dict = await self.mm_processor.process_mm_data_async(
| ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
| File "/root/sglang/python/sglang/srt/multimodal/processors/qwen_vl.py", line 251, in process_mm_data_async
| mm_items, input_ids, ret = self.process_and_combine_mm_data(
| ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
| File "/root/sglang/python/sglang/srt/multimodal/processors/base_processor.py", line 616, in process_and_combine_mm_data
| collected_items, input_ids, ret = self._process_and_collect_mm_items(
| ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
| File "/root/sglang/python/sglang/srt/multimodal/processors/base_processor.py", line 565, in _process_and_collect_mm_items
| ret = self.process_mm_data(
| ^^^^^^^^^^^^^^^^^^^^^
| File "/root/sglang/python/sglang/srt/multimodal/processors/base_processor.py", line 236, in process_mm_data
| result = processor.__call__(
| ^^^^^^^^^^^^^^^^^^^
| File "/root/.python/sglang/lib/python3.12/site-packages/transformers/models/qwen2_5_vl/processing_qwen2_5_vl.py", line 177, in __call__
| num_image_tokens = image_grid_thw[index].prod() // merge_length
| ~~~~~~~~~~~~~~^^^^^^^
| IndexError: index 3 is out of bounds for dimension 0 with size 3
+------------------------------------
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/root/.python/sglang/lib/python3.12/site-packages/uvicorn/protocols/http/h11_impl.py", line 403, in run_asgi
result = await app( # type: ignore[func-returns-value]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/.python/sglang/lib/python3.12/site-packages/uvicorn/middleware/proxy_headers.py", line 60, in __call__
return await self.app(scope, receive, send)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/.python/sglang/lib/python3.12/site-packages/fastapi/applications.py", line 1054, in __call__
await super().__call__(scope, receive, send)
File "/root/.python/sglang/lib/python3.12/site-packages/starlette/applications.py", line 113, in __call__
await self.middleware_stack(scope, receive, send)
File "/root/.python/sglang/lib/python3.12/site-packages/starlette/middleware/errors.py", line 186, in __call__
raise exc
File "/root/.python/sglang/lib/python3.12/site-packages/starlette/middleware/errors.py", line 164, in __call__
await self.app(scope, receive, _send)
File "/root/.python/sglang/lib/python3.12/site-packages/starlette/middleware/cors.py", line 85, in __call__
await self.app(scope, receive, send)
File "/root/.python/sglang/lib/python3.12/site-packages/starlette/middleware/exceptions.py", line 63, in __call__
await wrap_app_handling_exceptions(self.app, conn)(scope, receive, send)
File "/root/.python/sglang/lib/python3.12/site-packages/starlette/_exception_handler.py", line 53, in wrapped_app
raise exc
File "/root/.python/sglang/lib/python3.12/site-packages/starlette/_exception_handler.py", line 42, in wrapped_app
await app(scope, receive, sender)
File "/root/.python/sglang/lib/python3.12/site-packages/starlette/routing.py", line 716, in __call__
await self.middleware_stack(scope, receive, send)
File "/root/.python/sglang/lib/python3.12/site-packages/starlette/routing.py", line 736, in app
await route.handle(scope, receive, send)
File "/root/.python/sglang/lib/python3.12/site-packages/starlette/routing.py", line 290, in handle
await self.app(scope, receive, send)
File "/root/.python/sglang/lib/python3.12/site-packages/starlette/routing.py", line 78, in app
await wrap_app_handling_exceptions(app, request)(scope, receive, send)
File "/root/.python/sglang/lib/python3.12/site-packages/starlette/_exception_handler.py", line 53, in wrapped_app
raise exc
File "/root/.python/sglang/lib/python3.12/site-packages/starlette/_exception_handler.py", line 42, in wrapped_app
await app(scope, receive, sender)
File "/root/.python/sglang/lib/python3.12/site-packages/starlette/routing.py", line 76, in app
await response(scope, receive, send)
File "/root/.python/sglang/lib/python3.12/site-packages/starlette/responses.py", line 270, in __call__
with collapse_excgroups():
^^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3.12/contextlib.py", line 158, in __exit__
self.gen.throw(value)
File "/root/.python/sglang/lib/python3.12/site-packages/starlette/_utils.py", line 83, in collapse_excgroups
raise exc
File "/root/.python/sglang/lib/python3.12/site-packages/starlette/responses.py", line 274, in wrap
await func()
File "/root/.python/sglang/lib/python3.12/site-packages/starlette/responses.py", line 254, in stream_response
async for chunk in self.body_iterator:
File "/root/sglang/python/sglang/srt/entrypoints/openai/serving_chat.py", line 439, in _generate_chat_stream
async for content in self.tokenizer_manager.generate_request(
File "/root/sglang/python/sglang/srt/managers/tokenizer_manager.py", line 493, in generate_request
tokenized_obj = await self._tokenize_one_request(obj)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/sglang/python/sglang/srt/managers/tokenizer_manager.py", line 547, in _tokenize_one_request
mm_inputs: Dict = await self.mm_processor.process_mm_data_async(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/sglang/python/sglang/srt/multimodal/processors/qwen_vl.py", line 251, in process_mm_data_async
mm_items, input_ids, ret = self.process_and_combine_mm_data(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/sglang/python/sglang/srt/multimodal/processors/base_processor.py", line 616, in process_and_combine_mm_data
collected_items, input_ids, ret = self._process_and_collect_mm_items(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/sglang/python/sglang/srt/multimodal/processors/base_processor.py", line 565, in _process_and_collect_mm_items
ret = self.process_mm_data(
^^^^^^^^^^^^^^^^^^^^^
File "/root/sglang/python/sglang/srt/multimodal/processors/base_processor.py", line 236, in process_mm_data
result = processor.__call__(
^^^^^^^^^^^^^^^^^^^
File "/root/.python/sglang/lib/python3.12/site-packages/transformers/models/qwen2_5_vl/processing_qwen2_5_vl.py", line 177, in __call__
num_image_tokens = image_grid_thw[index].prod() // merge_length |
|
try to fix this also: /root/sglang/python/sglang/bench_serving.py:1210: DeprecationWarning: 'mode' parameter is deprecated and will be removed in Pillow 13 (2026-10-15) |
|
Refer to this: We shall detach this PR into two commits:
|
|
|
This is a code snippet for memory analysis: Details#!/usr/bin/env python3
"""
内存分析脚本 - 从日志文件中读取内存数据并生成变化曲线
使用方法: python memory_analyzer.py <log_file_path>
"""
import argparse
import sys
from datetime import datetime
import matplotlib.dates as mdates
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
def parse_memory_log(log_file):
"""解析内存日志文件"""
try:
# 读取CSV文件
df = pd.read_csv(log_file)
# 转换时间戳为datetime对象
df["timestamp"] = pd.to_datetime(df["timestamp"])
# 内存值现在已经是MiB单位,直接使用
df["memory_allocated_mb"] = df["memory_allocated"]
df["memory_reserved_mb"] = df["memory_reserved"]
return df
except Exception as e:
print(f"错误: 无法解析日志文件 {log_file}: {e}")
return None
def create_memory_plots(df, output_prefix=None):
"""创建内存变化曲线图"""
if df is None or df.empty:
print("错误: 没有有效的数据用于绘图")
return
# 设置中文字体
plt.rcParams["font.sans-serif"] = ["SimHei", "DejaVu Sans"]
plt.rcParams["axes.unicode_minus"] = False
# 创建三个子图
fig, (ax1, ax2, ax3) = plt.subplots(3, 1, figsize=(12, 10))
fig.suptitle("GPU Memory Usage Over Time", fontsize=16, fontweight="bold")
# 第一个图: Memory Allocated
ax1.plot(
df["timestamp"],
df["memory_allocated_mb"],
color="blue",
linewidth=2,
label="Allocated Memory",
)
ax1.set_ylabel("Memory Allocated (MiB)", fontsize=12)
ax1.grid(True, alpha=0.3)
ax1.legend()
ax1.set_title("GPU Memory Allocated Over Time")
# 第二个图: Memory Reserved
ax2.plot(
df["timestamp"],
df["memory_reserved_mb"],
color="red",
linewidth=2,
label="Reserved Memory",
)
ax2.set_ylabel("Memory Reserved (MiB)", fontsize=12)
ax2.grid(True, alpha=0.3)
ax2.legend()
ax2.set_title("GPU Memory Reserved Over Time")
# 第三个图: Memory Allocated vs Reserved 对比
ax3.plot(
df["timestamp"],
df["memory_allocated_mb"],
color="blue",
linewidth=2,
label="Allocated",
alpha=0.8,
)
ax3.plot(
df["timestamp"],
df["memory_reserved_mb"],
color="red",
linewidth=2,
label="Reserved",
alpha=0.8,
)
ax3.fill_between(
df["timestamp"], df["memory_allocated_mb"], alpha=0.3, color="blue"
)
ax3.fill_between(df["timestamp"], df["memory_reserved_mb"], alpha=0.3, color="red")
ax3.set_ylabel("Memory Usage (MiB)", fontsize=12)
ax3.set_xlabel("Time", fontsize=12)
ax3.grid(True, alpha=0.3)
ax3.legend()
ax3.set_title("GPU Memory Allocated vs Reserved Comparison")
# 格式化时间轴
for ax in [ax1, ax2, ax3]:
ax.xaxis.set_major_formatter(mdates.DateFormatter("%H:%M:%S"))
ax.xaxis.set_major_locator(mdates.SecondLocator(interval=30))
plt.setp(ax.xaxis.get_majorticklabels(), rotation=45)
plt.tight_layout()
# 保存图像
if output_prefix:
output_file = f"{output_prefix}_memory_analysis.png"
else:
output_file = "memory_analysis.png"
plt.savefig(output_file, dpi=300, bbox_inches="tight")
print(f"图像已保存到: {output_file}")
# 显示图像
plt.show()
def print_memory_stats(df):
"""打印内存统计信息"""
if df is None or df.empty:
return
print("\n=== 内存使用统计 ===")
print(f"数据记录数: {len(df)}")
print(
f"监控时长: {(df['timestamp'].iloc[-1] - df['timestamp'].iloc[0]).total_seconds():.1f} 秒"
)
print(f"\n分配内存 (MiB):")
print(f" 最小值: {df['memory_allocated_mb'].min():.2f}")
print(f" 最大值: {df['memory_allocated_mb'].max():.2f}")
print(f" 平均值: {df['memory_allocated_mb'].mean():.2f}")
print(f" 标准差: {df['memory_allocated_mb'].std():.2f}")
print(f"\n保留内存 (MiB):")
print(f" 最小值: {df['memory_reserved_mb'].min():.2f}")
print(f" 最大值: {df['memory_reserved_mb'].max():.2f}")
print(f" 平均值: {df['memory_reserved_mb'].mean():.2f}")
print(f" 标准差: {df['memory_reserved_mb'].std():.2f}")
# 计算内存利用率
utilization = (df["memory_allocated_mb"] / df["memory_reserved_mb"]) * 100
print(f"\n内存利用率 (%):")
print(f" 最小值: {utilization.min():.2f}")
print(f" 最大值: {utilization.max():.2f}")
print(f" 平均值: {utilization.mean():.2f}")
def main():
parser = argparse.ArgumentParser(
description="分析GPU内存使用日志文件并生成变化曲线"
)
parser.add_argument("log_file", help="内存日志文件路径")
parser.add_argument("--output", "-o", help="输出图像文件前缀")
parser.add_argument("--stats", "-s", action="store_true", help="显示统计信息")
args = parser.parse_args()
if not args.log_file:
print("错误: 请提供日志文件路径")
sys.exit(1)
# 解析日志文件
print(f"正在解析日志文件: {args.log_file}")
df = parse_memory_log(args.log_file)
if df is None:
sys.exit(1)
print(f"成功读取 {len(df)} 条记录")
# 打印统计信息
if args.stats:
print_memory_stats(df)
# 创建图表
create_memory_plots(df, args.output)
if __name__ == "__main__":
main() |
|
Results on B200: python -m sglang.launch_server --model-path Qwen/Qwen2.5-VL-3B-Instruct --disable-radix-cache
python -m sglang.bench_serving \
--backend sglang-oai-chat \
--dataset-name random-image \
--num-prompts 500 \
--random-image-num-images 3 \
--random-image-resolution 720p \
--random-input-len 512 \
--random-output-len 512============ Serving Benchmark Result ============
Backend: sglang-oai-chat
Traffic request rate: inf
Max request concurrency: not set
Successful requests: 498
Benchmark duration (s): 411.47
Total input tokens: 132763
Total generated tokens: 123381
Total generated tokens (retokenized): 30426
Request throughput (req/s): 1.21
Input token throughput (tok/s): 322.65
Output token throughput (tok/s): 299.85
Total token throughput (tok/s): 622.51
Concurrency: 491.58
----------------End-to-End Latency----------------
Mean E2E Latency (ms): 406167.27
Median E2E Latency (ms): 407130.40
---------------Time to First Token----------------
Mean TTFT (ms): 360920.24
Median TTFT (ms): 367521.13
P99 TTFT (ms): 401069.34
---------------Inter-Token Latency----------------
Mean ITL (ms): 747.29
Median ITL (ms): 34.15
P95 ITL (ms): 273.88
P99 ITL (ms): 28534.37
Max ITL (ms): 345368.19
================================================== |
Co-authored-by: zhaochenyang20 <zhaochen20@outlook.com>
python/sglang/bench_serving.py
Outdated
| python3 -m sglang.bench_serving --backend sglang --num-prompt 10 | ||
|
|
||
| python3 -m sglang.bench_serving --backend sglang --dataset-name random --num-prompts 3000 --random-input 1024 --random-output 1024 --random-range-ratio 0.5 | ||
| Please refer to https://docs.sglang.ai/developer_guide/bench_serving.html for details. |
There was a problem hiding this comment.
this doc is 404
After the merge, this link will be valid. Right now the link is 404, but I just submitted the docs with this PR. And let users to see the new docs.
Co-authored-by: Mick <mickjagger19@icloud.com> Co-authored-by: Xinyuan Tong <115166877+JustinTong0323@users.noreply.github.com> Co-authored-by: zhaochenyang20 <zhaochen20@outlook.com> Co-authored-by: zhaochenyang20 <zhaochenyang20@gmail.com> Co-authored-by: Yineng Zhang <me@zhyncs.com>
Motivation
This is a temporary PR to evaluate the memory leakage and provide a better benchmarking for multi-modal input.
Modifications
Example
Launch server:
Launch benchmarking. Note that, the
num-promptswill first collect 500 requests, then send them at once. Using a largenum-promptswould increase the picture creation time.Note that we do not need to wait for the
bench_servingto end. We can always use the log to analyse the memory.Find the log file, like
20250822_181618_memory_log.txt, then:We can get the mem profiler.
Note that we modified
event_loop_overlapfunction to let it dogc.collect(), torch.cuda.empty_cache(). You can manually disable it.With
gc.collect(), torch.cuda.empty_cache()always turn on, during the benchmarking, the image processor takes over 30GB of memory (on B200) to process the 500 requests, each of which has 3 images. After the process, the image processor's memory is released.Note this is without
--max-concurrency 1in the bench serving period.Using
--max-concurrency 1, we have:The converged value is steady.
Okay. After removing the
gc.collect(), torch.cuda.empty_cache()in the scheduler, the reserved memory indeed leaked:Note that I first ran the benchmark with
--max-concurrency 1, then slept for 5 minutes, and ran the benchmarking without--max-concurrency 1.Accuracy Tests
Benchmarking and Profiling
Checklist