Skip to content

[Test] add stability test case for wan2.2, qwen-tts, qwen3-omni and qwen-image model and modified conftest.py in test/dfx/#2817

Merged
hsliuustc0106 merged 39 commits into
vllm-project:mainfrom
zhumingjue138:main-longterm-wan22
Apr 23, 2026
Merged

[Test] add stability test case for wan2.2, qwen-tts, qwen3-omni and qwen-image model and modified conftest.py in test/dfx/#2817
hsliuustc0106 merged 39 commits into
vllm-project:mainfrom
zhumingjue138:main-longterm-wan22

Conversation

@zhumingjue138
Copy link
Copy Markdown
Contributor

@zhumingjue138 zhumingjue138 commented Apr 15, 2026

PLEASE FILL IN THE PR DESCRIPTION HERE ENSURING ALL CHECKLIST ITEMS (AT THE BOTTOM) HAVE BEEN CONSIDERED.

Purpose

add stability test case for wan2.2 model and modified conftest.py in test/dfx/

Test Plan

1、modified conftest.py in test/dfx/

2、Split the file "tests/dfx/stability/scripts/test_benchmark_stability.py" according to the model names and rename it as "tests/dfx/stability/scripts/test_stability_qwen3_omni.py" and "tests/dfx/stability/scripts/test_stability_wan22.py"

pytest -s -v tests/dfx/perf/scripts/run_benchmark.py
pytest -s -v tests/dfx/stability/scripts/test_stability_qwen3_omni.py
pytest -s -v tests/dfx/stability/scripts/test_stability_wan22.py
pytest -s -v tests/dfx/stability/scripts/test_stability_qwen_image.py
pytest -s -v tests/dfx/stability/scripts/test_stability_qwen3_tts.py

Test Result

pytest -s -v tests/dfx/perf/scripts/run_benchmark.py
image

pytest -s -v tests/dfx/stability/scripts/test_stability_qwen3_omni.py

[
    {
        "test_name": "test_qwen3_omni_stability_async_chunk",
        "server_params": {
            "model": "/home/models/Qwen/Qwen3-Omni-30B-A3B-Instruct",
            "stage_overrides": {
                "2": {
                    "max_num_batched_tokens": 1000000
                }
            },
            "extra_cli_args": ["--async-chunk"]
        },
        "benchmark_params": [
            {
                "dataset_name": "random-mm",
                "backend": "openai-chat-omni",
                "endpoint": "/v1/chat/completions",
                "duration_sec": 200,
                "request_rate": 0.3,
                "num_prompts_per_batch": 10,
                "random_input_len": {
                    "min": 0,
                    "max": 8000
                },
                "random_output_len": {
                    "min": 0,
                    "max": 1000
                },
                "random_range_ratio": 0.0,
                "random_mm_base_items_per_request": {
                    "min": 0,
                    "max": 6
                },
                "random_mm_num_mm_items_range_ratio": 0.0,
                "random_mm_limit_mm_per_prompt": {
                    "image": 2,
                    "video": 2,
                    "audio": 2
                },
                "random_mm_bucket_config": {
                    "(128-1024, 128-1024, 1)": 0.34,
                    "(256-1080, 256-1920, 2-16)": 0.33,
                    "(0, 1-60, 1-3)": 0.33
                },
                "ignore_eos": true,
                "percentile-metrics": "ttft,tpot,itl,e2el,audio_rtf,audio_ttfp,audio_duration"
            }
        ]
    }
]
image

pytest -s -v tests/dfx/stability/scripts/test_stability_qwen_image.py

[
    {
        "test_name": "test_qwen_image_stability",
        "server_params": {
            "model": "/nvme1n1p1/models/Qwen/Qwen-Image"
        },
        "benchmark_params": [
            {
                "dataset": "random",
                "task": "t2i",
                "backend": "vllm-omni",
                "duration_sec": 200,
                "max_concurrency": 1,
                "num_prompts_per_batch": 2,
                "width": {
                    "min": 512,
                    "max": 2048
                },
                "height": {
                    "min": 512,
                    "max": 2048
                },
                "num_inference_steps": 50,
                "enable_negative_prompt": true
            }
        ]
    }
]
image

pytest -s -v tests/dfx/stability/scripts/test_stability_qwen3_tts.py

[
    {
        "test_name": "test_qwen3_tts_stability",
        "server_params": {
            "model": "/nvme1n1p1/models/Qwen3-TTS-12Hz-1.7B-CustomVoice"
        },
        "benchmark_params": [
            {
                "dataset_name": "random",
                "backend": "openai-audio-speech",
                "endpoint": "/v1/audio/speech",
                "duration_sec": 120,
                "request_rate": 0.3,
                "num_prompts_per_batch": 3,
                "random_input_len": {
                    "min": 0,
                    "max": 1000
                },
                "random_output_len": {
                    "min": 0,
                    "max": 1000
                },
                "random_range_ratio": 0.0,
                "extra_body": {
                    "voice": "Vivian",
                    "language": "English"
                },
                "ignore_eos": true,
                "percentile-metrics": "ttft,e2el,audio_rtf,audio_ttfp,audio_duration"
            }
        ]
    }
]
image

pytest -s -v tests/dfx/stability/scripts/test_stability_wan22.py

[
    {
        "test_name": "test_wan22_t2v_stability_v1_videos",
        "server_params": {
            "model": "/home/models/Wan-AI/Wan2.2-T2V-A14B-Diffusers",
            "serve_args": {
                "ulysses-degree": 1,
                "vae-patch-parallel-size": 2,
                "cfg-parallel-size": 1,
                "tensor-parallel-size": 1,
                "use-hsdp": true,
                "hsdp_shard_size": 2,
                "hsdp_replicate_size": 1,
                "vae-use-slicing": true,
                "vae-use-tiling": true
            }
        },
        "benchmark_params": [
            {
                "dataset": "random",
                "task": "t2v",
                "backend": "v1/videos",
                "duration_sec": 300,
                "max_concurrency": 1,
                "num_prompts_per_batch": 3,
                "enable_negative_prompt": true,
                "random_request_config": [
                    {"width": 854, "height": 480, "num_inference_steps": 3, "num_frames": 80, "fps": 16, "weight": 0.65},
                    {"width": 854, "height": 480, "num_inference_steps": 4, "num_frames": 120, "fps": 24, "weight": 0.25},
                    {"width": 1280, "height": 720, "num_inference_steps": 6, "num_frames": 80, "fps": 16, "weight": 0.1}
                ]
            }
        ]
    }
]
image
================= Serving Benchmark Result =================
Backend:                                 v1/videos
Model:                                   /home/models/Wan-AI/Wan2.2-T2V-A14B-Diffusers
Dataset:                                 random
Task:                                    t2v
--------------------------------------------------
Benchmark duration (s):                  170.36
Request rate:                            inf
Max request concurrency:                 1
Successful requests:                     3/3
--------------------------------------------------
Request throughput (req/s):              0.02
Latency Mean (s):                        56.7875
Latency Median (s):                      56.1269
Latency P99 (s):                         58.0920
Latency P95 (s):                         57.9316
--------------------------------------------------
Peak Memory Max (MB):                    51132.00
Peak Memory Mean (MB):                   50554.00
Peak Memory Median (MB):                 51040.00

============================================================
Metrics saved to /tmp/stability_diffusion_q30pi7rf.json
================= Serving Benchmark Result =================
Backend:                                 v1/videos
Model:                                   /home/models/Wan-AI/Wan2.2-T2V-A14B-Diffusers
Dataset:                                 random
Task:                                    t2v
--------------------------------------------------
Benchmark duration (s):                  173.02
Request rate:                            inf
Max request concurrency:                 1
Successful requests:                     3/3
--------------------------------------------------
Request throughput (req/s):              0.02
Latency Mean (s):                        57.6723
Latency Median (s):                      58.1305
Latency P99 (s):                         58.2935
Latency P95 (s):                         58.2802
--------------------------------------------------
Peak Memory Max (MB):                    51186.00
Peak Memory Mean (MB):                   50572.00
Peak Memory Median (MB):                 51040.00

============================================================
Metrics saved to /tmp/stability_diffusion_wj55pjs7.json
============ Stability Benchmark Summary ============
Successful requests:                     6
Failed requests:                         0
Total duration (s):                      352.24
==================================================
C155F66E-A18E-4DB6-AB6E-7E87DE60C74B111

nightly CI
image

24h test:

============ Stability Benchmark Summary ============
Successful requests:                     1752      
Failed requests:                         68        
Total duration (s):                      93786.44  
==================================================

related issue: #2928

12h test:

[
    {
        "test_name": "test_wan22_t2v_stability_v1_videos",
        "server_params": {
            "model": "/nvme1n1p1/models/Wan2.2-I2V-A14B-Diffusers/snapshots/596658fd9ca6b7b71d5057529bbf319ecbc61d74",
            "serve_args": {
                "ulysses-degree": 2,
                "vae-patch-parallel-size": 2,
                "tensor-parallel-size": 1,
                "use-hsdp": true,
                "vae-use-slicing": true,
                "vae-use-tiling": true
            }
        },
        "benchmark_params": [
            {
                "dataset": "random",
                "task": "i2v",
                "backend": "v1/videos",
                "duration_sec": 43200,
                "max_concurrency": 1,
                "num_prompts_per_batch": 10,
                "enable_negative_prompt": true,
                "random_request_config": [
                    {"width": 832, "height": 480, "num_inference_steps": 2, "num_frames": 81, "fps": 16, "weight": 0.5},
                    {"width": 1280, "height": 720, "num_inference_steps": 2, "num_frames": 121, "fps": 16, "weight": 0.5}
                ]
            }
        ]
    }
]

result
============ Stability Benchmark Summary ============
Successful requests:                     490
Failed requests:                         0
Total duration (s):                      43387.18
==================================================

another 12h test

============ Stability Benchmark Summary ============
Successful requests:                     440
Failed requests:                         0
Total duration (s):                      40191.27
==================================================


Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan. Please provide the test scripts & test commands. Please state the reasons if your codes don't require additional test scripts. For test file guidelines, please check the test style doc
  • The test results. Please paste the results comparison before and after, or the e2e results.
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model. Please run mkdocs serve to sync the documentation editions to ./docs.
  • (Optional) Release notes update. If your change is user-facing, please update the release notes draft.

BEFORE SUBMITTING, PLEASE READ https://github.com/vllm-project/vllm-omni/blob/main/CONTRIBUTING.md (anything written below this line will be removed by GitHub Actions)

…ions and scripts

- Introduced new stability test scripts for Qwen3-Omni and Wan2.2 models, including `test_stability_qwen3_omni.py` and `test_stability_wan22.py`.
- Added corresponding JSON configuration files for both models to define benchmark parameters.
- Updated existing documentation to reflect changes in stability testing configurations and methods.
- Enhanced the `conftest.py` files to support new test structures and parameters.

These additions aim to improve the stability testing framework and provide comprehensive benchmarks for the new models.

Signed-off-by: zhumingjue <zhumingjue@huawei.com>
Signed-off-by: zhumingjue <zhumingjue@huawei.com>
…mands and benchmark execution

- Added L5 stability testing commands for Qwen3-Omni and Wan2.2 models in the test guide.
- Introduced a new `run_benchmark` function in `conftest.py` to streamline benchmark execution and result handling.
- Refactored existing stability test scripts to utilize the new benchmark execution method, improving code organization and maintainability.

These updates aim to enhance the stability testing capabilities and provide clearer guidance for executing benchmarks.

Signed-off-by: zhumingjue <zhumingjue@huawei.com>
Signed-off-by: zhumingjue <zhumingjue@huawei.com>
Signed-off-by: zhumingjue <zhumingjue@huawei.com>
@chatgpt-codex-connector
Copy link
Copy Markdown

Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits.
Credits must be used to enable repository wide code reviews.

@yenuo26 yenuo26 added the nightly-test label to trigger buildkite nightly test CI label Apr 15, 2026
@hsliuustc0106
Copy link
Copy Markdown
Collaborator

The test_wan22.json config only runs 3 prompts (num_prompts_per_batch=3) over 300 seconds with max_concurrency=1. Is this sufficient for a stability test? Consider increasing num_prompts_per_batch to get better coverage.

Signed-off-by: zhumingjue138 <zhumingjue@huawei.com>
…string in conftest.py

Signed-off-by: zhumingjue <zhumingjue@huawei.com>
…server params creation and update OmniServer fixture to accommodate stage config paths.

Signed-off-by: zhumingjue <zhumingjue@huawei.com>
… improve performance stability.

Signed-off-by: zhumingjue <zhumingjue@huawei.com>
@yenuo26 yenuo26 added omni-test label to trigger buildkite omni model test in nightly CI ready label to trigger buildkite CI and removed nightly-test label to trigger buildkite nightly test CI omni-test label to trigger buildkite omni model test in nightly CI labels Apr 16, 2026
…s; introduce serve_args support for OmniServer fixture and streamline unique server params creation.

Signed-off-by: zhumingjue <zhumingjue@huawei.com>
Signed-off-by: zhumingjue <zhumingjue@huawei.com>
Signed-off-by: zhumingjue <zhumingjue@huawei.com>
Signed-off-by: zhumingjue <zhumingjue@huawei.com>
…iptions

Signed-off-by: zhumingjue <zhumingjue@huawei.com>
Signed-off-by: zhumingjue <zhumingjue@huawei.com>
… gracefully. Updated paths to use 'deploy' directory instead of 'stage_configs'.

Signed-off-by: zhumingjue <zhumingjue@huawei.com>
…bility_qwen3_omni.py to allow for longer initialization periods.

Signed-off-by: zhumingjue <zhumingjue@huawei.com>
Signed-off-by: zhumingjue <zhumingjue@huawei.com>
…e stability test functionality.

Signed-off-by: zhumingjue <zhumingjue@huawei.com>
…t.py and update test configurations

- Introduced functions to sample integer values from specified ranges and to handle bucket key sampling.
- Updated benchmark parameters in test_qwen3_omni.json to use range specifications for input and output lengths, and adjusted request rates.
- Changed dataset names from "random" to "random-mm" for clarity.

Signed-off-by: zhumingjue <zhumingjue@huawei.com>
…pecifications for bucket keys

- Modified the bucket configuration from "(0, 60, 3)" to "(0, 1-60, 1-3)" for improved clarity and consistency with recent changes in sampling functionality.

Signed-off-by: zhumingjue <zhumingjue@huawei.com>
…y testing adjustments

Signed-off-by: zhumingjue <zhumingjue@huawei.com>
…event ffmpeg encoding failures

- Added logic to ensure that height and width are even numbers when the number of frames is greater than one, addressing potential encoding/decoding issues.

Signed-off-by: zhumingjue <zhumingjue@huawei.com>
…onfigurations

- Introduced new test scripts for Qwen-Image and Qwen3-TTS stability benchmarks, utilizing parameterized test cases to handle various server configurations and benchmark parameters.
- Updated the `_sample_stability_batch_params` function in `conftest.py` to include additional fields for width and height, enhancing the sampling capabilities for stability tests.

Signed-off-by: zhumingjue <zhumingjue@huawei.com>
- Introduced JSON test files for Qwen-Image and Qwen3-TTS, defining server and benchmark parameters for stability testing.
- Each test includes detailed configurations such as model specifications, dataset names, and various performance metrics.

Signed-off-by: zhumingjue <zhumingjue@huawei.com>
- Removed redundant parameters and adjusted the random_range_ratio to 0.0 for improved stability testing.
- Updated random_mm_bucket_config to use range specifications for clarity.
- Cleaned up the JSON structure by eliminating unnecessary dataset entries.

Signed-off-by: zhumingjue <zhumingjue@huawei.com>
@zhumingjue138 zhumingjue138 changed the title [Test] add stability test case for wan2.2 model and modified conftest.py in test/dfx/ [Test] add stability test case for wan2.2, qwen-tts, qwen3-omni and qwen-image model and modified conftest.py in test/dfx/ Apr 20, 2026
Comment thread tests/dfx/stability/conftest.py Outdated
os.environ.pop("BENCHMARK_DIR")


def _run_one_diffusion_batch(
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think these can be moved to helpers.py

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

- Moved benchmark helper functions from `conftest.py` to `helpers.py` for better organization and clarity.
- Updated test scripts to import benchmark functions from `helpers.py`, ensuring a cleaner structure.
- Enhanced the documentation in `conftest.py` to reflect the new organization of helper functions.

Signed-off-by: zhumingjue <zhumingjue@huawei.com>
@zhumingjue138
Copy link
Copy Markdown
Contributor Author

@Gaohan123 @hsliuustc0106 this pr is ready, can it be merged?

@hsliuustc0106 hsliuustc0106 merged commit 47edee1 into vllm-project:main Apr 23, 2026
7 of 8 checks passed
hongzhi-gao pushed a commit to hongzhi-gao/vllm-omni that referenced this pull request Apr 23, 2026
…wen-image model and modified conftest.py in test/dfx/ (vllm-project#2817)

Signed-off-by: zhumingjue <zhumingjue@huawei.com>
Signed-off-by: zhumingjue138 <zhumingjue@huawei.com>
Signed-off-by: hongzhigao <761417898@qq.com>
lengrongfu pushed a commit to lengrongfu/vllm-omni that referenced this pull request May 1, 2026
…wen-image model and modified conftest.py in test/dfx/ (vllm-project#2817)

Signed-off-by: zhumingjue <zhumingjue@huawei.com>
Signed-off-by: zhumingjue138 <zhumingjue@huawei.com>
clodaghwalsh17 pushed a commit to clodaghwalsh17/nm-vllm-omni-ent that referenced this pull request May 12, 2026
…wen-image model and modified conftest.py in test/dfx/ (vllm-project#2817)

Signed-off-by: zhumingjue <zhumingjue@huawei.com>
Signed-off-by: zhumingjue138 <zhumingjue@huawei.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready label to trigger buildkite CI

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants