diff --git a/docs/contributing/ci/CI_5levels.md b/docs/contributing/ci/CI_5levels.md index b78d3a0efb1..81392b201da 100644 --- a/docs/contributing/ci/CI_5levels.md +++ b/docs/contributing/ci/CI_5levels.md @@ -548,103 +548,19 @@ L4 level testing is a comprehensive quality audit before a version release. It e - ***Execution Environment***: ***GPU*** server clusters to meet the resource demands of performance testing. - ***Script Example***: -???+ example "Test Examples" - - When adding L4-level ***documentation example Tests***, please pay attention to the following guides. - - --8<-- "docs/contributing/ci/test_examples/doc_example_tests.inc.md" - - When you want to add L4-level ***performance test*** cases, you can refer to the following format for case addition in tests/dfx/perf/tests/test.json: - - ```JSON - { - "test_name": "test_qwen3_omni", - "server_params": { - "model": "Qwen/Qwen3-Omni-30B-A3B-Instruct", - "stage_config_name": "qwen3_omni.yaml" - }, - "benchmark_params": [ - { - "dataset_name": "random", - "num_prompts": [10, 20], - "max_concurrency": [1, 4], - "random_input_len": 2500, - "random_output_len": 900, - "ignore_eos": true, - "percentile-metrics": "ttft,tpot,itl,e2el,audio_rtf,audio_ttfp,audio_duration", - "baseline": { - "mean_ttft_ms": [500, 800], - "mean_audio_ttfp_ms": [2000, 3500], - "mean_audio_rtf": [0.25, 0.35] - } - } - ] - } - ``` - - **Parameter Explanation** - - *Overview* - - | Field | Required | Description | - | ---------------- | -------- | --------------------------------------------------------------- | - | test_name | Yes | Unique identifier for the test case | - | server_params | Yes | Server-side configuration parameters | - | benchmark_params | Yes | Benchmark running parameters (supports multiple configurations) | - - **server_params Configuration** +??? example "Test Examples: Documentation Example Tests" - *Basic Parameters* - - | Parameter | Required | Example | Description | - | ----------------- | -------- | ---------------------------------- | ----------------------------- | - | model | Yes | "Qwen/Qwen3-Omni-30B-A3B-Instruct" | Model name or path | - | stage_config_name | Yes | "qwen3_omni.yaml" | Stage configuration file name | - - *Dynamic Configuration (update/delete)* - - Supports incremental modifications based on the basic configuration: - - | Operation | Description | - | --------- | ------------------------------------ | - | update | Update or add configuration items | - | delete | Delete specified configuration items | - - ***Example***: - - ``` - "update": { - "async_chunk": true, // Enable asynchronous chunk processing - "stage_args": { - "0": { - "engine_args.custom_process_next_stage_input_func": "vllm_omni.model_executor.stage_input_processors.qwen3_omni.thinker2talker_async_chunk" - } - } - }, - "delete": { - "stage_args": { - "2": ["custom_process_input_func"] // Delete this configuration for stage 2 - } - } - ``` + --8<-- "docs/contributing/ci/test_examples/l4_doc_example_tests.inc.md" - **benchmark_params Configuration** +??? example "Test Examples: Performance Tests" - You can add any benchmark running parameters you need here. For all optional parameters, refer to the [benchmark documentation](https://github.com/vllm-project/vllm-omni/blob/main/docs/cli/bench/serve.md). General modifications are as follows: + --8<-- "docs/contributing/ci/test_examples/l4_performance_tests.inc.md" - 1. Change the ---xxx-xx-xx running parameters to xxx_xx_xx format and fill them as keys in the JSON file. - 2. For boolean variables in the running parameters, modify them to forms such as ignore_eos: true/false and fill them into the JSON file. - 3. Optionally add a `baseline` object (see **Baseline thresholds** below). If you omit `baseline` or leave it empty, the performance test still runs but does not assert metric thresholds from this field. - 4. The qps and concurrency modes are mutually exclusive. For detailed explanations, see the table below: +??? example "Test Examples: Functionality Tests" - | Parameter | Type | Required | Example/Values | Description | - | --------------- | ----------- | -------- | --------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | - | num_prompts | int / array | Yes | 10,[10, 20, 30] | Number of requests. Supports single values or arrays. If a single value is used, it will be automatically expanded to match the number of qps or max_concurrency, e.g., [10,10,10]. If an array is used, its length must match the number of qps or max_concurrency. | - | request_rate | int / array | No | 1, [1, 2, 3] | Queries per second. Supports single values or arrays. If a single value is used, it will be automatically expanded to match the number of num_prompts, e.g., [1,1,1]. If an array is used, its length must match the number of num_prompts. | - | max_concurrency | int / array | No | 1, [1, 2, 3] | Maximum concurrent in-flight requests. Same array / expansion rules as `request_rate` (mutually exclusive with QPS mode). | - | baseline | object | No | see above | Optional per-metric thresholds; keys must match benchmark output fields. Scalar, list (per sweep step), or object (keyed by concurrency or QPS string). | + --8<-- "docs/contributing/ci/test_examples/l4_functionality_tests.inc.md" - - - ***Run Command***: (Specific commands would depend on the performance testing tool and configuration defined in `nightly.json`). +- ***Run Command***: (Specific commands would depend on the performance testing tool and configuration defined in `nightly.json`). ## Chapter 4: L5 Level Testing - Stability and Reliability Testing diff --git a/docs/contributing/ci/test_examples/doc_example_tests.inc.md b/docs/contributing/ci/test_examples/l4_doc_example_tests.inc.md similarity index 100% rename from docs/contributing/ci/test_examples/doc_example_tests.inc.md rename to docs/contributing/ci/test_examples/l4_doc_example_tests.inc.md diff --git a/docs/contributing/ci/test_examples/l4_functionality_tests.inc.md b/docs/contributing/ci/test_examples/l4_functionality_tests.inc.md new file mode 100644 index 00000000000..69d6ad82871 --- /dev/null +++ b/docs/contributing/ci/test_examples/l4_functionality_tests.inc.md @@ -0,0 +1,46 @@ +**Scope** + +For diffusion models, the L4 functionality test covers all or common *diffusion features* that are supported by this model, including several [parallelism acceleration methods](https://docs.vllm.ai/projects/vllm-omni/en/latest/user_guide/diffusion/parallelism_acceleration/), [CPU offloading](https://docs.vllm.ai/projects/vllm-omni/en/latest/user_guide/diffusion/cpu_offload_diffusion/), [TeaCache](https://docs.vllm.ai/projects/vllm-omni/en/latest/user_guide/diffusion/teacache/) and [Cache-DiT](https://docs.vllm.ai/projects/vllm-omni/en/latest/user_guide/diffusion/cache_dit_acceleration/) cache backends, [quantization methods](https://docs.vllm.ai/projects/vllm-omni/en/latest/user_guide/diffusion/quantization/overview/). + +**Test Case Design** + +For a *high priority* model (currently listed in [issue #1832](https://github.com/vllm-project/vllm-omni/issues/1832)), we use several test cases, each with multiple features turned on, so that each supported feature is tested in at least one test case. This is to relieve the GPU workload on the CI machine. The suggested test case combination is as follows: + +- If the model can fit into 4 L4 GPU (with quantization and tensor parallel always on) (20GB RAM each) + - (1 GPU) TeaCache + Layerwise CPU offloading + GGUF + - (4 GPUs) CacheDiT + Ulysses=2 + TP=2 + VAE=2 + FP8 + - (4 GPUs) CacheDiT + Ring=2 + HSDP=2 + VAE=2 + GGUF + - (4 GPUs) TeaCache + CFG=2 + TP=2 + VAE=2 + FP8 +- Otherwise, consider 2 H100 GPU environment (80GB RAM each) with the following tests + - (1 GPU) TeaCache + Layerwise CPU offloading + GGUF + - (2 GPUs) CacheDiT + Ulysses=2 + FP8 + - (2 GPUs) CacheDiT + Ring=2 + GGUF + - (2 GPUs) TeaCache + CFG=2 + FP8 + - (2 GPUs) CacheDiT + TP=2 + VAE=2 + FP8 + - (2 GPUs) CacheDiT + HSDP=2 + VAE=2 + GGUF +- If 2 H100 GPU cannot handle the model either (e.g., HunyuanImage 3.0) + - Still design tests and feature combinations that can best fit real-world scenario. + - Do not include it in CI (or exclude it from the CI's filtering criteria). Instead, relevant PR authors are suggested to run these tests locally. +- Fallback plan + - If the model does not support layerwise CPU offloading, replace the corresponding test case with module-wise offloading + - If the model only supports specific or no caching feature, use this option or remove this option in all test cases. + - If the model only supports specific or no quantization feature, use this option or remove this option in all test cases. + - If the model does not support certain other features, remove this option from that test case. If, consequently, the coverage of this modified test case completely overlaps with others, remove this test case. + +For a *normal priority* model, further reduce the number of test cases. + +- Only write one or two test cases for the most common feature combinations. +- The author can explore themselves to see which feature combination balances output quality and performance. Alternatively, the author can refer to any example code in the PR that adds the model, or the example code in the PR that adds a feature (if the code involves this model of interest). + +Currently all the features are available in online serving mode. Hence, only need to add `tests/e2e/online_serving/test_{model}_expansion.py`. + +**Code Style** + +- Validation: test that the multimodal output files of your model have the correct shapes. `OpenAIClientHandler.send_diffusion_request` should have taken care of this. +- Test marks: always add `advanced_model` and `diffusion`. Add GPU-related marks if needed. Ref: [Markers for Tests](https://docs.vllm.ai/projects/vllm-omni/en/latest/contributing/ci/tests_markers/). +- To maximize code reuse, you may refer to + - `tests/conftest.py` for `omni_server` (running server in subprocess) and `openai_client` fixtures (sending requests and validating output), `generate_synthetic_image` and `assert_XXX_valid` helper. + - `tests/utils.py` for `@hardware_test(...)` and `hardware_marks`. + - [Parametrizing tests (pytest doc)](https://docs.pytest.org/en/stable/example/parametrize.html) to reuse test function implementation for different cases. +- Doc: add a concise docstring for each test function. +- Reference L4 test implementation: [tests/e2e/online_serving/test_qwen_image_edit_expansion.py](https://github.com/vllm-project/vllm-omni/blob/main/tests/e2e/online_serving/test_qwen_image_edit_expansion.py). diff --git a/docs/contributing/ci/test_examples/l4_performance_tests.inc.md b/docs/contributing/ci/test_examples/l4_performance_tests.inc.md new file mode 100644 index 00000000000..8093e1459f5 --- /dev/null +++ b/docs/contributing/ci/test_examples/l4_performance_tests.inc.md @@ -0,0 +1,89 @@ +When you want to add L4-level ***performance test*** cases, you can refer to the following format for case addition in tests/dfx/perf/tests/test.json: + +```JSON +{ + "test_name": "test_qwen3_omni", + "server_params": { + "model": "Qwen/Qwen3-Omni-30B-A3B-Instruct", + "stage_config_name": "qwen3_omni.yaml" + }, + "benchmark_params": [ + { + "dataset_name": "random", + "num_prompts": [10, 20], + "max_concurrency": [1, 4], + "random_input_len": 2500, + "random_output_len": 900, + "ignore_eos": true, + "percentile-metrics": "ttft,tpot,itl,e2el,audio_rtf,audio_ttfp,audio_duration", + "baseline": { + "mean_ttft_ms": [500, 800], + "mean_audio_ttfp_ms": [2000, 3500], + "mean_audio_rtf": [0.25, 0.35] + } + } + ] +} +``` + +**Parameter Explanation** + +*Overview* + +| Field | Required | Description | +| ---------------- | -------- | --------------------------------------------------------------- | +| test_name | Yes | Unique identifier for the test case | +| server_params | Yes | Server-side configuration parameters | +| benchmark_params | Yes | Benchmark running parameters (supports multiple configurations) | + +**`server_params` Configuration** + +*Basic Parameters* + +| Parameter | Required | Example | Description | +| ----------------- | -------- | ---------------------------------- | ----------------------------- | +| model | Yes | "Qwen/Qwen3-Omni-30B-A3B-Instruct" | Model name or path | +| stage_config_name | Yes | "qwen3_omni.yaml" | Stage configuration file name | + +*Dynamic Configuration (update/delete)* + +Supports incremental modifications based on the basic configuration: + +| Operation | Description | +| --------- | ------------------------------------ | +| update | Update or add configuration items | +| delete | Delete specified configuration items | + +**Example**: + +``` +"update": { + "async_chunk": true, // Enable asynchronous chunk processing + "stage_args": { + "0": { + "engine_args.custom_process_next_stage_input_func": "vllm_omni.model_executor.stage_input_processors.qwen3_omni.thinker2talker_async_chunk" + } + } +}, +"delete": { + "stage_args": { + "2": ["custom_process_input_func"] // Delete this configuration for stage 2 + } +} +``` + +**`benchmark_params` Configuration** + +You can add any benchmark running parameters you need here. For all optional parameters, refer to the [benchmark documentation](https://github.com/vllm-project/vllm-omni/blob/main/docs/cli/bench/serve.md). General modifications are as follows: + +1. Change the --xxx-xx-xx running parameters to xxx_xx_xx format and fill them as keys in the JSON file. +2. For boolean variables in the running parameters, modify them to forms such as ignore_eos: true/false and fill them into the JSON file. +3. Optionally add a `baseline` object (see **Baseline thresholds** below). If you omit `baseline` or leave it empty, the performance test still runs but does not assert metric thresholds from this field. +4. The qps and concurrency modes are recommended to be mutually exclusive. For detailed explanations, see the table below: + +| Parameter | Type | Required | Example/Values | Description | +| --------------- | ----------- | -------- | --------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | +| num_prompts | int / array | Yes | 10,[10, 20, 30] | Number of requests. Supports single values or arrays. If a single value is used, it will be automatically expanded to match the number of qps or max_concurrency, e.g., [10,10,10]. If an array is used, its length must match the number of qps or max_concurrency. | +| request_rate | float / array | No | 0.5, [0.5, 1, inf] | Queries per second. Supports single values or arrays. If a single value is used, it will be automatically expanded to match the number of num_prompts, e.g., [1,1,1]. If an array is used, its length must match the number of num_prompts. | +| max_concurrency | int / array | No | 1, [1, 2, 3] | Maximum concurrent in-flight requests. Same array / expansion rules as `request_rate` (mutually exclusive with QPS mode). | +| baseline | object | No | see above | Optional per-metric thresholds; keys must match benchmark output fields. Scalar, list (per sweep step), or object (keyed by concurrency or QPS string). diff --git a/docs/contributing/model/adding_diffusion_model.md b/docs/contributing/model/adding_diffusion_model.md index 366903433e1..dfa550173cf 100644 --- a/docs/contributing/model/adding_diffusion_model.md +++ b/docs/contributing/model/adding_diffusion_model.md @@ -653,25 +653,7 @@ For a fair comparison, keep the same **prompt**, **seed**, **resolution**, **num To ensure project maintainability and sustainable development, please submit test code (unit tests, system tests, or end-to-end tests) alongside their code changes. -For comprehensive testing guidelines and the definition of test levels (L1-L5), please refer to the [Test File Structure and Style Guide](../ci/tests_style.md). -The following tests are required to add: - -- L4 test of the model's full *functionality* (i.e., all the *diffusion features* that are supported by this model), including several [parallelism acceleration methods](https://docs.vllm.ai/projects/vllm-omni/en/latest/user_guide/diffusion/parallelism_acceleration/), [CPU offloading](https://docs.vllm.ai/projects/vllm-omni/en/latest/user_guide/diffusion/cpu_offload_diffusion/), [TeaCache](https://docs.vllm.ai/projects/vllm-omni/en/latest/user_guide/diffusion/teacache/) and [Cache-DiT](https://docs.vllm.ai/projects/vllm-omni/en/latest/user_guide/diffusion/cache_dit_acceleration/) cache backends, [quantization methods](https://docs.vllm.ai/projects/vllm-omni/en/latest/user_guide/diffusion/quantization/overview/). - - Test cases: Currently all the features are available in online serving mode. Hence, only need to add `tests/e2e/online_serving/test_{model}_expansion.py`. The following test cases shall cover all features: - - 1 GPU: TeaCache & GGUF (or fallback to FP8, or disable it) & Layer-wise CPU offloading (or fallback to Module-wise) - - 2 GPUs: Cache-DiT & FP8 (or fallback to GGUF, or disable it) & Ulysses = 2 - - 2 GPUs: Cache-DiT & GGUF (or fallback to FP8, or disable it) & Ring = 2 - - 2 GPUs: TeaCache & FP8 (or fallback to GGUF, or disable it) & CFG Parallel = 2 - - 2 GPUs: Cache-DiT & FP8 (or fallback to GGUF, or disable it) & Tensor Parallel = 2 & VAE Patch Parallel = 2 - - 2 GPUs: Cache-DiT & GGUF (or fallback to FP8, or disable it) & HSDP = 2 & VAE Patch Parallel = 2 - - Validation: test that the multimodal output files of your model have the correct shapes. - - Test marks: always add `advanced_model` and `diffusion`. Add `parallel` and GPU-related marks if needed. Ref: [Markers for Tests](https://docs.vllm.ai/projects/vllm-omni/en/latest/contributing/ci/tests_markers/) - - To maximize code reuse, you may refer to - - `tests/conftest.py` for `omni_server` and `openai_client` fixtures, `generate_synthetic_image` and `assert_XXX_valid` helper. - - `tests/utils.py` for `@hardware_test(...)` and `hardware_marks`. - - [Parametrizing tests (pytest doc)](https://docs.pytest.org/en/stable/example/parametrize.html) to reuse test function implementation for different cases. - - Doc: add a concise dostring for each test function. - - Reference L4 test implementation: [tests/e2e/online_serving/test_qwen_image_edit_expansion.py](https://github.com/vllm-project/vllm-omni/blob/main/tests/e2e/online_serving/test_qwen_image_edit_expansion.py). +For comprehensive testing guidelines and the definition of test levels (L1-L5), please refer to the [Multi-Level Automated Testing System Documentation](../ci/CI_5levels.md). You are at least required to add an L4 *functionality* test described in that document. ---