Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion tests/e2e/conftest.py
Original file line number Diff line number Diff line change
Expand Up @@ -280,7 +280,7 @@ def __init__(
disable_log_stats: bool = True,
tensor_parallel_size: int = 1,
block_size: int = 16,
enable_chunked_prefill: bool = False,
enable_chunked_prefill: bool = True,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Enabling enable_chunked_prefill by default will affect MLA (Multi-Layer Attention) models in tests. The logic in vllm_ascend/worker/model_runner_v1.py's _build_attn_state function will now set the attention state to ChunkedPrefill for all models, including MLA.

However, I found a configuration chunked_prefill_for_mla in vllm_ascend/ascend_config.py which seems intended to control this behavior for MLA models, but it is currently not used. This is confusing and could lead to unexpected behavior for developers trying to configure this feature.

If chunked prefill is now fully supported for MLA models and should be on by default, please consider removing the unused chunked_prefill_for_mla configuration to avoid confusion.

If chunked prefill for MLA is experimental or should be opt-in, the logic in _build_attn_state should be updated to respect this configuration. For example:

# In vllm_ascend/worker/model_runner_v1.py _build_attn_state
...
        elif self.scheduler_config.enable_chunked_prefill:
            if self.vllm_config.model_config.use_mla and not self.ascend_config.chunked_prefill_for_mla:
                 attn_state = AscendAttentionState.PrefillCacheHit
            else:
                 attn_state = AscendAttentionState.ChunkedPrefill
...

Given this PR changes the default for all tests, clarifying the intended behavior for MLA models is important for maintainability.

swap_space: int = 4,
enforce_eager: Optional[bool] = False,
quantization: Optional[str] = None,
Expand Down
1 change: 0 additions & 1 deletion tests/e2e/multicard/test_prefix_caching.py
Original file line number Diff line number Diff line change
Expand Up @@ -58,7 +58,6 @@
]


@pytest.mark.skip(reason="Fix me, the accuracy is not correct")
@pytest.mark.parametrize("model", MODELS)
@pytest.mark.parametrize("max_tokens", [50])
def test_prefix_cache_with_v1_scheduler(model: str, max_tokens: int) -> None:
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -118,7 +118,6 @@ def test_eagle_correctness(
spec_model_name = eagle3_model_name() if use_eagle3 else eagle_model_name()
with VllmRunner(
model_name,
enable_chunked_prefill=True,
max_num_seqs=1,
max_num_batched_tokens=2048,
gpu_memory_utilization=0.6,
Expand Down
Loading