[ROCm][CI] Fix ModernBERT token classification test numerical accuracy on ROCm#31820
Conversation
Signed-off-by: Andreas Karatzas <akaratza@amd.com>
There was a problem hiding this comment.
Code Review
This pull request introduces a workaround for numerical precision issues on ROCm that are causing test flakiness. The change involves modifying PyTorch's Scaled Dot-Product Attention (SDP) and matrix multiplication precision settings. While the fix is necessary, the current implementation using pytest_sessionstart alters global state without reverting it, which could unintentionally affect other tests in the suite. I have provided a suggestion to refactor this into a module-scoped autouse fixture. This is a safer, more idiomatic approach in pytest for managing test-specific setup and teardown, ensuring the changes are properly isolated.
| # SPDX-License-Identifier: Apache-2.0 | ||
| # SPDX-FileCopyrightText: Copyright contributors to the vLLM project | ||
| """Pytest configuration for vLLM language generation tests.""" | ||
|
|
||
| import warnings | ||
|
|
||
| import torch | ||
|
|
||
| from vllm.platforms import current_platform | ||
|
|
||
|
|
||
| def pytest_sessionstart(session): | ||
| """Configure ROCm-specific settings before test session starts.""" | ||
| if not current_platform.is_rocm(): | ||
| return | ||
|
|
||
| # Disable Flash/MemEfficient SDP on ROCm to avoid HF Transformers | ||
| # accuracy issues: https://github.com/vllm-project/vllm/issues/30167 | ||
| # TODO: Remove once ROCm SDP accuracy issues are resolved on HuggingFace | ||
| torch.backends.cuda.enable_flash_sdp(False) | ||
| torch.backends.cuda.enable_mem_efficient_sdp(False) | ||
| torch.backends.cuda.enable_math_sdp(True) | ||
| torch.set_float32_matmul_precision("highest") | ||
| warnings.warn( | ||
| "ROCm: Disabled flash_sdp and mem_efficient_sdp, enabled math_sdp " | ||
| "to avoid HuggingFace Transformers accuracy issues", | ||
| UserWarning, | ||
| stacklevel=1, | ||
| ) |
There was a problem hiding this comment.
Using pytest_sessionstart to modify global state like torch settings can have unintended side effects on other tests that run in the same session, as these settings are not reverted. This can lead to slower execution or unexpected behavior in unrelated tests.
A more robust and idiomatic pytest approach is to use a fixture with autouse=True and an appropriate scope (e.g., module). This ensures that the settings are applied only for the relevant tests and, crucially, that the original settings are restored after the tests in the module have completed, preventing any impact on other parts of the test suite.
I've suggested a refactoring to use a module-scoped autouse fixture which encapsulates the setup and teardown logic cleanly.
# SPDX-License-Identifier: Apache-2.0
# SPDX-FileCopyrightText: Copyright contributors to the vLLM project
"""Pytest configuration for vLLM language pooling tests."""
import warnings
import pytest
import torch
from vllm.platforms import current_platform
@pytest.fixture(scope="module", autouse=True)
def rocm_precision_workaround():
"""Workaround for numerical precision issues on ROCm for pooling tests."""
if not current_platform.is_rocm():
yield
return
# Save original settings
orig_flash = torch.backends.cuda.flash_sdp_enabled()
orig_mem_eff = torch.backends.cuda.mem_efficient_sdp_enabled()
orig_math = torch.backends.cuda.math_sdp_enabled()
orig_matmul_precision = torch.get_float32_matmul_precision()
try:
# Disable Flash/MemEfficient SDP on ROCm to avoid HF Transformers
# accuracy issues: https://github.com/vllm-project/vllm/issues/30167
# TODO: Remove once ROCm SDP accuracy issues are resolved on HuggingFace
torch.backends.cuda.enable_flash_sdp(False)
torch.backends.cuda.enable_mem_efficient_sdp(False)
torch.backends.cuda.enable_math_sdp(True)
torch.set_float32_matmul_precision("highest")
warnings.warn(
"ROCm: Disabled flash_sdp and mem_efficient_sdp, enabled math_sdp "
"to avoid HuggingFace Transformers accuracy issues for pooling tests.",
UserWarning,
stacklevel=2,
)
yield
finally:
# Restore original settings
torch.backends.cuda.enable_flash_sdp(orig_flash)
torch.backends.cuda.enable_mem_efficient_sdp(orig_mem_eff)
torch.backends.cuda.enable_math_sdp(orig_math)
torch.set_float32_matmul_precision(orig_matmul_precision)There was a problem hiding this comment.
That is a bit too overengineered, and might even not be functional. The problem is in one specific test inside the Language Models Test (Extended Pooling) group.
|
Just to be sure, this is still needed after #31776? |
@DarkLight1337 Thank you for pointing this PR out to me. I was not aware of it. I'm probably going to close this PR. I'm going to check the recent changes and close it. I already see that Flex Attention has been completely removed from rocm attn dispatch mechanism. |
… fp acc Signed-off-by: Andreas Karatzas <akaratza@amd.com>
|
@DarkLight1337 there was still a small error on ROCm: This PR addresses it. |
…y on ROCm (vllm-project#31820) Signed-off-by: Andreas Karatzas <akaratza@amd.com>
…y on ROCm (vllm-project#31820) Signed-off-by: Andreas Karatzas <akaratza@amd.com>
…y on ROCm (vllm-project#31820) Signed-off-by: Andreas Karatzas <akaratza@amd.com> Signed-off-by: dsuhinin <suhinin.dmitriy@gmail.com>
…y on ROCm (vllm-project#31820) Signed-off-by: Andreas Karatzas <akaratza@amd.com>
test_modernbert_modelsfails sometimes on ROCm due to numerical precision differences between vLLM's custom kernels and HuggingFace eager attention, with max diff ~0.03 exceeding the 0.01 threshold in only 2 floats.Root Cause
ROCm's default matmul precision settings produce slightly different numerical results.
Testing
Ran test 100+ times in loop with cache clearing to recompile model from scratch.