Skip to content

[Spyre-Next] RMSNorm tests and upstream tests framework#837

Merged
joerunde merged 24 commits intotorch-spyre:mainfrom
romitjain:tests/rms-norm
Mar 19, 2026
Merged

[Spyre-Next] RMSNorm tests and upstream tests framework#837
joerunde merged 24 commits intotorch-spyre:mainfrom
romitjain:tests/rms-norm

Conversation

@romitjain
Copy link
Copy Markdown
Collaborator

@romitjain romitjain commented Mar 13, 2026

Description

This PR does 2 things:

  1. Adds tests for SpyreRMSNorm that run from vllm-spyre/vllm_spyre_next

There are 2 tests added - a unit test verifying the correctness of the layer on CPU/Spyre and an integration test to ensure forward_oot gets called when it is installed as a vLLM plugin.
While writing down these tests, I saw a couple of issues in the SpyreRMSNorm implementation - which I have attempted to fix, but please correct me if I am wrong.

  1. Adds a framework for running upstream vLLM tests and runs RMSNorm upstream tests

Building on: vllm-project/vllm#36246, this PR also adds a framework that can be used to filter and update upstream tests and run them from the vllm repo.

  1. We clone vllm separately and run tests from the vllm repo (copied over from [Spyre-Next] Run upstream vLLM tests with pytest #800, not the contribution of this PR)
  2. We manage the whitelist/filtering logic via a declarative YAML

Related Issues

#805

Test Plan

To test both the features of this PR:

  1. Tests for SpyreRMSNorm that run from vllm-spyre/vllm_spyre_next
cd vllm-spyre/vllm_spyre_next
# Installs the pytest plugin
uv pip install -e .

VLLM_PLUGINS=spyre_next_ops pytest -rA tests/test_rms_norm.py -m spyre

This is expected to produce,

================================================================================= short test summary info =================================================================================
PASSED tests/test_rms_norm.py::test_spyre_rmsnorm_matches_reference[False-64-1]
PASSED tests/test_rms_norm.py::test_spyre_rmsnorm_matches_reference[False-128-1]
PASSED tests/test_rms_norm.py::test_spyre_rmsnorm_matches_reference[False-256-1]
PASSED tests/test_rms_norm.py::test_spyre_rmsnorm_matches_reference[False-512-1]
PASSED tests/test_rms_norm.py::test_spyre_rmsnorm_matches_reference[True-64-1]
PASSED tests/test_rms_norm.py::test_spyre_rmsnorm_matches_reference[True-128-1]
PASSED tests/test_rms_norm.py::test_spyre_rmsnorm_matches_reference[True-256-1]
PASSED tests/test_rms_norm.py::test_spyre_rmsnorm_matches_reference[True-512-1]
PASSED tests/test_rms_norm.py::test_rmsnorm_oot_dispatch[False]
PASSED tests/test_rms_norm.py::test_rmsnorm_oot_dispatch[True]
SKIPPED [1] ../../.cache/vllm-upstream-tests/worktree-a3b591a09545/tests/kernels/core/test_permute_cols.py:11: permute_cols is not supported on ROCm
FAILED tests/test_rms_norm.py::test_spyre_rmsnorm_matches_reference[False-63-1] - <redacted>
FAILED tests/test_rms_norm.py::test_spyre_rmsnorm_matches_reference[False-65-1] - <redacted>
FAILED tests/test_rms_norm.py::test_spyre_rmsnorm_matches_reference[False-127-1] - <redacted>
FAILED tests/test_rms_norm.py::test_spyre_rmsnorm_matches_reference[False-129-1] - <redacted>
FAILED tests/test_rms_norm.py::test_spyre_rmsnorm_matches_reference[True-63-1] - <redacted>
FAILED tests/test_rms_norm.py::test_spyre_rmsnorm_matches_reference[True-65-1] - <redacted>
FAILED tests/test_rms_norm.py::test_spyre_rmsnorm_matches_reference[True-127-1] - <redacted>
FAILED tests/test_rms_norm.py::test_spyre_rmsnorm_matches_reference[True-129-1] - <redacted>
========================================================== 8 failed, 10 passed, 1 skipped, 5836 deselected, 9 warnings in 24.42s ==========================================================

The tests are failing at boundaries of hidden dim, which is expected as of now, since hidden dim is not being padded to 64. (I can raise a separate PR to fix that, but I did not want to overload this PR)

  1. Upstream tests that run from vLLM

This makes use of my PR on vLLM: vllm-project/vllm#36246, which enables the RMSNorm test to run for OOT devices

# For demonstration purposes, I am testing on my fork and commit
export VLLM_COMMIT=a3b591a09545403114885ac7fbd94b63fbac1696
export VLLM_REPO_URL=https://github.com/romitjain/vllm

VLLM_PLUGINS=spyre_next_ops,spyre_next_test python -m pytest -rA -m upstream

This is expected to produce

================================================================================= short test summary info =================================================================================
PASSED test_layernorm.py::test_rms_norm[False-cpu-0-dtype0-False-64-1]
PASSED test_layernorm.py::test_rms_norm[False-cpu-0-dtype0-False-64-16]
PASSED test_layernorm.py::test_rms_norm[False-cpu-0-dtype1-False-64-1]
PASSED test_layernorm.py::test_rms_norm[False-cpu-0-dtype1-False-64-16]
PASSED test_layernorm.py::test_rms_norm[False-cpu-0-dtype2-False-64-1]
PASSED test_layernorm.py::test_rms_norm[False-cpu-0-dtype2-False-64-16]
SKIPPED [1] ../../.cache/vllm-upstream-tests/worktree-a3b591a09545/tests/kernels/core/test_permute_cols.py:11: permute_cols is not supported on ROCm
SKIPPED [126] ../../.cache/vllm-upstream-tests/worktree-a3b591a09545/tests/kernels/core/test_activation.py:34: not in upstream_tests.yaml
SKIPPED [54] ../../.cache/vllm-upstream-tests/worktree-a3b591a09545/tests/kernels/core/test_activation.py:119: not in upstream_tests.yaml
SKIPPED [4] ../../.cache/vllm-upstream-tests/worktree-a3b591a09545/tests/kernels/core/test_apply_rotary_emb.py:188: Skipping CUDA/ROCm only tests.
SKIPPED [24] ../../.cache/vllm-upstream-tests/worktree-a3b591a09545/tests/kernels/core/test_fused_qk_norm_rope.py:48: fused_qk_norm_rope custom op requires cuda and rocm platform
SKIPPED [4256] ../../.cache/vllm-upstream-tests/worktree-a3b591a09545/tests/kernels/core/test_fused_quant_layernorm.py:156: not in upstream_tests.yaml
SKIPPED [16] ../../.cache/vllm-upstream-tests/worktree-a3b591a09545/tests/kernels/core/test_fused_rms_norm_gated.py:21: not in upstream_tests.yaml
SKIPPED [8] ../../.cache/vllm-upstream-tests/worktree-a3b591a09545/tests/kernels/core/test_fused_rms_norm_gated.py:61: not in upstream_tests.yaml
SKIPPED [12] ../../.cache/vllm-upstream-tests/worktree-a3b591a09545/tests/kernels/core/test_layernorm.py:24: param skipped
SKIPPED [648] ../../.cache/vllm-upstream-tests/worktree-a3b591a09545/tests/kernels/core/test_layernorm.py:85: blocked by upstream_tests.yaml
SKIPPED [24] ../../.cache/vllm-upstream-tests/worktree-a3b591a09545/tests/kernels/core/test_mrope.py:59: Skipping CUDA/ROCm only tests.
SKIPPED [24] ../../.cache/vllm-upstream-tests/worktree-a3b591a09545/tests/kernels/core/test_mrope.py:129: Skipping CUDA/ROCm only tests.
SKIPPED [1] ../../.cache/vllm-upstream-tests/worktree-a3b591a09545/tests/kernels/core/test_opcheck.py: not in upstream_tests.yaml
SKIPPED [384] ../../.cache/vllm-upstream-tests/worktree-a3b591a09545/tests/kernels/core/test_pos_encoding.py:54: not in upstream_tests.yaml
SKIPPED [1] ../../.cache/vllm-upstream-tests/worktree-a3b591a09545/tests/kernels/core/test_pos_encoding.py: not in upstream_tests.yaml
SKIPPED [96] ../../.cache/vllm-upstream-tests/worktree-a3b591a09545/tests/kernels/core/test_rotary_embedding.py:30: not in upstream_tests.yaml
SKIPPED [144] ../../.cache/vllm-upstream-tests/worktree-a3b591a09545/tests/kernels/core/test_rotary_embedding_mla_cache_fused.py:20: not in upstream_tests.yaml
SKIPPED [1] ../../.cache/vllm-upstream-tests/worktree-a3b591a09545/tests/kernels/core/test_uva.py:14: UVA is not available.
SKIPPED [1] ../../.cache/vllm-upstream-tests/worktree-a3b591a09545/tests/kernels/core/test_uva.py:36: UVA is not available.
FAILED test_layernorm.py::test_rms_norm[False-cpu-0-dtype0-True-64-1] - AssertionError: Tensor-likes are not close!
FAILED test_layernorm.py::test_rms_norm[False-cpu-0-dtype0-True-64-16] - AssertionError: Tensor-likes are not close!
FAILED test_layernorm.py::test_rms_norm[False-cpu-0-dtype1-True-64-1] - AssertionError: Tensor-likes are not close!
FAILED test_layernorm.py::test_rms_norm[False-cpu-0-dtype1-True-64-16] - AssertionError: Tensor-likes are not close!
FAILED test_layernorm.py::test_rms_norm[False-cpu-0-dtype2-True-64-1] - AssertionError: Tensor-likes are not close!
FAILED test_layernorm.py::test_rms_norm[False-cpu-0-dtype2-True-64-16] - AssertionError: Tensor-likes are not close!
========================================================= 6 failed, 6 passed, 5825 skipped, 19 deselected, 13 warnings in 23.26s ==========================================================

We can see that our YAML is being respected and:

  1. Most of the upstream tests are skipped due to not being in our YAML
  2. 12 tests are getting skipped for parameters not being supported (param skipped)
  3. 12 tests are selected for running, out of whcih 6 pass/6 fail

Checklist

  • I have read the contributing guidelines
  • My code follows the project's code style (run bash format.sh)
  • I have added tests for my changes (if applicable)
  • I have updated the documentation (if applicable)
  • My commits include a Signed-off-by: line (DCO compliance)

Comment thread vllm_spyre_next/tests/conftest.py Outdated
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is largely for demonstration purposes. Can go back based on the feedback


@staticmethod
def forward_static(
def forward_spyre(
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is renamed to keep one upstream method as a reference.

I propose forward_static as the upstream golden reference implementation against which our custom implementation can be compared.

return


_REGISTERED = False
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While running tests, we need to register these ops multiple times and run into issues with multiple registration.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@functools.lru_cache(maxsize=1) would be a clean way to do this that prevents accidentally mutating this global variable and re-registering plugins.

Alternatively, a closure in a decorator would also work without global variables, and be easy to re-use across all the custom ops that we'll need to do this for.

def run_once(f):
    @wraps(f)
    def wrapper(*args, **kwargs):
        if not wrapper.has_run:
            wrapper.has_run = True
            return f(*args, **kwargs)
    wrapper.has_run = False
    return wrapper

@run_once
def register():

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense @joerunde , ill update it to use lru cache instead

add_residual: [true]
strided_input: [true]
override:
num_tokens: [1, 16]
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only testing for known pass conditions for now

Comment thread vllm_spyre_next/pyproject.toml
raise NotImplementedError("TODO: variance_size_override not yet implemented")

batch_padding = x.shape[0]
orig_batch_size = x.shape[0]
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Padding was being done incorrectly here

@github-actions
Copy link
Copy Markdown

👋 Hi! Thank you for contributing to vLLM support on Spyre.
Just a reminder: Make sure that your code passes all the linting checks, otherwise your PR won't be able to be merged. To do so, run ./format.sh.
Now you are good to go 🚀.

We also recommend installing prek and configuring it to check your code before every local commit.

dev = [
"pytest"
"pytest",
"pyyaml",
Copy link
Copy Markdown
Collaborator

@joerunde joerunde Mar 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what's this for?

edit- I always thought yaml was a builtin package because it's almost always installed by some other dependency 😆

@@ -0,0 +1,76 @@
"""Data models for the vllm-spyre-next test infrastructure."""
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really like this encapsulation within a plugin!

I don't have any experience writing pytest plugin, but I do think that we'd want to keep it in a separate package by itself instead of packaging it within vllm-spyre-next

@joerunde
Copy link
Copy Markdown
Collaborator

The reason (1) changes is purely for my testing of vllm-project/vllm#36246 - based on the feedback, we can actually go back to cloning vllm into cache and launching tests from there.

@romitjain I imagine we'll run into this situation a lot as we edit upstream tests :/
But the original goal of cloning into cache with contest.py was so that a simple pytest invocation would run the upstream tests as well. This way we can easily test changes during development and CI jobs don't require any extra setup to clone vllm.

Would it be possible to have the pytest plugin clone vllm into cache like how contest.py does currently, and also allow a CLI arg like --vllm-commit so that we can test against WIP test changes in vllm?

# Load plugins early to register custom ops before test modules import RMSNorm
from vllm.plugins import load_general_plugins

load_general_plugins()
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

very nice 👍

Comment thread vllm_spyre_next/pyproject.toml Outdated
dependencies = [
"torch-spyre",
"vllm==0.15.1+cpu",
# "vllm==0.15.1+cpu",
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we for sure depend on vllm!

Why did this need to be removed for your testing?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was testing using a custom fork and I was trying to install that custom fork in an editable manner

cd vllm-custom
pip install -e .

While installing vllm-spyre-next plug-in, I did not want to override my custom fork.

But this change would be reversed once I use the method you used to run upstream tests (cloning vllm into cache and using that)

Comment thread vllm_spyre_next/pyproject.toml Outdated
[tool.uv.sources]
# This is the unreleased v0.16.0 tag with 2.10 support
vllm = { git = "https://github.com/vllm-project/vllm", rev = "2d5be1dd5ce2e44dfea53ea03ff61143da5137eb" }
# vllm = { git = "https://github.com/vllm-project/vllm", rev = "2d5be1dd5ce2e44dfea53ea03ff61143da5137eb" }
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just FYI that this is here because we require building vllm from source so that we can compile cpu kernels which run the ops that aren't enabled on spyre yet

allow_list:
- test: "test_rms_norm"
mode: mandatory_pass
tags: [rmsnorm, llama, granite]
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nice to have markers to use to select certain tests, since pretty quickly the full set of tests here will take far too long to run in one go and for local development we'll want to be able to run a set of upstream tests that cover what we're changing in vllm-spyre-next.

Maybe the right thing to do is to add markers to the upstream tests in vllm so that everybody benefits? But if we could mark tests with these tags that would allow us to run

pytest -m rmsnorm

to select the tests that we want

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed will make the change

Comment on lines +9 to +20
def reference_rms_norm(
x: torch.Tensor,
weight: torch.Tensor | None,
eps: float,
) -> torch.Tensor:
"""Golden reference: standard RMSNorm in PyTorch."""
x_float = x.float()
variance = x_float.pow(2).mean(dim=-1, keepdim=True)
x_normed = x_float * torch.rsqrt(variance + eps)
if weight is not None:
x_normed = x_normed * weight.float()
return x_normed
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we use directly a reference method instead of reimplementing it: such as RMSNorm.forward_static()

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed,
I had added it to make sure we are not blocked by upstream vLLM merge, but let me check if we can replace this without the merge too.

Comment thread vllm_spyre_next/tests/test_rms_norm.py Outdated
return x_normed


@pytest.mark.cpu
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is this marked as cpu test? It should also run on spyre from my understanding

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this is a spyre test. This is an artifact of my previous work where SpyreRMSNorm had a CPU path as well. I will merge it with the test below

Comment thread vllm_spyre_next/tests/test_rms_norm.py Outdated

@pytest.mark.cpu
@pytest.mark.parametrize("batch_size", [1])
@pytest.mark.parametrize("hidden_size", [128, 512])
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I might be missing something, but I am also wondering why we are not testing the shapes that require padding here. Why not put the same shapes as in line 44

Comment on lines +93 to +104
# Mock forward_native (called by forward_oot) with a known transform
if residual is not None:
monkeypatch.setattr(layer, "forward_native", mock_forward_native_with_residual)
out_x, out_residual = layer.forward_oot(dummy_tensor, residual)

assert torch.allclose(out_x, 2 * dummy_tensor)
assert torch.allclose(out_residual, 2 * residual)
else:
monkeypatch.setattr(layer, "forward_native", mock_forward_native_no_residual)
out_x = layer.forward_oot(dummy_tensor, residual)

assert torch.allclose(out_x, dummy_tensor + 1)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I love that, it's a really nice and interesting method to test that forward_oot is using forward_native

Comment thread vllm_spyre_next/tests/test_rms_norm.py Outdated
@pytest.mark.cpu
@pytest.mark.parametrize("batch_size", [1])
@pytest.mark.parametrize("hidden_size", [128, 512])
def test_spyre_rmsnorm_matches_reference(default_vllm_config, batch_size, hidden_size):
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we also test the use of residual in these two tests, with your fixes to SpyreRMSNorm, my test that didn't use residual is passing, but the testing residual is failing

@sducouedic
Copy link
Copy Markdown
Collaborator

sducouedic commented Mar 13, 2026

thanks for the fix and the tests implementation @romitjain. I only reviewed the parts I was familiar with for now but it looks like really good work to me, I closed my PR#830 as I think it has nothing to add

@romitjain
Copy link
Copy Markdown
Collaborator Author

romitjain commented Mar 16, 2026

@joerunde
Yes, I 100% agree to this: #837 (comment)

My next commit will merge the cache+single pytest command that was already implemented (#800) with the one that I am proposing.

@romitjain
Copy link
Copy Markdown
Collaborator Author

@joerunde @sducouedic

I have updated my PR to address your comments

  1. Fixed the CPU marker on the local test and added the residual path for the test. I see this local test test_spyre_rmsnorm_matches_reference to be removed once my upstream PR gets merged
  2. Added LRU cache to rms op registration
  3. Made the upstream test framework similar to [Spyre-Next] Run upstream vLLM tests with pytest #800. The helper functions were copied over almost verbatim, but added the YAML-based filtering

This means we can run both upstream and local tests from vllm-spyre-next folder itself. It also allows for testing against custom fork/PRs - so that we dont need to wait for upstream PRs to test things locally.

export VLLM_COMMIT=a3b591a09545403114885ac7fbd94b63fbac1696
export VLLM_REPO_URL=https://github.com/romitjain/vllm

pytest -m upstream

I have updated the description, too.

@joerunde
Copy link
Copy Markdown
Collaborator

bot:next-test

@joerunde
Copy link
Copy Markdown
Collaborator

bot:next-test
export VLLM_COMMIT=a3b591a09545403114885ac7fbd94b63fbac1696
export VLLM_REPO_URL=https://github.com/romitjain/vllm

@joerunde
Copy link
Copy Markdown
Collaborator

Digging in on final things to fix before merging:

  1. We should probably edit up the test yaml to only specify tests that work with vllm 0.17.1, so that we don't have to always specify your fork and commit to test PRs. We can leave the existing config commented out with a link to your PR so that we know when it can be enabled
  2. I'm still seeing some test failures on the rms norm tests in this repo, and it looks like they're for parametrizations where the hidden size is not a multiple of 64. I'll see if I can figure out why that is today, this is failing on the latest base image:
=========================== short test summary info ============================
PASSED tests/test_rms_norm.py::test_spyre_rmsnorm_matches_reference[False-64-1]
PASSED tests/test_rms_norm.py::test_spyre_rmsnorm_matches_reference[False-128-1]
PASSED tests/test_rms_norm.py::test_spyre_rmsnorm_matches_reference[False-256-1]
PASSED tests/test_rms_norm.py::test_spyre_rmsnorm_matches_reference[False-512-1]
PASSED tests/test_rms_norm.py::test_spyre_rmsnorm_matches_reference[True-64-1]
PASSED tests/test_rms_norm.py::test_spyre_rmsnorm_matches_reference[True-128-1]
PASSED tests/test_rms_norm.py::test_spyre_rmsnorm_matches_reference[True-256-1]
PASSED tests/test_rms_norm.py::test_spyre_rmsnorm_matches_reference[True-512-1]
PASSED tests/test_rms_norm.py::test_rmsnorm_oot_dispatch[False]
PASSED tests/test_rms_norm.py::test_rmsnorm_oot_dispatch[True]
FAILED tests/test_rms_norm.py::test_spyre_rmsnorm_matches_reference[False-63-1]
FAILED tests/test_rms_norm.py::test_spyre_rmsnorm_matches_reference[False-65-1]
FAILED tests/test_rms_norm.py::test_spyre_rmsnorm_matches_reference[False-127-1]
FAILED tests/test_rms_norm.py::test_spyre_rmsnorm_matches_reference[False-129-1]
FAILED tests/test_rms_norm.py::test_spyre_rmsnorm_matches_reference[True-63-1]
FAILED tests/test_rms_norm.py::test_spyre_rmsnorm_matches_reference[True-65-1]
FAILED tests/test_rms_norm.py::test_spyre_rmsnorm_matches_reference[True-127-1]
FAILED tests/test_rms_norm.py::test_spyre_rmsnorm_matches_reference[True-129-1]
FAILED tests/test_vllm_spyre_next.py::test_basic_model_load - RuntimeError: E...
==== 9 failed, 10 passed, 1344 deselected, 10 warnings in 79.72s (0:01:19) =====

@joerunde
Copy link
Copy Markdown
Collaborator

Ah, additionally the extra failure of the basic model load test only seems to happen when it is run alongside other tests:

FAILED tests/test_vllm_spyre_next.py::test_basic_model_load - RuntimeError: E...

and this fails with a device busy error:

{"Device":"/dev/vfio/89","action":"information","category":"configuration","code":"0x332b","description":"The specified PCIe address was not found in the /dev/vfio subsystem.  This is likely due to a software misconfiguration.","errno":"Device or resource busy","message":"Failed to open the IBM Spyre VFIO device.","name":"RAS::VFIO::DeviceOpenFail","severity":"ERROR","step":"Validate user environment configuration","type":"runtime_error"}

I think this is likely because vllm will load the model from worker process(es), which will fail if the main pytest process is still holding references to memory on the spyre devices that hans't been cleaned up. I'd guess we need some global cleanup fixtures to ensure that we don't hold onto device-side data between tests

Signed-off-by: Joe Runde <joe@joerun.de>
Signed-off-by: Joe Runde <joe@joerun.de>
@joerunde
Copy link
Copy Markdown
Collaborator

bot:next-test

@joerunde
Copy link
Copy Markdown
Collaborator

@romit I'm not sure what's up with the size errors on those test cases, I tried running on dev images from the last week but they all hit the same error. If you have the full environment including spyre runtime stack component versions used to run those tests, it'd be good to figure out how to get them to pass.

In the meantime I've pushed changes here to:

  1. Disable the upstream rms tests while we wait for support to be merged
  2. Add a couple upstream model tests that currently pass
  3. Add a uses_subprocess marker that forces tests to run last so that we can get around the device busy error while we figure out the correct way to unload the spyre device from tests that run directly in the main pytest process

This currently passes a simple bot:next-test, and I've added you to the allowlist so that you can trigger these tests with comments as well.

WDYT about merging?

Copy link
Copy Markdown
Collaborator

@joerunde joerunde left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tests are passing, we can consider merging this

Signed-off-by: Joe Runde <joe@joerun.de>
@joerunde
Copy link
Copy Markdown
Collaborator

Update: Added an allowlist for test params, this will be useful so that we can select out only a set of parametrizations to run in cases where there are more things to block than to allow. We also think this will be required to avoid pulling in new test cases without vetting them first. (See the huge diff in the yaml for the laguage/generation tests on 0e9c0be0e9c0be)

@joerunde
Copy link
Copy Markdown
Collaborator

bot:next-test

@joerunde
Copy link
Copy Markdown
Collaborator

@tjohnson31415 has some staged changes to move the testing plugin out of vllm_spyre_next, but that could go in a followup

@joerunde joerunde moved this to In progress in Torch-Spyre Device Enablement Mar 19, 2026
@joerunde joerunde added the vllm-spyre-next Related to enabling the next-generation of the Spyre stack built on top of torch-spyre label Mar 19, 2026
@joerunde joerunde moved this from In progress to In review in Torch-Spyre Device Enablement Mar 19, 2026
Signed-off-by: Joe Runde <joe@joerun.de>
@joerunde
Copy link
Copy Markdown
Collaborator

bot:next-test

1 similar comment
@joerunde
Copy link
Copy Markdown
Collaborator

bot:next-test

tjohnson31415 and others added 2 commits March 19, 2026 10:48
Signed-off-by: Travis Johnson <tsjohnso@us.ibm.com>
Signed-off-by: Joe Runde <joe@joerun.de>
@joerunde
Copy link
Copy Markdown
Collaborator

bot:next-test

@joerunde joerunde enabled auto-merge (squash) March 19, 2026 17:10
Copy link
Copy Markdown
Collaborator

@tjohnson31415 tjohnson31415 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.
I'll do a follow up to move the pytest plugin out of the vllm_spyre_next module.

@github-actions github-actions bot added the ready Runs the full CI test suite. Only add to PRs once ready to merge to limit public GHA usage label Mar 19, 2026
@joerunde joerunde merged commit b7ce7b6 into torch-spyre:main Mar 19, 2026
16 of 24 checks passed
@github-project-automation github-project-automation bot moved this from In review to Done in Torch-Spyre Device Enablement Mar 19, 2026
@romitjain romitjain deleted the tests/rms-norm branch March 23, 2026 07:51
@romitjain
Copy link
Copy Markdown
Collaborator Author

@joerunde @tjohnson31415.
Thanks for the review and merging. I was OOO, hence could not look into the trailing commits in detail then. The changes look good to me.

@joerunde re:

@romit I'm not sure what's up with the size errors on those test cases, I tried running on dev images from the last week but they all hit the same error. If you have the full environment including spyre runtime stack component versions used to run those tests, it'd be good to figure out how to get them to pass.

These were failing for me, too. Let me debug this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready Runs the full CI test suite. Only add to PRs once ready to merge to limit public GHA usage vllm-spyre-next Related to enabling the next-generation of the Spyre stack built on top of torch-spyre

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

4 participants