Skip to content

tests: skip MoE LoRA extractor coverage when discovery finds zero classes#628

Merged
danielhanchen merged 2 commits into
mainfrom
tests/skipif-moe-coverage-old-transformers
May 7, 2026
Merged

tests: skip MoE LoRA extractor coverage when discovery finds zero classes#628
danielhanchen merged 2 commits into
mainfrom
tests/skipif-moe-coverage-old-transformers

Conversation

@danielhanchen
Copy link
Copy Markdown
Member

Summary

test_every_patched_moe_experts_class_has_lora_extractor (PR #624) is
designed to prevent a regression where a class flagged
_unsloth_already_patched=True is missing _unsloth_lora_extractor_fn.
Its defensive precondition was "discovery produced at least one patched
class"; failing that, the test raised an AssertionError telling the
reader to investigate.

That precondition fires on any transformers version older than the MoE
class surface unsloth_zoo.temporary_patches targets. With
transformers==4.57.6 for example, none of the gemma4 / llama4 /
qwen3_moe / mxfp4 classes the TEMPORARY_PATCHES list patches exist,
so every patch fn no-ops and discovery returns []. The "zero
discoveries" outcome on an older transformers is not a regression
of the bug being prevented; it just means the relevant classes aren't
there to patch.

Replace assert patched, ... with pytest.skip(...) and a comment
explaining the two indistinguishable cases. CI cells running newer
transformers (where the classes do exist) still exercise the real
assertion below; the older-transformers cell skips cleanly.

Test plan

Why this isn't pytest.mark.skipif(transformers_version < X.Y)

A version-string skipif would have to encode every minimum version
across the TEMPORARY_PATCHES list (gemma3, gemma4, llama4, qwen3_moe,
mxfp4 each landed in a different transformers release) AND track
upstream renames. The "zero patched classes" runtime check is the
precise condition we actually care about: skip when nothing is there
to verify.

Surfaced by the consolidated CPU-CI matrix on unslothai/unsloth#5312.

…sses

test_every_patched_moe_experts_class_has_lora_extractor asserts that
every class flagged `_unsloth_already_patched=True` whose source defines
3D `gate_up_proj`/`down_proj` parameters also has
`_unsloth_lora_extractor_fn` registered. The defensive precondition was
"discovery produced at least one patched class"; failing that, the
test raised AssertionError("Discovery produced zero patched MoE classes
... Investigate.").

That precondition fires on any transformers version older than the MoE
surface unsloth_zoo patches. Specifically, with transformers==4.57.6
none of the gemma4 / llama4 / qwen3_moe / mxfp4 classes the
TEMPORARY_PATCHES list targets exist, so every patch fn no-ops and
discovery returns []. The "zero discoveries" outcome on an older
transformers is not a regression of the bug being prevented (a class
patched-but-missing-extractor); it just means the relevant classes
aren't there to patch.

Switch the precondition from `assert patched, ...` to `pytest.skip(...)`
with a comment explaining the two indistinguishable cases. CI cells
running newer transformers (where classes do exist) will still exercise
the real assertion below; the older-transformers cell skips cleanly.

Surfaced by the consolidated CPU-CI matrix on unslothai/unsloth#5312
in the cell pinned to `transformers==4.57.6`.
danielhanchen added a commit to unslothai/unsloth that referenced this pull request May 7, 2026
… job-level continue-on-error

Two cleanups derived from review of the matrix output:

1. Skip false-positive zero-arg patches in the runtime ledger.
   Three patches have all-defaulted signatures but require either
   runtime args or real CUDA, so calling them in isolation produces
   a meaningless failure:
     - patch_linear_scaling: defaults are None placeholders;
       body starts with `assert rope_module is not None` etc.
     - patch_llama_rope_scaling: same shape.
     - patch_unsloth_smart_gradient_checkpointing: legitimately
       allocates CUDA tensors via aten::empty.memory_format inside
       initialize_unsloth_gradient_checkpointing(); the torch.cuda.*
       Python spoof can't intercept that at the dispatcher level.
   Add NEEDS_PRECONDITION = {...} to the shim and skip those by name.
   Symbol presence is still verified via REQUIRED.

2. Drop the job-level `continue-on-error: true`.
   Previously the cell reported SUCCESS even when steps failed, which
   made the PR check UI lie. Real failures now turn the cell red.
   Per-step `continue-on-error: true` stays so a single failed step
   does not cascade and skip the rest of the ledger.

Three other failures the matrix surfaced are addressed by separate PRs
to source:
  - #5319 (patch_fast_lora missing import,
    patch_sft_trainer_tokenizer Union NameError, openenv OSError)
  - unslothai/unsloth-zoo#628 (skip MoE coverage on older transformers)
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: d9f9c68598

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

# surface as a red regression here. CI cells running the
# transformers version the PR actually pins (and any newer one)
# will discover patched classes and run the real assertion.
pytest.skip(
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Preserve failure on discovery-regression cases

On a supported/newer transformers install that does contain the MoE classes, if the patch marker convention changes (or all relevant patch functions fail before recording the _original_*_forward marker), _discover_patched_moe_classes() still returns [] and this branch reports the test as skipped. That is the same discovery-regression case the previous assertion caught, so this can make the extractor-registration coverage silently disappear from the CI matrix instead of failing when coverage is no longer being exercised.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request modifies tests/test_moe_lora_extractor_coverage.py to replace a hard assertion with a pytest.skip call when no patched MoE classes are discovered. This change prevents test failures on older versions of the transformers library that do not include the targeted MoE classes, while still allowing the test to run on newer versions. I have no feedback to provide.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request modifies the MoE LoRA extractor coverage test to skip instead of fail when no patched classes are discovered, which prevents false negatives in CI environments running older versions of the 'transformers' library. The review feedback suggests updating the terminology in the skip message and comments to accurately reflect the patching marker convention used in the discovery logic.

Comment on lines +305 to +325
if not patched:
# Two situations produce zero discoveries:
# 1. The transformers version installed predates the MoE class
# surface unsloth_zoo patches (e.g. transformers 4.57.6 ships
# none of the gemma4 / llama4 / qwen3_moe / mxfp4 classes the
# TEMPORARY_PATCHES list targets, so every patch fn no-ops).
# 2. The `_unsloth_already_patched` marker convention drifted
# (the regression this test guards against).
# We can't tell (1) and (2) apart by discovery alone. Skip rather
# than fail: a CI cell pinned to an older transformers should not
# surface as a red regression here. CI cells running the
# transformers version the PR actually pins (and any newer one)
# will discover patched classes and run the real assertion.
pytest.skip(
"No patched MoE classes discovered on this transformers version. "
"Either transformers is older than the MoE class surface "
"unsloth_zoo patches, or the _unsloth_already_patched marker "
"convention changed. The latter is a real regression but "
"indistinguishable from the former here; skip and let cells "
"running newer transformers exercise the real assertion."
)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The comment and skip message refer to the _unsloth_already_patched marker convention. However, the discovery logic in _discover_patched_moe_classes (specifically _has_unsloth_patched_forward) actually looks for the _original_..._forward attribute. Using the correct terminology ensures that if the patching mechanism changes, the test's failure/skip messages remain accurate and helpful for debugging.

    if not patched:
        # Two situations produce zero discoveries:
        #   1. The transformers version installed predates the MoE class
        #      surface unsloth_zoo patches (e.g. transformers 4.57.6 ships
        #      none of the gemma4 / llama4 / qwen3_moe / mxfp4 classes the
        #      TEMPORARY_PATCHES list targets, so every patch fn no-ops).
        #   2. The patching marker convention (e.g. _original_..._forward)
        #      drifted (the regression this test guards against).
        # We can't tell (1) and (2) apart by discovery alone. Skip rather
        # than fail: a CI cell pinned to an older transformers should not
        # surface as a red regression here. CI cells running the
        # transformers version the PR actually pins (and any newer one)
        # will discover patched classes and run the real assertion.
        pytest.skip(
            "No patched MoE classes discovered on this transformers version. "
            "Either transformers is older than the MoE class surface "
            "unsloth_zoo patches, or the patching marker convention "
            "changed. The latter is a real regression but "
            "indistinguishable from the former here; skip and let cells "
            "running newer transformers exercise the real assertion."
        )

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request modifies the MoE LoRA extractor coverage test to skip execution when no patched MoE classes are discovered, preventing failures in CI environments with older transformers versions. Feedback suggests that this unconditional skip could lead to silent failures if the discovery mechanism itself breaks. It is recommended to implement a check for a known MoE module to distinguish between an unsupported environment and a regression in the discovery logic.

Comment on lines +305 to +325
if not patched:
# Two situations produce zero discoveries:
# 1. The transformers version installed predates the MoE class
# surface unsloth_zoo patches (e.g. transformers 4.57.6 ships
# none of the gemma4 / llama4 / qwen3_moe / mxfp4 classes the
# TEMPORARY_PATCHES list targets, so every patch fn no-ops).
# 2. The `_unsloth_already_patched` marker convention drifted
# (the regression this test guards against).
# We can't tell (1) and (2) apart by discovery alone. Skip rather
# than fail: a CI cell pinned to an older transformers should not
# surface as a red regression here. CI cells running the
# transformers version the PR actually pins (and any newer one)
# will discover patched classes and run the real assertion.
pytest.skip(
"No patched MoE classes discovered on this transformers version. "
"Either transformers is older than the MoE class surface "
"unsloth_zoo patches, or the _unsloth_already_patched marker "
"convention changed. The latter is a real regression but "
"indistinguishable from the former here; skip and let cells "
"running newer transformers exercise the real assertion."
)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Replacing the assertion with an unconditional skip when no patched classes are found introduces a risk of silent test failure. If the discovery mechanism (e.g., the _original_ attribute naming convention) breaks, the test will skip on all platforms, including those where MoE classes exist, thus failing to detect the regression it was designed to prevent.

To distinguish between an old transformers version (where skipping is appropriate) and a broken discovery mechanism (where the test should fail), you can check for the existence of a known MoE module. If the module exists but discovery found nothing, it's a real failure.

    if not patched:
        # To distinguish between "old transformers" (skip) and "broken discovery" (fail),
        # check if a known MoE module exists. If it does, discovery should have found it.
        try:
            importlib.import_module("transformers.models.gemma4.modeling_gemma4")
            assert patched, (
                "Discovery produced zero patched MoE classes, but Gemma-4 was found. "
                "This indicates the discovery logic or marker convention is broken."
            )
        except ImportError:
            pass

        pytest.skip(
            "No patched MoE classes discovered. This is expected on older "
            "transformers versions where the target MoE classes do not exist."
        )

Per codex/gemini review on PR #628: replacing the original assert with
an unconditional skip created a silent-failure path -- a real
discovery regression on a transformers that DOES ship the grouped-MoE
surface would now skip rather than fail, defeating the test.

Disambiguate the two zero-discovery causes by walking transformers
once and asking whether ANY class anywhere in transformers.models.*
declares `gate_up_proj` + `down_proj` as 3D `nn.Parameter`s -- the
exact filter `_looks_like_grouped_moe_experts` runs against discovered
classes. If at least one such class exists AND discovery returned
zero, that is a real regression: fail loud. If no such class exists,
the patch surface this test targets is genuinely absent on this
transformers version: skip. This avoids hard-coding model names that
change as transformers grows new MoE families.

Also update terminology in the comment + skip / fail messages: the
load-bearing marker `_has_unsloth_patched_forward` reads is
`_original_<modeling_tail>_<ClassName>_forward`, not the per-family
`_unsloth_already_patched` flag.
@danielhanchen
Copy link
Copy Markdown
Member Author

Pushed 4c464c8 addressing the codex P2 / gemini medium feedback.

The previous unconditional skip created a silent-failure path: a real discovery regression on a transformers that DOES ship the grouped-MoE surface would have skipped rather than failed, which is exactly the regression this test was written to catch.

The new logic walks transformers.models.* once and asks whether ANY class declares gate_up_proj + down_proj as 3D nn.Parameters — the exact filter _looks_like_grouped_moe_experts runs against discovered classes. If at least one such class exists AND discovery returned zero, that's a real regression: fail loud with a message naming up to 5 example classes. If no such class exists anywhere in transformers, the patch surface this test targets is genuinely absent on this transformers version: skip.

This is the structural canary the gemini reviewer suggested, generalized so it doesn't hard-code model names that change as transformers grows new MoE families.

Also corrected the comment + messages to refer to the actual marker convention the discovery logic reads (_original_<modeling_tail>_<ClassName>_forward) rather than the per-family _unsloth_already_patched flag.

Verified locally on transformers 4.57.6: discovery finds Qwen3VLMoeTextExperts (which shipped grouped-MoE 3D params before the rest of the family) and the real assertion runs.

@danielhanchen danielhanchen merged commit 7502595 into main May 7, 2026
3 checks passed
@danielhanchen danielhanchen deleted the tests/skipif-moe-coverage-old-transformers branch May 7, 2026 07:12
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 4c464c8bba

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +330 to +337
importable = [
f"{cls.__module__}.{cls.__name__}"
for modeling in _iter_modeling_modules()
for _, cls in inspect.getmembers(modeling, inspect.isclass)
if isinstance(cls, type)
and getattr(cls, "__module__", None) == modeling.__name__
and _looks_like_grouped_moe_experts(cls)
]
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Restrict the canary to patch targets

In older transformers cells where none of the current TEMPORARY_PATCHES targets import, this canary still treats any grouped-MoE-looking class as proof that the patch surface exists. For example, transformers.models.qwen2_moe.modeling_qwen2_moe.Qwen2MoeExperts has the same gate_up_proj/down_proj 3D nn.Parameter declarations in 4.57.x, but this repo has no qwen2_moe temporary patch target, so patched remains empty and this branch raises instead of performing the intended skip. Please limit the canary to classes/modules actually targeted by TEMPORARY_PATCHES (or otherwise distinguish untargeted MoE families) so old-transformers CI cells do not fail falsely.

Useful? React with 👍 / 👎.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant