tests: skip MoE LoRA extractor coverage when discovery finds zero classes#628
Conversation
…sses
test_every_patched_moe_experts_class_has_lora_extractor asserts that
every class flagged `_unsloth_already_patched=True` whose source defines
3D `gate_up_proj`/`down_proj` parameters also has
`_unsloth_lora_extractor_fn` registered. The defensive precondition was
"discovery produced at least one patched class"; failing that, the
test raised AssertionError("Discovery produced zero patched MoE classes
... Investigate.").
That precondition fires on any transformers version older than the MoE
surface unsloth_zoo patches. Specifically, with transformers==4.57.6
none of the gemma4 / llama4 / qwen3_moe / mxfp4 classes the
TEMPORARY_PATCHES list targets exist, so every patch fn no-ops and
discovery returns []. The "zero discoveries" outcome on an older
transformers is not a regression of the bug being prevented (a class
patched-but-missing-extractor); it just means the relevant classes
aren't there to patch.
Switch the precondition from `assert patched, ...` to `pytest.skip(...)`
with a comment explaining the two indistinguishable cases. CI cells
running newer transformers (where classes do exist) will still exercise
the real assertion below; the older-transformers cell skips cleanly.
Surfaced by the consolidated CPU-CI matrix on unslothai/unsloth#5312
in the cell pinned to `transformers==4.57.6`.
… job-level continue-on-error
Two cleanups derived from review of the matrix output:
1. Skip false-positive zero-arg patches in the runtime ledger.
Three patches have all-defaulted signatures but require either
runtime args or real CUDA, so calling them in isolation produces
a meaningless failure:
- patch_linear_scaling: defaults are None placeholders;
body starts with `assert rope_module is not None` etc.
- patch_llama_rope_scaling: same shape.
- patch_unsloth_smart_gradient_checkpointing: legitimately
allocates CUDA tensors via aten::empty.memory_format inside
initialize_unsloth_gradient_checkpointing(); the torch.cuda.*
Python spoof can't intercept that at the dispatcher level.
Add NEEDS_PRECONDITION = {...} to the shim and skip those by name.
Symbol presence is still verified via REQUIRED.
2. Drop the job-level `continue-on-error: true`.
Previously the cell reported SUCCESS even when steps failed, which
made the PR check UI lie. Real failures now turn the cell red.
Per-step `continue-on-error: true` stays so a single failed step
does not cascade and skip the rest of the ledger.
Three other failures the matrix surfaced are addressed by separate PRs
to source:
- #5319 (patch_fast_lora missing import,
patch_sft_trainer_tokenizer Union NameError, openenv OSError)
- unslothai/unsloth-zoo#628 (skip MoE coverage on older transformers)
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: d9f9c68598
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| # surface as a red regression here. CI cells running the | ||
| # transformers version the PR actually pins (and any newer one) | ||
| # will discover patched classes and run the real assertion. | ||
| pytest.skip( |
There was a problem hiding this comment.
Preserve failure on discovery-regression cases
On a supported/newer transformers install that does contain the MoE classes, if the patch marker convention changes (or all relevant patch functions fail before recording the _original_*_forward marker), _discover_patched_moe_classes() still returns [] and this branch reports the test as skipped. That is the same discovery-regression case the previous assertion caught, so this can make the extractor-registration coverage silently disappear from the CI matrix instead of failing when coverage is no longer being exercised.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Code Review
This pull request modifies tests/test_moe_lora_extractor_coverage.py to replace a hard assertion with a pytest.skip call when no patched MoE classes are discovered. This change prevents test failures on older versions of the transformers library that do not include the targeted MoE classes, while still allowing the test to run on newer versions. I have no feedback to provide.
There was a problem hiding this comment.
Code Review
This pull request modifies the MoE LoRA extractor coverage test to skip instead of fail when no patched classes are discovered, which prevents false negatives in CI environments running older versions of the 'transformers' library. The review feedback suggests updating the terminology in the skip message and comments to accurately reflect the patching marker convention used in the discovery logic.
| if not patched: | ||
| # Two situations produce zero discoveries: | ||
| # 1. The transformers version installed predates the MoE class | ||
| # surface unsloth_zoo patches (e.g. transformers 4.57.6 ships | ||
| # none of the gemma4 / llama4 / qwen3_moe / mxfp4 classes the | ||
| # TEMPORARY_PATCHES list targets, so every patch fn no-ops). | ||
| # 2. The `_unsloth_already_patched` marker convention drifted | ||
| # (the regression this test guards against). | ||
| # We can't tell (1) and (2) apart by discovery alone. Skip rather | ||
| # than fail: a CI cell pinned to an older transformers should not | ||
| # surface as a red regression here. CI cells running the | ||
| # transformers version the PR actually pins (and any newer one) | ||
| # will discover patched classes and run the real assertion. | ||
| pytest.skip( | ||
| "No patched MoE classes discovered on this transformers version. " | ||
| "Either transformers is older than the MoE class surface " | ||
| "unsloth_zoo patches, or the _unsloth_already_patched marker " | ||
| "convention changed. The latter is a real regression but " | ||
| "indistinguishable from the former here; skip and let cells " | ||
| "running newer transformers exercise the real assertion." | ||
| ) |
There was a problem hiding this comment.
The comment and skip message refer to the _unsloth_already_patched marker convention. However, the discovery logic in _discover_patched_moe_classes (specifically _has_unsloth_patched_forward) actually looks for the _original_..._forward attribute. Using the correct terminology ensures that if the patching mechanism changes, the test's failure/skip messages remain accurate and helpful for debugging.
if not patched:
# Two situations produce zero discoveries:
# 1. The transformers version installed predates the MoE class
# surface unsloth_zoo patches (e.g. transformers 4.57.6 ships
# none of the gemma4 / llama4 / qwen3_moe / mxfp4 classes the
# TEMPORARY_PATCHES list targets, so every patch fn no-ops).
# 2. The patching marker convention (e.g. _original_..._forward)
# drifted (the regression this test guards against).
# We can't tell (1) and (2) apart by discovery alone. Skip rather
# than fail: a CI cell pinned to an older transformers should not
# surface as a red regression here. CI cells running the
# transformers version the PR actually pins (and any newer one)
# will discover patched classes and run the real assertion.
pytest.skip(
"No patched MoE classes discovered on this transformers version. "
"Either transformers is older than the MoE class surface "
"unsloth_zoo patches, or the patching marker convention "
"changed. The latter is a real regression but "
"indistinguishable from the former here; skip and let cells "
"running newer transformers exercise the real assertion."
)There was a problem hiding this comment.
Code Review
This pull request modifies the MoE LoRA extractor coverage test to skip execution when no patched MoE classes are discovered, preventing failures in CI environments with older transformers versions. Feedback suggests that this unconditional skip could lead to silent failures if the discovery mechanism itself breaks. It is recommended to implement a check for a known MoE module to distinguish between an unsupported environment and a regression in the discovery logic.
| if not patched: | ||
| # Two situations produce zero discoveries: | ||
| # 1. The transformers version installed predates the MoE class | ||
| # surface unsloth_zoo patches (e.g. transformers 4.57.6 ships | ||
| # none of the gemma4 / llama4 / qwen3_moe / mxfp4 classes the | ||
| # TEMPORARY_PATCHES list targets, so every patch fn no-ops). | ||
| # 2. The `_unsloth_already_patched` marker convention drifted | ||
| # (the regression this test guards against). | ||
| # We can't tell (1) and (2) apart by discovery alone. Skip rather | ||
| # than fail: a CI cell pinned to an older transformers should not | ||
| # surface as a red regression here. CI cells running the | ||
| # transformers version the PR actually pins (and any newer one) | ||
| # will discover patched classes and run the real assertion. | ||
| pytest.skip( | ||
| "No patched MoE classes discovered on this transformers version. " | ||
| "Either transformers is older than the MoE class surface " | ||
| "unsloth_zoo patches, or the _unsloth_already_patched marker " | ||
| "convention changed. The latter is a real regression but " | ||
| "indistinguishable from the former here; skip and let cells " | ||
| "running newer transformers exercise the real assertion." | ||
| ) |
There was a problem hiding this comment.
Replacing the assertion with an unconditional skip when no patched classes are found introduces a risk of silent test failure. If the discovery mechanism (e.g., the _original_ attribute naming convention) breaks, the test will skip on all platforms, including those where MoE classes exist, thus failing to detect the regression it was designed to prevent.
To distinguish between an old transformers version (where skipping is appropriate) and a broken discovery mechanism (where the test should fail), you can check for the existence of a known MoE module. If the module exists but discovery found nothing, it's a real failure.
if not patched:
# To distinguish between "old transformers" (skip) and "broken discovery" (fail),
# check if a known MoE module exists. If it does, discovery should have found it.
try:
importlib.import_module("transformers.models.gemma4.modeling_gemma4")
assert patched, (
"Discovery produced zero patched MoE classes, but Gemma-4 was found. "
"This indicates the discovery logic or marker convention is broken."
)
except ImportError:
pass
pytest.skip(
"No patched MoE classes discovered. This is expected on older "
"transformers versions where the target MoE classes do not exist."
)Per codex/gemini review on PR #628: replacing the original assert with an unconditional skip created a silent-failure path -- a real discovery regression on a transformers that DOES ship the grouped-MoE surface would now skip rather than fail, defeating the test. Disambiguate the two zero-discovery causes by walking transformers once and asking whether ANY class anywhere in transformers.models.* declares `gate_up_proj` + `down_proj` as 3D `nn.Parameter`s -- the exact filter `_looks_like_grouped_moe_experts` runs against discovered classes. If at least one such class exists AND discovery returned zero, that is a real regression: fail loud. If no such class exists, the patch surface this test targets is genuinely absent on this transformers version: skip. This avoids hard-coding model names that change as transformers grows new MoE families. Also update terminology in the comment + skip / fail messages: the load-bearing marker `_has_unsloth_patched_forward` reads is `_original_<modeling_tail>_<ClassName>_forward`, not the per-family `_unsloth_already_patched` flag.
|
Pushed The previous unconditional skip created a silent-failure path: a real discovery regression on a transformers that DOES ship the grouped-MoE surface would have skipped rather than failed, which is exactly the regression this test was written to catch. The new logic walks This is the structural canary the gemini reviewer suggested, generalized so it doesn't hard-code model names that change as transformers grows new MoE families. Also corrected the comment + messages to refer to the actual marker convention the discovery logic reads ( Verified locally on transformers 4.57.6: discovery finds |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 4c464c8bba
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| importable = [ | ||
| f"{cls.__module__}.{cls.__name__}" | ||
| for modeling in _iter_modeling_modules() | ||
| for _, cls in inspect.getmembers(modeling, inspect.isclass) | ||
| if isinstance(cls, type) | ||
| and getattr(cls, "__module__", None) == modeling.__name__ | ||
| and _looks_like_grouped_moe_experts(cls) | ||
| ] |
There was a problem hiding this comment.
Restrict the canary to patch targets
In older transformers cells where none of the current TEMPORARY_PATCHES targets import, this canary still treats any grouped-MoE-looking class as proof that the patch surface exists. For example, transformers.models.qwen2_moe.modeling_qwen2_moe.Qwen2MoeExperts has the same gate_up_proj/down_proj 3D nn.Parameter declarations in 4.57.x, but this repo has no qwen2_moe temporary patch target, so patched remains empty and this branch raises instead of performing the intended skip. Please limit the canary to classes/modules actually targeted by TEMPORARY_PATCHES (or otherwise distinguish untargeted MoE families) so old-transformers CI cells do not fail falsely.
Useful? React with 👍 / 👎.
Summary
test_every_patched_moe_experts_class_has_lora_extractor(PR #624) isdesigned to prevent a regression where a class flagged
_unsloth_already_patched=Trueis missing_unsloth_lora_extractor_fn.Its defensive precondition was "discovery produced at least one patched
class"; failing that, the test raised an AssertionError telling the
reader to investigate.
That precondition fires on any transformers version older than the MoE
class surface
unsloth_zoo.temporary_patchestargets. Withtransformers==4.57.6for example, none of the gemma4 / llama4 /qwen3_moe / mxfp4 classes the
TEMPORARY_PATCHESlist patches exist,so every patch fn no-ops and discovery returns
[]. The "zerodiscoveries" outcome on an older transformers is not a regression
of the bug being prevented; it just means the relevant classes aren't
there to patch.
Replace
assert patched, ...withpytest.skip(...)and a commentexplaining the two indistinguishable cases. CI cells running newer
transformers (where the classes do exist) still exercise the real
assertion below; the older-transformers cell skips cleanly.
Test plan
cap), discovery still produces patched classes and the real
assertion runs.
transformers==4.57.6 + trl<1should drop the failure on this test on the next run.
Why this isn't
pytest.mark.skipif(transformers_version < X.Y)A version-string skipif would have to encode every minimum version
across the
TEMPORARY_PATCHESlist (gemma3, gemma4, llama4, qwen3_moe,mxfp4 each landed in a different transformers release) AND track
upstream renames. The "zero patched classes" runtime check is the
precise condition we actually care about: skip when nothing is there
to verify.
Surfaced by the consolidated CPU-CI matrix on unslothai/unsloth#5312.