Skip to content

tests: pinned-symbol canary for unsloth-zoo save_pretrained_merged guards (#5410)#5433

Merged
danielhanchen merged 23 commits into
mainfrom
ci-save-merged-smoke-5410
May 17, 2026
Merged

tests: pinned-symbol canary for unsloth-zoo save_pretrained_merged guards (#5410)#5433
danielhanchen merged 23 commits into
mainfrom
ci-save-merged-smoke-5410

Conversation

@danielhanchen
Copy link
Copy Markdown
Member

Summary

#5410 was a silent-write bug in the save_pretrained_merged path that the existing CI matrix could not detect because the merge-helper tests were not wired through the upstream-drift suite. The full fix lives in unslothai/unsloth-zoo#647 (layout-aware MoE merge helpers, authoritative num_experts resolver, loud-fail counter, generation_config.json save). This PR adds the unsloth-side canary that watches for the four guard families staying in place in unsloth-zoo so a future refactor cannot silently regress them.

The companion CI work in unslothai/unsloth-zoo#649 wires the unsloth-zoo CI matrix to exercise the same guards.

What this PR adds

A single new test file at tests/version_compat/test_unsloth_zoo_save_merged_pinned_symbols.py and one workflow line in .github/workflows/version-compat-ci.yml.

The test fetches unsloth_zoo/saving_utils.py and tests/test_unsloth_zoo_lora_merge.py from unslothai/unsloth-zoo:main (no pip install, no GPU, no model download) and asserts:

guard what regresses if removed
_MOE_MERGE_STATE + _reset_moe_merge_state + _record_moe_merge_fallback + the raise RuntimeError(...MoE...) silent fallback returns; partial merges save as 'success'
_detect_moe_lora_layout + both \"swapped\" and \"standard\" branch labels helpers re-hardcode one PEFT layout; PEFT 0.19+ breaks again
_resolve_num_experts_from_lora_stats + the bounded for _ in range(N): walk shard-local num_experts non-divisors silently fall back; cyclic module.base_layer hangs
generation_config.save_pretrained(...) call in merge_and_overwrite_lora chat-tuned models reload past EOS
Six PEFT 0.19+ standard-layout regression tests in tests/test_unsloth_zoo_lora_merge.py layout coverage gap returns
Local unsloth/save.py still names save_pretrained_merged and references merge_and_overwrite_lora the entry point bypasses the upstream fix

Behaviour while unslothai/unsloth-zoo#647 is open

The four symbol tests detect that the guard markers are not yet in unsloth-zoo:main and SKIP cleanly with a message naming #647. The local entry-point test passes today.

1 passed, 5 skipped in 0.09s

The skip condition is purely: "none of _MOE_MERGE_STATE, _detect_moe_lora_layout, or _resolve_num_experts_from_lora_stats are in the fetched source". The moment #647 merges into unsloth-zoo main, the skip stops triggering and the same tests become hard gates. No follow-up PR or manual flip needed.

Cost

Pure CPU static fetch, ~0.1s, wired into the existing peft-pinned-symbols job. Runs on every PR that touches unsloth/** plus the daily 06:43 UTC schedule.

Related

Test plan

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 6860ec9143

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +50 to +51
if src is None:
pytest.skip("unsloth_zoo/saving_utils.py not fetchable")
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Fail when the upstream guard file disappears

If unsloth-zoo later renames or removes unsloth_zoo/saving_utils.py, fetch_text returns None for the 404 and this helper skips every saving-utils guard test, so the version-compat job stays green exactly when the canary has stopped checking the module that unsloth/save.py imports for merge_and_overwrite_lora. For this pinned-symbol test, a missing upstream file should be a failure (or checked via an explicit alternate path), not a skip.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a new compatibility test file, test_unsloth_zoo_save_merged_pinned_symbols.py, which serves as a canary to prevent regressions of MoE lora merging fixes in the unsloth-zoo repository. The tests perform static analysis on fetched source code to verify the presence of specific guard symbols, layout detectors, and entry points. Feedback suggests improving the robustness of the static checks by using more flexible regex for bounded loop detection and more specific regex patterns for identifying function definitions and calls to avoid false positives or negatives.

"Shard-local num_experts will produce silent fallbacks again."
)
# The walk must be bounded (PR #647 hardening). A bare `while module
# is not None` without a visited / depth bound regresses to an
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The regex for detecting a bounded loop is overly restrictive. It specifically requires the loop variable to be _ and the range limit to be a literal integer. This will cause the test to fail if the upstream code uses a different variable name (e.g., for i in range(...)) or a constant/variable for the limit (e.g., range(MAX_DEPTH)). A more robust check would simply look for the presence of a for ... in range( construct to ensure the loop is bounded.

Suggested change
# is not None` without a visited / depth bound regresses to an
bounded_re = re.compile(r"for\s+\w+\s+in\s+range\s*\(")

Comment on lines +192 to +201
if not save_py.is_file():
pytest.skip(f"{save_py} not present (sparse checkout?)")
text = save_py.read_text(encoding = "utf-8", errors = "replace")
assert (
"save_pretrained_merged" in text
), "save_pretrained_merged entry point removed from unsloth/save.py."
# One of the two dispatch helpers must still call into unsloth_zoo's
# merge_and_overwrite_lora; renaming both would silently bypass #647.
assert "merge_and_overwrite_lora" in text, (
"Neither unsloth_save_pretrained_merged nor "
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The assertions for checking the entry point and dispatch call are too permissive. Using "string" in text can result in false positives if the strings appear in comments, docstrings, or imports without the actual logic being present. It is better to use regex to ensure these are actual function definitions or calls.

Suggested change
if not save_py.is_file():
pytest.skip(f"{save_py} not present (sparse checkout?)")
text = save_py.read_text(encoding = "utf-8", errors = "replace")
assert (
"save_pretrained_merged" in text
), "save_pretrained_merged entry point removed from unsloth/save.py."
# One of the two dispatch helpers must still call into unsloth_zoo's
# merge_and_overwrite_lora; renaming both would silently bypass #647.
assert "merge_and_overwrite_lora" in text, (
"Neither unsloth_save_pretrained_merged nor "
assert re.search(r"def\s+\w*save_pretrained_merged\b", text), (
"save_pretrained_merged entry point removed from unsloth/save.py."
)
# One of the two dispatch helpers must still call into unsloth_zoo's
# merge_and_overwrite_lora; renaming both would silently bypass #647.
assert re.search(r"merge_and_overwrite_lora\s*\(", text), (
"Neither unsloth_save_pretrained_merged nor "
"unsloth_generic_save_pretrained_merged calls "
"merge_and_overwrite_lora; the unsloth-zoo fix is bypassed."
)

@danielhanchen danielhanchen force-pushed the ci-save-merged-smoke-5410 branch from 6f35f75 to 0d08260 Compare May 15, 2026 05:01
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 995a31ab9f

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment on lines +97 to +99
assert re.search(
r"for\s+_\s+in\s+range\s*\(\s*\d+\s*\)", src
), "resolver walk is no longer bounded by `for _ in range(N):`."
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Scope bounded-walk check to resolver implementation

This assertion only checks for any for _ in range(<digits>) anywhere in saving_utils.py, so it can pass even if _resolve_num_experts_from_lora_stats regresses to an unbounded walk (for example, while traversal) as long as an unrelated bounded loop exists elsewhere in the file. Because this test is meant to guard against hangs from cyclic base_layer chains, the check should be tied to the resolver function body (or a more specific pattern) to avoid false negatives.

Useful? React with 👍 / 👎.

Comment on lines +72 to +74
assert re.search(
r"raise\s+RuntimeError\b[^\n]*MoE", src, re.IGNORECASE
), "no `raise RuntimeError(...MoE...)`; post-loop guard weakened."
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Anchor MoE RuntimeError check to the merge guard

The regex currently succeeds if any raise RuntimeError(...MoE...) appears anywhere in saving_utils.py, even if the post-merge fallback guard in the relevant code path is removed or weakened. That means this canary can report green while the specific safety check it claims to protect has regressed. Narrowing the match to the intended function/section would prevent this false-pass behavior.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: a313ede3a9

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

"_resolve_num_experts_from_lora_stats",
)
):
pytest.skip(
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Fail once all upstream guard markers disappear

If unsloth-zoo has already landed #647 and a later refactor removes or renames all three sentinel symbols at once, this condition routes every saving-utils guard test back through the “#647 has not yet merged” skip path, so the canary goes green on the exact regression it is meant to catch. The temporary skip needs a separate, time-bounded/explicit merge-state check, or a missing post-merge guard set should fail instead of skipping.

Useful? React with 👍 / 👎.

Comment on lines +114 to +115
if "test_merge_moe_gate_expert_standard_layout" not in src:
pytest.skip("unslothai/unsloth-zoo#647 not yet merged; coverage appears later.")
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Fail when the sentinel regression test is removed

After #647 has landed, deleting or renaming test_merge_moe_gate_expert_standard_layout is one of the regressions this canary claims to detect, but this branch treats that exact absence as “not yet merged” and skips the entire upstream test-coverage check. That lets the standard-layout coverage disappear while CI remains green; the pre-merge skip should not be keyed to a test name that later must be enforced.

Useful? React with 👍 / 👎.

…ards (#5410)

unsloth#5410 was a class of silent-write bug in the
save_pretrained_merged path that the existing CI matrix could not
detect because the merge-helper tests were not wired through the
upstream-drift suite. The full fix lives in unslothai/unsloth-zoo#647
(layout-aware MoE merge helpers, authoritative num_experts resolver,
loud-fail counter, generation_config.json save). This PR adds the
unsloth-side canary that watches for the four guards staying in place
in unsloth-zoo so a future refactor cannot silently regress them.

tests/version_compat/test_unsloth_zoo_save_merged_pinned_symbols.py
fetches unsloth_zoo/saving_utils.py + tests/test_unsloth_zoo_lora_merge.py
from unslothai/unsloth-zoo:main and asserts:

- _MOE_MERGE_STATE / _reset_moe_merge_state / _record_moe_merge_fallback
  are still defined and a `raise RuntimeError(...MoE...)` still fires
  when fallback > 0.
- _detect_moe_lora_layout exists and both "swapped" / "standard" branch
  labels are reachable in the source.
- _resolve_num_experts_from_lora_stats is present AND its base_layer
  walk is bounded by `for _ in range(N):` (a cyclic ParamWrapper chain
  must not hang the merge).
- merge_and_overwrite_lora still calls
  model.generation_config.save_pretrained(...).
- tests/test_unsloth_zoo_lora_merge.py keeps the six PEFT 0.19+
  standard-layout regression tests added in #647.
- Local unsloth/save.py still names save_pretrained_merged and
  routes through merge_and_overwrite_lora (i.e. the entry point still
  reaches the upstream fix).

While #647 is still open, the four symbol tests SKIP cleanly with a
message naming #647. When #647 merges into unsloth-zoo main, the same
tests automatically become hard gates and catch any future regression.
The sixth test (local entry-point grep) passes today.

CPU-only static fetch, ~0.1s. Wired into the existing peft-pinned-symbols
job in .github/workflows/version-compat-ci.yml so it runs on every PR
that touches unsloth/** and on the daily schedule.

Local run: 1 passed, 5 skipped (expected; #647 open).
@danielhanchen danielhanchen force-pushed the ci-save-merged-smoke-5410 branch from a313ede to fa20985 Compare May 15, 2026 07:21
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 9564c6cd70

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment on lines +32 to +33
if src is None:
pytest.skip("tests/test_unsloth_zoo_lora_merge.py not fetchable")
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Fail instead of skip when upstream merge test file vanishes

When tests/test_unsloth_zoo_lora_merge.py is missing (rename/remove/404), this test currently calls pytest.skip, which leaves CI green even though the canary has lost visibility into the exact regression tests it is supposed to enforce. In practice, any upstream move of that file after #647 would silently disable this guard rather than reporting a breakage, so this branch should fail (or explicitly resolve alternate paths) once the sentinel coverage is expected to exist.

Useful? React with 👍 / 👎.

Comment on lines +89 to +90
assert re.search(
r"generation_config\.save_pretrained\s*\(", src
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Scope generation-config assertion to merge code path

This regex passes if generation_config.save_pretrained( appears anywhere in saving_utils.py, including unrelated helpers or dead/commented code, so the canary can stay green even if the call is removed from the merge_and_overwrite_lora path that actually fixes the EOS-regression risk. Anchoring the check to the relevant function/section is needed to avoid false passes on refactors.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 21318a88ee

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

"_detect_moe_lora_layout" in src
), "_detect_moe_lora_layout removed (issue #5410)."
assert (
'"swapped"' in src and '"standard"' in src
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Accept either quote style for layout labels

If unsloth-zoo keeps the same semantic layout labels but formats them as Python single-quoted strings, e.g. 'swapped'/'standard', this hard gate fails even though the detector still supports both conventions. Because this test is meant to guard the labels rather than a source-formatting choice, it should match either quote style or parse the string literals.

Useful? React with 👍 / 👎.

Comment on lines +117 to +119
assert (
"merge_and_overwrite_lora" in text
), "no dispatch into unsloth_zoo merge; #647 bypassed."
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Scope the merge dispatch check to the entry point

This assertion passes whenever merge_and_overwrite_lora appears anywhere in unsloth/save.py, such as an unused import, a comment, or a different helper, so a future change can remove the save_pretrained_merged dispatch while this canary still stays green. Since the stated regression is the entry point bypassing the upstream fix, the check should verify a call inside the relevant save_pretrained_merged implementation rather than a file-wide substring.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: e356bac61a

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +61 to +63
assert re.search(
r"raise\s+RuntimeError\b[^\n]*MoE", src, re.IGNORECASE
), "no `raise RuntimeError(...MoE...)`; post-loop guard weakened."
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Allow multiline RuntimeError messages

If the upstream MoE guard raises RuntimeError with the message on the next line — a common Black-wrapped form for long error strings, e.g. raise RuntimeError(\n "MoE ..."\n) — this regex does not match because [^ ]*MoE only scans the raise RuntimeError line. Once #647 lands, that equivalent guard would make this canary fail CI even though the protection is still present; the check should either allow newlines inside the call or inspect the call body more structurally.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: b337038eff

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

"test_moe_merge_fallback_counter_records_bad_layout",
"test_resolve_num_experts_walks_base_layer_chain",
):
assert name in src, f"regression test `{name}` removed."
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Require regression test names to be actual definitions

Once the upstream #647 tests exist, this canary still passes if one of these regression tests is deleted or disabled but its name remains in a comment, skip reason, parameter string, or helper list in tests/test_unsloth_zoo_lora_merge.py. Since this check is meant to enforce executable upstream coverage, it should verify a real def/async def for each expected test (for example via the existing has_def helper) rather than a file-wide substring.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 953f38b9e6

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +89 to +90
assert re.search(
r"generation_config\.save_pretrained\s*\(", src
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Match the actual generation config save call

Now that unsloth-zoo#647 has merged, this hard gate will fail even though the fix is present: the merged upstream code assigns the generation config to gen_cfg and calls gen_cfg.save_pretrained(save_directory), so this literal generation_config.save_pretrained( regex never matches. In the post-#647 CI context this makes the newly added version-compat job red for a formatting/aliasing difference rather than a real regression.

Useful? React with 👍 / 👎.

zoo#647 landed two layout changes that broke the pinned-symbol
canary's exact-string regex matches but kept the underlying
guarantees intact:

- The post-loop MoE LoRA fallback `raise RuntimeError(...)` wraps
  the "MoE" wording onto a second line; the old `[^\n]*` did not
  cross newlines. Switch to `.*?` + re.DOTALL.

- The generation_config save now binds the attr to a local var
  `gen_cfg = getattr(model, "generation_config", ...)` and calls
  `gen_cfg.save_pretrained(save_directory)`, so a literal
  `generation_config.save_pretrained(` substring no longer matches.
  Anchor on the conceptual operation: a `generation_config` mention
  followed (within a small char window) by a `.save_pretrained(`
  call. That is what the canary actually cares about.

Verified locally:
  pytest tests/version_compat/test_unsloth_zoo_save_merged_pinned_symbols.py
    -> 2 passed (4 deselected)
@danielhanchen danielhanchen merged commit 79a431c into main May 17, 2026
24 checks passed
@danielhanchen danielhanchen deleted the ci-save-merged-smoke-5410 branch May 17, 2026 08:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant