fix CI fallout from MLX subpackage refactor (#634)#646
Conversation
Three small fixes triggered by tests on PRs #645 / unsloth#5426: 1) tests/test_upstream_pinned_symbols_accelerator.py read the legacy flat unsloth_zoo/mlx_trainer.py and mlx_loader.py paths directly. #634 moved those to unsloth_zoo/mlx/trainer.py and mlx/loader.py; point the tests at the new paths so they exercise the post-refactor sources. 2) unsloth_zoo/vision_utils.py:68 did an unconditional top-level import torchvision while torchvision is not a declared dependency. On a CPU-only install that lacks torchvision (e.g. zoo's Core CI matrix), every import of unsloth_zoo.vision_utils raises ModuleNotFoundError. Guard the import so the module loads; existing _read_video_torchvision call sites already only fire when video reading is requested. 3) tests/test_zoo_source_upstream_refs.py tries to resolve transformers qwen2_vl / qwen2_5_vl image-processing modules, both of which transitively `import torchvision`. Skip via pytest.importorskip when torchvision is not installed instead of failing the run. Each fix is independent and minimal; no behavior change for installs that have torchvision (the eager import path is identical), and no behavior change for the MLX trainer / loader (the tests still pin the exact same source-level invariants on the new file paths).
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 3fe5aa8fca
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| import torchvision | ||
| from torchvision import io, transforms | ||
| from torchvision.transforms import InterpolationMode | ||
| HAS_TORCHVISION = True | ||
| except Exception: | ||
| torchvision = None |
There was a problem hiding this comment.
Keep video resizing usable without torchvision
When torchvision is absent but decord or torchcodec is installed, this guard lets the module import and get_video_reader_backend() can select one of those backends, but fetch_video() still calls transforms.functional.resize and InterpolationMode after decoding. In that environment transforms is never defined because the guarded import failed, so video processing raises NameError despite decord/torchcodec being advertised alternatives. Please either avoid torchvision for resize or require/import it before the video path.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Code Review
This pull request updates MLX-related file paths in tests and transitions torchvision to an optional dependency in vision_utils.py by wrapping its import in a try-except block. It also adds pytest.importorskip for torchvision in relevant tests. Feedback was provided to initialize sub-modules like io and transforms to None in the exception handler to avoid NameError and to refine the fallback logic in the fetch_video function.
| except Exception: | ||
| torchvision = None | ||
| HAS_TORCHVISION = False |
There was a problem hiding this comment.
The except block should also initialize io, transforms, and InterpolationMode to None. If torchvision fails to import, these names remain undefined, which will lead to a NameError in fetch_video (line 517) even if an alternative backend like decord is used.
Furthermore, fetch_video currently relies on torchvision.transforms for resizing the video tensor. If torchvision is intended to be an optional dependency, the resizing logic in fetch_video should be updated to use a fallback (e.g., torch.nn.functional.interpolate) when HAS_TORCHVISION is False. Also, the fallback to torchvision in fetch_video (line 489) should be guarded by HAS_TORCHVISION to avoid a crash when no video backends are functional.
| except Exception: | |
| torchvision = None | |
| HAS_TORCHVISION = False | |
| except Exception: | |
| torchvision = io = transforms = InterpolationMode = None | |
| HAS_TORCHVISION = False |
The post-#634 conftest stubs ``unsloth_zoo.device_type`` so importing the package on a CPU-only CI runner doesn't trip ``get_device_type()``. ``test_unsloth_trainer_exec_marker`` (and any future test that does ``import unsloth.*``) goes one level up: it imports ``unsloth.trainer``, which traverses ``unsloth/__init__.py`` → ``_gpu_init`` → ``unsloth/ device_type.py:get_device_type()``, raising the same NotImplementedError zoo's stub was meant to suppress. Parameterise ``_preload_real_device_type(package, prereqs)`` so the same harness works for both packages, factor the fallback into ``_install_device_type_stub(name)``, and call both. unsloth's ``get_device_type()`` consumes ``unsloth_zoo`` helpers but has no intra-unsloth prereqs, hence ``prereqs=()`` there.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 196a78bc7a
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| # unsloth/device_type.py's own NotImplementedError. The real preload | ||
| # uses unsloth's own get_device_type(), which consults unsloth_zoo's | ||
| # Version helper, so no extra prereqs are needed here. | ||
| if not _preload_real_device_type("unsloth", prereqs=()): |
There was a problem hiding this comment.
Handle missing torch before preloading unsloth device type
When unsloth is importable in a no-torch environment (for example the security/MLX jobs described in this conftest), this new prereqs=() call bypasses the only ModuleNotFoundError('torch') catch in _preload_real_device_type; the later import torch then propagates and aborts pytest collection before _install_device_type_stub("unsloth.device_type") can run. I reproduced this by putting a minimal unsloth package on PYTHONPATH and blocking torch imports: conftest fails at this call instead of falling back to the stub.
Useful? React with 👍 / 👎.
The previous commit (196a78b) stubbed unsloth.device_type so test_unsloth_trainer_exec_marker could complete its ``import unsloth.trainer`` smoke. That makes ``import unsloth`` succeed on CPU-only CI, which then runs ``unsloth/_gpu_init.py:_patch_trl_trainer()`` and rebinds ``trl.trainer.sft_trainer.SFTTrainer`` / ``transformers.models.ministral.MinistralAttention`` to Unsloth's compiled wrappers. ``inspect.getsource(...)`` on those classes follows the wrapper's ``__init__.__code__.co_filename`` and returns the rewritten source -- which doesn't contain the literals the drift detectors anchor on (``self._signature_columns``, ``self._prepare_dataset(``, ``class SFTTrainer``, named ``hidden_states``/``attention_mask``). The HF=4.57.6 job went from -1 pre-existing failure (``test_unsloth_trainer_exec_marker``, also failing on main) to -4 new failures across ``test_MinistralAttention_forward_signature`` and the three ``test_unsloth_rl_trainer_*`` checks. That's a regression. Keep parameterised ``_preload_real_device_type`` and ``_install_device_type_stub`` -- they're harmless infrastructure and useful if a future test needs the unsloth.device_type stub specifically. Just don't fire the second preload at session setup. ``test_unsloth_trainer_exec_marker`` will continue to fail on CPU-only CI as it does on main; that's a separate pre-existing issue with ``unsloth/device_type.py`` raising on accelerator-less hosts and should be fixed upstream in unsloth itself, not papered over in zoo's conftest.
unsloth/_gpu_init.py:125 imports DEVICE_TYPE from unsloth_zoo, not local unsloth/device_type.py, so the env-var gate has to live here too for `import unsloth.trainer` to succeed on CPU-only CI. The companion unsloth PR landed as cb15a7a (#5429). Pairs with the conftest change in this branch that sets UNSLOTH_ALLOW_CPU=1 before triggering `import unsloth` from _apply_upstream_import_fixes_for_tests. @functools.cache on get_device_type plus the module-level DEVICE_TYPE = get_device_type() assignment mean the os.environ.get runs exactly once per process. Production hosts with a real accelerator hit the `cuda`/`xpu`/`hip` branches above and never reach the env-var short-circuit.
Two upstream changes broke zoo at runtime (not just CI drift detection):
1. transformers 5.x removed transformers.cache_utils.HybridCache. zoo's
utils.py falls back to `HybridCache = typing.Any`, which made
`gemma.py:260 isinstance(past_key_values, HybridCache)` raise
TypeError ("isinstance() arg 2 must be a type, a tuple of types,
or a union") at runtime on the gemma mask path.
Fix: expose HAS_HYBRID_CACHE alongside the existing HybridCache
fallback in utils.py, and gate the isinstance call in gemma.py:260
on the flag. When HybridCache is genuinely missing, the elif chain
falls through to the dynamic-cache branch -- same observable
behavior as the pre-5.x path that never matched the type.
2. transformers 5.x rewrote PreTrainedModel.save_pretrained. The
source-string anchors zoo's LoRA-merge-on-save optimization patches
in `merge_and_dequantize_lora` (state_dict_split assignment,
`state_dict[tensor].contiguous()`, `def save_pretrained`, and the
filename_to_tensors for-loop on the push_to_hub path) are all gone.
Each downstream `raise RuntimeError("Failed to find ...")` would
fire, killing `merge_and_dequantize_lora(push_to_hub=True)` on 5.x.
Fix: scan all required anchors upfront in saving_utils.py. If any
are missing on installed transformers, emit a one-time warnings.warn
and fall back to vanilla `model.save_pretrained(...)`. The end-user
sees the warning, no crash, and is told to call
`model.merge_and_unload()` (or equivalent) before saving if they
want merged LoRA weights on disk. The per-anchor RuntimeError
raises downstream are kept as defense-in-depth for partial drift
(one anchor renamed, others intact) that the upfront check might
miss.
The two corresponding drift detectors in tests/ are rewritten in the
companion commit as positive assertions: 4.57.6 keeps the strict
existence check; 5.x asserts the anchor is gone AND zoo's fallback
covers it (`HAS_HYBRID_CACHE` flag / `_required_anchors` list).
Sixteen drift detectors in the Core HF=default / HF=latest matrix
slices were pointing at upstream symbols / source strings that
transformers 5.x intentionally removed:
* 9 `temporary_patches/*_for_causal_lm_forward_named_params` --
cache_position moved from named forward param into
**kwargs: Unpack[TransformersKwargs] (DeepseekV3, GptOss, CsmDepth,
Csm, Qwen3Moe, Qwen3Next, Qwen3VLMoe, MinistralModel, GraniteMoeHybrid).
* 4 compiler/source rewriter probes -- output_attentions branching
removed, MoE routing_weights cast refactored, Trainer
is_torch_tpu_available/is_torch_xla_available both removed,
_update_causal_mask hasattr probe all-False.
* 1 routing_weights.to substring probe in compiler_rewriter_exhaustive.
* 1 GptOssConfig dedent-compare -- rope_theta/initial_context_length/
rope_scaling collapsed into rope_parameters dict.
* 1 GptOssConfig kwarg signature -- same rope_theta rename.
For all 16, the zoo runtime patch already gracefully no-ops on 5.x
(verified by the Explore audit: try/except + relaxed patch_function +
hasattr guards + str.replace's silent no-op-on-miss). Skip the
detector on 5.x with a clear "<symbol> removed on transformers
{version}" message; keep it strict on 4.57.6 where real drift could
still surface.
Two more drift detectors -- test_hybrid_cache_class_present and the
two saving_utils.py pinned-string tests -- become positive assertions:
on 5.x they confirm zoo's prod fix correctly identifies the missing
anchor (HAS_HYBRID_CACHE is False / `_required_anchors` covers the
missing strings) and the graceful fallback is wired up.
Conftest also sets UNSLOTH_ALLOW_CPU=1 before the existing
`import unsloth` trigger, which unblocks `test_unsloth_trainer_exec_marker`
on CPU-only CI runners (the companion unsloth PR #5429 lets that
import succeed without an accelerator).
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 18c9f62ed7
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| ) | ||
| model.save_pretrained( | ||
| save_directory = save_directory, | ||
| push_to_hub = push_to_hub, |
There was a problem hiding this comment.
Preserve the prepared Hub upload path in fallback
When the new missing-anchor fallback runs with push_to_hub=True (e.g. transformers 5.x), delegating the upload to vanilla model.save_pretrained here bypasses the repo setup already done by prepare_saving: repo_id is not passed, so HF defaults to the basename of save_directory (or a temp directory if prepare_saving switched directories), and the tokenizer is saved only after the model upload has already completed. This can publish to the wrong repo and omit tokenizer files; keep the fallback as a local save and then upload with the prepared repo_id after saving the tokenizer, or at least pass repo_id and upload after tokenizer save.
Useful? React with 👍 / 👎.
Missed this one in 18c9f62. Same root cause as MinistralModel: transformers 5.x reflowed MinistralAttention.forward, zoo's ``patch_function(..., match_level='relaxed')`` falls back to a ``(self, *args, **kwargs)`` wrapper, the named-param probe sees the wrapper signature and drift-fails. Runtime call still works because the wrapper forwards via kwargs. Skip on 5.x, keep strict on 4.57.6.
| try: | ||
| from transformers.cache_utils import HybridCache | ||
| HAS_HYBRID_CACHE = True | ||
| except Exception: |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 11324ad0de
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| revision = revision, | ||
| ) |
There was a problem hiding this comment.
Forward save_pretrained kwargs in the fallback
When the transformers 5.x anchor fallback is taken, this direct model.save_pretrained(...) call drops the wrapper's **kwargs, even though the normal generated save path forwards them at the existing call site. That means callers who pass supported save_pretrained options such as variant, is_main_process, save_peft_format, or Hub-related kwargs get different behavior only on the new fallback path; include **kwargs in this call so the public wrapper contract stays consistent.
Useful? React with 👍 / 👎.
Zoo's ministral.py:94-96 wraps the actual implementation in a generic ``def forward(self, *args, **kwargs): return _full_forward(...)`` adapter before calling patch_function -- this lets `check_args_kwargs` accept params transformers 5.x removed (cache_position) by routing them through **kwargs. The side effect: `inspect.signature( MinistralAttention.forward)` after patching shows the wrapper, not the real named params. This test used to pass only because the test runner couldn't fully trigger zoo's TEMPORARY_PATCHES loop (`unsloth/models/_utils.py:580` runs at import time but unsloth import was being silently aborted). With UNSLOTH_ALLOW_CPU=1 now making `import unsloth` succeed, the patch fires correctly and the live signature is the wrapper. Fix: probe the `_original_modeling_ministral_MinistralAttention_forward` stash that `patch_function(store_original=True)` writes, falling back to the live attribute. If the live attr is the relaxed wrapper and there's no stash, the named-param probe isn't meaningful -- the runtime kwargs contract is enforced elsewhere. Skip with a clear message. 4.57.6: stash is present and the named-param probe still runs strict; real drift in the upstream signature still surfaces.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: e74043de57
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| from packaging.version import Version # noqa: E402 | ||
|
|
||
| _TX_VERSION = getattr(transformers, "__version__", "0.0.0") | ||
| _TX_IS_5X = Version(_TX_VERSION) >= Version("5.0.0") |
There was a problem hiding this comment.
Treat Transformers 5 dev builds as 5.x
When CI installs Transformers from the 5.x development line, transformers.__version__ is commonly 5.0.0.dev0, and packaging.version.Version('5.0.0.dev0') < Version('5.0.0'). In that environment this flag stays false, so the new _skip_if_transformers_5x guards do not skip the drift detectors for anchors already removed on 5.x, causing the same CI failures this patch is trying to avoid; note the runtime patch code in temporary_patches/gpt_oss.py already treats >= Version('5.0.0.dev0') as v5. Please gate on 5.0.0.dev0 or the parsed major version here and in the sibling helpers added by this commit.
Useful? React with 👍 / 👎.
…-ci-cleanup # Conflicts: # tests/test_upstream_pinned_symbols_accelerator.py
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: cb1c6ce322
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| except (OSError, TypeError): | ||
| pytest.skip("save_pretrained source unavailable") | ||
| needle = "state_dict[tensor].contiguous()" | ||
| if _TX_IS_5X: |
There was a problem hiding this comment.
Mirror the 5.x guard for the first save anchor
When the latest-HF CI runs on Transformers 5.x where PreTrainedModel.save_pretrained has also dropped state_dict_split = split_torch_state_dict_into_shards, the new production fallback treats that missing anchor as recoverable, but test_saving_utils_save_pretrained_state_dict_split_pinned_string above this block still calls _drift unconditionally before the 5.x handling added here for the later anchors. That leaves the 5.x job failing with DRIFT DETECTED despite the intended fallback; add the same _TX_IS_5X positive-assertion/return for the state_dict_split test.
Useful? React with 👍 / 👎.
transformers 5.x dropped the slow image processors entirely
(image_processing_qwen2_5_vl.py and image_processing_qwen2_5_vl_fast.py
are both gone in v5.5.0). Qwen2.5-VL now reuses Qwen2VLImageProcessor
directly. zoo's misc.py:1500-1506 patch site for Qwen2_5_VLImageProcessor
is try/except ImportError-wrapped, so it silently no-ops on 5.x and the
runtime shim still fires via the Qwen2VLImageProcessor patch at
misc.py:1485-1498 (the class Qwen2.5-VL inherits at runtime).
The Qwen2.5-VL image-processor path was previously hidden by
``pytest.importorskip("torchvision")`` -- PR #648 added torchvision to
the upstream-regression matrix install, which unmasked the drift.
Runtime is unaffected. Add the same 5.x skip pattern used by the other
16 gated detectors so the dead-code patch site doesn't block CI.
4.57.6 keeps the strict path-existence check.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 4eae0a2915
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| f"transformers {transformers.__version__} (zoo patch is " | ||
| "version-gated to >= 5.0.0)" | ||
| ) | ||
| if Version(transformers.__version__) >= Version("5.0.0"): |
There was a problem hiding this comment.
Don't skip the qwen2_5_vl resolver for every version
For every Transformers version this test returns before reaching _resolve: versions < 5.0.0 already skip above, and this newly added >= 5.0.0 branch skips the rest. That makes the drift detector permanently unreachable, so it no longer pins the misc.py:1501 import path despite the surrounding comment saying the 4.57.6 path should stay strict; either delete the dead detector or keep a version range where the resolver actually runs.
Useful? React with 👍 / 👎.
Summary
Three small fixes triggered by tests on PRs #645 and unsloth#5426:
tests/test_upstream_pinned_symbols_accelerator.pyread the legacy flatunsloth_zoo/mlx_trainer.pyandmlx_loader.pypaths directly. PR fix mlx: Adds the MLX training path used by Studio on Apple Silicon #634 moved those tounsloth_zoo/mlx/trainer.pyandmlx/loader.py; point the tests at the new paths so they exercise the post-refactor sources.unsloth_zoo/vision_utils.py:68did an unconditional top-levelimport torchvisionwhiletorchvisionis not a declared dependency. On a CPU-only install that lacks torchvision (e.g. zoo's Core CI matrix), every import ofunsloth_zoo.vision_utilsraisedModuleNotFoundError. Guard the import so the module loads; existing_read_video_torchvisioncall sites only fire when video reading is requested.tests/test_zoo_source_upstream_refs.pytries to resolve transformers qwen2_vl / qwen2_5_vl image-processing modules, both of which transitivelyimport torchvision. Skip viapytest.importorskipwhen torchvision is not installed instead of failing the run.Test plan
python -m pytest tests/test_upstream_pinned_symbols_accelerator.py tests/test_zoo_source_upstream_refs.py tests/test_compiler_rewriter_exhaustive.py::test_unsloth_trainer_exec_marker -q-> 73 pass, 2 skipped locally (qwen2_5_vl gated on transformers >= 5.0).FileNotFoundError: ... mlx_trainer.py,mlx_loader.py, andModuleNotFoundError: No module named 'torchvision'failures observed on fix embedding matrix size mismatch bug #645.Out of scope
tests/test_temporary_patches_exhaustive.py(cache_position kwarg moved into**kwargs, GptOssConfigrope_thetaremoved,Trainer._inner_training_looprewritten,ModernBertModel._update_attention_maskremoved,HybridCachemissing). Those require updating the patches themselves inunsloth_zoo/temporary_patches/*andunsloth_zoo/compiler.pyand are tracked separately.