fix: 3 patch_* helpers — fast_lora import, sft_trainer Union, openenv OSError by danielhanchen · Pull Request #5319 · unslothai/unsloth

danielhanchen · 2026-05-07T06:17:05Z

Summary

Three independent fixes for patch_* helpers, all surfaced by the new
consolidated CPU-CI matrix on #5312 that invokes every zero-arg
patch_* across unsloth + unsloth_zoo@main against three
(transformers, trl) combinations.

1. `patch_fast_lora` — `NameError: name 'fast_lora_forward' is not defined`

unsloth/models/_utils.py:patch_fast_lora references fast_lora_forward
as if it were imported, but the import was never added. Bug present
since ddf118a8f (2024-11-21). Function is defined at
unsloth/kernels/fast_lora.py:652 and re-exported through
unsloth/kernels/__init__.py:45.

Fix: from ..kernels.fast_lora import fast_lora_forward inside the
function body. Importing inside (rather than at module top) keeps the
import surface narrow and avoids a circular-import risk.

2. `patch_sft_trainer_tokenizer` — `NameError: name 'Union' is not defined`

unsloth/tokenizer_utils.py:patch_sft_trainer_tokenizer rewrites the
source of TRL's SFTTrainer._prepare_non_packed_dataloader and
_prepare_dataset methods and re-execs them. With TRL 1.x those
methods carry Union[...] type hints in their signatures. The current
rewrite only injects identifiers found by dir(trl.trainer.sft_trainer),
which does not include Union, so exec(function, ...) raises
NameError.

Fix: import Union, Optional, List, Any, Callable, Tuple, Dict, Iterator
inside the function. exec receives locals() as its globals, so
those names are visible to the executed source body. Same pattern as
unsloth/models/_utils.py:patch_linear_scaling's exec_code.

3. `openenv_vllm_reload_weights` — `OSError: could not get source code`

unsloth/models/rl_replacements.py:openenv_vllm_reload_weights calls
inspect.getsource(patch_target) on a TRL openenv helper. TRL 0.29.1+
and the 1.x line ship some openenv helpers as compiled bytecode without
accessible source on disk; inspect.getsource raises OSError. The
unguarded call surfaces as a hard failure inside patch_trl_openenv()
and aborts the rest of the RL_ADDITIONAL_FUNCTIONS["openenv"]
iteration.

Fix: wrap the getsource call in a try/except OSError and log a
warning. Only the wake_up(tags=...) rewrite is skipped; the core
weight-reload patch path stays functional. With TRL 0.18.2-0.24.0 (the
range pyproject.toml currently pins) source is still accessible, so the
original code path runs unchanged.

Test plan

Each fix locally re-validated by directly invoking the patcher.
- patch_fast_lora() → returns without exception (import resolves).
- patch_sft_trainer_tokenizer() → returns without NameError.
- openenv_vllm_reload_weights() on TRL 0.29.1 → logs the warning
  and returns gracefully instead of raising.
CI green on this PR.
After merge, Consolidated CPU tests matrix cells on CI: scope GITHUB_TOKEN permissions, add MLX CI, unblock ~60 skipped tests #5312 should
drop these three from the failure ledger.

patch_fast_lora has referenced an unbound `fast_lora_forward` since ddf118a (2024-11-21). The function is defined at unsloth/kernels/fast_lora.py:652 and re-exported through unsloth/kernels/__init__.py:45, but it was never imported into unsloth/models/_utils.py, so calling patch_fast_lora() raises NameError: name 'fast_lora_forward' is not defined. The bug went unnoticed because no production code path calls patch_fast_lora() unconditionally. Surfaced by a new CPU-CI check that invokes every zero-arg patch_* helper across unsloth + unsloth_zoo (consolidated-tests-ci.yml on PR #5312). Importing inside the function (rather than at module top) keeps the import surface narrow and avoids a circular-import risk if unsloth.kernels.fast_lora ever needs to import from unsloth.models._utils.

…mespace patch_sft_trainer_tokenizer rewrites the source of TRL's SFTTrainer methods (_prepare_non_packed_dataloader, _prepare_dataset) and re-execs them. With TRL 1.x, those methods carry `Union[...]` type hints in their signatures. The current rewrite only injects identifiers found by `dir(trl.trainer.sft_trainer)` into the exec namespace, which does not include `Union`, so exec(function, ...) raises NameError: name 'Union' is not defined. Fix: import Union, Optional, List, Any, Callable, Tuple, Dict, Iterator inside the function. exec receives `locals()` as its globals dict, so those names are visible to the executed source body. Same pattern as unsloth/models/_utils.py:patch_linear_scaling, which already injects `from typing import Union, Optional, List, Any, Callable, Tuple` into its own exec_code. Surfaced by the consolidated CPU-CI runtime patch_* check on PR #5312 in the matrix cell `transformers>=5,<6 + trl>=1,<2`.

…etsource TRL 0.29.1 and the 1.x line ship some openenv helpers as compiled bytecode without accessible source on disk. inspect.getsource(patch_target) raises OSError("could not get source code") in that case, which surfaces as a hard failure in patch_trl_openenv() and aborts the rest of the RL_ADDITIONAL_FUNCTIONS["openenv"] iteration. Wrap the getsource call in a try/except OSError and log a warning instead. The wake_up(tags=...) rewrite is the only thing skipped; the core weight-reload patch path stays functional. Surfaced by the consolidated CPU-CI runtime patch_* check on PR #5312 in matrix cells running TRL 0.29.1 (latest <1.0.0) and TRL 1.3.0 (latest 1.x). The pyproject pin (TRL 0.18.2-0.24.0) still gets source for this function so the original code path runs unchanged there.

gemini-code-assist

Code Review

This pull request introduces several fixes to improve compatibility with newer versions of TRL. Key changes include adding a missing import for fast_lora_forward, implementing error handling for inspect.getsource to prevent crashes when source code is unavailable in compiled bytecode, and including necessary typing imports to avoid NameError during dynamic code execution in patch_sft_trainer_tokenizer. I have no feedback to provide.

… job-level continue-on-error Two cleanups derived from review of the matrix output: 1. Skip false-positive zero-arg patches in the runtime ledger. Three patches have all-defaulted signatures but require either runtime args or real CUDA, so calling them in isolation produces a meaningless failure: - patch_linear_scaling: defaults are None placeholders; body starts with `assert rope_module is not None` etc. - patch_llama_rope_scaling: same shape. - patch_unsloth_smart_gradient_checkpointing: legitimately allocates CUDA tensors via aten::empty.memory_format inside initialize_unsloth_gradient_checkpointing(); the torch.cuda.* Python spoof can't intercept that at the dispatcher level. Add NEEDS_PRECONDITION = {...} to the shim and skip those by name. Symbol presence is still verified via REQUIRED. 2. Drop the job-level `continue-on-error: true`. Previously the cell reported SUCCESS even when steps failed, which made the PR check UI lie. Real failures now turn the cell red. Per-step `continue-on-error: true` stays so a single failed step does not cascade and skip the rest of the ledger. Three other failures the matrix surfaced are addressed by separate PRs to source: - #5319 (patch_fast_lora missing import, patch_sft_trainer_tokenizer Union NameError, openenv OSError) - unslothai/unsloth-zoo#628 (skip MoE coverage on older transformers)

gemini-code-assist

Code Review

This pull request improves compatibility with newer TRL versions by adding a missing import in the fast LoRA patch, handling cases where source code is unavailable for inspection in RL replacements, and including necessary typing imports to prevent NameErrors during dynamic execution in tokenizer utilities. I have no feedback to provide.

…dated CI cells exercise the fixed patches under continue-on-error=false.

Now that the upstream patch fixes have landed (#5319 for the three patch_* helpers, unsloth-zoo#628 for the MoE coverage canary), every observed cell-level red was one of those two things. Both are fixed, so re-run the matrix in strict mode: - Removed every per-step `continue-on-error: true`. A failing test step fails the cell. The previous green-with-fail-prints lie is gone. - Runtime patch ledger: was `assert REQUIRED helpers exist by name` (an inventory walk). Now also `assert len(fail) == 0` -- any zero-arg patch that raises is a real regression. NEEDS_PRECONDITION still skips the three patches that legitimately need real CUDA / runtime args. - patch_tiled_mlp shim: bumped seq_len from 4 to 192 with hidden=64 so divmod(192, 64) = (3, 0) and the tiled path actually runs 3 shards instead of degenerating to n_shards=1 (which is bit-exact and only confirms patching installed something). Added an explicit pre-assertion that we are exercising multi-shard. - openenv graceful-skip warning: previous text said "Weight reload still functional" which over-promised. Replaced with the literal consequence: duplicate `collective_rpc("reload_weights")` is not stripped and `wake_up(tags=["kv_cache"])` is not retagged. Most users are unaffected; openenv GRPO users on this TRL build may see redundant reload_weights or partial wake_up. Includes a merge of main into this branch so the consolidated cells pip-install the post-#5319 unsloth tree.

danielhanchen added 3 commits May 7, 2026 06:14

danielhanchen requested review from Datta0, mmathew23 and pluesclues as code owners May 7, 2026 06:17

gemini-code-assist Bot reviewed May 7, 2026

View reviewed changes

danielhanchen merged commit 948ce43 into main May 7, 2026
14 checks passed

danielhanchen deleted the fix/patch-helpers-and-coverage branch May 7, 2026 07:12

danielhanchen added a commit that referenced this pull request May 7, 2026

Merge main (with #5319 patch_* fixes) into PR #5312 branch so consoli…

b79ae26

…dated CI cells exercise the fixed patches under continue-on-error=false.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: 3 patch_* helpers — fast_lora import, sft_trainer Union, openenv OSError#5319

fix: 3 patch_* helpers — fast_lora import, sft_trainer Union, openenv OSError#5319
danielhanchen merged 3 commits into
mainfrom
fix/patch-helpers-and-coverage

danielhanchen commented May 7, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

danielhanchen commented May 7, 2026

Summary

1. patch_fast_lora — NameError: name 'fast_lora_forward' is not defined

2. patch_sft_trainer_tokenizer — NameError: name 'Union' is not defined

3. openenv_vllm_reload_weights — OSError: could not get source code

Test plan

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

1. `patch_fast_lora` — `NameError: name 'fast_lora_forward' is not defined`

2. `patch_sft_trainer_tokenizer` — `NameError: name 'Union' is not defined`

3. `openenv_vllm_reload_weights` — `OSError: could not get source code`