Skip to content

Bug fixes#11

Merged
danielhanchen merged 23 commits into
mainfrom
nightly
Nov 5, 2024
Merged

Bug fixes#11
danielhanchen merged 23 commits into
mainfrom
nightly

Conversation

@danielhanchen
Copy link
Copy Markdown
Member

No description provided.

@danielhanchen danielhanchen merged commit e035111 into main Nov 5, 2024
amrothemich added a commit to amrothemich/unsloth-zoo that referenced this pull request Nov 6, 2025
KEY DISCOVERY: cache_position corruption starts early with 1156-element
tensors instead of single positions. This spreads and corrupts GPU memory.

- Detect corrupted cache_position shapes early (call unslothai#11)
- Fix by extracting last position and creating proper tensor
- Robust flex_attention fallbacks when GPU is corrupted
- CPU tensor fallbacks when even torch.ones() fails on GPU

This should prevent the CUDA illegal memory access by fixing the
root cause before it spreads.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
mmathew23 added a commit to mmathew23/unsloth-zoo that referenced this pull request May 22, 2026
Multi-reviewer pass on the autocast wrapper / norm-upcast path:

- Instance-level forward (#2): an instance attribute `model.forward`
  (Unsloth runtime forward patching) shadows class-method overrides, so
  mutating __class__ silently bypassed the wrapper -> fp32 norm met a bf16
  linear with no autocast and crashed. Now wrap the instance attribute when
  present; otherwise subclass as before.
- Wrapper gating (unslothai#5, unslothai#7): install the wrapper iff fp32 norm params actually
  exist (from our upcast, the legacy env upcast, or an external
  _pre_set_compute_dtype policy) -- not on the upcast DECISION. Fixes the
  rollback path leaving external fp32 norms exposed, and stops wrapping models
  with no fp32 norm. Add _unwrap_forward_in_bf16_autocast for re-prepare (unslothai#10).
- config.architectures leak (unslothai#8/unslothai#9): keep the original __name__ on the
  generated subclass (unique __qualname__ for registration) so save_pretrained
  records the base architecture.
- Device detection (unslothai#11): recurse into mapping/list/tuple batches and fall
  back to the model's parameter device instead of defaulting to "cuda".
- Legacy UNSLOTH_UPCAST_LAYERNORM (#1/#3/unslothai#4): route through the shared
  _cast_named_module + union matcher and honour the external-policy deferral.
- Recursive external-ownership guard (unslothai#6): record descendants of tagged
  modules (the external policy casts recursively).
- Fresh-interpreter pickle test (unslothai#12): real subprocess load.

Shared helpers: _find_tensor_device_type, _call_forward_with_bf16_autocast,
_canonical_module_name, _cast_named_module. Unit suite: 25 passed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant