Add Optional Attention Residuals (AttnRes) Integration for Unsloth Fast Paths#9
Add Optional Attention Residuals (AttnRes) Integration for Unsloth Fast Paths#9danielhanchen wants to merge 11 commits into
Conversation
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
|
/gemini review |
|
@codex review |
There was a problem hiding this comment.
Code Review
This pull request introduces 'Attention Residuals' (AttnRes), a feature designed to accumulate and mix residual states across transformer layers to potentially improve model performance or stability. The implementation includes a core state management module in unsloth/models/attnres.py, integration into several model architectures (Llama, Gemma2, Cohere, Falcon, and Granite), and hooks within the central attention dispatch utility to support various backends. The review feedback suggests minor logic simplifications in the new modules, specifically regarding how the number of layers is determined, how model boundary conditions are checked, and the redundancy of state initialization logic.
| num_layers = int(getattr(config, "num_hidden_layers", 0)) | ||
| if num_layers <= 0 and hasattr(model, "layers"): | ||
| num_layers = len(model.layers) |
There was a problem hiding this comment.
The logic for determining num_layers is redundant. getattr(config, "num_hidden_layers", 0) already returns 0 if the attribute is missing, and the subsequent check if num_layers <= 0 and hasattr(model, "layers") is sufficient. The explicit int() cast is also unnecessary as num_hidden_layers is typically an integer in Hugging Face configs.
| num_layers = int(getattr(config, "num_hidden_layers", 0)) | |
| if num_layers <= 0 and hasattr(model, "layers"): | |
| num_layers = len(model.layers) | |
| num_layers = getattr(config, "num_hidden_layers", 0) | |
| if num_layers <= 0 and hasattr(model, "layers"): |
| at_block_end = ((layer_idx + 1) % attnres_state.block_size) == 0 | ||
| at_model_end = (attnres_state.num_layers > 0) and ( | ||
| (layer_idx + 1) >= attnres_state.num_layers | ||
| ) |
There was a problem hiding this comment.
The condition at_model_end can be simplified. Since layer_idx is 0-indexed, layer_idx + 1 is the 1-based layer count. Comparing layer_idx + 1 == attnres_state.num_layers is sufficient.
| at_block_end = ((layer_idx + 1) % attnres_state.block_size) == 0 | |
| at_model_end = (attnres_state.num_layers > 0) and ( | |
| (layer_idx + 1) >= attnres_state.num_layers | |
| ) | |
| at_model_end = (attnres_state.num_layers > 0) and ((layer_idx + 1) == attnres_state.num_layers) |
| self.enabled = bool( | ||
| self.enabled or self.config.enabled or self.kernel_hook is not None | ||
| ) |
There was a problem hiding this comment.
The enabled property logic in __post_init__ is redundant because begin() performs the same check. Consider simplifying the state initialization.
| self.enabled = bool( | |
| self.enabled or self.config.enabled or self.kernel_hook is not None | |
| ) | |
| def __post_init__(self): | |
| if self.kernel_hook is None: | |
| self.kernel_hook = self.config.kernel_hook | |
| self.enabled = self.enabled or self.config.enabled or self.kernel_hook is not None |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 513888151c
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| return torch.zeros_like(query) | ||
|
|
||
| # (bsz, seqlen, n_states, dim) | ||
| stacked = torch.stack(candidates, dim = 2) |
There was a problem hiding this comment.
Normalize candidate devices before stacking AttnRes states
AttnRes currently stacks completed_block_summaries and current_block_states as-is, but multi-GPU pipeline inference moves activations per layer (move_to_device in the inference loops) while reusing one shared attnres_state. As soon as layers run on different _per_layer_device_index values, this list contains tensors from different GPUs and torch.stack fails with a cross-device error. The candidates should be moved to query.device (or tracked per device) before mixing.
Useful? React with 👍 / 👎.
| candidates.extend(attnres_state.current_block_states) | ||
| if len(candidates) != 0: | ||
| mixed = _compute_residual_mix(query, candidates) | ||
| attention_output = attention_output + (attnres_state.alpha * mixed) |
There was a problem hiding this comment.
Preserve attention output dtype when applying AttnRes mix
attention_output can be fp16/bf16, but the mix is computed from query and may be fp32 (for example, fast inference paths pass a float32 residual buffer as query). The expression attention_output + (alpha * mixed) then promotes the result to fp32, which pushes subsequent layers off the intended low-precision fast path and increases memory/latency whenever AttnRes is enabled. Cast mixed (or query) to attention_output.dtype before the add.
Useful? React with 👍 / 👎.
… MPS + base namespace for PR unslothai#5754 Round 12 reviewer findings. Backend correctness (P1) * core/inference/diffusion.py load_model: GGUF branch now handles an absolute local directory passed as repo_id by joining Path(repo_id) / gguf_filename directly instead of handing the path to hf_hub_download (which raises HFValidationError because the path is not 'namespace/repo'). Closes round 12 review #1 -- the load request advertised 'local path' support but actually only worked for Hub repo ids. Delete guard precision (P1) * routes/models.py /delete-finetuned + /delete-cached: diffusion guard now consults gguf_filename from status() and ALLOWS per-variant deletes that target a different quant than the one the loaded pipeline is reading. Loading 'Q4_K_S' no longer blocks deleting 'Q8_0' from the same repo / export directory (round 12 reviews #3 and #4). Accelerator (P2) * core/inference/diffusion.py _drain_cuda_cache: also calls torch.mps.empty_cache() when the MPS backend is the active accelerator. Apple Silicon swaps now actually return held VRAM instead of leaving it pinned in the Metal allocator (round 12 review #10). Smart base repo (P2) * core/inference/diffusion.py _smart_base_repo: only inspects the LAST segment of the repo id / path for the 'base' / '9b' tokens. A namespace like baseorg/FLUX.2-klein-4B-GGUF or a parent directory like /home/me/.cache/base/... no longer falsely selects the Base variant (round 12 review #9).
P1: route-layer chat/diffusion/export releases were still asymmetric. Training start and export load called ``diff_backend.unload_model`` inside a best-effort try/except so a wedged diffusion backend let the next workload allocate over the top of the resident pipeline and OOM. Both now use the strict ``_release_diffusion_for`` helper from routes.inference, which raises HTTPException 503 on status/unload failure or post-check mismatch. P2 #9: diffusion load exceptions can include the absolute local repo / base / gguf path verbatim (FileNotFoundError, OSError from diffusers / safetensors). The path flows into ``_last_error``, which ``status()`` returns to every authenticated session. Collapse the known repo_id / effective_base / gguf_filename paths to their leaf name before storing the error, mirroring the ``_display_repo_id`` convention used for the public repo label. P2 #10: when ``repo_id`` is an absolute local path, ``detect_family`` matched _FAMILY_EXCLUDE deny lists against the full path, so models stored under a parent directory containing ``qwen-image-edit`` or ``3.5`` were misclassified as None. Reduce the family-detection needle to the leaf directory when the input looks like a filesystem path; Hub-style ``owner/repo`` ids continue to use the original needle so existing detection rules keep working. P2 #12: ``gguf_filename`` was missing from the ``_reject_embedded_hf_token`` validator. A URL-form quant path like ``https://hf_xxxxx@huggingface.co/.../flux.gguf`` would be stored on ``DiffusionBackend._gguf_filename`` and surface in status() / log lines. Extend the validator to gguf_filename so the token is dropped before it can leak. All 85 diffusion-relevant backend tests pass locally.
P1 #1: ``_release_llama_for()`` now verifies ``llama.unload_model`` did not return False AND that ``is_loaded`` / ``is_active`` / ``loading_model_identifier`` are all cleared after the call. The previous version only treated raised exceptions as failure, so a subprocess refusing to terminate or an in-flight GGUF download let the next workload allocate on top. P1 #2: ``DiffusionBackend._release_other_gpu_owners_for_diffusion`` now raises RuntimeError when ``exp._shutdown_subprocess`` fails on a settled checkpoint. Direct backend callers used to log at debug level and proceed toward diffusion allocation while the export checkpoint still owned VRAM. P1 #3 + P1 #7: ``/images/load`` no longer drops chat + idle export before the cheap backend validation runs. ``DiffusionBackend.load_model`` already calls the strict ``_release_other_gpu_owners_for_diffusion`` and ``_release_chat_backend_for_diffusion`` helpers AFTER family inference and GGUF filename checks pass, so the GPU is still freed before allocation and a malformed payload no longer silently unloads the user's chat / chat-export pair. P1 #4: ``_release_chat_backend_for_diffusion`` now also rejects a post-unload state where ``loading_model_identifier`` is still set, matching the route-level ``_release_llama_for`` strictness. A GGUF download mid-flight before the diffusion handoff used to slip through and end up double-owning VRAM after diffusion allocated. P1 #5: ``_release_diffusion_for`` no longer swallows a post-unload ``status()`` failure as ``after = {}``. Training / chat / export handoffs need proof that the diffusion pipeline released VRAM; the helper now raises HTTP 503 when the verification status call itself raises, so the caller retries. P1 #6: ``DiffusionBackend._release_other_gpu_owners_for_diffusion`` raises RuntimeError when ``get_export_backend()`` itself raises. Direct backend callers used to silently ``return`` here and proceed to GPU allocation without being able to verify export ownership. P1 #8: ``/training/start`` releases settled export BEFORE chat, matching the chat-load helpers. If idle export shutdown fails the user's chat model is preserved instead of being dropped for a training run that never starts. P2 #9: GGUF load-error scrubber also collapses ``local_gguf_path``, the resolved HF cache path passed to ``transformer_cls.from_single_file()``. Without this an exception like ``OSError: cannot load /home/alice/.cache/huggingface/.../flux.gguf`` would leak the operator's filesystem layout through ``last_error`` and ``/images/status``. All 85 diffusion-relevant backend tests pass locally.
P1 #1: ``_gpu_workload_busy_for_helper`` in ``utils/datasets/llm_assist.py`` now also gates on the GGUF chat backend (llama-server) AND the safetensors chat backend. Round 23 extended it to training + export but missed Chat, so a helper / advisor GGUF could still race a loaded chat model for VRAM. Both checks fail closed when status is unverifiable. P1 #2 / #3 / #4 / #5: re-ordered the route-level GPU-handoff unloads so the diffusion release runs BEFORE the chat releases. A wedged diffusion unload used to fire AFTER chat was already gone, so the user lost both on a single failure. Drop chat last so an earlier failure preserves it. Applied to ``/training/start`` (training.py), ``/export/load`` (export.py), ``/chat/load`` GGUF branch and ``/chat/load`` safetensors branch (routes/inference.py). P1 #7 + P2 #13: ``/delete-finetuned`` body now hardens ``model_path`` and ``gguf_variant`` via the shared ``_validate_logged_identifier`` helper, so control characters and URL-form HF tokens can no longer log-line-smuggle. P1 #8 + #10: ``/delete-cached`` body hardens ``repo_id`` and ``variant`` the same way. P1 #9: ``/download-progress`` ``repo_id`` query parameter is also hardened; the value flows into log lines deep inside ``_get_repo_size_cached`` on lookup failure. P1 #11: ``CheckFormatRequest.dataset_name`` and ``AiAssistMappingRequest.{dataset_name, model_name}`` in ``models/datasets.py`` now apply the same control-char + embedded-HF-token validators, matching every other public request-body model. All 115 diffusion + training-validation + cached_gguf + export + inference model-validation tests pass locally. (P1 #6 native-path-lease enforcement for diffusion local paths and P1 #12 React Compiler frontend lint deferred -- both need focused design / frontend touchups separate from this batch.)
P1 #1: ``_release_llama_for()`` now verifies ``llama.unload_model`` did not return False AND that ``is_loaded`` / ``is_active`` / ``loading_model_identifier`` are all cleared after the call. The previous version only treated raised exceptions as failure, so a subprocess refusing to terminate or an in-flight GGUF download let the next workload allocate on top. P1 #2: ``DiffusionBackend._release_other_gpu_owners_for_diffusion`` now raises RuntimeError when ``exp._shutdown_subprocess`` fails on a settled checkpoint. Direct backend callers used to log at debug level and proceed toward diffusion allocation while the export checkpoint still owned VRAM. P1 #3 + P1 #7: ``/images/load`` no longer drops chat + idle export before the cheap backend validation runs. ``DiffusionBackend.load_model`` already calls the strict ``_release_other_gpu_owners_for_diffusion`` and ``_release_chat_backend_for_diffusion`` helpers AFTER family inference and GGUF filename checks pass, so the GPU is still freed before allocation and a malformed payload no longer silently unloads the user's chat / chat-export pair. P1 #4: ``_release_chat_backend_for_diffusion`` now also rejects a post-unload state where ``loading_model_identifier`` is still set, matching the route-level ``_release_llama_for`` strictness. A GGUF download mid-flight before the diffusion handoff used to slip through and end up double-owning VRAM after diffusion allocated. P1 #5: ``_release_diffusion_for`` no longer swallows a post-unload ``status()`` failure as ``after = {}``. Training / chat / export handoffs need proof that the diffusion pipeline released VRAM; the helper now raises HTTP 503 when the verification status call itself raises, so the caller retries. P1 #6: ``DiffusionBackend._release_other_gpu_owners_for_diffusion`` raises RuntimeError when ``get_export_backend()`` itself raises. Direct backend callers used to silently ``return`` here and proceed to GPU allocation without being able to verify export ownership. P1 #8: ``/training/start`` releases settled export BEFORE chat, matching the chat-load helpers. If idle export shutdown fails the user's chat model is preserved instead of being dropped for a training run that never starts. P2 #9: GGUF load-error scrubber also collapses ``local_gguf_path``, the resolved HF cache path passed to ``transformer_cls.from_single_file()``. Without this an exception like ``OSError: cannot load /home/alice/.cache/huggingface/.../flux.gguf`` would leak the operator's filesystem layout through ``last_error`` and ``/images/status``. All 85 diffusion-relevant backend tests pass locally.
P1 #1: ``_gpu_workload_busy_for_helper`` in ``utils/datasets/llm_assist.py`` now also gates on the GGUF chat backend (llama-server) AND the safetensors chat backend. Round 23 extended it to training + export but missed Chat, so a helper / advisor GGUF could still race a loaded chat model for VRAM. Both checks fail closed when status is unverifiable. P1 #2 / #3 / #4 / #5: re-ordered the route-level GPU-handoff unloads so the diffusion release runs BEFORE the chat releases. A wedged diffusion unload used to fire AFTER chat was already gone, so the user lost both on a single failure. Drop chat last so an earlier failure preserves it. Applied to ``/training/start`` (training.py), ``/export/load`` (export.py), ``/chat/load`` GGUF branch and ``/chat/load`` safetensors branch (routes/inference.py). P1 #7 + P2 #13: ``/delete-finetuned`` body now hardens ``model_path`` and ``gguf_variant`` via the shared ``_validate_logged_identifier`` helper, so control characters and URL-form HF tokens can no longer log-line-smuggle. P1 #8 + #10: ``/delete-cached`` body hardens ``repo_id`` and ``variant`` the same way. P1 #9: ``/download-progress`` ``repo_id`` query parameter is also hardened; the value flows into log lines deep inside ``_get_repo_size_cached`` on lookup failure. P1 #11: ``CheckFormatRequest.dataset_name`` and ``AiAssistMappingRequest.{dataset_name, model_name}`` in ``models/datasets.py`` now apply the same control-char + embedded-HF-token validators, matching every other public request-body model. All 115 diffusion + training-validation + cached_gguf + export + inference model-validation tests pass locally. (P1 #6 native-path-lease enforcement for diffusion local paths and P1 #12 React Compiler frontend lint deferred -- both need focused design / frontend touchups separate from this batch.)
Two round-33 reviewer findings: hub-floor consistency and the multipart upload filename validator gap. Dependencies: reverted the round-26 huggingface_hub>=1.3.0 floor in no-torch-runtime.txt and pyproject.toml (round 33 P1 #1-#5, 4/12 vote consensus). studio.txt forces huggingface_hub==0.36.2 to match the transformers==4.57.6 pin in extras-no-deps.txt, so the 1.3.0 floor was internally inconsistent. Reviewers reproduced the resolver conflict on a fresh install. Empirical justification (re-verified on the live B200 host before the revert): huggingface_hub 0.36.2 + transformers 4.57.6 + diffusers 0.37.1 imports Flux2KleinPipeline cleanly and runs end-to-end image generation. transformers 4.57.6 carries its own transformers.utils.hub.is_offline_mode and does not actually need huggingface_hub.is_offline_mode at import time. The original bump was guarding against the (never-realised) transformers 5.x path, which extras-no-deps explicitly pins away. Validation: multipart /seed/upload-unstructured-file now applies the same _no_control_chars and _reject_embedded_hf_token checks to file.filename that SeedInspectUploadRequest.filename already applies in the JSON variant (round 33 P1 #7). The filename is reflected back to the client, persisted in the per-file meta JSON, and echoed by error responses, so the JSON-side hardening must not be asymmetric with the multipart path. Skipped (consistent with prior rounds): * Find_spec vs full import (R33 P1 #6): preserves test compatibility with the huggingface_hub stub fixture. * React hooks set-state-in-effect lint (R33 P1 #8): codebase has 146 pre-existing violations of the same rule; studio-frontend-ci does not gate on lint. * Direct DiffusionBackend.load_model bypass (R33 P1 #9): the route is the only production entry point, and the backend helper now publishes its own diffusion-backend pending tag (round 32 P1 #3). Direct-caller hardening would require duplicating the lease check into load_model itself, which is out of scope for the route-layer security boundary. * One-segment Hub IDs (R33 P2 #10): strict 2-segment Hub id check is intentional; one-segment names are not valid Hub ids. * Cwd-relative shadow of Hub IDs (R33 P2 #11): documented side-channel tradeoff accepted in round 31 commit msg. 97 targeted backend tests pass.
Staging mirror of unslothai#4863
Original PR: unslothai#4863
Author: @kleeedolinux
This is a staging copy for review and editing. Once finalized, changes will be pushed back to the original PR.
Original description
1. Why this PR exists
Unsloth fast paths currently use standard residual updates around attention blocks, e.g.:
This is stable and efficient, but all prior layer information is mixed with fixed weights (implicit weight = 1 for the direct residual stream at each step). The AttnRes idea is to make residual mixing content-adaptive while preserving the same forward structure and default behavior.
Reference paper:
This PR introduces an opt-in, low-risk integration that keeps Unsloth architecture and coding style intact.
2. High-level design
The integration goal is:
Implemented approach:
So if current attention output is (a_l), the transformed output is:
with:
where in this practical implementation:
3. Mathematical mapping to implementation
The paper discusses full and block residual attention. This PR implements a practical block-style variant:
4. What was changed
New files
unsloth/models/attnres.pyunsloth/utils/attnres.pyUpdated files
unsloth/utils/attention_dispatch.pyunsloth/utils/__init__.pyunsloth/kernels/__init__.pyModel integrations (attention branch only)
unsloth/models/llama.pyunsloth/models/gemma2.pyunsloth/models/cohere.py