Skip to content

Add Optional Attention Residuals (AttnRes) Integration for Unsloth Fast Paths#9

Closed
danielhanchen wants to merge 11 commits into
mainfrom
pr-4863-head
Closed

Add Optional Attention Residuals (AttnRes) Integration for Unsloth Fast Paths#9
danielhanchen wants to merge 11 commits into
mainfrom
pr-4863-head

Conversation

@danielhanchen

Copy link
Copy Markdown
Member

Staging mirror of unslothai#4863

Original PR: unslothai#4863
Author: @kleeedolinux

This is a staging copy for review and editing. Once finalized, changes will be pushed back to the original PR.


Original description

1. Why this PR exists

Unsloth fast paths currently use standard residual updates around attention blocks, e.g.:

$$x_{l+1} = x_l + f_l(x_l)$$

This is stable and efficient, but all prior layer information is mixed with fixed weights (implicit weight = 1 for the direct residual stream at each step). The AttnRes idea is to make residual mixing content-adaptive while preserving the same forward structure and default behavior.

Reference paper:

This PR introduces an opt-in, low-risk integration that keeps Unsloth architecture and coding style intact.

2. High-level design

The integration goal is:

  • Do not break existing kernels / fast paths.
  • Keep default behavior identical when AttnRes is disabled.
  • Touch only attention-branch outputs before existing residual merges.

Implemented approach:

  1. Build lightweight AttnRes state per forward pass.
  2. Track block-local states and block summaries.
  3. Before residual add, compute an AttnRes contribution from prior states.
  4. Add that contribution to the attention output, then continue existing residual code.

So if current attention output is (a_l), the transformed output is:

$$\tilde{a}_l = a_l + \alpha \cdot \sum_i w_{l,i} v_i$$

with:

$$w_{l,i} = \mathrm{softmax}\left(\frac{q_l \cdot k_i}{\sqrt{d}}\right)$$

where in this practical implementation:

  • query (q_l): derived from current residual/query tensor,
  • keys/values ((k_i, v_i)): prior block summaries + in-block previous states,
  • (\alpha): configurable residual-mixing scale.

3. Mathematical mapping to implementation

The paper discusses full and block residual attention. This PR implements a practical block-style variant:

  • Maintain two sets:
    • completed block summaries (S_1, \dots, S_{m-1})
    • in-progress block states in current block (h_{m,1}, \dots, h_{m,t-1})
  • Candidate set for layer (l):
$$\mathcal{C}_l = \{S_1, \dots, S_{m-1}\} \cup \{h_{m,1}, \dots, h_{m,t-1}\}$$
  • Residual attention mix:
$$r_l = \sum_{c \in \mathcal{C}_l} \mathrm{softmax}\left(\frac{q_l \cdot c}{\sqrt{d}}\right) c$$
  • Output transform:
$$\tilde{a}_l = a_l + \alpha r_l$$
  • Block summary update at boundary:
$$S_m = \sum_{j \in B_m} h_j$$

4. What was changed

New files

  • unsloth/models/attnres.py
    • forward-state init
    • block summary accumulation
    • attention-output transform
  • unsloth/utils/attnres.py
    • utility-level AttnRes state/config/hook helpers

Updated files

  • unsloth/utils/attention_dispatch.py
    • optional AttnRes-aware post-attention finalization hook path
  • unsloth/utils/__init__.py
  • unsloth/kernels/__init__.py
    • export AttnRes helpers

Model integrations (attention branch only)

  • unsloth/models/llama.py
  • unsloth/models/gemma2.py
  • unsloth/models/cohere.py
  • `unsloth/models

@danielhanchen

Copy link
Copy Markdown
Member Author

/gemini review

@danielhanchen

Copy link
Copy Markdown
Member Author

@codex review

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces 'Attention Residuals' (AttnRes), a feature designed to accumulate and mix residual states across transformer layers to potentially improve model performance or stability. The implementation includes a core state management module in unsloth/models/attnres.py, integration into several model architectures (Llama, Gemma2, Cohere, Falcon, and Granite), and hooks within the central attention dispatch utility to support various backends. The review feedback suggests minor logic simplifications in the new modules, specifically regarding how the number of layers is determined, how model boundary conditions are checked, and the redundancy of state initialization logic.

Comment thread unsloth/models/attnres.py
Comment on lines +120 to +122
num_layers = int(getattr(config, "num_hidden_layers", 0))
if num_layers <= 0 and hasattr(model, "layers"):
num_layers = len(model.layers)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

low

The logic for determining num_layers is redundant. getattr(config, "num_hidden_layers", 0) already returns 0 if the attribute is missing, and the subsequent check if num_layers <= 0 and hasattr(model, "layers") is sufficient. The explicit int() cast is also unnecessary as num_hidden_layers is typically an integer in Hugging Face configs.

Suggested change
num_layers = int(getattr(config, "num_hidden_layers", 0))
if num_layers <= 0 and hasattr(model, "layers"):
num_layers = len(model.layers)
num_layers = getattr(config, "num_hidden_layers", 0)
if num_layers <= 0 and hasattr(model, "layers"):

Comment thread unsloth/models/attnres.py
Comment on lines +189 to +192
at_block_end = ((layer_idx + 1) % attnres_state.block_size) == 0
at_model_end = (attnres_state.num_layers > 0) and (
(layer_idx + 1) >= attnres_state.num_layers
)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

low

The condition at_model_end can be simplified. Since layer_idx is 0-indexed, layer_idx + 1 is the 1-based layer count. Comparing layer_idx + 1 == attnres_state.num_layers is sufficient.

Suggested change
at_block_end = ((layer_idx + 1) % attnres_state.block_size) == 0
at_model_end = (attnres_state.num_layers > 0) and (
(layer_idx + 1) >= attnres_state.num_layers
)
at_model_end = (attnres_state.num_layers > 0) and ((layer_idx + 1) == attnres_state.num_layers)

Comment thread unsloth/utils/attnres.py
Comment on lines +43 to +45
self.enabled = bool(
self.enabled or self.config.enabled or self.kernel_hook is not None
)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

low

The enabled property logic in __post_init__ is redundant because begin() performs the same check. Consider simplifying the state initialization.

Suggested change
self.enabled = bool(
self.enabled or self.config.enabled or self.kernel_hook is not None
)
def __post_init__(self):
if self.kernel_hook is None:
self.kernel_hook = self.config.kernel_hook
self.enabled = self.enabled or self.config.enabled or self.kernel_hook is not None

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 513888151c

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread unsloth/models/attnres.py
return torch.zeros_like(query)

# (bsz, seqlen, n_states, dim)
stacked = torch.stack(candidates, dim = 2)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Normalize candidate devices before stacking AttnRes states

AttnRes currently stacks completed_block_summaries and current_block_states as-is, but multi-GPU pipeline inference moves activations per layer (move_to_device in the inference loops) while reusing one shared attnres_state. As soon as layers run on different _per_layer_device_index values, this list contains tensors from different GPUs and torch.stack fails with a cross-device error. The candidates should be moved to query.device (or tracked per device) before mixing.

Useful? React with 👍 / 👎.

Comment thread unsloth/models/attnres.py
candidates.extend(attnres_state.current_block_states)
if len(candidates) != 0:
mixed = _compute_residual_mix(query, candidates)
attention_output = attention_output + (attnres_state.alpha * mixed)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Preserve attention output dtype when applying AttnRes mix

attention_output can be fp16/bf16, but the mix is computed from query and may be fp32 (for example, fast inference paths pass a float32 residual buffer as query). The expression attention_output + (alpha * mixed) then promotes the result to fp32, which pushes subsequent layers off the intended low-precision fast path and increases memory/latency whenever AttnRes is enabled. Cast mixed (or query) to attention_output.dtype before the add.

Useful? React with 👍 / 👎.

danielhanchen added a commit that referenced this pull request May 25, 2026
… MPS + base namespace for PR unslothai#5754

Round 12 reviewer findings.

Backend correctness (P1)
  * core/inference/diffusion.py load_model: GGUF branch now
    handles an absolute local directory passed as repo_id by
    joining Path(repo_id) / gguf_filename directly instead of
    handing the path to hf_hub_download (which raises
    HFValidationError because the path is not 'namespace/repo').
    Closes round 12 review #1 -- the load request advertised
    'local path' support but actually only worked for Hub repo ids.

Delete guard precision (P1)
  * routes/models.py /delete-finetuned + /delete-cached:
    diffusion guard now consults gguf_filename from status()
    and ALLOWS per-variant deletes that target a different quant
    than the one the loaded pipeline is reading. Loading
    'Q4_K_S' no longer blocks deleting 'Q8_0' from the same
    repo / export directory (round 12 reviews #3 and #4).

Accelerator (P2)
  * core/inference/diffusion.py _drain_cuda_cache: also calls
    torch.mps.empty_cache() when the MPS backend is the
    active accelerator. Apple Silicon swaps now actually return
    held VRAM instead of leaving it pinned in the Metal
    allocator (round 12 review #10).

Smart base repo (P2)
  * core/inference/diffusion.py _smart_base_repo: only inspects
    the LAST segment of the repo id / path for the 'base' / '9b'
    tokens. A namespace like baseorg/FLUX.2-klein-4B-GGUF or
    a parent directory like /home/me/.cache/base/... no
    longer falsely selects the Base variant (round 12 review #9).
danielhanchen added a commit that referenced this pull request May 25, 2026
P1: route-layer chat/diffusion/export releases were still
asymmetric. Training start and export load called
``diff_backend.unload_model`` inside a best-effort try/except so a
wedged diffusion backend let the next workload allocate over the
top of the resident pipeline and OOM. Both now use the strict
``_release_diffusion_for`` helper from routes.inference, which
raises HTTPException 503 on status/unload failure or post-check
mismatch.

P2 #9: diffusion load exceptions can include the absolute local
repo / base / gguf path verbatim (FileNotFoundError, OSError from
diffusers / safetensors). The path flows into ``_last_error``,
which ``status()`` returns to every authenticated session. Collapse
the known repo_id / effective_base / gguf_filename paths to their
leaf name before storing the error, mirroring the
``_display_repo_id`` convention used for the public repo label.

P2 #10: when ``repo_id`` is an absolute local path,
``detect_family`` matched _FAMILY_EXCLUDE deny lists against the
full path, so models stored under a parent directory containing
``qwen-image-edit`` or ``3.5`` were misclassified as None. Reduce
the family-detection needle to the leaf directory when the input
looks like a filesystem path; Hub-style ``owner/repo`` ids
continue to use the original needle so existing detection rules
keep working.

P2 #12: ``gguf_filename`` was missing from the
``_reject_embedded_hf_token`` validator. A URL-form quant path
like ``https://hf_xxxxx@huggingface.co/.../flux.gguf`` would be
stored on ``DiffusionBackend._gguf_filename`` and surface in
status() / log lines. Extend the validator to gguf_filename so the
token is dropped before it can leak.

All 85 diffusion-relevant backend tests pass locally.
danielhanchen added a commit that referenced this pull request May 25, 2026
P1 #1: ``_release_llama_for()`` now verifies ``llama.unload_model``
did not return False AND that ``is_loaded`` / ``is_active`` /
``loading_model_identifier`` are all cleared after the call. The
previous version only treated raised exceptions as failure, so a
subprocess refusing to terminate or an in-flight GGUF download
let the next workload allocate on top.

P1 #2: ``DiffusionBackend._release_other_gpu_owners_for_diffusion``
now raises RuntimeError when ``exp._shutdown_subprocess`` fails on
a settled checkpoint. Direct backend callers used to log at debug
level and proceed toward diffusion allocation while the export
checkpoint still owned VRAM.

P1 #3 + P1 #7: ``/images/load`` no longer drops chat + idle export
before the cheap backend validation runs. ``DiffusionBackend.load_model``
already calls the strict ``_release_other_gpu_owners_for_diffusion``
and ``_release_chat_backend_for_diffusion`` helpers AFTER family
inference and GGUF filename checks pass, so the GPU is still
freed before allocation and a malformed payload no longer
silently unloads the user's chat / chat-export pair.

P1 #4: ``_release_chat_backend_for_diffusion`` now also rejects a
post-unload state where ``loading_model_identifier`` is still set,
matching the route-level ``_release_llama_for`` strictness. A GGUF
download mid-flight before the diffusion handoff used to slip
through and end up double-owning VRAM after diffusion allocated.

P1 #5: ``_release_diffusion_for`` no longer swallows a post-unload
``status()`` failure as ``after = {}``. Training / chat / export
handoffs need proof that the diffusion pipeline released VRAM;
the helper now raises HTTP 503 when the verification status call
itself raises, so the caller retries.

P1 #6: ``DiffusionBackend._release_other_gpu_owners_for_diffusion``
raises RuntimeError when ``get_export_backend()`` itself raises.
Direct backend callers used to silently ``return`` here and
proceed to GPU allocation without being able to verify export
ownership.

P1 #8: ``/training/start`` releases settled export BEFORE chat,
matching the chat-load helpers. If idle export shutdown fails the
user's chat model is preserved instead of being dropped for a
training run that never starts.

P2 #9: GGUF load-error scrubber also collapses ``local_gguf_path``,
the resolved HF cache path passed to
``transformer_cls.from_single_file()``. Without this an exception
like ``OSError: cannot load /home/alice/.cache/huggingface/.../flux.gguf``
would leak the operator's filesystem layout through ``last_error``
and ``/images/status``.

All 85 diffusion-relevant backend tests pass locally.
danielhanchen added a commit that referenced this pull request May 25, 2026
P1 #1: ``_gpu_workload_busy_for_helper`` in
``utils/datasets/llm_assist.py`` now also gates on the GGUF chat
backend (llama-server) AND the safetensors chat backend. Round 23
extended it to training + export but missed Chat, so a helper /
advisor GGUF could still race a loaded chat model for VRAM.
Both checks fail closed when status is unverifiable.

P1 #2 / #3 / #4 / #5: re-ordered the route-level GPU-handoff
unloads so the diffusion release runs BEFORE the chat releases.
A wedged diffusion unload used to fire AFTER chat was already
gone, so the user lost both on a single failure. Drop chat last
so an earlier failure preserves it. Applied to
``/training/start`` (training.py), ``/export/load`` (export.py),
``/chat/load`` GGUF branch and ``/chat/load`` safetensors branch
(routes/inference.py).

P1 #7 + P2 #13: ``/delete-finetuned`` body now hardens
``model_path`` and ``gguf_variant`` via the shared
``_validate_logged_identifier`` helper, so control characters
and URL-form HF tokens can no longer log-line-smuggle.

P1 #8 + #10: ``/delete-cached`` body hardens ``repo_id`` and
``variant`` the same way.

P1 #9: ``/download-progress`` ``repo_id`` query parameter is
also hardened; the value flows into log lines deep inside
``_get_repo_size_cached`` on lookup failure.

P1 #11: ``CheckFormatRequest.dataset_name`` and
``AiAssistMappingRequest.{dataset_name, model_name}`` in
``models/datasets.py`` now apply the same control-char +
embedded-HF-token validators, matching every other public
request-body model.

All 115 diffusion + training-validation + cached_gguf + export
+ inference model-validation tests pass locally.

(P1 #6 native-path-lease enforcement for diffusion local paths
and P1 #12 React Compiler frontend lint deferred -- both need
focused design / frontend touchups separate from this batch.)
danielhanchen added a commit that referenced this pull request May 25, 2026
P1 #1: ``_release_llama_for()`` now verifies ``llama.unload_model``
did not return False AND that ``is_loaded`` / ``is_active`` /
``loading_model_identifier`` are all cleared after the call. The
previous version only treated raised exceptions as failure, so a
subprocess refusing to terminate or an in-flight GGUF download
let the next workload allocate on top.

P1 #2: ``DiffusionBackend._release_other_gpu_owners_for_diffusion``
now raises RuntimeError when ``exp._shutdown_subprocess`` fails on
a settled checkpoint. Direct backend callers used to log at debug
level and proceed toward diffusion allocation while the export
checkpoint still owned VRAM.

P1 #3 + P1 #7: ``/images/load`` no longer drops chat + idle export
before the cheap backend validation runs. ``DiffusionBackend.load_model``
already calls the strict ``_release_other_gpu_owners_for_diffusion``
and ``_release_chat_backend_for_diffusion`` helpers AFTER family
inference and GGUF filename checks pass, so the GPU is still
freed before allocation and a malformed payload no longer
silently unloads the user's chat / chat-export pair.

P1 #4: ``_release_chat_backend_for_diffusion`` now also rejects a
post-unload state where ``loading_model_identifier`` is still set,
matching the route-level ``_release_llama_for`` strictness. A GGUF
download mid-flight before the diffusion handoff used to slip
through and end up double-owning VRAM after diffusion allocated.

P1 #5: ``_release_diffusion_for`` no longer swallows a post-unload
``status()`` failure as ``after = {}``. Training / chat / export
handoffs need proof that the diffusion pipeline released VRAM;
the helper now raises HTTP 503 when the verification status call
itself raises, so the caller retries.

P1 #6: ``DiffusionBackend._release_other_gpu_owners_for_diffusion``
raises RuntimeError when ``get_export_backend()`` itself raises.
Direct backend callers used to silently ``return`` here and
proceed to GPU allocation without being able to verify export
ownership.

P1 #8: ``/training/start`` releases settled export BEFORE chat,
matching the chat-load helpers. If idle export shutdown fails the
user's chat model is preserved instead of being dropped for a
training run that never starts.

P2 #9: GGUF load-error scrubber also collapses ``local_gguf_path``,
the resolved HF cache path passed to
``transformer_cls.from_single_file()``. Without this an exception
like ``OSError: cannot load /home/alice/.cache/huggingface/.../flux.gguf``
would leak the operator's filesystem layout through ``last_error``
and ``/images/status``.

All 85 diffusion-relevant backend tests pass locally.
danielhanchen added a commit that referenced this pull request May 25, 2026
P1 #1: ``_gpu_workload_busy_for_helper`` in
``utils/datasets/llm_assist.py`` now also gates on the GGUF chat
backend (llama-server) AND the safetensors chat backend. Round 23
extended it to training + export but missed Chat, so a helper /
advisor GGUF could still race a loaded chat model for VRAM.
Both checks fail closed when status is unverifiable.

P1 #2 / #3 / #4 / #5: re-ordered the route-level GPU-handoff
unloads so the diffusion release runs BEFORE the chat releases.
A wedged diffusion unload used to fire AFTER chat was already
gone, so the user lost both on a single failure. Drop chat last
so an earlier failure preserves it. Applied to
``/training/start`` (training.py), ``/export/load`` (export.py),
``/chat/load`` GGUF branch and ``/chat/load`` safetensors branch
(routes/inference.py).

P1 #7 + P2 #13: ``/delete-finetuned`` body now hardens
``model_path`` and ``gguf_variant`` via the shared
``_validate_logged_identifier`` helper, so control characters
and URL-form HF tokens can no longer log-line-smuggle.

P1 #8 + #10: ``/delete-cached`` body hardens ``repo_id`` and
``variant`` the same way.

P1 #9: ``/download-progress`` ``repo_id`` query parameter is
also hardened; the value flows into log lines deep inside
``_get_repo_size_cached`` on lookup failure.

P1 #11: ``CheckFormatRequest.dataset_name`` and
``AiAssistMappingRequest.{dataset_name, model_name}`` in
``models/datasets.py`` now apply the same control-char +
embedded-HF-token validators, matching every other public
request-body model.

All 115 diffusion + training-validation + cached_gguf + export
+ inference model-validation tests pass locally.

(P1 #6 native-path-lease enforcement for diffusion local paths
and P1 #12 React Compiler frontend lint deferred -- both need
focused design / frontend touchups separate from this batch.)
danielhanchen added a commit that referenced this pull request May 25, 2026
Two round-33 reviewer findings: hub-floor consistency and the
multipart upload filename validator gap.

Dependencies: reverted the round-26 huggingface_hub>=1.3.0 floor
in no-torch-runtime.txt and pyproject.toml (round 33 P1 #1-#5,
4/12 vote consensus). studio.txt forces huggingface_hub==0.36.2
to match the transformers==4.57.6 pin in extras-no-deps.txt, so
the 1.3.0 floor was internally inconsistent. Reviewers
reproduced the resolver conflict on a fresh install.

Empirical justification (re-verified on the live B200 host before
the revert): huggingface_hub 0.36.2 + transformers 4.57.6 +
diffusers 0.37.1 imports Flux2KleinPipeline cleanly and runs
end-to-end image generation. transformers 4.57.6 carries its own
transformers.utils.hub.is_offline_mode and does not actually need
huggingface_hub.is_offline_mode at import time. The original bump
was guarding against the (never-realised) transformers 5.x path,
which extras-no-deps explicitly pins away.

Validation: multipart /seed/upload-unstructured-file now applies
the same _no_control_chars and _reject_embedded_hf_token checks
to file.filename that SeedInspectUploadRequest.filename already
applies in the JSON variant (round 33 P1 #7). The filename is
reflected back to the client, persisted in the per-file meta
JSON, and echoed by error responses, so the JSON-side hardening
must not be asymmetric with the multipart path.

Skipped (consistent with prior rounds):
  * Find_spec vs full import (R33 P1 #6): preserves test
    compatibility with the huggingface_hub stub fixture.
  * React hooks set-state-in-effect lint (R33 P1 #8): codebase
    has 146 pre-existing violations of the same rule;
    studio-frontend-ci does not gate on lint.
  * Direct DiffusionBackend.load_model bypass (R33 P1 #9): the
    route is the only production entry point, and the backend
    helper now publishes its own diffusion-backend pending tag
    (round 32 P1 #3). Direct-caller hardening would require
    duplicating the lease check into load_model itself, which
    is out of scope for the route-layer security boundary.
  * One-segment Hub IDs (R33 P2 #10): strict 2-segment Hub id
    check is intentional; one-segment names are not valid Hub
    ids.
  * Cwd-relative shadow of Hub IDs (R33 P2 #11): documented
    side-channel tradeoff accepted in round 31 commit msg.

97 targeted backend tests pass.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants