Skip to content

feat(compile): traceable GatedDeltaNet decoder loop (Qwen3.5 + Qwen3-Next) + model-agnostic FA2 varlen collator#49

Open
thad0ctor wants to merge 20 commits into
mainfrom
feat/gdn-compiled-decoder-loop
Open

feat(compile): traceable GatedDeltaNet decoder loop (Qwen3.5 + Qwen3-Next) + model-agnostic FA2 varlen collator#49
thad0ctor wants to merge 20 commits into
mainfrom
feat/gdn-compiled-decoder-loop

Conversation

@thad0ctor

@thad0ctor thad0ctor commented Jun 14, 2026

Copy link
Copy Markdown
Owner

Description

Under torch_compile, one graph break inside the GatedDeltaNet decoder for-loop made dynamo skip the whole text-model forward frame (a break inside a loop is unresumable), so every layer (attention included) ran eagerly.

This makes the loop traceable for every GatedDeltaNet model (Qwen3.5 dense + MoE, Qwen3-Next), plus a model-agnostic collator path that fixes the same break for any flash-attention + packed model under compile:

  • gated_delta_net_ops.py (new, shared) — opaque torch.library ops (axolotl_gdn::gdn_conv/gdn_chunk + backward) wrapping the same FLA kernels with identical saved tensors. They derive cu_seqlens from position_ids inside the op, so no aten.nonzero enters the graph. A cast_g flag matches each architecture's eager call (qwen3_5 casts g; qwen3_next keeps f32) for bitwise parity; packed input enforces batch-1 with an explicit ValueError. Also hosts the shared FusedRMSNormGated compile boundary.
  • qwen3_5/ + qwen3_next/modeling.py — route the GDN training path through the ops under compile, and install the FusedRMSNormGated opaque wrapper (its backward recomputes via autograd.grad, since the FLA backward isn't meta-traceable). Self-attention stays in-graph under GC+training, behind a dynamo.disable boundary otherwise (guards an Inductor FA2-backward fusion that corrupts packed grads; pinned by test_fa2_compiled_matches_eager_grads). The qwen3_next wiring also fixes a pre-existing crash (its decoder passed cache_position to a GatedDeltaNet.forward that rejects it).
  • collators/batching.py + builders/causal.py — under torch_compile, the multipack collator precomputes the FA2 varlen kwargs (cu_seq_lens_q/k, max_length_q/k) once per batch, so transformers skips its per-layer nonzero derivation. Model-agnostic — gated on SUPPORTED_MULTIPACK_MODEL_TYPES, so any packed FA2 model compiles clean.
  • patch_manager.py — threads torch_compile into the qwen3_5/qwen3_5_moe/qwen3_next packing patches.

Opt-in via torch_compile; inert otherwise.

Version coupling — the ==0.4.1 pin is load-bearing. The ops wrap FLA internal functions by fixed arity and the fakes hardcode 0.4.1 shapes/dtypes; FLA 0.4.2/0.5.0 already changed chunk_gated_delta_rule_fwd's arity (crash at first forward). So a bump needs revalidating the ops — guarded by test_opcheck_custom_ops + test_eager_parity_ops_vs_legacy_bitwise (the call-time crash isn't caught by the build-time warning). The upstream fix fla-org/flash-linear-attention#909 removes only the index-helper breaks, not the nonzero/@torch.compiler.disable ones these ops target, so the wrapper can't just be dropped on a newer FLA.

Known limit: auto-enabled LoRA triton kernels + a GDN packing patch + torch_compile + GC NaN gradients on torch ≤ 2.10 (pre-existing on main; clean on 2.11, and with any one ingredient removed). PatchManager force-disables those kernels with a warning on torch < 2.11.

Motivation and Context

The decoder loop is the bulk of training compute. Skipping its compilation forfeits the largest fusion payoff: the eager RMSNorm fp32 round-trips (~870 GB/step of traffic at 9B/32k, interposer-measured) + per-kernel Python launch overhead. Once the loop traces, Inductor fuses those.

Scope: the GDN ops cover Qwen3.5/Qwen3.6 (model_type qwen3_5/qwen3_5_moe, dense + MoE, text + VL — every shipped checkpoint is *ForConditionalGeneration) and Qwen3-Next (qwen3_next, same FLA GatedDeltaNet kernels). The collator change is model-agnostic. Dense Qwen3/Qwen2.5/Qwen3-VL have no GatedDeltaNet, so the loop-internal breaks don't exist for them — but they still benefit from the collator path under FA2+packing+compile.

How has this been tested?

Stack: torch 2.11.0+cu130, transformers 5.9.0, FLA 0.4.1, liger-kernel 0.8.0, real flash_attn 2.8.3 (the validated stack). RTX 3090 / 3090 Ti / 5090.

Functional — 12 tests, all passing:

tests/monkeypatch/test_qwen3_5_compiled_loop.py (10): zero-breaks-with-GC (1 graph), FA2-loop-with-varlen-kwargs, VL ForConditionalGeneration 3-D MRoPE, aten.nonzero-break-gone, bitwise eager parity (ops vs legacy), compiled-vs-eager, FA2+GC compiled grad parity, toy MoE (0 breaks), B>1-packed-raises, opcheck (packed/MRoPE/dense × bias × both cast_g).
tests/monkeypatch/test_qwen3_next_compiled_loop.py (2): qwen3_next decoder loop 0 breaks + bitwise eager parity (cast_g=False, g stays f32).
Existing test_qwen3_5_fused_attn.py: 12/12 (no regression). Suite also passes one major version down (torch 2.9.1 / 2.10.0 + transformers 5.5.4).

Multimodal: Multimodal training is non-packed in axolotl (SFT vision examples + MM-CPT (pending PR axolotl-ai-cloud#3629) ship sample_packing: false), so the GDN ops are inert there by construction this was validated with real images (llava-instruct through the Qwen3.5-2B vision tower: packing patches absent from logs, eager-vs-compiled step-1 loss Δ0.004, eval clean).

Multiple test training runs and benches (below).


Benchmarks (all on the validated stack above)

Qwen3.5-2B/9B = Qwen3_5ForConditionalGeneration, LoRA r=16 q/k/v/o, sample packing, bf16, gradient checkpointing, attn_implementation: flash_attention_2. 60 steps, steady-state window after warmup, TORCH_LOGS=recompiles. Throughput = tok/s; VRAM = reserved (GiB). Zero NaN and step-1/final-10 loss parity in every cell.

A. Core compile speedup On real FA2, torch_compile without this PR is near-neutral (the loop graph-breaks); this PR is what makes compile worth it (RTX 3090):

model eager compile (main, no PR) compile + PR PR vs eager PR vs compile-main
Qwen3.5-2B (seq 4096) 3593 tok/s 3788 (+5.4%) 4617 +28.5% +21.9%
Qwen3.5-9B (seq 4096, CCE) 2724 tok/s 2732 (+0.3%) 3330 +22.2% +21.9%

B. Feature compatibility / composition (Qwen3.5-2B, compile + PR, varying feature): (RTX 3090)

config tok/s reserved VRAM note
compile + PR (LoRA kernels on — default) 4617 14.66 GiB baseline
+ LoRA kernels off 4585 14.40 GiB kernels ≈ neutral under compile
+ fused_attn_kernel (axolotl-ai-cloud#3680) 4394 14.67 GiB −4.8% — redundant under compile
+ axolotl CCE (cut_cross_entropy) 4781 6.03 GiB fastest and ~2.4× less memory
+ liger fused_linear_cross_entropy 4593 14.66 GiB no-op on Qwen3.5 (= no memory benefit) — requires overlooked gating bug fix following liger 0.8.0 bump
+ liger full stack (rms_norm/gated/swiglu/rope) 4137 10.60 GiB −10% — kernels redundant under compile
(eager + fused_attn_kernel) 3648 18.07 GiB eager: fused_attn only +1.5%

LoRA triton kernels and fused_attn_kernel are subsumed by Inductor once the loop compiles (consistent with axolotl-ai-cloud#3680's own "fused+compile −18%" dense-Qwen3 numbers); liger composes cleanly but adds no value under compile.

C. Generalization — the model-agnostic collator (compile on/off, FA2 + packing): (RTX 5090)

model eager compile + PR speedup reserved (eager → compile)
Llama-3.2-1B (seq 2048) 5181 tok/s 7572 +46.1% 6.09 → 4.54 GiB
Qwen3-0.6B (seq 2048) 5445 tok/s 11375 +108.9% 5.54 → 5.01 GiB

Neither model has a GatedDeltaNet so speedups come from the collator removing the FA2 per-layer break. Isolation (toy Llama, FA2 + packed, under compile): 3 graph breaks (4 graphs) without the varlen kwargs → 0 breaks (1 graph) with them. 0 recompiles in steady state; loss parity holds.


Convergence: loss is bit-stable across every cell above — step-1 identical to eager and final-10 within float noise (e.g. 2B eager 1.1275 vs compile+PR 1.1275; 9B 0.9264 vs 0.9266). The opaque-op path is bitwise identical to legacy eager (regression-tested) for both qwen3_5 and qwen3_next.

Composition with axolotl-ai-cloud#3732 (fused-LoRA GDN routing): merges cleanly, both touch the GDN forward, the projection call sites route through _la_proj_fwd while the compiled-ops gating stays intact, and the fused LoRA_O autograd Function traces inside the compiled loop with 0 graph breaks (earlier 2B bench: both-merged+compile +44% vs eager, axolotl-ai-cloud#3732's +10% marginal surviving compilation because it substitutes bf16 adapter GEMMs Inductor can't infer).

(Orthogonal note: any LoRA + GC + compile run carries one pre-loop graph break from transformers' enable_input_require_grads embedding hook; on torch 2.11 it fires before the decoder loop and doesn't affect it.)

AI Usage Disclaimer

Opus 4.8 / Fable 5 used throughout.

Types of changes

  • New feature (torch_compile support for the GatedDeltaNet decoder loop — Qwen3.5 dense/MoE + Qwen3-Next)
  • New feature (model-agnostic FA2 varlen collator kwargs for any packed model under compile)
  • Performance improvement (non-breaking; numerics-neutral, bitwise eager fallback)
  • New tests

Summary by CodeRabbit

  • New Features

    • Added torch.compile optimization support for Qwen3.5 and Qwen3Next models with improved kernel handling.
    • Enhanced Flash Attention 2 compatibility with varlen metadata precomputation for better performance.
  • Bug Fixes

    • Fixed Qwen3.5 LoRA training compatibility when using sample packing with gradient checkpointing on PyTorch < 2.11.
  • Tests

    • Added regression test suites validating torch.compile compatibility and performance parity for Qwen model architectures.

thad0ctor added 10 commits June 12, 2026 12:43
…eable packing

With torch_compile, a single in-loop graph break made dynamo skip the entire
Qwen3_5TextModel.forward frame, so every decoder layer ran eagerly (no
norm/rope/gating/residual fusion, full per-kernel launch overhead). The
in-loop blockers were: axolotl's get_cu_seqlens() nonzero in the packing
patch, FLA's @torch.compiler.disable'd chunk_gated_delta_rule, causal_conv1d's
.item(), and FusedRMSNormGated's untraceable device-property probe.

- New fla_ops.py: opaque torch.library custom ops (axolotl_qwen3_5::gdn_conv /
  gdn_chunk + backward) wrapping the same FLA host functions with identical
  saved tensors (no recompute, no kernel changes). They take position_ids and
  derive cu_seqlens eagerly inside the op, so no data-dependent op enters the
  graph and every fake impl is static.
- modeling.py: the GDN training/no-cache path routes through the ops; a
  FusedRMSNormGated custom-op wrapper installs whenever torch_compile is on;
  decoder self-attention is traced when gradient_checkpointing and training.
- patch_manager.py threads torch_compile=bool(cfg.torch_compile) into the
  packing patches.

Bit-exactness: eager FLA's dg lands on the bf16 grid (reproduced via an
explicit f32->bf16->f32 round-trip); v reaches the kernels as a non-contiguous
split/reshape view (the bwd op mirrors FLA's input_guard contiguization).

Verified (tiny 4-layer hybrid): 0 graph breaks / 1 graph for the full text
model under GC; eager ops-vs-legacy bitwise identical; compiled-vs-eager
within compile noise. monkeypatch compiled-loop suite 4/4, qwen3_5 fused-attn 12/12.

Note: flash_attention_2 configs still don't get the compiled loop — the
remaining breaker is transformers' per-layer varlen derivation on the FA2
path, fixable purely axolotl-side by emitting precomputed cu_seq_lens/max_length
from the multipack collator (follow-up, not in this change).
… GDN ops

_FLA_COMPILED_OPS now follows the torch_compile flag, so it is set to False when torch_compile=False instead of always enabling the opaque GDN ops.
… the compiled loop

The compiled decoder loop (the preceding commit) works under sdpa but not
flash_attention_2: with packed sequences signalled only via position_ids,
transformers re-derives the varlen metadata per layer inside
_flash_attention_forward (a data-dependent _is_packed_sequence branch +
(position_ids==0).nonzero()), which graph-breaks inside the decoder loop and
makes dynamo skip the whole frame.

transformers already ships the escape hatch: if the caller passes the
FlashAttentionKwargs (cu_seq_lens_q/k + max_length_q/k), is_fa_with_varlen_kwargs
short-circuits before the data-dependent branch and the per-layer derivation is
skipped. This emits exactly those kwargs from the multipack collator, computed
once via transformers' own prepare_fa_kwargs_from_position_ids (so the metadata
is bit-identical to what it would derive per layer). Gated to FA2 + sample
packing + qwen3_5/qwen3_5_moe collators; max_length stays a python int so no
capture_scalar_outputs is needed.

Verified: FA2 + torch_compile now traces the loop with 0 graph breaks (was 1
break / 4 fragmented graphs); eager FA2 loss with-vs-without the kwargs is
bitwise identical. compiled-loop suite 6/6.
…acks, formatting

- build_collator: gate emit_fa_varlen_kwargs on the eval-aware packing state
  (training_args.eval_sample_packing for eval loaders) instead of
  cfg.sample_packing, so eval-only packed mode also gets the precomputed kwargs.
- collator + FusedRMSNormGated boundary: replace silent exception swallows with
  warnings so a disabled compiled-loop optimization is visible.
- ruff-format the new test assert; collapse a two-line comment.
…lback, B>1 guard, weight-None norm fallback, saved-v contiguity; add MoE/FA2-parity/opcheck/B>1 tests
…undary (torch 2.11-only, unreproduced on 2.9/2.10)
- collator: emit FA2 varlen kwargs for ALL multipack models under torch_compile
  (was gated to qwen3_5/qwen3_5_moe); model-agnostic, transformers consumes them natively
- share the GatedDeltaNet opaque ops: move fla_ops -> gated_delta_net_ops (neutral
  axolotl_gdn namespace), add cast_g flag so qwen3_5 (casts g) and qwen3_next (f32 g)
  both stay bit-exact; move the FusedRMSNormGated compile boundary into the shared module
- wire qwen3_next: route conv+chunk through the shared ops under compile, install the
  rmsnorm boundary + self-attn dynamo boundary, thread torch_compile; also fix a
  pre-existing crash (decoder passed cache_position to GatedDeltaNet.forward)
- extend the lora-kernel NaN guard to qwen3_next
- tests: qwen3_next compiled-loop (0 breaks + bitwise eager parity); opcheck both cast_g
@coderabbitai

coderabbitai Bot commented Jun 14, 2026

Copy link
Copy Markdown

Review Change Stack

Important

Review skipped

Auto incremental reviews are disabled on this repository.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 9b1e5889-d874-4ad8-9550-f1c4f59c52b8

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Adds opaque torch.library.custom_op wrappers (axolotl_gdn.gdn_conv, gdn_chunk) around FLA GatedDeltaNet kernels to prevent aten.nonzero from entering torch.compile graphs. Threads torch_compile flags through Qwen3.5 and Qwen3-Next monkeypatches to route through these ops. Precomputes FlashAttention varlen metadata in DataCollatorForSeq2Seq when enabled. Wires everything via PatchManager and adds CUDA regression tests for zero graph breaks and eager/compiled parity.

Changes

torch.compile GatedDeltaNet Opaque Op Integration

Layer / File(s) Summary
GatedDeltaNet opaque custom-op library
src/axolotl/monkeypatch/models/gated_delta_net_ops.py
New module: lazily registers torch.library.custom_op wrappers (gdn_conv, gdn_chunk) that move cu_seqlens computation from position_ids inside the op boundary, preventing aten.nonzero from entering compile graphs. Also registers a FusedRMSNormGated compile boundary with recompute-based backward, and exposes fla_ops_available() / fla_ops_build_error() with cached build state.
Qwen3.5 modeling patches for compiled-op routing
src/axolotl/monkeypatch/models/qwen3_5/modeling.py
Adds _FLA_COMPILED_OPS module state and _call_self_attn Dynamo-disabled wrapper. Adds compile_boundary flag to _inject_fla_kernels to install FusedRMSNormGated boundary. Updates decoder forward to choose between torch.ops.axolotl_gdn.gdn_conv/gdn_chunk (compiled path) vs cu_seqlens+chunk_gated_delta_rule (eager path). Extends _apply_packing_patches, patch_qwen3_5_modeling_packing, and patch_qwen3_5_moe_modeling_packing with torch_compile parameter.
Qwen3-Next modeling patches for compiled-op routing
src/axolotl/monkeypatch/models/qwen3_next/modeling.py
Mirrors Qwen3.5 changes: adds _FLA_COMPILED_OPS, Dynamo-disabled self-attn wrapper, and torch_compile parameter to all four public patch functions. Decoder layer threads position_ids into linear_attention and conditionally wraps self_attn with the Dynamo-disable boundary. GatedDeltaNet layer selects axolotl_gdn.gdn_conv/gdn_chunk or falls back to FLA eager paths.
FA varlen kwargs precomputation in collator and builder
src/axolotl/utils/collators/batching.py, src/axolotl/core/builders/causal.py
DataCollatorForSeq2Seq gains emit_fa_varlen_kwargs: bool = False; when set, precomputes cu_seq_lens_q/k and max_length_q/k from position_ids once per batch with a warning_once fallback. HFCausalTrainerBuilder.build_collator sets this flag when flash_attention_2, packed mode, torch_compile, and a supported multipack model type are all active.
PatchManager wiring and LoRA kernel guard
src/axolotl/loaders/patch_manager.py
Passes torch_compile=bool(self.cfg.torch_compile) to all three Qwen3-family packing patch calls. Adds a pre-model-load guard that disables Qwen3.5 LoRA triton kernels on PyTorch < 2.11 when sample packing + torch_compile + gradient checkpointing + adapter training are all enabled.
CUDA regression tests
tests/monkeypatch/test_qwen3_5_compiled_loop.py, tests/monkeypatch/test_qwen3_next_compiled_loop.py
CUDA-only test suites with packing_patched fixtures that snapshot/restore patched methods. Qwen3.5 tests cover zero Dynamo graph breaks across gradient-checkpointing, FA2 varlen, VL, and MoE scenarios; bitwise and tolerance parity between compiled-op and eager paths; batch-gt-1 raise guard; and torch.library.opcheck for gdn_chunk/gdn_conv. Qwen3-Next tests cover zero graph breaks with gradient checkpointing and bitwise parity toggling _FLA_COMPILED_OPS.

Estimated code review effort

🎯 5 (Critical) | ⏱️ ~120 minutes

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 40.79% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and accurately summarizes the main changes: adding torch.compile support for GatedDeltaNet decoder loops in Qwen3.5/Qwen3-Next models and a model-agnostic FA2 varlen collator.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/gdn-compiled-decoder-loop

Comment @coderabbitai help to get the list of available commands and usage tips.

thad0ctor and others added 10 commits June 14, 2026 08:41
- batching.py: narrow the FA varlen fallback catch to the realistic exception
  set (ImportError/AttributeError/RuntimeError/TypeError/ValueError/IndexError)
  instead of bare Exception (Ruff BLE001)
- tests: presence-aware restore of FusedRMSNormGated._axolotl_compile_boundary
  (snapshot hasattr + value) in both compiled-loop fixtures
- tests: narrow B>1 pytest.raises to ValueError (verified: the op raises a clean
  ValueError through the model forward)
- qwen3_next: route non-GC self-attn through the dynamo.disable boundary
  unconditionally (a no-op when not compiling), matching qwen3_5
* feat: update cce for new models

* feat: update transformers

* chore: remove dead code

* feat: add liger for gemma4 unified

* fix: hybrid attn for latest transformers

* feat: add gemma4 unified following gemma4

* chore: refactor old logger to axolotl get_logger

* fix: update legacy env to new xet env

* feat: add missed files

* fix: handling of lora kernels

* feat: update vision and readme yaml

* chore: update numbers from latest run

* feat: add text config

* chore: update correct number

* fix: update cce commit

* fix: packing leak

* use transformers patch release

* 2 parallel jobs for pytests

* fix: gate attention_mask for gemma4_unified

* fix: restore prior gemma4 e2b shared kv layer helper

* chore: refactor gemma4 hybrid attn

* feat: update gemma4 results and config

* chore: simplify config

* fix: update unified results and docs

* chore: swap to hybrid attn

* feat: add tests

* fix: swap to FA2 text

* fix: ci logging

* fix: generalize rotary patch

* fix: deleted file for docs

* fix: fsdp defaulted to v2

* fix: support simplenamespace for test

* fix: update quarto to include all current scripts

* fix: drop quarto doc entry for untracked deepseek_v4 module

---------

Co-authored-by: Wing Lian <wing@axolotl.ai>
* fix version mismatch 4 pirate

* add whitlist for collab

---------

Co-authored-by: NanoCode012 <nano@axolotl.ai>
…tent in qwen3_5 template (axolotl-ai-cloud#3725) [skip ci]

The inline-<think> assistant branch reassigned content (stripping it to
the post-</think> answer) before reasoning_content was extracted from it.
Since reasoning_content reads from the already-truncated content, the
reasoning trace was dropped and the answer leaked into the <think> block.
Swap the two set statements to match the official Qwen3.5 template order.
…onfig (axolotl-ai-cloud#3730) [skip ci]

* fix: KTO user_defined dataset transform crashes on every documented config

The user_defined.default KTO strategy was broken in all configurations:

- when completion_format was provided, the default was assigned to a
  misnamed chosen_format variable only in the fallback branch, so the
  transform raised NameError: chosen_format
- when completion_format was omitted, the generated placeholder name did
  not match the .format() keyword (chosen= vs {completion}), raising
  KeyError
- prompt formatting read sample['prompt'] instead of sample[field_prompt],
  breaking custom field_prompt configs

Also surface the underlying exception when a prompt strategy fails to
load instead of silently returning None, which previously crashed later
with the unhelpful 'TypeError: None is not a callable object'.

Fixes axolotl-ai-cloud#2757

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

* docs: add docstrings to KTO user_defined tests for coverage check

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

* fix: wrong pydantic type

* feat: updated test to handle e2e validation

---------

Co-authored-by: NanoCode012 <nano@axolotl.ai>
…ip ci]

* fix: fail early for CI if meet CUDA error

* fix: switch to clean abort
* feat: add interactive multi-turn chat mode (--chat) to inference CLI

* fix: apply_chat_template returns BatchEncoding in transformers v5

* docs: document interactive chat mode for inference

* feat: diffusion turn generation for chat mode; fix fp8 probe on CPU-only torch

* feat: suggest command aliases on typo; save chat sessions in multimodal parts format

* feat: collapse thinking blocks in chat with /expand, /think template toggle, thinking token stats

* fix: suppress unauthenticated HF Hub nag warning in logging config

* fix: harden chat REPL against interrupts and command errors; store assistant turns in parse_response format

- Ctrl+C during a (diffusion) turn no longer crashes the REPL; the session
  survives and the user message is kept
- exceptions in slash-command handlers no longer kill the session
- consecutive user messages merge so strict templates never see two user
  turns after a failed generation
- assistant turns are stored without special tokens, with thinking under
  reasoning_content (tokenizer parse_response schema when available,
  think-marker split otherwise); EOS markers no longer leak into the
  streamed display

* perf(chat): lighten the turn loop

- /new now drops the cross-turn KV cache instead of leaving it on
  device until the next generation
- throttle live thinking-tail rerenders to the 12 Hz repaint rate
  (was O(n^2) splitlines over the full think text per chunk)
- split think markers once per turn and reuse for counts and the
  stored message, dropping the redundant full decode

* refactor(chat): share the live thinking-tail FPS as a class constant

* fix: interrupt cache race condition and parse edge case
@github-actions

Copy link
Copy Markdown

📖 Documentation Preview:

Deployed on Netlify from commit 9953bce

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants