Skip to content

refactor: address CodeRabbit review comments for mixed content JSON p…#2

Closed
thad0ctor wants to merge 2 commits into
mainfrom
fix/mixed-content-qwen3vl
Closed

refactor: address CodeRabbit review comments for mixed content JSON p…#2
thad0ctor wants to merge 2 commits into
mainfrom
fix/mixed-content-qwen3vl

Conversation

@thad0ctor

Copy link
Copy Markdown
Owner

…arsing

  • Improve JSON detection logic to check for proper structure (matching brackets/braces)
  • Rename parameter from 'is_qwen2_vl' to 'deserialize_json_content' for clarity
  • Use axolotl's logging system instead of standard logging module
  • Add more specific exception handling (KeyError, TypeError, ValueError)
  • Preserve all message fields during normalization, not just role and content
  • Add proper spacing in warning messages for readability
  • Add warnings for unsupported loader options (split, streaming)
  • Remove unused kwargs parameter from load_mixed_content_jsonl function

These changes improve code maintainability, error handling, and user experience when working with mixed content datasets for multimodal models.

🤖 Generated with Claude Code

Description

Motivation and Context

How has this been tested?

Screenshots (if appropriate)

Types of changes

Social Handles (Optional)

thad0ctor and others added 2 commits November 28, 2025 14:18
- Add custom JSON loader to handle mixed content types (string/array)
- Implement multi-image support for Qwen2VLProcessingStrategy
- Normalize content during loading, deserialize during processing
- Add configuration gating via chat_template or mixed_content_messages flag
- Update documentation with mixed content dataset examples
- Add schema support for mixed_content_messages configuration

Fixes PyArrow schema errors when training Qwen3-VL with multimodal datasets
containing both text-only and multi-image messages.
…arsing

- Improve JSON detection logic to check for proper structure (matching brackets/braces)
- Rename parameter from 'is_qwen2_vl' to 'deserialize_json_content' for clarity
- Use axolotl's logging system instead of standard logging module
- Add more specific exception handling (KeyError, TypeError, ValueError)
- Preserve all message fields during normalization, not just role and content
- Add proper spacing in warning messages for readability
- Add warnings for unsupported loader options (split, streaming)
- Remove unused kwargs parameter from load_mixed_content_jsonl function

These changes improve code maintainability, error handling, and user experience
when working with mixed content datasets for multimodal models.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
@thad0ctor thad0ctor closed this Apr 24, 2026
thad0ctor added a commit that referenced this pull request May 6, 2026
Two related fixes landing together — both Codex-flagged after the
previous round closed gaps #1 (persistent peak), #2 (T_reduce), and #5
(per-chunk contention). Bundled because both touch the cost-model
search path and add regression tests to the same file.

## Fix 1: searcher fast-path applies cap layering (closes #1 fully)

909fc9e fixed cost/memory.py::estimate_peak to layer the
hot_iter_peak_cap (cap only the activation portion, preserve
model_state_present). The same bug existed in search/exhaustive.py:649
inline fast path — it still applied the cap as a flat clamp, erasing
the Adam state contribution.

Codex synthetic repro: search() returned predicted_peak_bytes=78MB
while estimate_peak() for the same config returned 7.09GB — a 90×
divergence on full-FT shapes.

Extracted shared helper apply_hot_iter_cap(raw_peak, model_state_present,
measured_cap, layout) -> int into cost/memory.py. Used at three sites
to eliminate drift risk:
  - cost/memory.py::estimate_peak (replaces inline 909fc9e code)
  - search/exhaustive.py inline fast path (the Codex bug site)
  - search/exhaustive.py _cap_dominates probe (also was naive)

The _cap_dominates shortcut at line 522 was widening max_sum based on
the false claim "predicted_peak collapses to alpha * hot_cap
independent of (n_persist + n_buffer)." Tightened to use the layered
cap with n_persist=N_chunk worst-case probe — shortcut now activates
only when the layered worst-case cap fits in capacity. LoRA-shape
efficiency win preserved; full-FT shapes correctly exit the shortcut.

New regression test test_search_fast_path_cap_preserves_full_ft_model_state
asserts search()'s predicted_peak_bytes agrees with estimate_peak()'s
output and clears the alpha * model_state_present floor. Falsification
verified: temporarily reverting the inline fix reproduces the exact
74MB-vs-7GB Codex pattern.

## Fix 2: phase-2 measured-wall override gates on n_swap == 0

e8f45fd added the per-chunk timeline-overlap bandwidth model on the
analytical path. But cost/runtime.py:703 (forward) and :820 (backward)
have a phase-2 measured-chunked-wall override that bypasses the
chunk_bw_fwd / chunk_bw_bwd vectors entirely. The phase-2 measurement
was captured at n_swap=0 (verified at profiler/phase2.py:117), so when
an n_swap > 0 candidate is evaluated using the measured wall, the
per-chunk SWAP contention is missed — only the swap-transfer term is
charged.

Fix: AND both override conditions with cfg.n_swap == 0. Phase-2 stays
applicable in the dominant case (no SWAP, hot-iter measured walls);
n_swap > 0 candidates fall through to the analytical path that
already handles per-chunk contention correctly.

Numerics for n_swap == 0 candidates are bit-identical (the analytical
path's per-chunk derate collapses to full bandwidth when n_swap == 0).

New regression test test_phase2_override_routes_n_swap_through_per_chunk_contention
constructs n_swap=0 vs n_swap=N_block configs under a fully-populated
phase-2 trace, asserts the gap exceeds the swap-transfer-only lower
bound (the exact pre-fix value). Falsification verified: temporarily
reverting both gates reproduces gap == swap-transfer-only.

## Test results

41/41 pass in tests/protrain/test_cost_search.py (39 prior + 2 new).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
thad0ctor added a commit that referenced this pull request May 12, 2026
Five test-quality refinements from CodeRabbit's third-round review.

**R3-#2 — deterministic teardown in test_dora.**

Wrap the DoRA smoke's wrap → train → assert sequence in
``try/finally`` so ``wrapped.close()`` runs even when the
loss-descent assertion fails mid-test. Without this, an early
assertion failure leaves hooks, pinned-host borrows, and CPU
adapter threads alive into subsequent GPU tests on the same
pytest session.

**R3-#3 — distinguish hook edges in test_lora_offload_mode
recording stub.**

The pre-fix ``_RecordingScheduler.ensure_chunks_resident``
recorded every container callback under the same
``"ensure_chunks_resident"`` label. The per-hook tests
(pre_forward / post_forward / post_backward fires
``ensure_chunks_resident``) then asserted only call COUNT — so a
regression that deleted the pre-forward hook factory while
post-forward still fired would still pass the count gates.

Tag each call with its originating hook edge via frame
inspection on the caller's ``co_qualname`` (Python 3.11+
guarantees the qualname captures the enclosing
``_make_lora_container_<edge>_hook`` factory). The four LoRA
container hooks all funnel through the same
``ensure_chunks_resident`` entry point but their closures live
in distinct factory functions, so the qualname uniquely
identifies the edge.

Update each per-hook test to filter on the edge-tagged label so
a regression in any single edge fails the corresponding test:

* pre_forward test: asserts ``ensure_chunks_resident:pre_forward``
  fires ≥ n_blocks times.
* post_forward test: asserts BOTH ``:pre_forward`` AND
  ``:post_forward`` fire ≥ n_containers times each (the previous
  bare ≥ 2*n_containers count was satisfied by either edge alone).
* post_backward test: asserts all four edges (pre/post fwd, pre/
  post bwd) fire ≥ n_containers times each.

The production hook factory layout is unchanged — the stub
recovers the edge from the existing closure's frame, no new
arguments thread through ``install_hooks``.

**R3-#4 — narrow protrain_model_wrapper exception scope in
test_lora_offload_mode:1117.**

The bare ``except (ValueError, RuntimeError)`` was treating any
wrapper failure as "offload setup unavailable" and skipping. A
broken ``protrain_model_wrapper`` runtime path could leave this
smoke green. Restrict the suppression to known env-failure
substrings (DeepSpeedCPUAdam JIT, CUDA version mismatch, bnb
load, ``No module named``, and capacity/searcher gates) — same
canonical tuple D8 used at the optimizer-step site below — and
re-raise anything else. Real wrapper regressions now surface.

**R3-#5 — fail-safe CUDA teardown in
test_param_data_shape_preservation.**

Eight test functions in this module construct ``mgr / layout /
pool / host`` via ``_build_chunk_manager`` and tear them down at
the happy-path tail (``mgr.uninstall()`` / ``host.close()`` /
``del pool``). Any earlier assertion failure skipped the
teardown, leaking pinned-host borrows + CUDA buffer-pool state
into subsequent GPU tests.

Add a top-level ``_teardown_chunk_manager(mgr, host, pool)``
helper that does the best-effort 3-call teardown (each call
wrapped in its own try/except so a failure in ``uninstall``
doesn't block the ``host.close``), and wrap each test body in
``try: ... finally: _teardown_chunk_manager(...)``. Done
programmatically across all 8 tests via a one-shot Python
rewrite to keep the diff mechanical and the new structure
consistent.

**R3-#8 — replace hard-coded n_chunk_estimate=1 in
test_trace_skip_on_override.**

The trace-skip e2e test hard-coded ``n_chunk_estimate = 1`` based
on the assumption that the tiny GPT-2 fixture produces a single
chunk. If the layout heuristics (``pick_S_chunk`` default,
block-discovery rules) shift such that ``N_chunk > 1``,
``min_n_buffer_for(layout, n_persist=1)`` rejects
``n_buffer_override=0`` BEFORE the wrapper reaches the
trace-skip gate the test is supposed to validate — converting
this into a flaky non-target failure.

Compute ``n_chunk_estimate`` dynamically by running the same
``discover_blocks`` → ``flatten_block_trees`` → ``build_layout``
pipeline the wrapper itself uses (with the wrapper's default
S_chunk), and pass the resulting ``layout.N_chunk`` through.
``n_persist_override = n_chunk_estimate`` then keeps the
all-persistent invariant the test relies on regardless of any
future layout-heuristic shift.

``tests/protrain/`` default-marker sweep: 303 passed / 4 skipped
/ 0 failed. GPU-marker sweep on touched files: 40 passed /
2 skipped (single-process Mode-C downgrade for shape-preserving
placeholder paths) / 0 failed.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
thad0ctor added a commit that referenced this pull request May 12, 2026
…inor)

Four CodeRabbit findings on commits ``b61f04e0`` (init-transient peak
prediction) + ``aa0c6ba9`` (Mode-C steady CKPT-chain accounting).

**R4-#1 (Critical) — post-block-wrap DDP ignore-set re-registration.**

The M6C-fix-8 ignore-set registration ran BEFORE block-wrap. The
block wrappers (``block/checkpoint.py``, ``block/swap.py``,
``block/offload.py``) all do ``self.block = block``, which means
PyTorch's ``named_parameters()`` traversal inserts a ``.block.``
infix into the parameter namespace
(``layers.0.attn.q_proj.weight`` ⇒
``layers.0.block.attn.q_proj.weight``). The pre-wrap names captured
in ``model._ddp_params_and_buffers_to_ignore`` no longer match the
namespace DDP's ``__init__`` walks at construction time.

The init-time broadcast is irrelevant (M6C-fix-8's
``init_sync=False`` monkey-patch bypasses it wholesale on
chunk-managed models), but DDP's BACKWARD-pass allreduce still
consults the ignore list. A stale ignore set means DDP's backward
allreduce would attempt to all-reduce chunk-managed LoRA factor
gradients, conflicting with ProTrain's per-chunk ``reduce_scatter``
drain.

Add a post-wrap re-registration step after ``install_hooks`` in
``_construct_runtime``: walk the WRAPPED ``model.named_parameters()``
and identify chunk-managed params by OBJECT identity against
``chunk_manager._params_by_id.values()``. Build the post-wrap name
set, merge with the pre-protrain snapshot
(``_protrain_ddp_original_ignore``), overwrite the attribute.
Gated on ``_shape_preserving`` so the single-GPU / replicated path
remains a no-op.

**R4-#2 (Major) — reuse bootstrap init-transient peak instead of
recomputing post-offload.**

``predict_init_transient_peak_bytes(layout, hw, chunk_manager)``
walks ``chunk_manager.model.named_parameters()`` to sum chunk
bytes. By the time the phase-2 post-measurement calibration runs,
``materialize_offload`` has already executed and ``param.data``
points at the zero-size placeholders (replicated path) or
``scratch.expand(slot.shape)`` views (sharded path), so the
byte accounting drifts away from the bootstrap-time full-residence
prediction.

Replace the recompute call at the phase-2 post-measurement
calibration site with ``boot_result.predicted_init_transient_peak_bytes``
— the bootstrap-time value captured at line 1614 before
materialize_offload ran. The downstream consumers (SearchResult
publish, LOG.info diagnostic) get the same authoritative value
without re-walking a now-stale chunk_manager.

**R4-#3 (Major) — meta tensors in ``_stub_chunk_manager`` to avoid
CI OOM.**

``tests/protrain/test_init_transient_peak.py::_stub_chunk_manager``
allocated full CPU tensors sized to model 15–60 GiB chunk
totals. ``predict_init_transient_peak_bytes`` only reads
``param.numel() * param.element_size()``, so meta-device tensors
preserve the byte-accounting metadata without consuming RAM.
Switch the ``nn.Parameter(torch.zeros(numel, dtype, device='cpu'))``
construction to ``nn.Parameter(torch.empty(numel, dtype,
device='meta'), requires_grad=False)``.

**R4-#4 (Minor) — align docstring tolerance with ``TOLERANCE_FRAC = 0.35``.**

``tests/protrain/test_modec_steady_peak_accuracy.py`` docstring
said "±25%" but ``TOLERANCE_FRAC = 0.35`` and the assertion uses
0.35. Update the two docstring mentions to "±35%" so text matches
intent.

### Test gates

- ``pre-commit run --all-files`` ALL green (ruff / ruff-format /
  mypy / bandit / yaml / eol / whitespace).
- ``tests/protrain/`` default-marker sweep: 313 passed / 4
  skipped / 162 deselected / 0 failed.
- GPU sanity on touched test files (GPU 5): 24 passed / 2 skipped
  (single-process Mode-C downgrade — expected) / 0 failed.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
thad0ctor added a commit that referenced this pull request May 12, 2026
…est fixes

Seven Minor items from the CodeRabbit full-diff re-scan on
commit ``55377e5d``.

**F-#2 — Clarify Mode-A guidance in ``protrain_optimizer_wrapper``
8-bit warning (``api/optim_wrapper.py:802-815``).**

The warning told users to set ``protrain_force_all_persistent: true``
to get end-to-end 8-bit AdamW on CPU-resident chunks, but didn't
mention that ``protrain_force_all_persistent`` is ignored while
``protrain_auto_mode`` is on (the auto-mode selector picks the mode
itself based on capacity). Expanded the warning to instruct users
to set ``protrain_auto_mode: false`` AND
``protrain_force_all_persistent: true`` together.

**F-#4 — Unify fragmentation-alpha docs in DESIGN.md.**

Module summaries at lines 49 (``cost/memory.py``) and 118
(``memory.py`` module spec) still described a fixed ``alpha=1.10``
while Design Decision 1 documents the per-dtype lookup
(``ALPHA_FRAGMENTATION_4BIT = 0.75`` for bnb-4-bit). Aligned both
summaries to reference the per-dtype helper
(``alpha_fragmentation_for_dtype``) and the design decision section.

**F-#5 — Resolve ``use_reentrant`` contradiction in DESIGN.md.**

Line 109 (``block/checkpoint.py`` module spec) said
``use_reentrant=False``, which matches the actual implementation
(verified via ``grep`` against ``block/checkpoint.py:99``). Line 290
(audit Block G analysis) claimed ``use_reentrant=True, the
production wrap`` — stale and incorrect. Updated the analysis text
to acknowledge ``use_reentrant=False`` is the production wrap and
re-stated the per-block-input residual mechanism in a form
compatible with the non-reentrant variant (each CKPT block's
saved-tensors-hooks recompute frame holds the block input, which
is what produces the linear-in-N_block activation footprint the
audit data exposes).

**F-#8 — Centralized CUDA-availability guard in
``tests/protrain/test_adamw8bit_adapter.py::_gpu_device``.**

The helper unconditionally returned ``torch.device("cuda:0")``,
so a custom marker filter or conftest override that lands the
module in a CPU-only context would surface as a torch error
before any test body. Added a
``pytest.skip("CUDA not available; ...")`` early-return so every
gpu-marked test in the module gets a clean skip.

**F-#9 — Replace silent ``try/except: pass`` with
``contextlib.suppress(Exception)`` in
``tests/protrain/test_lora_offload_mode.py``.**

Five sites — lines 742-746, 839-843, 906-910, 981-985, 1040-1044
— each had the same ``for h in handles: try: h.remove() except
Exception: pass`` pattern that Ruff S110 flags. Replaced with
``contextlib.suppress(Exception)`` over the loop. Semantics
unchanged (best-effort cleanup, tolerate already-removed handles
or torch shutting down mid-test); intent now documented by the
context manager.

**F-#10 — ASCII ``x`` in ``test_lora_offload_mode.py:1062`` docstring.**

Missed in the R5 unicode sweep — ``4×3090`` ⇒ ``4x3090``.

**F-#11 — ``try/finally`` for ``wrapped.close()`` in 3 sites of
``test_trace_skip_on_override.py``.**

``test_run_trace_skipped_on_override_full_path`` (L255-282),
``test_run_trace_invoked_without_override`` (L319-337), and
``test_partial_overrides_do_not_skip_trace`` (L381-400) each
called ``wrapped.close()`` only on the success path — assertion
failures earlier in the test body would skip the close and leak
CUDA + chunk resources into subsequent GPU tests. Wrapped each
test body in ``try/finally`` so ``wrapped.close()`` always
runs. Done programmatically via a one-shot Python rewrite
(8 lines of new indent + 2 lines of try/finally per site) to
keep the diff mechanical.

### Test gates

- ``pre-commit run --all-files`` ALL green (ruff / ruff-format /
  mypy / bandit / yaml / eol / whitespace).
- ``tests/protrain/`` default-marker: 313 passed / 4 skipped /
  162 deselected / 0 failed.
- GPU sanity on F-touched files (GPU 5): 43 passed / 2 skipped /
  0 failed.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
thad0ctor added a commit that referenced this pull request May 28, 2026
Three real issues from the review of 5ce0c15:

1. (HIGH) ProTrain shard save was silently skipped after Accelerate.prepare.
   HF Trainer replaces self.optimizer with AcceleratedOptimizer once
   prepare runs (transformers/trainer.py:1600), and every subsequent
   callback receives the wrapped form. The wrapper exposes the raw
   optimizer at .optimizer (accelerate/optimizer.py:56) but does not
   forward ProTrain's _gpu_optim/_cpu_optim/_chunk_manager attrs. The
   callback's duck-type check failed on the wrapper, returned control
   without writing anything, and protrain_optim/ was never produced
   in real Trainer runs even though direct unit tests passed.

   Fix: add _unwrap_protrain_optim that returns the raw _ProTrainOptimizer
   from either a raw or AcceleratedOptimizer-wrapped input. Use it in
   the callback's on_save and (defensively) in install_load_hook.

2. (HIGH) The save-size guard undercounted offloaded optimizer state.
   _estimate_optim_state_bytes walked the user-facing optim.param_groups
   and summed p.numel() per param. But materialize_offload (manager.py:706,
   :1494) replaces every offloaded param's .data with an empty placeholder
   between training steps, so p.numel() == 0 for offloaded params. For
   7B full-FT this meant the actual ~84 GB state was estimated as ~0
   bytes and the 2 GiB cap was bypassed — the exact silent-large-write
   the gate was meant to prevent.

   Fix: walk each INNER adapter's state dict (_gpu_optim._optim.state and
   every entry in _cpu_optim._optims values) and sum tensor bytes
   directly. Counts exactly what gets pickled, regardless of the
   user-facing param.data state. Pre-first-step inner state is empty
   so the estimate is 0 — correct: there's nothing to save yet.

3. (MEDIUM) Tests didn't exercise the real Trainer path or prove
   restoration. Direct callback tests passed an unwrapped optimizer,
   missing the Accelerate wrapper issue (#1). The "load succeeds"
   test only asserted the function returned True, not that any state
   was actually restored.

   Fixes: three new tests.
   * test_unwrap_protrain_optim_handles_raw_and_wrapped — covers the
     wrap/unwrap matrix with a fake wrapper.
   * test_unwrap_real_accelerated_optimizer — constructs the actual
     accelerate.AcceleratedOptimizer and verifies our unwrap finds
     the raw form. (Initializes Accelerator() to set the singleton.)
   * test_callback_unwraps_accelerated_optimizer — the regression
     test that would have caught #1: hand the callback an
     AcceleratedOptimizer wrapping a real ProTrain optim, assert
     protrain_optim/metadata.json is actually written.
   * test_load_actually_restores_inner_state — snapshot inner state,
     mutate it, load, assert state matches the snapshot bit-identical.
     Strictly stronger than the previous "load returned True" check.
   * test_save_skipped_when_offloaded_state_exceeds_threshold — the
     regression test for #2: outer param_groups have empty placeholders
     (would have summed to 0 under the old estimator) but inner state
     is large; verify the gate trips correctly.

Estimator unit tests rewritten for the new semantic (mock _gpu_optim /
_cpu_optim adapter shape + inner state dicts), replacing the previous
mocks of param_groups.

Fast suite: 146 passed, 2 skipped, 12 deselected (+6 new tests on
top of the prior 140).
7B-LoRA regression guard: 1 passed in 71s.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
thad0ctor added a commit that referenced this pull request May 28, 2026
Two related fixes landing together — both Codex-flagged after the
previous round closed gaps #1 (persistent peak), #2 (T_reduce), and #5
(per-chunk contention). Bundled because both touch the cost-model
search path and add regression tests to the same file.

## Fix 1: searcher fast-path applies cap layering (closes #1 fully)

909fc9e fixed cost/memory.py::estimate_peak to layer the
hot_iter_peak_cap (cap only the activation portion, preserve
model_state_present). The same bug existed in search/exhaustive.py:649
inline fast path — it still applied the cap as a flat clamp, erasing
the Adam state contribution.

Codex synthetic repro: search() returned predicted_peak_bytes=78MB
while estimate_peak() for the same config returned 7.09GB — a 90×
divergence on full-FT shapes.

Extracted shared helper apply_hot_iter_cap(raw_peak, model_state_present,
measured_cap, layout) -> int into cost/memory.py. Used at three sites
to eliminate drift risk:
  - cost/memory.py::estimate_peak (replaces inline 909fc9e code)
  - search/exhaustive.py inline fast path (the Codex bug site)
  - search/exhaustive.py _cap_dominates probe (also was naive)

The _cap_dominates shortcut at line 522 was widening max_sum based on
the false claim "predicted_peak collapses to alpha * hot_cap
independent of (n_persist + n_buffer)." Tightened to use the layered
cap with n_persist=N_chunk worst-case probe — shortcut now activates
only when the layered worst-case cap fits in capacity. LoRA-shape
efficiency win preserved; full-FT shapes correctly exit the shortcut.

New regression test test_search_fast_path_cap_preserves_full_ft_model_state
asserts search()'s predicted_peak_bytes agrees with estimate_peak()'s
output and clears the alpha * model_state_present floor. Falsification
verified: temporarily reverting the inline fix reproduces the exact
74MB-vs-7GB Codex pattern.

## Fix 2: phase-2 measured-wall override gates on n_swap == 0

e8f45fd added the per-chunk timeline-overlap bandwidth model on the
analytical path. But cost/runtime.py:703 (forward) and :820 (backward)
have a phase-2 measured-chunked-wall override that bypasses the
chunk_bw_fwd / chunk_bw_bwd vectors entirely. The phase-2 measurement
was captured at n_swap=0 (verified at profiler/phase2.py:117), so when
an n_swap > 0 candidate is evaluated using the measured wall, the
per-chunk SWAP contention is missed — only the swap-transfer term is
charged.

Fix: AND both override conditions with cfg.n_swap == 0. Phase-2 stays
applicable in the dominant case (no SWAP, hot-iter measured walls);
n_swap > 0 candidates fall through to the analytical path that
already handles per-chunk contention correctly.

Numerics for n_swap == 0 candidates are bit-identical (the analytical
path's per-chunk derate collapses to full bandwidth when n_swap == 0).

New regression test test_phase2_override_routes_n_swap_through_per_chunk_contention
constructs n_swap=0 vs n_swap=N_block configs under a fully-populated
phase-2 trace, asserts the gap exceeds the swap-transfer-only lower
bound (the exact pre-fix value). Falsification verified: temporarily
reverting both gates reproduces gap == swap-transfer-only.

## Test results

41/41 pass in tests/protrain/test_cost_search.py (39 prior + 2 new).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
thad0ctor added a commit that referenced this pull request May 28, 2026
Five test-quality refinements from CodeRabbit's third-round review.

**R3-#2 — deterministic teardown in test_dora.**

Wrap the DoRA smoke's wrap → train → assert sequence in
``try/finally`` so ``wrapped.close()`` runs even when the
loss-descent assertion fails mid-test. Without this, an early
assertion failure leaves hooks, pinned-host borrows, and CPU
adapter threads alive into subsequent GPU tests on the same
pytest session.

**R3-#3 — distinguish hook edges in test_lora_offload_mode
recording stub.**

The pre-fix ``_RecordingScheduler.ensure_chunks_resident``
recorded every container callback under the same
``"ensure_chunks_resident"`` label. The per-hook tests
(pre_forward / post_forward / post_backward fires
``ensure_chunks_resident``) then asserted only call COUNT — so a
regression that deleted the pre-forward hook factory while
post-forward still fired would still pass the count gates.

Tag each call with its originating hook edge via frame
inspection on the caller's ``co_qualname`` (Python 3.11+
guarantees the qualname captures the enclosing
``_make_lora_container_<edge>_hook`` factory). The four LoRA
container hooks all funnel through the same
``ensure_chunks_resident`` entry point but their closures live
in distinct factory functions, so the qualname uniquely
identifies the edge.

Update each per-hook test to filter on the edge-tagged label so
a regression in any single edge fails the corresponding test:

* pre_forward test: asserts ``ensure_chunks_resident:pre_forward``
  fires ≥ n_blocks times.
* post_forward test: asserts BOTH ``:pre_forward`` AND
  ``:post_forward`` fire ≥ n_containers times each (the previous
  bare ≥ 2*n_containers count was satisfied by either edge alone).
* post_backward test: asserts all four edges (pre/post fwd, pre/
  post bwd) fire ≥ n_containers times each.

The production hook factory layout is unchanged — the stub
recovers the edge from the existing closure's frame, no new
arguments thread through ``install_hooks``.

**R3-#4 — narrow protrain_model_wrapper exception scope in
test_lora_offload_mode:1117.**

The bare ``except (ValueError, RuntimeError)`` was treating any
wrapper failure as "offload setup unavailable" and skipping. A
broken ``protrain_model_wrapper`` runtime path could leave this
smoke green. Restrict the suppression to known env-failure
substrings (DeepSpeedCPUAdam JIT, CUDA version mismatch, bnb
load, ``No module named``, and capacity/searcher gates) — same
canonical tuple D8 used at the optimizer-step site below — and
re-raise anything else. Real wrapper regressions now surface.

**R3-#5 — fail-safe CUDA teardown in
test_param_data_shape_preservation.**

Eight test functions in this module construct ``mgr / layout /
pool / host`` via ``_build_chunk_manager`` and tear them down at
the happy-path tail (``mgr.uninstall()`` / ``host.close()`` /
``del pool``). Any earlier assertion failure skipped the
teardown, leaking pinned-host borrows + CUDA buffer-pool state
into subsequent GPU tests.

Add a top-level ``_teardown_chunk_manager(mgr, host, pool)``
helper that does the best-effort 3-call teardown (each call
wrapped in its own try/except so a failure in ``uninstall``
doesn't block the ``host.close``), and wrap each test body in
``try: ... finally: _teardown_chunk_manager(...)``. Done
programmatically across all 8 tests via a one-shot Python
rewrite to keep the diff mechanical and the new structure
consistent.

**R3-#8 — replace hard-coded n_chunk_estimate=1 in
test_trace_skip_on_override.**

The trace-skip e2e test hard-coded ``n_chunk_estimate = 1`` based
on the assumption that the tiny GPT-2 fixture produces a single
chunk. If the layout heuristics (``pick_S_chunk`` default,
block-discovery rules) shift such that ``N_chunk > 1``,
``min_n_buffer_for(layout, n_persist=1)`` rejects
``n_buffer_override=0`` BEFORE the wrapper reaches the
trace-skip gate the test is supposed to validate — converting
this into a flaky non-target failure.

Compute ``n_chunk_estimate`` dynamically by running the same
``discover_blocks`` → ``flatten_block_trees`` → ``build_layout``
pipeline the wrapper itself uses (with the wrapper's default
S_chunk), and pass the resulting ``layout.N_chunk`` through.
``n_persist_override = n_chunk_estimate`` then keeps the
all-persistent invariant the test relies on regardless of any
future layout-heuristic shift.

``tests/protrain/`` default-marker sweep: 303 passed / 4 skipped
/ 0 failed. GPU-marker sweep on touched files: 40 passed /
2 skipped (single-process Mode-C downgrade for shape-preserving
placeholder paths) / 0 failed.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
thad0ctor added a commit that referenced this pull request May 28, 2026
…inor)

Four CodeRabbit findings on commits ``b61f04e0`` (init-transient peak
prediction) + ``aa0c6ba9`` (Mode-C steady CKPT-chain accounting).

**R4-#1 (Critical) — post-block-wrap DDP ignore-set re-registration.**

The M6C-fix-8 ignore-set registration ran BEFORE block-wrap. The
block wrappers (``block/checkpoint.py``, ``block/swap.py``,
``block/offload.py``) all do ``self.block = block``, which means
PyTorch's ``named_parameters()`` traversal inserts a ``.block.``
infix into the parameter namespace
(``layers.0.attn.q_proj.weight`` ⇒
``layers.0.block.attn.q_proj.weight``). The pre-wrap names captured
in ``model._ddp_params_and_buffers_to_ignore`` no longer match the
namespace DDP's ``__init__`` walks at construction time.

The init-time broadcast is irrelevant (M6C-fix-8's
``init_sync=False`` monkey-patch bypasses it wholesale on
chunk-managed models), but DDP's BACKWARD-pass allreduce still
consults the ignore list. A stale ignore set means DDP's backward
allreduce would attempt to all-reduce chunk-managed LoRA factor
gradients, conflicting with ProTrain's per-chunk ``reduce_scatter``
drain.

Add a post-wrap re-registration step after ``install_hooks`` in
``_construct_runtime``: walk the WRAPPED ``model.named_parameters()``
and identify chunk-managed params by OBJECT identity against
``chunk_manager._params_by_id.values()``. Build the post-wrap name
set, merge with the pre-protrain snapshot
(``_protrain_ddp_original_ignore``), overwrite the attribute.
Gated on ``_shape_preserving`` so the single-GPU / replicated path
remains a no-op.

**R4-#2 (Major) — reuse bootstrap init-transient peak instead of
recomputing post-offload.**

``predict_init_transient_peak_bytes(layout, hw, chunk_manager)``
walks ``chunk_manager.model.named_parameters()`` to sum chunk
bytes. By the time the phase-2 post-measurement calibration runs,
``materialize_offload`` has already executed and ``param.data``
points at the zero-size placeholders (replicated path) or
``scratch.expand(slot.shape)`` views (sharded path), so the
byte accounting drifts away from the bootstrap-time full-residence
prediction.

Replace the recompute call at the phase-2 post-measurement
calibration site with ``boot_result.predicted_init_transient_peak_bytes``
— the bootstrap-time value captured at line 1614 before
materialize_offload ran. The downstream consumers (SearchResult
publish, LOG.info diagnostic) get the same authoritative value
without re-walking a now-stale chunk_manager.

**R4-#3 (Major) — meta tensors in ``_stub_chunk_manager`` to avoid
CI OOM.**

``tests/protrain/test_init_transient_peak.py::_stub_chunk_manager``
allocated full CPU tensors sized to model 15–60 GiB chunk
totals. ``predict_init_transient_peak_bytes`` only reads
``param.numel() * param.element_size()``, so meta-device tensors
preserve the byte-accounting metadata without consuming RAM.
Switch the ``nn.Parameter(torch.zeros(numel, dtype, device='cpu'))``
construction to ``nn.Parameter(torch.empty(numel, dtype,
device='meta'), requires_grad=False)``.

**R4-#4 (Minor) — align docstring tolerance with ``TOLERANCE_FRAC = 0.35``.**

``tests/protrain/test_modec_steady_peak_accuracy.py`` docstring
said "±25%" but ``TOLERANCE_FRAC = 0.35`` and the assertion uses
0.35. Update the two docstring mentions to "±35%" so text matches
intent.

### Test gates

- ``pre-commit run --all-files`` ALL green (ruff / ruff-format /
  mypy / bandit / yaml / eol / whitespace).
- ``tests/protrain/`` default-marker sweep: 313 passed / 4
  skipped / 162 deselected / 0 failed.
- GPU sanity on touched test files (GPU 5): 24 passed / 2 skipped
  (single-process Mode-C downgrade — expected) / 0 failed.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
thad0ctor added a commit that referenced this pull request May 28, 2026
…est fixes

Seven Minor items from the CodeRabbit full-diff re-scan on
commit ``55377e5d``.

**F-#2 — Clarify Mode-A guidance in ``protrain_optimizer_wrapper``
8-bit warning (``api/optim_wrapper.py:802-815``).**

The warning told users to set ``protrain_force_all_persistent: true``
to get end-to-end 8-bit AdamW on CPU-resident chunks, but didn't
mention that ``protrain_force_all_persistent`` is ignored while
``protrain_auto_mode`` is on (the auto-mode selector picks the mode
itself based on capacity). Expanded the warning to instruct users
to set ``protrain_auto_mode: false`` AND
``protrain_force_all_persistent: true`` together.

**F-#4 — Unify fragmentation-alpha docs in DESIGN.md.**

Module summaries at lines 49 (``cost/memory.py``) and 118
(``memory.py`` module spec) still described a fixed ``alpha=1.10``
while Design Decision 1 documents the per-dtype lookup
(``ALPHA_FRAGMENTATION_4BIT = 0.75`` for bnb-4-bit). Aligned both
summaries to reference the per-dtype helper
(``alpha_fragmentation_for_dtype``) and the design decision section.

**F-#5 — Resolve ``use_reentrant`` contradiction in DESIGN.md.**

Line 109 (``block/checkpoint.py`` module spec) said
``use_reentrant=False``, which matches the actual implementation
(verified via ``grep`` against ``block/checkpoint.py:99``). Line 290
(audit Block G analysis) claimed ``use_reentrant=True, the
production wrap`` — stale and incorrect. Updated the analysis text
to acknowledge ``use_reentrant=False`` is the production wrap and
re-stated the per-block-input residual mechanism in a form
compatible with the non-reentrant variant (each CKPT block's
saved-tensors-hooks recompute frame holds the block input, which
is what produces the linear-in-N_block activation footprint the
audit data exposes).

**F-#8 — Centralized CUDA-availability guard in
``tests/protrain/test_adamw8bit_adapter.py::_gpu_device``.**

The helper unconditionally returned ``torch.device("cuda:0")``,
so a custom marker filter or conftest override that lands the
module in a CPU-only context would surface as a torch error
before any test body. Added a
``pytest.skip("CUDA not available; ...")`` early-return so every
gpu-marked test in the module gets a clean skip.

**F-#9 — Replace silent ``try/except: pass`` with
``contextlib.suppress(Exception)`` in
``tests/protrain/test_lora_offload_mode.py``.**

Five sites — lines 742-746, 839-843, 906-910, 981-985, 1040-1044
— each had the same ``for h in handles: try: h.remove() except
Exception: pass`` pattern that Ruff S110 flags. Replaced with
``contextlib.suppress(Exception)`` over the loop. Semantics
unchanged (best-effort cleanup, tolerate already-removed handles
or torch shutting down mid-test); intent now documented by the
context manager.

**F-#10 — ASCII ``x`` in ``test_lora_offload_mode.py:1062`` docstring.**

Missed in the R5 unicode sweep — ``4×3090`` ⇒ ``4x3090``.

**F-#11 — ``try/finally`` for ``wrapped.close()`` in 3 sites of
``test_trace_skip_on_override.py``.**

``test_run_trace_skipped_on_override_full_path`` (L255-282),
``test_run_trace_invoked_without_override`` (L319-337), and
``test_partial_overrides_do_not_skip_trace`` (L381-400) each
called ``wrapped.close()`` only on the success path — assertion
failures earlier in the test body would skip the close and leak
CUDA + chunk resources into subsequent GPU tests. Wrapped each
test body in ``try/finally`` so ``wrapped.close()`` always
runs. Done programmatically via a one-shot Python rewrite
(8 lines of new indent + 2 lines of try/finally per site) to
keep the diff mechanical.

### Test gates

- ``pre-commit run --all-files`` ALL green (ruff / ruff-format /
  mypy / bandit / yaml / eol / whitespace).
- ``tests/protrain/`` default-marker: 313 passed / 4 skipped /
  162 deselected / 0 failed.
- GPU sanity on F-touched files (GPU 5): 43 passed / 2 skipped /
  0 failed.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant