refactor: address CodeRabbit review comments for mixed content JSON p… by thad0ctor · Pull Request #2 · thad0ctor/axolotl

thad0ctor · 2025-12-03T03:58:49Z

…arsing

Improve JSON detection logic to check for proper structure (matching brackets/braces)
Rename parameter from 'is_qwen2_vl' to 'deserialize_json_content' for clarity
Use axolotl's logging system instead of standard logging module
Add more specific exception handling (KeyError, TypeError, ValueError)
Preserve all message fields during normalization, not just role and content
Add proper spacing in warning messages for readability
Add warnings for unsupported loader options (split, streaming)
Remove unused kwargs parameter from load_mixed_content_jsonl function

These changes improve code maintainability, error handling, and user experience when working with mixed content datasets for multimodal models.

🤖 Generated with Claude Code

Description

Motivation and Context

How has this been tested?

Screenshots (if appropriate)

Types of changes

Social Handles (Optional)

- Add custom JSON loader to handle mixed content types (string/array) - Implement multi-image support for Qwen2VLProcessingStrategy - Normalize content during loading, deserialize during processing - Add configuration gating via chat_template or mixed_content_messages flag - Update documentation with mixed content dataset examples - Add schema support for mixed_content_messages configuration Fixes PyArrow schema errors when training Qwen3-VL with multimodal datasets containing both text-only and multi-image messages.

…arsing - Improve JSON detection logic to check for proper structure (matching brackets/braces) - Rename parameter from 'is_qwen2_vl' to 'deserialize_json_content' for clarity - Use axolotl's logging system instead of standard logging module - Add more specific exception handling (KeyError, TypeError, ValueError) - Preserve all message fields during normalization, not just role and content - Add proper spacing in warning messages for readability - Add warnings for unsupported loader options (split, streaming) - Remove unused kwargs parameter from load_mixed_content_jsonl function These changes improve code maintainability, error handling, and user experience when working with mixed content datasets for multimodal models. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>

Two related fixes landing together — both Codex-flagged after the previous round closed gaps #1 (persistent peak), #2 (T_reduce), and #5 (per-chunk contention). Bundled because both touch the cost-model search path and add regression tests to the same file. ## Fix 1: searcher fast-path applies cap layering (closes #1 fully) 909fc9e fixed cost/memory.py::estimate_peak to layer the hot_iter_peak_cap (cap only the activation portion, preserve model_state_present). The same bug existed in search/exhaustive.py:649 inline fast path — it still applied the cap as a flat clamp, erasing the Adam state contribution. Codex synthetic repro: search() returned predicted_peak_bytes=78MB while estimate_peak() for the same config returned 7.09GB — a 90× divergence on full-FT shapes. Extracted shared helper apply_hot_iter_cap(raw_peak, model_state_present, measured_cap, layout) -> int into cost/memory.py. Used at three sites to eliminate drift risk: - cost/memory.py::estimate_peak (replaces inline 909fc9e code) - search/exhaustive.py inline fast path (the Codex bug site) - search/exhaustive.py _cap_dominates probe (also was naive) The _cap_dominates shortcut at line 522 was widening max_sum based on the false claim "predicted_peak collapses to alpha * hot_cap independent of (n_persist + n_buffer)." Tightened to use the layered cap with n_persist=N_chunk worst-case probe — shortcut now activates only when the layered worst-case cap fits in capacity. LoRA-shape efficiency win preserved; full-FT shapes correctly exit the shortcut. New regression test test_search_fast_path_cap_preserves_full_ft_model_state asserts search()'s predicted_peak_bytes agrees with estimate_peak()'s output and clears the alpha * model_state_present floor. Falsification verified: temporarily reverting the inline fix reproduces the exact 74MB-vs-7GB Codex pattern. ## Fix 2: phase-2 measured-wall override gates on n_swap == 0 e8f45fd added the per-chunk timeline-overlap bandwidth model on the analytical path. But cost/runtime.py:703 (forward) and :820 (backward) have a phase-2 measured-chunked-wall override that bypasses the chunk_bw_fwd / chunk_bw_bwd vectors entirely. The phase-2 measurement was captured at n_swap=0 (verified at profiler/phase2.py:117), so when an n_swap > 0 candidate is evaluated using the measured wall, the per-chunk SWAP contention is missed — only the swap-transfer term is charged. Fix: AND both override conditions with cfg.n_swap == 0. Phase-2 stays applicable in the dominant case (no SWAP, hot-iter measured walls); n_swap > 0 candidates fall through to the analytical path that already handles per-chunk contention correctly. Numerics for n_swap == 0 candidates are bit-identical (the analytical path's per-chunk derate collapses to full bandwidth when n_swap == 0). New regression test test_phase2_override_routes_n_swap_through_per_chunk_contention constructs n_swap=0 vs n_swap=N_block configs under a fully-populated phase-2 trace, asserts the gap exceeds the swap-transfer-only lower bound (the exact pre-fix value). Falsification verified: temporarily reverting both gates reproduces gap == swap-transfer-only. ## Test results 41/41 pass in tests/protrain/test_cost_search.py (39 prior + 2 new). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Five test-quality refinements from CodeRabbit's third-round review. **R3-#2 — deterministic teardown in test_dora.** Wrap the DoRA smoke's wrap → train → assert sequence in ``try/finally`` so ``wrapped.close()`` runs even when the loss-descent assertion fails mid-test. Without this, an early assertion failure leaves hooks, pinned-host borrows, and CPU adapter threads alive into subsequent GPU tests on the same pytest session. **R3-#3 — distinguish hook edges in test_lora_offload_mode recording stub.** The pre-fix ``_RecordingScheduler.ensure_chunks_resident`` recorded every container callback under the same ``"ensure_chunks_resident"`` label. The per-hook tests (pre_forward / post_forward / post_backward fires ``ensure_chunks_resident``) then asserted only call COUNT — so a regression that deleted the pre-forward hook factory while post-forward still fired would still pass the count gates. Tag each call with its originating hook edge via frame inspection on the caller's ``co_qualname`` (Python 3.11+ guarantees the qualname captures the enclosing ``_make_lora_container_<edge>_hook`` factory). The four LoRA container hooks all funnel through the same ``ensure_chunks_resident`` entry point but their closures live in distinct factory functions, so the qualname uniquely identifies the edge. Update each per-hook test to filter on the edge-tagged label so a regression in any single edge fails the corresponding test: * pre_forward test: asserts ``ensure_chunks_resident:pre_forward`` fires ≥ n_blocks times. * post_forward test: asserts BOTH ``:pre_forward`` AND ``:post_forward`` fire ≥ n_containers times each (the previous bare ≥ 2*n_containers count was satisfied by either edge alone). * post_backward test: asserts all four edges (pre/post fwd, pre/ post bwd) fire ≥ n_containers times each. The production hook factory layout is unchanged — the stub recovers the edge from the existing closure's frame, no new arguments thread through ``install_hooks``. **R3-#4 — narrow protrain_model_wrapper exception scope in test_lora_offload_mode:1117.** The bare ``except (ValueError, RuntimeError)`` was treating any wrapper failure as "offload setup unavailable" and skipping. A broken ``protrain_model_wrapper`` runtime path could leave this smoke green. Restrict the suppression to known env-failure substrings (DeepSpeedCPUAdam JIT, CUDA version mismatch, bnb load, ``No module named``, and capacity/searcher gates) — same canonical tuple D8 used at the optimizer-step site below — and re-raise anything else. Real wrapper regressions now surface. **R3-#5 — fail-safe CUDA teardown in test_param_data_shape_preservation.** Eight test functions in this module construct ``mgr / layout / pool / host`` via ``_build_chunk_manager`` and tear them down at the happy-path tail (``mgr.uninstall()`` / ``host.close()`` / ``del pool``). Any earlier assertion failure skipped the teardown, leaking pinned-host borrows + CUDA buffer-pool state into subsequent GPU tests. Add a top-level ``_teardown_chunk_manager(mgr, host, pool)`` helper that does the best-effort 3-call teardown (each call wrapped in its own try/except so a failure in ``uninstall`` doesn't block the ``host.close``), and wrap each test body in ``try: ... finally: _teardown_chunk_manager(...)``. Done programmatically across all 8 tests via a one-shot Python rewrite to keep the diff mechanical and the new structure consistent. **R3-#8 — replace hard-coded n_chunk_estimate=1 in test_trace_skip_on_override.** The trace-skip e2e test hard-coded ``n_chunk_estimate = 1`` based on the assumption that the tiny GPT-2 fixture produces a single chunk. If the layout heuristics (``pick_S_chunk`` default, block-discovery rules) shift such that ``N_chunk > 1``, ``min_n_buffer_for(layout, n_persist=1)`` rejects ``n_buffer_override=0`` BEFORE the wrapper reaches the trace-skip gate the test is supposed to validate — converting this into a flaky non-target failure. Compute ``n_chunk_estimate`` dynamically by running the same ``discover_blocks`` → ``flatten_block_trees`` → ``build_layout`` pipeline the wrapper itself uses (with the wrapper's default S_chunk), and pass the resulting ``layout.N_chunk`` through. ``n_persist_override = n_chunk_estimate`` then keeps the all-persistent invariant the test relies on regardless of any future layout-heuristic shift. ``tests/protrain/`` default-marker sweep: 303 passed / 4 skipped / 0 failed. GPU-marker sweep on touched files: 40 passed / 2 skipped (single-process Mode-C downgrade for shape-preserving placeholder paths) / 0 failed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…inor) Four CodeRabbit findings on commits ``b61f04e0`` (init-transient peak prediction) + ``aa0c6ba9`` (Mode-C steady CKPT-chain accounting). **R4-#1 (Critical) — post-block-wrap DDP ignore-set re-registration.** The M6C-fix-8 ignore-set registration ran BEFORE block-wrap. The block wrappers (``block/checkpoint.py``, ``block/swap.py``, ``block/offload.py``) all do ``self.block = block``, which means PyTorch's ``named_parameters()`` traversal inserts a ``.block.`` infix into the parameter namespace (``layers.0.attn.q_proj.weight`` ⇒ ``layers.0.block.attn.q_proj.weight``). The pre-wrap names captured in ``model._ddp_params_and_buffers_to_ignore`` no longer match the namespace DDP's ``__init__`` walks at construction time. The init-time broadcast is irrelevant (M6C-fix-8's ``init_sync=False`` monkey-patch bypasses it wholesale on chunk-managed models), but DDP's BACKWARD-pass allreduce still consults the ignore list. A stale ignore set means DDP's backward allreduce would attempt to all-reduce chunk-managed LoRA factor gradients, conflicting with ProTrain's per-chunk ``reduce_scatter`` drain. Add a post-wrap re-registration step after ``install_hooks`` in ``_construct_runtime``: walk the WRAPPED ``model.named_parameters()`` and identify chunk-managed params by OBJECT identity against ``chunk_manager._params_by_id.values()``. Build the post-wrap name set, merge with the pre-protrain snapshot (``_protrain_ddp_original_ignore``), overwrite the attribute. Gated on ``_shape_preserving`` so the single-GPU / replicated path remains a no-op. **R4-#2 (Major) — reuse bootstrap init-transient peak instead of recomputing post-offload.** ``predict_init_transient_peak_bytes(layout, hw, chunk_manager)`` walks ``chunk_manager.model.named_parameters()`` to sum chunk bytes. By the time the phase-2 post-measurement calibration runs, ``materialize_offload`` has already executed and ``param.data`` points at the zero-size placeholders (replicated path) or ``scratch.expand(slot.shape)`` views (sharded path), so the byte accounting drifts away from the bootstrap-time full-residence prediction. Replace the recompute call at the phase-2 post-measurement calibration site with ``boot_result.predicted_init_transient_peak_bytes`` — the bootstrap-time value captured at line 1614 before materialize_offload ran. The downstream consumers (SearchResult publish, LOG.info diagnostic) get the same authoritative value without re-walking a now-stale chunk_manager. **R4-#3 (Major) — meta tensors in ``_stub_chunk_manager`` to avoid CI OOM.** ``tests/protrain/test_init_transient_peak.py::_stub_chunk_manager`` allocated full CPU tensors sized to model 15–60 GiB chunk totals. ``predict_init_transient_peak_bytes`` only reads ``param.numel() * param.element_size()``, so meta-device tensors preserve the byte-accounting metadata without consuming RAM. Switch the ``nn.Parameter(torch.zeros(numel, dtype, device='cpu'))`` construction to ``nn.Parameter(torch.empty(numel, dtype, device='meta'), requires_grad=False)``. **R4-#4 (Minor) — align docstring tolerance with ``TOLERANCE_FRAC = 0.35``.** ``tests/protrain/test_modec_steady_peak_accuracy.py`` docstring said "±25%" but ``TOLERANCE_FRAC = 0.35`` and the assertion uses 0.35. Update the two docstring mentions to "±35%" so text matches intent. ### Test gates - ``pre-commit run --all-files`` ALL green (ruff / ruff-format / mypy / bandit / yaml / eol / whitespace). - ``tests/protrain/`` default-marker sweep: 313 passed / 4 skipped / 162 deselected / 0 failed. - GPU sanity on touched test files (GPU 5): 24 passed / 2 skipped (single-process Mode-C downgrade — expected) / 0 failed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…est fixes Seven Minor items from the CodeRabbit full-diff re-scan on commit ``55377e5d``. **F-#2 — Clarify Mode-A guidance in ``protrain_optimizer_wrapper`` 8-bit warning (``api/optim_wrapper.py:802-815``).** The warning told users to set ``protrain_force_all_persistent: true`` to get end-to-end 8-bit AdamW on CPU-resident chunks, but didn't mention that ``protrain_force_all_persistent`` is ignored while ``protrain_auto_mode`` is on (the auto-mode selector picks the mode itself based on capacity). Expanded the warning to instruct users to set ``protrain_auto_mode: false`` AND ``protrain_force_all_persistent: true`` together. **F-#4 — Unify fragmentation-alpha docs in DESIGN.md.** Module summaries at lines 49 (``cost/memory.py``) and 118 (``memory.py`` module spec) still described a fixed ``alpha=1.10`` while Design Decision 1 documents the per-dtype lookup (``ALPHA_FRAGMENTATION_4BIT = 0.75`` for bnb-4-bit). Aligned both summaries to reference the per-dtype helper (``alpha_fragmentation_for_dtype``) and the design decision section. **F-#5 — Resolve ``use_reentrant`` contradiction in DESIGN.md.** Line 109 (``block/checkpoint.py`` module spec) said ``use_reentrant=False``, which matches the actual implementation (verified via ``grep`` against ``block/checkpoint.py:99``). Line 290 (audit Block G analysis) claimed ``use_reentrant=True, the production wrap`` — stale and incorrect. Updated the analysis text to acknowledge ``use_reentrant=False`` is the production wrap and re-stated the per-block-input residual mechanism in a form compatible with the non-reentrant variant (each CKPT block's saved-tensors-hooks recompute frame holds the block input, which is what produces the linear-in-N_block activation footprint the audit data exposes). **F-#8 — Centralized CUDA-availability guard in ``tests/protrain/test_adamw8bit_adapter.py::_gpu_device``.** The helper unconditionally returned ``torch.device("cuda:0")``, so a custom marker filter or conftest override that lands the module in a CPU-only context would surface as a torch error before any test body. Added a ``pytest.skip("CUDA not available; ...")`` early-return so every gpu-marked test in the module gets a clean skip. **F-#9 — Replace silent ``try/except: pass`` with ``contextlib.suppress(Exception)`` in ``tests/protrain/test_lora_offload_mode.py``.** Five sites — lines 742-746, 839-843, 906-910, 981-985, 1040-1044 — each had the same ``for h in handles: try: h.remove() except Exception: pass`` pattern that Ruff S110 flags. Replaced with ``contextlib.suppress(Exception)`` over the loop. Semantics unchanged (best-effort cleanup, tolerate already-removed handles or torch shutting down mid-test); intent now documented by the context manager. **F-#10 — ASCII ``x`` in ``test_lora_offload_mode.py:1062`` docstring.** Missed in the R5 unicode sweep — ``4×3090`` ⇒ ``4x3090``. **F-#11 — ``try/finally`` for ``wrapped.close()`` in 3 sites of ``test_trace_skip_on_override.py``.** ``test_run_trace_skipped_on_override_full_path`` (L255-282), ``test_run_trace_invoked_without_override`` (L319-337), and ``test_partial_overrides_do_not_skip_trace`` (L381-400) each called ``wrapped.close()`` only on the success path — assertion failures earlier in the test body would skip the close and leak CUDA + chunk resources into subsequent GPU tests. Wrapped each test body in ``try/finally`` so ``wrapped.close()`` always runs. Done programmatically via a one-shot Python rewrite (8 lines of new indent + 2 lines of try/finally per site) to keep the diff mechanical. ### Test gates - ``pre-commit run --all-files`` ALL green (ruff / ruff-format / mypy / bandit / yaml / eol / whitespace). - ``tests/protrain/`` default-marker: 313 passed / 4 skipped / 162 deselected / 0 failed. - GPU sanity on F-touched files (GPU 5): 43 passed / 2 skipped / 0 failed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Three real issues from the review of 5ce0c15: 1. (HIGH) ProTrain shard save was silently skipped after Accelerate.prepare. HF Trainer replaces self.optimizer with AcceleratedOptimizer once prepare runs (transformers/trainer.py:1600), and every subsequent callback receives the wrapped form. The wrapper exposes the raw optimizer at .optimizer (accelerate/optimizer.py:56) but does not forward ProTrain's _gpu_optim/_cpu_optim/_chunk_manager attrs. The callback's duck-type check failed on the wrapper, returned control without writing anything, and protrain_optim/ was never produced in real Trainer runs even though direct unit tests passed. Fix: add _unwrap_protrain_optim that returns the raw _ProTrainOptimizer from either a raw or AcceleratedOptimizer-wrapped input. Use it in the callback's on_save and (defensively) in install_load_hook. 2. (HIGH) The save-size guard undercounted offloaded optimizer state. _estimate_optim_state_bytes walked the user-facing optim.param_groups and summed p.numel() per param. But materialize_offload (manager.py:706, :1494) replaces every offloaded param's .data with an empty placeholder between training steps, so p.numel() == 0 for offloaded params. For 7B full-FT this meant the actual ~84 GB state was estimated as ~0 bytes and the 2 GiB cap was bypassed — the exact silent-large-write the gate was meant to prevent. Fix: walk each INNER adapter's state dict (_gpu_optim._optim.state and every entry in _cpu_optim._optims values) and sum tensor bytes directly. Counts exactly what gets pickled, regardless of the user-facing param.data state. Pre-first-step inner state is empty so the estimate is 0 — correct: there's nothing to save yet. 3. (MEDIUM) Tests didn't exercise the real Trainer path or prove restoration. Direct callback tests passed an unwrapped optimizer, missing the Accelerate wrapper issue (#1). The "load succeeds" test only asserted the function returned True, not that any state was actually restored. Fixes: three new tests. * test_unwrap_protrain_optim_handles_raw_and_wrapped — covers the wrap/unwrap matrix with a fake wrapper. * test_unwrap_real_accelerated_optimizer — constructs the actual accelerate.AcceleratedOptimizer and verifies our unwrap finds the raw form. (Initializes Accelerator() to set the singleton.) * test_callback_unwraps_accelerated_optimizer — the regression test that would have caught #1: hand the callback an AcceleratedOptimizer wrapping a real ProTrain optim, assert protrain_optim/metadata.json is actually written. * test_load_actually_restores_inner_state — snapshot inner state, mutate it, load, assert state matches the snapshot bit-identical. Strictly stronger than the previous "load returned True" check. * test_save_skipped_when_offloaded_state_exceeds_threshold — the regression test for #2: outer param_groups have empty placeholders (would have summed to 0 under the old estimator) but inner state is large; verify the gate trips correctly. Estimator unit tests rewritten for the new semantic (mock _gpu_optim / _cpu_optim adapter shape + inner state dicts), replacing the previous mocks of param_groups. Fast suite: 146 passed, 2 skipped, 12 deselected (+6 new tests on top of the prior 140). 7B-LoRA regression guard: 1 passed in 71s. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Two related fixes landing together — both Codex-flagged after the previous round closed gaps #1 (persistent peak), #2 (T_reduce), and #5 (per-chunk contention). Bundled because both touch the cost-model search path and add regression tests to the same file. ## Fix 1: searcher fast-path applies cap layering (closes #1 fully) 909fc9e fixed cost/memory.py::estimate_peak to layer the hot_iter_peak_cap (cap only the activation portion, preserve model_state_present). The same bug existed in search/exhaustive.py:649 inline fast path — it still applied the cap as a flat clamp, erasing the Adam state contribution. Codex synthetic repro: search() returned predicted_peak_bytes=78MB while estimate_peak() for the same config returned 7.09GB — a 90× divergence on full-FT shapes. Extracted shared helper apply_hot_iter_cap(raw_peak, model_state_present, measured_cap, layout) -> int into cost/memory.py. Used at three sites to eliminate drift risk: - cost/memory.py::estimate_peak (replaces inline 909fc9e code) - search/exhaustive.py inline fast path (the Codex bug site) - search/exhaustive.py _cap_dominates probe (also was naive) The _cap_dominates shortcut at line 522 was widening max_sum based on the false claim "predicted_peak collapses to alpha * hot_cap independent of (n_persist + n_buffer)." Tightened to use the layered cap with n_persist=N_chunk worst-case probe — shortcut now activates only when the layered worst-case cap fits in capacity. LoRA-shape efficiency win preserved; full-FT shapes correctly exit the shortcut. New regression test test_search_fast_path_cap_preserves_full_ft_model_state asserts search()'s predicted_peak_bytes agrees with estimate_peak()'s output and clears the alpha * model_state_present floor. Falsification verified: temporarily reverting the inline fix reproduces the exact 74MB-vs-7GB Codex pattern. ## Fix 2: phase-2 measured-wall override gates on n_swap == 0 e8f45fd added the per-chunk timeline-overlap bandwidth model on the analytical path. But cost/runtime.py:703 (forward) and :820 (backward) have a phase-2 measured-chunked-wall override that bypasses the chunk_bw_fwd / chunk_bw_bwd vectors entirely. The phase-2 measurement was captured at n_swap=0 (verified at profiler/phase2.py:117), so when an n_swap > 0 candidate is evaluated using the measured wall, the per-chunk SWAP contention is missed — only the swap-transfer term is charged. Fix: AND both override conditions with cfg.n_swap == 0. Phase-2 stays applicable in the dominant case (no SWAP, hot-iter measured walls); n_swap > 0 candidates fall through to the analytical path that already handles per-chunk contention correctly. Numerics for n_swap == 0 candidates are bit-identical (the analytical path's per-chunk derate collapses to full bandwidth when n_swap == 0). New regression test test_phase2_override_routes_n_swap_through_per_chunk_contention constructs n_swap=0 vs n_swap=N_block configs under a fully-populated phase-2 trace, asserts the gap exceeds the swap-transfer-only lower bound (the exact pre-fix value). Falsification verified: temporarily reverting both gates reproduces gap == swap-transfer-only. ## Test results 41/41 pass in tests/protrain/test_cost_search.py (39 prior + 2 new). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Five test-quality refinements from CodeRabbit's third-round review. **R3-#2 — deterministic teardown in test_dora.** Wrap the DoRA smoke's wrap → train → assert sequence in ``try/finally`` so ``wrapped.close()`` runs even when the loss-descent assertion fails mid-test. Without this, an early assertion failure leaves hooks, pinned-host borrows, and CPU adapter threads alive into subsequent GPU tests on the same pytest session. **R3-#3 — distinguish hook edges in test_lora_offload_mode recording stub.** The pre-fix ``_RecordingScheduler.ensure_chunks_resident`` recorded every container callback under the same ``"ensure_chunks_resident"`` label. The per-hook tests (pre_forward / post_forward / post_backward fires ``ensure_chunks_resident``) then asserted only call COUNT — so a regression that deleted the pre-forward hook factory while post-forward still fired would still pass the count gates. Tag each call with its originating hook edge via frame inspection on the caller's ``co_qualname`` (Python 3.11+ guarantees the qualname captures the enclosing ``_make_lora_container_<edge>_hook`` factory). The four LoRA container hooks all funnel through the same ``ensure_chunks_resident`` entry point but their closures live in distinct factory functions, so the qualname uniquely identifies the edge. Update each per-hook test to filter on the edge-tagged label so a regression in any single edge fails the corresponding test: * pre_forward test: asserts ``ensure_chunks_resident:pre_forward`` fires ≥ n_blocks times. * post_forward test: asserts BOTH ``:pre_forward`` AND ``:post_forward`` fire ≥ n_containers times each (the previous bare ≥ 2*n_containers count was satisfied by either edge alone). * post_backward test: asserts all four edges (pre/post fwd, pre/ post bwd) fire ≥ n_containers times each. The production hook factory layout is unchanged — the stub recovers the edge from the existing closure's frame, no new arguments thread through ``install_hooks``. **R3-#4 — narrow protrain_model_wrapper exception scope in test_lora_offload_mode:1117.** The bare ``except (ValueError, RuntimeError)`` was treating any wrapper failure as "offload setup unavailable" and skipping. A broken ``protrain_model_wrapper`` runtime path could leave this smoke green. Restrict the suppression to known env-failure substrings (DeepSpeedCPUAdam JIT, CUDA version mismatch, bnb load, ``No module named``, and capacity/searcher gates) — same canonical tuple D8 used at the optimizer-step site below — and re-raise anything else. Real wrapper regressions now surface. **R3-#5 — fail-safe CUDA teardown in test_param_data_shape_preservation.** Eight test functions in this module construct ``mgr / layout / pool / host`` via ``_build_chunk_manager`` and tear them down at the happy-path tail (``mgr.uninstall()`` / ``host.close()`` / ``del pool``). Any earlier assertion failure skipped the teardown, leaking pinned-host borrows + CUDA buffer-pool state into subsequent GPU tests. Add a top-level ``_teardown_chunk_manager(mgr, host, pool)`` helper that does the best-effort 3-call teardown (each call wrapped in its own try/except so a failure in ``uninstall`` doesn't block the ``host.close``), and wrap each test body in ``try: ... finally: _teardown_chunk_manager(...)``. Done programmatically across all 8 tests via a one-shot Python rewrite to keep the diff mechanical and the new structure consistent. **R3-#8 — replace hard-coded n_chunk_estimate=1 in test_trace_skip_on_override.** The trace-skip e2e test hard-coded ``n_chunk_estimate = 1`` based on the assumption that the tiny GPT-2 fixture produces a single chunk. If the layout heuristics (``pick_S_chunk`` default, block-discovery rules) shift such that ``N_chunk > 1``, ``min_n_buffer_for(layout, n_persist=1)`` rejects ``n_buffer_override=0`` BEFORE the wrapper reaches the trace-skip gate the test is supposed to validate — converting this into a flaky non-target failure. Compute ``n_chunk_estimate`` dynamically by running the same ``discover_blocks`` → ``flatten_block_trees`` → ``build_layout`` pipeline the wrapper itself uses (with the wrapper's default S_chunk), and pass the resulting ``layout.N_chunk`` through. ``n_persist_override = n_chunk_estimate`` then keeps the all-persistent invariant the test relies on regardless of any future layout-heuristic shift. ``tests/protrain/`` default-marker sweep: 303 passed / 4 skipped / 0 failed. GPU-marker sweep on touched files: 40 passed / 2 skipped (single-process Mode-C downgrade for shape-preserving placeholder paths) / 0 failed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…inor) Four CodeRabbit findings on commits ``b61f04e0`` (init-transient peak prediction) + ``aa0c6ba9`` (Mode-C steady CKPT-chain accounting). **R4-#1 (Critical) — post-block-wrap DDP ignore-set re-registration.** The M6C-fix-8 ignore-set registration ran BEFORE block-wrap. The block wrappers (``block/checkpoint.py``, ``block/swap.py``, ``block/offload.py``) all do ``self.block = block``, which means PyTorch's ``named_parameters()`` traversal inserts a ``.block.`` infix into the parameter namespace (``layers.0.attn.q_proj.weight`` ⇒ ``layers.0.block.attn.q_proj.weight``). The pre-wrap names captured in ``model._ddp_params_and_buffers_to_ignore`` no longer match the namespace DDP's ``__init__`` walks at construction time. The init-time broadcast is irrelevant (M6C-fix-8's ``init_sync=False`` monkey-patch bypasses it wholesale on chunk-managed models), but DDP's BACKWARD-pass allreduce still consults the ignore list. A stale ignore set means DDP's backward allreduce would attempt to all-reduce chunk-managed LoRA factor gradients, conflicting with ProTrain's per-chunk ``reduce_scatter`` drain. Add a post-wrap re-registration step after ``install_hooks`` in ``_construct_runtime``: walk the WRAPPED ``model.named_parameters()`` and identify chunk-managed params by OBJECT identity against ``chunk_manager._params_by_id.values()``. Build the post-wrap name set, merge with the pre-protrain snapshot (``_protrain_ddp_original_ignore``), overwrite the attribute. Gated on ``_shape_preserving`` so the single-GPU / replicated path remains a no-op. **R4-#2 (Major) — reuse bootstrap init-transient peak instead of recomputing post-offload.** ``predict_init_transient_peak_bytes(layout, hw, chunk_manager)`` walks ``chunk_manager.model.named_parameters()`` to sum chunk bytes. By the time the phase-2 post-measurement calibration runs, ``materialize_offload`` has already executed and ``param.data`` points at the zero-size placeholders (replicated path) or ``scratch.expand(slot.shape)`` views (sharded path), so the byte accounting drifts away from the bootstrap-time full-residence prediction. Replace the recompute call at the phase-2 post-measurement calibration site with ``boot_result.predicted_init_transient_peak_bytes`` — the bootstrap-time value captured at line 1614 before materialize_offload ran. The downstream consumers (SearchResult publish, LOG.info diagnostic) get the same authoritative value without re-walking a now-stale chunk_manager. **R4-#3 (Major) — meta tensors in ``_stub_chunk_manager`` to avoid CI OOM.** ``tests/protrain/test_init_transient_peak.py::_stub_chunk_manager`` allocated full CPU tensors sized to model 15–60 GiB chunk totals. ``predict_init_transient_peak_bytes`` only reads ``param.numel() * param.element_size()``, so meta-device tensors preserve the byte-accounting metadata without consuming RAM. Switch the ``nn.Parameter(torch.zeros(numel, dtype, device='cpu'))`` construction to ``nn.Parameter(torch.empty(numel, dtype, device='meta'), requires_grad=False)``. **R4-#4 (Minor) — align docstring tolerance with ``TOLERANCE_FRAC = 0.35``.** ``tests/protrain/test_modec_steady_peak_accuracy.py`` docstring said "±25%" but ``TOLERANCE_FRAC = 0.35`` and the assertion uses 0.35. Update the two docstring mentions to "±35%" so text matches intent. ### Test gates - ``pre-commit run --all-files`` ALL green (ruff / ruff-format / mypy / bandit / yaml / eol / whitespace). - ``tests/protrain/`` default-marker sweep: 313 passed / 4 skipped / 162 deselected / 0 failed. - GPU sanity on touched test files (GPU 5): 24 passed / 2 skipped (single-process Mode-C downgrade — expected) / 0 failed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…est fixes Seven Minor items from the CodeRabbit full-diff re-scan on commit ``55377e5d``. **F-#2 — Clarify Mode-A guidance in ``protrain_optimizer_wrapper`` 8-bit warning (``api/optim_wrapper.py:802-815``).** The warning told users to set ``protrain_force_all_persistent: true`` to get end-to-end 8-bit AdamW on CPU-resident chunks, but didn't mention that ``protrain_force_all_persistent`` is ignored while ``protrain_auto_mode`` is on (the auto-mode selector picks the mode itself based on capacity). Expanded the warning to instruct users to set ``protrain_auto_mode: false`` AND ``protrain_force_all_persistent: true`` together. **F-#4 — Unify fragmentation-alpha docs in DESIGN.md.** Module summaries at lines 49 (``cost/memory.py``) and 118 (``memory.py`` module spec) still described a fixed ``alpha=1.10`` while Design Decision 1 documents the per-dtype lookup (``ALPHA_FRAGMENTATION_4BIT = 0.75`` for bnb-4-bit). Aligned both summaries to reference the per-dtype helper (``alpha_fragmentation_for_dtype``) and the design decision section. **F-#5 — Resolve ``use_reentrant`` contradiction in DESIGN.md.** Line 109 (``block/checkpoint.py`` module spec) said ``use_reentrant=False``, which matches the actual implementation (verified via ``grep`` against ``block/checkpoint.py:99``). Line 290 (audit Block G analysis) claimed ``use_reentrant=True, the production wrap`` — stale and incorrect. Updated the analysis text to acknowledge ``use_reentrant=False`` is the production wrap and re-stated the per-block-input residual mechanism in a form compatible with the non-reentrant variant (each CKPT block's saved-tensors-hooks recompute frame holds the block input, which is what produces the linear-in-N_block activation footprint the audit data exposes). **F-#8 — Centralized CUDA-availability guard in ``tests/protrain/test_adamw8bit_adapter.py::_gpu_device``.** The helper unconditionally returned ``torch.device("cuda:0")``, so a custom marker filter or conftest override that lands the module in a CPU-only context would surface as a torch error before any test body. Added a ``pytest.skip("CUDA not available; ...")`` early-return so every gpu-marked test in the module gets a clean skip. **F-#9 — Replace silent ``try/except: pass`` with ``contextlib.suppress(Exception)`` in ``tests/protrain/test_lora_offload_mode.py``.** Five sites — lines 742-746, 839-843, 906-910, 981-985, 1040-1044 — each had the same ``for h in handles: try: h.remove() except Exception: pass`` pattern that Ruff S110 flags. Replaced with ``contextlib.suppress(Exception)`` over the loop. Semantics unchanged (best-effort cleanup, tolerate already-removed handles or torch shutting down mid-test); intent now documented by the context manager. **F-#10 — ASCII ``x`` in ``test_lora_offload_mode.py:1062`` docstring.** Missed in the R5 unicode sweep — ``4×3090`` ⇒ ``4x3090``. **F-#11 — ``try/finally`` for ``wrapped.close()`` in 3 sites of ``test_trace_skip_on_override.py``.** ``test_run_trace_skipped_on_override_full_path`` (L255-282), ``test_run_trace_invoked_without_override`` (L319-337), and ``test_partial_overrides_do_not_skip_trace`` (L381-400) each called ``wrapped.close()`` only on the success path — assertion failures earlier in the test body would skip the close and leak CUDA + chunk resources into subsequent GPU tests. Wrapped each test body in ``try/finally`` so ``wrapped.close()`` always runs. Done programmatically via a one-shot Python rewrite (8 lines of new indent + 2 lines of try/finally per site) to keep the diff mechanical. ### Test gates - ``pre-commit run --all-files`` ALL green (ruff / ruff-format / mypy / bandit / yaml / eol / whitespace). - ``tests/protrain/`` default-marker: 313 passed / 4 skipped / 162 deselected / 0 failed. - GPU sanity on F-touched files (GPU 5): 43 passed / 2 skipped / 0 failed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

thad0ctor and others added 2 commits November 28, 2025 14:18

thad0ctor force-pushed the main branch from 48b98ee to 7420fd4 Compare April 22, 2026 22:42

thad0ctor closed this Apr 24, 2026

thad0ctor mentioned this pull request May 6, 2026

feat(protrain): close paper-fidelity gaps from Codex audit (15 commits) #19

Closed

6 tasks

thad0ctor mentioned this pull request May 12, 2026

Phase 2: ProTrain integration with Axolotl perf features (M0–M6C closed) #21

Closed

5 tasks

This was referenced Jun 1, 2026

fix(protrain): auto-offload memory-bound VL models; compose lora_mlp_kernel with offload #39

Merged

NVFP4: attention perf wins + flag consolidation #42

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

refactor: address CodeRabbit review comments for mixed content JSON p…#2

refactor: address CodeRabbit review comments for mixed content JSON p…#2
thad0ctor wants to merge 2 commits into
mainfrom
fix/mixed-content-qwen3vl

thad0ctor commented Dec 3, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

thad0ctor commented Dec 3, 2025

Description

Motivation and Context

How has this been tested?

Screenshots (if appropriate)

Types of changes

Social Handles (Optional)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant