fix: add support for SIZE_BASED_WRAP with min_num_params in FSDP2 by thad0ctor · Pull Request #3 · thad0ctor/axolotl

thad0ctor · 2025-12-03T03:59:17Z

Add min_num_params field to FSDPConfig schema
Set FSDP_MIN_NUM_PARAMS environment variable for SIZE_BASED_WRAP
Add patch for fsdp2_apply_ac to handle None auto_wrap_policy
Default to 100M parameters when min_num_params not specified

This fixes TypeError when using SIZE_BASED_WRAP without proper configuration

Description

Motivation and Context

How has this been tested?

Screenshots (if appropriate)

Types of changes

Social Handles (Optional)

- Add min_num_params field to FSDPConfig schema - Set FSDP_MIN_NUM_PARAMS environment variable for SIZE_BASED_WRAP - Add patch for fsdp2_apply_ac to handle None auto_wrap_policy - Default to 100M parameters when min_num_params not specified This fixes TypeError when using SIZE_BASED_WRAP without proper configuration

Five test-quality refinements from CodeRabbit's third-round review. **R3-#2 — deterministic teardown in test_dora.** Wrap the DoRA smoke's wrap → train → assert sequence in ``try/finally`` so ``wrapped.close()`` runs even when the loss-descent assertion fails mid-test. Without this, an early assertion failure leaves hooks, pinned-host borrows, and CPU adapter threads alive into subsequent GPU tests on the same pytest session. **R3-#3 — distinguish hook edges in test_lora_offload_mode recording stub.** The pre-fix ``_RecordingScheduler.ensure_chunks_resident`` recorded every container callback under the same ``"ensure_chunks_resident"`` label. The per-hook tests (pre_forward / post_forward / post_backward fires ``ensure_chunks_resident``) then asserted only call COUNT — so a regression that deleted the pre-forward hook factory while post-forward still fired would still pass the count gates. Tag each call with its originating hook edge via frame inspection on the caller's ``co_qualname`` (Python 3.11+ guarantees the qualname captures the enclosing ``_make_lora_container_<edge>_hook`` factory). The four LoRA container hooks all funnel through the same ``ensure_chunks_resident`` entry point but their closures live in distinct factory functions, so the qualname uniquely identifies the edge. Update each per-hook test to filter on the edge-tagged label so a regression in any single edge fails the corresponding test: * pre_forward test: asserts ``ensure_chunks_resident:pre_forward`` fires ≥ n_blocks times. * post_forward test: asserts BOTH ``:pre_forward`` AND ``:post_forward`` fire ≥ n_containers times each (the previous bare ≥ 2*n_containers count was satisfied by either edge alone). * post_backward test: asserts all four edges (pre/post fwd, pre/ post bwd) fire ≥ n_containers times each. The production hook factory layout is unchanged — the stub recovers the edge from the existing closure's frame, no new arguments thread through ``install_hooks``. **R3-#4 — narrow protrain_model_wrapper exception scope in test_lora_offload_mode:1117.** The bare ``except (ValueError, RuntimeError)`` was treating any wrapper failure as "offload setup unavailable" and skipping. A broken ``protrain_model_wrapper`` runtime path could leave this smoke green. Restrict the suppression to known env-failure substrings (DeepSpeedCPUAdam JIT, CUDA version mismatch, bnb load, ``No module named``, and capacity/searcher gates) — same canonical tuple D8 used at the optimizer-step site below — and re-raise anything else. Real wrapper regressions now surface. **R3-#5 — fail-safe CUDA teardown in test_param_data_shape_preservation.** Eight test functions in this module construct ``mgr / layout / pool / host`` via ``_build_chunk_manager`` and tear them down at the happy-path tail (``mgr.uninstall()`` / ``host.close()`` / ``del pool``). Any earlier assertion failure skipped the teardown, leaking pinned-host borrows + CUDA buffer-pool state into subsequent GPU tests. Add a top-level ``_teardown_chunk_manager(mgr, host, pool)`` helper that does the best-effort 3-call teardown (each call wrapped in its own try/except so a failure in ``uninstall`` doesn't block the ``host.close``), and wrap each test body in ``try: ... finally: _teardown_chunk_manager(...)``. Done programmatically across all 8 tests via a one-shot Python rewrite to keep the diff mechanical and the new structure consistent. **R3-#8 — replace hard-coded n_chunk_estimate=1 in test_trace_skip_on_override.** The trace-skip e2e test hard-coded ``n_chunk_estimate = 1`` based on the assumption that the tiny GPT-2 fixture produces a single chunk. If the layout heuristics (``pick_S_chunk`` default, block-discovery rules) shift such that ``N_chunk > 1``, ``min_n_buffer_for(layout, n_persist=1)`` rejects ``n_buffer_override=0`` BEFORE the wrapper reaches the trace-skip gate the test is supposed to validate — converting this into a flaky non-target failure. Compute ``n_chunk_estimate`` dynamically by running the same ``discover_blocks`` → ``flatten_block_trees`` → ``build_layout`` pipeline the wrapper itself uses (with the wrapper's default S_chunk), and pass the resulting ``layout.N_chunk`` through. ``n_persist_override = n_chunk_estimate`` then keeps the all-persistent invariant the test relies on regardless of any future layout-heuristic shift. ``tests/protrain/`` default-marker sweep: 303 passed / 4 skipped / 0 failed. GPU-marker sweep on touched files: 40 passed / 2 skipped (single-process Mode-C downgrade for shape-preserving placeholder paths) / 0 failed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…inor) Four CodeRabbit findings on commits ``b61f04e0`` (init-transient peak prediction) + ``aa0c6ba9`` (Mode-C steady CKPT-chain accounting). **R4-#1 (Critical) — post-block-wrap DDP ignore-set re-registration.** The M6C-fix-8 ignore-set registration ran BEFORE block-wrap. The block wrappers (``block/checkpoint.py``, ``block/swap.py``, ``block/offload.py``) all do ``self.block = block``, which means PyTorch's ``named_parameters()`` traversal inserts a ``.block.`` infix into the parameter namespace (``layers.0.attn.q_proj.weight`` ⇒ ``layers.0.block.attn.q_proj.weight``). The pre-wrap names captured in ``model._ddp_params_and_buffers_to_ignore`` no longer match the namespace DDP's ``__init__`` walks at construction time. The init-time broadcast is irrelevant (M6C-fix-8's ``init_sync=False`` monkey-patch bypasses it wholesale on chunk-managed models), but DDP's BACKWARD-pass allreduce still consults the ignore list. A stale ignore set means DDP's backward allreduce would attempt to all-reduce chunk-managed LoRA factor gradients, conflicting with ProTrain's per-chunk ``reduce_scatter`` drain. Add a post-wrap re-registration step after ``install_hooks`` in ``_construct_runtime``: walk the WRAPPED ``model.named_parameters()`` and identify chunk-managed params by OBJECT identity against ``chunk_manager._params_by_id.values()``. Build the post-wrap name set, merge with the pre-protrain snapshot (``_protrain_ddp_original_ignore``), overwrite the attribute. Gated on ``_shape_preserving`` so the single-GPU / replicated path remains a no-op. **R4-#2 (Major) — reuse bootstrap init-transient peak instead of recomputing post-offload.** ``predict_init_transient_peak_bytes(layout, hw, chunk_manager)`` walks ``chunk_manager.model.named_parameters()`` to sum chunk bytes. By the time the phase-2 post-measurement calibration runs, ``materialize_offload`` has already executed and ``param.data`` points at the zero-size placeholders (replicated path) or ``scratch.expand(slot.shape)`` views (sharded path), so the byte accounting drifts away from the bootstrap-time full-residence prediction. Replace the recompute call at the phase-2 post-measurement calibration site with ``boot_result.predicted_init_transient_peak_bytes`` — the bootstrap-time value captured at line 1614 before materialize_offload ran. The downstream consumers (SearchResult publish, LOG.info diagnostic) get the same authoritative value without re-walking a now-stale chunk_manager. **R4-#3 (Major) — meta tensors in ``_stub_chunk_manager`` to avoid CI OOM.** ``tests/protrain/test_init_transient_peak.py::_stub_chunk_manager`` allocated full CPU tensors sized to model 15–60 GiB chunk totals. ``predict_init_transient_peak_bytes`` only reads ``param.numel() * param.element_size()``, so meta-device tensors preserve the byte-accounting metadata without consuming RAM. Switch the ``nn.Parameter(torch.zeros(numel, dtype, device='cpu'))`` construction to ``nn.Parameter(torch.empty(numel, dtype, device='meta'), requires_grad=False)``. **R4-#4 (Minor) — align docstring tolerance with ``TOLERANCE_FRAC = 0.35``.** ``tests/protrain/test_modec_steady_peak_accuracy.py`` docstring said "±25%" but ``TOLERANCE_FRAC = 0.35`` and the assertion uses 0.35. Update the two docstring mentions to "±35%" so text matches intent. ### Test gates - ``pre-commit run --all-files`` ALL green (ruff / ruff-format / mypy / bandit / yaml / eol / whitespace). - ``tests/protrain/`` default-marker sweep: 313 passed / 4 skipped / 162 deselected / 0 failed. - GPU sanity on touched test files (GPU 5): 24 passed / 2 skipped (single-process Mode-C downgrade — expected) / 0 failed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…s (R5) CodeRabbit R5 review (final pass on c996ce9 + f09be09) flagged Ruff RUF002/RUF003 warnings for confusable unicode glyphs across the new audit-Block-G commentary added by 2fcc1fc / b61f04e / aa0c6ba / the per-dtype alpha lookup work. Same lint family R1-#5 and R2-#3 addressed in narrow scope before; this is the broader pass that sweeps the rest of the protrain subtree. Replacements (234 substitutions across 7 files): | File | alpha | x | ∪ | Total | |---------------------------------------------------|------:|----:|--:|------:| | src/axolotl/integrations/protrain/cost/memory.py | 23 | 1 | 0 | 24 | | src/axolotl/integrations/protrain/api/model_wrapper.py | 39 | 14 | 4 | 57 | | src/axolotl/integrations/protrain/types.py | 23 | 6 | 2 | 31 | | src/axolotl/integrations/protrain/DESIGN.md | 19 | 17 | 0 | 36 | | tests/protrain/test_modec_steady_peak_accuracy.py | 8 | 5 | 1 | 14 | | tests/protrain/test_init_transient_peak.py | 6 | 7 | 0 | 13 | | tests/protrain/test_alpha_per_dtype.py | 38 | 0 | 0 | 38 | Substitution rules: - Greek small letter alpha (U+03B1) → ``alpha`` - Multiplication sign (U+00D7) → ``x`` - Union operator (U+222A) → ``|`` (also the Python set-union operator, so doubly appropriate) All replacements are in docstrings, comments, and pytest-parametrize ID strings — zero changes to function names, type names, control flow, or assertion text. ``param_to_chunk`` typed dict keys, set literals, and any Python operator usage of ``|`` are unaffected. Test parametrize IDs change cosmetically (e.g. ``test_alpha_lookup_by_dtype[2.0-1.1-fp16/bf16 weights → α=1.10]`` ⇒ ``test_alpha_lookup_by_dtype[2.0-1.1-fp16/bf16 weights → alpha=1.10]``) — the ``→`` arrow remains unchanged (Ruff doesn't flag it; CodeRabbit flagged only ``α``/``×``/``∪`` explicitly). ### Test gates - ``pre-commit run --all-files`` ALL green (ruff / ruff-format / mypy / bandit / yaml / eol / whitespace). - ``tests/protrain/`` default-marker sweep: 313 passed / 4 skipped / 162 deselected / 0 failed. - Ruff RUF002/RUF003 warnings across the seven touched protrain files: 234 → 0. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…s in prior fixes CodeRabbit's full-diff re-scan on commit 55377e5 surfaced four Major correctness gaps in prior triage commits that the incremental reviews missed. **F-#1 — Filter post-wrap ignore set to non-persistent chunks only.** My R4-#1 fix at ``api/model_wrapper.py:2227-2246`` built ``chunk_managed_param_ids`` from ALL ``chunk_manager._params_by_id.values()``, but persistent chunks should NEVER be in ``_ddp_params_and_buffers_to_ignore`` — they need normal DDP broadcast and backward allreduce (see ``ChunkManager.chunk_managed_param_names``'s docstring: "Persistent chunks are excluded — their params stay GPU-resident, do not pass through the released-state placeholder, and DO need the standard DDP broadcast for correctness."). The over-broad filter silently swept persistent params into the ignore set, breaking gradient sync on the chunks DDP IS supposed to handle. Restrict the OBJECT-identity set to params backed by ``_non_persistent_ids`` only — iterate ``_cpu_slots[cid]`` for each non-persistent ``cid`` and pull the param ref from ``_params_by_id``. Renamed the local loop vars (``_cpu_slot`` instead of ``slot``) to avoid shadowing the earlier ``for slot, child in enumerate(parent)`` block-wrap site that binds ``slot`` to ``int``. **F-#3 — Abort optim swap on CPU-adapter teardown failure.** My D3 fix at ``api/optim_wrapper.py:951`` wrapped ``_old_cpu_optim.shutdown()`` in a try/except that warned and continued. The whole point of D3 is the deterministic-cleanup invariant — masking a real teardown failure (``ThreadPoolExecutor`` hung, DeepSpeed C-state corrupted) puts the failed adapter back on the GC path AND silently accepts an inconsistent state-machine on the rebuild side. Removed the try/except so a shutdown failure aborts the swap rather than papering over it. **F-#6 — Also fence compute stream against ``_prefetch_stream``.** My R3-#1 fix at ``runtime/scheduler.py::ensure_chunks_resident`` added ``compute.wait_stream(_swap_stream)`` before the synchronous gather loop to close the SWAP D2H race. CodeRabbit caught that the symmetric prefetch race is still open: if a chunk is being prefetched and ``ChunkManager.gather()`` hits the ``_active_chunks`` resident fast path, ``param.data`` rebinds while the prefetch's H2D / ``all_gather_into_tensor`` is still running on ``_prefetch_stream`` — gather returns BEFORE the chunk is compute-stream-safe, and a LoRA forward consuming ``param.data`` reads stale / not-yet-written bytes. Add ``compute.wait_stream(_prefetch_stream)`` alongside the existing ``compute.wait_stream(_swap_stream)`` so both cross-stream barriers fire when present. Cost: one extra event-record / event-wait per LoRA container hook fire; no-op when ``_prefetch_stream`` isn't running anything. **F-#7 — Broaden exception scope in ``check_cuda_p2p_support``.** My D9 fix at ``utils/environment.py:96`` caught only ``AssertionError`` from ``torch.cuda.can_device_access_peer``. Per the PyTorch 2.6 docs path, the Python wrapper validates device indices with ``AssertionError``, but the C++ binding ``_cuda_canDeviceAccessPeer`` it delegates to can surface exceptions from the CUDA runtime (``RuntimeError`` wrapping ``cudaErrorInvalidDevice``, peer-access machinery errors) that ``AssertionError`` wouldn't catch. An unhandled exception would propagate out of the helper and break the fail-closed contract — ranks would disagree about ``NCCL_P2P_DISABLE`` which is exactly the SIGSEGV class commit ``91e0912e`` set out to prevent. Widened to ``except Exception`` (with ``noqa: BLE001`` annotation explicitly documenting the fail-closed rationale). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Five test-quality refinements from CodeRabbit's third-round review. **R3-#2 — deterministic teardown in test_dora.** Wrap the DoRA smoke's wrap → train → assert sequence in ``try/finally`` so ``wrapped.close()`` runs even when the loss-descent assertion fails mid-test. Without this, an early assertion failure leaves hooks, pinned-host borrows, and CPU adapter threads alive into subsequent GPU tests on the same pytest session. **R3-#3 — distinguish hook edges in test_lora_offload_mode recording stub.** The pre-fix ``_RecordingScheduler.ensure_chunks_resident`` recorded every container callback under the same ``"ensure_chunks_resident"`` label. The per-hook tests (pre_forward / post_forward / post_backward fires ``ensure_chunks_resident``) then asserted only call COUNT — so a regression that deleted the pre-forward hook factory while post-forward still fired would still pass the count gates. Tag each call with its originating hook edge via frame inspection on the caller's ``co_qualname`` (Python 3.11+ guarantees the qualname captures the enclosing ``_make_lora_container_<edge>_hook`` factory). The four LoRA container hooks all funnel through the same ``ensure_chunks_resident`` entry point but their closures live in distinct factory functions, so the qualname uniquely identifies the edge. Update each per-hook test to filter on the edge-tagged label so a regression in any single edge fails the corresponding test: * pre_forward test: asserts ``ensure_chunks_resident:pre_forward`` fires ≥ n_blocks times. * post_forward test: asserts BOTH ``:pre_forward`` AND ``:post_forward`` fire ≥ n_containers times each (the previous bare ≥ 2*n_containers count was satisfied by either edge alone). * post_backward test: asserts all four edges (pre/post fwd, pre/ post bwd) fire ≥ n_containers times each. The production hook factory layout is unchanged — the stub recovers the edge from the existing closure's frame, no new arguments thread through ``install_hooks``. **R3-#4 — narrow protrain_model_wrapper exception scope in test_lora_offload_mode:1117.** The bare ``except (ValueError, RuntimeError)`` was treating any wrapper failure as "offload setup unavailable" and skipping. A broken ``protrain_model_wrapper`` runtime path could leave this smoke green. Restrict the suppression to known env-failure substrings (DeepSpeedCPUAdam JIT, CUDA version mismatch, bnb load, ``No module named``, and capacity/searcher gates) — same canonical tuple D8 used at the optimizer-step site below — and re-raise anything else. Real wrapper regressions now surface. **R3-#5 — fail-safe CUDA teardown in test_param_data_shape_preservation.** Eight test functions in this module construct ``mgr / layout / pool / host`` via ``_build_chunk_manager`` and tear them down at the happy-path tail (``mgr.uninstall()`` / ``host.close()`` / ``del pool``). Any earlier assertion failure skipped the teardown, leaking pinned-host borrows + CUDA buffer-pool state into subsequent GPU tests. Add a top-level ``_teardown_chunk_manager(mgr, host, pool)`` helper that does the best-effort 3-call teardown (each call wrapped in its own try/except so a failure in ``uninstall`` doesn't block the ``host.close``), and wrap each test body in ``try: ... finally: _teardown_chunk_manager(...)``. Done programmatically across all 8 tests via a one-shot Python rewrite to keep the diff mechanical and the new structure consistent. **R3-#8 — replace hard-coded n_chunk_estimate=1 in test_trace_skip_on_override.** The trace-skip e2e test hard-coded ``n_chunk_estimate = 1`` based on the assumption that the tiny GPT-2 fixture produces a single chunk. If the layout heuristics (``pick_S_chunk`` default, block-discovery rules) shift such that ``N_chunk > 1``, ``min_n_buffer_for(layout, n_persist=1)`` rejects ``n_buffer_override=0`` BEFORE the wrapper reaches the trace-skip gate the test is supposed to validate — converting this into a flaky non-target failure. Compute ``n_chunk_estimate`` dynamically by running the same ``discover_blocks`` → ``flatten_block_trees`` → ``build_layout`` pipeline the wrapper itself uses (with the wrapper's default S_chunk), and pass the resulting ``layout.N_chunk`` through. ``n_persist_override = n_chunk_estimate`` then keeps the all-persistent invariant the test relies on regardless of any future layout-heuristic shift. ``tests/protrain/`` default-marker sweep: 303 passed / 4 skipped / 0 failed. GPU-marker sweep on touched files: 40 passed / 2 skipped (single-process Mode-C downgrade for shape-preserving placeholder paths) / 0 failed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…inor) Four CodeRabbit findings on commits ``b61f04e0`` (init-transient peak prediction) + ``aa0c6ba9`` (Mode-C steady CKPT-chain accounting). **R4-#1 (Critical) — post-block-wrap DDP ignore-set re-registration.** The M6C-fix-8 ignore-set registration ran BEFORE block-wrap. The block wrappers (``block/checkpoint.py``, ``block/swap.py``, ``block/offload.py``) all do ``self.block = block``, which means PyTorch's ``named_parameters()`` traversal inserts a ``.block.`` infix into the parameter namespace (``layers.0.attn.q_proj.weight`` ⇒ ``layers.0.block.attn.q_proj.weight``). The pre-wrap names captured in ``model._ddp_params_and_buffers_to_ignore`` no longer match the namespace DDP's ``__init__`` walks at construction time. The init-time broadcast is irrelevant (M6C-fix-8's ``init_sync=False`` monkey-patch bypasses it wholesale on chunk-managed models), but DDP's BACKWARD-pass allreduce still consults the ignore list. A stale ignore set means DDP's backward allreduce would attempt to all-reduce chunk-managed LoRA factor gradients, conflicting with ProTrain's per-chunk ``reduce_scatter`` drain. Add a post-wrap re-registration step after ``install_hooks`` in ``_construct_runtime``: walk the WRAPPED ``model.named_parameters()`` and identify chunk-managed params by OBJECT identity against ``chunk_manager._params_by_id.values()``. Build the post-wrap name set, merge with the pre-protrain snapshot (``_protrain_ddp_original_ignore``), overwrite the attribute. Gated on ``_shape_preserving`` so the single-GPU / replicated path remains a no-op. **R4-#2 (Major) — reuse bootstrap init-transient peak instead of recomputing post-offload.** ``predict_init_transient_peak_bytes(layout, hw, chunk_manager)`` walks ``chunk_manager.model.named_parameters()`` to sum chunk bytes. By the time the phase-2 post-measurement calibration runs, ``materialize_offload`` has already executed and ``param.data`` points at the zero-size placeholders (replicated path) or ``scratch.expand(slot.shape)`` views (sharded path), so the byte accounting drifts away from the bootstrap-time full-residence prediction. Replace the recompute call at the phase-2 post-measurement calibration site with ``boot_result.predicted_init_transient_peak_bytes`` — the bootstrap-time value captured at line 1614 before materialize_offload ran. The downstream consumers (SearchResult publish, LOG.info diagnostic) get the same authoritative value without re-walking a now-stale chunk_manager. **R4-#3 (Major) — meta tensors in ``_stub_chunk_manager`` to avoid CI OOM.** ``tests/protrain/test_init_transient_peak.py::_stub_chunk_manager`` allocated full CPU tensors sized to model 15–60 GiB chunk totals. ``predict_init_transient_peak_bytes`` only reads ``param.numel() * param.element_size()``, so meta-device tensors preserve the byte-accounting metadata without consuming RAM. Switch the ``nn.Parameter(torch.zeros(numel, dtype, device='cpu'))`` construction to ``nn.Parameter(torch.empty(numel, dtype, device='meta'), requires_grad=False)``. **R4-#4 (Minor) — align docstring tolerance with ``TOLERANCE_FRAC = 0.35``.** ``tests/protrain/test_modec_steady_peak_accuracy.py`` docstring said "±25%" but ``TOLERANCE_FRAC = 0.35`` and the assertion uses 0.35. Update the two docstring mentions to "±35%" so text matches intent. ### Test gates - ``pre-commit run --all-files`` ALL green (ruff / ruff-format / mypy / bandit / yaml / eol / whitespace). - ``tests/protrain/`` default-marker sweep: 313 passed / 4 skipped / 162 deselected / 0 failed. - GPU sanity on touched test files (GPU 5): 24 passed / 2 skipped (single-process Mode-C downgrade — expected) / 0 failed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…s (R5) CodeRabbit R5 review (final pass on c996ce9 + f09be09) flagged Ruff RUF002/RUF003 warnings for confusable unicode glyphs across the new audit-Block-G commentary added by 2fcc1fc / b61f04e / aa0c6ba / the per-dtype alpha lookup work. Same lint family R1-#5 and R2-#3 addressed in narrow scope before; this is the broader pass that sweeps the rest of the protrain subtree. Replacements (234 substitutions across 7 files): | File | alpha | x | ∪ | Total | |---------------------------------------------------|------:|----:|--:|------:| | src/axolotl/integrations/protrain/cost/memory.py | 23 | 1 | 0 | 24 | | src/axolotl/integrations/protrain/api/model_wrapper.py | 39 | 14 | 4 | 57 | | src/axolotl/integrations/protrain/types.py | 23 | 6 | 2 | 31 | | src/axolotl/integrations/protrain/DESIGN.md | 19 | 17 | 0 | 36 | | tests/protrain/test_modec_steady_peak_accuracy.py | 8 | 5 | 1 | 14 | | tests/protrain/test_init_transient_peak.py | 6 | 7 | 0 | 13 | | tests/protrain/test_alpha_per_dtype.py | 38 | 0 | 0 | 38 | Substitution rules: - Greek small letter alpha (U+03B1) → ``alpha`` - Multiplication sign (U+00D7) → ``x`` - Union operator (U+222A) → ``|`` (also the Python set-union operator, so doubly appropriate) All replacements are in docstrings, comments, and pytest-parametrize ID strings — zero changes to function names, type names, control flow, or assertion text. ``param_to_chunk`` typed dict keys, set literals, and any Python operator usage of ``|`` are unaffected. Test parametrize IDs change cosmetically (e.g. ``test_alpha_lookup_by_dtype[2.0-1.1-fp16/bf16 weights → α=1.10]`` ⇒ ``test_alpha_lookup_by_dtype[2.0-1.1-fp16/bf16 weights → alpha=1.10]``) — the ``→`` arrow remains unchanged (Ruff doesn't flag it; CodeRabbit flagged only ``α``/``×``/``∪`` explicitly). ### Test gates - ``pre-commit run --all-files`` ALL green (ruff / ruff-format / mypy / bandit / yaml / eol / whitespace). - ``tests/protrain/`` default-marker sweep: 313 passed / 4 skipped / 162 deselected / 0 failed. - Ruff RUF002/RUF003 warnings across the seven touched protrain files: 234 → 0. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…s in prior fixes CodeRabbit's full-diff re-scan on commit 55377e5 surfaced four Major correctness gaps in prior triage commits that the incremental reviews missed. **F-#1 — Filter post-wrap ignore set to non-persistent chunks only.** My R4-#1 fix at ``api/model_wrapper.py:2227-2246`` built ``chunk_managed_param_ids`` from ALL ``chunk_manager._params_by_id.values()``, but persistent chunks should NEVER be in ``_ddp_params_and_buffers_to_ignore`` — they need normal DDP broadcast and backward allreduce (see ``ChunkManager.chunk_managed_param_names``'s docstring: "Persistent chunks are excluded — their params stay GPU-resident, do not pass through the released-state placeholder, and DO need the standard DDP broadcast for correctness."). The over-broad filter silently swept persistent params into the ignore set, breaking gradient sync on the chunks DDP IS supposed to handle. Restrict the OBJECT-identity set to params backed by ``_non_persistent_ids`` only — iterate ``_cpu_slots[cid]`` for each non-persistent ``cid`` and pull the param ref from ``_params_by_id``. Renamed the local loop vars (``_cpu_slot`` instead of ``slot``) to avoid shadowing the earlier ``for slot, child in enumerate(parent)`` block-wrap site that binds ``slot`` to ``int``. **F-#3 — Abort optim swap on CPU-adapter teardown failure.** My D3 fix at ``api/optim_wrapper.py:951`` wrapped ``_old_cpu_optim.shutdown()`` in a try/except that warned and continued. The whole point of D3 is the deterministic-cleanup invariant — masking a real teardown failure (``ThreadPoolExecutor`` hung, DeepSpeed C-state corrupted) puts the failed adapter back on the GC path AND silently accepts an inconsistent state-machine on the rebuild side. Removed the try/except so a shutdown failure aborts the swap rather than papering over it. **F-#6 — Also fence compute stream against ``_prefetch_stream``.** My R3-#1 fix at ``runtime/scheduler.py::ensure_chunks_resident`` added ``compute.wait_stream(_swap_stream)`` before the synchronous gather loop to close the SWAP D2H race. CodeRabbit caught that the symmetric prefetch race is still open: if a chunk is being prefetched and ``ChunkManager.gather()`` hits the ``_active_chunks`` resident fast path, ``param.data`` rebinds while the prefetch's H2D / ``all_gather_into_tensor`` is still running on ``_prefetch_stream`` — gather returns BEFORE the chunk is compute-stream-safe, and a LoRA forward consuming ``param.data`` reads stale / not-yet-written bytes. Add ``compute.wait_stream(_prefetch_stream)`` alongside the existing ``compute.wait_stream(_swap_stream)`` so both cross-stream barriers fire when present. Cost: one extra event-record / event-wait per LoRA container hook fire; no-op when ``_prefetch_stream`` isn't running anything. **F-#7 — Broaden exception scope in ``check_cuda_p2p_support``.** My D9 fix at ``utils/environment.py:96`` caught only ``AssertionError`` from ``torch.cuda.can_device_access_peer``. Per the PyTorch 2.6 docs path, the Python wrapper validates device indices with ``AssertionError``, but the C++ binding ``_cuda_canDeviceAccessPeer`` it delegates to can surface exceptions from the CUDA runtime (``RuntimeError`` wrapping ``cudaErrorInvalidDevice``, peer-access machinery errors) that ``AssertionError`` wouldn't catch. An unhandled exception would propagate out of the helper and break the fail-closed contract — ranks would disagree about ``NCCL_P2P_DISABLE`` which is exactly the SIGSEGV class commit ``91e0912e`` set out to prevent. Widened to ``except Exception`` (with ``noqa: BLE001`` annotation explicitly documenting the fail-closed rationale). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

thad0ctor merged commit 2ab4c10 into main Dec 3, 2025

thad0ctor mentioned this pull request May 12, 2026

Phase 2: ProTrain integration with Axolotl perf features (M0–M6C closed) #21

Closed

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: add support for SIZE_BASED_WRAP with min_num_params in FSDP2#3

fix: add support for SIZE_BASED_WRAP with min_num_params in FSDP2#3
thad0ctor merged 1 commit into
mainfrom
fix-fsdp2-size-based-wrap-clean

thad0ctor commented Dec 3, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

thad0ctor commented Dec 3, 2025

Description

Motivation and Context

How has this been tested?

Screenshots (if appropriate)

Types of changes

Social Handles (Optional)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant