Skip to content

fix: add support for SIZE_BASED_WRAP with min_num_params in FSDP2#3

Merged
thad0ctor merged 1 commit into
mainfrom
fix-fsdp2-size-based-wrap-clean
Dec 3, 2025
Merged

fix: add support for SIZE_BASED_WRAP with min_num_params in FSDP2#3
thad0ctor merged 1 commit into
mainfrom
fix-fsdp2-size-based-wrap-clean

Conversation

@thad0ctor

Copy link
Copy Markdown
Owner
  • Add min_num_params field to FSDPConfig schema
  • Set FSDP_MIN_NUM_PARAMS environment variable for SIZE_BASED_WRAP
  • Add patch for fsdp2_apply_ac to handle None auto_wrap_policy
  • Default to 100M parameters when min_num_params not specified

This fixes TypeError when using SIZE_BASED_WRAP without proper configuration

Description

Motivation and Context

How has this been tested?

Screenshots (if appropriate)

Types of changes

Social Handles (Optional)

- Add min_num_params field to FSDPConfig schema
- Set FSDP_MIN_NUM_PARAMS environment variable for SIZE_BASED_WRAP
- Add patch for fsdp2_apply_ac to handle None auto_wrap_policy
- Default to 100M parameters when min_num_params not specified

This fixes TypeError when using SIZE_BASED_WRAP without proper configuration
@thad0ctor thad0ctor merged commit 2ab4c10 into main Dec 3, 2025
thad0ctor added a commit that referenced this pull request May 12, 2026
Five test-quality refinements from CodeRabbit's third-round review.

**R3-#2 — deterministic teardown in test_dora.**

Wrap the DoRA smoke's wrap → train → assert sequence in
``try/finally`` so ``wrapped.close()`` runs even when the
loss-descent assertion fails mid-test. Without this, an early
assertion failure leaves hooks, pinned-host borrows, and CPU
adapter threads alive into subsequent GPU tests on the same
pytest session.

**R3-#3 — distinguish hook edges in test_lora_offload_mode
recording stub.**

The pre-fix ``_RecordingScheduler.ensure_chunks_resident``
recorded every container callback under the same
``"ensure_chunks_resident"`` label. The per-hook tests
(pre_forward / post_forward / post_backward fires
``ensure_chunks_resident``) then asserted only call COUNT — so a
regression that deleted the pre-forward hook factory while
post-forward still fired would still pass the count gates.

Tag each call with its originating hook edge via frame
inspection on the caller's ``co_qualname`` (Python 3.11+
guarantees the qualname captures the enclosing
``_make_lora_container_<edge>_hook`` factory). The four LoRA
container hooks all funnel through the same
``ensure_chunks_resident`` entry point but their closures live
in distinct factory functions, so the qualname uniquely
identifies the edge.

Update each per-hook test to filter on the edge-tagged label so
a regression in any single edge fails the corresponding test:

* pre_forward test: asserts ``ensure_chunks_resident:pre_forward``
  fires ≥ n_blocks times.
* post_forward test: asserts BOTH ``:pre_forward`` AND
  ``:post_forward`` fire ≥ n_containers times each (the previous
  bare ≥ 2*n_containers count was satisfied by either edge alone).
* post_backward test: asserts all four edges (pre/post fwd, pre/
  post bwd) fire ≥ n_containers times each.

The production hook factory layout is unchanged — the stub
recovers the edge from the existing closure's frame, no new
arguments thread through ``install_hooks``.

**R3-#4 — narrow protrain_model_wrapper exception scope in
test_lora_offload_mode:1117.**

The bare ``except (ValueError, RuntimeError)`` was treating any
wrapper failure as "offload setup unavailable" and skipping. A
broken ``protrain_model_wrapper`` runtime path could leave this
smoke green. Restrict the suppression to known env-failure
substrings (DeepSpeedCPUAdam JIT, CUDA version mismatch, bnb
load, ``No module named``, and capacity/searcher gates) — same
canonical tuple D8 used at the optimizer-step site below — and
re-raise anything else. Real wrapper regressions now surface.

**R3-#5 — fail-safe CUDA teardown in
test_param_data_shape_preservation.**

Eight test functions in this module construct ``mgr / layout /
pool / host`` via ``_build_chunk_manager`` and tear them down at
the happy-path tail (``mgr.uninstall()`` / ``host.close()`` /
``del pool``). Any earlier assertion failure skipped the
teardown, leaking pinned-host borrows + CUDA buffer-pool state
into subsequent GPU tests.

Add a top-level ``_teardown_chunk_manager(mgr, host, pool)``
helper that does the best-effort 3-call teardown (each call
wrapped in its own try/except so a failure in ``uninstall``
doesn't block the ``host.close``), and wrap each test body in
``try: ... finally: _teardown_chunk_manager(...)``. Done
programmatically across all 8 tests via a one-shot Python
rewrite to keep the diff mechanical and the new structure
consistent.

**R3-#8 — replace hard-coded n_chunk_estimate=1 in
test_trace_skip_on_override.**

The trace-skip e2e test hard-coded ``n_chunk_estimate = 1`` based
on the assumption that the tiny GPT-2 fixture produces a single
chunk. If the layout heuristics (``pick_S_chunk`` default,
block-discovery rules) shift such that ``N_chunk > 1``,
``min_n_buffer_for(layout, n_persist=1)`` rejects
``n_buffer_override=0`` BEFORE the wrapper reaches the
trace-skip gate the test is supposed to validate — converting
this into a flaky non-target failure.

Compute ``n_chunk_estimate`` dynamically by running the same
``discover_blocks`` → ``flatten_block_trees`` → ``build_layout``
pipeline the wrapper itself uses (with the wrapper's default
S_chunk), and pass the resulting ``layout.N_chunk`` through.
``n_persist_override = n_chunk_estimate`` then keeps the
all-persistent invariant the test relies on regardless of any
future layout-heuristic shift.

``tests/protrain/`` default-marker sweep: 303 passed / 4 skipped
/ 0 failed. GPU-marker sweep on touched files: 40 passed /
2 skipped (single-process Mode-C downgrade for shape-preserving
placeholder paths) / 0 failed.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
thad0ctor added a commit that referenced this pull request May 12, 2026
…inor)

Four CodeRabbit findings on commits ``b61f04e0`` (init-transient peak
prediction) + ``aa0c6ba9`` (Mode-C steady CKPT-chain accounting).

**R4-#1 (Critical) — post-block-wrap DDP ignore-set re-registration.**

The M6C-fix-8 ignore-set registration ran BEFORE block-wrap. The
block wrappers (``block/checkpoint.py``, ``block/swap.py``,
``block/offload.py``) all do ``self.block = block``, which means
PyTorch's ``named_parameters()`` traversal inserts a ``.block.``
infix into the parameter namespace
(``layers.0.attn.q_proj.weight`` ⇒
``layers.0.block.attn.q_proj.weight``). The pre-wrap names captured
in ``model._ddp_params_and_buffers_to_ignore`` no longer match the
namespace DDP's ``__init__`` walks at construction time.

The init-time broadcast is irrelevant (M6C-fix-8's
``init_sync=False`` monkey-patch bypasses it wholesale on
chunk-managed models), but DDP's BACKWARD-pass allreduce still
consults the ignore list. A stale ignore set means DDP's backward
allreduce would attempt to all-reduce chunk-managed LoRA factor
gradients, conflicting with ProTrain's per-chunk ``reduce_scatter``
drain.

Add a post-wrap re-registration step after ``install_hooks`` in
``_construct_runtime``: walk the WRAPPED ``model.named_parameters()``
and identify chunk-managed params by OBJECT identity against
``chunk_manager._params_by_id.values()``. Build the post-wrap name
set, merge with the pre-protrain snapshot
(``_protrain_ddp_original_ignore``), overwrite the attribute.
Gated on ``_shape_preserving`` so the single-GPU / replicated path
remains a no-op.

**R4-#2 (Major) — reuse bootstrap init-transient peak instead of
recomputing post-offload.**

``predict_init_transient_peak_bytes(layout, hw, chunk_manager)``
walks ``chunk_manager.model.named_parameters()`` to sum chunk
bytes. By the time the phase-2 post-measurement calibration runs,
``materialize_offload`` has already executed and ``param.data``
points at the zero-size placeholders (replicated path) or
``scratch.expand(slot.shape)`` views (sharded path), so the
byte accounting drifts away from the bootstrap-time full-residence
prediction.

Replace the recompute call at the phase-2 post-measurement
calibration site with ``boot_result.predicted_init_transient_peak_bytes``
— the bootstrap-time value captured at line 1614 before
materialize_offload ran. The downstream consumers (SearchResult
publish, LOG.info diagnostic) get the same authoritative value
without re-walking a now-stale chunk_manager.

**R4-#3 (Major) — meta tensors in ``_stub_chunk_manager`` to avoid
CI OOM.**

``tests/protrain/test_init_transient_peak.py::_stub_chunk_manager``
allocated full CPU tensors sized to model 15–60 GiB chunk
totals. ``predict_init_transient_peak_bytes`` only reads
``param.numel() * param.element_size()``, so meta-device tensors
preserve the byte-accounting metadata without consuming RAM.
Switch the ``nn.Parameter(torch.zeros(numel, dtype, device='cpu'))``
construction to ``nn.Parameter(torch.empty(numel, dtype,
device='meta'), requires_grad=False)``.

**R4-#4 (Minor) — align docstring tolerance with ``TOLERANCE_FRAC = 0.35``.**

``tests/protrain/test_modec_steady_peak_accuracy.py`` docstring
said "±25%" but ``TOLERANCE_FRAC = 0.35`` and the assertion uses
0.35. Update the two docstring mentions to "±35%" so text matches
intent.

### Test gates

- ``pre-commit run --all-files`` ALL green (ruff / ruff-format /
  mypy / bandit / yaml / eol / whitespace).
- ``tests/protrain/`` default-marker sweep: 313 passed / 4
  skipped / 162 deselected / 0 failed.
- GPU sanity on touched test files (GPU 5): 24 passed / 2 skipped
  (single-process Mode-C downgrade — expected) / 0 failed.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
thad0ctor added a commit that referenced this pull request May 12, 2026
…s (R5)

CodeRabbit R5 review (final pass on c996ce9 + f09be09) flagged Ruff
RUF002/RUF003 warnings for confusable unicode glyphs across the new
audit-Block-G commentary added by 2fcc1fc / b61f04e / aa0c6ba /
the per-dtype alpha lookup work. Same lint family R1-#5 and R2-#3
addressed in narrow scope before; this is the broader pass that
sweeps the rest of the protrain subtree.

Replacements (234 substitutions across 7 files):

| File                                              | alpha | x   | ∪ | Total |
|---------------------------------------------------|------:|----:|--:|------:|
| src/axolotl/integrations/protrain/cost/memory.py  |   23  |   1 | 0 |    24 |
| src/axolotl/integrations/protrain/api/model_wrapper.py | 39 | 14 | 4 |    57 |
| src/axolotl/integrations/protrain/types.py        |   23  |   6 | 2 |    31 |
| src/axolotl/integrations/protrain/DESIGN.md       |   19  |  17 | 0 |    36 |
| tests/protrain/test_modec_steady_peak_accuracy.py |    8  |   5 | 1 |    14 |
| tests/protrain/test_init_transient_peak.py        |    6  |   7 | 0 |    13 |
| tests/protrain/test_alpha_per_dtype.py            |   38  |   0 | 0 |    38 |

Substitution rules:
- Greek small letter alpha (U+03B1) → ``alpha``
- Multiplication sign (U+00D7) → ``x``
- Union operator (U+222A) → ``|`` (also the Python set-union operator,
  so doubly appropriate)

All replacements are in docstrings, comments, and pytest-parametrize
ID strings — zero changes to function names, type names, control
flow, or assertion text. ``param_to_chunk`` typed dict keys, set
literals, and any Python operator usage of ``|`` are unaffected.

Test parametrize IDs change cosmetically (e.g.
``test_alpha_lookup_by_dtype[2.0-1.1-fp16/bf16 weights → α=1.10]`` ⇒
``test_alpha_lookup_by_dtype[2.0-1.1-fp16/bf16 weights → alpha=1.10]``)
— the ``→`` arrow remains unchanged (Ruff doesn't flag it; CodeRabbit
flagged only ``α``/``×``/``∪`` explicitly).

### Test gates

- ``pre-commit run --all-files`` ALL green (ruff / ruff-format /
  mypy / bandit / yaml / eol / whitespace).
- ``tests/protrain/`` default-marker sweep: 313 passed / 4 skipped /
  162 deselected / 0 failed.
- Ruff RUF002/RUF003 warnings across the seven touched protrain
  files: 234 → 0.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
thad0ctor added a commit that referenced this pull request May 12, 2026
…s in prior fixes

CodeRabbit's full-diff re-scan on commit 55377e5 surfaced four
Major correctness gaps in prior triage commits that the incremental
reviews missed.

**F-#1 — Filter post-wrap ignore set to non-persistent chunks only.**

My R4-#1 fix at ``api/model_wrapper.py:2227-2246`` built
``chunk_managed_param_ids`` from ALL
``chunk_manager._params_by_id.values()``, but persistent chunks
should NEVER be in ``_ddp_params_and_buffers_to_ignore`` — they
need normal DDP broadcast and backward allreduce (see
``ChunkManager.chunk_managed_param_names``'s docstring: "Persistent
chunks are excluded — their params stay GPU-resident, do not pass
through the released-state placeholder, and DO need the standard
DDP broadcast for correctness."). The over-broad filter silently
swept persistent params into the ignore set, breaking gradient
sync on the chunks DDP IS supposed to handle.

Restrict the OBJECT-identity set to params backed by
``_non_persistent_ids`` only — iterate ``_cpu_slots[cid]`` for
each non-persistent ``cid`` and pull the param ref from
``_params_by_id``. Renamed the local loop vars (``_cpu_slot``
instead of ``slot``) to avoid shadowing the earlier
``for slot, child in enumerate(parent)`` block-wrap site that
binds ``slot`` to ``int``.

**F-#3 — Abort optim swap on CPU-adapter teardown failure.**

My D3 fix at ``api/optim_wrapper.py:951`` wrapped
``_old_cpu_optim.shutdown()`` in a try/except that warned and
continued. The whole point of D3 is the deterministic-cleanup
invariant — masking a real teardown failure (``ThreadPoolExecutor``
hung, DeepSpeed C-state corrupted) puts the failed adapter back on
the GC path AND silently accepts an inconsistent state-machine on
the rebuild side. Removed the try/except so a shutdown failure
aborts the swap rather than papering over it.

**F-#6 — Also fence compute stream against ``_prefetch_stream``.**

My R3-#1 fix at ``runtime/scheduler.py::ensure_chunks_resident``
added ``compute.wait_stream(_swap_stream)`` before the
synchronous gather loop to close the SWAP D2H race. CodeRabbit
caught that the symmetric prefetch race is still open: if a chunk
is being prefetched and ``ChunkManager.gather()`` hits the
``_active_chunks`` resident fast path, ``param.data`` rebinds
while the prefetch's H2D / ``all_gather_into_tensor`` is still
running on ``_prefetch_stream`` — gather returns BEFORE the chunk
is compute-stream-safe, and a LoRA forward consuming
``param.data`` reads stale / not-yet-written bytes.

Add ``compute.wait_stream(_prefetch_stream)`` alongside the
existing ``compute.wait_stream(_swap_stream)`` so both
cross-stream barriers fire when present. Cost: one extra
event-record / event-wait per LoRA container hook fire; no-op when
``_prefetch_stream`` isn't running anything.

**F-#7 — Broaden exception scope in ``check_cuda_p2p_support``.**

My D9 fix at ``utils/environment.py:96`` caught only
``AssertionError`` from ``torch.cuda.can_device_access_peer``.
Per the PyTorch 2.6 docs path, the Python wrapper validates
device indices with ``AssertionError``, but the C++ binding
``_cuda_canDeviceAccessPeer`` it delegates to can surface
exceptions from the CUDA runtime (``RuntimeError`` wrapping
``cudaErrorInvalidDevice``, peer-access machinery errors) that
``AssertionError`` wouldn't catch. An unhandled exception would
propagate out of the helper and break the fail-closed contract —
ranks would disagree about ``NCCL_P2P_DISABLE`` which is exactly
the SIGSEGV class commit ``91e0912e`` set out to prevent.

Widened to ``except Exception`` (with ``noqa: BLE001`` annotation
explicitly documenting the fail-closed rationale).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
thad0ctor added a commit that referenced this pull request May 28, 2026
Five test-quality refinements from CodeRabbit's third-round review.

**R3-#2 — deterministic teardown in test_dora.**

Wrap the DoRA smoke's wrap → train → assert sequence in
``try/finally`` so ``wrapped.close()`` runs even when the
loss-descent assertion fails mid-test. Without this, an early
assertion failure leaves hooks, pinned-host borrows, and CPU
adapter threads alive into subsequent GPU tests on the same
pytest session.

**R3-#3 — distinguish hook edges in test_lora_offload_mode
recording stub.**

The pre-fix ``_RecordingScheduler.ensure_chunks_resident``
recorded every container callback under the same
``"ensure_chunks_resident"`` label. The per-hook tests
(pre_forward / post_forward / post_backward fires
``ensure_chunks_resident``) then asserted only call COUNT — so a
regression that deleted the pre-forward hook factory while
post-forward still fired would still pass the count gates.

Tag each call with its originating hook edge via frame
inspection on the caller's ``co_qualname`` (Python 3.11+
guarantees the qualname captures the enclosing
``_make_lora_container_<edge>_hook`` factory). The four LoRA
container hooks all funnel through the same
``ensure_chunks_resident`` entry point but their closures live
in distinct factory functions, so the qualname uniquely
identifies the edge.

Update each per-hook test to filter on the edge-tagged label so
a regression in any single edge fails the corresponding test:

* pre_forward test: asserts ``ensure_chunks_resident:pre_forward``
  fires ≥ n_blocks times.
* post_forward test: asserts BOTH ``:pre_forward`` AND
  ``:post_forward`` fire ≥ n_containers times each (the previous
  bare ≥ 2*n_containers count was satisfied by either edge alone).
* post_backward test: asserts all four edges (pre/post fwd, pre/
  post bwd) fire ≥ n_containers times each.

The production hook factory layout is unchanged — the stub
recovers the edge from the existing closure's frame, no new
arguments thread through ``install_hooks``.

**R3-#4 — narrow protrain_model_wrapper exception scope in
test_lora_offload_mode:1117.**

The bare ``except (ValueError, RuntimeError)`` was treating any
wrapper failure as "offload setup unavailable" and skipping. A
broken ``protrain_model_wrapper`` runtime path could leave this
smoke green. Restrict the suppression to known env-failure
substrings (DeepSpeedCPUAdam JIT, CUDA version mismatch, bnb
load, ``No module named``, and capacity/searcher gates) — same
canonical tuple D8 used at the optimizer-step site below — and
re-raise anything else. Real wrapper regressions now surface.

**R3-#5 — fail-safe CUDA teardown in
test_param_data_shape_preservation.**

Eight test functions in this module construct ``mgr / layout /
pool / host`` via ``_build_chunk_manager`` and tear them down at
the happy-path tail (``mgr.uninstall()`` / ``host.close()`` /
``del pool``). Any earlier assertion failure skipped the
teardown, leaking pinned-host borrows + CUDA buffer-pool state
into subsequent GPU tests.

Add a top-level ``_teardown_chunk_manager(mgr, host, pool)``
helper that does the best-effort 3-call teardown (each call
wrapped in its own try/except so a failure in ``uninstall``
doesn't block the ``host.close``), and wrap each test body in
``try: ... finally: _teardown_chunk_manager(...)``. Done
programmatically across all 8 tests via a one-shot Python
rewrite to keep the diff mechanical and the new structure
consistent.

**R3-#8 — replace hard-coded n_chunk_estimate=1 in
test_trace_skip_on_override.**

The trace-skip e2e test hard-coded ``n_chunk_estimate = 1`` based
on the assumption that the tiny GPT-2 fixture produces a single
chunk. If the layout heuristics (``pick_S_chunk`` default,
block-discovery rules) shift such that ``N_chunk > 1``,
``min_n_buffer_for(layout, n_persist=1)`` rejects
``n_buffer_override=0`` BEFORE the wrapper reaches the
trace-skip gate the test is supposed to validate — converting
this into a flaky non-target failure.

Compute ``n_chunk_estimate`` dynamically by running the same
``discover_blocks`` → ``flatten_block_trees`` → ``build_layout``
pipeline the wrapper itself uses (with the wrapper's default
S_chunk), and pass the resulting ``layout.N_chunk`` through.
``n_persist_override = n_chunk_estimate`` then keeps the
all-persistent invariant the test relies on regardless of any
future layout-heuristic shift.

``tests/protrain/`` default-marker sweep: 303 passed / 4 skipped
/ 0 failed. GPU-marker sweep on touched files: 40 passed /
2 skipped (single-process Mode-C downgrade for shape-preserving
placeholder paths) / 0 failed.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
thad0ctor added a commit that referenced this pull request May 28, 2026
…inor)

Four CodeRabbit findings on commits ``b61f04e0`` (init-transient peak
prediction) + ``aa0c6ba9`` (Mode-C steady CKPT-chain accounting).

**R4-#1 (Critical) — post-block-wrap DDP ignore-set re-registration.**

The M6C-fix-8 ignore-set registration ran BEFORE block-wrap. The
block wrappers (``block/checkpoint.py``, ``block/swap.py``,
``block/offload.py``) all do ``self.block = block``, which means
PyTorch's ``named_parameters()`` traversal inserts a ``.block.``
infix into the parameter namespace
(``layers.0.attn.q_proj.weight`` ⇒
``layers.0.block.attn.q_proj.weight``). The pre-wrap names captured
in ``model._ddp_params_and_buffers_to_ignore`` no longer match the
namespace DDP's ``__init__`` walks at construction time.

The init-time broadcast is irrelevant (M6C-fix-8's
``init_sync=False`` monkey-patch bypasses it wholesale on
chunk-managed models), but DDP's BACKWARD-pass allreduce still
consults the ignore list. A stale ignore set means DDP's backward
allreduce would attempt to all-reduce chunk-managed LoRA factor
gradients, conflicting with ProTrain's per-chunk ``reduce_scatter``
drain.

Add a post-wrap re-registration step after ``install_hooks`` in
``_construct_runtime``: walk the WRAPPED ``model.named_parameters()``
and identify chunk-managed params by OBJECT identity against
``chunk_manager._params_by_id.values()``. Build the post-wrap name
set, merge with the pre-protrain snapshot
(``_protrain_ddp_original_ignore``), overwrite the attribute.
Gated on ``_shape_preserving`` so the single-GPU / replicated path
remains a no-op.

**R4-#2 (Major) — reuse bootstrap init-transient peak instead of
recomputing post-offload.**

``predict_init_transient_peak_bytes(layout, hw, chunk_manager)``
walks ``chunk_manager.model.named_parameters()`` to sum chunk
bytes. By the time the phase-2 post-measurement calibration runs,
``materialize_offload`` has already executed and ``param.data``
points at the zero-size placeholders (replicated path) or
``scratch.expand(slot.shape)`` views (sharded path), so the
byte accounting drifts away from the bootstrap-time full-residence
prediction.

Replace the recompute call at the phase-2 post-measurement
calibration site with ``boot_result.predicted_init_transient_peak_bytes``
— the bootstrap-time value captured at line 1614 before
materialize_offload ran. The downstream consumers (SearchResult
publish, LOG.info diagnostic) get the same authoritative value
without re-walking a now-stale chunk_manager.

**R4-#3 (Major) — meta tensors in ``_stub_chunk_manager`` to avoid
CI OOM.**

``tests/protrain/test_init_transient_peak.py::_stub_chunk_manager``
allocated full CPU tensors sized to model 15–60 GiB chunk
totals. ``predict_init_transient_peak_bytes`` only reads
``param.numel() * param.element_size()``, so meta-device tensors
preserve the byte-accounting metadata without consuming RAM.
Switch the ``nn.Parameter(torch.zeros(numel, dtype, device='cpu'))``
construction to ``nn.Parameter(torch.empty(numel, dtype,
device='meta'), requires_grad=False)``.

**R4-#4 (Minor) — align docstring tolerance with ``TOLERANCE_FRAC = 0.35``.**

``tests/protrain/test_modec_steady_peak_accuracy.py`` docstring
said "±25%" but ``TOLERANCE_FRAC = 0.35`` and the assertion uses
0.35. Update the two docstring mentions to "±35%" so text matches
intent.

### Test gates

- ``pre-commit run --all-files`` ALL green (ruff / ruff-format /
  mypy / bandit / yaml / eol / whitespace).
- ``tests/protrain/`` default-marker sweep: 313 passed / 4
  skipped / 162 deselected / 0 failed.
- GPU sanity on touched test files (GPU 5): 24 passed / 2 skipped
  (single-process Mode-C downgrade — expected) / 0 failed.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
thad0ctor added a commit that referenced this pull request May 28, 2026
…s (R5)

CodeRabbit R5 review (final pass on c996ce9 + f09be09) flagged Ruff
RUF002/RUF003 warnings for confusable unicode glyphs across the new
audit-Block-G commentary added by 2fcc1fc / b61f04e / aa0c6ba /
the per-dtype alpha lookup work. Same lint family R1-#5 and R2-#3
addressed in narrow scope before; this is the broader pass that
sweeps the rest of the protrain subtree.

Replacements (234 substitutions across 7 files):

| File                                              | alpha | x   | ∪ | Total |
|---------------------------------------------------|------:|----:|--:|------:|
| src/axolotl/integrations/protrain/cost/memory.py  |   23  |   1 | 0 |    24 |
| src/axolotl/integrations/protrain/api/model_wrapper.py | 39 | 14 | 4 |    57 |
| src/axolotl/integrations/protrain/types.py        |   23  |   6 | 2 |    31 |
| src/axolotl/integrations/protrain/DESIGN.md       |   19  |  17 | 0 |    36 |
| tests/protrain/test_modec_steady_peak_accuracy.py |    8  |   5 | 1 |    14 |
| tests/protrain/test_init_transient_peak.py        |    6  |   7 | 0 |    13 |
| tests/protrain/test_alpha_per_dtype.py            |   38  |   0 | 0 |    38 |

Substitution rules:
- Greek small letter alpha (U+03B1) → ``alpha``
- Multiplication sign (U+00D7) → ``x``
- Union operator (U+222A) → ``|`` (also the Python set-union operator,
  so doubly appropriate)

All replacements are in docstrings, comments, and pytest-parametrize
ID strings — zero changes to function names, type names, control
flow, or assertion text. ``param_to_chunk`` typed dict keys, set
literals, and any Python operator usage of ``|`` are unaffected.

Test parametrize IDs change cosmetically (e.g.
``test_alpha_lookup_by_dtype[2.0-1.1-fp16/bf16 weights → α=1.10]`` ⇒
``test_alpha_lookup_by_dtype[2.0-1.1-fp16/bf16 weights → alpha=1.10]``)
— the ``→`` arrow remains unchanged (Ruff doesn't flag it; CodeRabbit
flagged only ``α``/``×``/``∪`` explicitly).

### Test gates

- ``pre-commit run --all-files`` ALL green (ruff / ruff-format /
  mypy / bandit / yaml / eol / whitespace).
- ``tests/protrain/`` default-marker sweep: 313 passed / 4 skipped /
  162 deselected / 0 failed.
- Ruff RUF002/RUF003 warnings across the seven touched protrain
  files: 234 → 0.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
thad0ctor added a commit that referenced this pull request May 28, 2026
…s in prior fixes

CodeRabbit's full-diff re-scan on commit 55377e5 surfaced four
Major correctness gaps in prior triage commits that the incremental
reviews missed.

**F-#1 — Filter post-wrap ignore set to non-persistent chunks only.**

My R4-#1 fix at ``api/model_wrapper.py:2227-2246`` built
``chunk_managed_param_ids`` from ALL
``chunk_manager._params_by_id.values()``, but persistent chunks
should NEVER be in ``_ddp_params_and_buffers_to_ignore`` — they
need normal DDP broadcast and backward allreduce (see
``ChunkManager.chunk_managed_param_names``'s docstring: "Persistent
chunks are excluded — their params stay GPU-resident, do not pass
through the released-state placeholder, and DO need the standard
DDP broadcast for correctness."). The over-broad filter silently
swept persistent params into the ignore set, breaking gradient
sync on the chunks DDP IS supposed to handle.

Restrict the OBJECT-identity set to params backed by
``_non_persistent_ids`` only — iterate ``_cpu_slots[cid]`` for
each non-persistent ``cid`` and pull the param ref from
``_params_by_id``. Renamed the local loop vars (``_cpu_slot``
instead of ``slot``) to avoid shadowing the earlier
``for slot, child in enumerate(parent)`` block-wrap site that
binds ``slot`` to ``int``.

**F-#3 — Abort optim swap on CPU-adapter teardown failure.**

My D3 fix at ``api/optim_wrapper.py:951`` wrapped
``_old_cpu_optim.shutdown()`` in a try/except that warned and
continued. The whole point of D3 is the deterministic-cleanup
invariant — masking a real teardown failure (``ThreadPoolExecutor``
hung, DeepSpeed C-state corrupted) puts the failed adapter back on
the GC path AND silently accepts an inconsistent state-machine on
the rebuild side. Removed the try/except so a shutdown failure
aborts the swap rather than papering over it.

**F-#6 — Also fence compute stream against ``_prefetch_stream``.**

My R3-#1 fix at ``runtime/scheduler.py::ensure_chunks_resident``
added ``compute.wait_stream(_swap_stream)`` before the
synchronous gather loop to close the SWAP D2H race. CodeRabbit
caught that the symmetric prefetch race is still open: if a chunk
is being prefetched and ``ChunkManager.gather()`` hits the
``_active_chunks`` resident fast path, ``param.data`` rebinds
while the prefetch's H2D / ``all_gather_into_tensor`` is still
running on ``_prefetch_stream`` — gather returns BEFORE the chunk
is compute-stream-safe, and a LoRA forward consuming
``param.data`` reads stale / not-yet-written bytes.

Add ``compute.wait_stream(_prefetch_stream)`` alongside the
existing ``compute.wait_stream(_swap_stream)`` so both
cross-stream barriers fire when present. Cost: one extra
event-record / event-wait per LoRA container hook fire; no-op when
``_prefetch_stream`` isn't running anything.

**F-#7 — Broaden exception scope in ``check_cuda_p2p_support``.**

My D9 fix at ``utils/environment.py:96`` caught only
``AssertionError`` from ``torch.cuda.can_device_access_peer``.
Per the PyTorch 2.6 docs path, the Python wrapper validates
device indices with ``AssertionError``, but the C++ binding
``_cuda_canDeviceAccessPeer`` it delegates to can surface
exceptions from the CUDA runtime (``RuntimeError`` wrapping
``cudaErrorInvalidDevice``, peer-access machinery errors) that
``AssertionError`` wouldn't catch. An unhandled exception would
propagate out of the helper and break the fail-closed contract —
ranks would disagree about ``NCCL_P2P_DISABLE`` which is exactly
the SIGSEGV class commit ``91e0912e`` set out to prevent.

Widened to ``except Exception`` (with ``noqa: BLE001`` annotation
explicitly documenting the fail-closed rationale).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant