Support directory detection in dump comparator#19680
Support directory detection in dump comparator#19680fzyzcjy merged 573 commits intosgl-project:mainfrom
Conversation
The set excludes DP (non-shardable), so the old name was misleading.
Previously, load failures were silently filtered out. Now each failure emits a GeneralWarning via warning_sink so it surfaces to the user through the record's warnings list.
Tests cover: all success (no warnings), partial failure (one warning emitted, good files kept), and all corrupted (empty result, one warning per file).
- Add _compute_exit_code() that returns 1 when any comparison fails - run() now returns int exit code, main() calls sys.exit() - _consume_comparison_records() returns SummaryRecord - Add --forbid-skip flag to also fail on skipped comparisons
…rn type - _run_and_parse now returns tuple[list[AnyRecord], int] - Add forbid_skip=False to _make_args defaults - Update all call sites to unpack the tuple - Add TestExitCode class with 7 unit tests + 2 integration tests
TestExitCodeSubprocess invokes comparator via subprocess.run and verifies exit code for passed, failed, skipped, and --forbid-skip.
Add ParallelModifier dataclass supporting axis:qualifier colon syntax. DimSpec now uses parallel_modifiers tuple to support multiple parallel axes on a single dimension (e.g. t(cp:zigzag,sp)).
Iterate over parallel_modifiers tuple instead of single parallel field. Unshard in reverse order (innermost shard first). _resolve_unshard_params now takes modifier + dim_name instead of full DimSpec.
Iterate over parallel_modifiers to find zigzag ordering instead of checking single spec.ordering field.
Update h(tp,partial) -> h(tp:partial), s(cp,zigzag) -> s(cp:zigzag), t(cp,zigzag) -> t(cp:zigzag), etc. across all source and test files. Also fix unsharder planner to reverse modifiers within each dim spec (not globally) for correct innermost-first unshard order.
- ParallelModifier and DimSpec now extend _FrozenBase (frozen=True, extra='forbid') instead of frozen dataclass - parallel_modifiers uses list instead of tuple - Parser uses axis:qual+qual syntax (+ separates qualifiers within one modifier, comma separates modifiers on a dim) - Rename sharded_modifiers → reversed_sharded_modifiers in planner - Collapse modifier collection to one-line comprehension
Replace `if x not in dict` + `dict[x]` pattern with `dict.get(x)` + `if ... is None` for both axis and qualifier lookups.
Add unsharder planner tests verifying SP unshards first (inner) then CP (outer) for same-dim multi-axis. Add E2E round-trip test with CP=2 zigzag + SP=2 on token dim covering full unshard + reorder pipeline.
Add test_cp_zigzag_sp_same_dim_unshard covering full pipeline: dump creation → metadata loading → aligner plan → SP unshard + CP unshard + zigzag reorder → tensor comparison. Also add helper _create_cp_zigzag_sp_sharded_dumps.
The 't' dim zigzag reorder requires thd_global_seq_lens (derived from cu_seqlens_q). Reworked helper to do per-sequence zigzag split and include cu_seqlens_q dumps.
THD zigzag reorder requires thd_global_seq_lens derived from cu_seqlens_q, which only gets loaded via the smart token aligner path. Add input_ids dumps alongside cu_seqlens_q and switch to token_aligner="smart".
The t dim requires THD aux handling (cu_seqlens_q, input_ids) which adds complexity. Use b s(cp:zigzag,sp) h to test the same multi-axis same-dim unshard + reorder flow without THD aux dependencies.
Collect moe_dp_rank/size and attn_cp_rank/size in _SGLangPlugin.collect_parallel_info() using existing SGLang APIs that were previously not being dumped.
In dp_attn mode, dp_size > 1 but MLP tensors have data on all ranks. The existing dp filter would incorrectly drop valid ranks. This adds a `// dp:=<group_name>` syntax in dims strings that redirects dp filtering to use `<group>_rank`/`<group>_size` fields from metadata instead of the default `dp_rank`/`dp_size`. - dims.py: parse_dims() strips `//` section; new extract_dp_group_alias() - dp_utils.py: filter/extract accept dp_group_alias parameter - bundle_comparator.py: swap dims override before dp filter, pass alias
- test_dims.py: extract_dp_group_alias, parse_dims with //, resolve_dim_names with // - test_dp_utils.py: _extract_dp_info and filter with dp_group_alias param - test_entrypoint.py: E2E tests for dp alias noop, override-dims alias, real alias filtering
Assert that sglang_parallel_info in dump metadata contains the newly added moe_dp_rank/size and attn_cp_rank/size fields alongside the existing tp_rank/size.
Check every group (tp, pp, moe_ep, moe_tp, moe_dp, attn_tp, attn_dp, local_attn_dp, attn_cp, enable_dp_attention) rather than only the newly added ones.
…xxx directly Only keep local vars for derived values (dir_pair, viz_output_dir, etc.) that need Path() conversion or conditional logic.
- >=2 subdirs with .pt → ValueError with clear message - 0 subdirs with .pt (and no .pt at root) → ValueError - Add test for single non-empty subdir among empty siblings
auto_descend_dir emits log via log_sink → report_sink, so the output format must be set before paths are resolved. Call configure() early with format only, then again with the resolved report_path.
The test was checking stderr for a [comparator] prefix that never existed. Now checks for auto-descend info in LogRecord output.
axis_swapper uses warning_sink to report dim name mismatches, but the module and type were missing.
The new dims_spec parser supports modifier syntax like h(tp).
dims_spec parser uses square brackets for modifiers.
h(tp:partial) -> h[tp:partial], s(cp) -> s[cp]
…se_dims().dims in reduce tests
… fix parse_dims().dims in reduce tests" This reverts commit d5480c8.
…tor" This reverts commit 28afed6.
…r test" This reverts commit 3a8205d.
This reverts commit 44da934.
This reverts commit 220c379.
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly refactors and enhances the dump comparator's ability to handle complex tensor layouts and distributed training configurations. Key changes include a new dimension specification system that supports fused dimensions and explicit replicated axes, improved axis alignment, and a more robust logging mechanism. These updates aim to make the comparator more flexible, accurate, and user-friendly for debugging diverse model architectures and parallelization strategies. Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Changelog
Activity
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request introduces a significant and high-quality refactoring of the dump comparator utility. The changes enhance its functionality, robustness, and user experience. Key improvements include a new, more powerful dims_spec parser that supports fused dimensions and explicit replication declarations, automatic discovery of dump directories, a more flexible command-line interface with presets, and improved logging and error handling. The code quality is excellent, and the changes are well-tested. I have not found any issues or bugs in this pull request.
# Conflicts: # python/sglang/srt/debug_utils/comparator/aligner/axis_swapper.py
Motivation
Modifications
Accuracy Tests
Benchmarking and Profiling
Checklist
Review Process
/tag-run-ci-label,/rerun-failed-ci,/tag-and-rerun-ci