Enhance replication check, matching pattern, logging in dump comparator#19677
Enhance replication check, matching pattern, logging in dump comparator#19677fzyzcjy merged 485 commits intosgl-project:mainfrom
Conversation
…ns with spaces" This reverts commit d7dd57c. # Conflicts: # test/registered/debug_utils/test_dumper.py
…ns with spaces" This reverts commit d7dd57c. # Conflicts: # test/registered/debug_utils/test_dumper.py
extra_imports (e.g. dumper import) must execute before user preamble code so that preamble can reference the imported names.
- Add dp_rank field to PositionalSeqId (default=0 for backward compat) - Pass dp_rank through aux_plugins compute_step_aux to distinguish sequences from different DP ranks - Make aux_loader DP-aware: detect dp_rank from metadata, group rows by dp_rank, independently unshard each group, then concatenate step_auxs across DP groups
When dp_size > 1, group ValueWithMeta items by dp_rank, unshard each group independently, concat results along token dim across dp_ranks, then proceed with cross-side token alignment and comparison.
Tests cover: - PositionalSeqId dp_rank field backward compatibility - dp_rank extraction from megatron/sglang metadata - dp_size detection for bundle_comparator DP routing - group_by_dp_rank grouping logic - aux_loader DP-aware loading with DP=2 megatron and sglang - seq_info_builder with DP-distinct PositionalSeqIds
- dp_utils.py: shared dp_rank/dp_size extraction from metadata - dp_grouping.py: aux_loader DP grouping/merging logic - dp_bundle_comparator.py: DP-aware tensor bundle comparison - Slim down aux_loader.py and bundle_comparator.py with re-exports for backward compatibility
This reverts commit 5fe0ae4.
This reverts commit 02d55d9.
This reverts commit 86317b8.
This reverts commit 562994b.
This reverts commit fe43d07.
…pes" This reverts commit d1b828e.
In DP training, only 1 dp_rank has non-empty tensors while others dump empty (numel=0) tensors. This utility filters out the empty dp_rank items so downstream code can process the data unchanged.
Filter out empty dp_rank items before the tensor/non-tensor routing so that downstream comparison sees only the meaningful data.
Filter out empty dp_rank items in both _load_non_tensor_aux and _load_and_align_aux_tensor so that multi-rank aux loading works correctly when DP > 1.
Tests cover: - _extract_dp_info for sglang/megatron parallel info formats - _group_has_data for empty/non-empty tensor detection - filter_to_non_empty_dp_rank: dp_size=1, dp=2 one-empty, both-nonempty error, DP×TP - aux_loader integration: _load_non_tensor_aux and _load_and_align_aux_tensor with DP=2
Non-tensor aux values (e.g. rids) are identical across all dp_ranks, so there's no empty/non-empty distinction. Return unchanged when all items are non-tensors.
- Rename test_dp_filter.py -> test_dp_utils.py (unit tests for dp_utils.py) - Move aux_loader DP integration tests into test_aux_loader.py - Add DP E2E tests to test_entrypoint.py (dp2 sglang, dp2 megatron, dp2×tp2)
ReplicatedMismatchWarning is informational — it indicates EP/TP replicas have numerical differences, which is expected for bfloat16 tensors after MoE alltoall. Only non-ReplicatedMismatchWarning warnings should cause category to be "failed".
ReplicatedMismatchWarning is informational — it indicates EP/TP replicas have numerical differences, which is expected for bfloat16 tensors after MoE alltoall. Only non-ReplicatedMismatchWarning warnings should cause category to be "failed".
Now that ReplicatedMismatchWarning doesn't cause category="failed", update the test to assert category="passed" and summary.passed=1.
In grouping=raw mode, each rank is a separate bundle. DP filtering needs all ranks in the same bundle, which requires grouping=logical.
- replicated_mismatch alone → passed (not failed) - add test for GeneralWarning → still failed - SkipRecord with replicated_mismatch → skipped (not failed)
The previous replace_all accidentally changed grouping in TestEntrypointGroupingRaw tests. Revert those, keep only the 3 DP test changes to grouping=logical.
# Conflicts: # python/sglang/srt/debug_utils/comparator/aligner/token_aligner/executor.py
- Add missing `warning_sink` import in unsharder test_executor.py - Remove unused `parse_dim_names` import in test_dims.py - Add CI registry to source_patcher test files - Fix axis_swapper.py to use parse_dims().dims instead of parse_dims()
# Conflicts: # test/registered/debug_utils/comparator/aligner/unsharder/test_executor.py
# Conflicts: # python/sglang/srt/debug_utils/comparator/bundle_comparator.py # python/sglang/srt/debug_utils/comparator/dims.py # python/sglang/srt/debug_utils/comparator/entrypoint.py # test/registered/debug_utils/comparator/aligner/reorderer/test_planner.py # test/registered/debug_utils/comparator/aligner/unsharder/test_executor.py # test/registered/debug_utils/comparator/test_dims.py # test/registered/debug_utils/comparator/test_entrypoint.py # test/registered/debug_utils/comparator/test_model_validation.py # test/registered/debug_utils/source_patcher/test_code_patcher.py # test/registered/debug_utils/source_patcher/test_dumper_integration.py # test/registered/debug_utils/source_patcher/test_source_editor.py
# Conflicts: # test/registered/debug_utils/comparator/test_entrypoint.py
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly enhances the dump comparator's capabilities by introducing a more robust and flexible logging system, refining how tensor dimensions and parallel groups are parsed, and improving the accuracy of replication checks. It also streamlines the command-line interface with configurable presets and offers finer control over exit codes based on comparison outcomes. These changes collectively aim to make the comparator a more powerful and user-friendly tool for debugging parallel execution in deep learning models. Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Changelog
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request introduces a significant and valuable refactoring of the dump comparator utility. Key improvements include a more structured logging system with log_sink that differentiates between errors and informational messages, clearer naming for output record types, and enhanced dimension string parsing to support data parallelism group aliases. The introduction of presets for command-line arguments is a great addition for usability. The overall changes enhance the maintainability, flexibility, and robustness of the comparator. I've identified two instances of duplicated code that should be addressed, which I've commented on directly.
| def _maybe_load_tokenizer(args: argparse.Namespace) -> Any: | ||
| tokenizer_path: Optional[str] = getattr(args, "tokenizer", None) | ||
|
|
||
| if tokenizer_path is None: | ||
| for directory in [Path(args.baseline_path), Path(args.target_path)]: | ||
| tokenizer_path = read_tokenizer_path(directory) | ||
| if tokenizer_path is not None: | ||
| break | ||
|
|
||
| if tokenizer_path is None: | ||
| return None | ||
|
|
||
| try: | ||
| from transformers import AutoTokenizer | ||
|
|
||
| return AutoTokenizer.from_pretrained(tokenizer_path) | ||
| except Exception: | ||
| return None | ||
|
|
| class TestReduceSum: | ||
| def test_basic_tp2_reduce(self) -> None: | ||
| """2 partial tensors sum to full tensor.""" |
Motivation
Modifications
Accuracy Tests
Benchmarking and Profiling
Checklist
Review Process
/tag-run-ci-label,/rerun-failed-ci,/tag-and-rerun-ci