Skip to content

Enhance replication check, matching pattern, logging in dump comparator#19677

Merged
fzyzcjy merged 485 commits intosgl-project:mainfrom
fzyzcjy:ac8420/30
Mar 2, 2026
Merged

Enhance replication check, matching pattern, logging in dump comparator#19677
fzyzcjy merged 485 commits intosgl-project:mainfrom
fzyzcjy:ac8420/30

Conversation

@fzyzcjy
Copy link
Copy Markdown
Collaborator

@fzyzcjy fzyzcjy commented Mar 2, 2026

Motivation

Modifications

Accuracy Tests

Benchmarking and Profiling

Checklist

Review Process

  1. Ping Merge Oncalls to start the PR flow. See the PR Merge Process.
  2. Get approvals from CODEOWNERS and other reviewers.
  3. Trigger CI tests with comments or contact authorized users to do so.
    • /tag-run-ci-label, /rerun-failed-ci, /tag-and-rerun-ci
  4. After green CI and required approvals, ask Merge Oncalls to merge.

…ns with spaces"

This reverts commit d7dd57c.

# Conflicts:
#	test/registered/debug_utils/test_dumper.py
…ns with spaces"

This reverts commit d7dd57c.

# Conflicts:
#	test/registered/debug_utils/test_dumper.py
extra_imports (e.g. dumper import) must execute before user preamble
code so that preamble can reference the imported names.
- Add dp_rank field to PositionalSeqId (default=0 for backward compat)
- Pass dp_rank through aux_plugins compute_step_aux to distinguish
  sequences from different DP ranks
- Make aux_loader DP-aware: detect dp_rank from metadata, group rows
  by dp_rank, independently unshard each group, then concatenate
  step_auxs across DP groups
When dp_size > 1, group ValueWithMeta items by dp_rank, unshard each
group independently, concat results along token dim across dp_ranks,
then proceed with cross-side token alignment and comparison.
Tests cover:
- PositionalSeqId dp_rank field backward compatibility
- dp_rank extraction from megatron/sglang metadata
- dp_size detection for bundle_comparator DP routing
- group_by_dp_rank grouping logic
- aux_loader DP-aware loading with DP=2 megatron and sglang
- seq_info_builder with DP-distinct PositionalSeqIds
- dp_utils.py: shared dp_rank/dp_size extraction from metadata
- dp_grouping.py: aux_loader DP grouping/merging logic
- dp_bundle_comparator.py: DP-aware tensor bundle comparison
- Slim down aux_loader.py and bundle_comparator.py with re-exports
  for backward compatibility
In DP training, only 1 dp_rank has non-empty tensors while others
dump empty (numel=0) tensors. This utility filters out the empty
dp_rank items so downstream code can process the data unchanged.
Filter out empty dp_rank items before the tensor/non-tensor routing
so that downstream comparison sees only the meaningful data.
Filter out empty dp_rank items in both _load_non_tensor_aux and
_load_and_align_aux_tensor so that multi-rank aux loading works
correctly when DP > 1.
Tests cover:
- _extract_dp_info for sglang/megatron parallel info formats
- _group_has_data for empty/non-empty tensor detection
- filter_to_non_empty_dp_rank: dp_size=1, dp=2 one-empty, both-nonempty error, DP×TP
- aux_loader integration: _load_non_tensor_aux and _load_and_align_aux_tensor with DP=2
Non-tensor aux values (e.g. rids) are identical across all dp_ranks,
so there's no empty/non-empty distinction. Return unchanged when all
items are non-tensors.
- Rename test_dp_filter.py -> test_dp_utils.py (unit tests for dp_utils.py)
- Move aux_loader DP integration tests into test_aux_loader.py
- Add DP E2E tests to test_entrypoint.py (dp2 sglang, dp2 megatron, dp2×tp2)
ReplicatedMismatchWarning is informational — it indicates EP/TP
replicas have numerical differences, which is expected for bfloat16
tensors after MoE alltoall. Only non-ReplicatedMismatchWarning
warnings should cause category to be "failed".
ReplicatedMismatchWarning is informational — it indicates EP/TP
replicas have numerical differences, which is expected for bfloat16
tensors after MoE alltoall. Only non-ReplicatedMismatchWarning
warnings should cause category to be "failed".
Now that ReplicatedMismatchWarning doesn't cause category="failed",
update the test to assert category="passed" and summary.passed=1.
In grouping=raw mode, each rank is a separate bundle. DP filtering
needs all ranks in the same bundle, which requires grouping=logical.
- replicated_mismatch alone → passed (not failed)
- add test for GeneralWarning → still failed
- SkipRecord with replicated_mismatch → skipped (not failed)
The previous replace_all accidentally changed grouping in
TestEntrypointGroupingRaw tests. Revert those, keep only the 3 DP
test changes to grouping=logical.
fzyzcjy added 23 commits March 1, 2026 09:56
# Conflicts:
#	python/sglang/srt/debug_utils/comparator/aligner/token_aligner/executor.py
- Add missing `warning_sink` import in unsharder test_executor.py
- Remove unused `parse_dim_names` import in test_dims.py
- Add CI registry to source_patcher test files
- Fix axis_swapper.py to use parse_dims().dims instead of parse_dims()
# Conflicts:
#	test/registered/debug_utils/comparator/aligner/unsharder/test_executor.py
# Conflicts:
#	python/sglang/srt/debug_utils/comparator/bundle_comparator.py
#	python/sglang/srt/debug_utils/comparator/dims.py
#	python/sglang/srt/debug_utils/comparator/entrypoint.py
#	test/registered/debug_utils/comparator/aligner/reorderer/test_planner.py
#	test/registered/debug_utils/comparator/aligner/unsharder/test_executor.py
#	test/registered/debug_utils/comparator/test_dims.py
#	test/registered/debug_utils/comparator/test_entrypoint.py
#	test/registered/debug_utils/comparator/test_model_validation.py
#	test/registered/debug_utils/source_patcher/test_code_patcher.py
#	test/registered/debug_utils/source_patcher/test_dumper_integration.py
#	test/registered/debug_utils/source_patcher/test_source_editor.py
# Conflicts:
#	test/registered/debug_utils/comparator/test_entrypoint.py
@gemini-code-assist
Copy link
Copy Markdown
Contributor

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the dump comparator's capabilities by introducing a more robust and flexible logging system, refining how tensor dimensions and parallel groups are parsed, and improving the accuracy of replication checks. It also streamlines the command-line interface with configurable presets and offers finer control over exit codes based on comparison outcomes. These changes collectively aim to make the comparator a more powerful and user-friendly tool for debugging parallel execution in deep learning models.

Highlights

  • Refactored Output Types and Logging: Renamed comparison-related output types (e.g., ComparisonRecord to TensorComparisonRecord, SkipRecord to SkipComparisonRecord) for clarity and introduced a new structured logging system (log_sink with ErrorLog and InfoLog) to replace the previous warning_sink.
  • Enhanced Dimension Parsing and Data Parallel Grouping: Modified the parse_dims function to return a DimsSpec object, which now includes support for extracting data parallel (DP) group aliases (e.g., dp:=moe_dp) from dimension strings. This allows for more flexible and accurate filtering of data parallel ranks.
  • Improved Replication Check Logic: Refactored the _verify_replicated_group function into a new helper _check_replicated_pair to improve modularity and added explicit handling for shape mismatches during replication checks.
  • Configurable CLI Presets: Introduced a new preset.py module and expand_preset function to allow users to define and use CLI argument presets, simplifying common comparator configurations.
  • Granular Exit Code Control: Added a new --allow-failed-pattern CLI argument and updated the compute_exit_code logic to provide more granular control over when failed comparisons should result in a non-zero exit code.
  • Extended Parallel Information Collection: Expanded the Dumper.collect_parallel_info method to include additional MoE data parallel (moe_dp_rank, moe_dp_size) and attention data parallel (attn_cp_rank, attn_cp_size) information, providing richer metadata for debugging.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog
  • python/sglang/srt/debug_utils/comparator/init.py
    • Renamed ComparisonRecord to TensorComparisonRecord.
  • python/sglang/srt/debug_utils/comparator/aligner/axis_aligner.py
    • Replaced warning_sink with log_sink and GeneralWarning with ErrorLog.
    • Updated parse_dims calls to access the dims attribute of the returned DimsSpec object.
  • python/sglang/srt/debug_utils/comparator/aligner/axis_swapper.py
    • Updated parse_dims calls to access the dims attribute of the returned DimsSpec object.
  • python/sglang/srt/debug_utils/comparator/aligner/entrypoint/planner.py
    • Updated parse_dims calls to access the dims attribute of the returned DimsSpec object.
  • python/sglang/srt/debug_utils/comparator/aligner/token_aligner/entrypoint.py
    • Replaced warning_sink with log_sink and GeneralWarning with InfoLog.
    • Adjusted token aligner mode logic to default to None instead of concat_steps.
  • python/sglang/srt/debug_utils/comparator/aligner/token_aligner/smart/aux_loader.py
    • Replaced warning_sink with log_sink and GeneralWarning with ErrorLog or InfoLog.
  • python/sglang/srt/debug_utils/comparator/aligner/token_aligner/smart/aux_plugins.py
    • Replaced warning_sink with log_sink and GeneralWarning with InfoLog.
  • python/sglang/srt/debug_utils/comparator/aligner/unsharder/executor.py
    • Refactored _verify_replicated_group to use a new helper function _check_replicated_pair.
    • Added shape mismatch handling within _check_replicated_pair for replicated tensors.
  • python/sglang/srt/debug_utils/comparator/bundle_comparator.py
    • Replaced warning_sink with log_sink and updated output type imports (e.g., ComparisonRecord to TensorComparisonRecord).
    • Adjusted the order of dims override and DP filter application.
    • Modified filter_to_non_empty_dp_rank call to pass dp_group_alias.
    • Added _extract_dp_alias_from_items helper function.
  • python/sglang/srt/debug_utils/comparator/dims.py
    • Introduced DimsSpec class to hold parsed dimension specifications and an optional dp_group_alias.
    • Modified parse_dims to return a DimsSpec object instead of a list of DimSpec.
    • Updated resolve_dim_names to extract names from DimsSpec.
    • Added a shape mismatch check in apply_dim_names.
    • Added _extract_dp_group_alias function to parse DP group aliases from dimension strings.
  • python/sglang/srt/debug_utils/comparator/dp_utils.py
    • Modified filter_to_non_empty_dp_rank and _extract_dp_info to accept an optional dp_group_alias argument, allowing filtering based on specific data parallel groups.
  • python/sglang/srt/debug_utils/comparator/entrypoint.py
    • Removed unused re import.
    • Updated imports for renamed output types (e.g., ComparisonRecord to TensorComparisonRecord).
    • Replaced direct call to _compute_exit_code with compute_exit_code from utils.
    • Updated _compute_skip_keys logic to use grouping_skip_keys from args.
    • Introduced parse_args function to handle CLI argument parsing and preset expansion.
    • Added --preset, --grouping-skip-keys, and --allow-failed-pattern CLI arguments.
  • python/sglang/srt/debug_utils/comparator/log_sink.py
    • Added new file: Implemented LogSink class for structured collection and reporting of ErrorLog and InfoLog messages.
  • python/sglang/srt/debug_utils/comparator/output_types.py
    • Introduced BaseLog, ErrorLog, and InfoLog classes for structured logging.
    • Added _split_logs helper function to separate errors and infos.
    • Introduced RecordLocation to include step information in comparison records.
    • Renamed ComparisonRecord to TensorComparisonRecord, SkipRecord to SkipComparisonRecord, and NonTensorRecord to NonTensorComparisonRecord.
    • Updated _OutputRecord to store errors and infos instead of warnings.
    • Made diff optional in ReplicatedCheckResult.
    • Renamed WarningRecord to LogRecord.
  • python/sglang/srt/debug_utils/comparator/per_token_visualizer.py
    • Updated ComparisonRecord to TensorComparisonRecord in function signatures.
  • python/sglang/srt/debug_utils/comparator/preset.py
    • Added new file: Defined CLI argument presets (raw, sglang_dev, sglang_megatron) and expand_preset logic.
  • python/sglang/srt/debug_utils/comparator/tensor_comparator/formatter.py
    • Updated format_replicated_checks to handle cases where diff might be None.
  • python/sglang/srt/debug_utils/comparator/utils.py
    • Added compute_exit_code function to determine the process exit code based on comparison summary and allowance patterns.
    • Added _is_all_match_pattern helper for regex matching.
  • python/sglang/srt/debug_utils/comparator/warning_sink.py
    • Removed file: Replaced by log_sink.py.
  • python/sglang/srt/debug_utils/dumper.py
    • Added moe_dp_rank, moe_dp_size, attn_cp_rank, and attn_cp_size to the collected parallel information.
  • test/registered/debug_utils/comparator/aligner/reorderer/test_planner.py
    • Updated parse_dims calls to access the dims attribute of the returned DimsSpec object.
  • test/registered/debug_utils/comparator/aligner/test_axis_aligner.py
    • Replaced warning_sink with log_sink in tests.
  • test/registered/debug_utils/comparator/aligner/token_aligner/test_aux_loader.py
    • Replaced WarningSink with LogSink and GeneralWarning with ErrorLog or InfoLog in tests.
  • test/registered/debug_utils/comparator/aligner/unsharder/test_executor.py
    • Updated parse_dims calls to access the dims attribute of the returned DimsSpec object.
    • Added new test cases for TestReduceSum and TestThdCpConcat.
  • test/registered/debug_utils/comparator/aligner/unsharder/test_planner.py
    • Updated parse_dims calls to access the dims attribute of the returned DimsSpec object.
  • test/registered/debug_utils/comparator/tensor_comparator/test_types.py
    • Updated ComparisonRecord to TensorComparisonRecord and SkipRecord to SkipComparisonRecord.
    • Replaced GeneralWarning with ErrorLog and WarningRecord with LogRecord in tests.
    • Added InfoLog to discriminated union parsing tests.
  • test/registered/debug_utils/comparator/test_bundle_comparator.py
    • Replaced WarningSink with LogSink and GeneralWarning with ErrorLog in tests.
  • test/registered/debug_utils/comparator/test_dims.py
    • Added DimsSpec import.
    • Removed parse_dim_names import.
    • Updated parse_dims calls to access the dims attribute of the returned DimsSpec object.
    • Added new test cases for TestParseDimsWithHash, TestDpGroupAlias, and TestResolveDimNamesWithHash.
    • Added a test for apply_dim_names with ndim mismatch.
  • test/registered/debug_utils/comparator/test_dp_utils.py
    • Added new test cases for TestExtractDpInfoWithAlias and TestFilterToNonEmptyDpRankWithAlias to verify dp_group_alias functionality.
  • test/registered/debug_utils/comparator/test_entrypoint.py
    • Updated imports for renamed output types and parse_args.
    • Replaced _make_args with _make_argv to construct argument lists for parse_args.
    • Updated tests to use preset and grouping_skip_keys arguments.
    • Added new test cases for TestEntrypointPerStepMode and TestEntrypointDpGroupAlias.
    • Updated exit code tests to use allow_failed_pattern.
  • test/registered/debug_utils/comparator/test_log_sink.py
    • Renamed from test_warning_sink.py.
    • Updated tests to use LogSink, ErrorLog, and InfoLog.
  • test/registered/debug_utils/comparator/test_manually_verify.py
    • Updated ComparisonRecord to TensorComparisonRecord in tests.
  • test/registered/debug_utils/comparator/test_model_validation.py
    • Updated ComparisonRecord to TensorComparisonRecord, SkipRecord to SkipComparisonRecord, NonTensorRecord to NonTensorComparisonRecord.
    • Replaced GeneralWarning with ErrorLog in tests.
  • test/registered/debug_utils/comparator/test_output_types.py
    • Added new file: Contains tests for _split_logs and LogRecord.to_text().
  • test/registered/debug_utils/comparator/test_per_token_visualizer.py
    • Updated ComparisonRecord to TensorComparisonRecord in tests.
  • test/registered/debug_utils/comparator/test_preset.py
    • Added new file: Contains tests for expand_preset functionality.
  • test/registered/debug_utils/comparator/test_utils.py
    • Added SummaryRecord import.
    • Added new test cases for TestComputeExitCode.
  • test/registered/debug_utils/source_patcher/test_code_patcher.py
    • Updated register_cpu_ci to include nightly=True.
  • test/registered/debug_utils/source_patcher/test_dumper_integration.py
    • Updated register_cpu_ci to include nightly=True.
  • test/registered/debug_utils/source_patcher/test_source_editor.py
    • Updated register_cpu_ci to include nightly=True.
  • test/registered/debug_utils/test_dumper.py
    • Added assertions to check for the presence of new parallel info keys (moe_dp_rank, moe_dp_size, attn_cp_rank, attn_cp_size) in sglang_parallel_info.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a significant and valuable refactoring of the dump comparator utility. Key improvements include a more structured logging system with log_sink that differentiates between errors and informational messages, clearer naming for output record types, and enhanced dimension string parsing to support data parallelism group aliases. The introduction of presets for command-line arguments is a great addition for usability. The overall changes enhance the maintainability, flexibility, and robustness of the comparator. I've identified two instances of duplicated code that should be addressed, which I've commented on directly.

Comment on lines +157 to +175
def _maybe_load_tokenizer(args: argparse.Namespace) -> Any:
tokenizer_path: Optional[str] = getattr(args, "tokenizer", None)

if tokenizer_path is None:
for directory in [Path(args.baseline_path), Path(args.target_path)]:
tokenizer_path = read_tokenizer_path(directory)
if tokenizer_path is not None:
break

if tokenizer_path is None:
return None

try:
from transformers import AutoTokenizer

return AutoTokenizer.from_pretrained(tokenizer_path)
except Exception:
return None

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

This function _maybe_load_tokenizer is a duplicate of the one defined just above on lines 137-155. This duplication should be removed to avoid confusion and potential issues.

Comment on lines 904 to 906
class TestReduceSum:
def test_basic_tp2_reduce(self) -> None:
"""2 partial tensors sum to full tensor."""
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The test class TestReduceSum is defined twice in this file. The first definition is on line 649. This second definition appears to be a copy-paste error and should be removed.

@fzyzcjy fzyzcjy merged commit 15e83ee into sgl-project:main Mar 2, 2026
52 of 61 checks passed
Kangyan-Zhou pushed a commit to Kangyan-Zhou/sglang that referenced this pull request Mar 4, 2026
magicYang1573 pushed a commit to magicYang1573/sglang that referenced this pull request Mar 9, 2026
Wangzheee pushed a commit to Wangzheee/sglang that referenced this pull request Mar 21, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant