Skip to content

Support presets and arbitrary skipping keys in dump comparator#19676

Merged
fzyzcjy merged 461 commits intosgl-project:mainfrom
fzyzcjy:ac8420/29
Mar 2, 2026
Merged

Support presets and arbitrary skipping keys in dump comparator#19676
fzyzcjy merged 461 commits intosgl-project:mainfrom
fzyzcjy:ac8420/29

Conversation

@fzyzcjy
Copy link
Copy Markdown
Collaborator

@fzyzcjy fzyzcjy commented Mar 2, 2026

Motivation

Modifications

Accuracy Tests

Benchmarking and Profiling

Checklist

Review Process

  1. Ping Merge Oncalls to start the PR flow. See the PR Merge Process.
  2. Get approvals from CODEOWNERS and other reviewers.
  3. Trigger CI tests with comments or contact authorized users to do so.
    • /tag-run-ci-label, /rerun-failed-ci, /tag-and-rerun-ci
  4. After green CI and required approvals, ask Merge Oncalls to merge.

Follow existing pattern: viz output is silent on success. The user
already knows the output path since they specified it via CLI.
Tensor already has named dims applied by _apply_dim_names_from_meta
before _resolve_seq_dim is called, so the metadata fallback path was
redundant.
- calc_per_token_rel_diff tests → test_utils.py (matches utils.py)
- _compute_diff seq_dim tests → test_comparator.py (matches comparator.py)
- delete standalone test_per_token_diff.py
- split TestManuallyVerify into TestBundleDetailsManualVerify and
  TestPerTokenHeatmapManualVerify
- remove duplicate test_visualize_per_token_with_dims from test_entrypoint.py
New params type for element-wise summation of partial tensors
(reduction=partial). No dim_name needed since output shape == input shape.
Replace NotImplementedError with ReduceSumParams() when a dim spec
has reduction=partial. The grouping logic is unchanged — sum is
commutative so rank order within a group doesn't matter.
Strip named dims, stack, sum along dim=0, then restore names.
Same pattern as _thd_concat for name handling.
Replace test_reduction_not_implemented_raises with tests verifying
ReduceSumParams is returned for partial reduction dims: basic TP=2,
TP=4, mixed CP+TP, and scrambled rank order.
Tests: basic TP=2 reduce, TP=4 reduce, multi-axis concat+reduce,
scrambled rank order, and named dimension preservation.
Three scenarios: both-sides TP partial, single-rank vs TP partial,
and mixed CP concat + TP partial reduction.
Verify that pipeline parallelism works correctly with the existing
bundle matching logic: different world ranks with same layer_id match,
non-layer tensors match across PP ranks, different PP sizes form
correct bundles, and unmatched layer_ids don't create false matches.
After match_bundles, emit a GeneralWarning listing tensors that exist
in target but have no matching baseline. Useful for diagnosing cases
where some PP ranks didn't dump correctly.
When loading auxiliary tensors (input_ids, positions, etc.), check
embedded meta for pp_rank. If items span multiple PP ranks, keep
only pp_rank=0 and emit a warning. This guards against the unlikely
case where non-first PP stages accidentally dump aux tensors.
Replace 5 redundant PP tests with one that covers the actual net-new
behavior: nullable layer_id column participates in grouping alongside
rows without layer_id.
Cover: unclosed quotes, empty quotes, equals inside multi-token quotes,
consecutive quoted values, user real-world e2e scenario (shell not
expanding quotes), and bare token without equals.
Flatten the main loop with continue-first pattern and introduce
_QuoteParseResult NamedTuple to make the two-state machine
(normal vs collecting) explicit and easier to follow.
…ns with spaces"

This reverts commit d7dd57c.

# Conflicts:
#	test/registered/debug_utils/test_dumper.py
…ns with spaces"

This reverts commit d7dd57c.

# Conflicts:
#	test/registered/debug_utils/test_dumper.py
extra_imports (e.g. dumper import) must execute before user preamble
code so that preamble can reference the imported names.
- Add dp_rank field to PositionalSeqId (default=0 for backward compat)
- Pass dp_rank through aux_plugins compute_step_aux to distinguish
  sequences from different DP ranks
- Make aux_loader DP-aware: detect dp_rank from metadata, group rows
  by dp_rank, independently unshard each group, then concatenate
  step_auxs across DP groups
fzyzcjy added 23 commits March 1, 2026 09:45
# Conflicts:
#	python/sglang/srt/debug_utils/comparator/aligner/token_aligner/executor.py
- Add missing `warning_sink` import in unsharder test_executor.py
- Remove unused `parse_dim_names` import in test_dims.py
- Add CI registry to source_patcher test files
- Fix axis_swapper.py to use parse_dims().dims instead of parse_dims()
# Conflicts:
#	test/registered/debug_utils/comparator/aligner/unsharder/test_executor.py
# Conflicts:
#	python/sglang/srt/debug_utils/comparator/bundle_comparator.py
#	python/sglang/srt/debug_utils/comparator/dims.py
#	python/sglang/srt/debug_utils/comparator/entrypoint.py
#	test/registered/debug_utils/comparator/aligner/reorderer/test_planner.py
#	test/registered/debug_utils/comparator/aligner/unsharder/test_executor.py
#	test/registered/debug_utils/comparator/test_dims.py
#	test/registered/debug_utils/comparator/test_entrypoint.py
#	test/registered/debug_utils/comparator/test_model_validation.py
#	test/registered/debug_utils/source_patcher/test_code_patcher.py
#	test/registered/debug_utils/source_patcher/test_dumper_integration.py
#	test/registered/debug_utils/source_patcher/test_source_editor.py
# Conflicts:
#	test/registered/debug_utils/comparator/test_entrypoint.py
@gemini-code-assist
Copy link
Copy Markdown
Contributor

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the dump comparator's flexibility and usability by introducing CLI presets and arbitrary metadata key skipping. It refactors dimension parsing to better support complex parallelization schemes, particularly for data parallel group aliases, and improves the clarity of comparison output records. These changes streamline the process of debugging and comparing tensor dumps from distributed training environments.

Highlights

  • Enhanced Comparator CLI with Presets: Introduced a new --preset command-line argument to the dump comparator, allowing users to quickly apply predefined configurations for grouping-skip-keys and token-aligner settings. This simplifies common comparison scenarios like 'raw', 'sglang_dev', and 'sglang_megatron'.
  • Flexible Metadata Key Skipping: Added a --grouping-skip-keys argument, enabling users to specify arbitrary metadata keys to ignore when grouping bundles for comparison. This provides greater control over how tensors are matched and compared across different ranks or steps.
  • Refactored Dimension Parsing: The parse_dims function was refactored to return a DimsSpec object, which now explicitly separates dimension specifications from declaration parts (e.g., # dp:=moe_dp). This improves clarity and allows for more robust handling of data parallel group aliases.
  • Support for Data Parallel Group Aliases: The data parallel utility functions (dp_utils.py) were updated to recognize and utilize data parallel group aliases specified in the dims metadata (e.g., moe_dp). This ensures correct filtering of non-empty data parallel ranks in complex parallelization setups.
  • Improved Comparison Record Types: Renamed core comparison record types for better semantic clarity (e.g., ComparisonRecord to TensorComparisonRecord, SkipRecord to SkipComparisonRecord, NonTensorRecord to NonTensorComparisonRecord). A new RecordLocation class was also introduced to associate comparison results with specific steps.
  • Expanded Parallel Info Collection: The dumper now collects additional parallelization information, specifically moe_dp_rank, moe_dp_size, attn_cp_rank, and attn_cp_size, providing more comprehensive metadata for debugging distributed models.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog
  • python/sglang/srt/debug_utils/comparator/init.py
    • Updated import and usage of ComparisonRecord to TensorComparisonRecord.
  • python/sglang/srt/debug_utils/comparator/aligner/axis_aligner.py
    • Modified calls to parse_dims to access the .dims attribute of the returned DimsSpec object.
  • python/sglang/srt/debug_utils/comparator/aligner/axis_swapper.py
    • Modified calls to parse_dims to access the .dims attribute of the returned DimsSpec object.
  • python/sglang/srt/debug_utils/comparator/aligner/entrypoint/planner.py
    • Modified calls to parse_dims to access the .dims attribute of the returned DimsSpec object.
  • python/sglang/srt/debug_utils/comparator/aligner/token_aligner/entrypoint.py
    • Changed the default token_aligner mode from 'concat_steps' to None.
    • Removed the conditional check for args.grouping != 'logical' when determining token aligner mode.
  • python/sglang/srt/debug_utils/comparator/bundle_comparator.py
    • Imported parse_dims for new dimension parsing logic.
    • Renamed ComparisonRecord, NonTensorRecord, and SkipRecord to TensorComparisonRecord, NonTensorComparisonRecord, and SkipComparisonRecord respectively.
    • Updated function return type hints to use the new comparison record names.
    • Reordered the data parallel (DP) filter and dimension override steps to ensure correct processing.
    • Added a new helper function _extract_dp_alias_from_items to retrieve data parallel group aliases from metadata.
    • Modified filter_to_non_empty_dp_rank calls to pass the extracted dp_group_alias.
  • python/sglang/srt/debug_utils/comparator/dims.py
    • Added a new DimsSpec class to encapsulate parsed dimension specifications and data parallel group aliases.
    • Refactored parse_dims to return a DimsSpec object, handling the separation of dimension tokens from declaration parts (e.g., # dp:=moe_dp).
    • Modified resolve_dim_names to extract dimension names from the DimsSpec.dims attribute.
    • Introduced _DP_ALIAS_PATTERN and _extract_dp_group_alias to parse data parallel group aliases from dimension strings.
  • python/sglang/srt/debug_utils/comparator/dp_utils.py
    • Updated filter_to_non_empty_dp_rank to accept an optional dp_group_alias argument.
    • Modified _extract_dp_info to dynamically determine rank and size field names based on the provided dp_group_alias.
  • python/sglang/srt/debug_utils/comparator/entrypoint.py
    • Imported PRESETS and expand_preset for CLI argument handling.
    • Defined _DEFAULT_SKIP_KEYS for baseline metadata keys to always skip.
    • Changed main function to use parse_args for argument parsing, which includes preset expansion.
    • Updated _compute_skip_keys to incorporate args.grouping_skip_keys and remove conditional logic based on grouping and token_aligner.
    • Updated type hints for comparison records to use the new TensorComparisonRecord, SkipComparisonRecord, and NonTensorComparisonRecord.
    • Added RecordLocation to comparison records to store step information.
    • Refactored argument parsing into a new parse_args function that applies preset expansion.
  • python/sglang/srt/debug_utils/comparator/output_types.py
    • Introduced RecordLocation class to store location details like step.
    • Added _BaseComparisonRecord as a base class for comparison records, including RecordLocation and methods for formatting location prefixes/suffixes.
    • Renamed SkipRecord to SkipComparisonRecord and made it inherit from _BaseComparisonRecord.
    • Renamed ComparisonRecord to TensorComparisonRecord and made it inherit from _BaseComparisonRecord.
    • Renamed NonTensorRecord to NonTensorComparisonRecord and made it inherit from _BaseComparisonRecord.
    • Updated text formatting for SkipComparisonRecord and NonTensorComparisonRecord to include location suffixes.
    • Updated the AllOutputRecords type alias to reflect the new record names.
  • python/sglang/srt/debug_utils/comparator/per_token_visualizer.py
    • Updated type hints from ComparisonRecord to TensorComparisonRecord.
  • python/sglang/srt/debug_utils/comparator/preset.py
    • Added new file preset.py.
    • Defined PRESETS dictionary containing predefined argument lists for common comparison scenarios.
    • Implemented expand_preset function to replace --preset arguments with their corresponding argument lists, and to apply a default preset if no explicit preset or grouping-skip-keys are provided.
  • python/sglang/srt/debug_utils/dumper.py
    • Added moe_dp_rank and moe_dp_size to the collected parallel information.
    • Added attn_cp_rank and attn_cp_size to the collected parallel information.
  • test/registered/debug_utils/comparator/aligner/reorderer/test_planner.py
    • Updated calls to parse_dims to access the .dims attribute of the returned DimsSpec object in test cases.
  • test/registered/debug_utils/comparator/aligner/unsharder/test_executor.py
    • Updated calls to parse_dims to access the .dims attribute of the returned DimsSpec object in test cases.
    • Added new test cases for TestReduceSum covering basic TP reduction, multi-axis concat then reduce, scrambled ranks, and preservation of named dimensions.
    • Added new test cases for TestThdCpConcat covering single sequence, multi-sequence, hidden dimension, and leading batch dimension scenarios.
  • test/registered/debug_utils/comparator/aligner/unsharder/test_planner.py
    • Updated calls to parse_dims to access the .dims attribute of the returned DimsSpec object in test cases.
  • test/registered/debug_utils/comparator/tensor_comparator/test_types.py
    • Updated type hints and instantiations from ComparisonRecord to TensorComparisonRecord and SkipRecord to SkipComparisonRecord.
  • test/registered/debug_utils/comparator/test_dims.py
    • Imported DimsSpec.
    • Removed parse_dim_names as it's no longer needed.
    • Updated test cases for parse_dims to expect a DimsSpec object and access its .dims attribute.
    • Added new test class TestParseDimsWithHash to verify parsing of declaration sections (e.g., # dp:=moe_dp).
    • Added new test class TestDpGroupAlias to specifically test extraction of data parallel group aliases.
    • Added TestResolveDimNamesWithHash to confirm hash declarations are stripped when resolving dim names.
  • test/registered/debug_utils/comparator/test_dp_utils.py
    • Added new test classes TestExtractDpInfoWithAlias and TestFilterToNonEmptyDpRankWithAlias to verify the functionality of data parallel group aliases in dp_utils.
  • test/registered/debug_utils/comparator/test_entrypoint.py
    • Imported parse_args from entrypoint.
    • Updated type hints for comparison records to use the new TensorComparisonRecord, SkipComparisonRecord, and NonTensorComparisonRecord.
    • Replaced _make_args with _make_argv to align with the new argument parsing structure.
    • Added new test class TestEntrypointPerStepMode to verify per-step comparison behavior, including step field population and TP unshard integration.
    • Added new test class TestEntrypointDpGroupAlias to test end-to-end functionality of data parallel group aliases, including override mechanisms.
  • test/registered/debug_utils/comparator/test_manually_verify.py
    • Updated type hints from ComparisonRecord to TensorComparisonRecord.
  • test/registered/debug_utils/comparator/test_model_validation.py
    • Updated type hints and instantiations from ComparisonRecord to TensorComparisonRecord, SkipRecord to SkipComparisonRecord, and NonTensorRecord to NonTensorComparisonRecord.
  • test/registered/debug_utils/comparator/test_per_token_visualizer.py
    • Updated type hints from ComparisonRecord to TensorComparisonRecord.
  • test/registered/debug_utils/comparator/test_preset.py
    • Added new file test_preset.py.
    • Added test cases for TestExpandPreset to verify explicit preset expansion, default preset application, and prevention of default preset when skip keys are explicit.
  • test/registered/debug_utils/source_patcher/test_code_patcher.py
    • Updated register_cpu_ci call to include nightly=True.
  • test/registered/debug_utils/source_patcher/test_dumper_integration.py
    • Updated register_cpu_ci call to include nightly=True.
  • test/registered/debug_utils/source_patcher/test_source_editor.py
    • Updated register_cpu_ci call to include nightly=True.
  • test/registered/debug_utils/test_dumper.py
    • Added assertions to verify the presence of new parallel info keys (moe_dp_rank, moe_dp_size, attn_cp_rank, attn_cp_size) in the sglang_parallel_info metadata.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces presets and arbitrary skipping keys to the dump comparator, which is a significant improvement for usability. The refactoring of argument parsing and the introduction of DimsSpec and dp_group_alias are well-executed. The code changes are extensive but seem correct and are accompanied by corresponding test updates.

I found two minor issues related to code duplication, one in the application code and one in the tests, which should be addressed. Overall, this is a solid contribution that enhances the debugging capabilities.

Comment on lines +173 to +191
def _maybe_load_tokenizer(args: argparse.Namespace) -> Any:
tokenizer_path: Optional[str] = getattr(args, "tokenizer", None)

if tokenizer_path is None:
for directory in [Path(args.baseline_path), Path(args.target_path)]:
tokenizer_path = read_tokenizer_path(directory)
if tokenizer_path is not None:
break

if tokenizer_path is None:
return None

try:
from transformers import AutoTokenizer

return AutoTokenizer.from_pretrained(tokenizer_path)
except Exception:
return None

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

This function _maybe_load_tokenizer appears to be a duplicate of the function with the same name defined just above (lines 153-171 in the full file). This is likely a copy-paste error. Please remove the duplicated definition to avoid confusion and potential bugs.

Comment on lines +116 to +121
def test_squeeze_dim(self) -> None:
assert parse_dim("1") == DimSpec(name="1")

def test_squeeze_dim_rejects_modifiers(self) -> None:
with pytest.raises(ValueError, match="Invalid dim token"):
parse_dim("1(tp)")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The tests test_squeeze_dim and test_squeeze_dim_rejects_modifiers are duplicated. This block seems to be a copy of the one just above it (lines 109-114 in the full file). Please remove these duplicated tests.

@fzyzcjy fzyzcjy merged commit ec44bc8 into sgl-project:main Mar 2, 2026
57 of 66 checks passed
magicYang1573 pushed a commit to magicYang1573/sglang that referenced this pull request Mar 9, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant