Skip to content

fix: resolve tensor file overwrite between target and draft models#21694

Open
yaya159456 wants to merge 8 commits intosgl-project:mainfrom
yaya159456:fix_eagle_fileoverwrite
Open

fix: resolve tensor file overwrite between target and draft models#21694
yaya159456 wants to merge 8 commits intosgl-project:mainfrom
yaya159456:fix_eagle_fileoverwrite

Conversation

@yaya159456
Copy link
Copy Markdown

@yaya159456 yaya159456 commented Mar 30, 2026

In Eagle mode, tensor files generated by the target and draft models share the same output paths, leading to unintended overwriting.

This change separates the file outputs to prevent conflicts.
This change only affects file output paths and does not impact model computation or performance.
Fixes #21721

Motivation

Fix the issue where tensor dump files from the target and draft models overwrite each other in Eagle mode due to sharing the same output directory.

Modifications

  • Add role-based subdirectories for tensor dump outputs in Eagle mode:
    • Append "draft" for draft workers
    • Append "target" for target workers
  • Keep the original behavior unchanged for non-Eagle modes
  • Ensure compatibility with existing debug tensor dump configurations

Accuracy Tests

N/A (no impact on model outputs)

Speed Tests and Profiling

N/A (no impact on performance)

Checklist

  • Format your code according to the Format code with pre-commit.
  • Add unit tests according to the (not applicable, as this change only affects file output paths).
  • Update documentation according to (not applicable, no user-facing changes).
  • Provide accuracy and speed benchmark results according to(no impact on model outputs or performance).
  • Follow the SGLang code style guidance.

Review and Merge Process

  1. Ping Merge Oncalls to start the process. See the PR Merge Process.
  2. Get approvals from CODEOWNERS and other reviewers.
  3. Trigger CI tests with comments or contact authorized users to do so.
    • Common commands include /tag-and-rerun-ci, /tag-run-ci-label, /rerun-failed-ci
  4. After green CI and required approvals, ask Merge Oncalls or people with Write permission to merge the PR.

In Eagle mode, tensor files generated by the target and draft models
share the same output paths, leading to unintended overwriting.

This change separates the file outputs to prevent conflicts.
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request updates the model runner to support separate debug tensor dump directories for draft and target workers when the Eagle speculative algorithm is active. The review feedback suggests refactoring the implementation to reduce code duplication by determining the specific dump path before calling the hook registration function.

…share the same output paths, leading to unintended overwriting.

This change separates the file outputs to prevent conflicts. Specifically, it appends distinct subdirectories ("draft" and "target") to the configured dump path based on the worker role.

This change only affects file output paths and does not impact model computation or performance.
@yaya159456
Copy link
Copy Markdown
Author

/tag-and-rerun-ci

@yaya159456
Copy link
Copy Markdown
Author

Hi, CI is currently blocked due to missing run-ci label.
Could someone help trigger it? Thanks!

@yaya159456
Copy link
Copy Markdown
Author

@google-gemini review

@yaya159456 yaya159456 force-pushed the fix_eagle_fileoverwrite branch from 7ccb83a to 761039d Compare March 31, 2026 02:40
@kpham-sgl
Copy link
Copy Markdown
Collaborator

/tag-and-rerun-ci

f"mem usage={self.weight_load_mem_usage:.2f} GB."
)
if self.server_args.debug_tensor_dump_output_folder is not None:
dump_folder = self.server_args.debug_tensor_dump_output_folder
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: possibly worth to document this behavior in self.server_args.debug_tensor_dump_output_folder help docstring

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good suggestion, thanks!
I've added clarification to the help docstring regarding the behavior in Eagle mode.
In addition, I submitted a PR to update the documentation in sgl-project.github.io to make this behavior more explicit:
sgl-project/sgl-project.github.io#26

@kpham-sgl kpham-sgl self-assigned this Mar 31, 2026
@yaya159456
Copy link
Copy Markdown
Author

Hi, just a gentle follow-up on this PR 😊
It’s linked to the issue above. When you have a chance, could you please take a look?
Thanks a lot for your time and help!

Copy link
Copy Markdown
Collaborator

@kpham-sgl kpham-sgl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Can you wait for CI to pass before merging in @yaya159456 ?

@yaya159456
Copy link
Copy Markdown
Author

Thanks for the review! 🙏

I have a quick question — when you mentioned waiting for CI to pass, are those CI checks visible to me on this PR? I do see some failing checks, so I’m not sure if I’m expected to fix them, or if I should just wait for CI to complete and further reviews before merging.

Also, I’m not entirely sure if there’s anything else I should take care of before this PR can be closed.

Thanks for the clarification!

@kpham-sgl
Copy link
Copy Markdown
Collaborator

Thanks for the review! 🙏

I have a quick question — when you mentioned waiting for CI to pass, are those CI checks visible to me on this PR? I do see some failing checks, so I’m not sure if I’m expected to fix them, or if I should just wait for CI to complete and further reviews before merging.

Also, I’m not entirely sure if there’s anything else I should take care of before this PR can be closed.

Thanks for the clarification!

No action needed from your side. Just hang tight until the CI is fully green (we are working on it)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug] In Eagle mode, tensor files generated by the target and draft models share the same output paths, leading to unintended overwriting.

2 participants