fix: resolve tensor file overwrite between target and draft models#21694
fix: resolve tensor file overwrite between target and draft models#21694yaya159456 wants to merge 8 commits intosgl-project:mainfrom
Conversation
In Eagle mode, tensor files generated by the target and draft models share the same output paths, leading to unintended overwriting. This change separates the file outputs to prevent conflicts.
There was a problem hiding this comment.
Code Review
This pull request updates the model runner to support separate debug tensor dump directories for draft and target workers when the Eagle speculative algorithm is active. The review feedback suggests refactoring the implementation to reduce code duplication by determining the specific dump path before calling the hook registration function.
…share the same output paths, leading to unintended overwriting.
This change separates the file outputs to prevent conflicts. Specifically, it appends distinct subdirectories ("draft" and "target") to the configured dump path based on the worker role.
This change only affects file output paths and does not impact model computation or performance.
|
/tag-and-rerun-ci |
|
Hi, CI is currently blocked due to missing |
|
@google-gemini review |
7ccb83a to
761039d
Compare
|
/tag-and-rerun-ci |
| f"mem usage={self.weight_load_mem_usage:.2f} GB." | ||
| ) | ||
| if self.server_args.debug_tensor_dump_output_folder is not None: | ||
| dump_folder = self.server_args.debug_tensor_dump_output_folder |
There was a problem hiding this comment.
nit: possibly worth to document this behavior in self.server_args.debug_tensor_dump_output_folder help docstring
There was a problem hiding this comment.
Good suggestion, thanks!
I've added clarification to the help docstring regarding the behavior in Eagle mode.
In addition, I submitted a PR to update the documentation in sgl-project.github.io to make this behavior more explicit:
sgl-project/sgl-project.github.io#26
|
Hi, just a gentle follow-up on this PR 😊 |
kpham-sgl
left a comment
There was a problem hiding this comment.
LGTM. Can you wait for CI to pass before merging in @yaya159456 ?
|
Thanks for the review! 🙏 I have a quick question — when you mentioned waiting for CI to pass, are those CI checks visible to me on this PR? I do see some failing checks, so I’m not sure if I’m expected to fix them, or if I should just wait for CI to complete and further reviews before merging. Also, I’m not entirely sure if there’s anything else I should take care of before this PR can be closed. Thanks for the clarification! |
No action needed from your side. Just hang tight until the CI is fully green (we are working on it) |
In Eagle mode, tensor files generated by the target and draft models share the same output paths, leading to unintended overwriting.
This change separates the file outputs to prevent conflicts.
This change only affects file output paths and does not impact model computation or performance.
Fixes #21721
Motivation
Fix the issue where tensor dump files from the target and draft models overwrite each other in Eagle mode due to sharing the same output directory.
Modifications
Accuracy Tests
N/A (no impact on model outputs)
Speed Tests and Profiling
N/A (no impact on performance)
Checklist
Review and Merge Process
/tag-and-rerun-ci,/tag-run-ci-label,/rerun-failed-ci