[MoE Refactor] Add sequence parallel tests to test_moe_layer.py#41299
Conversation
Signed-off-by: Bill Nell <bnell@redhat.com>
Signed-off-by: Bill Nell <bnell@redhat.com>
Signed-off-by: Bill Nell <bnell@redhat.com>
There was a problem hiding this comment.
Code Review
This pull request introduces support for sequence parallelism (SP) in MoE layer tests. Key changes include the addition of an sp_wrapper to handle sequence chunking and gathering, updates to test configurations and validation logic to support SP, and adjustments to weight chunking behavior when SP is enabled. Review feedback highlights a potential issue in the calculation of num_tokens_across_dp for sequence parallel configurations, which could lead to incorrect communication or out-of-bounds access. Additionally, there is a concern regarding the significant increase in FP8 quantization tolerance, which may reduce the effectiveness of correctness checks.
|
Hi @bnellnm, the pre-commit checks have failed. Please run: uv pip install pre-commit>=4.5.1
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
Signed-off-by: Bill Nell <bnell@redhat.com>
yzong-rh
left a comment
There was a problem hiding this comment.
Also passed for me on 4xB200 on 0.20.2rc1.dev56+gf65376125.precompiled after merging with commit f6537612521df1156d8ac13f524e427aca908322 on main.
| atol, rtol = 3.5e-2, 3.5e-2 | ||
| elif quantization in ("fp8", "fp8_blocked", "modelopt_fp8"): | ||
| atol, rtol = 6e-2, 6e-2 | ||
| atol, rtol = 6.5e-2, 6.5e-2 |
Signed-off-by: Bill Nell <bnell@redhat.com>
|
Hi @bnellnm, the pre-commit checks have failed. Please run: uv pip install pre-commit>=4.5.1
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
Signed-off-by: Bill Nell <bnell@redhat.com>
|
Hi @bnellnm, the pre-commit checks have failed. Please run: uv pip install pre-commit>=4.5.1
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
Signed-off-by: Bill Nell <bnell@redhat.com>
Head branch was pushed to by a user without write access
…-project#41299) Signed-off-by: Bill Nell <bnell@redhat.com>
…-project#41299) Signed-off-by: Bill Nell <bnell@redhat.com>
…-project#41299) Signed-off-by: Bill Nell <bnell@redhat.com>
…-project#41299) Signed-off-by: Bill Nell <bnell@redhat.com>
Purpose
Add sequence parallel tests to test_moe_layer.py. Unfortunately, they require 4 gpus so I'm not sure we can run them in automation. At least they can be run locally when validating changes to MoE.
cc @yzong-rh
Test Plan
Ran by hand with 4 gpus.
Test Result
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.