[mcore] moonlight (small model with deepseekv3 arch)#1284
[mcore] moonlight (small model with deepseekv3 arch)#1284vermouth1992 merged 32 commits intoverl-project:mainfrom
Conversation
|
Thank you for the great work. Could you provide a dpsk-v3 script in the example for testing? |
just added a script |
|
@ISEEKYAN i ran dspk-v3 with this patch, but it failed. does this patch support training for dpsk-v3 ? |
What is the error info? It is supposed to support dpsk-v3 except for the mtp part for now. I will update the PR to support the full version. |
Thank you for your reply. Python Dependencies and Versions: Error Log: @ISEEKYAN could u please take a look at this error and help me fix it ? |
The mcore version was a pre-release 0.12 when I initially make this PR, and there was a known bug at
|
It might be because TE version mismatch, I used |
@ISEEKYAN |
Please check the latest curve at url, now the ppo_kl curve is much better than the April version |
|
OOM error: |
@ISEEKYAN Thank you for your reply, for the ppo_kl increasing curve and the reward declining curve, may I ask if you have any new discoveries or bugs? |
With the latest updates, the ppo_kl increases very slowly and no reward decline any more. |
| dtype=dtype, | ||
| use_cpu_initialization=False, | ||
| add_bias_linear=False, | ||
| attention_backend=AttnBackend.fused, |
There was a problem hiding this comment.
is AttnBackend.fused specific to deepseek v3 model? is AttnBackend.auto enough here?
There was a problem hiding this comment.
When feed with AttnBackend.auto, the TE would use flash, but flash is not implemented for MLA, the error info is
ValueError: No dot product attention backend is available for the provided inputs. Please run with NVTE_DEBUG=1 NVTE_DEBUG_LEVEL=2 to find out the reasons for disabling all backends.
| data.max_response_length=512 \ | ||
| data.filter_overlong_prompts=True \ | ||
| data.truncation='error' \ | ||
| +data.trust_remote_code=True \ |
There was a problem hiding this comment.
should trust_remote_code be set in the model per ppo_megatron_trainer.yaml?
There was a problem hiding this comment.
oh data preprocessing might need this as well. please ignore this if i misunderstand.
There was a problem hiding this comment.
is this another topic beyond supporting moonlight? would it be better if we commit another small PR for the config file modification?
There was a problem hiding this comment.
agree. we should track any change in the config and keep it consistent.
| # limitations under the License. | ||
|
|
||
| # there is some bug in mcore 0.12, so we need to patch it | ||
| # 1. `get_query_key_value_tensors` in `multi_latent_attention.py` works wrong when packed_seq_params is not None |
There was a problem hiding this comment.
just curious whether there is issue opening in the megatron lm repo so that we can track the patch fix accordingly.
There was a problem hiding this comment.
it is supposed to be fixed in 0.13
### Checklist Before Starting - [x] Search for similar PR(s). ### What does this PR do? support training with deepseekv3 671B support MTP on top of #1284 now it is functional ready for 671B, still lacking of practice > Add one-line overview of what this PR aims to achieve or accomplish. ### High-Level Design > Demonstrate the high-level design if this PR is complex. ### Specific Changes > List the specific changes. ### API > Demonstrate how the API changes if any. ### Usage Example > Provide usage example(s) for easier usage. ```python # Add code snippet or script demonstrating how to use this ``` ### Test > For changes that can not be tested by CI (e.g., algorithm implementation, new model support), validate by experiment(s) and show results like training curve plots, evaluatuion results, etc. ### Additional Info. - **Issue Number**: Fixes issue # or discussion # if any. - **Training**: [Note which backend this PR will affect: FSDP, Megatron, both, or none] - **Inference**: [Note which backend this PR will affect: vLLM, SGLang, both, or none] ### Checklist Before Submitting - [ ] Read the [Contribute Guide](https://github.com/volcengine/verl?tab=readme-ov-file#contribution-guide). - [ ] Apply [pre-commit checks](https://github.com/volcengine/verl?tab=readme-ov-file#code-linting-and-formatting). - [ ] Add `[BREAKING]` to the PR title if it breaks any API. - [ ] Update the documentation about your changes in the [docs](https://github.com/volcengine/verl/tree/main/docs). - [ ] Add CI test(s) if necessary.
### Checklist Before Starting - [x] Search for similar PR(s). ### What does this PR do? support training with deepseekv3 671B support MTP on top of verl-project#1284 now it is functional ready for 671B, still lacking of practice > Add one-line overview of what this PR aims to achieve or accomplish. ### High-Level Design > Demonstrate the high-level design if this PR is complex. ### Specific Changes > List the specific changes. ### API > Demonstrate how the API changes if any. ### Usage Example > Provide usage example(s) for easier usage. ```python # Add code snippet or script demonstrating how to use this ``` ### Test > For changes that can not be tested by CI (e.g., algorithm implementation, new model support), validate by experiment(s) and show results like training curve plots, evaluatuion results, etc. ### Additional Info. - **Issue Number**: Fixes issue # or discussion # if any. - **Training**: [Note which backend this PR will affect: FSDP, Megatron, both, or none] - **Inference**: [Note which backend this PR will affect: vLLM, SGLang, both, or none] ### Checklist Before Submitting - [ ] Read the [Contribute Guide](https://github.com/volcengine/verl?tab=readme-ov-file#contribution-guide). - [ ] Apply [pre-commit checks](https://github.com/volcengine/verl?tab=readme-ov-file#code-linting-and-formatting). - [ ] Add `[BREAKING]` to the PR title if it breaks any API. - [ ] Update the documentation about your changes in the [docs](https://github.com/volcengine/verl/tree/main/docs). - [ ] Add CI test(s) if necessary.
achieve 74.3 at gsm8k, while moonlight reported as 77.4 still WIP with the performance diff
achieve 74.3 at gsm8k, while moonlight reported as 77.4 still WIP with the performance diff
### Checklist Before Starting - [x] Search for similar PR(s). ### What does this PR do? support training with deepseekv3 671B support MTP on top of verl-project#1284 now it is functional ready for 671B, still lacking of practice > Add one-line overview of what this PR aims to achieve or accomplish. ### High-Level Design > Demonstrate the high-level design if this PR is complex. ### Specific Changes > List the specific changes. ### API > Demonstrate how the API changes if any. ### Usage Example > Provide usage example(s) for easier usage. ```python # Add code snippet or script demonstrating how to use this ``` ### Test > For changes that can not be tested by CI (e.g., algorithm implementation, new model support), validate by experiment(s) and show results like training curve plots, evaluatuion results, etc. ### Additional Info. - **Issue Number**: Fixes issue # or discussion # if any. - **Training**: [Note which backend this PR will affect: FSDP, Megatron, both, or none] - **Inference**: [Note which backend this PR will affect: vLLM, SGLang, both, or none] ### Checklist Before Submitting - [ ] Read the [Contribute Guide](https://github.com/volcengine/verl?tab=readme-ov-file#contribution-guide). - [ ] Apply [pre-commit checks](https://github.com/volcengine/verl?tab=readme-ov-file#code-linting-and-formatting). - [ ] Add `[BREAKING]` to the PR title if it breaks any API. - [ ] Update the documentation about your changes in the [docs](https://github.com/volcengine/verl/tree/main/docs). - [ ] Add CI test(s) if necessary.
achieve 74.3 at gsm8k, while moonlight reported as 77.4 still WIP with the performance diff
### Checklist Before Starting - [x] Search for similar PR(s). ### What does this PR do? support training with deepseekv3 671B support MTP on top of verl-project#1284 now it is functional ready for 671B, still lacking of practice > Add one-line overview of what this PR aims to achieve or accomplish. ### High-Level Design > Demonstrate the high-level design if this PR is complex. ### Specific Changes > List the specific changes. ### API > Demonstrate how the API changes if any. ### Usage Example > Provide usage example(s) for easier usage. ```python # Add code snippet or script demonstrating how to use this ``` ### Test > For changes that can not be tested by CI (e.g., algorithm implementation, new model support), validate by experiment(s) and show results like training curve plots, evaluatuion results, etc. ### Additional Info. - **Issue Number**: Fixes issue # or discussion # if any. - **Training**: [Note which backend this PR will affect: FSDP, Megatron, both, or none] - **Inference**: [Note which backend this PR will affect: vLLM, SGLang, both, or none] ### Checklist Before Submitting - [ ] Read the [Contribute Guide](https://github.com/volcengine/verl?tab=readme-ov-file#contribution-guide). - [ ] Apply [pre-commit checks](https://github.com/volcengine/verl?tab=readme-ov-file#code-linting-and-formatting). - [ ] Add `[BREAKING]` to the PR title if it breaks any API. - [ ] Update the documentation about your changes in the [docs](https://github.com/volcengine/verl/tree/main/docs). - [ ] Add CI test(s) if necessary.
achieve 74.3 at gsm8k, while moonlight reported as 77.4 still WIP with the performance diff
### Checklist Before Starting - [x] Search for similar PR(s). ### What does this PR do? support training with deepseekv3 671B support MTP on top of verl-project#1284 now it is functional ready for 671B, still lacking of practice > Add one-line overview of what this PR aims to achieve or accomplish. ### High-Level Design > Demonstrate the high-level design if this PR is complex. ### Specific Changes > List the specific changes. ### API > Demonstrate how the API changes if any. ### Usage Example > Provide usage example(s) for easier usage. ```python # Add code snippet or script demonstrating how to use this ``` ### Test > For changes that can not be tested by CI (e.g., algorithm implementation, new model support), validate by experiment(s) and show results like training curve plots, evaluatuion results, etc. ### Additional Info. - **Issue Number**: Fixes issue # or discussion # if any. - **Training**: [Note which backend this PR will affect: FSDP, Megatron, both, or none] - **Inference**: [Note which backend this PR will affect: vLLM, SGLang, both, or none] ### Checklist Before Submitting - [ ] Read the [Contribute Guide](https://github.com/volcengine/verl?tab=readme-ov-file#contribution-guide). - [ ] Apply [pre-commit checks](https://github.com/volcengine/verl?tab=readme-ov-file#code-linting-and-formatting). - [ ] Add `[BREAKING]` to the PR title if it breaks any API. - [ ] Update the documentation about your changes in the [docs](https://github.com/volcengine/verl/tree/main/docs). - [ ] Add CI test(s) if necessary.


achieve 74.3 at gsm8k, while moonlight reported as 77.4
still WIP with the performance diff