-
Notifications
You must be signed in to change notification settings - Fork 208
feat: GRPO example for Qwen3 32b context length=128k #957
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
Are there any theoratical differences between this and doing TP8+CP4+actckpt on DTensor, in which case DTensor couldn't work? |
need a profiling to see why. btw, dtensor can run with seqlen=64k |
wangshangsam
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, by exsisting convention, the hardware setup should come before the runtime config, and ideally the max supported context length should come with the model & task, e.g., grpo-math-qwen3-32b-128k-4n8g-megatrontp8cp4.yaml.
Could you also rename the other grpo-math-qwen3-30ba3b-megatron-tp4-32k.yaml into grpo-math-qwen3-30ba3b-32k-4n8g-megatrontp4ep8.yaml? We forgot about this aspect when we reviewed #918 last week (cc @pjin-nvidia)
Let's get this PR merged first. The key objective of the issue this PR is addressing is to unblock the Nemotron folks, so if MCore path already works, let's leave it at that. We can dig into the memory profile for very long seq len when addressing #885. |
|
Btw, since you are adding a new recipe, you need to add a bash script that tests this new recipe in the CI too (there's a unit test that check if you have done that or not). Refers to #926 |
fixed. but overall the recipe naming still lacks consistency. It'd be better if we can have a naming convention. |
Naming convention: https://github.com/NVIDIA-NeMo/RL/tree/main/tests/test_suites#naming |
Signed-off-by: Qidong Su <[email protected]>
Signed-off-by: Qidong Su <[email protected]>
Signed-off-by: Qidong Su <[email protected]>
Signed-off-by: Qidong Su <[email protected]>
Signed-off-by: Qidong Su <[email protected]>
Signed-off-by: Qidong Su <[email protected]>
Signed-off-by: Qidong Su <[email protected]>
Signed-off-by: Qidong Su <[email protected]>
Signed-off-by: Qidong Su <[email protected]>
Signed-off-by: Qidong Su <[email protected]>
Signed-off-by: Qidong Su <[email protected]>
| # Only run metrics if the target step is reached | ||
| if [[ $(jq 'to_entries | .[] | select(.key == "train/loss") | .value | keys | map(tonumber) | max' $JSON_METRICS) -ge $MAX_STEPS ]]; then | ||
| uv run tests/check_metrics.py $JSON_METRICS \ | ||
| 'mean(data["train/token_mult_prob_error"]) < 1.1' \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wouldn't this fail according to the wandb link you shared?
I think with longer generations, we'll probably run into outliers that skew the mean. This one is run for so few steps, it's probably hard to write something robust. Maybe:
| 'mean(data["train/token_mult_prob_error"]) < 1.1' \ | |
| 'min(data["train/token_mult_prob_error"]) < 1.1' \ |
and then add a comment above why you use min for this particular test
I made an issue to track this: #1039
| rope_scaling: | ||
| type: "yarn" | ||
| factor: 4.0 | ||
| original_max_position_embeddings: 32768 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
which models need this? is it possible to handle this in code?
In the past when we had stuff like this, the consensus was to handle it in code since we knew which model types needed it, ex: fdb565c
regardless of if this is handled in code or yaml, it should probably have an entry in the model-quirks.md so we have documentation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is used by qwen3 + long context length: https://huggingface.co/Qwen/Qwen3-32B#processing-long-texts.
Since it's a optional configuration that can be changed by the user, I tend to explicitly put it in yaml.
Will update model-quirks.md to reflect this.
Signed-off-by: Qidong Su <[email protected]>
Signed-off-by: Qidong Su <[email protected]>
|
This PR is not merged because it depends on the support of |
| self.dp_size = worker_sharding_annotations.get_axis_size("data_parallel") | ||
| self.megatron_bridge = AutoBridge.from_hf_pretrained( | ||
| hf_model_name, trust_remote_code=True | ||
| hf_model_name, trust_remote_code=True, **self.cfg.get("model_kwargs", {}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think mcore path cannot parse/handle the rope_scaling.type="yarn" even if you pass these arguments, this might succeed but your model arch is still rope, this seems confusing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe we should error out in mcore path if seeing rope_scaling.type=="yarn" until necessary support is added;
@yaoyu-33 to confirm if the qwen and llama bridge in mbridge can parse and handle this field. I believe this is only supported in deepseek model type.
What does this PR do ?
Address #883.
Issues
List issues that this PR closes (syntax):
Usage
# Add a code snippet demonstrating how to use thisBefore your PR is "Ready for review"
Pre checks:
Additional Information