Skip to content

Conversation

@soodoshll
Copy link
Contributor

@soodoshll soodoshll commented Aug 20, 2025

What does this PR do ?

Address #883.

  • Qwen3 32b + 128k context is runnable using TP8+CP4+actckpt on 4x8H100 nodes. Example run: https://wandb.ai/nvidia/nemo-rl/runs/5qhgqcqf?nw=nwuserqidongs
  • Roll-out is extremely slow in this case so I use a rather low batch size in this recipe.
  • I use YaRN to extend the context length to 128k in vllm. Not 100% sure if megatron is correctly handling this part.

Issues

List issues that this PR closes (syntax):

Usage

  • You can potentially add a usage example below
# Add a code snippet demonstrating how to use this

Before your PR is "Ready for review"

Pre checks:

  • Make sure you read and followed Contributor guidelines
  • Did you write any new necessary tests?
  • Did you run the unit tests and functional tests locally? Visit our Testing Guide for how to run tests
  • Did you add or update any necessary documentation? Visit our Document Development Guide for how to write, build and test the docs.

Additional Information

  • ...

@wangshangsam
Copy link
Contributor

Are there any theoratical differences between this and doing TP8+CP4+actckpt on DTensor, in which case DTensor couldn't work?

@soodoshll
Copy link
Contributor Author

Are there any theoratical differences between this and doing TP8+CP4+actckpt on DTensor, in which case DTensor couldn't work?

need a profiling to see why. btw, dtensor can run with seqlen=64k

Copy link
Contributor

@wangshangsam wangshangsam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, by exsisting convention, the hardware setup should come before the runtime config, and ideally the max supported context length should come with the model & task, e.g., grpo-math-qwen3-32b-128k-4n8g-megatrontp8cp4.yaml.

Could you also rename the other grpo-math-qwen3-30ba3b-megatron-tp4-32k.yaml into grpo-math-qwen3-30ba3b-32k-4n8g-megatrontp4ep8.yaml? We forgot about this aspect when we reviewed #918 last week (cc @pjin-nvidia)

@wangshangsam
Copy link
Contributor

Are there any theoratical differences between this and doing TP8+CP4+actckpt on DTensor, in which case DTensor couldn't work?

need a profiling to see why. btw, dtensor can run with seqlen=64k

Let's get this PR merged first. The key objective of the issue this PR is addressing is to unblock the Nemotron folks, so if MCore path already works, let's leave it at that. We can dig into the memory profile for very long seq len when addressing #885.

@wangshangsam
Copy link
Contributor

Btw, since you are adding a new recipe, you need to add a bash script that tests this new recipe in the CI too (there's a unit test that check if you have done that or not). Refers to #926 tests/test_suites/llm and tests/test_suites/nightly for an example.

@soodoshll
Copy link
Contributor Author

Actually, by exsisting convention, the hardware setup should come before the runtime config, and ideally the max supported context length should come with the model & task, e.g., grpo-math-qwen3-32b-128k-4n8g-megatrontp8cp4.yaml.

Could you also rename the other grpo-math-qwen3-30ba3b-megatron-tp4-32k.yaml into grpo-math-qwen3-30ba3b-32k-4n8g-megatrontp4ep8.yaml? We forgot about this aspect when we reviewed #918 last week (cc @pjin-nvidia)

fixed. but overall the recipe naming still lacks consistency. It'd be better if we can have a naming convention.

@wangshangsam
Copy link
Contributor

Actually, by exsisting convention, the hardware setup should come before the runtime config, and ideally the max supported context length should come with the model & task, e.g., grpo-math-qwen3-32b-128k-4n8g-megatrontp8cp4.yaml.
Could you also rename the other grpo-math-qwen3-30ba3b-megatron-tp4-32k.yaml into grpo-math-qwen3-30ba3b-32k-4n8g-megatrontp4ep8.yaml? We forgot about this aspect when we reviewed #918 last week (cc @pjin-nvidia)

fixed. but overall the recipe naming still lacks consistency. It'd be better if we can have a naming convention.

Naming convention: https://github.com/NVIDIA-NeMo/RL/tree/main/tests/test_suites#naming

@github-actions github-actions bot added documentation Improvements or additions to documentation CI Relating to CI labels Aug 28, 2025
@github-actions github-actions bot removed documentation Improvements or additions to documentation CI Relating to CI labels Aug 28, 2025
Signed-off-by: Qidong Su <[email protected]>
Signed-off-by: Qidong Su <[email protected]>
Signed-off-by: Qidong Su <[email protected]>
Signed-off-by: Qidong Su <[email protected]>
Signed-off-by: Qidong Su <[email protected]>
Signed-off-by: Qidong Su <[email protected]>
Signed-off-by: Qidong Su <[email protected]>
Signed-off-by: Qidong Su <[email protected]>
Signed-off-by: Qidong Su <[email protected]>
Signed-off-by: Qidong Su <[email protected]>
@soodoshll soodoshll requested a review from wangshangsam August 28, 2025 21:06
wangshangsam
wangshangsam previously approved these changes Aug 28, 2025
@wangshangsam wangshangsam added CI:L1 Run doctests, unit tests, and functional tests and removed CI:L1 Run doctests, unit tests, and functional tests labels Aug 28, 2025
Signed-off-by: Qidong Su <[email protected]>
@wangshangsam wangshangsam added CI:L1 Run doctests, unit tests, and functional tests and removed CI:L1 Run doctests, unit tests, and functional tests labels Aug 30, 2025
@wangshangsam wangshangsam added CI:L1 Run doctests, unit tests, and functional tests and removed CI:L1 Run doctests, unit tests, and functional tests labels Sep 2, 2025
wangshangsam
wangshangsam previously approved these changes Sep 2, 2025
# Only run metrics if the target step is reached
if [[ $(jq 'to_entries | .[] | select(.key == "train/loss") | .value | keys | map(tonumber) | max' $JSON_METRICS) -ge $MAX_STEPS ]]; then
uv run tests/check_metrics.py $JSON_METRICS \
'mean(data["train/token_mult_prob_error"]) < 1.1' \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wouldn't this fail according to the wandb link you shared?

image

I think with longer generations, we'll probably run into outliers that skew the mean. This one is run for so few steps, it's probably hard to write something robust. Maybe:

Suggested change
'mean(data["train/token_mult_prob_error"]) < 1.1' \
'min(data["train/token_mult_prob_error"]) < 1.1' \

and then add a comment above why you use min for this particular test

I made an issue to track this: #1039

Comment on lines +36 to +39
rope_scaling:
type: "yarn"
factor: 4.0
original_max_position_embeddings: 32768
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

which models need this? is it possible to handle this in code?

In the past when we had stuff like this, the consensus was to handle it in code since we knew which model types needed it, ex: fdb565c

regardless of if this is handled in code or yaml, it should probably have an entry in the model-quirks.md so we have documentation

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is used by qwen3 + long context length: https://huggingface.co/Qwen/Qwen3-32B#processing-long-texts.

Since it's a optional configuration that can be changed by the user, I tend to explicitly put it in yaml.

Will update model-quirks.md to reflect this.

Signed-off-by: Qidong Su <[email protected]>
@github-actions github-actions bot added the documentation Improvements or additions to documentation label Sep 3, 2025
@soodoshll
Copy link
Contributor Author

This PR is not merged because it depends on the support of YaRN in the Megatron backend to be ready.
Related MR in Megatron-LM: #3854

@wangshangsam wangshangsam linked an issue Sep 4, 2025 that may be closed by this pull request
self.dp_size = worker_sharding_annotations.get_axis_size("data_parallel")
self.megatron_bridge = AutoBridge.from_hf_pretrained(
hf_model_name, trust_remote_code=True
hf_model_name, trust_remote_code=True, **self.cfg.get("model_kwargs", {})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think mcore path cannot parse/handle the rope_scaling.type="yarn" even if you pass these arguments, this might succeed but your model arch is still rope, this seems confusing.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe we should error out in mcore path if seeing rope_scaling.type=="yarn" until necessary support is added;

@yaoyu-33 to confirm if the qwen and llama bridge in mbridge can parse and handle this field. I believe this is only supported in deepseek model type.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CI:L1 Run doctests, unit tests, and functional tests documentation Improvements or additions to documentation

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Enable GRPO Qwen 3 32B with 128k context length

4 participants