Skip to content

test: add nanov3 prefill decode test#2141

Merged
yuki-97 merged 2 commits intohemil/automodel-transformers-v5from
zhiyul/add-nanov3-prefill-decode-test
Mar 24, 2026
Merged

test: add nanov3 prefill decode test#2141
yuki-97 merged 2 commits intohemil/automodel-transformers-v5from
zhiyul/add-nanov3-prefill-decode-test

Conversation

@ZhiyuLi-Nvidia
Copy link
Copy Markdown
Contributor

@ZhiyuLi-Nvidia ZhiyuLi-Nvidia commented Mar 23, 2026

What does this PR do ?

Background add nanov3 prefill decode test.

Vllm prefill and decode should generate consistent logprob. It was an issue before vllm<0.17.0 and fixed in vllm==0.17.0. Add the test to guard vllm prefill and decode logprob consistency.

Issues

It was an issue as per #2100

Fixed after vllm bump up.

Background:
I’d expect bump up vllm version to vLLM 0.17.0 would resolve the issue:
Some findings are as follows:
old vllm version is self-conflicting while megatron is good

vLLM decode  vs vLLM prefill:     TME = 322.490753   DIVERGED   <== vllm is self-conflicting 
vLLM decode  vs Megatron prefill: TME = 313.127106   DIVERGED   <== what we saw
vLLM prefill vs Megatron prefill: TME = 1.030557     HEALTHY   <==  megatron is good and aligns well with vllm 

prefillvllm prefill is like pass all prompt generated tokens to vllm and let it calculate logprobs, it is similar as a single forward  pass in training.
after bumping up vllm to 0.17.0 using the container /lustre/fsw/portfolios/coreai/users/terryk/enroot-images/gitlab-master.nvidia.com/terryk/images/nemo-rl:hemil-automodel-transformers-v5-9db945aa4.squashfs

  vLLM decode  vs vLLM prefill:     TME = 1.032336
  vLLM decode  vs Megatron prefill: TME = 1.032111
  vLLM prefill vs Megatron prefill: TME = 1.031372all 64 healthy

I think the root cause should be relevant to the prefill/decode kernel with mamba, kv cache and they were fixed with vllm bump up.

Usage

See tests.

Before your PR is "Ready for review"

Pre checks:

  • Make sure you read and followed Contributor guidelines
  • Did you write any new necessary tests?
  • [x ] Did you run the unit tests and functional tests locally? Visit our Testing Guide for how to run tests
  • Did you add or update any necessary documentation? Visit our Document Development Guide for how to write, build and test the docs.

Additional Information

  • ...

@ZhiyuLi-Nvidia ZhiyuLi-Nvidia requested review from a team and terrykong as code owners March 23, 2026 12:49
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot bot commented Mar 23, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@github-actions github-actions bot added documentation Improvements or additions to documentation CI Relating to CI labels Mar 23, 2026
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: Zhiyu Li <zhiyul@NVIDIA.com>
@ZhiyuLi-Nvidia ZhiyuLi-Nvidia force-pushed the zhiyul/add-nanov3-prefill-decode-test branch from 61485e3 to 30f2253 Compare March 23, 2026 12:57
@github-actions github-actions bot removed the CI Relating to CI label Mar 23, 2026
@ZhiyuLi-Nvidia ZhiyuLi-Nvidia changed the title Zhiyul/add nanov3 prefill decode test test: add nanov3 prefill decode test Mar 23, 2026
@ZhiyuLi-Nvidia ZhiyuLi-Nvidia added CI:L1 Run doctests, unit tests, and functional tests CI Relating to CI and removed CI:L1 Run doctests, unit tests, and functional tests CI Relating to CI labels Mar 23, 2026
@ZhiyuLi-Nvidia
Copy link
Copy Markdown
Contributor Author

/ok to test 30f2253

Signed-off-by: Zhiyu Li <zhiyul@NVIDIA.com>
@ZhiyuLi-Nvidia
Copy link
Copy Markdown
Contributor Author

/ok to test d255ec9

@terrykong
Copy link
Copy Markdown
Collaborator

terrykong commented Mar 23, 2026

approved, but let's see where the v5 PR is. if it's almost done by the time this CI finishes, let's just do this in a following up PR to not slow down that PR. check with @yuki-97 on her preference on merging this one

@terrykong terrykong added CI:L1 Run doctests, unit tests, and functional tests and removed CI:L1 Run doctests, unit tests, and functional tests labels Mar 23, 2026
@yuki-97
Copy link
Copy Markdown
Contributor

yuki-97 commented Mar 24, 2026

thanks @ZhiyuLi-Nvidia @terrykong !
since CI passes and the PR is independent with other codes, it should be safe. I'll directly merge it to the bump PR.

@yuki-97 yuki-97 merged commit 1179183 into hemil/automodel-transformers-v5 Mar 24, 2026
47 of 49 checks passed
@yuki-97 yuki-97 deleted the zhiyul/add-nanov3-prefill-decode-test branch March 24, 2026 02:35
yuki-97 pushed a commit that referenced this pull request Mar 24, 2026
Signed-off-by: Zhiyu Li <zhiyul@NVIDIA.com>
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
terrykong pushed a commit that referenced this pull request Mar 24, 2026
Signed-off-by: Zhiyu Li <zhiyul@NVIDIA.com>
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: Terry Kong <terryk@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CI:L1 Run doctests, unit tests, and functional tests documentation Improvements or additions to documentation

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants