-
-
Notifications
You must be signed in to change notification settings - Fork 11.3k
Fix test_kv_sharing_fast_prefill flakiness #22038
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request effectively addresses the flakiness of test_kv_sharing_fast_prefill by enforcing deterministic execution. The method of disabling multiprocessing and setting seeds is appropriate for achieving reproducibility.
My primary feedback concerns the new cleanup function. While it solves the immediate memory release issue, its approach of directly manipulating deep internal state of the LLM engine is fragile and creates a maintenance risk. I've added a comment recommending the creation of a proper teardown API on the LLM class as a more robust long-term solution. The current implementation is acceptable as a temporary measure to fix the test, but this technical debt should be addressed.
|
@DarkLight1337 is there a recommended way to release GPU memory between non-multiprocessing LLM instances? I tried solutions in: but none of them worked, and the only thing that worked was a hacky solution (see gemini comment above) that involves deleting references to model and KV caches from the model runner. |
|
CI test failure seems unrelated: |
|
Can you merge from main to fix CI? |
Signed-off-by: Yong Hoon Shin <[email protected]>
Signed-off-by: Yong Hoon Shin <[email protected]>
Signed-off-by: Yong Hoon Shin <[email protected]>
00d9b4c to
0c2a533
Compare
Signed-off-by: Yong Hoon Shin <[email protected]>
Signed-off-by: Yong Hoon Shin <[email protected]> Signed-off-by: Jinzhen Lin <[email protected]>
Signed-off-by: Yong Hoon Shin <[email protected]> Signed-off-by: Noam Gat <[email protected]>
Signed-off-by: Yong Hoon Shin <[email protected]> Signed-off-by: Paul Pak <[email protected]>
Signed-off-by: Yong Hoon Shin <[email protected]> Signed-off-by: Diego-Castan <[email protected]>
Signed-off-by: Yong Hoon Shin <[email protected]>
Signed-off-by: Yong Hoon Shin <[email protected]>
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.Purpose
Fixes #22104
This unit test, added in #21590, checks for exact outputs from two
LLMinstances. However, due to inherent randomness in vLLM (e.g. scheduling), sometimes the test fails, making it flaky. Following vLLM's reproducibility guide, make this less flaky.The key was to turn off multiprocessing to make scheduling deterministic with
VLLM_ENABLE_V1_MULTIPROCESSING=0. However, this requires further changes to release GPU memory for secondLLMinstance (as unlike with multiprocessing, there is no separate worker process that gets killed to free memory).Follow up required: had to disable
enforce_eager=Falsetest case asdeleting model and KV caches ref doesn't seem to remove all references, so memory doesn't get freed.Test Plan
Ran
pytest tests/v1/e2e/test_kv_sharing_fast_prefill.py::test_kv_sharing_fast_prefill10 timesTest Result
Before PR, some runs would fail. After PR, all 10 runs pass.