[Bugfix] Fix pooling non-determinism from pinned prompt_lens aliasing#37775
[Bugfix] Fix pooling non-determinism from pinned prompt_lens aliasing#37775noooop merged 3 commits intovllm-project:mainfrom
Conversation
Signed-off-by: Andreas Karatzas <akaratza@amd.com>
There was a problem hiding this comment.
Code Review
This pull request correctly addresses a critical non-determinism bug in the pooling logic, which was caused by aliasing a pinned memory buffer that could be mutated concurrently. The proposed fix of using .copy() to create a snapshot of the data is valid. I have provided one suggestion to further refine the fix by operating directly on the underlying torch tensor, which is a cleaner and more direct approach, avoiding unnecessary conversions between numpy and torch.
Signed-off-by: Andreas Karatzas <akaratza@amd.com>
|
I added a regression test that deterministically catches the aliasing bug. I think that in this case, adding such a test does not fall in the case of "Should we add a test for every regression" because regressions in this mechanism are actually hard to find and also this is a core file, so its importance for integrity and correctness can justify such a test I think. |
|
Please confirm that Language Models Test (MTEB) has been fixed. |
|
@noooop Rebased and added |
|
Our scheduler is becoming increasingly async, and more and more race condition variables need to be taken care of. Maybe we should probably make the race condition variables private, or indicate them in the variable names, to prevent misuse. |
|
For this one I also created a PR in that library, but yeah long term in order to avoid such issues I think this is the right approach. I also think that this is a good opportunity to rethink some tests (whether they catch such failures). |
|
The race condition issue is very difficult to detect, test and debuging. We should avoid misusing these variables during the design phase. Thanks to Andreas Karatzas for find and fixing this issue. |
num_prompt_tokensinInputBatchfrom a plainnp.zeros()array to a pinned-memory-backed numpy view (torch.zeros(..., pin_memory=True).numpy()).get_pooling_metadata()callstorch.from_numpy(self.num_prompt_tokens[:self.num_reqs]), which creates a tensor that shares the underlying pinned buffer rather than copying the data.prompt_lensis created and when it's consumed by the pooling pipeline, causing non-deterministic scores across runs..copy()to the numpy slice soprompt_lensgets its own independent memory.Bisected to commit e1d85e5 (#37303) which introduced the pinned tensor backing for
num_prompt_tokens. Thetorch.from_numpy()call inget_pooling_metadata()returns a view into the pinned buffer rather than a copy. Subsequent batch operations (request additions, condensation) mutate the same pinned storage thatprompt_lensreferences, creating a race with in-flight async CUDA operations.Test plan
test_rerank_models_mteb[model_info0]passes 5/5cc @kenroche