-
Notifications
You must be signed in to change notification settings - Fork 590
Open
Labels
bugSomething isn't workingSomething isn't working
Description
The test_umap_trustworthiness_on_batch_nnd
test fails when ran on multiple GPUs with the following error :
CUDA error encountered at: file=cuml/cpp/src/umap/simpl_set_embed/algo.cuh line=338: call='cudaPeekAtLastError()', Reason=cudaErrorIllegalAddress:an illegal memory access was encountered
This invalidates the CUDA context and results in additional :
MemoryError: std::bad_alloc: CUDA error (failed to allocate ...
The issue happens specifically when do_snmg = True
and num_clusters == 5
.
Setting CUDA_VISIBLE_DEVICES=0
prevents the issue from appearing explaining why the tests does not fail in the CI (which only has a single GPU).
Further investigation revealed that the KNN step sets all the indices of the nearest neighbors to -1 causing the illegal accesses.
For context, the test ran on my workstation that has 2 GPUs. Maybe running 5 clusters requires more?
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't working