Skip to content

[BUG] test_umap_trustworthiness_on_batch_nnd test failure with multiple GPUs #7210

@viclafargue

Description

@viclafargue

The test_umap_trustworthiness_on_batch_nnd test fails when ran on multiple GPUs with the following error :

CUDA error encountered at: file=cuml/cpp/src/umap/simpl_set_embed/algo.cuh line=338: call='cudaPeekAtLastError()', Reason=cudaErrorIllegalAddress:an illegal memory access was encountered

This invalidates the CUDA context and results in additional :

MemoryError: std::bad_alloc: CUDA error (failed to allocate ...

The issue happens specifically when do_snmg = True and num_clusters == 5.
Setting CUDA_VISIBLE_DEVICES=0 prevents the issue from appearing explaining why the tests does not fail in the CI (which only has a single GPU).

Further investigation revealed that the KNN step sets all the indices of the nearest neighbors to -1 causing the illegal accesses.

For context, the test ran on my workstation that has 2 GPUs. Maybe running 5 clusters requires more?

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions