Skip to content

[BUG] error in training examples/water/se_e3:failed to run cuBLAS routine: CUBLAS_STATUS_INVALID_VALUE #1062

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
njzjz opened this issue Aug 29, 2021 · 13 comments
Assignees
Labels
bug reproduced This bug has been reproduced by developers upstream wontfix

Comments

@njzjz
Copy link
Member

njzjz commented Aug 29, 2021

See #1061.

I can reproduce the error.

CPU works fine.

@njzjz njzjz added the bug label Aug 29, 2021
@njzjz
Copy link
Member Author

njzjz commented Aug 29, 2021

CUDA 10.1 works fine. Maybe a bug in CUDA 11.3?

@njzjz
Copy link
Member Author

njzjz commented Aug 30, 2021

update: it's related to memory. Reducing sel to a proper number can avoid the error. However, I don't find the root reason yet.

@njzjz njzjz added the reproduced This bug has been reproduced by developers label Aug 30, 2021
@amcadmus
Copy link
Member

amcadmus commented Sep 1, 2021

The memory consumption of se3 is much larger than se2. therefore one is recommended to use smaller rcut and sel for se3. A hybridization of se2 (standard rc) and se3 (small rc) is a good practice.

@njzjz
Copy link
Member Author

njzjz commented Sep 1, 2021

The memory consumption of se3 is much larger than se2. therefore one is recommended to use smaller rcut and sel for se3. A hybridization of se2 (standard rc) and se3 (small rc) is a good practice.

Maybe we can reduce rcut in examples.

njzjz added a commit to njzjz/deepmd-kit that referenced this issue Sep 1, 2021
amcadmus pushed a commit that referenced this issue Sep 1, 2021
@Dankomaister
Copy link

Hi!

I'm running parallel training using DeepMD 2.0.0 on a compute node with 8x NVIDIA A100 GPUs (A100-SXM4-40GB).
These GPUs have 40GB of memory each. However, I cant seem to be able to use more than around 3GB of memory per GPU without running into the failed to run cuBLAS routine: CUBLAS_STATUS_INVALID_VALUE error.

For the dataset I'm using, running with a batch size of 16 works fine and shows memory utilization of around 3GB per GPU.
But increasing the batch size to 17 causes the failed to run cuBLAS routine: CUBLAS_STATUS_INVALID_VALUE error.
It doesn't seem reasonable that increasing the batch size from 16 to 17 would cause the memory usage to increase to more than 40 GB per GPU.

Especially since the memory usage increase with the batch size is not that big
batch size = 5 uses ~1400 MB memory per GPU
batch size = 10 uses ~1930 MB memory per GPU
batch size = 15 uses ~2950 MB memory per GPU
batch size = 16 uses ~2950 MB memory per GPU

Could this be related to something else then memory?

I attached some logfiles of the error and screenshots of the memory usage.
batch size = 17.txt
batch size = 20.txt

batch size = 5
batch size = 5

batch size = 10
batch size = 10

batch size = 15
batch size = 15

batch size = 16
batch size = 16

/Daniel

@Dankomaister
Copy link

Hi,

Is this issue now solved with the new version (v2.0.2) of deepmd-kit?

/Daniel

@Dankomaister
Copy link

Hi @njzjz,
Seems like this problem is related to the fact that TensorFlow v2.5 is incompatible with CUDA 11.3.
Is there any chance that we could get a conda version of DeePMD that is compiled with the correct version of CUDA (11.2).
This problem is really hindering the efficient usage of DeePMD on our cluster, so I would appreciate if it could be solved :)

/Daniel

@njzjz
Copy link
Member Author

njzjz commented Oct 12, 2021

@Dankomaister
Copy link

Hi,

OK, maybe I will try that!
Have you tested and confirmed that this problem is solved when using CUDA 11.2 together with DeePMD 2.0.2 ?

@njzjz njzjz self-assigned this Nov 1, 2021
@njzjz njzjz moved this from Todo to In Progress in Bugfixes for DeePMD-kit Dec 1, 2021
@njzjz njzjz moved this from In Progress to Done in Bugfixes for DeePMD-kit Dec 1, 2021
@njzjz njzjz moved this from Done to Todo in Bugfixes for DeePMD-kit Dec 1, 2021
@njzjz
Copy link
Member Author

njzjz commented Jan 16, 2022

Note: this may be a bug of cuBLAS, which we cannot fix by ourselves.
See https://docs.nvidia.com/cuda/cuda-toolkit-release-notes/index.html#cublas-11.4.0

@njzjz njzjz added wontfix and removed help wanted labels Jan 16, 2022
@njzjz njzjz closed this as completed Jan 16, 2022
Repository owner moved this from Todo to Done in Bugfixes for DeePMD-kit Jan 16, 2022
@njzjz
Copy link
Member Author

njzjz commented Mar 10, 2022

OK, maybe I will try that! Have you tested and confirmed that this problem is solved when using CUDA 11.2 together with DeePMD 2.0.2 ?

As reported by tensorflow/tensorflow#54463 (comment), CUDA 11.2 also has this bug.

Next time when I build TensorFlow, I will bump the CUDA version to 11.4.

@njzjz
Copy link
Member Author

njzjz commented May 22, 2022

The next release will bump CUDA version from 11.3 to 11.6.

@njzjz
Copy link
Member Author

njzjz commented May 23, 2022

The installer v2.1.1 with CUDA 11.6 has been released.

@njzjz njzjz pinned this issue Dec 10, 2022
@njzjz njzjz unpinned this issue Jul 6, 2023
njzjz added a commit to njzjz/deepmd-kit that referenced this issue Sep 21, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug reproduced This bug has been reproduced by developers upstream wontfix
Projects
Archived in project
Development

No branches or pull requests

3 participants