[BUG] error in training examples/water/se_e3:failed to run cuBLAS routine: CUBLAS_STATUS_INVALID_VALUE #1062

njzjz · 2021-08-29T14:17:00Z

I can reproduce the error.

CPU works fine.

njzjz · 2021-08-29T15:09:27Z

CUDA 10.1 works fine. Maybe a bug in CUDA 11.3?

njzjz · 2021-08-30T03:51:28Z

update: it's related to memory. Reducing sel to a proper number can avoid the error. However, I don't find the root reason yet.

amcadmus · 2021-09-01T02:10:52Z

The memory consumption of se3 is much larger than se2. therefore one is recommended to use smaller rcut and sel for se3. A hybridization of se2 (standard rc) and se3 (small rc) is a good practice.

njzjz · 2021-09-01T20:22:09Z

The memory consumption of se3 is much larger than se2. therefore one is recommended to use smaller rcut and sel for se3. A hybridization of se2 (standard rc) and se3 (small rc) is a good practice.

Maybe we can reduce rcut in examples.

See deepmodeling#1062.

See #1062.

Dankomaister · 2021-09-08T07:09:01Z

Hi!

I'm running parallel training using DeepMD 2.0.0 on a compute node with 8x NVIDIA A100 GPUs (A100-SXM4-40GB).
These GPUs have 40GB of memory each. However, I cant seem to be able to use more than around 3GB of memory per GPU without running into the failed to run cuBLAS routine: CUBLAS_STATUS_INVALID_VALUE error.

For the dataset I'm using, running with a batch size of 16 works fine and shows memory utilization of around 3GB per GPU.
But increasing the batch size to 17 causes the failed to run cuBLAS routine: CUBLAS_STATUS_INVALID_VALUE error.
It doesn't seem reasonable that increasing the batch size from 16 to 17 would cause the memory usage to increase to more than 40 GB per GPU.

Especially since the memory usage increase with the batch size is not that big
batch size = 5 uses ~1400 MB memory per GPU
batch size = 10 uses ~1930 MB memory per GPU
batch size = 15 uses ~2950 MB memory per GPU
batch size = 16 uses ~2950 MB memory per GPU

Could this be related to something else then memory?

I attached some logfiles of the error and screenshots of the memory usage.
batch size = 17.txt
batch size = 20.txt

batch size = 5

batch size = 10

batch size = 15

batch size = 16

/Daniel

Dankomaister · 2021-10-01T02:58:26Z

Hi,

Is this issue now solved with the new version (v2.0.2) of deepmd-kit?

/Daniel

Dankomaister · 2021-10-12T03:28:42Z

Hi @njzjz,
Seems like this problem is related to the fact that TensorFlow v2.5 is incompatible with CUDA 11.3.
Is there any chance that we could get a conda version of DeePMD that is compiled with the correct version of CUDA (11.2).
This problem is really hindering the efficient usage of DeePMD on our cluster, so I would appreciate if it could be solved :)

/Daniel

njzjz · 2021-10-12T04:05:13Z

Hi @Dankomaister, you can try pip packages instead by following https://docs.deepmodeling.org/projects/deepmd/en/v2.0.2/install/install-from-source.html#install-the-python-interface

Dankomaister · 2021-10-14T00:29:17Z

Hi,

OK, maybe I will try that!
Have you tested and confirmed that this problem is solved when using CUDA 11.2 together with DeePMD 2.0.2 ?

njzjz · 2022-01-16T08:50:26Z

Note: this may be a bug of cuBLAS, which we cannot fix by ourselves.
See https://docs.nvidia.com/cuda/cuda-toolkit-release-notes/index.html#cublas-11.4.0

njzjz · 2022-03-10T18:43:43Z

OK, maybe I will try that! Have you tested and confirmed that this problem is solved when using CUDA 11.2 together with DeePMD 2.0.2 ?

As reported by tensorflow/tensorflow#54463 (comment), CUDA 11.2 also has this bug.

Next time when I build TensorFlow, I will bump the CUDA version to 11.4.

njzjz · 2022-05-22T01:38:12Z

The next release will bump CUDA version from 11.3 to 11.6.

njzjz · 2022-05-23T21:57:43Z

The installer v2.1.1 with CUDA 11.6 has been released.

njzjz added the bug label Aug 29, 2021

njzjz added the reproduced This bug has been reproduced by developers label Aug 30, 2021

njzjz added a commit to njzjz/deepmd-kit that referenced this issue Sep 1, 2021

reduce rcut and sel in the example of se_e3

9bc76f0

See deepmodeling#1062.

njzjz mentioned this issue Sep 1, 2021

reduce rcut and sel in the example of se_e3 #1082

Merged

amcadmus pushed a commit that referenced this issue Sep 1, 2021

reduce rcut and sel in the example of se_e3 (#1082)

d68171a

See #1062.

njzjz added the upstream label Sep 8, 2021

njzjz mentioned this issue Sep 8, 2021

CUBLAS_STATUS_INVALID_VALUE when using tf.matmal to calculate large size of tensors tensorflow/tensorflow#51889

Closed

njzjz added this to Bugfixes for DeePMD-kit Oct 28, 2021

njzjz moved this to Todo in Bugfixes for DeePMD-kit Oct 28, 2021

njzjz added the help wanted label Oct 29, 2021

njzjz self-assigned this Nov 1, 2021

njzjz moved this from Todo to In Progress in Bugfixes for DeePMD-kit Dec 1, 2021

njzjz moved this from In Progress to Done in Bugfixes for DeePMD-kit Dec 1, 2021

njzjz moved this from Done to Todo in Bugfixes for DeePMD-kit Dec 1, 2021

njzjz added wontfix and removed help wanted labels Jan 16, 2022

njzjz closed this as completed Jan 16, 2022

Repository owner moved this from Todo to Done in Bugfixes for DeePMD-kit Jan 16, 2022

njzjz mentioned this issue May 23, 2022

update the latest state of easy installation #1726

Merged

njzjz pinned this issue Dec 10, 2022

njzjz unpinned this issue Jul 6, 2023

njzjz added a commit to njzjz/deepmd-kit that referenced this issue Sep 21, 2023

ignore dpdata 0.2.11 (deepmodeling#1062)

8f34e78

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] error in training examples/water/se_e3:failed to run cuBLAS routine: CUBLAS_STATUS_INVALID_VALUE #1062

[BUG] error in training examples/water/se_e3:failed to run cuBLAS routine: CUBLAS_STATUS_INVALID_VALUE #1062

njzjz commented Aug 29, 2021 •

edited

Loading

njzjz commented Aug 29, 2021

njzjz commented Aug 30, 2021

amcadmus commented Sep 1, 2021

njzjz commented Sep 1, 2021

Dankomaister commented Sep 8, 2021

Dankomaister commented Oct 1, 2021

Dankomaister commented Oct 12, 2021

njzjz commented Oct 12, 2021

Dankomaister commented Oct 14, 2021

njzjz commented Jan 16, 2022

njzjz commented Mar 10, 2022

njzjz commented May 22, 2022

njzjz commented May 23, 2022

[BUG] error in training examples/water/se_e3:failed to run cuBLAS routine: CUBLAS_STATUS_INVALID_VALUE #1062

[BUG] error in training examples/water/se_e3:failed to run cuBLAS routine: CUBLAS_STATUS_INVALID_VALUE #1062

Comments

njzjz commented Aug 29, 2021 • edited Loading

njzjz commented Aug 29, 2021

njzjz commented Aug 30, 2021

amcadmus commented Sep 1, 2021

njzjz commented Sep 1, 2021

Dankomaister commented Sep 8, 2021

Dankomaister commented Oct 1, 2021

Dankomaister commented Oct 12, 2021

njzjz commented Oct 12, 2021

Dankomaister commented Oct 14, 2021

njzjz commented Jan 16, 2022

njzjz commented Mar 10, 2022

njzjz commented May 22, 2022

njzjz commented May 23, 2022

njzjz commented Aug 29, 2021 •

edited

Loading