-
Notifications
You must be signed in to change notification settings - Fork 541
[BUG] error in training examples/water/se_e3:failed to run cuBLAS routine: CUBLAS_STATUS_INVALID_VALUE #1062
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
CUDA 10.1 works fine. Maybe a bug in CUDA 11.3? |
update: it's related to memory. Reducing |
The memory consumption of se3 is much larger than se2. therefore one is recommended to use smaller rcut and sel for se3. A hybridization of se2 (standard rc) and se3 (small rc) is a good practice. |
Maybe we can reduce |
Hi! I'm running parallel training using DeepMD 2.0.0 on a compute node with 8x NVIDIA A100 GPUs (A100-SXM4-40GB). For the dataset I'm using, running with a batch size of 16 works fine and shows memory utilization of around 3GB per GPU. Especially since the memory usage increase with the batch size is not that big Could this be related to something else then memory? I attached some logfiles of the error and screenshots of the memory usage. /Daniel |
Hi, Is this issue now solved with the new version (v2.0.2) of deepmd-kit? /Daniel |
Hi @njzjz, /Daniel |
Hi @Dankomaister, you can try pip packages instead by following https://docs.deepmodeling.org/projects/deepmd/en/v2.0.2/install/install-from-source.html#install-the-python-interface |
Hi, OK, maybe I will try that! |
Note: this may be a bug of cuBLAS, which we cannot fix by ourselves. |
As reported by tensorflow/tensorflow#54463 (comment), CUDA 11.2 also has this bug. Next time when I build TensorFlow, I will bump the CUDA version to 11.4. |
The next release will bump CUDA version from 11.3 to 11.6. |
The installer v2.1.1 with CUDA 11.6 has been released. |
See #1061.
I can reproduce the error.
CPU works fine.
The text was updated successfully, but these errors were encountered: