You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
PS: Is the ROCM version of deepmd no longer maintained? is there an updated docker, including the updated deepmd version and the updated rocm version?
Main Topic: Envs: docker:docker pull deepmodeling/dpmdkit-rocm:dp2.0.3-rocm4.5.2-tf2.6-lmp29Sep2021, and physical machines have 8 mi gpus. Envs test: I can use rocm-smi correctly in deepmd docker,and return eight mi-gpu; I can run tensorforflow/benchmark correctly in deepmd docker.
Problem: When I was in the example/water/se_e2_a folder, I encountered an error when I used the command: dp train input.json. Error message:
Instructions for updating:
non-resource variables are not supported in the long term
"hipErrorNoBinaryForGpu: Unable to find code object for all current devices!"
[hpe01:00397] *** Process received signal ***
[hpe01:00397] Signal: Aborted (6)
[hpe01:00397] Signal code: (-6)
[hpe01:00397] [ 0] /lib/x86_64-linux-gnu/libc.so.6(+0x3f040)[0x7f3100f7e040]
[hpe01:00397] [ 1] /lib/x86_64-linux-gnu/libc.so.6(gsignal+0xc7)[0x7f3100f7dfb7]
[hpe01:00397] [ 2] /lib/x86_64-linux-gnu/libc.so.6(abort+0x141)[0x7f3100f7f921]
[hpe01:00397] [ 3] /opt/rocm/lib/libamdhip64.so.4(+0x1cfdf9)[0x7f2e8badadf9]
[hpe01:00397] [ 4] /opt/rocm/lib/libamdhip64.so.4(+0x623a0)[0x7f2e8b96d3a0]
[hpe01:00397] [ 5] /opt/rocm/lib/libamdhip64.so.4(+0x97aa9)[0x7f2e8b9a2aa9]
[hpe01:00397] [ 6] /opt/rocm/lib/libamdhip64.so.4(+0x5fc44)[0x7f2e8b96ac44]
[hpe01:00397] [ 7] /opt/rocm/lib/libamdhip64.so.4(+0x13ec98)[0x7f2e8ba49c98]
[hpe01:00397] [ 8] /opt/rocm/lib/libamdhip64.so.4(+0x6e159)[0x7f2e8b979159]
[hpe01:00397] [ 9] /lib/x86_64-linux-gnu/libpthread.so.0(+0xf907)[0x7f3100d2f907]
[hpe01:00397] [10] /opt/rocm/lib/libamdhip64.so.4(+0x65acc)[0x7f2e8b970acc]
[hpe01:00397] [11] /opt/rocm/lib/libamdhip64.so.4(hipInit+0x65)[0x7f2e8b970ea5]
[hpe01:00397] [12] /usr/local/lib/python3.6/dist-packages/tensorflow/python/_pywrap_tensorflow_internal.so(+0x10bbd36a)[0x7f2ebe45f36a]
[hpe01:00397] [13] /usr/local/lib/python3.6/dist-packages/tensorflow/python/_pywrap_tensorflow_internal.so(_ZN15stream_executor3gpu9GpuDriver4InitEv+0x1dd)[0x7f2ebe45f60d]
[hpe01:00397] [14] /usr/local/lib/python3.6/dist-packages/tensorflow/python/_pywrap_tensorflow_internal.so(_ZNK15stream_executor3gpu12ROCmPlatform18VisibleDeviceCountEv+0x18)[0x7f2ebe446278]
[hpe01:00397] [15] /usr/local/lib/python3.6/dist-packages/tensorflow/python/../libtensorflow_framework.so.2(_ZN10tensorflow20BaseGPUDeviceFactory13CreateDevicesERKNS_14SessionOptionsERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEEPSt6vectorISt10unique_ptrINS_6DeviceESt14default_deleteISE_EESaISH_EE+0x7f)[0x7f2e8d73256f]
[hpe01:00397] [16] /usr/local/lib/python3.6/dist-packages/tensorflow/python/../libtensorflow_framework.so.2(_ZN10tensorflow13DeviceFactory10AddDevicesERKNS_14SessionOptionsERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEEPSt6vectorISt10unique_ptrINS_6DeviceESt14default_deleteISE_EESaISH_EE+0xc6)[0x7f2e8d4df4a6]
[hpe01:00397] [17] /usr/local/lib/python3.6/dist-packages/tensorflow/python/_pywrap_tensorflow_internal.so(_ZN10tensorflow20DirectSessionFactory10NewSessionERKNS_14SessionOptionsEPPNS_7SessionE+0x2e7)[0x7f2eb2b77cc7]
[hpe01:00397] [18] /usr/local/lib/python3.6/dist-packages/tensorflow/python/../libtensorflow_framework.so.2(_ZN10tensorflow10NewSessionERKNS_14SessionOptionsEPPNS_7SessionE+0xe3)[0x7f2e8d8a6613]
[hpe01:00397] [19] /usr/local/lib/python3.6/dist-packages/tensorflow/python/_pywrap_tensorflow_internal.so(TF_NewSession+0x3b)[0x7f2eb23aa0db]
[hpe01:00397] [20] /usr/local/lib/python3.6/dist-packages/tensorflow/python/_pywrap_tensorflow_internal.so(_ZN10tensorflow16TF_NewSessionRefEP8TF_GraphPK17TF_SessionOptionsP9TF_Status+0x12)[0x7f2eb22cb0b2]
[hpe01:00397] [21] /usr/local/lib/python3.6/dist-packages/tensorflow/python/client/_pywrap_tf_session.so(+0x6eaf7)[0x7f2e8a482af7]
[hpe01:00397] [22] /usr/local/lib/python3.6/dist-packages/tensorflow/python/client/_pywrap_tf_session.so(+0x5c1e7)[0x7f2e8a4701e7]
[hpe01:00397] [23] /usr/bin/python3[0x50a865]
[hpe01:00397] [24] /usr/bin/python3(_PyEval_EvalFrameDefault+0x444)[0x50c274]
[hpe01:00397] [25] /usr/bin/python3[0x507f94]
[hpe01:00397] [26] /usr/bin/python3[0x509cc0]
[hpe01:00397] [27] /usr/bin/python3[0x50a6bd]
[hpe01:00397] [28] /usr/bin/python3(_PyEval_EvalFrameDefault+0x1225)[0x50d055]
[hpe01:00397] [29] /usr/bin/python3[0x507f94]
[hpe01:00397] *** End of error message ***
Aborted (core dumped)
Bug summary
PS: Is the ROCM version of deepmd no longer maintained? is there an updated docker, including the updated deepmd version and the updated rocm version?
Main Topic:
Envs: docker:docker pull deepmodeling/dpmdkit-rocm:dp2.0.3-rocm4.5.2-tf2.6-lmp29Sep2021, and physical machines have 8 mi gpus.
Envs test: I can use rocm-smi correctly in deepmd docker,and return eight mi-gpu; I can run tensorforflow/benchmark correctly in deepmd docker.
Problem: When I was in the example/water/se_e2_a folder, I encountered an error when I used the command: dp train input.json.
Error message:
DeePMD-kit Version
v2.0.4
TensorFlow Version
2.6.0
How did you download the software?
docker
Input Files, Running Commands, Error Log, etc.
Input: /deepmd/examples/water/se_ea_a/input.json
Commands: dp train input.json
Log:
Steps to Reproduce
Further Information, Files, and Links
No response
The text was updated successfully, but these errors were encountered: