[BUG] Using Deepmd's ROCM official docker, there is a problem with training on the rocm machine #1998

wangych6 · 2022-10-14T07:20:06Z

Bug summary

PS: Is the ROCM version of deepmd no longer maintained？ is there an updated docker, including the updated deepmd version and the updated rocm version？

Main Topic：
Envs： docker：docker pull deepmodeling/dpmdkit-rocm:dp2.0.3-rocm4.5.2-tf2.6-lmp29Sep2021, and physical machines have 8 mi gpus.
Envs test： I can use rocm-smi correctly in deepmd docker，and return eight mi-gpu； I can run tensorforflow/benchmark correctly in deepmd docker.

Problem： When I was in the example/water/se_e2_a folder, I encountered an error when I used the command: dp train input.json.
Error message：

Instructions for updating:
non-resource variables are not supported in the long term
"hipErrorNoBinaryForGpu: Unable to find code object for all current devices!"
[hpe01:00397] *** Process received signal ***
[hpe01:00397] Signal: Aborted (6)
[hpe01:00397] Signal code: (-6)
[hpe01:00397] [ 0] /lib/x86_64-linux-gnu/libc.so.6(+0x3f040)[0x7f3100f7e040]
[hpe01:00397] [ 1] /lib/x86_64-linux-gnu/libc.so.6(gsignal+0xc7)[0x7f3100f7dfb7]
[hpe01:00397] [ 2] /lib/x86_64-linux-gnu/libc.so.6(abort+0x141)[0x7f3100f7f921]
[hpe01:00397] [ 3] /opt/rocm/lib/libamdhip64.so.4(+0x1cfdf9)[0x7f2e8badadf9]
[hpe01:00397] [ 4] /opt/rocm/lib/libamdhip64.so.4(+0x623a0)[0x7f2e8b96d3a0]
[hpe01:00397] [ 5] /opt/rocm/lib/libamdhip64.so.4(+0x97aa9)[0x7f2e8b9a2aa9]
[hpe01:00397] [ 6] /opt/rocm/lib/libamdhip64.so.4(+0x5fc44)[0x7f2e8b96ac44]
[hpe01:00397] [ 7] /opt/rocm/lib/libamdhip64.so.4(+0x13ec98)[0x7f2e8ba49c98]
[hpe01:00397] [ 8] /opt/rocm/lib/libamdhip64.so.4(+0x6e159)[0x7f2e8b979159]
[hpe01:00397] [ 9] /lib/x86_64-linux-gnu/libpthread.so.0(+0xf907)[0x7f3100d2f907]
[hpe01:00397] [10] /opt/rocm/lib/libamdhip64.so.4(+0x65acc)[0x7f2e8b970acc]
[hpe01:00397] [11] /opt/rocm/lib/libamdhip64.so.4(hipInit+0x65)[0x7f2e8b970ea5]
[hpe01:00397] [12] /usr/local/lib/python3.6/dist-packages/tensorflow/python/_pywrap_tensorflow_internal.so(+0x10bbd36a)[0x7f2ebe45f36a]
[hpe01:00397] [13] /usr/local/lib/python3.6/dist-packages/tensorflow/python/_pywrap_tensorflow_internal.so(_ZN15stream_executor3gpu9GpuDriver4InitEv+0x1dd)[0x7f2ebe45f60d]
[hpe01:00397] [14] /usr/local/lib/python3.6/dist-packages/tensorflow/python/_pywrap_tensorflow_internal.so(_ZNK15stream_executor3gpu12ROCmPlatform18VisibleDeviceCountEv+0x18)[0x7f2ebe446278]
[hpe01:00397] [15] /usr/local/lib/python3.6/dist-packages/tensorflow/python/../libtensorflow_framework.so.2(_ZN10tensorflow20BaseGPUDeviceFactory13CreateDevicesERKNS_14SessionOptionsERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEEPSt6vectorISt10unique_ptrINS_6DeviceESt14default_deleteISE_EESaISH_EE+0x7f)[0x7f2e8d73256f]
[hpe01:00397] [16] /usr/local/lib/python3.6/dist-packages/tensorflow/python/../libtensorflow_framework.so.2(_ZN10tensorflow13DeviceFactory10AddDevicesERKNS_14SessionOptionsERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEEPSt6vectorISt10unique_ptrINS_6DeviceESt14default_deleteISE_EESaISH_EE+0xc6)[0x7f2e8d4df4a6]
[hpe01:00397] [17] /usr/local/lib/python3.6/dist-packages/tensorflow/python/_pywrap_tensorflow_internal.so(_ZN10tensorflow20DirectSessionFactory10NewSessionERKNS_14SessionOptionsEPPNS_7SessionE+0x2e7)[0x7f2eb2b77cc7]
[hpe01:00397] [18] /usr/local/lib/python3.6/dist-packages/tensorflow/python/../libtensorflow_framework.so.2(_ZN10tensorflow10NewSessionERKNS_14SessionOptionsEPPNS_7SessionE+0xe3)[0x7f2e8d8a6613]
[hpe01:00397] [19] /usr/local/lib/python3.6/dist-packages/tensorflow/python/_pywrap_tensorflow_internal.so(TF_NewSession+0x3b)[0x7f2eb23aa0db]
[hpe01:00397] [20] /usr/local/lib/python3.6/dist-packages/tensorflow/python/_pywrap_tensorflow_internal.so(_ZN10tensorflow16TF_NewSessionRefEP8TF_GraphPK17TF_SessionOptionsP9TF_Status+0x12)[0x7f2eb22cb0b2]
[hpe01:00397] [21] /usr/local/lib/python3.6/dist-packages/tensorflow/python/client/_pywrap_tf_session.so(+0x6eaf7)[0x7f2e8a482af7]
[hpe01:00397] [22] /usr/local/lib/python3.6/dist-packages/tensorflow/python/client/_pywrap_tf_session.so(+0x5c1e7)[0x7f2e8a4701e7]
[hpe01:00397] [23] /usr/bin/python3[0x50a865]
[hpe01:00397] [24] /usr/bin/python3(_PyEval_EvalFrameDefault+0x444)[0x50c274]
[hpe01:00397] [25] /usr/bin/python3[0x507f94]
[hpe01:00397] [26] /usr/bin/python3[0x509cc0]
[hpe01:00397] [27] /usr/bin/python3[0x50a6bd]
[hpe01:00397] [28] /usr/bin/python3(_PyEval_EvalFrameDefault+0x1225)[0x50d055]
[hpe01:00397] [29] /usr/bin/python3[0x507f94]
[hpe01:00397] *** End of error message ***
Aborted (core dumped)

DeePMD-kit Version

v2.0.4

TensorFlow Version

2.6.0

How did you download the software?

docker

Input Files, Running Commands, Error Log, etc.

Input: /deepmd/examples/water/se_ea_a/input.json
Commands: dp train input.json
Log:

Instructions for updating:
non-resource variables are not supported in the long term
"hipErrorNoBinaryForGpu: Unable to find code object for all current devices!"
[hpe01:00397] *** Process received signal ***
[hpe01:00397] Signal: Aborted (6)
[hpe01:00397] Signal code: (-6)
[hpe01:00397] [ 0] /lib/x86_64-linux-gnu/libc.so.6(+0x3f040)[0x7f3100f7e040]
[hpe01:00397] [ 1] /lib/x86_64-linux-gnu/libc.so.6(gsignal+0xc7)[0x7f3100f7dfb7]
[hpe01:00397] [ 2] /lib/x86_64-linux-gnu/libc.so.6(abort+0x141)[0x7f3100f7f921]
[hpe01:00397] [ 3] /opt/rocm/lib/libamdhip64.so.4(+0x1cfdf9)[0x7f2e8badadf9]
[hpe01:00397] [ 4] /opt/rocm/lib/libamdhip64.so.4(+0x623a0)[0x7f2e8b96d3a0]
[hpe01:00397] [ 5] /opt/rocm/lib/libamdhip64.so.4(+0x97aa9)[0x7f2e8b9a2aa9]
[hpe01:00397] [ 6] /opt/rocm/lib/libamdhip64.so.4(+0x5fc44)[0x7f2e8b96ac44]
[hpe01:00397] [ 7] /opt/rocm/lib/libamdhip64.so.4(+0x13ec98)[0x7f2e8ba49c98]
[hpe01:00397] [ 8] /opt/rocm/lib/libamdhip64.so.4(+0x6e159)[0x7f2e8b979159]
[hpe01:00397] [ 9] /lib/x86_64-linux-gnu/libpthread.so.0(+0xf907)[0x7f3100d2f907]
[hpe01:00397] [10] /opt/rocm/lib/libamdhip64.so.4(+0x65acc)[0x7f2e8b970acc]
[hpe01:00397] [11] /opt/rocm/lib/libamdhip64.so.4(hipInit+0x65)[0x7f2e8b970ea5]
[hpe01:00397] [12] /usr/local/lib/python3.6/dist-packages/tensorflow/python/_pywrap_tensorflow_internal.so(+0x10bbd36a)[0x7f2ebe45f36a]
[hpe01:00397] [13] /usr/local/lib/python3.6/dist-packages/tensorflow/python/_pywrap_tensorflow_internal.so(_ZN15stream_executor3gpu9GpuDriver4InitEv+0x1dd)[0x7f2ebe45f60d]
[hpe01:00397] [14] /usr/local/lib/python3.6/dist-packages/tensorflow/python/_pywrap_tensorflow_internal.so(_ZNK15stream_executor3gpu12ROCmPlatform18VisibleDeviceCountEv+0x18)[0x7f2ebe446278]
[hpe01:00397] [15] /usr/local/lib/python3.6/dist-packages/tensorflow/python/../libtensorflow_framework.so.2(_ZN10tensorflow20BaseGPUDeviceFactory13CreateDevicesERKNS_14SessionOptionsERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEEPSt6vectorISt10unique_ptrINS_6DeviceESt14default_deleteISE_EESaISH_EE+0x7f)[0x7f2e8d73256f]
[hpe01:00397] [16] /usr/local/lib/python3.6/dist-packages/tensorflow/python/../libtensorflow_framework.so.2(_ZN10tensorflow13DeviceFactory10AddDevicesERKNS_14SessionOptionsERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEEPSt6vectorISt10unique_ptrINS_6DeviceESt14default_deleteISE_EESaISH_EE+0xc6)[0x7f2e8d4df4a6]
[hpe01:00397] [17] /usr/local/lib/python3.6/dist-packages/tensorflow/python/_pywrap_tensorflow_internal.so(_ZN10tensorflow20DirectSessionFactory10NewSessionERKNS_14SessionOptionsEPPNS_7SessionE+0x2e7)[0x7f2eb2b77cc7]
[hpe01:00397] [18] /usr/local/lib/python3.6/dist-packages/tensorflow/python/../libtensorflow_framework.so.2(_ZN10tensorflow10NewSessionERKNS_14SessionOptionsEPPNS_7SessionE+0xe3)[0x7f2e8d8a6613]
[hpe01:00397] [19] /usr/local/lib/python3.6/dist-packages/tensorflow/python/_pywrap_tensorflow_internal.so(TF_NewSession+0x3b)[0x7f2eb23aa0db]
[hpe01:00397] [20] /usr/local/lib/python3.6/dist-packages/tensorflow/python/_pywrap_tensorflow_internal.so(_ZN10tensorflow16TF_NewSessionRefEP8TF_GraphPK17TF_SessionOptionsP9TF_Status+0x12)[0x7f2eb22cb0b2]
[hpe01:00397] [21] /usr/local/lib/python3.6/dist-packages/tensorflow/python/client/_pywrap_tf_session.so(+0x6eaf7)[0x7f2e8a482af7]
[hpe01:00397] [22] /usr/local/lib/python3.6/dist-packages/tensorflow/python/client/_pywrap_tf_session.so(+0x5c1e7)[0x7f2e8a4701e7]
[hpe01:00397] [23] /usr/bin/python3[0x50a865]
[hpe01:00397] [24] /usr/bin/python3(_PyEval_EvalFrameDefault+0x444)[0x50c274]
[hpe01:00397] [25] /usr/bin/python3[0x507f94]
[hpe01:00397] [26] /usr/bin/python3[0x509cc0]
[hpe01:00397] [27] /usr/bin/python3[0x50a6bd]
[hpe01:00397] [28] /usr/bin/python3(_PyEval_EvalFrameDefault+0x1225)[0x50d055]
[hpe01:00397] [29] /usr/bin/python3[0x507f94]
[hpe01:00397] *** End of error message ***
Aborted (core dumped)

Steps to Reproduce

docker pull deepmodeling/dpmdkit-rocm:dp2.0.3-rocm4.5.2-tf2.6-lmp29Sep2021
cd /root/deepmd-kit/examples/water/se_e2_a
dp train input.json

Further Information, Files, and Links

No response

njzjz · 2022-10-14T20:42:50Z

hipErrorNoBinaryForGpu: Unable to find code object for all current devices!

See ROCm/ROCm#1623 (comment) for this error.

wangych6 added the bug label Oct 14, 2022

wangych6 closed this as completed Oct 14, 2022

njzjz added the upstream label Oct 14, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Using Deepmd's ROCM official docker, there is a problem with training on the rocm machine #1998

[BUG] Using Deepmd's ROCM official docker, there is a problem with training on the rocm machine #1998

wangych6 commented Oct 14, 2022

njzjz commented Oct 14, 2022

[BUG] Using Deepmd's ROCM official docker, there is a problem with training on the rocm machine #1998

[BUG] Using Deepmd's ROCM official docker, there is a problem with training on the rocm machine #1998

Comments

wangych6 commented Oct 14, 2022

Bug summary

DeePMD-kit Version

TensorFlow Version

How did you download the software?

Input Files, Running Commands, Error Log, etc.

Steps to Reproduce

Further Information, Files, and Links

njzjz commented Oct 14, 2022