Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Using Deepmd's ROCM official docker, there is a problem with training on the rocm machine #1998

Closed
wangych6 opened this issue Oct 14, 2022 · 1 comment

Comments

@wangych6
Copy link

Bug summary

PS: Is the ROCM version of deepmd no longer maintained? is there an updated docker, including the updated deepmd version and the updated rocm version?

Main Topic:
Envs: docker:docker pull deepmodeling/dpmdkit-rocm:dp2.0.3-rocm4.5.2-tf2.6-lmp29Sep2021, and physical machines have 8 mi gpus.
Envs test: I can use rocm-smi correctly in deepmd docker,and return eight mi-gpu; I can run tensorforflow/benchmark correctly in deepmd docker.

Problem: When I was in the example/water/se_e2_a folder, I encountered an error when I used the command: dp train input.json.
Error message:

Instructions for updating:
non-resource variables are not supported in the long term
"hipErrorNoBinaryForGpu: Unable to find code object for all current devices!"
[hpe01:00397] *** Process received signal ***
[hpe01:00397] Signal: Aborted (6)
[hpe01:00397] Signal code: (-6)
[hpe01:00397] [ 0] /lib/x86_64-linux-gnu/libc.so.6(+0x3f040)[0x7f3100f7e040]
[hpe01:00397] [ 1] /lib/x86_64-linux-gnu/libc.so.6(gsignal+0xc7)[0x7f3100f7dfb7]
[hpe01:00397] [ 2] /lib/x86_64-linux-gnu/libc.so.6(abort+0x141)[0x7f3100f7f921]
[hpe01:00397] [ 3] /opt/rocm/lib/libamdhip64.so.4(+0x1cfdf9)[0x7f2e8badadf9]
[hpe01:00397] [ 4] /opt/rocm/lib/libamdhip64.so.4(+0x623a0)[0x7f2e8b96d3a0]
[hpe01:00397] [ 5] /opt/rocm/lib/libamdhip64.so.4(+0x97aa9)[0x7f2e8b9a2aa9]
[hpe01:00397] [ 6] /opt/rocm/lib/libamdhip64.so.4(+0x5fc44)[0x7f2e8b96ac44]
[hpe01:00397] [ 7] /opt/rocm/lib/libamdhip64.so.4(+0x13ec98)[0x7f2e8ba49c98]
[hpe01:00397] [ 8] /opt/rocm/lib/libamdhip64.so.4(+0x6e159)[0x7f2e8b979159]
[hpe01:00397] [ 9] /lib/x86_64-linux-gnu/libpthread.so.0(+0xf907)[0x7f3100d2f907]
[hpe01:00397] [10] /opt/rocm/lib/libamdhip64.so.4(+0x65acc)[0x7f2e8b970acc]
[hpe01:00397] [11] /opt/rocm/lib/libamdhip64.so.4(hipInit+0x65)[0x7f2e8b970ea5]
[hpe01:00397] [12] /usr/local/lib/python3.6/dist-packages/tensorflow/python/_pywrap_tensorflow_internal.so(+0x10bbd36a)[0x7f2ebe45f36a]
[hpe01:00397] [13] /usr/local/lib/python3.6/dist-packages/tensorflow/python/_pywrap_tensorflow_internal.so(_ZN15stream_executor3gpu9GpuDriver4InitEv+0x1dd)[0x7f2ebe45f60d]
[hpe01:00397] [14] /usr/local/lib/python3.6/dist-packages/tensorflow/python/_pywrap_tensorflow_internal.so(_ZNK15stream_executor3gpu12ROCmPlatform18VisibleDeviceCountEv+0x18)[0x7f2ebe446278]
[hpe01:00397] [15] /usr/local/lib/python3.6/dist-packages/tensorflow/python/../libtensorflow_framework.so.2(_ZN10tensorflow20BaseGPUDeviceFactory13CreateDevicesERKNS_14SessionOptionsERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEEPSt6vectorISt10unique_ptrINS_6DeviceESt14default_deleteISE_EESaISH_EE+0x7f)[0x7f2e8d73256f]
[hpe01:00397] [16] /usr/local/lib/python3.6/dist-packages/tensorflow/python/../libtensorflow_framework.so.2(_ZN10tensorflow13DeviceFactory10AddDevicesERKNS_14SessionOptionsERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEEPSt6vectorISt10unique_ptrINS_6DeviceESt14default_deleteISE_EESaISH_EE+0xc6)[0x7f2e8d4df4a6]
[hpe01:00397] [17] /usr/local/lib/python3.6/dist-packages/tensorflow/python/_pywrap_tensorflow_internal.so(_ZN10tensorflow20DirectSessionFactory10NewSessionERKNS_14SessionOptionsEPPNS_7SessionE+0x2e7)[0x7f2eb2b77cc7]
[hpe01:00397] [18] /usr/local/lib/python3.6/dist-packages/tensorflow/python/../libtensorflow_framework.so.2(_ZN10tensorflow10NewSessionERKNS_14SessionOptionsEPPNS_7SessionE+0xe3)[0x7f2e8d8a6613]
[hpe01:00397] [19] /usr/local/lib/python3.6/dist-packages/tensorflow/python/_pywrap_tensorflow_internal.so(TF_NewSession+0x3b)[0x7f2eb23aa0db]
[hpe01:00397] [20] /usr/local/lib/python3.6/dist-packages/tensorflow/python/_pywrap_tensorflow_internal.so(_ZN10tensorflow16TF_NewSessionRefEP8TF_GraphPK17TF_SessionOptionsP9TF_Status+0x12)[0x7f2eb22cb0b2]
[hpe01:00397] [21] /usr/local/lib/python3.6/dist-packages/tensorflow/python/client/_pywrap_tf_session.so(+0x6eaf7)[0x7f2e8a482af7]
[hpe01:00397] [22] /usr/local/lib/python3.6/dist-packages/tensorflow/python/client/_pywrap_tf_session.so(+0x5c1e7)[0x7f2e8a4701e7]
[hpe01:00397] [23] /usr/bin/python3[0x50a865]
[hpe01:00397] [24] /usr/bin/python3(_PyEval_EvalFrameDefault+0x444)[0x50c274]
[hpe01:00397] [25] /usr/bin/python3[0x507f94]
[hpe01:00397] [26] /usr/bin/python3[0x509cc0]
[hpe01:00397] [27] /usr/bin/python3[0x50a6bd]
[hpe01:00397] [28] /usr/bin/python3(_PyEval_EvalFrameDefault+0x1225)[0x50d055]
[hpe01:00397] [29] /usr/bin/python3[0x507f94]
[hpe01:00397] *** End of error message ***
Aborted (core dumped)

DeePMD-kit Version

v2.0.4

TensorFlow Version

2.6.0

How did you download the software?

docker

Input Files, Running Commands, Error Log, etc.

Input: /deepmd/examples/water/se_ea_a/input.json
Commands: dp train input.json
Log:

Instructions for updating:
non-resource variables are not supported in the long term
"hipErrorNoBinaryForGpu: Unable to find code object for all current devices!"
[hpe01:00397] *** Process received signal ***
[hpe01:00397] Signal: Aborted (6)
[hpe01:00397] Signal code: (-6)
[hpe01:00397] [ 0] /lib/x86_64-linux-gnu/libc.so.6(+0x3f040)[0x7f3100f7e040]
[hpe01:00397] [ 1] /lib/x86_64-linux-gnu/libc.so.6(gsignal+0xc7)[0x7f3100f7dfb7]
[hpe01:00397] [ 2] /lib/x86_64-linux-gnu/libc.so.6(abort+0x141)[0x7f3100f7f921]
[hpe01:00397] [ 3] /opt/rocm/lib/libamdhip64.so.4(+0x1cfdf9)[0x7f2e8badadf9]
[hpe01:00397] [ 4] /opt/rocm/lib/libamdhip64.so.4(+0x623a0)[0x7f2e8b96d3a0]
[hpe01:00397] [ 5] /opt/rocm/lib/libamdhip64.so.4(+0x97aa9)[0x7f2e8b9a2aa9]
[hpe01:00397] [ 6] /opt/rocm/lib/libamdhip64.so.4(+0x5fc44)[0x7f2e8b96ac44]
[hpe01:00397] [ 7] /opt/rocm/lib/libamdhip64.so.4(+0x13ec98)[0x7f2e8ba49c98]
[hpe01:00397] [ 8] /opt/rocm/lib/libamdhip64.so.4(+0x6e159)[0x7f2e8b979159]
[hpe01:00397] [ 9] /lib/x86_64-linux-gnu/libpthread.so.0(+0xf907)[0x7f3100d2f907]
[hpe01:00397] [10] /opt/rocm/lib/libamdhip64.so.4(+0x65acc)[0x7f2e8b970acc]
[hpe01:00397] [11] /opt/rocm/lib/libamdhip64.so.4(hipInit+0x65)[0x7f2e8b970ea5]
[hpe01:00397] [12] /usr/local/lib/python3.6/dist-packages/tensorflow/python/_pywrap_tensorflow_internal.so(+0x10bbd36a)[0x7f2ebe45f36a]
[hpe01:00397] [13] /usr/local/lib/python3.6/dist-packages/tensorflow/python/_pywrap_tensorflow_internal.so(_ZN15stream_executor3gpu9GpuDriver4InitEv+0x1dd)[0x7f2ebe45f60d]
[hpe01:00397] [14] /usr/local/lib/python3.6/dist-packages/tensorflow/python/_pywrap_tensorflow_internal.so(_ZNK15stream_executor3gpu12ROCmPlatform18VisibleDeviceCountEv+0x18)[0x7f2ebe446278]
[hpe01:00397] [15] /usr/local/lib/python3.6/dist-packages/tensorflow/python/../libtensorflow_framework.so.2(_ZN10tensorflow20BaseGPUDeviceFactory13CreateDevicesERKNS_14SessionOptionsERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEEPSt6vectorISt10unique_ptrINS_6DeviceESt14default_deleteISE_EESaISH_EE+0x7f)[0x7f2e8d73256f]
[hpe01:00397] [16] /usr/local/lib/python3.6/dist-packages/tensorflow/python/../libtensorflow_framework.so.2(_ZN10tensorflow13DeviceFactory10AddDevicesERKNS_14SessionOptionsERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEEPSt6vectorISt10unique_ptrINS_6DeviceESt14default_deleteISE_EESaISH_EE+0xc6)[0x7f2e8d4df4a6]
[hpe01:00397] [17] /usr/local/lib/python3.6/dist-packages/tensorflow/python/_pywrap_tensorflow_internal.so(_ZN10tensorflow20DirectSessionFactory10NewSessionERKNS_14SessionOptionsEPPNS_7SessionE+0x2e7)[0x7f2eb2b77cc7]
[hpe01:00397] [18] /usr/local/lib/python3.6/dist-packages/tensorflow/python/../libtensorflow_framework.so.2(_ZN10tensorflow10NewSessionERKNS_14SessionOptionsEPPNS_7SessionE+0xe3)[0x7f2e8d8a6613]
[hpe01:00397] [19] /usr/local/lib/python3.6/dist-packages/tensorflow/python/_pywrap_tensorflow_internal.so(TF_NewSession+0x3b)[0x7f2eb23aa0db]
[hpe01:00397] [20] /usr/local/lib/python3.6/dist-packages/tensorflow/python/_pywrap_tensorflow_internal.so(_ZN10tensorflow16TF_NewSessionRefEP8TF_GraphPK17TF_SessionOptionsP9TF_Status+0x12)[0x7f2eb22cb0b2]
[hpe01:00397] [21] /usr/local/lib/python3.6/dist-packages/tensorflow/python/client/_pywrap_tf_session.so(+0x6eaf7)[0x7f2e8a482af7]
[hpe01:00397] [22] /usr/local/lib/python3.6/dist-packages/tensorflow/python/client/_pywrap_tf_session.so(+0x5c1e7)[0x7f2e8a4701e7]
[hpe01:00397] [23] /usr/bin/python3[0x50a865]
[hpe01:00397] [24] /usr/bin/python3(_PyEval_EvalFrameDefault+0x444)[0x50c274]
[hpe01:00397] [25] /usr/bin/python3[0x507f94]
[hpe01:00397] [26] /usr/bin/python3[0x509cc0]
[hpe01:00397] [27] /usr/bin/python3[0x50a6bd]
[hpe01:00397] [28] /usr/bin/python3(_PyEval_EvalFrameDefault+0x1225)[0x50d055]
[hpe01:00397] [29] /usr/bin/python3[0x507f94]
[hpe01:00397] *** End of error message ***
Aborted (core dumped)

Steps to Reproduce

  1. docker pull deepmodeling/dpmdkit-rocm:dp2.0.3-rocm4.5.2-tf2.6-lmp29Sep2021
  2. cd /root/deepmd-kit/examples/water/se_e2_a
  3. dp train input.json

Further Information, Files, and Links

No response

@wangych6 wangych6 added the bug label Oct 14, 2022
@njzjz
Copy link
Member

njzjz commented Oct 14, 2022

hipErrorNoBinaryForGpu: Unable to find code object for all current devices!

See ROCm/ROCm#1623 (comment) for this error.

@njzjz njzjz added the upstream label Oct 14, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants