Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[urgent] HipBuildImpl() seems fail to create temp dir or due to no GPU presents in build env #2958

Open
junliume opened this issue May 11, 2024 · 6 comments
Assignees

Comments

@junliume
Copy link
Collaborator

build_fail.log

Actual Result:

[2024-05-10T20:01:46.194Z] /src/hip/hip_build_utils.cpp:186:42: error: expected ')'
[2024-05-11T09:45:27.064Z]   186 |             MIOPEN_THROW("Failed cmd: '" MIOPEN_HIP_COMPILER "', args: '" + args + '\'');
[2024-05-11T09:45:27.064Z]       |                                          ^
[2024-05-11T09:45:27.064Z] /build_hip/include/miopen/config.h:98:29: note: expanded from macro 'MIOPEN_HIP_COMPILER'
[2024-05-11T09:45:27.064Z]    98 | #define MIOPEN_HIP_COMPILER getHIPCompilerPath()
[2024-05-11T09:45:27.064Z]       |                             ^
[2024-05-11T09:45:27.064Z] /src/hip/hip_build_utils.cpp:186:13: note: to match this '('
[2024-05-11T09:45:27.064Z]   186 |             MIOPEN_THROW("Failed cmd: '" MIOPEN_HIP_COMPILER "', args: '" + args + '\'');
[2024-05-11T09:45:27.064Z]       |             ^
[2024-05-11T09:45:27.064Z] /src/include/miopen/errors.hpp:69:28: note: expanded from macro 'MIOPEN_THROW'
[2024-05-11T09:45:27.064Z]    69 |         miopen::MIOpenThrow(__FILE__, __LINE__, __VA_ARGS__); \

It looks like that

        if(!fs::exists(bin_file))
            MIOPEN_THROW("Failed cmd: '" MIOPEN_HIP_COMPILER "', args: '" + args + '\'');

has failed.

@junliume junliume self-assigned this May 11, 2024
@junliume
Copy link
Collaborator Author

@atamazov @apwojcik could you help to take a look? I know it is not easily reproducible and I am requesting an exact reproduce env now. Could you check statically what might be the potential issue? Is it permission to create temp dir?

@junliume
Copy link
Collaborator Author

@apwojcik @JehandadKhan @atamazov another theory, the line:

MIOPEN_THROW("Failed cmd: '" MIOPEN_HIP_COMPILER "', args: '" + args + '\'');

gets mis-matched quote marks. It was not triggered in normal situations, but on a node where no GPU is presented, it triggered the MIOPEN_THROW and thus produces this error. aka this throw is not properly tested unfortunately.

How about make it:

MIOPEN_THROW("Failed cmd: '" + MIOPEN_HIP_COMPILER + "', args: '" + args + '\'');

@junliume junliume changed the title [urgent] HipBuildImpl() seems fail to create temp dir? [urgent] HipBuildImpl() seems fail to create temp dir or due to no GPU presents in build env May 12, 2024
@atamazov
Copy link
Contributor

@junliume

MIOPEN_THROW("Failed cmd: '" + MIOPEN_HIP_COMPILER + "', args: '" + args + '\'');

Almost like that, pls see #2959 (review)

@junliume
Copy link
Collaborator Author

Thanks @atamazov especially it's afterhours :)

I am still puzzled why this is happening only now, likely staging has been building MIOpen on nodes without GPU for a while, but we only starts to throw such issues recently. So maybe nogpu backend should be fixed somehow? :)

@atamazov
Copy link
Contributor

@junliume In the attached logfile I see: [2024-05-10T20:00:55.207Z] + cmake '-DCMAKE_PREFIX_PATH=/opt/rocm-6.2.0-488/llvm;/opt/rocm-6.2.0-488' '-DCMAKE_SHARED_LINKER_FLAGS_INIT=-Wl,--enable-new-dtags,--build-id=sha1,--rpath,$ORIGIN' '-DCMAKE_EXE_LINKER_FLAGS_INIT=-Wl,--enable-new-dtags,--build-id=sha1,--rpath,$ORIGIN/../lib' -DCMAKE_VERBOSE_MAKEFILE=1 -DCMAKE_BUILD_TYPE=RelWithDebInfo -DCMAKE_INSTALL_RPATH_USE_LINK_PATH=FALSE -DCMAKE_INSTALL_PREFIX=/opt/rocm-6.2.0-488 -DCMAKE_PACKAGING_INSTALL_PREFIX=/opt/rocm-6.2.0-488 -DBUILD_FILE_REORG_BACKWARD_COMPATIBILITY=OFF -DROCM_SYMLINK_LIBS=OFF -DCPACK_PACKAGING_INSTALL_PREFIX=/opt/rocm-6.2.0-488 -DROCM_DISABLE_LDCONFIG=ON -DROCM_PATH=/opt/rocm-6.2.0-488 -DCPACK_DEBIAN_DEBUGINFO_PACKAGE=FALSE -DCPACK_RPM_DEBUGINFO_PACKAGE=FALSE -DCPACK_RPM_INSTALL_WITH_EXEC=FALSE -DCMAKE_BUILD_TYPE=Release -DMIOPEN_BACKEND=HIP -DMIOPEN_OFFLINE_COMPILER_PATHS_V2=1 -DCMAKE_CXX_COMPILER=/opt/rocm-6.2.0-488/llvm/bin/clang++ -DCMAKE_C_COMPILER=/opt/rocm-6.2.0-488/llvm/bin/clang '-DCMAKE_PREFIX_PATH=/opt/rocm-6.2.0-488;/opt/rocm-6.2.0-488/hip;/long_pathname_so_that_rpms_can_package_the_debug_info/data/driver/MLOpen/deps' -DHIP_OC_COMPILER=/opt/rocm-6.2.0-488/bin/clang-ocl /long_pathname_so_that_rpms_can_package_the_debug_info/data/driver/MLOpen.

I have no idea where all these options come from and why. Among them is -DMIOPEN_OFFLINE_COMPILER_PATHS_V2=1, which is an indirect source of the error.

So maybe nogpu backend should be fixed somehow? :)

Let's enable MIOPEN_OFFLINE_COMPILER_PATHS_V2 by default and see ;)

@junliume
Copy link
Collaborator Author

-DMIOPEN_OFFLINE_COMPILER_PATHS_V2=1

@junliume In the attached logfile I see: [2024-05-10T20:00:55.207Z] + cmake '-DCMAKE_PREFIX_PATH=/opt/rocm-6.2.0-488/llvm;/opt/rocm-6.2.0-488' '-DCMAKE_SHARED_LINKER_FLAGS_INIT=-Wl,--enable-new-dtags,--build-id=sha1,--rpath,$ORIGIN' '-DCMAKE_EXE_LINKER_FLAGS_INIT=-Wl,--enable-new-dtags,--build-id=sha1,--rpath,$ORIGIN/../lib' -DCMAKE_VERBOSE_MAKEFILE=1 -DCMAKE_BUILD_TYPE=RelWithDebInfo -DCMAKE_INSTALL_RPATH_USE_LINK_PATH=FALSE -DCMAKE_INSTALL_PREFIX=/opt/rocm-6.2.0-488 -DCMAKE_PACKAGING_INSTALL_PREFIX=/opt/rocm-6.2.0-488 -DBUILD_FILE_REORG_BACKWARD_COMPATIBILITY=OFF -DROCM_SYMLINK_LIBS=OFF -DCPACK_PACKAGING_INSTALL_PREFIX=/opt/rocm-6.2.0-488 -DROCM_DISABLE_LDCONFIG=ON -DROCM_PATH=/opt/rocm-6.2.0-488 -DCPACK_DEBIAN_DEBUGINFO_PACKAGE=FALSE -DCPACK_RPM_DEBUGINFO_PACKAGE=FALSE -DCPACK_RPM_INSTALL_WITH_EXEC=FALSE -DCMAKE_BUILD_TYPE=Release -DMIOPEN_BACKEND=HIP -DMIOPEN_OFFLINE_COMPILER_PATHS_V2=1 -DCMAKE_CXX_COMPILER=/opt/rocm-6.2.0-488/llvm/bin/clang++ -DCMAKE_C_COMPILER=/opt/rocm-6.2.0-488/llvm/bin/clang '-DCMAKE_PREFIX_PATH=/opt/rocm-6.2.0-488;/opt/rocm-6.2.0-488/hip;/long_pathname_so_that_rpms_can_package_the_debug_info/data/driver/MLOpen/deps' -DHIP_OC_COMPILER=/opt/rocm-6.2.0-488/bin/clang-ocl /long_pathname_so_that_rpms_can_package_the_debug_info/data/driver/MLOpen.

I have no idea where all these options come from and why. Among them is -DMIOPEN_OFFLINE_COMPILER_PATHS_V2=1, which is an indirect source of the error.

So maybe nogpu backend should be fixed somehow? :)

Let's enable MIOPEN_OFFLINE_COMPILER_PATHS_V2 by default and see ;)

@atamazov Yes! Using -DMIOPEN_OFFLINE_COMPILER_PATHS_V2=1 finally I can reproduce this issue (I should have checked the cmake options more carefully).

The option was added in #2694 however it was not considered as default till we discovered it now.

BTW~ with #2959 it seems that we can build successfully even with this option enabled.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants