Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Finding the cause for ZE_RESULT_ERROR_UNINITIALIZED #140

Open
maleadt opened this issue Mar 29, 2024 · 2 comments
Open

Finding the cause for ZE_RESULT_ERROR_UNINITIALIZED #140

maleadt opened this issue Mar 29, 2024 · 2 comments

Comments

@maleadt
Copy link

maleadt commented Mar 29, 2024

I'm working on oneAPI.jl, which provides Julia support for Intel GPUs through Level Zero. Occasionally, we run into users reporting that they run into an opaque ZE_RESULT_ERROR_UNINITIALIZED when we call zeInit during loading of oneAPI.jl. This is an unhelpful error, and it makes it impossible to use the Level Zero APIs to figure out what's actually happening. For example, I've run into:

  • users not having a (supported) GPU
  • restrictive permissions on /dev/dri
  • conflicting library versions picked up (e.g. redistributed libze_loader vs system libze_tracing_layer)

Apart from the last one, I wouldn't expect the loader to fail to initialize, but still allow iterating drivers (why else this abstraction?) and ideally being able to determine why there's no devices. Currently, we typically find this out after a painstaking remote debugging session using strace or LD_DEBUG.

Am I missing something in the API here? CUDA for example has error codes that indicate at least a little better what may be happening happening (CUDA_ERROR_NO_DEVICE, CUDA_ERROR_DEVICE_UNAVAILABLE, CUDA_ERROR_DEVICE_NOT_LICENSED, etc).


Apart from the above API issue, I also have a concrete case where a user's system keeps on throwing ZE_RESULT_ERROR_UNINITIALIZED: JuliaGPU/oneAPI.jl#399. LD_DEBUG reveals that the correct libraries are found, and strace shows that /dev/dri nodes are successfully discovered and opened.

I've found out about some environment variables to increase logging, but the output isn't very helpful:

❯ ZE_ENABLE_LOADER_DEBUG_TRACE=1 julia ...

ZE_LOADER_DEBUG_TRACE:Loading Driver libze_intel_gpu.so.1
ZE_LOADER_DEBUG_TRACE:Loading Driver libze_intel_vpu.so.1
ZE_LOADER_DEBUG_TRACE:Load Library of libze_intel_vpu.so.1 failed with libze_intel_vpu.so.1: cannot open shared object file: No such file or directory
ZE_LOADER_DEBUG_TRACE:Load Library of libze_tracing_layer.so.1 failed with libze_tracing_layer.so.1: cannot open shared object file: No such file or directory
ZE_LOADER_DEBUG_TRACE:check_drivers(flags=0(ZE_INIT_ALL_DRIVER_TYPES_ENABLED))
ZE_LOADER_DEBUG_TRACE:init driver libze_intel_gpu.so.1 zeInit(0(ZE_INIT_ALL_DRIVER_TYPES_ENABLED)) returning ZE_RESULT_ERROR_UNINITIALIZED
ZE_LOADER_DEBUG_TRACE:Check Drivers Failed on libze_intel_gpu.so.1 , driver will be removed. zeInit failed with ZE_RESULT_ERROR_UNINITIALIZED
❯ NEOReadDebugKeys=1 PrintDebugMessages=1 PrintXeLogs=1 julia ...
...
INFO: System Info query failed!
WARNING: Failed to request OCL Turbo Boost
ZE_LOADER_DEBUG_TRACE:init driver libze_intel_gpu.so.1 zeInit(0(ZE_INIT_ALL_DRIVER_TYPES_ENABLED)) returning ZE_RESULT_ERROR_UNINITIALIZED
ZE_LOADER_DEBUG_TRACE:Check Drivers Failed on libze_intel_gpu.so.1 , driver will be removed. zeInit failed with ZE_RESULT_ERROR_UNINITIALIZED

Any other suggestions on how to debug this would be much appreciated.

@eero-t
Copy link

eero-t commented Apr 2, 2024

Apart from the above API issue, I also have a concrete case where a user's system keeps on throwing ZE_RESULT_ERROR_UNINITIALIZED: JuliaGPU/oneAPI.jl#399. LD_DEBUG reveals that the correct libraries are found, and strace shows that /dev/dri nodes are successfully discovered and opened.

Is this with the very latest kernel?

I.e. does this help:

export NEOReadDebugKeys=1
export OverrideGpuAddressSpace=48

See: intel/compute-runtime#710

@eero-t
Copy link

eero-t commented Apr 2, 2024

Having debugged several of these issues, I think this is rather important bug...

For example, I've run into:
* users not having a (supported) GPU

Or there being some mismatch between user-space and kernel driver:

  • restrictive permissions on /dev/dri
  • conflicting library versions picked up (e.g. redistributed libze_loader vs system libze_tracing_layer)

Or frontend implementing zesInit(), but backend being older one that does not implement it (like is case in Ubuntu 23.10): intel/compute-runtime#650

Apart from the last one, I wouldn't expect the loader to fail to initialize, but still allow iterating drivers (why else this abstraction?) and ideally being able to determine why there's no devices. Currently, we typically find this out after a painstaking remote debugging session using strace or LD_DEBUG.

Am I missing something in the API here? CUDA for example has error codes that indicate at least a little better what may be happening happening (CUDA_ERROR_NO_DEVICE, CUDA_ERROR_DEVICE_UNAVAILABLE, CUDA_ERROR_DEVICE_NOT_LICENSED, etc).

Looking at current Level-Zero frontend sources, it returns "unitialized" error for zesInit() regardless of whether zesInit() support is missing from backend, or backend function returned some error (e.g. because there was no GPU).

Apart from the above API issue, I also have a concrete case where a user's system keeps on throwing ZE_RESULT_ERROR_UNINITIALIZED: JuliaGPU/oneAPI.jl#399. LD_DEBUG reveals that the correct libraries are found, and strace shows that /dev/dri nodes are successfully discovered and opened.
...
Any other suggestions on how to debug this would be much appreciated.

Using -k (stacktrace) option for strace can give some additional clues on where things fail.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants