Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

run_prebuilt_e2e_tests CI jobs fail in cases with UR API changes #16982

Open
EwanC opened this issue Feb 12, 2025 · 8 comments
Open

run_prebuilt_e2e_tests CI jobs fail in cases with UR API changes #16982

EwanC opened this issue Feb 12, 2025 · 8 comments
Assignees
Labels
bug Something isn't working

Comments

@EwanC
Copy link
Contributor

EwanC commented Feb 12, 2025

As discovered by @yingcong-wu in #16747 (comment) the run_prebuilt_e2e_tests CI jobs in PR testing fail in the case that the UR API changes.

I notice that those features are all tested in llvm/sycl/test-e2e/lit.cfg.py using subprocess.getstatusoutput. I added a print to those tests and they all show the following error:

2025-02-11T15:09:01.2540004Z lit.py: /__w/llvm/llvm/llvm/sycl/test-e2e/lit.cfg.py:328: note: /usr/bin/ld: /__w/llvm/llvm/toolchain/bin/../lib/libsycl.so: undefined reference to `urEnqueueCommandBufferExp@LIBUR_LOADER_0.12'
2025-02-11T15:09:01.2540580Z clang++: error: linker command failed with exit code 1 (use -v to see invocation)
2025-02-11T15:09:01.2541173Z lit.py: /__w/llvm/llvm/llvm/sycl/test-e2e/lit.cfg.py:356: note: /usr/bin/ld: /__w/llvm/llvm/toolchain/bin/../lib/libsycl-preview.so: undefined reference to `urEnqueueCommandBufferExp@LIBUR_LOADER_0.12'

(In https://github.com/intel/llvm/actions/runs/13265484428/job/37031762080, line 104607).

I think there is a problem with those feature checks(maybe linking to an out-dated UR libs?), and I think it would impact any UR ABI breaking changes.

Experimental features in Unified Runtime are allowed to make API/ABI breaking changes, and the SYCL-Graph experimental oneAPI extension relies on being able to do this. This CI job should be updated so that it can pass in such a case, or a policy should be established with the gatekeepers whereby PRs that fail run_prebuilt_e2e_tests CI runs due to this reason can still be merged.

@EwanC EwanC added the bug Something isn't working label Feb 12, 2025
EwanC added a commit to Bensuo/unified-runtime that referenced this issue Feb 12, 2025
**Same patch as oneapi-src#2606
which was reverted due to intel/llvm#16982

The API to enqueue a closed command-buffer to a queue is defined in the YAML as a part of the command-buffer class, but it should be part of the enqueue class like other enqueue API extensions.

This PR updates the YAML and regenerates UR code, making the associated changes to adapters and CTS.

Closes oneapi-src#2600
EwanC added a commit to reble/llvm that referenced this issue Feb 12, 2025
**Same PR as was closed in intel#16747
due to intel#16982

Reflects change in name of UR entry-point from
`urCommandBufferEnqueueExp` to `urEnqueueCommandBufferExp` in oneapi-src/unified-runtime#2606
@sarnex sarnex self-assigned this Feb 12, 2025
@sarnex
Copy link
Contributor

sarnex commented Feb 12, 2025

@ayylol Do you mind taking a look at this? It seems specific to prebuilt e2e testing. If that's not related to the problem let me know and I'll investigate.

@sarnex sarnex assigned ayylol and unassigned sarnex Feb 12, 2025
@aelovikov-intel
Copy link
Contributor

AFAIK, normally one can use /path/to/freshly/built/clang++ -fsycl without setting extra env. We rely on that here:

cxx_compiler: $GITHUB_WORKSPACE/toolchain/bin/clang++
. Did UR break that? @EwanC

@ayylol
Copy link
Contributor

ayylol commented Feb 12, 2025

Hey @EwanC, for some context the way the "dev_kit" features are added, which seems to be the ones causing the issues in your pr, is by testing to see if a simple file that depends on those dev_kits can be compiled. In the case for the level_zero_dev_kit we try to compile the following file:

#include <level_zero/ze_api.h>
int main() { uint32_t t; zeDriverGet(&t, nullptr); return t; }

using the following command:

clang++ -fsycl l0_include.cpp -L<path-to-repo>/build/lib -lze_loader  -I<path-to-repo>/build/_deps/level-zero-loader-src/include

Given the similarity of the error message undefined reference to 'urEnqueueCommandBufferExp@LIBUR_LOADER_0.12' and the pr title "Update to UR cmd-buf enqueue rename" I don't think its a CI issue, and rather a UR issue from that commit.

@ayylol ayylol assigned EwanC and unassigned ayylol Feb 12, 2025
@EwanC
Copy link
Contributor Author

EwanC commented Feb 13, 2025

Thanks for investigating, I've tried that reproducer locally using #16984 which is the same change and shows the same CI issues in the prebuilt e2e test. Using the reproducer you suggest seems to work fine locally. Is there something I need to tweak locally to reproduce this issue?

$ git status
On branch ewan/rename_enqueue
$ cd build
$ ninja deploy-sycl-toolchain
$ cat l0_include.cpp
#include <level_zero/ze_api.h>
int main() { uint32_t t; zeDriverGet(&t, nullptr); return t; }
$ ./bin/clang++ -fsycl l0_include.cpp -L $PWD/lib -lze_loader  -I $PWD/_deps/level-zero-loader-src/include -o /tmp/l0_include

$ LD_LIBRARY_PATH=$PWD/lib /tmp/l0_include
$ echo $?
72

This is the same output I see as tip sycl

$ git checkout sycl
Switched to branch 'sycl'
Your branch is up-to-date with 'github/sycl'.
$ cd build
$ ninja deploy-sycl-toolchain
$ ./bin/clang++ -fsycl l0_include.cpp -L $PWD/lib -lze_loader  -I $PWD/_deps/level-zero-loader-src/include -o /tmp/l0_include
$ LD_LIBRARY_PATH=$PWD/lib /tmp/l0_include
$ echo $?
72

@EwanC EwanC assigned ayylol and unassigned EwanC Feb 13, 2025
@ayylol
Copy link
Contributor

ayylol commented Feb 13, 2025

@EwanC Ok, I investigated this a bit further, and Andrei's comment above is correct. The way the CI works at the build stage is we don't set the PATH to include the compiler rather we rely on the -DCMAKE_CXX_COMPILER cmake option when configuring the e2e tests. Doing it this way is where the issues are popping up in your pr.

The prebuilt running stages on the other hand do outright set the PATH/LD_LIBRARY_PATH to include the bin/lib folders, in this case the l0 check file is able to be compiled. This is likely why you weren't able to reproduce this issue locally, and this is also why we do end up getting the level_zero_devkit feature on the prebuilt run stage.

I'm not entirely sure why we have this difference (perhaps @sarnex or @aelovikov-intel could comment), but I think its important to note that this same issue is not reproducible on HEAD (there it works fine in either case), just this pr.

@ayylol
Copy link
Contributor

ayylol commented Feb 13, 2025

If you'd like to check I made this reproducer which follows what the CI is doing at the build stage (with some minor tweaks)

#!/bin/bash

# Path to working directory
WD=/home
# Path to intel/llvm repo
REPO=/home/llvm

# Configure
env CUDA_LIB_PATH="/usr/local/cuda/lib64/stubs" python3 $REPO/buildbot/configure.py -w $WD -s $REPO -o $WD/build -t Release --ci-defaults --hip --cuda --native_cpu --cmake-opt=-DCMAKE_C_COMPILER_LAUNCHER=ccache --cmake-opt=-DCMAKE_CXX_COMPILER_LAUNCHER=ccache --cmake-opt="-DLLVM_INSTALL_UTILS=ON" --cmake-opt="-DNATIVECPU_USE_OCK=Off" --cmake-opt="-DLLVM_EXPERIMENTAL_TARGETS_TO_BUILD=SPIRV"

# Compile
env CUDA_LIB_PATH="/usr/local/cuda/lib64/stubs" cmake --build $WD/build/

# Install
cmake --build $WD/build --target deploy-sycl-toolchain
cmake --build $WD/build --target utils/FileCheck/install
cmake --build $WD/build --target utils/count/install
cmake --build $WD/build --target utils/not/install
cmake --build $WD/build --target utils/lit/install
cmake --build $WD/build --target utils/llvm-lit/install
cmake --build $WD/build --target install-llvm-size
cmake --build $WD/build --target install-llvm-cov
cmake --build $WD/build --target install-llvm-profdata
cmake --build $WD/build --target install-compiler-rt

# Copy toolchain
cp -r $WD/build/install $WD/toolchain

# SET PATH -> IF ADDED SOLVES ISSUE
#export PATH=$WD/toolchain/bin:$PATH
#export LD_LIBRARY_PATH=$WD/toolchain/lib:$LD_LIBRARY_PATH

# Configure
cmake -GNinja -B $WD/build-e2e -S $REPO/sycl/test-e2e -DCMAKE_CXX_COMPILER="$WD/toolchain/bin/clang++" -DLLVM_LIT="$REPO/llvm/utils/lit/lit.py"

# RUN TESTS
env LIT_OPTS="--filter=Basic/device.cpp -v --no-progress-bar --show-unsupported --show-pass --show-xfail --max-time 3600 --time-tests --param print_features=True --param test-mode=build-only --param sycl_devices=all --param sycl_build_targets=\"spir;nvidia\"" ninja -C build-e2e check-sycl-e2e

ran this on #16992 locally in the intel_drivers:alldeps container. Executing this with the bit under #SET PATH commented out will fail to build the l0 check file, and removing the comment it should be able to build the check file.

@ayylol ayylol assigned EwanC and unassigned ayylol Feb 13, 2025
@EwanC
Copy link
Contributor Author

EwanC commented Feb 17, 2025

Thanks again for coming up with a reproducer, unfortunately I still don't see this issue locally. I tried it out the git branch for #16984 and ran the Basic/vector/bool.cpp test since it is one of the ones failing in the CI on that PR. However without exporting PATH or LD_LIBRARY_PATH the test passed.

#!/bin/bash
set -x #echo on

WD=$HOME/Development
REPO=$WD/dpcpp

# configure
python3 $REPO/buildbot/configure.py -w $WD -s $REPO -o $WD/dpcpp_build -t Release --ci-defaults --cuda

#compile
cmake --build $WD/dpcpp_build


# Install
cmake --build $WD/dpcpp_build --target deploy-sycl-toolchain
cmake --build $WD/dpcpp_build --target utils/FileCheck/install
cmake --build $WD/dpcpp_build --target utils/count/install
cmake --build $WD/dpcpp_build --target utils/not/install
cmake --build $WD/dpcpp_build --target utils/lit/install
cmake --build $WD/dpcpp_build --target utils/llvm-lit/install
cmake --build $WD/dpcpp_build --target install-llvm-size
cmake --build $WD/dpcpp_build --target install-llvm-cov
cmake --build $WD/dpcpp_build --target install-llvm-profdata
cmake --build $WD/dpcpp_build --target install-compiler-rt

# Copy toolchain
cp -r $WD/build_dpcpp/install $WD/toolchain

# Configure
cmake -GNinja -B $WD/build-e2e -S $REPO/sycl/test-e2e -DCMAKE_CXX_COMPILER="$WD/toolchain/bin/clang++" -DLLVM_LIT="$REPO/llvm/utils/lit/lit.py"

# Run Test
LIT_OPTS="--filter=Basic/vector/bool.cpp -v --no-progress-bar --show-unsupported --show-pass --show-xfail --max-time 3600 --time-tests --param print_features=True --param test-mode=build-only --param sycl_devices=all --param sycl_build_targets=\"spir;nvidia\"" ninja -C build-e2e check-sycl-e2e

# Test passes

If i add -a to the LIT_OPTS, that boils the // RUN: %{build} -o %t.out line down to

$WD/toolchain/bin/clang++  -Werror -fsycl -fsycl-targets=nvptx64-nvidia-cuda,spir64  $REPO/sycl/test-e2e/Basic/vector/bool.cpp -o $WD/build-e2e/Basic/vector/Output/bool.cpp.tmp.out

Which if I run by itself in the command-line does indeed give me the error "$WS/toolchain/bin/../lib/libsycl.so: undefined reference to " for all the UR entry-points, but if I add LD_LIBRARY_PATH=$WD/toolchain/lib:$LD_LIBRARY_PATH as is done by $REPO/sycl/test-e2e/lit.cgf then it works fine. This is identical behavior to when I try on tip SYCL.

The only way I've managed to reproduce the "undefined reference to 'urEnqueueCommandBufferExp@LIBUR_LOADER_0.12'" error is by intentionally pointing LD_LIBRARY_PATH to a build of the commit prior to the one I'm testing when running $WD/toolchain/bin/clang++. To me that suggests in CI $REPO/sycl/test-e2e/lit.cgf is somehow picking up an incorrect build, but I will do some more investigation tomorrow into that before assigning this back.

EwanC added a commit to Bensuo/unified-runtime that referenced this issue Feb 18, 2025
**Same patch as oneapi-src#2606
which was reverted due to intel/llvm#16982

The API to enqueue a closed command-buffer to a queue is defined in the YAML as a part of the command-buffer class, but it should be part of the enqueue class like other enqueue API extensions.

This PR updates the YAML and regenerates UR code, making the associated changes to adapters and CTS.

Closes oneapi-src#2600
EwanC added a commit to reble/llvm that referenced this issue Feb 18, 2025
**Same PR as was closed in intel#16747
due to intel#16982

Reflects change in name of UR entry-point from
`urCommandBufferEnqueueExp` to `urEnqueueCommandBufferExp` in oneapi-src/unified-runtime#2606
@aelovikov-intel
Copy link
Contributor

before assigning this back.

Nobody is owning the CI, all contributions there are based on the good will.

EwanC added a commit to reble/llvm that referenced this issue Feb 19, 2025
**Same PR as was closed in intel#16747
due to intel#16982

Reflects change in name of UR entry-point from
`urCommandBufferEnqueueExp` to `urEnqueueCommandBufferExp` in oneapi-src/unified-runtime#2606
EwanC added a commit to reble/llvm that referenced this issue Feb 19, 2025
**Same PR as was closed in intel#16747
due to intel#16982

Reflects change in name of UR entry-point from
`urCommandBufferEnqueueExp` to `urEnqueueCommandBufferExp` in oneapi-src/unified-runtime#2606
EwanC added a commit to reble/llvm that referenced this issue Feb 20, 2025
**Same PR as was closed in intel#16747
due to intel#16982

Reflects change in name of UR entry-point from
`urCommandBufferEnqueueExp` to `urEnqueueCommandBufferExp` in oneapi-src/unified-runtime#2606
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants