Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Matrix tests crashing when running on SPR CPU #17079

Open
aelovikov-intel opened this issue Feb 19, 2025 · 5 comments
Open

Matrix tests crashing when running on SPR CPU #17079

aelovikov-intel opened this issue Feb 19, 2025 · 5 comments
Labels
bug Something isn't working confirmed

Comments

@aelovikov-intel
Copy link
Contributor

aelovikov-intel commented Feb 19, 2025

It turns out we weren't running our E2E tests targeting SPR CPUs in our CI. When trying to address that the following failures were uncovered (https://github.com/intel/llvm/actions/runs/13417907780/job/37485887703):

SYCL :: Matrix/SG32/element_wise_abc.cpp
SYCL :: Matrix/SG32/element_wise_all_ops.cpp
SYCL :: Matrix/SG32/element_wise_all_ops_int8.cpp
SYCL :: Matrix/SG32/element_wise_all_ops_int8_packed.cpp
SYCL :: Matrix/SG32/element_wise_all_sizes.cpp
SYCL :: Matrix/SG32/element_wise_ops.cpp
SYCL :: Matrix/SG32/joint_matrix_apply_bf16.cpp
SYCL :: Matrix/SG32/joint_matrix_apply_two_matrices.cpp
SYCL :: Matrix/SG32/joint_matrix_down_convert.cpp
SYCL :: Matrix/element_wise_abc.cpp
SYCL :: Matrix/element_wise_all_ops.cpp
SYCL :: Matrix/element_wise_all_ops_int8.cpp
SYCL :: Matrix/element_wise_all_ops_int8_packed.cpp
SYCL :: Matrix/element_wise_all_sizes.cpp
SYCL :: Matrix/element_wise_ops.cpp
SYCL :: Matrix/elemwise_irreg_size_ops_bf16.cpp
SYCL :: Matrix/get_coordinate_ops.cpp
SYCL :: Matrix/joint_matrix_apply_bf16.cpp
SYCL :: Matrix/joint_matrix_apply_two_matrices.cpp
SYCL :: Matrix/joint_matrix_down_convert.cpp

Also reproduced locally (at least for one, haven't verified for all), with the following error under debugger:

Thread 1 "element_wise_ab" received signal SIGSEGV, Segmentation fault.
0x00007fffe427c5b9 in SPIRV::SPIRVToLLVMDbgTran::transDebugScope(SPIRV::SPIRVInstruction const*) () from /rdrive/ref/opencl/runtime/linux/oclcpu/2024.18.10.0.08/libintelocl.so
@aelovikov-intel aelovikov-intel added bug Something isn't working confirmed labels Feb 19, 2025
@aelovikov-intel
Copy link
Contributor Author

@vmaksimo , @MrSidims , FYI.

@dkhaldi
Copy link
Contributor

dkhaldi commented Feb 19, 2025

@vmaksimo, is this the same as CMPLRLLVM-65270?
If yes, do we expect the next pull down from SPIRV translator will solve these issues?

@vmaksimo
Copy link
Contributor

Looks like the same issue mentioned above, I believe it'd be fixed with the next pull down from SPIRV translator

@MrSidims
Copy link
Contributor

I'm confused. Let me break down, what causes my confusion.

First of all thanks for enabling SPR CPU testing! Do you plan to use release versions of CPU runtime for the testing or dev builds (from my perspective, just like for IGC dev the latter is preferable, see below).

It turns out we weren't running our E2E tests targeting SPR CPUs in our CI.
https://github.com/intel/llvm/actions/runs/13417907780/job/37485887703)

The link shows PVC results. Guess it SPR and PVC are merged into one report, right?

SYCL :: Matrix/SG32/element_wise_abc.cpp
....
SYCL :: Matrix/joint_matrix_down_convert.cpp

I don't expect pulldown to fix the tests. The reasons are:
a. some of them would be failing in https://github.com/intel/llvm anyway, even without issue reported in CMPLRLLVM-65270, and the only way to fix them would be to update CPU runtime from 2025.0 version to something newer, with the newer version (and without issue reported in CMPLRLLVM-65270) all of the tests should pass;
b. when we update CPU runtime to the newer release version we still potentially get the crash (I can't say for sure as assertions are disabled and without them we might get lucky and don't crash), so the pulldowned code should go through the release cycle before fixing issue in intel/llvm.

So a. and b. combined makes me voting to use dev builds of CPU runtime in our CI (or at least to have a dedicated job), but previously there was some resistance to use unreleased binaries.

@aelovikov-intel
Copy link
Contributor Author

aelovikov-intel commented Feb 20, 2025

We don't have plans for "dev" version of OCL CPU RT, unless you're willing to do the work :)

Guess it SPR and PVC are merged into one report, right?

Correct.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working confirmed
Projects
None yet
Development

No branches or pull requests

4 participants