Skip to content
This repository was archived by the owner on May 9, 2024. It is now read-only.

IntelGPUEnablingTest hanging due to L0 #527

Closed
lmontigny opened this issue Jun 14, 2023 · 2 comments
Closed

IntelGPUEnablingTest hanging due to L0 #527

lmontigny opened this issue Jun 14, 2023 · 2 comments
Assignees

Comments

@lmontigny
Copy link
Contributor

lmontigny commented Jun 14, 2023

Summary
Using Intel PVC GPU compiled with L0, the ./IntelGPUEnablingTest executable is failing due to the L0 driver:

[==========] Running 40 tests from 7 test suites.
[----------] Global test environment set-up.
[----------] 1 test from JoinTest
[ RUN      ] JoinTest.SimpleJoin
2023-06-14T09:05:03.074750 F 870547 0 0 Execute.cpp:3346 Error launching the GPU kernel: L0 error: device hung, reset, was removed, or driver update occurred
Aborted

Impact
PVC functional enabling is impacted by this issue.

Reproducer
Connect to Intel OneCloud, reserve a node with 1 PVC GPU
Compile HDK with L0 enabled
sudo usermod -aG render
Run ./IntelGPUEnablingTest

@lmontigny lmontigny added this to the PVC enabling (functional) milestone Jun 14, 2023
@lmontigny lmontigny self-assigned this Jun 14, 2023
@lmontigny lmontigny changed the title IntelGPUEnabling Test failing due to L0 driver not initialized IntelGPUEnabling hanging due to L0 Jun 14, 2023
@lmontigny lmontigny changed the title IntelGPUEnabling hanging due to L0 IntelGPUEnablingTest hanging due to L0 Jun 14, 2023
@lmontigny
Copy link
Contributor Author

By commenting out SimpleJoin

$:~/hdk/build$ ./omniscidb/Tests/IntelGPUEnablingTest
[==========] Running 39 tests from 6 test suites.
[----------] Global test environment set-up.
[----------] 17 tests from AggregationTest
[ RUN      ] AggregationTest.StandaloneCount
[       OK ] AggregationTest.StandaloneCount (647 ms)
[ RUN      ] AggregationTest.StandaloneCountFilter
[       OK ] AggregationTest.StandaloneCountFilter (137 ms)
[ RUN      ] AggregationTest.StandaloneCountWithProjection
[       OK ] AggregationTest.StandaloneCountWithProjection (16 ms)
[ RUN      ] AggregationTest.ConsequentCount
[       OK ] AggregationTest.ConsequentCount (42 ms)
[ RUN      ] AggregationTest.ConsequentCountWithProjection
[       OK ] AggregationTest.ConsequentCountWithProjection (27 ms)
[ RUN      ] AggregationTest.CountStarAfterCountWithProjection
[       OK ] AggregationTest.CountStarAfterCountWithProjection (35 ms)
[ RUN      ] AggregationTest.CountWithProjectionAfterCountStar
[       OK ] AggregationTest.CountWithProjectionAfterCountStar (25 ms)
[ RUN      ] AggregationTest.StandaloneSum
[       OK ] AggregationTest.StandaloneSum (130 ms)
[ RUN      ] AggregationTest.SimpleAggregations
/home/lmontign2/hdk/omniscidb/Tests/ArrowSQLRunner/SQLiteComparator.cpp:148: Failure
The difference between ref_val and omnisci_val is 2309.699951171875, which exceeds EPS * std::fabs(ref_val), where
ref_val evaluates to 5518.5,
omnisci_val evaluates to 3208.800048828125, and
EPS * std::fabs(ref_val) evaluates to 0.068981250000000008.
GPU: SELECT SUM(ff) FROM test;
/home/lmontign2/hdk/omniscidb/Tests/ArrowSQLRunner/SQLiteComparator.cpp:148: Failure
The difference between ref_val and omnisci_val is 2304.199951171875, which exceeds EPS * std::fabs(ref_val), where
ref_val evaluates to -5507.5,
omnisci_val evaluates to -3203.300048828125, and
EPS * std::fabs(ref_val) evaluates to 0.068843750000000009.
GPU: SELECT SUM(fn) FROM test;
/home/lmontign2/hdk/omniscidb/Tests/ArrowSQLRunner/SQLiteComparator.cpp:100: Failure
Expected equality of these values:
  ref_val
    Which is: 995
  omnisci_val
    Which is: -72057326136638942
GPU: SELECT SUM(x + y) FROM test;
2023-06-21T08:37:24.596505 F 40562 0 0 Execute.cpp:3346 Error launching the GPU kernel: L0 error: device hung, reset, was removed, or driver update occurred
Aborted

@lmontigny
Copy link
Contributor Author

lmontigny commented Jun 23, 2023

Progress has been made on the smem branch #534 which seems to resolve this specific query issue.

We keep the issue open as there is some hanging related to sort.
To be re-evaluated once smem is available.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants