[L0 v2][CTS] Fix problems reported by SYCL e2e tests #2516

igchor · 2025-01-02T20:49:36Z

No description provided.

github-actions · 2025-01-09T19:18:13Z

Compute Benchmarks level_zero_v2 run (with params: --compare baseline-v2):
https://github.com/oneapi-src/unified-runtime/actions/runs/12696264251

github-actions · 2025-01-09T20:01:38Z

Compute Benchmarks level_zero_v2 run (--compare baseline-v2):
https://github.com/oneapi-src/unified-runtime/actions/runs/12696264251
Job status: success. Test status: success.

Summary

No diffs to calculate performance change

(result is better)

Performance change in benchmark groups

Relative perf in group api (9): cannot calculate

Benchmark	This PR	baseline	baseline-v2
api_overhead_benchmark_l0 SubmitKernel out of order	11.515 μs	11.114000 μs	11.515 μs
api_overhead_benchmark_sycl SubmitKernel out of order	21.368000 μs	23.476 μs	21.642 μs
api_overhead_benchmark_sycl SubmitKernel in order	22.319 μs	25.185 μs	22.240000 μs
api_overhead_benchmark_sycl ExecImmediateCopyQueue out of order from Device to Device, size 1024	1.861 μs	2.109 μs	1.826000 μs
api_overhead_benchmark_sycl ExecImmediateCopyQueue in order from Device to Host, size 1024	1.862 μs	1.667000 μs	1.867 μs
api_overhead_benchmark_ur SubmitKernel out of order CPU count	94844.000000 instr	101923.000 instr	94854.000 instr
api_overhead_benchmark_ur SubmitKernel out of order	13.891 μs	15.629 μs	13.367000 μs
api_overhead_benchmark_ur SubmitKernel in order CPU count	94844.000000 instr	107041.000 instr	94854.000 instr
api_overhead_benchmark_ur SubmitKernel in order	13.244000 μs	16.305 μs	13.282 μs

Relative perf in group memory (4): cannot calculate

Benchmark	This PR	baseline	baseline-v2
memory_benchmark_sycl QueueInOrderMemcpy from Device to Device, size 1024	201.071 μs	252.552 μs	200.450000 μs
memory_benchmark_sycl QueueInOrderMemcpy from Host to Device, size 1024	84.278 μs	133.161 μs	83.808000 μs
memory_benchmark_sycl QueueMemcpy from Device to Device, size 1024	5.998 μs	5.545000 μs	6.025 μs
memory_benchmark_sycl StreamMemory, placement Device, type Triad, size 10240	2.955 GB/s	3.172000 GB/s	2.930 GB/s

Relative perf in group miscellaneous (1): cannot calculate

Benchmark	This PR	baseline	baseline-v2	Relative perf	Change	-
miscellaneous_benchmark_sycl VectorSum	807.632000 bw GB/s	807.892 bw GB/s	858.902 bw GB/s

Relative perf in group multithread (10): cannot calculate

Benchmark	This PR	baseline	baseline-v2
multithread_benchmark_ur MemcpyExecute opsPerThread:400, numThreads:1, allocSize:102400 srcUSM:1 dstUSM:1	3623.765 μs	6913.511 μs	3606.366000 μs
multithread_benchmark_ur MemcpyExecute opsPerThread:100, numThreads:8, allocSize:102400 srcUSM:1 dstUSM:1	8462.477 μs	17276.274 μs	8337.052000 μs
multithread_benchmark_ur MemcpyExecute opsPerThread:400, numThreads:8, allocSize:1024 srcUSM:1 dstUSM:1	24989.087000 μs	47969.092 μs	25070.404 μs
multithread_benchmark_ur MemcpyExecute opsPerThread:10, numThreads:16, allocSize:1024 srcUSM:1 dstUSM:1	1060.893000 μs	2028.247 μs	1075.836 μs
multithread_benchmark_ur MemcpyExecute opsPerThread:400, numThreads:1, allocSize:102400 srcUSM:0 dstUSM:1	4446.440000 μs	7352.088 μs	4546.023 μs
multithread_benchmark_ur MemcpyExecute opsPerThread:100, numThreads:8, allocSize:102400 srcUSM:0 dstUSM:1	6422.665000 μs	8675.380 μs	6520.838 μs
multithread_benchmark_ur MemcpyExecute opsPerThread:400, numThreads:8, allocSize:1024 srcUSM:0 dstUSM:1	25219.702 μs	25567.887 μs	24863.849000 μs
multithread_benchmark_ur MemcpyExecute opsPerThread:10, numThreads:16, allocSize:1024 srcUSM:0 dstUSM:1	1088.372 μs	1171.666 μs	1079.399000 μs
multithread_benchmark_ur MemcpyExecute opsPerThread:4096, numThreads:1, allocSize:1024 srcUSM:0 dstUSM:1 without events	28923.325000 μs	40328.398 μs	28949.925 μs
multithread_benchmark_ur MemcpyExecute opsPerThread:4096, numThreads:4, allocSize:1024 srcUSM:0 dstUSM:1 without events	113063.962 μs	112651.427000 μs	116651.165 μs

Relative perf in group Velocity-Bench (9): cannot calculate

Benchmark	This PR	baseline	baseline-v2
Velocity-Bench Hashtable	380.986 M keys/sec	378.053 M keys/sec	384.547201 M keys/sec
Velocity-Bench Bitcracker	35.196 s	35.222 s	35.177000 s
Velocity-Bench CudaSift	200.653000 ms	202.890 ms	-
Velocity-Bench Easywave	235.000000 ms	244.000 ms	238.000 ms
Velocity-Bench QuickSilver	121.570000 MMS/CTT	118.360 MMS/CTT	121.090 MMS/CTT
Velocity-Bench Sobel Filter	516.810000 ms	533.221 ms	519.769 ms
Velocity-Bench dl-cifar	17.075200 s	23.238 s	17.226 s
Velocity-Bench dl-mnist	2.700 s	2.740 s	2.690000 s
Velocity-Bench svm	-	0.135900 s	-

Relative perf in group Runtime (8): cannot calculate

Benchmark	This PR	baseline	baseline-v2
Runtime_IndependentDAGTaskThroughput_SingleTask	186.919 ms	266.787 ms	175.848000 ms
Runtime_IndependentDAGTaskThroughput_BasicParallelFor	186.263 ms	281.351 ms	182.065000 ms
Runtime_IndependentDAGTaskThroughput_HierarchicalParallelFor	191.907 ms	277.904 ms	182.281000 ms
Runtime_IndependentDAGTaskThroughput_NDRangeParallelFor	187.252 ms	278.250 ms	177.759000 ms
Runtime_DAGTaskThroughput_SingleTask	1203.222000 ms	1689.703 ms	1226.450 ms
Runtime_DAGTaskThroughput_BasicParallelFor	1266.166000 ms	1751.814 ms	1280.960 ms
Runtime_DAGTaskThroughput_HierarchicalParallelFor	1272.352 ms	1735.799 ms	1269.362000 ms
Runtime_DAGTaskThroughput_NDRangeParallelFor	1236.965000 ms	1700.528 ms	1240.886 ms

Relative perf in group MicroBench (14): cannot calculate

Benchmark	This PR	baseline	baseline-v2
MicroBench_HostDeviceBandwidth_1D_H2D_Contiguous	4.377 ms	4.346000 ms	4.363 ms
MicroBench_HostDeviceBandwidth_2D_H2D_Contiguous	4.431000 ms	4.524 ms	4.473 ms
MicroBench_HostDeviceBandwidth_3D_H2D_Contiguous	4.487000 ms	4.507 ms	4.538 ms
MicroBench_HostDeviceBandwidth_1D_D2H_Contiguous	3.711000 ms	4.611 ms	3.741 ms
MicroBench_HostDeviceBandwidth_2D_D2H_Contiguous	618.178 ms	618.167 ms	618.119000 ms
MicroBench_HostDeviceBandwidth_3D_D2H_Contiguous	618.175000 ms	618.207 ms	618.183 ms
MicroBench_HostDeviceBandwidth_1D_H2D_Strided	4.398 ms	4.288000 ms	4.371 ms
MicroBench_HostDeviceBandwidth_2D_H2D_Strided	4.555 ms	4.543000 ms	4.558 ms
MicroBench_HostDeviceBandwidth_3D_H2D_Strided	4.573 ms	4.535000 ms	4.535 ms
MicroBench_HostDeviceBandwidth_1D_D2H_Strided	3.813 ms	4.666 ms	3.760000 ms
MicroBench_HostDeviceBandwidth_2D_D2H_Strided	617.462 ms	617.486 ms	617.419000 ms
MicroBench_HostDeviceBandwidth_3D_D2H_Strided	617.434 ms	617.482 ms	617.406000 ms
MicroBench_LocalMem_int32_4096	29.905 ms	29.867 ms	29.840000 ms
MicroBench_LocalMem_fp32_4096	29.873 ms	29.866000 ms	29.885 ms

Relative perf in group Pattern (10): cannot calculate

Benchmark	This PR	baseline	baseline-v2
Pattern_Reduction_NDRange_int32	16.774 ms	16.646000 ms	16.750 ms
Pattern_Reduction_Hierarchical_int32	16.840000 ms	16.999 ms	16.868 ms
Pattern_SegmentedReduction_NDRange_int16	2.250000 ms	2.270 ms	2.250 ms
Pattern_SegmentedReduction_NDRange_int32	2.168 ms	2.170 ms	2.167000 ms
Pattern_SegmentedReduction_NDRange_int64	2.343000 ms	2.350 ms	2.343 ms
Pattern_SegmentedReduction_NDRange_fp32	2.160000 ms	2.178 ms	2.164 ms
Pattern_SegmentedReduction_Hierarchical_int16	11.799 ms	11.804 ms	11.794000 ms
Pattern_SegmentedReduction_Hierarchical_int32	11.601 ms	11.594000 ms	11.600 ms
Pattern_SegmentedReduction_Hierarchical_int64	11.786 ms	11.795 ms	11.784000 ms
Pattern_SegmentedReduction_Hierarchical_fp32	11.605 ms	11.597000 ms	11.601 ms

Relative perf in group ScalarProduct (6): cannot calculate

Benchmark	This PR	baseline	baseline-v2
ScalarProduct_NDRange_int32	3.984 ms	3.868000 ms	3.940 ms
ScalarProduct_NDRange_int64	5.520 ms	5.463000 ms	5.521 ms
ScalarProduct_NDRange_fp32	3.833 ms	3.784000 ms	3.935 ms
ScalarProduct_Hierarchical_int32	10.599 ms	10.530000 ms	10.558 ms
ScalarProduct_Hierarchical_int64	11.563 ms	11.483000 ms	11.554 ms
ScalarProduct_Hierarchical_fp32	10.179 ms	10.177 ms	10.160000 ms

Relative perf in group USM (7): cannot calculate

Benchmark	This PR	baseline	baseline-v2
USM_Allocation_latency_fp32_host	37.384 ms	37.361000 ms	37.611 ms
USM_Allocation_latency_fp32_shared	0.064000 ms	0.069 ms	0.069 ms
USM_Instr_Mix_fp32_device_1:1mix_with_init_no_prefetch	1.353 ms	1.648 ms	1.322000 ms
USM_Instr_Mix_fp32_host_1:1mix_with_init_no_prefetch	1.056 ms	1.034 ms	1.001000 ms
USM_Instr_Mix_fp32_device_1:1mix_no_init_no_prefetch	1.588 ms	1.797 ms	1.579000 ms
USM_Instr_Mix_fp32_host_1:1mix_no_init_no_prefetch	1.214 ms	1.192 ms	1.166000 ms
USM_Allocation_latency_fp32_device	-	0.066000 ms	-

Relative perf in group VectorAddition (3): cannot calculate

Benchmark	This PR	baseline	baseline-v2
VectorAddition_int32	1.492000 ms	1.593 ms	1.658 ms
VectorAddition_int64	3.219 ms	3.135 ms	3.115000 ms
VectorAddition_fp32	1.605 ms	1.559 ms	1.491000 ms

Relative perf in group Polybench (3): cannot calculate

Benchmark	This PR	baseline	baseline-v2
Polybench_2mm	1.217000 ms	1.221 ms	1.225 ms
Polybench_3mm	1.811 ms	1.733000 ms	1.821 ms
Polybench_Atax	6.859 ms	6.822000 ms	6.876 ms

Relative perf in group Kmeans (1): cannot calculate

Benchmark	This PR	baseline	baseline-v2	Relative perf	Change	-
Kmeans_fp32	16.062 ms	16.048000 ms	16.052 ms

Relative perf in group MolecularDynamics (1): cannot calculate

Benchmark	This PR	baseline	baseline-v2	Relative perf	Change	-
MolecularDynamics	0.029000 ms	0.031 ms	0.030 ms

Relative perf in group llama.cpp (6): cannot calculate

Benchmark	This PR	baseline	baseline-v2
llama.cpp Prompt Processing Batched 128	852.029760 token/s	791.989 token/s	821.910 token/s
llama.cpp Text Generation Batched 128	65.245587 token/s	62.602 token/s	65.177 token/s
llama.cpp Prompt Processing Batched 256	941.225659 token/s	891.414 token/s	938.009 token/s
llama.cpp Text Generation Batched 256	65.141895 token/s	62.599 token/s	65.125 token/s
llama.cpp Prompt Processing Batched 512	479.955972 token/s	444.416 token/s	476.255 token/s
llama.cpp Text Generation Batched 512	65.214514 token/s	62.639 token/s	65.195 token/s

Relative perf in group alloc/max (20): cannot calculate

Benchmark	This PR	baseline	baseline-v2
alloc/max_allocs:10000/pre_allocs:0/size:4096/iterations:200000/threads:4 glibc	2635.290 ns	2464.010000 ns	2639.300 ns
alloc/max_allocs:10000/pre_allocs:0/size:4096/iterations:200000/threads:1 glibc	715.883000 ns	724.901 ns	724.206 ns
alloc/max_allocs:10000/pre_allocs:100000/size:4096/iterations:200000/threads:4 glibc	1270.620 ns	1231.150000 ns	1256.870 ns
alloc/max_allocs:10000/pre_allocs:100000/size:4096/iterations:200000/threads:1 glibc	748.620 ns	763.166 ns	748.475000 ns
alloc/max_allocs:10000/pre_allocs:0/min size:8/max size:65536/granularity:8/iterations:200000/threads:4 glibc	888.314 ns	878.565000 ns	896.579 ns
alloc/max_allocs:10000/pre_allocs:0/min size:8/max size:65536/granularity:8/iterations:200000/threads:1 glibc	179.599 ns	176.342 ns	174.623000 ns
alloc/max_allocs:10000/pre_allocs:0/size:4096/iterations:200000/threads:4 os_provider	2020.190 ns	1980.090000 ns	1984.420 ns
alloc/max_allocs:10000/pre_allocs:0/size:4096/iterations:200000/threads:1 os_provider	186.551000 ns	186.830 ns	189.637 ns
alloc/max_allocs:10000/pre_allocs:100000/size:4096/iterations:200000/threads:4 os_provider	1906.400 ns	1755.850000 ns	1823.540 ns
alloc/max_allocs:10000/pre_allocs:100000/size:4096/iterations:200000/threads:1 os_provider	190.687000 ns	192.109 ns	196.367 ns
alloc/max_allocs:1000/pre_allocs:0/size:4096/iterations:200000/threads:4 proxy_pool<os_provider>	4639.740 ns	4002.640000 ns	4161.600 ns
alloc/max_allocs:1000/pre_allocs:0/size:4096/iterations:200000/threads:1 proxy_pool<os_provider>	265.629 ns	253.808000 ns	257.051 ns
alloc/max_allocs:1000/pre_allocs:100000/size:4096/iterations:200000/threads:4 proxy_pool<os_provider>	3196.410 ns	2995.180000 ns	3218.080 ns
alloc/max_allocs:1000/pre_allocs:100000/size:4096/iterations:200000/threads:1 proxy_pool<os_provider>	294.990 ns	287.926000 ns	290.105 ns
alloc/max_allocs:10000/pre_allocs:0/size:4096/iterations:200000/threads:4 scalable_pool<os_provider>	288.779000 ns	297.851 ns	312.990 ns
alloc/max_allocs:10000/pre_allocs:0/size:4096/iterations:200000/threads:1 scalable_pool<os_provider>	217.055 ns	219.901 ns	215.629000 ns
alloc/max_allocs:10000/pre_allocs:100000/size:4096/iterations:200000/threads:4 scalable_pool<os_provider>	262.733 ns	270.144 ns	259.951000 ns
alloc/max_allocs:10000/pre_allocs:100000/size:4096/iterations:200000/threads:1 scalable_pool<os_provider>	207.134000 ns	208.751 ns	212.024 ns
alloc/max_allocs:10000/pre_allocs:0/min size:8/max size:65536/granularity:8/iterations:200000/threads:4 scalable_pool<os_provider>	994.775 ns	968.236000 ns	1060.040 ns
alloc/max_allocs:10000/pre_allocs:0/min size:8/max size:65536/granularity:8/iterations:200000/threads:1 scalable_pool<os_provider>	974.129 ns	971.817000 ns	990.434 ns

Relative perf in group multiple (12): cannot calculate

Benchmark	This PR	baseline	baseline-v2
multiple_malloc_free/max_allocs:10000/size:4096/iterations:2000/threads:4 glibc	32430.300 ns	31242.300000 ns	32028.400 ns
multiple_malloc_free/max_allocs:10000/size:4096/iterations:2000/threads:1 glibc	4181.510 ns	4136.860 ns	4128.190000 ns
multiple_malloc_free/max_allocs:10000/min size:8/max size:65536/granularity:8/iterations:2000/threads:4 glibc	138061.000 ns	137667.000000 ns	139795.000 ns
multiple_malloc_free/max_allocs:10000/min size:8/max size:65536/granularity:8/iterations:2000/threads:1 glibc	29392.400000 ns	32264.100 ns	31031.800 ns
multiple_malloc_free/max_allocs:10000/size:4096/iterations:2000/threads:4 proxy_pool<os_provider>	1178880.000 ns	1141810.000000 ns	1152010.000 ns
multiple_malloc_free/max_allocs:10000/size:4096/iterations:2000/threads:1 proxy_pool<os_provider>	155981.000000 ns	159931.000 ns	158010.000 ns
multiple_malloc_free/max_allocs:10000/size:4096/iterations:2000/threads:4 os_provider	1193550.000 ns	1160200.000000 ns	1193680.000 ns
multiple_malloc_free/max_allocs:10000/size:4096/iterations:2000/threads:1 os_provider	139330.000000 ns	140786.000 ns	140147.000 ns
multiple_malloc_free/max_allocs:10000/size:4096/iterations:2000/threads:4 scalable_pool<os_provider>	42446.900 ns	42412.100000 ns	42744.500 ns
multiple_malloc_free/max_allocs:10000/size:4096/iterations:2000/threads:1 scalable_pool<os_provider>	15140.500 ns	14708.200000 ns	15111.600 ns
multiple_malloc_free/max_allocs:10000/min size:8/max size:65536/granularity:8/iterations:2000/threads:4 scalable_pool<os_provider>	71557.400000 ns	73219.900 ns	73555.400 ns
multiple_malloc_free/max_allocs:10000/min size:8/max size:65536/granularity:8/iterations:2000/threads:1 scalable_pool<os_provider>	25469.900 ns	28349.100 ns	25335.300000 ns

Relative perf in group LinearRegressionCoeff (1): cannot calculate

Benchmark	This PR	baseline	baseline-v2	Relative perf	Change	-
LinearRegressionCoeff_fp32	-	-	687.077000 ms

Output:

---------> BitCracker: BitLocker password cracking tool <---------

==================================
Retrieving Info

Reading hash file "/home/pmdk/bench_workdir/velocity-bench-repo/bitcracker/hash_pass/img_win8_user_hash.txt"

              Attack

================================================
Type of attack: User Password
Psw per thread: 1
max_num_pswd_per_read: 60000
Dictionary: /home/pmdk/bench_workdir/velocity-bench-repo/bitcracker/hash_pass/user_passwords_60000.txt
MAC Comparison (-m): Yes

Iter: 1, num passwords read: 60000
Kernel execution:
Effective passwords: 60000
Passwords Range:
npknpByH7N2m3OnLNH1X9DJxLrzIFWk
.....
dL_7uuf3QCz-c6K3xDu0

================================================
Bitcracker attack completed
Total passwords evaluated: 60000
Password not found!

time to subtract from total: 0.00407045 s
bitcracker - total time for whole calculation: 35.196 s

Velocity-Bench CudaSift

Environment Variables:

Command:

/home/pmdk/bench_workdir/cudaSift/cudaSift

Output:

UNKN:

UNKN: ==================================================
UNKN: User input parameters:
UNKN: Trace: ../../inputData
UNKN: ==================================================
UNKN:

Image size = (1920,1080)
Initializing data...
Number of original features: 3683 3933
Number of matching features: 1163 1266 31.5775% 1 2