Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add 3 more benchmarks #2364

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open

Conversation

pbalcer
Copy link
Contributor

@pbalcer pbalcer commented Nov 21, 2024

This patch implements support for:

  • dl-cifar
  • dl-mnist
  • svm

These are all workloads from Velocity Bench that require oneMKL.

@pbalcer pbalcer requested a review from a team as a code owner November 21, 2024 13:50
Copy link

Compute Benchmarks level_zero run (with params: ):
https://github.com/oneapi-src/unified-runtime/actions/runs/11954609872

Copy link

Compute Benchmarks level_zero run ():
https://github.com/oneapi-src/unified-runtime/actions/runs/11954609872
Job status: success. Test status: success.

Summary

Total 77 benchmarks in mean.
Geomean 100.042%.
Improved 17 Regressed 19 (threshold 0.50%)

(result is better)

Performance change in benchmark groups

Relative perf in group api (6): 94.825%
Benchmark This PR baseline Relative perf Change -
api_overhead_benchmark_ur SubmitKernel out of order 14.229 μs 14.019000 μs 98.52% -1.48% -
api_overhead_benchmark_ur SubmitKernel in order 13.800 μs 13.560000 μs 98.26% -1.74% --
api_overhead_benchmark_sycl ExecImmediateCopyQueue in order from Device to Host, size 1024 1.683 μs 1.629000 μs 96.79% -3.21% ---
api_overhead_benchmark_sycl ExecImmediateCopyQueue out of order from Device to Device, size 1024 2.485 μs 2.399000 μs 96.54% -3.46% ---
api_overhead_benchmark_sycl SubmitKernel out of order 25.729 μs 23.344000 μs 90.73% -9.27% --------
api_overhead_benchmark_sycl SubmitKernel in order 25.977 μs 23.009000 μs 88.57% -11.43% ----------
Relative perf in group memory (3): 102.717%
Benchmark This PR baseline Relative perf Change -
memory_benchmark_sycl QueueInOrderMemcpy from Device to Device, size 1024 224.757000 μs 250.368 μs 111.39% 11.39% ++++++++++
memory_benchmark_sycl QueueInOrderMemcpy from Host to Device, size 1024 118.634 μs 117.932000 μs 99.41% -0.59% -
memory_benchmark_sycl QueueMemcpy from Device to Device, size 1024 5.907 μs 5.781000 μs 97.87% -2.13% --
Relative perf in group miscellaneous (1): 99.890%
Benchmark This PR baseline Relative perf Change -
miscellaneous_benchmark_sycl VectorSum 804.021 μs 803.138000 μs 99.89% -0.11% .
Relative perf in group multithread (8): 99.574%
Benchmark This PR baseline Relative perf Change -
multithread_benchmark_ur MemcpyExecute opsPerThread:100, numThreads:8, allocSize:102400 srcUSM:1 dstUSM:1 17139.312000 μs 17264.000 μs 100.73% 0.73% +
multithread_benchmark_ur MemcpyExecute opsPerThread:400, numThreads:1, allocSize:102400 srcUSM:0 dstUSM:1 7623.315 μs 7614.679000 μs 99.89% -0.11% .
multithread_benchmark_ur MemcpyExecute opsPerThread:100, numThreads:8, allocSize:102400 srcUSM:0 dstUSM:1 8404.589 μs 8369.748000 μs 99.59% -0.41% .
multithread_benchmark_ur MemcpyExecute opsPerThread:400, numThreads:8, allocSize:1024 srcUSM:0 dstUSM:1 25036.204 μs 24930.683000 μs 99.58% -0.42% .
multithread_benchmark_ur MemcpyExecute opsPerThread:400, numThreads:8, allocSize:1024 srcUSM:1 dstUSM:1 24943.599 μs 24830.061000 μs 99.54% -0.46% .
multithread_benchmark_ur MemcpyExecute opsPerThread:10, numThreads:16, allocSize:1024 srcUSM:1 dstUSM:1 1052.654 μs 1046.591000 μs 99.42% -0.58% -
multithread_benchmark_ur MemcpyExecute opsPerThread:400, numThreads:1, allocSize:102400 srcUSM:1 dstUSM:1 6779.758 μs 6719.944000 μs 99.12% -0.88% -
multithread_benchmark_ur MemcpyExecute opsPerThread:10, numThreads:16, allocSize:1024 srcUSM:0 dstUSM:1 1049.081 μs 1035.853000 μs 98.74% -1.26% -
Relative perf in group Runtime (8): 103.058%
Benchmark This PR baseline Relative perf Change -
Runtime_DAGTaskThroughput_NDRangeParallelFor 1676.520000 ms 1782.877 ms 106.34% 6.34% ++++++
Runtime_DAGTaskThroughput_HierarchicalParallelFor 1702.231000 ms 1801.303 ms 105.82% 5.82% +++++
Runtime_DAGTaskThroughput_BasicParallelFor 1731.725000 ms 1827.461 ms 105.53% 5.53% +++++
Runtime_DAGTaskThroughput_SingleTask 1678.318000 ms 1766.731 ms 105.27% 5.27% +++++
Runtime_IndependentDAGTaskThroughput_BasicParallelFor 273.405000 ms 277.443 ms 101.48% 1.48% +
Runtime_IndependentDAGTaskThroughput_HierarchicalParallelFor 273.271000 ms 274.737 ms 100.54% 0.54% .
Runtime_IndependentDAGTaskThroughput_SingleTask 262.387 ms 262.132000 ms 99.90% -0.10% .
Runtime_IndependentDAGTaskThroughput_NDRangeParallelFor 270.529 ms 270.176000 ms 99.87% -0.13% .
Relative perf in group MicroBench (14): 100.113%
Benchmark This PR baseline Relative perf Change -
MicroBench_HostDeviceBandwidth_1D_D2H_Contiguous 4.412000 ms 4.536 ms 102.81% 2.81% ++
MicroBench_HostDeviceBandwidth_1D_H2D_Strided 4.326000 ms 4.405 ms 101.83% 1.83% ++
MicroBench_HostDeviceBandwidth_3D_H2D_Contiguous 4.442000 ms 4.487 ms 101.01% 1.01% +
MicroBench_LocalMem_fp32_4096 29.847000 ms 29.880 ms 100.11% 0.11% .
MicroBench_LocalMem_int32_4096 29.881000 ms 29.899 ms 100.06% 0.06% .
MicroBench_HostDeviceBandwidth_1D_H2D_Contiguous 4.363000 ms 4.365 ms 100.05% 0.05% .
MicroBench_HostDeviceBandwidth_3D_D2H_Contiguous 618.073000 ms 618.194 ms 100.02% 0.02% .
MicroBench_HostDeviceBandwidth_2D_D2H_Contiguous 618.141000 ms 618.203 ms 100.01% 0.01% .
MicroBench_HostDeviceBandwidth_3D_D2H_Strided 617.434000 ms 617.471 ms 100.01% 0.01% .
MicroBench_HostDeviceBandwidth_2D_D2H_Strided 617.464 ms 617.439000 ms 100.00% -0.00% .
MicroBench_HostDeviceBandwidth_1D_D2H_Strided 4.650 ms 4.625000 ms 99.46% -0.54% .
MicroBench_HostDeviceBandwidth_2D_H2D_Strided 4.541 ms 4.504000 ms 99.19% -0.81% -
MicroBench_HostDeviceBandwidth_3D_H2D_Strided 4.552 ms 4.498000 ms 98.81% -1.19% -
MicroBench_HostDeviceBandwidth_2D_H2D_Contiguous 4.557 ms 4.480000 ms 98.31% -1.69% -
Relative perf in group Pattern (10): 100.042%
Benchmark This PR baseline Relative perf Change -
Pattern_Reduction_NDRange_int32 16.831000 ms 16.937 ms 100.63% 0.63% +
Pattern_SegmentedReduction_NDRange_fp32 2.165000 ms 2.166 ms 100.05% 0.05% .
Pattern_SegmentedReduction_NDRange_int32 2.171000 ms 2.172 ms 100.05% 0.05% .
Pattern_SegmentedReduction_Hierarchical_int32 11.591000 ms 11.595 ms 100.03% 0.03% .
Pattern_SegmentedReduction_Hierarchical_int64 11.772000 ms 11.776 ms 100.03% 0.03% .
Pattern_SegmentedReduction_Hierarchical_int16 11.802 ms 11.799000 ms 99.97% -0.03% .
Pattern_SegmentedReduction_NDRange_int64 2.348 ms 2.347000 ms 99.96% -0.04% .
Pattern_SegmentedReduction_Hierarchical_fp32 11.595 ms 11.590000 ms 99.96% -0.04% .
Pattern_SegmentedReduction_NDRange_int16 2.271 ms 2.269000 ms 99.91% -0.09% .
Pattern_Reduction_Hierarchical_int32 17.019 ms 16.990000 ms 99.83% -0.17% .
Relative perf in group ScalarProduct (6): 100.044%
Benchmark This PR baseline Relative perf Change -
ScalarProduct_NDRange_int32 3.749000 ms 3.757 ms 100.21% 0.21% .
ScalarProduct_Hierarchical_int32 10.309000 ms 10.320 ms 100.11% 0.11% .
ScalarProduct_Hierarchical_int64 11.299000 ms 11.306 ms 100.06% 0.06% .
ScalarProduct_Hierarchical_fp32 9.955000 ms 9.955 ms 100.00% 0.00% .
ScalarProduct_NDRange_int64 5.443 ms 5.441000 ms 99.96% -0.04% .
ScalarProduct_NDRange_fp32 3.749 ms 3.746000 ms 99.92% -0.08% .
Relative perf in group USM (7): 99.530%
Benchmark This PR baseline Relative perf Change -
USM_Instr_Mix_fp32_device_1:1mix_with_init_no_prefetch 1.647000 ms 1.653 ms 100.36% 0.36% .
USM_Instr_Mix_fp32_device_1:1mix_no_init_no_prefetch 1.798000 ms 1.800 ms 100.11% 0.11% .
USM_Instr_Mix_fp32_host_1:1mix_with_init_no_prefetch 1.043000 ms 1.044 ms 100.10% 0.10% .
USM_Allocation_latency_fp32_shared 0.055000 ms 0.055 ms 100.00% 0.00% .
USM_Instr_Mix_fp32_host_1:1mix_no_init_no_prefetch 1.207 ms 1.202000 ms 99.59% -0.41% .
USM_Allocation_latency_fp32_host 37.628 ms 37.435000 ms 99.49% -0.51% .
USM_Allocation_latency_fp32_device 0.069 ms 0.067000 ms 97.10% -2.90% ---
Relative perf in group VectorAddition (3): 99.740%
Benchmark This PR baseline Relative perf Change -
VectorAddition_int32 1.449000 ms 1.452 ms 100.21% 0.21% .
VectorAddition_int64 3.060 ms 3.055000 ms 99.84% -0.16% .
VectorAddition_fp32 1.462 ms 1.450000 ms 99.18% -0.82% -
Relative perf in group Polybench (3): 100.086%
Benchmark This PR baseline Relative perf Change -
Polybench_Atax 6.849000 ms 6.883 ms 100.50% 0.50% .
Polybench_3mm 1.727000 ms 1.730 ms 100.17% 0.17% .
Polybench_2mm 1.218 ms 1.213000 ms 99.59% -0.41% .
Relative perf in group Kmeans (1): 100.012%
Benchmark This PR baseline Relative perf Change -
Kmeans_fp32 16.049000 ms 16.051 ms 100.01% 0.01% .
Relative perf in group LinearRegressionCoeff (1): cannot calculate
Benchmark This PR baseline Relative perf Change -
LinearRegressionCoeff_fp32 856.672000 ms -
Relative perf in group MolecularDynamics (1): 103.846%
Benchmark This PR baseline Relative perf Change -
MolecularDynamics 0.026000 ms 0.027 ms 103.85% 3.85% +++
Relative perf in group llama.cpp (6): 100.719%
Benchmark This PR baseline Relative perf Change -
llama.cpp Prompt Processing Batched 128 843.935068 token/s 815.476 token/s 103.49% 3.49% +++
llama.cpp Text Generation Batched 256 62.833672 token/s 62.255 token/s 100.93% 0.93% +
llama.cpp Text Generation Batched 128 62.752080 token/s 62.270 token/s 100.77% 0.77% +
llama.cpp Text Generation Batched 512 62.786236 token/s 62.368 token/s 100.67% 0.67% +
llama.cpp Prompt Processing Batched 512 457.319 token/s 458.128303 token/s 99.82% -0.18% .
llama.cpp Prompt Processing Batched 256 897.760 token/s 909.695608 token/s 98.69% -1.31% -
Relative perf in group Velocity-Bench (6): cannot calculate
Benchmark This PR baseline Relative perf Change -
Velocity-Bench Hashtable - 379.894103 M keys/sec
Velocity-Bench Bitcracker - 35.221900 s
Velocity-Bench CudaSift - 204.950000 ms
Velocity-Bench Easywave - 237.000000 ms
Velocity-Bench QuickSilver - 118.990000 MMS/CTT
Velocity-Bench Sobel Filter - 534.267000 ms

Details

Benchmark details - environment, command, output...
api_overhead_benchmark_sycl SubmitKernel out of order

Environment Variables:

Command:

/home/pmdk/bench_workdir/compute-benchmarks-build/bin/api_overhead_benchmark_sycl --test=SubmitKernel --csv --noHeaders --Ioq=0 --DiscardEvents=0 --MeasureCompletion=0 --iterations=100000 --Profiling=0 --NumKernels=10 --KernelExecTime=1

Output:

TestCase,Mean,Median,StdDev,Min,Max,Type
SubmitKernel(api=sycl Profiling=0 Ioq=0 DiscardEvents=0 NumKernels=10 KernelExecTime=1 MeasureCompletion=0),25.729,26.061,6.51%,22.208,436.522,[CPU],[us]

api_overhead_benchmark_sycl SubmitKernel in order

Environment Variables:

Command:

/home/pmdk/bench_workdir/compute-benchmarks-build/bin/api_overhead_benchmark_sycl --test=SubmitKernel --csv --noHeaders --Ioq=1 --DiscardEvents=0 --MeasureCompletion=0 --iterations=100000 --Profiling=0 --NumKernels=10 --KernelExecTime=1

Output:

TestCase,Mean,Median,StdDev,Min,Max,Type
SubmitKernel(api=sycl Profiling=0 Ioq=1 DiscardEvents=0 NumKernels=10 KernelExecTime=1 MeasureCompletion=0),25.977,26.068,3.02%,22.695,151.349,[CPU],[us]

memory_benchmark_sycl QueueInOrderMemcpy from Device to Device, size 1024

Environment Variables:

Command:

/home/pmdk/bench_workdir/compute-benchmarks-build/bin/memory_benchmark_sycl --test=QueueInOrderMemcpy --csv --noHeaders --iterations=10000 --IsCopyOnly=0 --sourcePlacement=Device --destinationPlacement=Device --size=1024 --count=100

Output:

TestCase,Mean,Median,StdDev,Min,Max,Type
QueueInOrderMemcpy(api=sycl IsCopyOnly=0 sourcePlacement=Device destinationPlacement=Device size=1KB count=100),224.757,224.478,1.22%,220.800,443.266,[CPU],[us]

memory_benchmark_sycl QueueInOrderMemcpy from Host to Device, size 1024

Environment Variables:

Command:

/home/pmdk/bench_workdir/compute-benchmarks-build/bin/memory_benchmark_sycl --test=QueueInOrderMemcpy --csv --noHeaders --iterations=10000 --IsCopyOnly=0 --sourcePlacement=Host --destinationPlacement=Device --size=1024 --count=100

Output:

TestCase,Mean,Median,StdDev,Min,Max,Type
QueueInOrderMemcpy(api=sycl IsCopyOnly=0 sourcePlacement=Host destinationPlacement=Device size=1KB count=100),118.634,119.541,2.83%,110.221,348.061,[CPU],[us]

memory_benchmark_sycl QueueMemcpy from Device to Device, size 1024

Environment Variables:

Command:

/home/pmdk/bench_workdir/compute-benchmarks-build/bin/memory_benchmark_sycl --test=QueueMemcpy --csv --noHeaders --iterations=10000 --sourcePlacement=Device --destinationPlacement=Device --size=1024

Output:

TestCase,Mean,Median,StdDev,Min,Max,Type
QueueMemcpy(api=sycl sourcePlacement=Device destinationPlacement=Device size=1KB),5.907,5.824,16.10%,5.373,81.974,[CPU],[us]

api_overhead_benchmark_sycl ExecImmediateCopyQueue out of order from Device to Device, size 1024

Environment Variables:

Command:

/home/pmdk/bench_workdir/compute-benchmarks-build/bin/api_overhead_benchmark_sycl --test=ExecImmediateCopyQueue --csv --noHeaders --iterations=100000 --ioq=0 --IsCopyOnly=1 --MeasureCompletionTime=0 --src=Device --dst=Device --size=1024

Output:

TestCase,Mean,Median,StdDev,Min,Max,Type
ExecImmediateCopyQueue(api=sycl IsCopyOnly=1 MeasureCompletionTime=0 src=Device dst=Device size=1KB ioq=0),2.485,2.481,4.88%,2.179,12.528,[CPU],[us]

api_overhead_benchmark_sycl ExecImmediateCopyQueue in order from Device to Host, size 1024

Environment Variables:

Command:

/home/pmdk/bench_workdir/compute-benchmarks-build/bin/api_overhead_benchmark_sycl --test=ExecImmediateCopyQueue --csv --noHeaders --iterations=100000 --ioq=1 --IsCopyOnly=1 --MeasureCompletionTime=0 --src=Host --dst=Host --size=1024

Output:

TestCase,Mean,Median,StdDev,Min,Max,Type
ExecImmediateCopyQueue(api=sycl IsCopyOnly=1 MeasureCompletionTime=0 src=Host dst=Host size=1KB ioq=1),1.683,1.676,3.72%,1.585,6.288,[CPU],[us]

miscellaneous_benchmark_sycl VectorSum

Environment Variables:

Command:

/home/pmdk/bench_workdir/compute-benchmarks-build/bin/miscellaneous_benchmark_sycl --test=VectorSum --csv --noHeaders --iterations=1000 --numberOfElementsX=512 --numberOfElementsY=256 --numberOfElementsZ=256

Output:

TestCase,Mean,Median,StdDev,Min,Max,Type
VectorSum(api=sycl numberOfElementsX=512 numberOfElementsY=256 numberOfElementsZ=256),804.021,805.306,0.58%,768.657,813.112,[GPU],bw [GB/s]

multithread_benchmark_ur MemcpyExecute opsPerThread:400, numThreads:1, allocSize:102400 srcUSM:1 dstUSM:1

Environment Variables:

Command:

/home/pmdk/bench_workdir/compute-benchmarks-build/bin/multithread_benchmark_ur --test=MemcpyExecute --csv --noHeaders --Ioq=1 --UseEvents=1 --MeasureCompletion=1 --UseQueuePerThread=1 --AllocSize=102400 --NumThreads=1 --NumOpsPerThread=400 --iterations=10 --SrcUSM=1 --DstUSM=1

Output:

TestCase,Mean,Median,StdDev,Min,Max,Type
MemcpyExecute(api=ur Ioq=1 NumOpsPerThread=400 NumThreads=1 AllocSize=102400 MeasureCompletion=1 UseEvents=1 UseQueuePerThread=1 SrcUSM=1 DstUSM=1),6779.758,6756.541,0.67%,6750.124,6869.389,[CPU],[us]

multithread_benchmark_ur MemcpyExecute opsPerThread:100, numThreads:8, allocSize:102400 srcUSM:1 dstUSM:1

Environment Variables:

Command:

/home/pmdk/bench_workdir/compute-benchmarks-build/bin/multithread_benchmark_ur --test=MemcpyExecute --csv --noHeaders --Ioq=1 --UseEvents=1 --MeasureCompletion=1 --UseQueuePerThread=1 --AllocSize=102400 --NumThreads=8 --NumOpsPerThread=100 --iterations=10 --SrcUSM=1 --DstUSM=1

Output:

TestCase,Mean,Median,StdDev,Min,Max,Type
MemcpyExecute(api=ur Ioq=1 NumOpsPerThread=100 NumThreads=8 AllocSize=102400 MeasureCompletion=1 UseEvents=1 UseQueuePerThread=1 SrcUSM=1 DstUSM=1),17139.312,17099.524,5.94%,15908.559,19484.881,[CPU],[us]

multithread_benchmark_ur MemcpyExecute opsPerThread:400, numThreads:8, allocSize:1024 srcUSM:1 dstUSM:1

Environment Variables:

Command:

/home/pmdk/bench_workdir/compute-benchmarks-build/bin/multithread_benchmark_ur --test=MemcpyExecute --csv --noHeaders --Ioq=1 --UseEvents=1 --MeasureCompletion=1 --UseQueuePerThread=1 --AllocSize=1024 --NumThreads=8 --NumOpsPerThread=400 --iterations=1000 --SrcUSM=1 --DstUSM=1

Output:

TestCase,Mean,Median,StdDev,Min,Max,Type
MemcpyExecute(api=ur Ioq=1 NumOpsPerThread=400 NumThreads=8 AllocSize=1024 MeasureCompletion=1 UseEvents=1 UseQueuePerThread=1 SrcUSM=1 DstUSM=1),24943.599,24693.131,2.18%,23728.616,25996.714,[CPU],[us]

multithread_benchmark_ur MemcpyExecute opsPerThread:10, numThreads:16, allocSize:1024 srcUSM:1 dstUSM:1

Environment Variables:

Command:

/home/pmdk/bench_workdir/compute-benchmarks-build/bin/multithread_benchmark_ur --test=MemcpyExecute --csv --noHeaders --Ioq=1 --UseEvents=1 --MeasureCompletion=1 --UseQueuePerThread=1 --AllocSize=1024 --NumThreads=16 --NumOpsPerThread=10 --iterations=10000 --SrcUSM=1 --DstUSM=1

Output:

TestCase,Mean,Median,StdDev,Min,Max,Type
MemcpyExecute(api=ur Ioq=1 NumOpsPerThread=10 NumThreads=16 AllocSize=1024 MeasureCompletion=1 UseEvents=1 UseQueuePerThread=1 SrcUSM=1 DstUSM=1),1052.654,1018.310,43.81%,668.108,14850.761,[CPU],[us]

multithread_benchmark_ur MemcpyExecute opsPerThread:400, numThreads:1, allocSize:102400 srcUSM:0 dstUSM:1

Environment Variables:

Command:

/home/pmdk/bench_workdir/compute-benchmarks-build/bin/multithread_benchmark_ur --test=MemcpyExecute --csv --noHeaders --Ioq=1 --UseEvents=1 --MeasureCompletion=1 --UseQueuePerThread=1 --AllocSize=102400 --NumThreads=1 --NumOpsPerThread=400 --iterations=10 --SrcUSM=0 --DstUSM=1

Output:

TestCase,Mean,Median,StdDev,Min,Max,Type
MemcpyExecute(api=ur Ioq=1 NumOpsPerThread=400 NumThreads=1 AllocSize=102400 MeasureCompletion=1 UseEvents=1 UseQueuePerThread=1 SrcUSM=0 DstUSM=1),7623.315,7582.912,1.66%,7459.904,7894.888,[CPU],[us]

multithread_benchmark_ur MemcpyExecute opsPerThread:100, numThreads:8, allocSize:102400 srcUSM:0 dstUSM:1

Environment Variables:

Command:

/home/pmdk/bench_workdir/compute-benchmarks-build/bin/multithread_benchmark_ur --test=MemcpyExecute --csv --noHeaders --Ioq=1 --UseEvents=1 --MeasureCompletion=1 --UseQueuePerThread=1 --AllocSize=102400 --NumThreads=8 --NumOpsPerThread=100 --iterations=10 --SrcUSM=0 --DstUSM=1

Output:

TestCase,Mean,Median,StdDev,Min,Max,Type
MemcpyExecute(api=ur Ioq=1 NumOpsPerThread=100 NumThreads=8 AllocSize=102400 MeasureCompletion=1 UseEvents=1 UseQueuePerThread=1 SrcUSM=0 DstUSM=1),8404.589,8300.608,3.84%,8186.599,9337.164,[CPU],[us]

multithread_benchmark_ur MemcpyExecute opsPerThread:400, numThreads:8, allocSize:1024 srcUSM:0 dstUSM:1

Environment Variables:

Command:

/home/pmdk/bench_workdir/compute-benchmarks-build/bin/multithread_benchmark_ur --test=MemcpyExecute --csv --noHeaders --Ioq=1 --UseEvents=1 --MeasureCompletion=1 --UseQueuePerThread=1 --AllocSize=1024 --NumThreads=8 --NumOpsPerThread=400 --iterations=1000 --SrcUSM=0 --DstUSM=1

Output:

TestCase,Mean,Median,StdDev,Min,Max,Type
MemcpyExecute(api=ur Ioq=1 NumOpsPerThread=400 NumThreads=8 AllocSize=1024 MeasureCompletion=1 UseEvents=1 UseQueuePerThread=1 SrcUSM=0 DstUSM=1),25036.204,24788.885,2.10%,23857.110,26183.130,[CPU],[us]

multithread_benchmark_ur MemcpyExecute opsPerThread:10, numThreads:16, allocSize:1024 srcUSM:0 dstUSM:1

Environment Variables:

Command:

/home/pmdk/bench_workdir/compute-benchmarks-build/bin/multithread_benchmark_ur --test=MemcpyExecute --csv --noHeaders --Ioq=1 --UseEvents=1 --MeasureCompletion=1 --UseQueuePerThread=1 --AllocSize=1024 --NumThreads=16 --NumOpsPerThread=10 --iterations=10000 --SrcUSM=0 --DstUSM=1

Output:

TestCase,Mean,Median,StdDev,Min,Max,Type
MemcpyExecute(api=ur Ioq=1 NumOpsPerThread=10 NumThreads=16 AllocSize=1024 MeasureCompletion=1 UseEvents=1 UseQueuePerThread=1 SrcUSM=0 DstUSM=1),1049.081,1016.995,46.54%,691.020,16349.073,[CPU],[us]

api_overhead_benchmark_ur SubmitKernel out of order

Environment Variables:

Command:

/home/pmdk/bench_workdir/compute-benchmarks-build/bin/api_overhead_benchmark_ur --test=SubmitKernel --csv --noHeaders --Ioq=0 --DiscardEvents=0 --MeasureCompletion=0 --iterations=100000 --Profiling=0 --NumKernels=10 --KernelExecTime=1

Output:

TestCase,Mean,Median,StdDev,Min,Max,Type
SubmitKernel(api=ur Profiling=0 Ioq=0 DiscardEvents=0 NumKernels=10 KernelExecTime=1 MeasureCompletion=0),14.229,14.203,4.55%,13.732,201.749,[CPU],[us]

api_overhead_benchmark_ur SubmitKernel in order

Environment Variables:

Command:

/home/pmdk/bench_workdir/compute-benchmarks-build/bin/api_overhead_benchmark_ur --test=SubmitKernel --csv --noHeaders --Ioq=1 --DiscardEvents=0 --MeasureCompletion=0 --iterations=100000 --Profiling=0 --NumKernels=10 --KernelExecTime=1

Output:

TestCase,Mean,Median,StdDev,Min,Max,Type
SubmitKernel(api=ur Profiling=0 Ioq=1 DiscardEvents=0 NumKernels=10 KernelExecTime=1 MeasureCompletion=0),13.800,13.695,5.72%,12.916,145.307,[CPU],[us]

Runtime_IndependentDAGTaskThroughput_HierarchicalParallelFor

Environment Variables:

Command:

/home/pmdk/bench_workdir/sycl-bench-build/dag_task_throughput_independent --warmup-run --num-runs=5 --output=/home/pmdk/bench_workdir/IndependentDAGTaskThroughput_multi.csv --size=32768

Output:

['Runtime_IndependentDAGTaskThroughput_HierarchicalParallelFor', 'PASS', 'Intel(R) Data Center GPU Max 1100', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', '256', '32768', '0.273299', '0.273271', '0.272456', '0.272456 0.273180 0.273271 0.273401 0.274184', '0.000616', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'LLVM (Intel DPC++)', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A']

Runtime_IndependentDAGTaskThroughput_SingleTask

Environment Variables:

Command:

/home/pmdk/bench_workdir/sycl-bench-build/dag_task_throughput_independent --warmup-run --num-runs=5 --output=/home/pmdk/bench_workdir/IndependentDAGTaskThroughput_multi.csv --size=32768

Output:

['Runtime_IndependentDAGTaskThroughput_SingleTask', 'PASS', 'Intel(R) Data Center GPU Max 1100', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', '256', '32768', '0.262816', '0.262387', '0.253176', '0.253176 0.260528 0.262387 0.263326 0.274661', '0.007729', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'LLVM (Intel DPC++)', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A']

Runtime_IndependentDAGTaskThroughput_BasicParallelFor

Environment Variables:

Command:

/home/pmdk/bench_workdir/sycl-bench-build/dag_task_throughput_independent --warmup-run --num-runs=5 --output=/home/pmdk/bench_workdir/IndependentDAGTaskThroughput_multi.csv --size=32768

Output:

['Runtime_IndependentDAGTaskThroughput_BasicParallelFor', 'PASS', 'Intel(R) Data Center GPU Max 1100', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', '256', '32768', '0.273175', '0.273405', '0.270476', '0.270476 0.271232 0.273405 0.274709 0.276054', '0.002332', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'LLVM (Intel DPC++)', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A']

Runtime_IndependentDAGTaskThroughput_NDRangeParallelFor

Environment Variables:

Command:

/home/pmdk/bench_workdir/sycl-bench-build/dag_task_throughput_independent --warmup-run --num-runs=5 --output=/home/pmdk/bench_workdir/IndependentDAGTaskThroughput_multi.csv --size=32768

Output:

['Runtime_IndependentDAGTaskThroughput_NDRangeParallelFor', 'PASS', 'Intel(R) Data Center GPU Max 1100', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', '256', '32768', '0.276156', '0.270529', '0.270016', '0.270016 0.270392 0.270529 0.271280 0.298563', '0.012534', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'LLVM (Intel DPC++)', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A']

Runtime_DAGTaskThroughput_BasicParallelFor

Environment Variables:

Command:

/home/pmdk/bench_workdir/sycl-bench-build/dag_task_throughput_sequential --warmup-run --num-runs=5 --output=/home/pmdk/bench_workdir/DAGTaskThroughput_multi.csv --size=327680

Output:

['Runtime_DAGTaskThroughput_BasicParallelFor', 'PASS', 'Intel(R) Data Center GPU Max 1100', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', '256', '327680', '1.731921', '1.731725', '1.729209', '1.729209 1.730248 1.731725 1.731873 1.736552', '0.002812', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'LLVM (Intel DPC++)', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A']

Runtime_DAGTaskThroughput_NDRangeParallelFor

Environment Variables:

Command:

/home/pmdk/bench_workdir/sycl-bench-build/dag_task_throughput_sequential --warmup-run --num-runs=5 --output=/home/pmdk/bench_workdir/DAGTaskThroughput_multi.csv --size=327680

Output:

['Runtime_DAGTaskThroughput_NDRangeParallelFor', 'PASS', 'Intel(R) Data Center GPU Max 1100', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', '256', '327680', '1.677170', '1.676520', '1.675415', '1.675415 1.675698 1.676520 1.678950 1.679270', '0.001820', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'LLVM (Intel DPC++)', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A']

Runtime_DAGTaskThroughput_HierarchicalParallelFor

Environment Variables:

Command:

/home/pmdk/bench_workdir/sycl-bench-build/dag_task_throughput_sequential --warmup-run --num-runs=5 --output=/home/pmdk/bench_workdir/DAGTaskThroughput_multi.csv --size=327680

Output:

['Runtime_DAGTaskThroughput_HierarchicalParallelFor', 'PASS', 'Intel(R) Data Center GPU Max 1100', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', '256', '327680', '1.702001', '1.702231', '1.700349', '1.700349 1.701259 1.702231 1.702753 1.703414', '0.001214', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'LLVM (Intel DPC++)', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A']

Runtime_DAGTaskThroughput_SingleTask

Environment Variables:

Command:

/home/pmdk/bench_workdir/sycl-bench-build/dag_task_throughput_sequential --warmup-run --num-runs=5 --output=/home/pmdk/bench_workdir/DAGTaskThroughput_multi.csv --size=327680

Output:

['Runtime_DAGTaskThroughput_SingleTask', 'PASS', 'Intel(R) Data Center GPU Max 1100', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', '256', '327680', '1.678307', '1.678318', '1.675849', '1.675849 1.676296 1.678318 1.679879 1.681195', '0.002286', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'LLVM (Intel DPC++)', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A']

MicroBench_HostDeviceBandwidth_1D_H2D_Strided

Environment Variables:

Command:

/home/pmdk/bench_workdir/sycl-bench-build/host_device_bandwidth --warmup-run --num-runs=5 --output=/home/pmdk/bench_workdir/HostDeviceBandwidth_multi.csv --size=512

Output:

['MicroBench_HostDeviceBandwidth_1D_H2D_Strided', 'PASS', 'Intel(R) Data Center GPU Max 1100', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', '256', '512', '0.004369', '0.004326', '0.004289', '0.004289 0.004314 0.004326 0.004455 0.004460', '0.000082', '29.141890', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'LLVM (Intel DPC++)', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', '0.125000']

MicroBench_HostDeviceBandwidth_1D_D2H_Strided

Environment Variables:

Command:

/home/pmdk/bench_workdir/sycl-bench-build/host_device_bandwidth --warmup-run --num-runs=5 --output=/home/pmdk/bench_workdir/HostDeviceBandwidth_multi.csv --size=512

Output:

['MicroBench_HostDeviceBandwidth_1D_D2H_Strided', 'PASS', 'Intel(R) Data Center GPU Max 1100', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', '256', '512', '0.004667', '0.004650', '0.004629', '0.004629 0.004636 0.004650 0.004706 0.004713', '0.000040', '27.001672', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'LLVM (Intel DPC++)', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', '0.125000']

MicroBench_HostDeviceBandwidth_1D_H2D_Contiguous

Environment Variables:

Command:

/home/pmdk/bench_workdir/sycl-bench-build/host_device_bandwidth --warmup-run --num-runs=5 --output=/home/pmdk/bench_workdir/HostDeviceBandwidth_multi.csv --size=512

Output:

['MicroBench_HostDeviceBandwidth_1D_H2D_Contiguous', 'PASS', 'Intel(R) Data Center GPU Max 1100', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', '256', '512', '0.004672', '0.004363', '0.004301', '0.004301 0.004316 0.004363 0.004385 0.005993', '0.000740', '29.063022', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'LLVM (Intel DPC++)', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', '0.125000']

MicroBench_HostDeviceBandwidth_3D_H2D_Strided

Environment Variables:

Command:

/home/pmdk/bench_workdir/sycl-bench-build/host_device_bandwidth --warmup-run --num-runs=5 --output=/home/pmdk/bench_workdir/HostDeviceBandwidth_multi.csv --size=512

Output:

['MicroBench_HostDeviceBandwidth_3D_H2D_Strided', 'PASS', 'Intel(R) Data Center GPU Max 1100', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', '256', '512', '0.004543', '0.004552', '0.004483', '0.004483 0.004545 0.004552 0.004564 0.004572', '0.000035', '27.884333', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'LLVM (Intel DPC++)', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', '0.125000']

MicroBench_HostDeviceBandwidth_2D_D2H_Strided

Environment Variables:

Command:

/home/pmdk/bench_workdir/sycl-bench-build/host_device_bandwidth --warmup-run --num-runs=5 --output=/home/pmdk/bench_workdir/HostDeviceBandwidth_multi.csv --size=512

Output:

['MicroBench_HostDeviceBandwidth_2D_D2H_Strided', 'PASS', 'Intel(R) Data Center GPU Max 1100', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', '256', '512', '0.617443', '0.617464', '0.617339', '0.617339 0.617459 0.617464 0.617475 0.617479', '0.000059', '0.202482', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'LLVM (Intel DPC++)', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', '0.125000']

MicroBench_HostDeviceBandwidth_3D_D2H_Contiguous

Environment Variables:

Command:

/home/pmdk/bench_workdir/sycl-bench-build/host_device_bandwidth --warmup-run --num-runs=5 --output=/home/pmdk/bench_workdir/HostDeviceBandwidth_multi.csv --size=512

Output:

['MicroBench_HostDeviceBandwidth_3D_D2H_Contiguous', 'PASS', 'Intel(R) Data Center GPU Max 1100', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', '256', '512', '0.618081', '0.618073', '0.618023', '0.618023 0.618046 0.618073 0.618123 0.618140', '0.000050', '0.202258', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'LLVM (Intel DPC++)', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', '0.125000']

MicroBench_HostDeviceBandwidth_3D_H2D_Contiguous

Environment Variables:

Command:

/home/pmdk/bench_workdir/sycl-bench-build/host_device_bandwidth --warmup-run --num-runs=5 --output=/home/pmdk/bench_workdir/HostDeviceBandwidth_multi.csv --size=512

Output:

['MicroBench_HostDeviceBandwidth_3D_H2D_Contiguous', 'PASS', 'Intel(R) Data Center GPU Max 1100', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', '256', '512', '0.004435', '0.004442', '0.004355', '0.004355 0.004381 0.004442 0.004467 0.004529', '0.000069', '28.703010', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'LLVM (Intel DPC++)', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', '0.125000']

MicroBench_HostDeviceBandwidth_3D_D2H_Strided

Environment Variables:

Command:

/home/pmdk/bench_workdir/sycl-bench-build/host_device_bandwidth --warmup-run --num-runs=5 --output=/home/pmdk/bench_workdir/HostDeviceBandwidth_multi.csv --size=512

Output:

['MicroBench_HostDeviceBandwidth_3D_D2H_Strided', 'PASS', 'Intel(R) Data Center GPU Max 1100', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', '256', '512', '0.617447', '0.617434', '0.617427', '0.617427 0.617432 0.617434 0.617465 0.617475', '0.000022', '0.202453', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'LLVM (Intel DPC++)', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', '0.125000']

MicroBench_HostDeviceBandwidth_2D_D2H_Contiguous

Environment Variables:

Command:

/home/pmdk/bench_workdir/sycl-bench-build/host_device_bandwidth --warmup-run --num-runs=5 --output=/home/pmdk/bench_workdir/HostDeviceBandwidth_multi.csv --size=512

Output:

['MicroBench_HostDeviceBandwidth_2D_D2H_Contiguous', 'PASS', 'Intel(R) Data Center GPU Max 1100', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', '256', '512', '0.618119', '0.618141', '0.617996', '0.617996 0.618111 0.618141 0.618166 0.618182', '0.000074', '0.202267', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'LLVM (Intel DPC++)', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', '0.125000']

MicroBench_HostDeviceBandwidth_1D_D2H_Contiguous

Environment Variables:

Command:

/home/pmdk/bench_workdir/sycl-bench-build/host_device_bandwidth --warmup-run --num-runs=5 --output=/home/pmdk/bench_workdir/HostDeviceBandwidth_multi.csv --size=512

Output:

['MicroBench_HostDeviceBandwidth_1D_D2H_Contiguous', 'PASS', 'Intel(R) Data Center GPU Max 1100', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', '256', '512', '0.004468', '0.004412', '0.004395', '0.004395 0.004399 0.004412 0.004519 0.004615', '0.000097', '28.442006', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'LLVM (Intel DPC++)', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', '0.125000']

MicroBench_HostDeviceBandwidth_2D_H2D_Contiguous

Environment Variables:

Command:

/home/pmdk/bench_workdir/sycl-bench-build/host_device_bandwidth --warmup-run --num-runs=5 --output=/home/pmdk/bench_workdir/HostDeviceBandwidth_multi.csv --size=512

Output:

['MicroBench_HostDeviceBandwidth_2D_H2D_Contiguous', 'PASS', 'Intel(R) Data Center GPU Max 1100', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', '256', '512', '0.004565', '0.004557', '0.004446', '0.004446 0.004520 0.004557 0.004629 0.004672', '0.000089', '28.115179', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'LLVM (Intel DPC++)', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', '0.125000']

MicroBench_HostDeviceBandwidth_2D_H2D_Strided

Environment Variables:

Command:

/home/pmdk/bench_workdir/sycl-bench-build/host_device_bandwidth --warmup-run --num-runs=5 --output=/home/pmdk/bench_workdir/HostDeviceBandwidth_multi.csv --size=512

Output:

['MicroBench_HostDeviceBandwidth_2D_H2D_Strided', 'PASS', 'Intel(R) Data Center GPU Max 1100', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', '256', '512', '0.004491', '0.004541', '0.004383', '0.004383 0.004391 0.004541 0.004557 0.004584', '0.000096', '28.519806', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'LLVM (Intel DPC++)', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', '0.125000']

MicroBench_LocalMem_int32_4096

Environment Variables:

Command:

/home/pmdk/bench_workdir/sycl-bench-build/local_mem --warmup-run --num-runs=5 --output=/home/pmdk/bench_workdir/LocalMem_multi.csv --size=10240000

Output:

['MicroBench_LocalMem_int32_4096', 'PASS', 'Intel(R) Data Center GPU Max 1100', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', '256', '10240000', '0.029904', '0.029881', '0.029862', '0.029862 0.029878 0.029881 0.029910 0.029988', '0.000050', '10448.042887', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'LLVM (Intel DPC++)', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', '312.000000']

MicroBench_LocalMem_fp32_4096

Environment Variables:

Command:

/home/pmdk/bench_workdir/sycl-bench-build/local_mem --warmup-run --num-runs=5 --output=/home/pmdk/bench_workdir/LocalMem_multi.csv --size=10240000

Output:

['MicroBench_LocalMem_fp32_4096', 'PASS', 'Intel(R) Data Center GPU Max 1100', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', '256', '10240000', '0.029853', '0.029847', '0.029805', '0.029805 0.029811 0.029847 0.029876 0.029927', '0.000050', '10468.030333', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'LLVM (Intel DPC++)', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', '312.000000']

Pattern_Reduction_Hierarchical_int32

Environment Variables:

Command:

/home/pmdk/bench_workdir/sycl-bench-build/reduction --warmup-run --num-runs=5 --output=/home/pmdk/bench_workdir/Pattern_Reduction_multi.csv --size=10240000

Output:

['Pattern_Reduction_Hierarchical_int32', 'PASS', 'Intel(R) Data Center GPU Max 1100', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', '256', '10240000', '0.016963', '0.017019', '0.016736', '0.016736 0.016958 0.017019 0.017023 0.017081', '0.000134', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'LLVM (Intel DPC++)', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A']

Pattern_Reduction_NDRange_int32

Environment Variables:

Command:

/home/pmdk/bench_workdir/sycl-bench-build/reduction --warmup-run --num-runs=5 --output=/home/pmdk/bench_workdir/Pattern_Reduction_multi.csv --size=10240000

Output:

['Pattern_Reduction_NDRange_int32', 'PASS', 'Intel(R) Data Center GPU Max 1100', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', '256', '10240000', '0.016828', '0.016831', '0.016604', '0.016604 0.016793 0.016831 0.016950 0.016963', '0.000145', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'LLVM (Intel DPC++)', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A']

ScalarProduct_NDRange_fp32

Environment Variables:

Command:

/home/pmdk/bench_workdir/sycl-bench-build/scalar_prod --warmup-run --num-runs=5 --output=/home/pmdk/bench_workdir/ScalarProduct_multi.csv --size=102400000

Output:

['ScalarProduct_NDRange_fp32', 'PASS', 'Intel(R) Data Center GPU Max 1100', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', '256', '102400000', '0.003755', '0.003749', '0.003737', '0.003737 0.003742 0.003749 0.003766 0.003783', '0.000019', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'LLVM (Intel DPC++)', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A']

ScalarProduct_NDRange_int32

Environment Variables:

Command:

/home/pmdk/bench_workdir/sycl-bench-build/scalar_prod --warmup-run --num-runs=5 --output=/home/pmdk/bench_workdir/ScalarProduct_multi.csv --size=102400000

Output:

['ScalarProduct_NDRange_int32', 'PASS', 'Intel(R) Data Center GPU Max 1100', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', '256', '102400000', '0.003756', '0.003749', '0.003731', '0.003731 0.003746 0.003749 0.003760 0.003795', '0.000024', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'LLVM (Intel DPC++)', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A']

ScalarProduct_Hierarchical_fp32

Environment Variables:

Command:

/home/pmdk/bench_workdir/sycl-bench-build/scalar_prod --warmup-run --num-runs=5 --output=/home/pmdk/bench_workdir/ScalarProduct_multi.csv --size=102400000

Output:

['ScalarProduct_Hierarchical_fp32', 'PASS', 'Intel(R) Data Center GPU Max 1100', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', '256', '102400000', '0.009957', '0.009955', '0.009943', '0.009943 0.009947 0.009955 0.009956 0.009985', '0.000017', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'LLVM (Intel DPC++)', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A']

ScalarProduct_NDRange_int64

Environment Variables:

Command:

/home/pmdk/bench_workdir/sycl-bench-build/scalar_prod --warmup-run --num-runs=5 --output=/home/pmdk/bench_workdir/ScalarProduct_multi.csv --size=102400000

Output:

['ScalarProduct_NDRange_int64', 'PASS', 'Intel(R) Data Center GPU Max 1100', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', '256', '102400000', '0.005446', '0.005443', '0.005434', '0.005434 0.005443 0.005443 0.005444 0.005465', '0.000011', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'LLVM (Intel DPC++)', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A']

ScalarProduct_Hierarchical_int32

Environment Variables:

Command:

/home/pmdk/bench_workdir/sycl-bench-build/scalar_prod --warmup-run --num-runs=5 --output=/home/pmdk/bench_workdir/ScalarProduct_multi.csv --size=102400000

Output:

['ScalarProduct_Hierarchical_int32', 'PASS', 'Intel(R) Data Center GPU Max 1100', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', '256', '102400000', '0.010306', '0.010309', '0.010272', '0.010272 0.010300 0.010309 0.010317 0.010332', '0.000022', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'LLVM (Intel DPC++)', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A']

ScalarProduct_Hierarchical_int64

Environment Variables:

Command:

/home/pmdk/bench_workdir/sycl-bench-build/scalar_prod --warmup-run --num-runs=5 --output=/home/pmdk/bench_workdir/ScalarProduct_multi.csv --size=102400000

Output:

['ScalarProduct_Hierarchical_int64', 'PASS', 'Intel(R) Data Center GPU Max 1100', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', '256', '102400000', '0.011293', '0.011299', '0.011268', '0.011268 0.011271 0.011299 0.011313 0.011316', '0.000023', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'LLVM (Intel DPC++)', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A']

Pattern_SegmentedReduction_Hierarchical_int16

Environment Variables:

Command:

/home/pmdk/bench_workdir/sycl-bench-build/segmentedreduction --warmup-run --num-runs=5 --output=/home/pmdk/bench_workdir/Pattern_SegmentedReduction_multi.csv --size=102400000

Output:

['Pattern_SegmentedReduction_Hierarchical_int16', 'PASS', 'Intel(R) Data Center GPU Max 1100', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', '256', '102400000', '0.011802', '0.011802', '0.011784', '0.011784 0.011798 0.011802 0.011806 0.011820', '0.000013', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'LLVM (Intel DPC++)', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A']

Pattern_SegmentedReduction_NDRange_fp32

Environment Variables:

Command:

/home/pmdk/bench_workdir/sycl-bench-build/segmentedreduction --warmup-run --num-runs=5 --output=/home/pmdk/bench_workdir/Pattern_SegmentedReduction_multi.csv --size=102400000

Output:

['Pattern_SegmentedReduction_NDRange_fp32', 'PASS', 'Intel(R) Data Center GPU Max 1100', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', '256', '102400000', '0.002168', '0.002165', '0.002162', '0.002162 0.002164 0.002165 0.002165 0.002183', '0.000008', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'LLVM (Intel DPC++)', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A']

Pattern_SegmentedReduction_NDRange_int16

Environment Variables:

Command:

/home/pmdk/bench_workdir/sycl-bench-build/segmentedreduction --warmup-run --num-runs=5 --output=/home/pmdk/bench_workdir/Pattern_SegmentedReduction_multi.csv --size=102400000

Output:

['Pattern_SegmentedReduction_NDRange_int16', 'PASS', 'Intel(R) Data Center GPU Max 1100', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', '256', '102400000', '0.002275', '0.002271', '0.002268', '0.002268 0.002270 0.002271 0.002275 0.002293', '0.000010', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'LLVM (Intel DPC++)', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A']

Pattern_SegmentedReduction_Hierarchical_int64

Environment Variables:

Command:

/home/pmdk/bench_workdir/sycl-bench-build/segmentedreduction --warmup-run --num-runs=5 --output=/home/pmdk/bench_workdir/Pattern_SegmentedReduction_multi.csv --size=102400000

Output:

['Pattern_SegmentedReduction_Hierarchical_int64', 'PASS', 'Intel(R) Data Center GPU Max 1100', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', '256', '102400000', '0.011785', '0.011772', '0.011757', '0.011757 0.011768 0.011772 0.011782 0.011845', '0.000035', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'LLVM (Intel DPC++)', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A']

Pattern_SegmentedReduction_NDRange_int32

Environment Variables:

Command:

/home/pmdk/bench_workdir/sycl-bench-build/segmentedreduction --warmup-run --num-runs=5 --output=/home/pmdk/bench_workdir/Pattern_SegmentedReduction_multi.csv --size=102400000

Output:

['Pattern_SegmentedReduction_NDRange_int32', 'PASS', 'Intel(R) Data Center GPU Max 1100', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', '256', '102400000', '0.002173', '0.002171', '0.002170', '0.002170 0.002170 0.002171 0.002172 0.002183', '0.000006', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'LLVM (Intel DPC++)', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A']

Pattern_SegmentedReduction_NDRange_int64

Environment Variables:

Command:

/home/pmdk/bench_workdir/sycl-bench-build/segmentedreduction --warmup-run --num-runs=5 --output=/home/pmdk/bench_workdir/Pattern_SegmentedReduction_multi.csv --size=102400000

Output:

['Pattern_SegmentedReduction_NDRange_int64', 'PASS', 'Intel(R) Data Center GPU Max 1100', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', '256', '102400000', '0.002349', '0.002348', '0.002341', '0.002341 0.002345 0.002348 0.002353 0.002359', '0.000007', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'LLVM (Intel DPC++)', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A']

Pattern_SegmentedReduction_Hierarchical_int32

Environment Variables:

Command:

/home/pmdk/bench_workdir/sycl-bench-build/segmentedreduction --warmup-run --num-runs=5 --output=/home/pmdk/bench_workdir/Pattern_SegmentedReduction_multi.csv --size=102400000

Output:

['Pattern_SegmentedReduction_Hierarchical_int32', 'PASS', 'Intel(R) Data Center GPU Max 1100', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', '256', '102400000', '0.011594', '0.011591', '0.011574', '0.011574 0.011589 0.011591 0.011595 0.011618', '0.000016', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'LLVM (Intel DPC++)', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A']

Pattern_SegmentedReduction_Hierarchical_fp32

Environment Variables:

Command:

/home/pmdk/bench_workdir/sycl-bench-build/segmentedreduction --warmup-run --num-runs=5 --output=/home/pmdk/bench_workdir/Pattern_SegmentedReduction_multi.csv --size=102400000

Output:

['Pattern_SegmentedReduction_Hierarchical_fp32', 'PASS', 'Intel(R) Data Center GPU Max 1100', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', '256', '102400000', '0.011593', '0.011595', '0.011572', '0.011572 0.011589 0.011595 0.011597 0.011612', '0.000014', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'LLVM (Intel DPC++)', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A']

USM_Allocation_latency_fp32_device

Environment Variables:

Command:

/home/pmdk/bench_workdir/sycl-bench-build/usm_allocation_latency --warmup-run --num-runs=5 --output=/home/pmdk/bench_workdir/USM_Allocation_latency_multi.csv --size=1024000000

Output:

['USM_Allocation_latency_fp32_device', 'PASS', 'Intel(R) Data Center GPU Max 1100', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', '256', '1024000000', '0.000064', '0.000069', '0.000046', '0.000046 0.000052 0.000069 0.000069 0.000085', '0.000015', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'LLVM (Intel DPC++)', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A']

USM_Allocation_latency_fp32_host

Environment Variables:

Command:

/home/pmdk/bench_workdir/sycl-bench-build/usm_allocation_latency --warmup-run --num-runs=5 --output=/home/pmdk/bench_workdir/USM_Allocation_latency_multi.csv --size=1024000000

Output:

['USM_Allocation_latency_fp32_host', 'PASS', 'Intel(R) Data Center GPU Max 1100', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', '256', '1024000000', '0.037514', '0.037628', '0.037325', '0.037325 0.037359 0.037628 0.037629 0.037629', '0.000158', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'LLVM (Intel DPC++)', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A']

USM_Allocation_latency_fp32_shared

Environment Variables:

Command:

/home/pmdk/bench_workdir/sycl-bench-build/usm_allocation_latency --warmup-run --num-runs=5 --output=/home/pmdk/bench_workdir/USM_Allocation_latency_multi.csv --size=1024000000

Output:

['USM_Allocation_latency_fp32_shared', 'PASS', 'Intel(R) Data Center GPU Max 1100', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', '256', '1024000000', '0.000059', '0.000055', '0.000050', '0.000050 0.000051 0.000055 0.000065 0.000072', '0.000010', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'LLVM (Intel DPC++)', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A']

USM_Instr_Mix_fp32_host_1:1mix_with_init_no_prefetch

Environment Variables:

Command:

/home/pmdk/bench_workdir/sycl-bench-build/usm_instr_mix --warmup-run --num-runs=5 --output=/home/pmdk/bench_workdir/USM_Instr_Mix_multi.csv --size=8192

Output:

['USM_Instr_Mix_fp32_host_1:1mix_with_init_no_prefetch', 'PASS', 'Intel(R) Data Center GPU Max 1100', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', '256', '8192', '0.001073', '0.001043', '0.001041', '0.001041 0.001041 0.001043 0.001082 0.001157', '0.000050', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'LLVM (Intel DPC++)', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A']

USM_Instr_Mix_fp32_host_1:1mix_no_init_no_prefetch

Environment Variables:

Command:

/home/pmdk/bench_workdir/sycl-bench-build/usm_instr_mix --warmup-run --num-runs=5 --output=/home/pmdk/bench_workdir/USM_Instr_Mix_multi.csv --size=8192

Output:

['USM_Instr_Mix_fp32_host_1:1mix_no_init_no_prefetch', 'PASS', 'Intel(R) Data Center GPU Max 1100', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', '256', '8192', '0.001207', '0.001207', '0.001202', '0.001202 0.001206 0.001207 0.001207 0.001211', '0.000003', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'LLVM (Intel DPC++)', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A']

USM_Instr_Mix_fp32_device_1:1mix_with_init_no_prefetch

Environment Variables:

Command:

/home/pmdk/bench_workdir/sycl-bench-build/usm_instr_mix --warmup-run --num-runs=5 --output=/home/pmdk/bench_workdir/USM_Instr_Mix_multi.csv --size=8192

Output:

['USM_Instr_Mix_fp32_device_1:1mix_with_init_no_prefetch', 'PASS', 'Intel(R) Data Center GPU Max 1100', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', '256', '8192', '0.001958', '0.001647', '0.001642', '0.001642 0.001642 0.001647 0.001652 0.003207', '0.000698', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'LLVM (Intel DPC++)', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A']

USM_Instr_Mix_fp32_device_1:1mix_no_init_no_prefetch

Environment Variables:

Command:

/home/pmdk/bench_workdir/sycl-bench-build/usm_instr_mix --warmup-run --num-runs=5 --output=/home/pmdk/bench_workdir/USM_Instr_Mix_multi.csv --size=8192

Output:

['USM_Instr_Mix_fp32_device_1:1mix_no_init_no_prefetch', 'PASS', 'Intel(R) Data Center GPU Max 1100', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', '256', '8192', '0.001815', '0.001798', '0.001792', '0.001792 0.001796 0.001798 0.001805 0.001886', '0.000040', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'LLVM (Intel DPC++)', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A']

VectorAddition_int32

Environment Variables:

Command:

/home/pmdk/bench_workdir/sycl-bench-build/vec_add --warmup-run --num-runs=5 --output=/home/pmdk/bench_workdir/VectorAddition_multi.csv --size=102400000

Output:

['VectorAddition_int32', 'PASS', 'Intel(R) Data Center GPU Max 1100', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', '256', '102400000', '0.001462', '0.001449', '0.001440', '0.001440 0.001449 0.001449 0.001474 0.001500', '0.000025', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'LLVM (Intel DPC++)', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A']

VectorAddition_int64

Environment Variables:

Command:

/home/pmdk/bench_workdir/sycl-bench-build/vec_add --warmup-run --num-runs=5 --output=/home/pmdk/bench_workdir/VectorAddition_multi.csv --size=102400000

Output:

['VectorAddition_int64', 'PASS', 'Intel(R) Data Center GPU Max 1100', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', '256', '102400000', '0.003064', '0.003060', '0.003053', '0.003053 0.003058 0.003060 0.003068 0.003079', '0.000010', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'LLVM (Intel DPC++)', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A']

VectorAddition_fp32

Environment Variables:

Command:

/home/pmdk/bench_workdir/sycl-bench-build/vec_add --warmup-run --num-runs=5 --output=/home/pmdk/bench_workdir/VectorAddition_multi.csv --size=102400000

Output:

['VectorAddition_fp32', 'PASS', 'Intel(R) Data Center GPU Max 1100', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', '256', '102400000', '0.001468', '0.001462', '0.001452', '0.001452 0.001457 0.001462 0.001464 0.001504', '0.000021', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'LLVM (Intel DPC++)', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A']

Polybench_2mm

Environment Variables:

Command:

/home/pmdk/bench_workdir/sycl-bench-build/2mm --warmup-run --num-runs=5 --output=/home/pmdk/bench_workdir/2mm.csv --size=512

Output:

['Polybench_2mm', 'PASS', 'Intel(R) Data Center GPU Max 1100', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', '256', '512', '0.001218', '0.001218', '0.001211', '0.001211 0.001212 0.001218 0.001221 0.001229', '0.000008', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'LLVM (Intel DPC++)', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A']

Polybench_3mm

Environment Variables:

Command:

/home/pmdk/bench_workdir/sycl-bench-build/3mm --warmup-run --num-runs=5 --output=/home/pmdk/bench_workdir/3mm.csv --size=512

Output:

['Polybench_3mm', 'PASS', 'Intel(R) Data Center GPU Max 1100', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', '256', '512', '0.001732', '0.001727', '0.001722', '0.001722 0.001724 0.001727 0.001737 0.001749', '0.000011', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'LLVM (Intel DPC++)', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A']

Polybench_Atax

Environment Variables:

Command:

/home/pmdk/bench_workdir/sycl-bench-build/atax --warmup-run --num-runs=5 --output=/home/pmdk/bench_workdir/Atax.csv --size=8192

Output:

['Polybench_Atax', 'PASS', 'Intel(R) Data Center GPU Max 1100', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', '256', '8192', '0.006794', '0.006849', '0.006699', '0.006699 0.006702 0.006849 0.006851 0.006871', '0.000086', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'LLVM (Intel DPC++)', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A']

Kmeans_fp32

Environment Variables:

Command:

/home/pmdk/bench_workdir/sycl-bench-build/kmeans --warmup-run --num-runs=5 --output=/home/pmdk/bench_workdir/Kmeans.csv --size=700000000

Output:

['Kmeans_fp32', 'PASS', 'Intel(R) Data Center GPU Max 1100', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', '256', '700000000', '0.016060', '0.016049', '0.016045', '0.016045 0.016047 0.016049 0.016079 0.016080', '0.000018', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'LLVM (Intel DPC++)', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A']

LinearRegressionCoeff_fp32

Environment Variables:

Command:

/home/pmdk/bench_workdir/sycl-bench-build/lin_reg_coeff --warmup-run --num-runs=5 --output=/home/pmdk/bench_workdir/LinearRegressionCoeff.csv --size=1638400000

Output:

['LinearRegressionCoeff_fp32', 'PASS', 'Intel(R) Data Center GPU Max 1100', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', '256', '1638400000', '0.858873', '0.856672', '0.841709', '0.841709 0.841853 0.856672 0.863435 0.890694', '0.020140', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'LLVM (Intel DPC++)', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A']

MolecularDynamics

Environment Variables:

Command:

/home/pmdk/bench_workdir/sycl-bench-build/mol_dyn --warmup-run --num-runs=5 --output=/home/pmdk/bench_workdir/MolecularDynamics.csv --size=8196

Output:

['MolecularDynamics', 'PASS', 'Intel(R) Data Center GPU Max 1100', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', '256', '8196', '0.000033', '0.000026', '0.000025', '0.000025 0.000026 0.000026 0.000030 0.000058', '0.000014', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'LLVM (Intel DPC++)', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A']

llama.cpp Prompt Processing Batched 128

Environment Variables:

Command:

/home/pmdk/bench_workdir/llamacpp-build/bin/llama-bench --output csv -n 128 -p 512 -b 128,256,512 --numa isolate -t 56 --model /home/pmdk/bench_workdir/models/Phi-3-mini-4k-instruct-q4.gguf

Output:

build_commit,build_number,cuda,vulkan,kompute,metal,sycl,rpc,gpu_blas,blas,cpu_info,gpu_info,model_filename,model_type,model_size,model_n_params,n_batch,n_ubatch,n_threads,cpu_mask,cpu_strict,poll,type_k,type_v,n_gpu_layers,split_mode,main_gpu,no_kv_offload,flash_attn,tensor_split,use_mmap,embeddings,n_prompt,n_gen,test_time,avg_ns,stddev_ns,avg_ts,stddev_ts
"1ee9eea0","4073","0","0","0","0","1","0","1","1","INTEL(R) XEON(R) PLATINUM 8580","Intel(R) Data Center GPU Max 1100","/home/pmdk/bench_workdir/models/Phi-3-mini-4k-instruct-q4.gguf","phi3 3B Q4_K - Medium","2392493568","3821079552","128","512","56","0x0","0","50","f16","f16","99","layer","0","0","0","0.00","1","0","512","0","2024-11-21T14:16:20Z","606771220","8302129","843.935068","11.369513"
"1ee9eea0","4073","0","0","0","0","1","0","1","1","INTEL(R) XEON(R) PLATINUM 8580","Intel(R) Data Center GPU Max 1100","/home/pmdk/bench_workdir/models/Phi-3-mini-4k-instruct-q4.gguf","phi3 3B Q4_K - Medium","2392493568","3821079552","128","512","56","0x0","0","50","f16","f16","99","layer","0","0","0","0.00","1","0","0","128","2024-11-21T14:16:24Z","2039773824","1434550","62.752080","0.044084"
"1ee9eea0","4073","0","0","0","0","1","0","1","1","INTEL(R) XEON(R) PLATINUM 8580","Intel(R) Data Center GPU Max 1100","/home/pmdk/bench_workdir/models/Phi-3-mini-4k-instruct-q4.gguf","phi3 3B Q4_K - Medium","2392493568","3821079552","256","512","56","0x0","0","50","f16","f16","99","layer","0","0","0","0.00","1","0","512","0","2024-11-21T14:16:34Z","570335949","4425464","897.759541","6.931640"
"1ee9eea0","4073","0","0","0","0","1","0","1","1","INTEL(R) XEON(R) PLATINUM 8580","Intel(R) Data Center GPU Max 1100","/home/pmdk/bench_workdir/models/Phi-3-mini-4k-instruct-q4.gguf","phi3 3B Q4_K - Medium","2392493568","3821079552","256","512","56","0x0","0","50","f16","f16","99","layer","0","0","0","0.00","1","0","0","128","2024-11-21T14:16:38Z","2037124433","620822","62.833672","0.019046"
"1ee9eea0","4073","0","0","0","0","1","0","1","1","INTEL(R) XEON(R) PLATINUM 8580","Intel(R) Data Center GPU Max 1100","/home/pmdk/bench_workdir/models/Phi-3-mini-4k-instruct-q4.gguf","phi3 3B Q4_K - Medium","2392493568","3821079552","512","512","56","0x0","0","50","f16","f16","99","layer","0","0","0","0.00","1","0","512","0","2024-11-21T14:16:48Z","1119576832","3312613","457.318846","1.352793"
"1ee9eea0","4073","0","0","0","0","1","0","1","1","INTEL(R) XEON(R) PLATINUM 8580","Intel(R) Data Center GPU Max 1100","/home/pmdk/bench_workdir/models/Phi-3-mini-4k-instruct-q4.gguf","phi3 3B Q4_K - Medium","2392493568","3821079552","512","512","56","0x0","0","50","f16","f16","99","layer","0","0","0","0.00","1","0","0","128","2024-11-21T14:16:55Z","2038663662","865860","62.786236","0.026669"

llama.cpp Text Generation Batched 128

Environment Variables:

Command:

/home/pmdk/bench_workdir/llamacpp-build/bin/llama-bench --output csv -n 128 -p 512 -b 128,256,512 --numa isolate -t 56 --model /home/pmdk/bench_workdir/models/Phi-3-mini-4k-instruct-q4.gguf

Output:

build_commit,build_number,cuda,vulkan,kompute,metal,sycl,rpc,gpu_blas,blas,cpu_info,gpu_info,model_filename,model_type,model_size,model_n_params,n_batch,n_ubatch,n_threads,cpu_mask,cpu_strict,poll,type_k,type_v,n_gpu_layers,split_mode,main_gpu,no_kv_offload,flash_attn,tensor_split,use_mmap,embeddings,n_prompt,n_gen,test_time,avg_ns,stddev_ns,avg_ts,stddev_ts
"1ee9eea0","4073","0","0","0","0","1","0","1","1","INTEL(R) XEON(R) PLATINUM 8580","Intel(R) Data Center GPU Max 1100","/home/pmdk/bench_workdir/models/Phi-3-mini-4k-instruct-q4.gguf","phi3 3B Q4_K - Medium","2392493568","3821079552","128","512","56","0x0","0","50","f16","f16","99","layer","0","0","0","0.00","1","0","512","0","2024-11-21T14:16:20Z","606771220","8302129","843.935068","11.369513"
"1ee9eea0","4073","0","0","0","0","1","0","1","1","INTEL(R) XEON(R) PLATINUM 8580","Intel(R) Data Center GPU Max 1100","/home/pmdk/bench_workdir/models/Phi-3-mini-4k-instruct-q4.gguf","phi3 3B Q4_K - Medium","2392493568","3821079552","128","512","56","0x0","0","50","f16","f16","99","layer","0","0","0","0.00","1","0","0","128","2024-11-21T14:16:24Z","2039773824","1434550","62.752080","0.044084"
"1ee9eea0","4073","0","0","0","0","1","0","1","1","INTEL(R) XEON(R) PLATINUM 8580","Intel(R) Data Center GPU Max 1100","/home/pmdk/bench_workdir/models/Phi-3-mini-4k-instruct-q4.gguf","phi3 3B Q4_K - Medium","2392493568","3821079552","256","512","56","0x0","0","50","f16","f16","99","layer","0","0","0","0.00","1","0","512","0","2024-11-21T14:16:34Z","570335949","4425464","897.759541","6.931640"
"1ee9eea0","4073","0","0","0","0","1","0","1","1","INTEL(R) XEON(R) PLATINUM 8580","Intel(R) Data Center GPU Max 1100","/home/pmdk/bench_workdir/models/Phi-3-mini-4k-instruct-q4.gguf","phi3 3B Q4_K - Medium","2392493568","3821079552","256","512","56","0x0","0","50","f16","f16","99","layer","0","0","0","0.00","1","0","0","128","2024-11-21T14:16:38Z","2037124433","620822","62.833672","0.019046"
"1ee9eea0","4073","0","0","0","0","1","0","1","1","INTEL(R) XEON(R) PLATINUM 8580","Intel(R) Data Center GPU Max 1100","/home/pmdk/bench_workdir/models/Phi-3-mini-4k-instruct-q4.gguf","phi3 3B Q4_K - Medium","2392493568","3821079552","512","512","56","0x0","0","50","f16","f16","99","layer","0","0","0","0.00","1","0","512","0","2024-11-21T14:16:48Z","1119576832","3312613","457.318846","1.352793"
"1ee9eea0","4073","0","0","0","0","1","0","1","1","INTEL(R) XEON(R) PLATINUM 8580","Intel(R) Data Center GPU Max 1100","/home/pmdk/bench_workdir/models/Phi-3-mini-4k-instruct-q4.gguf","phi3 3B Q4_K - Medium","2392493568","3821079552","512","512","56","0x0","0","50","f16","f16","99","layer","0","0","0","0.00","1","0","0","128","2024-11-21T14:16:55Z","2038663662","865860","62.786236","0.026669"

llama.cpp Text Generation Batched 512

Environment Variables:

Command:

/home/pmdk/bench_workdir/llamacpp-build/bin/llama-bench --output csv -n 128 -p 512 -b 128,256,512 --numa isolate -t 56 --model /home/pmdk/bench_workdir/models/Phi-3-mini-4k-instruct-q4.gguf

Output:

build_commit,build_number,cuda,vulkan,kompute,metal,sycl,rpc,gpu_blas,blas,cpu_info,gpu_info,model_filename,model_type,model_size,model_n_params,n_batch,n_ubatch,n_threads,cpu_mask,cpu_strict,poll,type_k,type_v,n_gpu_layers,split_mode,main_gpu,no_kv_offload,flash_attn,tensor_split,use_mmap,embeddings,n_prompt,n_gen,test_time,avg_ns,stddev_ns,avg_ts,stddev_ts
"1ee9eea0","4073","0","0","0","0","1","0","1","1","INTEL(R) XEON(R) PLATINUM 8580","Intel(R) Data Center GPU Max 1100","/home/pmdk/bench_workdir/models/Phi-3-mini-4k-instruct-q4.gguf","phi3 3B Q4_K - Medium","2392493568","3821079552","128","512","56","0x0","0","50","f16","f16","99","layer","0","0","0","0.00","1","0","512","0","2024-11-21T14:16:20Z","606771220","8302129","843.935068","11.369513"
"1ee9eea0","4073","0","0","0","0","1","0","1","1","INTEL(R) XEON(R) PLATINUM 8580","Intel(R) Data Center GPU Max 1100","/home/pmdk/bench_workdir/models/Phi-3-mini-4k-instruct-q4.gguf","phi3 3B Q4_K - Medium","2392493568","3821079552","128","512","56","0x0","0","50","f16","f16","99","layer","0","0","0","0.00","1","0","0","128","2024-11-21T14:16:24Z","2039773824","1434550","62.752080","0.044084"
"1ee9eea0","4073","0","0","0","0","1","0","1","1","INTEL(R) XEON(R) PLATINUM 8580","Intel(R) Data Center GPU Max 1100","/home/pmdk/bench_workdir/models/Phi-3-mini-4k-instruct-q4.gguf","phi3 3B Q4_K - Medium","2392493568","3821079552","256","512","56","0x0","0","50","f16","f16","99","layer","0","0","0","0.00","1","0","512","0","2024-11-21T14:16:34Z","570335949","4425464","897.759541","6.931640"
"1ee9eea0","4073","0","0","0","0","1","0","1","1","INTEL(R) XEON(R) PLATINUM 8580","Intel(R) Data Center GPU Max 1100","/home/pmdk/bench_workdir/models/Phi-3-mini-4k-instruct-q4.gguf","phi3 3B Q4_K - Medium","2392493568","3821079552","256","512","56","0x0","0","50","f16","f16","99","layer","0","0","0","0.00","1","0","0","128","2024-11-21T14:16:38Z","2037124433","620822","62.833672","0.019046"
"1ee9eea0","4073","0","0","0","0","1","0","1","1","INTEL(R) XEON(R) PLATINUM 8580","Intel(R) Data Center GPU Max 1100","/home/pmdk/bench_workdir/models/Phi-3-mini-4k-instruct-q4.gguf","phi3 3B Q4_K - Medium","2392493568","3821079552","512","512","56","0x0","0","50","f16","f16","99","layer","0","0","0","0.00","1","0","512","0","2024-11-21T14:16:48Z","1119576832","3312613","457.318846","1.352793"
"1ee9eea0","4073","0","0","0","0","1","0","1","1","INTEL(R) XEON(R) PLATINUM 8580","Intel(R) Data Center GPU Max 1100","/home/pmdk/bench_workdir/models/Phi-3-mini-4k-instruct-q4.gguf","phi3 3B Q4_K - Medium","2392493568","3821079552","512","512","56","0x0","0","50","f16","f16","99","layer","0","0","0","0.00","1","0","0","128","2024-11-21T14:16:55Z","2038663662","865860","62.786236","0.026669"

llama.cpp Text Generation Batched 256

Environment Variables:

Command:

/home/pmdk/bench_workdir/llamacpp-build/bin/llama-bench --output csv -n 128 -p 512 -b 128,256,512 --numa isolate -t 56 --model /home/pmdk/bench_workdir/models/Phi-3-mini-4k-instruct-q4.gguf

Output:

build_commit,build_number,cuda,vulkan,kompute,metal,sycl,rpc,gpu_blas,blas,cpu_info,gpu_info,model_filename,model_type,model_size,model_n_params,n_batch,n_ubatch,n_threads,cpu_mask,cpu_strict,poll,type_k,type_v,n_gpu_layers,split_mode,main_gpu,no_kv_offload,flash_attn,tensor_split,use_mmap,embeddings,n_prompt,n_gen,test_time,avg_ns,stddev_ns,avg_ts,stddev_ts
"1ee9eea0","4073","0","0","0","0","1","0","1","1","INTEL(R) XEON(R) PLATINUM 8580","Intel(R) Data Center GPU Max 1100","/home/pmdk/bench_workdir/models/Phi-3-mini-4k-instruct-q4.gguf","phi3 3B Q4_K - Medium","2392493568","3821079552","128","512","56","0x0","0","50","f16","f16","99","layer","0","0","0","0.00","1","0","512","0","2024-11-21T14:16:20Z","606771220","8302129","843.935068","11.369513"
"1ee9eea0","4073","0","0","0","0","1","0","1","1","INTEL(R) XEON(R) PLATINUM 8580","Intel(R) Data Center GPU Max 1100","/home/pmdk/bench_workdir/models/Phi-3-mini-4k-instruct-q4.gguf","phi3 3B Q4_K - Medium","2392493568","3821079552","128","512","56","0x0","0","50","f16","f16","99","layer","0","0","0","0.00","1","0","0","128","2024-11-21T14:16:24Z","2039773824","1434550","62.752080","0.044084"
"1ee9eea0","4073","0","0","0","0","1","0","1","1","INTEL(R) XEON(R) PLATINUM 8580","Intel(R) Data Center GPU Max 1100","/home/pmdk/bench_workdir/models/Phi-3-mini-4k-instruct-q4.gguf","phi3 3B Q4_K - Medium","2392493568","3821079552","256","512","56","0x0","0","50","f16","f16","99","layer","0","0","0","0.00","1","0","512","0","2024-11-21T14:16:34Z","570335949","4425464","897.759541","6.931640"
"1ee9eea0","4073","0","0","0","0","1","0","1","1","INTEL(R) XEON(R) PLATINUM 8580","Intel(R) Data Center GPU Max 1100","/home/pmdk/bench_workdir/models/Phi-3-mini-4k-instruct-q4.gguf","phi3 3B Q4_K - Medium","2392493568","3821079552","256","512","56","0x0","0","50","f16","f16","99","layer","0","0","0","0.00","1","0","0","128","2024-11-21T14:16:38Z","2037124433","620822","62.833672","0.019046"
"1ee9eea0","4073","0","0","0","0","1","0","1","1","INTEL(R) XEON(R) PLATINUM 8580","Intel(R) Data Center GPU Max 1100","/home/pmdk/bench_workdir/models/Phi-3-mini-4k-instruct-q4.gguf","phi3 3B Q4_K - Medium","2392493568","3821079552","512","512","56","0x0","0","50","f16","f16","99","layer","0","0","0","0.00","1","0","512","0","2024-11-21T14:16:48Z","1119576832","3312613","457.318846","1.352793"
"1ee9eea0","4073","0","0","0","0","1","0","1","1","INTEL(R) XEON(R) PLATINUM 8580","Intel(R) Data Center GPU Max 1100","/home/pmdk/bench_workdir/models/Phi-3-mini-4k-instruct-q4.gguf","phi3 3B Q4_K - Medium","2392493568","3821079552","512","512","56","0x0","0","50","f16","f16","99","layer","0","0","0","0.00","1","0","0","128","2024-11-21T14:16:55Z","2038663662","865860","62.786236","0.026669"

llama.cpp Prompt Processing Batched 256

Environment Variables:

Command:

/home/pmdk/bench_workdir/llamacpp-build/bin/llama-bench --output csv -n 128 -p 512 -b 128,256,512 --numa isolate -t 56 --model /home/pmdk/bench_workdir/models/Phi-3-mini-4k-instruct-q4.gguf

Output:

build_commit,build_number,cuda,vulkan,kompute,metal,sycl,rpc,gpu_blas,blas,cpu_info,gpu_info,model_filename,model_type,model_size,model_n_params,n_batch,n_ubatch,n_threads,cpu_mask,cpu_strict,poll,type_k,type_v,n_gpu_layers,split_mode,main_gpu,no_kv_offload,flash_attn,tensor_split,use_mmap,embeddings,n_prompt,n_gen,test_time,avg_ns,stddev_ns,avg_ts,stddev_ts
"1ee9eea0","4073","0","0","0","0","1","0","1","1","INTEL(R) XEON(R) PLATINUM 8580","Intel(R) Data Center GPU Max 1100","/home/pmdk/bench_workdir/models/Phi-3-mini-4k-instruct-q4.gguf","phi3 3B Q4_K - Medium","2392493568","3821079552","128","512","56","0x0","0","50","f16","f16","99","layer","0","0","0","0.00","1","0","512","0","2024-11-21T14:16:20Z","606771220","8302129","843.935068","11.369513"
"1ee9eea0","4073","0","0","0","0","1","0","1","1","INTEL(R) XEON(R) PLATINUM 8580","Intel(R) Data Center GPU Max 1100","/home/pmdk/bench_workdir/models/Phi-3-mini-4k-instruct-q4.gguf","phi3 3B Q4_K - Medium","2392493568","3821079552","128","512","56","0x0","0","50","f16","f16","99","layer","0","0","0","0.00","1","0","0","128","2024-11-21T14:16:24Z","2039773824","1434550","62.752080","0.044084"
"1ee9eea0","4073","0","0","0","0","1","0","1","1","INTEL(R) XEON(R) PLATINUM 8580","Intel(R) Data Center GPU Max 1100","/home/pmdk/bench_workdir/models/Phi-3-mini-4k-instruct-q4.gguf","phi3 3B Q4_K - Medium","2392493568","3821079552","256","512","56","0x0","0","50","f16","f16","99","layer","0","0","0","0.00","1","0","512","0","2024-11-21T14:16:34Z","570335949","4425464","897.759541","6.931640"
"1ee9eea0","4073","0","0","0","0","1","0","1","1","INTEL(R) XEON(R) PLATINUM 8580","Intel(R) Data Center GPU Max 1100","/home/pmdk/bench_workdir/models/Phi-3-mini-4k-instruct-q4.gguf","phi3 3B Q4_K - Medium","2392493568","3821079552","256","512","56","0x0","0","50","f16","f16","99","layer","0","0","0","0.00","1","0","0","128","2024-11-21T14:16:38Z","2037124433","620822","62.833672","0.019046"
"1ee9eea0","4073","0","0","0","0","1","0","1","1","INTEL(R) XEON(R) PLATINUM 8580","Intel(R) Data Center GPU Max 1100","/home/pmdk/bench_workdir/models/Phi-3-mini-4k-instruct-q4.gguf","phi3 3B Q4_K - Medium","2392493568","3821079552","512","512","56","0x0","0","50","f16","f16","99","layer","0","0","0","0.00","1","0","512","0","2024-11-21T14:16:48Z","1119576832","3312613","457.318846","1.352793"
"1ee9eea0","4073","0","0","0","0","1","0","1","1","INTEL(R) XEON(R) PLATINUM 8580","Intel(R) Data Center GPU Max 1100","/home/pmdk/bench_workdir/models/Phi-3-mini-4k-instruct-q4.gguf","phi3 3B Q4_K - Medium","2392493568","3821079552","512","512","56","0x0","0","50","f16","f16","99","layer","0","0","0","0.00","1","0","0","128","2024-11-21T14:16:55Z","2038663662","865860","62.786236","0.026669"

llama.cpp Prompt Processing Batched 512

Environment Variables:

Command:

/home/pmdk/bench_workdir/llamacpp-build/bin/llama-bench --output csv -n 128 -p 512 -b 128,256,512 --numa isolate -t 56 --model /home/pmdk/bench_workdir/models/Phi-3-mini-4k-instruct-q4.gguf

Output:

build_commit,build_number,cuda,vulkan,kompute,metal,sycl,rpc,gpu_blas,blas,cpu_info,gpu_info,model_filename,model_type,model_size,model_n_params,n_batch,n_ubatch,n_threads,cpu_mask,cpu_strict,poll,type_k,type_v,n_gpu_layers,split_mode,main_gpu,no_kv_offload,flash_attn,tensor_split,use_mmap,embeddings,n_prompt,n_gen,test_time,avg_ns,stddev_ns,avg_ts,stddev_ts
"1ee9eea0","4073","0","0","0","0","1","0","1","1","INTEL(R) XEON(R) PLATINUM 8580","Intel(R) Data Center GPU Max 1100","/home/pmdk/bench_workdir/models/Phi-3-mini-4k-instruct-q4.gguf","phi3 3B Q4_K - Medium","2392493568","3821079552","128","512","56","0x0","0","50","f16","f16","99","layer","0","0","0","0.00","1","0","512","0","2024-11-21T14:16:20Z","606771220","8302129","843.935068","11.369513"
"1ee9eea0","4073","0","0","0","0","1","0","1","1","INTEL(R) XEON(R) PLATINUM 8580","Intel(R) Data Center GPU Max 1100","/home/pmdk/bench_workdir/models/Phi-3-mini-4k-instruct-q4.gguf","phi3 3B Q4_K - Medium","2392493568","3821079552","128","512","56","0x0","0","50","f16","f16","99","layer","0","0","0","0.00","1","0","0","128","2024-11-21T14:16:24Z","2039773824","1434550","62.752080","0.044084"
"1ee9eea0","4073","0","0","0","0","1","0","1","1","INTEL(R) XEON(R) PLATINUM 8580","Intel(R) Data Center GPU Max 1100","/home/pmdk/bench_workdir/models/Phi-3-mini-4k-instruct-q4.gguf","phi3 3B Q4_K - Medium","2392493568","3821079552","256","512","56","0x0","0","50","f16","f16","99","layer","0","0","0","0.00","1","0","512","0","2024-11-21T14:16:34Z","570335949","4425464","897.759541","6.931640"
"1ee9eea0","4073","0","0","0","0","1","0","1","1","INTEL(R) XEON(R) PLATINUM 8580","Intel(R) Data Center GPU Max 1100","/home/pmdk/bench_workdir/models/Phi-3-mini-4k-instruct-q4.gguf","phi3 3B Q4_K - Medium","2392493568","3821079552","256","512","56","0x0","0","50","f16","f16","99","layer","0","0","0","0.00","1","0","0","128","2024-11-21T14:16:38Z","2037124433","620822","62.833672","0.019046"
"1ee9eea0","4073","0","0","0","0","1","0","1","1","INTEL(R) XEON(R) PLATINUM 8580","Intel(R) Data Center GPU Max 1100","/home/pmdk/bench_workdir/models/Phi-3-mini-4k-instruct-q4.gguf","phi3 3B Q4_K - Medium","2392493568","3821079552","512","512","56","0x0","0","50","f16","f16","99","layer","0","0","0","0.00","1","0","512","0","2024-11-21T14:16:48Z","1119576832","3312613","457.318846","1.352793"
"1ee9eea0","4073","0","0","0","0","1","0","1","1","INTEL(R) XEON(R) PLATINUM 8580","Intel(R) Data Center GPU Max 1100","/home/pmdk/bench_workdir/models/Phi-3-mini-4k-instruct-q4.gguf","phi3 3B Q4_K - Medium","2392493568","3821079552","512","512","56","0x0","0","50","f16","f16","99","layer","0","0","0","0.00","1","0","0","128","2024-11-21T14:16:55Z","2038663662","865860","62.786236","0.026669"

Copy link

Compute Benchmarks level_zero_v2 run (with params: --compare baseline-v2):
https://github.com/oneapi-src/unified-runtime/actions/runs/11954614246

Copy link

Compute Benchmarks level_zero_v2 run (--compare baseline-v2):
https://github.com/oneapi-src/unified-runtime/actions/runs/11954614246
Job status: success. Test status: success.

Summary

No diffs to calculate performance change

(result is better)

Performance change in benchmark groups

Relative perf in group api (6): cannot calculate
Benchmark This PR baseline baseline-v2 Relative perf Change -
api_overhead_benchmark_sycl SubmitKernel out of order 24.240 μs 23.344000 μs 23.745 μs
api_overhead_benchmark_sycl SubmitKernel in order 24.598 μs 23.009000 μs 24.748 μs
api_overhead_benchmark_sycl ExecImmediateCopyQueue out of order from Device to Device, size 1024 1.891 μs 2.399 μs 1.868000 μs
api_overhead_benchmark_sycl ExecImmediateCopyQueue in order from Device to Host, size 1024 1.912 μs 1.629000 μs 1.899 μs
api_overhead_benchmark_ur SubmitKernel out of order 12.678000 μs 14.019 μs 15.342 μs
api_overhead_benchmark_ur SubmitKernel in order 15.079 μs 13.560000 μs 15.480 μs
Relative perf in group memory (3): cannot calculate
Benchmark This PR baseline baseline-v2 Relative perf Change -
memory_benchmark_sycl QueueInOrderMemcpy from Device to Device, size 1024 202.074 μs 250.368 μs 196.473000 μs
memory_benchmark_sycl QueueInOrderMemcpy from Host to Device, size 1024 85.184000 μs 117.932 μs 88.565 μs
memory_benchmark_sycl QueueMemcpy from Device to Device, size 1024 5.854 μs 5.781 μs 5.684000 μs
Relative perf in group miscellaneous (1): cannot calculate
Benchmark This PR baseline baseline-v2 Relative perf Change -
miscellaneous_benchmark_sycl VectorSum 803.578 μs 803.138000 μs 804.478 μs
Relative perf in group multithread (8): cannot calculate
Benchmark This PR baseline baseline-v2 Relative perf Change -
multithread_benchmark_ur MemcpyExecute opsPerThread:400, numThreads:1, allocSize:102400 srcUSM:1 dstUSM:1 3571.593000 μs 6719.944 μs 3588.421 μs
multithread_benchmark_ur MemcpyExecute opsPerThread:100, numThreads:8, allocSize:102400 srcUSM:1 dstUSM:1 8132.681 μs 17264.000 μs 7922.105000 μs
multithread_benchmark_ur MemcpyExecute opsPerThread:400, numThreads:8, allocSize:1024 srcUSM:1 dstUSM:1 25522.220 μs 24830.061000 μs 24977.963 μs
multithread_benchmark_ur MemcpyExecute opsPerThread:10, numThreads:16, allocSize:1024 srcUSM:1 dstUSM:1 1071.074 μs 1046.591000 μs 1063.816 μs
multithread_benchmark_ur MemcpyExecute opsPerThread:400, numThreads:1, allocSize:102400 srcUSM:0 dstUSM:1 4517.869 μs 7614.679 μs 4507.512000 μs
multithread_benchmark_ur MemcpyExecute opsPerThread:100, numThreads:8, allocSize:102400 srcUSM:0 dstUSM:1 6416.048 μs 8369.748 μs 6390.880000 μs
multithread_benchmark_ur MemcpyExecute opsPerThread:400, numThreads:8, allocSize:1024 srcUSM:0 dstUSM:1 24560.764000 μs 24930.683 μs 25445.122 μs
multithread_benchmark_ur MemcpyExecute opsPerThread:10, numThreads:16, allocSize:1024 srcUSM:0 dstUSM:1 1060.457 μs 1035.853000 μs 1068.329 μs
Relative perf in group Runtime (8): cannot calculate
Benchmark This PR baseline baseline-v2 Relative perf Change -
Runtime_IndependentDAGTaskThroughput_BasicParallelFor 186.652 ms 277.443 ms 179.923000 ms
Runtime_IndependentDAGTaskThroughput_HierarchicalParallelFor 190.006 ms 274.737 ms 185.023000 ms
Runtime_IndependentDAGTaskThroughput_NDRangeParallelFor 186.364 ms 270.176 ms 182.648000 ms
Runtime_IndependentDAGTaskThroughput_SingleTask 181.620 ms 262.132 ms 175.188000 ms
Runtime_DAGTaskThroughput_NDRangeParallelFor 1301.018 ms 1782.877 ms 1284.252000 ms
Runtime_DAGTaskThroughput_HierarchicalParallelFor 1323.247 ms 1801.303 ms 1313.605000 ms
Runtime_DAGTaskThroughput_BasicParallelFor 1346.561 ms 1827.461 ms 1318.343000 ms
Runtime_DAGTaskThroughput_SingleTask 1276.034 ms 1766.731 ms 1249.041000 ms
Relative perf in group MicroBench (14): cannot calculate
Benchmark This PR baseline baseline-v2 Relative perf Change -
MicroBench_HostDeviceBandwidth_1D_H2D_Strided 4.417 ms 4.405000 ms 4.476 ms
MicroBench_HostDeviceBandwidth_2D_H2D_Contiguous 4.458 ms 4.480 ms 4.446000 ms
MicroBench_HostDeviceBandwidth_3D_H2D_Contiguous 4.486 ms 4.487 ms 4.461000 ms
MicroBench_HostDeviceBandwidth_3D_H2D_Strided 4.483000 ms 4.498 ms 4.528 ms
MicroBench_HostDeviceBandwidth_1D_D2H_Contiguous 3.677000 ms 4.536 ms 3.716 ms
MicroBench_HostDeviceBandwidth_1D_D2H_Strided 3.707000 ms 4.625 ms 3.753 ms
MicroBench_HostDeviceBandwidth_2D_H2D_Strided 4.470000 ms 4.504 ms 4.547 ms
MicroBench_HostDeviceBandwidth_3D_D2H_Strided 617.413 ms 617.471 ms 617.302000 ms
MicroBench_HostDeviceBandwidth_3D_D2H_Contiguous 618.144 ms 618.194 ms 618.005000 ms
MicroBench_HostDeviceBandwidth_2D_D2H_Strided 617.438 ms 617.439 ms 617.296000 ms
MicroBench_HostDeviceBandwidth_1D_H2D_Contiguous 4.351 ms 4.365 ms 4.300000 ms
MicroBench_HostDeviceBandwidth_2D_D2H_Contiguous 618.130 ms 618.203 ms 617.994000 ms
MicroBench_LocalMem_int32_4096 29.879 ms 29.899 ms 29.853000 ms
MicroBench_LocalMem_fp32_4096 29.857000 ms 29.880 ms 29.858 ms
Relative perf in group Pattern (10): cannot calculate
Benchmark This PR baseline baseline-v2 Relative perf Change -
Pattern_Reduction_NDRange_int32 16.825 ms 16.937 ms 16.718000 ms
Pattern_Reduction_Hierarchical_int32 17.079 ms 16.990 ms 16.906000 ms
Pattern_SegmentedReduction_NDRange_int32 2.166000 ms 2.172 ms 2.170 ms
Pattern_SegmentedReduction_Hierarchical_int16 11.793000 ms 11.799 ms 11.799 ms
Pattern_SegmentedReduction_NDRange_int64 2.344000 ms 2.347 ms 2.351 ms
Pattern_SegmentedReduction_NDRange_fp32 2.161000 ms 2.166 ms 2.164 ms
Pattern_SegmentedReduction_Hierarchical_fp32 11.595 ms 11.590000 ms 11.595 ms
Pattern_SegmentedReduction_NDRange_int16 2.252000 ms 2.269 ms 2.253 ms
Pattern_SegmentedReduction_Hierarchical_int32 11.598 ms 11.595000 ms 11.605 ms
Pattern_SegmentedReduction_Hierarchical_int64 11.777 ms 11.776 ms 11.775000 ms
Relative perf in group ScalarProduct (6): cannot calculate
Benchmark This PR baseline baseline-v2 Relative perf Change -
ScalarProduct_Hierarchical_int32 10.320 ms 10.320 ms 10.313000 ms
ScalarProduct_NDRange_fp32 3.798 ms 3.746000 ms 3.797 ms
ScalarProduct_NDRange_int64 5.497 ms 5.441000 ms 5.499 ms
ScalarProduct_Hierarchical_fp32 9.961 ms 9.955000 ms 9.961 ms
ScalarProduct_Hierarchical_int64 11.362 ms 11.306000 ms 11.356 ms
ScalarProduct_NDRange_int32 3.809 ms 3.757000 ms 3.811 ms
Relative perf in group USM (7): cannot calculate
Benchmark This PR baseline baseline-v2 Relative perf Change -
USM_Allocation_latency_fp32_shared 0.065 ms 0.055000 ms 0.062 ms
USM_Allocation_latency_fp32_host 37.388000 ms 37.435 ms 37.568 ms
USM_Instr_Mix_fp32_device_1:1mix_no_init_no_prefetch 1.603 ms 1.800 ms 1.543000 ms
USM_Instr_Mix_fp32_host_1:1mix_with_init_no_prefetch 1.012 ms 1.044 ms 0.979000 ms
USM_Instr_Mix_fp32_device_1:1mix_with_init_no_prefetch 1.316 ms 1.653 ms 1.298000 ms
USM_Instr_Mix_fp32_host_1:1mix_no_init_no_prefetch 1.168 ms 1.202 ms 1.140000 ms
USM_Allocation_latency_fp32_device - 0.067000 ms -
Relative perf in group VectorAddition (3): cannot calculate
Benchmark This PR baseline baseline-v2 Relative perf Change -
VectorAddition_int32 1.495 ms 1.452000 ms 1.494 ms
VectorAddition_fp32 1.494 ms 1.450000 ms 1.495 ms
VectorAddition_int64 3.113 ms 3.055000 ms 3.119 ms
Relative perf in group Polybench (3): cannot calculate
Benchmark This PR baseline baseline-v2 Relative perf Change -
Polybench_2mm 1.212000 ms 1.213 ms 1.214 ms
Polybench_3mm 1.812 ms 1.730000 ms 1.813 ms
Polybench_Atax 6.880 ms 6.883 ms 6.860000 ms
Relative perf in group Kmeans (1): cannot calculate
Benchmark This PR baseline baseline-v2 Relative perf Change -
Kmeans_fp32 16.050000 ms 16.051 ms 16.054 ms
Relative perf in group LinearRegressionCoeff (1): cannot calculate
Benchmark This PR baseline baseline-v2 Relative perf Change -
LinearRegressionCoeff_fp32 686.779000 ms - -
Relative perf in group MolecularDynamics (1): cannot calculate
Benchmark This PR baseline baseline-v2 Relative perf Change -
MolecularDynamics 0.027000 ms 0.027 ms 0.027 ms
Relative perf in group llama.cpp (6): cannot calculate
Benchmark This PR baseline baseline-v2 Relative perf Change -
llama.cpp Text Generation Batched 512 65.231763 token/s 62.368 token/s 65.191 token/s
llama.cpp Text Generation Batched 256 65.210 token/s 62.255 token/s 65.269429 token/s
llama.cpp Prompt Processing Batched 256 944.202 token/s 909.696 token/s 949.465825 token/s
llama.cpp Text Generation Batched 128 65.391429 token/s 62.270 token/s 65.291 token/s
llama.cpp Prompt Processing Batched 128 876.617762 token/s 815.476 token/s 864.649 token/s
llama.cpp Prompt Processing Batched 512 487.997363 token/s 458.128 token/s 486.573 token/s
Relative perf in group Velocity-Bench (6): cannot calculate
Benchmark This PR baseline baseline-v2 Relative perf Change -
Velocity-Bench Hashtable - 379.894 M keys/sec 383.152085 M keys/sec
Velocity-Bench Bitcracker - 35.222 s 35.193300 s
Velocity-Bench CudaSift - 204.950 ms 202.081000 ms
Velocity-Bench Easywave - 237.000 ms 234.000000 ms
Velocity-Bench QuickSilver - 118.990 MMS/CTT 121.320000 MMS/CTT
Velocity-Bench Sobel Filter - 534.267 ms 516.832000 ms

Details

Benchmark details - environment, command, output...
api_overhead_benchmark_sycl SubmitKernel out of order

Environment Variables:

Command:

/home/pmdk/bench_workdir/compute-benchmarks-build/bin/api_overhead_benchmark_sycl --test=SubmitKernel --csv --noHeaders --Ioq=0 --DiscardEvents=0 --MeasureCompletion=0 --iterations=100000 --Profiling=0 --NumKernels=10 --KernelExecTime=1

Output:

TestCase,Mean,Median,StdDev,Min,Max,Type
SubmitKernel(api=sycl Profiling=0 Ioq=0 DiscardEvents=0 NumKernels=10 KernelExecTime=1 MeasureCompletion=0),24.240,24.223,3.97%,23.002,300.663,[CPU],[us]

api_overhead_benchmark_sycl SubmitKernel in order

Environment Variables:

Command:

/home/pmdk/bench_workdir/compute-benchmarks-build/bin/api_overhead_benchmark_sycl --test=SubmitKernel --csv --noHeaders --Ioq=1 --DiscardEvents=0 --MeasureCompletion=0 --iterations=100000 --Profiling=0 --NumKernels=10 --KernelExecTime=1

Output:

TestCase,Mean,Median,StdDev,Min,Max,Type
SubmitKernel(api=sycl Profiling=0 Ioq=1 DiscardEvents=0 NumKernels=10 KernelExecTime=1 MeasureCompletion=0),24.598,24.586,3.60%,23.319,276.888,[CPU],[us]

memory_benchmark_sycl QueueInOrderMemcpy from Device to Device, size 1024

Environment Variables:

Command:

/home/pmdk/bench_workdir/compute-benchmarks-build/bin/memory_benchmark_sycl --test=QueueInOrderMemcpy --csv --noHeaders --iterations=10000 --IsCopyOnly=0 --sourcePlacement=Device --destinationPlacement=Device --size=1024 --count=100

Output:

TestCase,Mean,Median,StdDev,Min,Max,Type
QueueInOrderMemcpy(api=sycl IsCopyOnly=0 sourcePlacement=Device destinationPlacement=Device size=1KB count=100),202.074,202.423,2.09%,194.523,471.090,[CPU],[us]

memory_benchmark_sycl QueueInOrderMemcpy from Host to Device, size 1024

Environment Variables:

Command:

/home/pmdk/bench_workdir/compute-benchmarks-build/bin/memory_benchmark_sycl --test=QueueInOrderMemcpy --csv --noHeaders --iterations=10000 --IsCopyOnly=0 --sourcePlacement=Host --destinationPlacement=Device --size=1024 --count=100

Output:

TestCase,Mean,Median,StdDev,Min,Max,Type
QueueInOrderMemcpy(api=sycl IsCopyOnly=0 sourcePlacement=Host destinationPlacement=Device size=1KB count=100),85.184,85.174,0.85%,83.319,126.588,[CPU],[us]

memory_benchmark_sycl QueueMemcpy from Device to Device, size 1024

Environment Variables:

Command:

/home/pmdk/bench_workdir/compute-benchmarks-build/bin/memory_benchmark_sycl --test=QueueMemcpy --csv --noHeaders --iterations=10000 --sourcePlacement=Device --destinationPlacement=Device --size=1024

Output:

TestCase,Mean,Median,StdDev,Min,Max,Type
QueueMemcpy(api=sycl sourcePlacement=Device destinationPlacement=Device size=1KB),5.854,5.995,13.98%,4.591,53.273,[CPU],[us]

api_overhead_benchmark_sycl ExecImmediateCopyQueue out of order from Device to Device, size 1024

Environment Variables:

Command:

/home/pmdk/bench_workdir/compute-benchmarks-build/bin/api_overhead_benchmark_sycl --test=ExecImmediateCopyQueue --csv --noHeaders --iterations=100000 --ioq=0 --IsCopyOnly=1 --MeasureCompletionTime=0 --src=Device --dst=Device --size=1024

Output:

TestCase,Mean,Median,StdDev,Min,Max,Type
ExecImmediateCopyQueue(api=sycl IsCopyOnly=1 MeasureCompletionTime=0 src=Device dst=Device size=1KB ioq=0),1.891,1.861,8.92%,1.672,25.918,[CPU],[us]

api_overhead_benchmark_sycl ExecImmediateCopyQueue in order from Device to Host, size 1024

Environment Variables:

Command:

/home/pmdk/bench_workdir/compute-benchmarks-build/bin/api_overhead_benchmark_sycl --test=ExecImmediateCopyQueue --csv --noHeaders --iterations=100000 --ioq=1 --IsCopyOnly=1 --MeasureCompletionTime=0 --src=Host --dst=Host --size=1024

Output:

TestCase,Mean,Median,StdDev,Min,Max,Type
ExecImmediateCopyQueue(api=sycl IsCopyOnly=1 MeasureCompletionTime=0 src=Host dst=Host size=1KB ioq=1),1.912,1.879,11.65%,1.685,56.603,[CPU],[us]

miscellaneous_benchmark_sycl VectorSum

Environment Variables:

Command:

/home/pmdk/bench_workdir/compute-benchmarks-build/bin/miscellaneous_benchmark_sycl --test=VectorSum --csv --noHeaders --iterations=1000 --numberOfElementsX=512 --numberOfElementsY=256 --numberOfElementsZ=256

Output:

TestCase,Mean,Median,StdDev,Min,Max,Type
VectorSum(api=sycl numberOfElementsX=512 numberOfElementsY=256 numberOfElementsZ=256),803.578,804.791,0.55%,770.540,812.325,[GPU],bw [GB/s]

multithread_benchmark_ur MemcpyExecute opsPerThread:400, numThreads:1, allocSize:102400 srcUSM:1 dstUSM:1

Environment Variables:

Command:

/home/pmdk/bench_workdir/compute-benchmarks-build/bin/multithread_benchmark_ur --test=MemcpyExecute --csv --noHeaders --Ioq=1 --UseEvents=1 --MeasureCompletion=1 --UseQueuePerThread=1 --AllocSize=102400 --NumThreads=1 --NumOpsPerThread=400 --iterations=10 --SrcUSM=1 --DstUSM=1

Output:

TestCase,Mean,Median,StdDev,Min,Max,Type
MemcpyExecute(api=ur Ioq=1 NumOpsPerThread=400 NumThreads=1 AllocSize=102400 MeasureCompletion=1 UseEvents=1 UseQueuePerThread=1 SrcUSM=1 DstUSM=1),3571.593,3571.696,0.11%,3566.429,3579.532,[CPU],[us]

multithread_benchmark_ur MemcpyExecute opsPerThread:100, numThreads:8, allocSize:102400 srcUSM:1 dstUSM:1

Environment Variables:

Command:

/home/pmdk/bench_workdir/compute-benchmarks-build/bin/multithread_benchmark_ur --test=MemcpyExecute --csv --noHeaders --Ioq=1 --UseEvents=1 --MeasureCompletion=1 --UseQueuePerThread=1 --AllocSize=102400 --NumThreads=8 --NumOpsPerThread=100 --iterations=10 --SrcUSM=1 --DstUSM=1

Output:

TestCase,Mean,Median,StdDev,Min,Max,Type
MemcpyExecute(api=ur Ioq=1 NumOpsPerThread=100 NumThreads=8 AllocSize=102400 MeasureCompletion=1 UseEvents=1 UseQueuePerThread=1 SrcUSM=1 DstUSM=1),8132.681,8110.627,3.47%,7490.372,8540.392,[CPU],[us]

multithread_benchmark_ur MemcpyExecute opsPerThread:400, numThreads:8, allocSize:1024 srcUSM:1 dstUSM:1

Environment Variables:

Command:

/home/pmdk/bench_workdir/compute-benchmarks-build/bin/multithread_benchmark_ur --test=MemcpyExecute --csv --noHeaders --Ioq=1 --UseEvents=1 --MeasureCompletion=1 --UseQueuePerThread=1 --AllocSize=1024 --NumThreads=8 --NumOpsPerThread=400 --iterations=1000 --SrcUSM=1 --DstUSM=1

Output:

TestCase,Mean,Median,StdDev,Min,Max,Type
MemcpyExecute(api=ur Ioq=1 NumOpsPerThread=400 NumThreads=8 AllocSize=1024 MeasureCompletion=1 UseEvents=1 UseQueuePerThread=1 SrcUSM=1 DstUSM=1),25522.220,25307.099,5.09%,22609.408,31509.932,[CPU],[us]

multithread_benchmark_ur MemcpyExecute opsPerThread:10, numThreads:16, allocSize:1024 srcUSM:1 dstUSM:1

Environment Variables:

Command:

/home/pmdk/bench_workdir/compute-benchmarks-build/bin/multithread_benchmark_ur --test=MemcpyExecute --csv --noHeaders --Ioq=1 --UseEvents=1 --MeasureCompletion=1 --UseQueuePerThread=1 --AllocSize=1024 --NumThreads=16 --NumOpsPerThread=10 --iterations=10000 --SrcUSM=1 --DstUSM=1

Output:

TestCase,Mean,Median,StdDev,Min,Max,Type
MemcpyExecute(api=ur Ioq=1 NumOpsPerThread=10 NumThreads=16 AllocSize=1024 MeasureCompletion=1 UseEvents=1 UseQueuePerThread=1 SrcUSM=1 DstUSM=1),1071.074,1061.155,7.03%,874.521,1834.429,[CPU],[us]

multithread_benchmark_ur MemcpyExecute opsPerThread:400, numThreads:1, allocSize:102400 srcUSM:0 dstUSM:1

Environment Variables:

Command:

/home/pmdk/bench_workdir/compute-benchmarks-build/bin/multithread_benchmark_ur --test=MemcpyExecute --csv --noHeaders --Ioq=1 --UseEvents=1 --MeasureCompletion=1 --UseQueuePerThread=1 --AllocSize=102400 --NumThreads=1 --NumOpsPerThread=400 --iterations=10 --SrcUSM=0 --DstUSM=1

Output:

TestCase,Mean,Median,StdDev,Min,Max,Type
MemcpyExecute(api=ur Ioq=1 NumOpsPerThread=400 NumThreads=1 AllocSize=102400 MeasureCompletion=1 UseEvents=1 UseQueuePerThread=1 SrcUSM=0 DstUSM=1),4517.869,4511.490,1.34%,4415.599,4652.838,[CPU],[us]

multithread_benchmark_ur MemcpyExecute opsPerThread:100, numThreads:8, allocSize:102400 srcUSM:0 dstUSM:1

Environment Variables:

Command:

/home/pmdk/bench_workdir/compute-benchmarks-build/bin/multithread_benchmark_ur --test=MemcpyExecute --csv --noHeaders --Ioq=1 --UseEvents=1 --MeasureCompletion=1 --UseQueuePerThread=1 --AllocSize=102400 --NumThreads=8 --NumOpsPerThread=100 --iterations=10 --SrcUSM=0 --DstUSM=1

Output:

TestCase,Mean,Median,StdDev,Min,Max,Type
MemcpyExecute(api=ur Ioq=1 NumOpsPerThread=100 NumThreads=8 AllocSize=102400 MeasureCompletion=1 UseEvents=1 UseQueuePerThread=1 SrcUSM=0 DstUSM=1),6416.048,6430.264,4.59%,5941.959,6914.902,[CPU],[us]

multithread_benchmark_ur MemcpyExecute opsPerThread:400, numThreads:8, allocSize:1024 srcUSM:0 dstUSM:1

Environment Variables:

Command:

/home/pmdk/bench_workdir/compute-benchmarks-build/bin/multithread_benchmark_ur --test=MemcpyExecute --csv --noHeaders --Ioq=1 --UseEvents=1 --MeasureCompletion=1 --UseQueuePerThread=1 --AllocSize=1024 --NumThreads=8 --NumOpsPerThread=400 --iterations=1000 --SrcUSM=0 --DstUSM=1

Output:

TestCase,Mean,Median,StdDev,Min,Max,Type
MemcpyExecute(api=ur Ioq=1 NumOpsPerThread=400 NumThreads=8 AllocSize=1024 MeasureCompletion=1 UseEvents=1 UseQueuePerThread=1 SrcUSM=0 DstUSM=1),24560.764,24414.588,4.87%,21958.582,29399.667,[CPU],[us]

multithread_benchmark_ur MemcpyExecute opsPerThread:10, numThreads:16, allocSize:1024 srcUSM:0 dstUSM:1

Environment Variables:

Command:

/home/pmdk/bench_workdir/compute-benchmarks-build/bin/multithread_benchmark_ur --test=MemcpyExecute --csv --noHeaders --Ioq=1 --UseEvents=1 --MeasureCompletion=1 --UseQueuePerThread=1 --AllocSize=1024 --NumThreads=16 --NumOpsPerThread=10 --iterations=10000 --SrcUSM=0 --DstUSM=1

Output:

TestCase,Mean,Median,StdDev,Min,Max,Type
MemcpyExecute(api=ur Ioq=1 NumOpsPerThread=10 NumThreads=16 AllocSize=1024 MeasureCompletion=1 UseEvents=1 UseQueuePerThread=1 SrcUSM=0 DstUSM=1),1060.457,1050.233,7.13%,851.246,1791.188,[CPU],[us]

api_overhead_benchmark_ur SubmitKernel out of order

Environment Variables:

Command:

/home/pmdk/bench_workdir/compute-benchmarks-build/bin/api_overhead_benchmark_ur --test=SubmitKernel --csv --noHeaders --Ioq=0 --DiscardEvents=0 --MeasureCompletion=0 --iterations=100000 --Profiling=0 --NumKernels=10 --KernelExecTime=1

Output:

TestCase,Mean,Median,StdDev,Min,Max,Type
SubmitKernel(api=ur Profiling=0 Ioq=0 DiscardEvents=0 NumKernels=10 KernelExecTime=1 MeasureCompletion=0),12.678,12.543,4.41%,11.960,26.173,[CPU],[us]

api_overhead_benchmark_ur SubmitKernel in order

Environment Variables:

Command:

/home/pmdk/bench_workdir/compute-benchmarks-build/bin/api_overhead_benchmark_ur --test=SubmitKernel --csv --noHeaders --Ioq=1 --DiscardEvents=0 --MeasureCompletion=0 --iterations=100000 --Profiling=0 --NumKernels=10 --KernelExecTime=1

Output:

TestCase,Mean,Median,StdDev,Min,Max,Type
SubmitKernel(api=ur Profiling=0 Ioq=1 DiscardEvents=0 NumKernels=10 KernelExecTime=1 MeasureCompletion=0),15.079,15.064,5.38%,14.173,251.281,[CPU],[us]

Runtime_IndependentDAGTaskThroughput_BasicParallelFor

Environment Variables:

Command:

/home/pmdk/bench_workdir/sycl-bench-build/dag_task_throughput_independent --warmup-run --num-runs=5 --output=/home/pmdk/bench_workdir/IndependentDAGTaskThroughput_multi.csv --size=32768

Output:

['Runtime_IndependentDAGTaskThroughput_BasicParallelFor', 'PASS', 'Intel(R) Data Center GPU Max 1100', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', '256', '32768', '0.190922', '0.186652', '0.182696', '0.182696 0.183532 0.186652 0.186689 0.215041', '0.013603', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'LLVM (Intel DPC++)', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A']

Runtime_IndependentDAGTaskThroughput_HierarchicalParallelFor

Environment Variables:

Command:

/home/pmdk/bench_workdir/sycl-bench-build/dag_task_throughput_independent --warmup-run --num-runs=5 --output=/home/pmdk/bench_workdir/IndependentDAGTaskThroughput_multi.csv --size=32768

Output:

['Runtime_IndependentDAGTaskThroughput_HierarchicalParallelFor', 'PASS', 'Intel(R) Data Center GPU Max 1100', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', '256', '32768', '0.197491', '0.190006', '0.189277', '0.189277 0.189470 0.190006 0.190348 0.228352', '0.017257', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'LLVM (Intel DPC++)', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A']

Runtime_IndependentDAGTaskThroughput_NDRangeParallelFor

Environment Variables:

Command:

/home/pmdk/bench_workdir/sycl-bench-build/dag_task_throughput_independent --warmup-run --num-runs=5 --output=/home/pmdk/bench_workdir/IndependentDAGTaskThroughput_multi.csv --size=32768

Output:

['Runtime_IndependentDAGTaskThroughput_NDRangeParallelFor', 'PASS', 'Intel(R) Data Center GPU Max 1100', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', '256', '32768', '0.186533', '0.186364', '0.185129', '0.185129 0.185568 0.186364 0.187079 0.188525', '0.001341', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'LLVM (Intel DPC++)', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A']

Runtime_IndependentDAGTaskThroughput_SingleTask

Environment Variables:

Command:

/home/pmdk/bench_workdir/sycl-bench-build/dag_task_throughput_independent --warmup-run --num-runs=5 --output=/home/pmdk/bench_workdir/IndependentDAGTaskThroughput_multi.csv --size=32768

Output:

['Runtime_IndependentDAGTaskThroughput_SingleTask', 'PASS', 'Intel(R) Data Center GPU Max 1100', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', '256', '32768', '0.190598', '0.181620', '0.176535', '0.176535 0.177970 0.181620 0.183571 0.233294', '0.024032', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'LLVM (Intel DPC++)', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A']

Runtime_DAGTaskThroughput_NDRangeParallelFor

Environment Variables:

Command:

/home/pmdk/bench_workdir/sycl-bench-build/dag_task_throughput_sequential --warmup-run --num-runs=5 --output=/home/pmdk/bench_workdir/DAGTaskThroughput_multi.csv --size=327680

Output:

['Runtime_DAGTaskThroughput_NDRangeParallelFor', 'PASS', 'Intel(R) Data Center GPU Max 1100', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', '256', '327680', '1.300368', '1.301018', '1.294351', '1.294351 1.296758 1.301018 1.304569 1.305147', '0.004747', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'LLVM (Intel DPC++)', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A']

Runtime_DAGTaskThroughput_HierarchicalParallelFor

Environment Variables:

Command:

/home/pmdk/bench_workdir/sycl-bench-build/dag_task_throughput_sequential --warmup-run --num-runs=5 --output=/home/pmdk/bench_workdir/DAGTaskThroughput_multi.csv --size=327680

Output:

['Runtime_DAGTaskThroughput_HierarchicalParallelFor', 'PASS', 'Intel(R) Data Center GPU Max 1100', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', '256', '327680', '1.325413', '1.323247', '1.314688', '1.314688 1.321380 1.323247 1.326673 1.341077', '0.009785', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'LLVM (Intel DPC++)', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A']

Runtime_DAGTaskThroughput_BasicParallelFor

Environment Variables:

Command:

/home/pmdk/bench_workdir/sycl-bench-build/dag_task_throughput_sequential --warmup-run --num-runs=5 --output=/home/pmdk/bench_workdir/DAGTaskThroughput_multi.csv --size=327680

Output:

['Runtime_DAGTaskThroughput_BasicParallelFor', 'PASS', 'Intel(R) Data Center GPU Max 1100', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', '256', '327680', '1.346098', '1.346561', '1.339927', '1.339927 1.344470 1.346561 1.348534 1.350998', '0.004210', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'LLVM (Intel DPC++)', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A']

Runtime_DAGTaskThroughput_SingleTask

Environment Variables:

Command:

/home/pmdk/bench_workdir/sycl-bench-build/dag_task_throughput_sequential --warmup-run --num-runs=5 --output=/home/pmdk/bench_workdir/DAGTaskThroughput_multi.csv --size=327680

Output:

['Runtime_DAGTaskThroughput_SingleTask', 'PASS', 'Intel(R) Data Center GPU Max 1100', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', '256', '327680', '1.277738', '1.276034', '1.270503', '1.270503 1.275233 1.276034 1.276356 1.290565', '0.007549', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'LLVM (Intel DPC++)', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A']

MicroBench_HostDeviceBandwidth_1D_H2D_Strided

Environment Variables:

Command:

/home/pmdk/bench_workdir/sycl-bench-build/host_device_bandwidth --warmup-run --num-runs=5 --output=/home/pmdk/bench_workdir/HostDeviceBandwidth_multi.csv --size=512

Output:

['MicroBench_HostDeviceBandwidth_1D_H2D_Strided', 'PASS', 'Intel(R) Data Center GPU Max 1100', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', '256', '512', '0.004349', '0.004417', '0.004218', '0.004218 0.004246 0.004417 0.004427 0.004438', '0.000108', '29.634238', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'LLVM (Intel DPC++)', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', '0.125000']

MicroBench_HostDeviceBandwidth_2D_H2D_Contiguous

Environment Variables:

Command:

/home/pmdk/bench_workdir/sycl-bench-build/host_device_bandwidth --warmup-run --num-runs=5 --output=/home/pmdk/bench_workdir/HostDeviceBandwidth_multi.csv --size=512

Output:

['MicroBench_HostDeviceBandwidth_2D_H2D_Contiguous', 'PASS', 'Intel(R) Data Center GPU Max 1100', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', '256', '512', '0.004445', '0.004458', '0.004377', '0.004377 0.004451 0.004458 0.004464 0.004474', '0.000039', '28.561192', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'LLVM (Intel DPC++)', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', '0.125000']

MicroBench_HostDeviceBandwidth_3D_H2D_Contiguous

Environment Variables:

Command:

/home/pmdk/bench_workdir/sycl-bench-build/host_device_bandwidth --warmup-run --num-runs=5 --output=/home/pmdk/bench_workdir/HostDeviceBandwidth_multi.csv --size=512

Output:

['MicroBench_HostDeviceBandwidth_3D_H2D_Contiguous', 'PASS', 'Intel(R) Data Center GPU Max 1100', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', '256', '512', '0.004461', '0.004486', '0.004349', '0.004349 0.004446 0.004486 0.004503 0.004523', '0.000069', '28.743700', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'LLVM (Intel DPC++)', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', '0.125000']

MicroBench_HostDeviceBandwidth_3D_H2D_Strided

Environment Variables:

Command:

/home/pmdk/bench_workdir/sycl-bench-build/host_device_bandwidth --warmup-run --num-runs=5 --output=/home/pmdk/bench_workdir/HostDeviceBandwidth_multi.csv --size=512

Output:

['MicroBench_HostDeviceBandwidth_3D_H2D_Strided', 'PASS', 'Intel(R) Data Center GPU Max 1100', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', '256', '512', '0.004509', '0.004483', '0.004473', '0.004473 0.004482 0.004483 0.004508 0.004601', '0.000053', '27.943058', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'LLVM (Intel DPC++)', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', '0.125000']

MicroBench_HostDeviceBandwidth_1D_D2H_Contiguous

Environment Variables:

Command:

/home/pmdk/bench_workdir/sycl-bench-build/host_device_bandwidth --warmup-run --num-runs=5 --output=/home/pmdk/bench_workdir/HostDeviceBandwidth_multi.csv --size=512

Output:

['MicroBench_HostDeviceBandwidth_1D_D2H_Contiguous', 'PASS', 'Intel(R) Data Center GPU Max 1100', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', '256', '512', '0.003671', '0.003677', '0.003621', '0.003621 0.003626 0.003677 0.003709 0.003723', '0.000047', '34.518439', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'LLVM (Intel DPC++)', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', '0.125000']

MicroBench_HostDeviceBandwidth_1D_D2H_Strided

Environment Variables:

Command:

/home/pmdk/bench_workdir/sycl-bench-build/host_device_bandwidth --warmup-run --num-runs=5 --output=/home/pmdk/bench_workdir/HostDeviceBandwidth_multi.csv --size=512

Output:

['MicroBench_HostDeviceBandwidth_1D_D2H_Strided', 'PASS', 'Intel(R) Data Center GPU Max 1100', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', '256', '512', '0.003705', '0.003707', '0.003592', '0.003592 0.003634 0.003707 0.003792 0.003803', '0.000094', '34.803905', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'LLVM (Intel DPC++)', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', '0.125000']

MicroBench_HostDeviceBandwidth_2D_H2D_Strided

Environment Variables:

Command:

/home/pmdk/bench_workdir/sycl-bench-build/host_device_bandwidth --warmup-run --num-runs=5 --output=/home/pmdk/bench_workdir/HostDeviceBandwidth_multi.csv --size=512

Output:

['MicroBench_HostDeviceBandwidth_2D_H2D_Strided', 'PASS', 'Intel(R) Data Center GPU Max 1100', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', '256', '512', '0.004467', '0.004470', '0.004416', '0.004416 0.004463 0.004470 0.004474 0.004513', '0.000035', '28.308121', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'LLVM (Intel DPC++)', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', '0.125000']

MicroBench_HostDeviceBandwidth_3D_D2H_Strided

Environment Variables:

Command:

/home/pmdk/bench_workdir/sycl-bench-build/host_device_bandwidth --warmup-run --num-runs=5 --output=/home/pmdk/bench_workdir/HostDeviceBandwidth_multi.csv --size=512

Output:

['MicroBench_HostDeviceBandwidth_3D_D2H_Strided', 'PASS', 'Intel(R) Data Center GPU Max 1100', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', '256', '512', '0.617413', '0.617413', '0.617393', '0.617393 0.617401 0.617413 0.617423 0.617434', '0.000016', '0.202464', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'LLVM (Intel DPC++)', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', '0.125000']

MicroBench_HostDeviceBandwidth_3D_D2H_Contiguous

Environment Variables:

Command:

/home/pmdk/bench_workdir/sycl-bench-build/host_device_bandwidth --warmup-run --num-runs=5 --output=/home/pmdk/bench_workdir/HostDeviceBandwidth_multi.csv --size=512

Output:

['MicroBench_HostDeviceBandwidth_3D_D2H_Contiguous', 'PASS', 'Intel(R) Data Center GPU Max 1100', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', '256', '512', '0.618141', '0.618144', '0.618104', '0.618104 0.618122 0.618144 0.618161 0.618173', '0.000028', '0.202231', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'LLVM (Intel DPC++)', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', '0.125000']

MicroBench_HostDeviceBandwidth_2D_D2H_Strided

Environment Variables:

Command:

/home/pmdk/bench_workdir/sycl-bench-build/host_device_bandwidth --warmup-run --num-runs=5 --output=/home/pmdk/bench_workdir/HostDeviceBandwidth_multi.csv --size=512

Output:

['MicroBench_HostDeviceBandwidth_2D_D2H_Strided', 'PASS', 'Intel(R) Data Center GPU Max 1100', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', '256', '512', '0.617436', '0.617438', '0.617385', '0.617385 0.617401 0.617438 0.617449 0.617506', '0.000047', '0.202467', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'LLVM (Intel DPC++)', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', '0.125000']

MicroBench_HostDeviceBandwidth_1D_H2D_Contiguous

Environment Variables:

Command:

/home/pmdk/bench_workdir/sycl-bench-build/host_device_bandwidth --warmup-run --num-runs=5 --output=/home/pmdk/bench_workdir/HostDeviceBandwidth_multi.csv --size=512

Output:

['MicroBench_HostDeviceBandwidth_1D_H2D_Contiguous', 'PASS', 'Intel(R) Data Center GPU Max 1100', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', '256', '512', '0.004657', '0.004351', '0.004252', '0.004252 0.004349 0.004351 0.004369 0.005963', '0.000732', '29.396064', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'LLVM (Intel DPC++)', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', '0.125000']

MicroBench_HostDeviceBandwidth_2D_D2H_Contiguous

Environment Variables:

Command:

/home/pmdk/bench_workdir/sycl-bench-build/host_device_bandwidth --warmup-run --num-runs=5 --output=/home/pmdk/bench_workdir/HostDeviceBandwidth_multi.csv --size=512

Output:

['MicroBench_HostDeviceBandwidth_2D_D2H_Contiguous', 'PASS', 'Intel(R) Data Center GPU Max 1100', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', '256', '512', '0.618138', '0.618130', '0.618120', '0.618120 0.618124 0.618130 0.618134 0.618184', '0.000026', '0.202226', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'LLVM (Intel DPC++)', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', '0.125000']

MicroBench_LocalMem_int32_4096

Environment Variables:

Command:

/home/pmdk/bench_workdir/sycl-bench-build/local_mem --warmup-run --num-runs=5 --output=/home/pmdk/bench_workdir/LocalMem_multi.csv --size=10240000

Output:

['MicroBench_LocalMem_int32_4096', 'PASS', 'Intel(R) Data Center GPU Max 1100', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', '256', '10240000', '0.029885', '0.029879', '0.029780', '0.029780 0.029863 0.029879 0.029933 0.029968', '0.000072', '10476.654538', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'LLVM (Intel DPC++)', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', '312.000000']

MicroBench_LocalMem_fp32_4096

Environment Variables:

Command:

/home/pmdk/bench_workdir/sycl-bench-build/local_mem --warmup-run --num-runs=5 --output=/home/pmdk/bench_workdir/LocalMem_multi.csv --size=10240000

Output:

['MicroBench_LocalMem_fp32_4096', 'PASS', 'Intel(R) Data Center GPU Max 1100', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', '256', '10240000', '0.029860', '0.029857', '0.029831', '0.029831 0.029849 0.029857 0.029879 0.029882', '0.000021', '10459.077401', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'LLVM (Intel DPC++)', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', '312.000000']

Pattern_Reduction_NDRange_int32

Environment Variables:

Command:

/home/pmdk/bench_workdir/sycl-bench-build/reduction --warmup-run --num-runs=5 --output=/home/pmdk/bench_workdir/Pattern_Reduction_multi.csv --size=10240000

Output:

['Pattern_Reduction_NDRange_int32', 'PASS', 'Intel(R) Data Center GPU Max 1100', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', '256', '10240000', '0.016856', '0.016825', '0.016701', '0.016701 0.016774 0.016825 0.016940 0.017040', '0.000135', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'LLVM (Intel DPC++)', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A']

Pattern_Reduction_Hierarchical_int32

Environment Variables:

Command:

/home/pmdk/bench_workdir/sycl-bench-build/reduction --warmup-run --num-runs=5 --output=/home/pmdk/bench_workdir/Pattern_Reduction_multi.csv --size=10240000

Output:

['Pattern_Reduction_Hierarchical_int32', 'PASS', 'Intel(R) Data Center GPU Max 1100', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', '256', '10240000', '0.017090', '0.017079', '0.017030', '0.017030 0.017078 0.017079 0.017117 0.017146', '0.000044', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'LLVM (Intel DPC++)', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A']

ScalarProduct_Hierarchical_int32

Environment Variables:

Command:

/home/pmdk/bench_workdir/sycl-bench-build/scalar_prod --warmup-run --num-runs=5 --output=/home/pmdk/bench_workdir/ScalarProduct_multi.csv --size=102400000

Output:

['ScalarProduct_Hierarchical_int32', 'PASS', 'Intel(R) Data Center GPU Max 1100', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', '256', '102400000', '0.010321', '0.010320', '0.010312', '0.010312 0.010315 0.010320 0.010321 0.010337', '0.000009', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'LLVM (Intel DPC++)', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A']

ScalarProduct_NDRange_fp32

Environment Variables:

Command:

/home/pmdk/bench_workdir/sycl-bench-build/scalar_prod --warmup-run --num-runs=5 --output=/home/pmdk/bench_workdir/ScalarProduct_multi.csv --size=102400000

Output:

['ScalarProduct_NDRange_fp32', 'PASS', 'Intel(R) Data Center GPU Max 1100', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', '256', '102400000', '0.003809', '0.003798', '0.003792', '0.003792 0.003796 0.003798 0.003815 0.003842', '0.000021', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'LLVM (Intel DPC++)', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A']

ScalarProduct_NDRange_int64

Environment Variables:

Command:

/home/pmdk/bench_workdir/sycl-bench-build/scalar_prod --warmup-run --num-runs=5 --output=/home/pmdk/bench_workdir/ScalarProduct_multi.csv --size=102400000

Output:

['ScalarProduct_NDRange_int64', 'PASS', 'Intel(R) Data Center GPU Max 1100', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', '256', '102400000', '0.005499', '0.005497', '0.005485', '0.005485 0.005489 0.005497 0.005500 0.005526', '0.000016', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'LLVM (Intel DPC++)', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A']

ScalarProduct_Hierarchical_fp32

Environment Variables:

Command:

/home/pmdk/bench_workdir/sycl-bench-build/scalar_prod --warmup-run --num-runs=5 --output=/home/pmdk/bench_workdir/ScalarProduct_multi.csv --size=102400000

Output:

['ScalarProduct_Hierarchical_fp32', 'PASS', 'Intel(R) Data Center GPU Max 1100', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', '256', '102400000', '0.009970', '0.009961', '0.009926', '0.009926 0.009955 0.009961 0.009999 0.010007', '0.000033', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'LLVM (Intel DPC++)', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A']

ScalarProduct_Hierarchical_int64

Environment Variables:

Command:

/home/pmdk/bench_workdir/sycl-bench-build/scalar_prod --warmup-run --num-runs=5 --output=/home/pmdk/bench_workdir/ScalarProduct_multi.csv --size=102400000

Output:

['ScalarProduct_Hierarchical_int64', 'PASS', 'Intel(R) Data Center GPU Max 1100', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', '256', '102400000', '0.011349', '0.011362', '0.011313', '0.011313 0.011317 0.011362 0.011364 0.011387', '0.000032', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'LLVM (Intel DPC++)', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A']

ScalarProduct_NDRange_int32

Environment Variables:

Command:

/home/pmdk/bench_workdir/sycl-bench-build/scalar_prod --warmup-run --num-runs=5 --output=/home/pmdk/bench_workdir/ScalarProduct_multi.csv --size=102400000

Output:

['ScalarProduct_NDRange_int32', 'PASS', 'Intel(R) Data Center GPU Max 1100', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', '256', '102400000', '0.003813', '0.003809', '0.003805', '0.003805 0.003808 0.003809 0.003814 0.003831', '0.000010', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'LLVM (Intel DPC++)', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A']

Pattern_SegmentedReduction_NDRange_int32

Environment Variables:

Command:

/home/pmdk/bench_workdir/sycl-bench-build/segmentedreduction --warmup-run --num-runs=5 --output=/home/pmdk/bench_workdir/Pattern_SegmentedReduction_multi.csv --size=102400000

Output:

['Pattern_SegmentedReduction_NDRange_int32', 'PASS', 'Intel(R) Data Center GPU Max 1100', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', '256', '102400000', '0.002166', '0.002166', '0.002164', '0.002164 0.002165 0.002166 0.002166 0.002171', '0.000003', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'LLVM (Intel DPC++)', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A']

Pattern_SegmentedReduction_Hierarchical_int16

Environment Variables:

Command:

/home/pmdk/bench_workdir/sycl-bench-build/segmentedreduction --warmup-run --num-runs=5 --output=/home/pmdk/bench_workdir/Pattern_SegmentedReduction_multi.csv --size=102400000

Output:

['Pattern_SegmentedReduction_Hierarchical_int16', 'PASS', 'Intel(R) Data Center GPU Max 1100', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', '256', '102400000', '0.011793', '0.011793', '0.011783', '0.011783 0.011793 0.011793 0.011798 0.011800', '0.000007', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'LLVM (Intel DPC++)', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A']

Pattern_SegmentedReduction_NDRange_int64

Environment Variables:

Command:

/home/pmdk/bench_workdir/sycl-bench-build/segmentedreduction --warmup-run --num-runs=5 --output=/home/pmdk/bench_workdir/Pattern_SegmentedReduction_multi.csv --size=102400000

Output:

['Pattern_SegmentedReduction_NDRange_int64', 'PASS', 'Intel(R) Data Center GPU Max 1100', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', '256', '102400000', '0.002346', '0.002344', '0.002341', '0.002341 0.002343 0.002344 0.002349 0.002355', '0.000006', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'LLVM (Intel DPC++)', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A']

Pattern_SegmentedReduction_NDRange_fp32

Environment Variables:

Command:

/home/pmdk/bench_workdir/sycl-bench-build/segmentedreduction --warmup-run --num-runs=5 --output=/home/pmdk/bench_workdir/Pattern_SegmentedReduction_multi.csv --size=102400000

Output:

['Pattern_SegmentedReduction_NDRange_fp32', 'PASS', 'Intel(R) Data Center GPU Max 1100', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', '256', '102400000', '0.002163', '0.002161', '0.002161', '0.002161 0.002161 0.002161 0.002163 0.002169', '0.000004', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'LLVM (Intel DPC++)', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A']

Pattern_SegmentedReduction_Hierarchical_fp32

Environment Variables:

Command:

/home/pmdk/bench_workdir/sycl-bench-build/segmentedreduction --warmup-run --num-runs=5 --output=/home/pmdk/bench_workdir/Pattern_SegmentedReduction_multi.csv --size=102400000

Output:

['Pattern_SegmentedReduction_Hierarchical_fp32', 'PASS', 'Intel(R) Data Center GPU Max 1100', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', '256', '102400000', '0.011595', '0.011595', '0.011588', '0.011588 0.011594 0.011595 0.011597 0.011602', '0.000005', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'LLVM (Intel DPC++)', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A']

Pattern_SegmentedReduction_NDRange_int16

Environment Variables:

Command:

/home/pmdk/bench_workdir/sycl-bench-build/segmentedreduction --warmup-run --num-runs=5 --output=/home/pmdk/bench_workdir/Pattern_SegmentedReduction_multi.csv --size=102400000

Output:

['Pattern_SegmentedReduction_NDRange_int16', 'PASS', 'Intel(R) Data Center GPU Max 1100', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', '256', '102400000', '0.002255', '0.002252', '0.002251', '0.002251 0.002251 0.002252 0.002257 0.002264', '0.000005', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'LLVM (Intel DPC++)', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A']

Pattern_SegmentedReduction_Hierarchical_int32

Environment Variables:

Command:

/home/pmdk/bench_workdir/sycl-bench-build/segmentedreduction --warmup-run --num-runs=5 --output=/home/pmdk/bench_workdir/Pattern_SegmentedReduction_multi.csv --size=102400000

Output:

['Pattern_SegmentedReduction_Hierarchical_int32', 'PASS', 'Intel(R) Data Center GPU Max 1100', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', '256', '102400000', '0.011599', '0.011598', '0.011591', '0.011591 0.011592 0.011598 0.011605 0.011610', '0.000008', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'LLVM (Intel DPC++)', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A']

Pattern_SegmentedReduction_Hierarchical_int64

Environment Variables:

Command:

/home/pmdk/bench_workdir/sycl-bench-build/segmentedreduction --warmup-run --num-runs=5 --output=/home/pmdk/bench_workdir/Pattern_SegmentedReduction_multi.csv --size=102400000

Output:

['Pattern_SegmentedReduction_Hierarchical_int64', 'PASS', 'Intel(R) Data Center GPU Max 1100', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', '256', '102400000', '0.011778', '0.011777', '0.011755', '0.011755 0.011761 0.011777 0.011777 0.011818', '0.000025', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'LLVM (Intel DPC++)', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A']

USM_Allocation_latency_fp32_shared

Environment Variables:

Command:

/home/pmdk/bench_workdir/sycl-bench-build/usm_allocation_latency --warmup-run --num-runs=5 --output=/home/pmdk/bench_workdir/USM_Allocation_latency_multi.csv --size=1024000000

Output:

['USM_Allocation_latency_fp32_shared', 'PASS', 'Intel(R) Data Center GPU Max 1100', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', '256', '1024000000', '0.000067', '0.000065', '0.000063', '0.000063 0.000064 0.000065 0.000071 0.000073', '0.000005', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'LLVM (Intel DPC++)', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A']

USM_Allocation_latency_fp32_host

Environment Variables:

Command:

/home/pmdk/bench_workdir/sycl-bench-build/usm_allocation_latency --warmup-run --num-runs=5 --output=/home/pmdk/bench_workdir/USM_Allocation_latency_multi.csv --size=1024000000

Output:

['USM_Allocation_latency_fp32_host', 'PASS', 'Intel(R) Data Center GPU Max 1100', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', '256', '1024000000', '0.037373', '0.037388', '0.037308', '0.037308 0.037354 0.037388 0.037391 0.037423', '0.000044', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'LLVM (Intel DPC++)', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A']

USM_Instr_Mix_fp32_device_1:1mix_no_init_no_prefetch

Environment Variables:

Command:

/home/pmdk/bench_workdir/sycl-bench-build/usm_instr_mix --warmup-run --num-runs=5 --output=/home/pmdk/bench_workdir/USM_Instr_Mix_multi.csv --size=8192

Output:

['USM_Instr_Mix_fp32_device_1:1mix_no_init_no_prefetch', 'PASS', 'Intel(R) Data Center GPU Max 1100', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', '256', '8192', '0.001613', '0.001603', '0.001600', '0.001600 0.001602 0.001603 0.001608 0.001651', '0.000022', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'LLVM (Intel DPC++)', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A']

USM_Instr_Mix_fp32_host_1:1mix_with_init_no_prefetch

Environment Variables:

Command:

/home/pmdk/bench_workdir/sycl-bench-build/usm_instr_mix --warmup-run --num-runs=5 --output=/home/pmdk/bench_workdir/USM_Instr_Mix_multi.csv --size=8192

Output:

['USM_Instr_Mix_fp32_host_1:1mix_with_init_no_prefetch', 'PASS', 'Intel(R) Data Center GPU Max 1100', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', '256', '8192', '0.001027', '0.001012', '0.001005', '0.001005 0.001007 0.001012 0.001049 0.001064', '0.000027', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'LLVM (Intel DPC++)', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A']

USM_Instr_Mix_fp32_device_1:1mix_with_init_no_prefetch

Environment Variables:

Command:

/home/pmdk/bench_workdir/sycl-bench-build/usm_instr_mix --warmup-run --num-runs=5 --output=/home/pmdk/bench_workdir/USM_Instr_Mix_multi.csv --size=8192

Output:

['USM_Instr_Mix_fp32_device_1:1mix_with_init_no_prefetch', 'PASS', 'Intel(R) Data Center GPU Max 1100', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', '256', '8192', '0.001621', '0.001316', '0.001307', '0.001307 0.001310 0.001316 0.001325 0.002846', '0.000685', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'LLVM (Intel DPC++)', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A']

USM_Instr_Mix_fp32_host_1:1mix_no_init_no_prefetch

Environment Variables:

Command:

/home/pmdk/bench_workdir/sycl-bench-build/usm_instr_mix --warmup-run --num-runs=5 --output=/home/pmdk/bench_workdir/USM_Instr_Mix_multi.csv --size=8192

Output:

['USM_Instr_Mix_fp32_host_1:1mix_no_init_no_prefetch', 'PASS', 'Intel(R) Data Center GPU Max 1100', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', '256', '8192', '0.001169', '0.001168', '0.001157', '0.001157 0.001167 0.001168 0.001173 0.001178', '0.000008', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'LLVM (Intel DPC++)', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A']

VectorAddition_int32

Environment Variables:

Command:

/home/pmdk/bench_workdir/sycl-bench-build/vec_add --warmup-run --num-runs=5 --output=/home/pmdk/bench_workdir/VectorAddition_multi.csv --size=102400000

Output:

['VectorAddition_int32', 'PASS', 'Intel(R) Data Center GPU Max 1100', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', '256', '102400000', '0.001509', '0.001495', '0.001488', '0.001488 0.001491 0.001495 0.001519 0.001554', '0.000028', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'LLVM (Intel DPC++)', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A']

VectorAddition_fp32

Environment Variables:

Command:

/home/pmdk/bench_workdir/sycl-bench-build/vec_add --warmup-run --num-runs=5 --output=/home/pmdk/bench_workdir/VectorAddition_multi.csv --size=102400000

Output:

['VectorAddition_fp32', 'PASS', 'Intel(R) Data Center GPU Max 1100', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', '256', '102400000', '0.001508', '0.001494', '0.001487', '0.001487 0.001487 0.001494 0.001525 0.001548', '0.000027', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'LLVM (Intel DPC++)', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A']

VectorAddition_int64

Environment Variables:

Command:

/home/pmdk/bench_workdir/sycl-bench-build/vec_add --warmup-run --num-runs=5 --output=/home/pmdk/bench_workdir/VectorAddition_multi.csv --size=102400000

Output:

['VectorAddition_int64', 'PASS', 'Intel(R) Data Center GPU Max 1100', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', '256', '102400000', '0.003114', '0.003113', '0.003104', '0.003104 0.003107 0.003113 0.003119 0.003126', '0.000009', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'LLVM (Intel DPC++)', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A']

Polybench_2mm

Environment Variables:

Command:

/home/pmdk/bench_workdir/sycl-bench-build/2mm --warmup-run --num-runs=5 --output=/home/pmdk/bench_workdir/2mm.csv --size=512

Output:

['Polybench_2mm', 'PASS', 'Intel(R) Data Center GPU Max 1100', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', '256', '512', '0.001214', '0.001212', '0.001204', '0.001204 0.001209 0.001212 0.001221 0.001225', '0.000009', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'LLVM (Intel DPC++)', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A']

Polybench_3mm

Environment Variables:

Command:

/home/pmdk/bench_workdir/sycl-bench-build/3mm --warmup-run --num-runs=5 --output=/home/pmdk/bench_workdir/3mm.csv --size=512

Output:

['Polybench_3mm', 'PASS', 'Intel(R) Data Center GPU Max 1100', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', '256', '512', '0.001815', '0.001812', '0.001804', '0.001804 0.001810 0.001812 0.001818 0.001834', '0.000011', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'LLVM (Intel DPC++)', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A']

Polybench_Atax

Environment Variables:

Command:

/home/pmdk/bench_workdir/sycl-bench-build/atax --warmup-run --num-runs=5 --output=/home/pmdk/bench_workdir/Atax.csv --size=8192

Output:

['Polybench_Atax', 'PASS', 'Intel(R) Data Center GPU Max 1100', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', '256', '8192', '0.006866', '0.006880', '0.006814', '0.006814 0.006843 0.006880 0.006882 0.006909', '0.000037', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'LLVM (Intel DPC++)', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A']

Kmeans_fp32

Environment Variables:

Command:

/home/pmdk/bench_workdir/sycl-bench-build/kmeans --warmup-run --num-runs=5 --output=/home/pmdk/bench_workdir/Kmeans.csv --size=700000000

Output:

['Kmeans_fp32', 'PASS', 'Intel(R) Data Center GPU Max 1100', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', '256', '700000000', '0.016052', '0.016050', '0.016045', '0.016045 0.016047 0.016050 0.016056 0.016061', '0.000007', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'LLVM (Intel DPC++)', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A']

LinearRegressionCoeff_fp32

Environment Variables:

Command:

/home/pmdk/bench_workdir/sycl-bench-build/lin_reg_coeff --warmup-run --num-runs=5 --output=/home/pmdk/bench_workdir/LinearRegressionCoeff.csv --size=1638400000

Output:

['LinearRegressionCoeff_fp32', 'PASS', 'Intel(R) Data Center GPU Max 1100', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', '256', '1638400000', '0.691469', '0.686779', '0.686624', '0.686624 0.686706 0.686779 0.686985 0.710250', '0.010500', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'LLVM (Intel DPC++)', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A']

MolecularDynamics

Environment Variables:

Command:

/home/pmdk/bench_workdir/sycl-bench-build/mol_dyn --warmup-run --num-runs=5 --output=/home/pmdk/bench_workdir/MolecularDynamics.csv --size=8196

Output:

['MolecularDynamics', 'PASS', 'Intel(R) Data Center GPU Max 1100', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', '256', '8196', '0.000033', '0.000027', '0.000025', '0.000025 0.000027 0.000027 0.000031 0.000054', '0.000012', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'LLVM (Intel DPC++)', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A']

llama.cpp Text Generation Batched 512

Environment Variables:

Command:

/home/pmdk/bench_workdir/llamacpp-build/bin/llama-bench --output csv -n 128 -p 512 -b 128,256,512 --numa isolate -t 56 --model /home/pmdk/bench_workdir/models/Phi-3-mini-4k-instruct-q4.gguf

Output:

build_commit,build_number,cuda,vulkan,kompute,metal,sycl,rpc,gpu_blas,blas,cpu_info,gpu_info,model_filename,model_type,model_size,model_n_params,n_batch,n_ubatch,n_threads,cpu_mask,cpu_strict,poll,type_k,type_v,n_gpu_layers,split_mode,main_gpu,no_kv_offload,flash_attn,tensor_split,use_mmap,embeddings,n_prompt,n_gen,test_time,avg_ns,stddev_ns,avg_ts,stddev_ts
"1ee9eea0","4073","0","0","0","0","1","0","1","1","INTEL(R) XEON(R) PLATINUM 8580","Intel(R) Data Center GPU Max 1100","/home/pmdk/bench_workdir/models/Phi-3-mini-4k-instruct-q4.gguf","phi3 3B Q4_K - Medium","2392493568","3821079552","128","512","56","0x0","0","50","f16","f16","99","layer","0","0","0","0.00","1","0","512","0","2024-11-21T14:41:50Z","584067375","1790850","876.617762","2.678036"
"1ee9eea0","4073","0","0","0","0","1","0","1","1","INTEL(R) XEON(R) PLATINUM 8580","Intel(R) Data Center GPU Max 1100","/home/pmdk/bench_workdir/models/Phi-3-mini-4k-instruct-q4.gguf","phi3 3B Q4_K - Medium","2392493568","3821079552","128","512","56","0x0","0","50","f16","f16","99","layer","0","0","0","0.00","1","0","0","128","2024-11-21T14:41:54Z","1957444244","1687494","65.391429","0.056276"
"1ee9eea0","4073","0","0","0","0","1","0","1","1","INTEL(R) XEON(R) PLATINUM 8580","Intel(R) Data Center GPU Max 1100","/home/pmdk/bench_workdir/models/Phi-3-mini-4k-instruct-q4.gguf","phi3 3B Q4_K - Medium","2392493568","3821079552","256","512","56","0x0","0","50","f16","f16","99","layer","0","0","0","0.00","1","0","512","0","2024-11-21T14:42:04Z","542257330","430602","944.201634","0.748458"
"1ee9eea0","4073","0","0","0","0","1","0","1","1","INTEL(R) XEON(R) PLATINUM 8580","Intel(R) Data Center GPU Max 1100","/home/pmdk/bench_workdir/models/Phi-3-mini-4k-instruct-q4.gguf","phi3 3B Q4_K - Medium","2392493568","3821079552","256","512","56","0x0","0","50","f16","f16","99","layer","0","0","0","0.00","1","0","0","128","2024-11-21T14:42:07Z","1962898378","983654","65.209706","0.032628"
"1ee9eea0","4073","0","0","0","0","1","0","1","1","INTEL(R) XEON(R) PLATINUM 8580","Intel(R) Data Center GPU Max 1100","/home/pmdk/bench_workdir/models/Phi-3-mini-4k-instruct-q4.gguf","phi3 3B Q4_K - Medium","2392493568","3821079552","512","512","56","0x0","0","50","f16","f16","99","layer","0","0","0","0.00","1","0","512","0","2024-11-21T14:42:17Z","1049216261","6308709","487.997363","2.926289"
"1ee9eea0","4073","0","0","0","0","1","0","1","1","INTEL(R) XEON(R) PLATINUM 8580","Intel(R) Data Center GPU Max 1100","/home/pmdk/bench_workdir/models/Phi-3-mini-4k-instruct-q4.gguf","phi3 3B Q4_K - Medium","2392493568","3821079552","512","512","56","0x0","0","50","f16","f16","99","layer","0","0","0","0.00","1","0","0","128","2024-11-21T14:42:24Z","1962236311","2241898","65.231763","0.074527"

llama.cpp Text Generation Batched 256

Environment Variables:

Command:

/home/pmdk/bench_workdir/llamacpp-build/bin/llama-bench --output csv -n 128 -p 512 -b 128,256,512 --numa isolate -t 56 --model /home/pmdk/bench_workdir/models/Phi-3-mini-4k-instruct-q4.gguf

Output:

build_commit,build_number,cuda,vulkan,kompute,metal,sycl,rpc,gpu_blas,blas,cpu_info,gpu_info,model_filename,model_type,model_size,model_n_params,n_batch,n_ubatch,n_threads,cpu_mask,cpu_strict,poll,type_k,type_v,n_gpu_layers,split_mode,main_gpu,no_kv_offload,flash_attn,tensor_split,use_mmap,embeddings,n_prompt,n_gen,test_time,avg_ns,stddev_ns,avg_ts,stddev_ts
"1ee9eea0","4073","0","0","0","0","1","0","1","1","INTEL(R) XEON(R) PLATINUM 8580","Intel(R) Data Center GPU Max 1100","/home/pmdk/bench_workdir/models/Phi-3-mini-4k-instruct-q4.gguf","phi3 3B Q4_K - Medium","2392493568","3821079552","128","512","56","0x0","0","50","f16","f16","99","layer","0","0","0","0.00","1","0","512","0","2024-11-21T14:41:50Z","584067375","1790850","876.617762","2.678036"
"1ee9eea0","4073","0","0","0","0","1","0","1","1","INTEL(R) XEON(R) PLATINUM 8580","Intel(R) Data Center GPU Max 1100","/home/pmdk/bench_workdir/models/Phi-3-mini-4k-instruct-q4.gguf","phi3 3B Q4_K - Medium","2392493568","3821079552","128","512","56","0x0","0","50","f16","f16","99","layer","0","0","0","0.00","1","0","0","128","2024-11-21T14:41:54Z","1957444244","1687494","65.391429","0.056276"
"1ee9eea0","4073","0","0","0","0","1","0","1","1","INTEL(R) XEON(R) PLATINUM 8580","Intel(R) Data Center GPU Max 1100","/home/pmdk/bench_workdir/models/Phi-3-mini-4k-instruct-q4.gguf","phi3 3B Q4_K - Medium","2392493568","3821079552","256","512","56","0x0","0","50","f16","f16","99","layer","0","0","0","0.00","1","0","512","0","2024-11-21T14:42:04Z","542257330","430602","944.201634","0.748458"
"1ee9eea0","4073","0","0","0","0","1","0","1","1","INTEL(R) XEON(R) PLATINUM 8580","Intel(R) Data Center GPU Max 1100","/home/pmdk/bench_workdir/models/Phi-3-mini-4k-instruct-q4.gguf","phi3 3B Q4_K - Medium","2392493568","3821079552","256","512","56","0x0","0","50","f16","f16","99","layer","0","0","0","0.00","1","0","0","128","2024-11-21T14:42:07Z","1962898378","983654","65.209706","0.032628"
"1ee9eea0","4073","0","0","0","0","1","0","1","1","INTEL(R) XEON(R) PLATINUM 8580","Intel(R) Data Center GPU Max 1100","/home/pmdk/bench_workdir/models/Phi-3-mini-4k-instruct-q4.gguf","phi3 3B Q4_K - Medium","2392493568","3821079552","512","512","56","0x0","0","50","f16","f16","99","layer","0","0","0","0.00","1","0","512","0","2024-11-21T14:42:17Z","1049216261","6308709","487.997363","2.926289"
"1ee9eea0","4073","0","0","0","0","1","0","1","1","INTEL(R) XEON(R) PLATINUM 8580","Intel(R) Data Center GPU Max 1100","/home/pmdk/bench_workdir/models/Phi-3-mini-4k-instruct-q4.gguf","phi3 3B Q4_K - Medium","2392493568","3821079552","512","512","56","0x0","0","50","f16","f16","99","layer","0","0","0","0.00","1","0","0","128","2024-11-21T14:42:24Z","1962236311","2241898","65.231763","0.074527"

llama.cpp Prompt Processing Batched 256

Environment Variables:

Command:

/home/pmdk/bench_workdir/llamacpp-build/bin/llama-bench --output csv -n 128 -p 512 -b 128,256,512 --numa isolate -t 56 --model /home/pmdk/bench_workdir/models/Phi-3-mini-4k-instruct-q4.gguf

Output:

build_commit,build_number,cuda,vulkan,kompute,metal,sycl,rpc,gpu_blas,blas,cpu_info,gpu_info,model_filename,model_type,model_size,model_n_params,n_batch,n_ubatch,n_threads,cpu_mask,cpu_strict,poll,type_k,type_v,n_gpu_layers,split_mode,main_gpu,no_kv_offload,flash_attn,tensor_split,use_mmap,embeddings,n_prompt,n_gen,test_time,avg_ns,stddev_ns,avg_ts,stddev_ts
"1ee9eea0","4073","0","0","0","0","1","0","1","1","INTEL(R) XEON(R) PLATINUM 8580","Intel(R) Data Center GPU Max 1100","/home/pmdk/bench_workdir/models/Phi-3-mini-4k-instruct-q4.gguf","phi3 3B Q4_K - Medium","2392493568","3821079552","128","512","56","0x0","0","50","f16","f16","99","layer","0","0","0","0.00","1","0","512","0","2024-11-21T14:41:50Z","584067375","1790850","876.617762","2.678036"
"1ee9eea0","4073","0","0","0","0","1","0","1","1","INTEL(R) XEON(R) PLATINUM 8580","Intel(R) Data Center GPU Max 1100","/home/pmdk/bench_workdir/models/Phi-3-mini-4k-instruct-q4.gguf","phi3 3B Q4_K - Medium","2392493568","3821079552","128","512","56","0x0","0","50","f16","f16","99","layer","0","0","0","0.00","1","0","0","128","2024-11-21T14:41:54Z","1957444244","1687494","65.391429","0.056276"
"1ee9eea0","4073","0","0","0","0","1","0","1","1","INTEL(R) XEON(R) PLATINUM 8580","Intel(R) Data Center GPU Max 1100","/home/pmdk/bench_workdir/models/Phi-3-mini-4k-instruct-q4.gguf","phi3 3B Q4_K - Medium","2392493568","3821079552","256","512","56","0x0","0","50","f16","f16","99","layer","0","0","0","0.00","1","0","512","0","2024-11-21T14:42:04Z","542257330","430602","944.201634","0.748458"
"1ee9eea0","4073","0","0","0","0","1","0","1","1","INTEL(R) XEON(R) PLATINUM 8580","Intel(R) Data Center GPU Max 1100","/home/pmdk/bench_workdir/models/Phi-3-mini-4k-instruct-q4.gguf","phi3 3B Q4_K - Medium","2392493568","3821079552","256","512","56","0x0","0","50","f16","f16","99","layer","0","0","0","0.00","1","0","0","128","2024-11-21T14:42:07Z","1962898378","983654","65.209706","0.032628"
"1ee9eea0","4073","0","0","0","0","1","0","1","1","INTEL(R) XEON(R) PLATINUM 8580","Intel(R) Data Center GPU Max 1100","/home/pmdk/bench_workdir/models/Phi-3-mini-4k-instruct-q4.gguf","phi3 3B Q4_K - Medium","2392493568","3821079552","512","512","56","0x0","0","50","f16","f16","99","layer","0","0","0","0.00","1","0","512","0","2024-11-21T14:42:17Z","1049216261","6308709","487.997363","2.926289"
"1ee9eea0","4073","0","0","0","0","1","0","1","1","INTEL(R) XEON(R) PLATINUM 8580","Intel(R) Data Center GPU Max 1100","/home/pmdk/bench_workdir/models/Phi-3-mini-4k-instruct-q4.gguf","phi3 3B Q4_K - Medium","2392493568","3821079552","512","512","56","0x0","0","50","f16","f16","99","layer","0","0","0","0.00","1","0","0","128","2024-11-21T14:42:24Z","1962236311","2241898","65.231763","0.074527"

llama.cpp Text Generation Batched 128

Environment Variables:

Command:

/home/pmdk/bench_workdir/llamacpp-build/bin/llama-bench --output csv -n 128 -p 512 -b 128,256,512 --numa isolate -t 56 --model /home/pmdk/bench_workdir/models/Phi-3-mini-4k-instruct-q4.gguf

Output:

build_commit,build_number,cuda,vulkan,kompute,metal,sycl,rpc,gpu_blas,blas,cpu_info,gpu_info,model_filename,model_type,model_size,model_n_params,n_batch,n_ubatch,n_threads,cpu_mask,cpu_strict,poll,type_k,type_v,n_gpu_layers,split_mode,main_gpu,no_kv_offload,flash_attn,tensor_split,use_mmap,embeddings,n_prompt,n_gen,test_time,avg_ns,stddev_ns,avg_ts,stddev_ts
"1ee9eea0","4073","0","0","0","0","1","0","1","1","INTEL(R) XEON(R) PLATINUM 8580","Intel(R) Data Center GPU Max 1100","/home/pmdk/bench_workdir/models/Phi-3-mini-4k-instruct-q4.gguf","phi3 3B Q4_K - Medium","2392493568","3821079552","128","512","56","0x0","0","50","f16","f16","99","layer","0","0","0","0.00","1","0","512","0","2024-11-21T14:41:50Z","584067375","1790850","876.617762","2.678036"
"1ee9eea0","4073","0","0","0","0","1","0","1","1","INTEL(R) XEON(R) PLATINUM 8580","Intel(R) Data Center GPU Max 1100","/home/pmdk/bench_workdir/models/Phi-3-mini-4k-instruct-q4.gguf","phi3 3B Q4_K - Medium","2392493568","3821079552","128","512","56","0x0","0","50","f16","f16","99","layer","0","0","0","0.00","1","0","0","128","2024-11-21T14:41:54Z","1957444244","1687494","65.391429","0.056276"
"1ee9eea0","4073","0","0","0","0","1","0","1","1","INTEL(R) XEON(R) PLATINUM 8580","Intel(R) Data Center GPU Max 1100","/home/pmdk/bench_workdir/models/Phi-3-mini-4k-instruct-q4.gguf","phi3 3B Q4_K - Medium","2392493568","3821079552","256","512","56","0x0","0","50","f16","f16","99","layer","0","0","0","0.00","1","0","512","0","2024-11-21T14:42:04Z","542257330","430602","944.201634","0.748458"
"1ee9eea0","4073","0","0","0","0","1","0","1","1","INTEL(R) XEON(R) PLATINUM 8580","Intel(R) Data Center GPU Max 1100","/home/pmdk/bench_workdir/models/Phi-3-mini-4k-instruct-q4.gguf","phi3 3B Q4_K - Medium","2392493568","3821079552","256","512","56","0x0","0","50","f16","f16","99","layer","0","0","0","0.00","1","0","0","128","2024-11-21T14:42:07Z","1962898378","983654","65.209706","0.032628"
"1ee9eea0","4073","0","0","0","0","1","0","1","1","INTEL(R) XEON(R) PLATINUM 8580","Intel(R) Data Center GPU Max 1100","/home/pmdk/bench_workdir/models/Phi-3-mini-4k-instruct-q4.gguf","phi3 3B Q4_K - Medium","2392493568","3821079552","512","512","56","0x0","0","50","f16","f16","99","layer","0","0","0","0.00","1","0","512","0","2024-11-21T14:42:17Z","1049216261","6308709","487.997363","2.926289"
"1ee9eea0","4073","0","0","0","0","1","0","1","1","INTEL(R) XEON(R) PLATINUM 8580","Intel(R) Data Center GPU Max 1100","/home/pmdk/bench_workdir/models/Phi-3-mini-4k-instruct-q4.gguf","phi3 3B Q4_K - Medium","2392493568","3821079552","512","512","56","0x0","0","50","f16","f16","99","layer","0","0","0","0.00","1","0","0","128","2024-11-21T14:42:24Z","1962236311","2241898","65.231763","0.074527"

llama.cpp Prompt Processing Batched 128

Environment Variables:

Command:

/home/pmdk/bench_workdir/llamacpp-build/bin/llama-bench --output csv -n 128 -p 512 -b 128,256,512 --numa isolate -t 56 --model /home/pmdk/bench_workdir/models/Phi-3-mini-4k-instruct-q4.gguf

Output:

build_commit,build_number,cuda,vulkan,kompute,metal,sycl,rpc,gpu_blas,blas,cpu_info,gpu_info,model_filename,model_type,model_size,model_n_params,n_batch,n_ubatch,n_threads,cpu_mask,cpu_strict,poll,type_k,type_v,n_gpu_layers,split_mode,main_gpu,no_kv_offload,flash_attn,tensor_split,use_mmap,embeddings,n_prompt,n_gen,test_time,avg_ns,stddev_ns,avg_ts,stddev_ts
"1ee9eea0","4073","0","0","0","0","1","0","1","1","INTEL(R) XEON(R) PLATINUM 8580","Intel(R) Data Center GPU Max 1100","/home/pmdk/bench_workdir/models/Phi-3-mini-4k-instruct-q4.gguf","phi3 3B Q4_K - Medium","2392493568","3821079552","128","512","56","0x0","0","50","f16","f16","99","layer","0","0","0","0.00","1","0","512","0","2024-11-21T14:41:50Z","584067375","1790850","876.617762","2.678036"
"1ee9eea0","4073","0","0","0","0","1","0","1","1","INTEL(R) XEON(R) PLATINUM 8580","Intel(R) Data Center GPU Max 1100","/home/pmdk/bench_workdir/models/Phi-3-mini-4k-instruct-q4.gguf","phi3 3B Q4_K - Medium","2392493568","3821079552","128","512","56","0x0","0","50","f16","f16","99","layer","0","0","0","0.00","1","0","0","128","2024-11-21T14:41:54Z","1957444244","1687494","65.391429","0.056276"
"1ee9eea0","4073","0","0","0","0","1","0","1","1","INTEL(R) XEON(R) PLATINUM 8580","Intel(R) Data Center GPU Max 1100","/home/pmdk/bench_workdir/models/Phi-3-mini-4k-instruct-q4.gguf","phi3 3B Q4_K - Medium","2392493568","3821079552","256","512","56","0x0","0","50","f16","f16","99","layer","0","0","0","0.00","1","0","512","0","2024-11-21T14:42:04Z","542257330","430602","944.201634","0.748458"
"1ee9eea0","4073","0","0","0","0","1","0","1","1","INTEL(R) XEON(R) PLATINUM 8580","Intel(R) Data Center GPU Max 1100","/home/pmdk/bench_workdir/models/Phi-3-mini-4k-instruct-q4.gguf","phi3 3B Q4_K - Medium","2392493568","3821079552","256","512","56","0x0","0","50","f16","f16","99","layer","0","0","0","0.00","1","0","0","128","2024-11-21T14:42:07Z","1962898378","983654","65.209706","0.032628"
"1ee9eea0","4073","0","0","0","0","1","0","1","1","INTEL(R) XEON(R) PLATINUM 8580","Intel(R) Data Center GPU Max 1100","/home/pmdk/bench_workdir/models/Phi-3-mini-4k-instruct-q4.gguf","phi3 3B Q4_K - Medium","2392493568","3821079552","512","512","56","0x0","0","50","f16","f16","99","layer","0","0","0","0.00","1","0","512","0","2024-11-21T14:42:17Z","1049216261","6308709","487.997363","2.926289"
"1ee9eea0","4073","0","0","0","0","1","0","1","1","INTEL(R) XEON(R) PLATINUM 8580","Intel(R) Data Center GPU Max 1100","/home/pmdk/bench_workdir/models/Phi-3-mini-4k-instruct-q4.gguf","phi3 3B Q4_K - Medium","2392493568","3821079552","512","512","56","0x0","0","50","f16","f16","99","layer","0","0","0","0.00","1","0","0","128","2024-11-21T14:42:24Z","1962236311","2241898","65.231763","0.074527"

llama.cpp Prompt Processing Batched 512

Environment Variables:

Command:

/home/pmdk/bench_workdir/llamacpp-build/bin/llama-bench --output csv -n 128 -p 512 -b 128,256,512 --numa isolate -t 56 --model /home/pmdk/bench_workdir/models/Phi-3-mini-4k-instruct-q4.gguf

Output:

build_commit,build_number,cuda,vulkan,kompute,metal,sycl,rpc,gpu_blas,blas,cpu_info,gpu_info,model_filename,model_type,model_size,model_n_params,n_batch,n_ubatch,n_threads,cpu_mask,cpu_strict,poll,type_k,type_v,n_gpu_layers,split_mode,main_gpu,no_kv_offload,flash_attn,tensor_split,use_mmap,embeddings,n_prompt,n_gen,test_time,avg_ns,stddev_ns,avg_ts,stddev_ts
"1ee9eea0","4073","0","0","0","0","1","0","1","1","INTEL(R) XEON(R) PLATINUM 8580","Intel(R) Data Center GPU Max 1100","/home/pmdk/bench_workdir/models/Phi-3-mini-4k-instruct-q4.gguf","phi3 3B Q4_K - Medium","2392493568","3821079552","128","512","56","0x0","0","50","f16","f16","99","layer","0","0","0","0.00","1","0","512","0","2024-11-21T14:41:50Z","584067375","1790850","876.617762","2.678036"
"1ee9eea0","4073","0","0","0","0","1","0","1","1","INTEL(R) XEON(R) PLATINUM 8580","Intel(R) Data Center GPU Max 1100","/home/pmdk/bench_workdir/models/Phi-3-mini-4k-instruct-q4.gguf","phi3 3B Q4_K - Medium","2392493568","3821079552","128","512","56","0x0","0","50","f16","f16","99","layer","0","0","0","0.00","1","0","0","128","2024-11-21T14:41:54Z","1957444244","1687494","65.391429","0.056276"
"1ee9eea0","4073","0","0","0","0","1","0","1","1","INTEL(R) XEON(R) PLATINUM 8580","Intel(R) Data Center GPU Max 1100","/home/pmdk/bench_workdir/models/Phi-3-mini-4k-instruct-q4.gguf","phi3 3B Q4_K - Medium","2392493568","3821079552","256","512","56","0x0","0","50","f16","f16","99","layer","0","0","0","0.00","1","0","512","0","2024-11-21T14:42:04Z","542257330","430602","944.201634","0.748458"
"1ee9eea0","4073","0","0","0","0","1","0","1","1","INTEL(R) XEON(R) PLATINUM 8580","Intel(R) Data Center GPU Max 1100","/home/pmdk/bench_workdir/models/Phi-3-mini-4k-instruct-q4.gguf","phi3 3B Q4_K - Medium","2392493568","3821079552","256","512","56","0x0","0","50","f16","f16","99","layer","0","0","0","0.00","1","0","0","128","2024-11-21T14:42:07Z","1962898378","983654","65.209706","0.032628"
"1ee9eea0","4073","0","0","0","0","1","0","1","1","INTEL(R) XEON(R) PLATINUM 8580","Intel(R) Data Center GPU Max 1100","/home/pmdk/bench_workdir/models/Phi-3-mini-4k-instruct-q4.gguf","phi3 3B Q4_K - Medium","2392493568","3821079552","512","512","56","0x0","0","50","f16","f16","99","layer","0","0","0","0.00","1","0","512","0","2024-11-21T14:42:17Z","1049216261","6308709","487.997363","2.926289"
"1ee9eea0","4073","0","0","0","0","1","0","1","1","INTEL(R) XEON(R) PLATINUM 8580","Intel(R) Data Center GPU Max 1100","/home/pmdk/bench_workdir/models/Phi-3-mini-4k-instruct-q4.gguf","phi3 3B Q4_K - Medium","2392493568","3821079552","512","512","56","0x0","0","50","f16","f16","99","layer","0","0","0","0.00","1","0","0","128","2024-11-21T14:42:24Z","1962236311","2241898","65.231763","0.074527"

Copy link

Compute Benchmarks level_zero run (with params: --filter Velocity):
https://github.com/oneapi-src/unified-runtime/actions/runs/11955206272

Copy link

Compute Benchmarks level_zero run (--filter Velocity):
https://github.com/oneapi-src/unified-runtime/actions/runs/11955206272
Job status: success. Test status: success.

Summary

No diffs to calculate performance change

(result is better)

Performance change in benchmark groups

Relative perf in group Velocity-Bench (7): cannot calculate
Benchmark This PR baseline Relative perf Change -
Velocity-Bench svm 0.133400 s -
Velocity-Bench Hashtable - 379.894103 M keys/sec
Velocity-Bench Bitcracker - 35.221900 s
Velocity-Bench CudaSift - 204.950000 ms
Velocity-Bench Easywave - 237.000000 ms
Velocity-Bench QuickSilver - 118.990000 MMS/CTT
Velocity-Bench Sobel Filter - 534.267000 ms
Relative perf in group api (6): cannot calculate
Benchmark This PR baseline Relative perf Change -
api_overhead_benchmark_sycl SubmitKernel out of order - 23.344000 μs
api_overhead_benchmark_sycl SubmitKernel in order - 23.009000 μs
api_overhead_benchmark_sycl ExecImmediateCopyQueue out of order from Device to Device, size 1024 - 2.399000 μs
api_overhead_benchmark_sycl ExecImmediateCopyQueue in order from Device to Host, size 1024 - 1.629000 μs
api_overhead_benchmark_ur SubmitKernel out of order - 14.019000 μs
api_overhead_benchmark_ur SubmitKernel in order - 13.560000 μs
Relative perf in group memory (3): cannot calculate
Benchmark This PR baseline Relative perf Change -
memory_benchmark_sycl QueueInOrderMemcpy from Device to Device, size 1024 - 250.368000 μs
memory_benchmark_sycl QueueInOrderMemcpy from Host to Device, size 1024 - 117.932000 μs
memory_benchmark_sycl QueueMemcpy from Device to Device, size 1024 - 5.781000 μs
Relative perf in group miscellaneous (1): cannot calculate
Benchmark This PR baseline Relative perf Change -
miscellaneous_benchmark_sycl VectorSum - 803.138000 μs
Relative perf in group multithread (8): cannot calculate
Benchmark This PR baseline Relative perf Change -
multithread_benchmark_ur MemcpyExecute opsPerThread:400, numThreads:1, allocSize:102400 srcUSM:1 dstUSM:1 - 6719.944000 μs
multithread_benchmark_ur MemcpyExecute opsPerThread:100, numThreads:8, allocSize:102400 srcUSM:1 dstUSM:1 - 17264.000000 μs
multithread_benchmark_ur MemcpyExecute opsPerThread:400, numThreads:8, allocSize:1024 srcUSM:1 dstUSM:1 - 24830.061000 μs
multithread_benchmark_ur MemcpyExecute opsPerThread:10, numThreads:16, allocSize:1024 srcUSM:1 dstUSM:1 - 1046.591000 μs
multithread_benchmark_ur MemcpyExecute opsPerThread:400, numThreads:1, allocSize:102400 srcUSM:0 dstUSM:1 - 7614.679000 μs
multithread_benchmark_ur MemcpyExecute opsPerThread:100, numThreads:8, allocSize:102400 srcUSM:0 dstUSM:1 - 8369.748000 μs
multithread_benchmark_ur MemcpyExecute opsPerThread:400, numThreads:8, allocSize:1024 srcUSM:0 dstUSM:1 - 24930.683000 μs
multithread_benchmark_ur MemcpyExecute opsPerThread:10, numThreads:16, allocSize:1024 srcUSM:0 dstUSM:1 - 1035.853000 μs
Relative perf in group Runtime (8): cannot calculate
Benchmark This PR baseline Relative perf Change -
Runtime_IndependentDAGTaskThroughput_SingleTask - 262.132000 ms
Runtime_IndependentDAGTaskThroughput_HierarchicalParallelFor - 274.737000 ms
Runtime_IndependentDAGTaskThroughput_NDRangeParallelFor - 270.176000 ms
Runtime_IndependentDAGTaskThroughput_BasicParallelFor - 277.443000 ms
Runtime_DAGTaskThroughput_HierarchicalParallelFor - 1801.303000 ms
Runtime_DAGTaskThroughput_BasicParallelFor - 1827.461000 ms
Runtime_DAGTaskThroughput_NDRangeParallelFor - 1782.877000 ms
Runtime_DAGTaskThroughput_SingleTask - 1766.731000 ms
Relative perf in group MicroBench (14): cannot calculate
Benchmark This PR baseline Relative perf Change -
MicroBench_HostDeviceBandwidth_3D_H2D_Contiguous - 4.487000 ms
MicroBench_HostDeviceBandwidth_3D_H2D_Strided - 4.498000 ms
MicroBench_HostDeviceBandwidth_2D_H2D_Strided - 4.504000 ms
MicroBench_HostDeviceBandwidth_1D_D2H_Contiguous - 4.536000 ms
MicroBench_HostDeviceBandwidth_1D_D2H_Strided - 4.625000 ms
MicroBench_HostDeviceBandwidth_3D_D2H_Contiguous - 618.194000 ms
MicroBench_HostDeviceBandwidth_1D_H2D_Strided - 4.405000 ms
MicroBench_HostDeviceBandwidth_2D_H2D_Contiguous - 4.480000 ms
MicroBench_HostDeviceBandwidth_2D_D2H_Contiguous - 618.203000 ms
MicroBench_HostDeviceBandwidth_1D_H2D_Contiguous - 4.365000 ms
MicroBench_HostDeviceBandwidth_3D_D2H_Strided - 617.471000 ms
MicroBench_HostDeviceBandwidth_2D_D2H_Strided - 617.439000 ms
MicroBench_LocalMem_fp32_4096 - 29.880000 ms
MicroBench_LocalMem_int32_4096 - 29.899000 ms
Relative perf in group Pattern (10): cannot calculate
Benchmark This PR baseline Relative perf Change -
Pattern_Reduction_NDRange_int32 - 16.937000 ms
Pattern_Reduction_Hierarchical_int32 - 16.990000 ms
Pattern_SegmentedReduction_NDRange_int64 - 2.347000 ms
Pattern_SegmentedReduction_NDRange_int16 - 2.269000 ms
Pattern_SegmentedReduction_NDRange_fp32 - 2.166000 ms
Pattern_SegmentedReduction_Hierarchical_int64 - 11.776000 ms
Pattern_SegmentedReduction_Hierarchical_int32 - 11.595000 ms
Pattern_SegmentedReduction_Hierarchical_int16 - 11.799000 ms
Pattern_SegmentedReduction_NDRange_int32 - 2.172000 ms
Pattern_SegmentedReduction_Hierarchical_fp32 - 11.590000 ms
Relative perf in group ScalarProduct (6): cannot calculate
Benchmark This PR baseline Relative perf Change -
ScalarProduct_Hierarchical_int64 - 11.306000 ms
ScalarProduct_NDRange_fp32 - 3.746000 ms
ScalarProduct_NDRange_int32 - 3.757000 ms
ScalarProduct_Hierarchical_int32 - 10.320000 ms
ScalarProduct_Hierarchical_fp32 - 9.955000 ms
ScalarProduct_NDRange_int64 - 5.441000 ms
Relative perf in group USM (7): cannot calculate
Benchmark This PR baseline Relative perf Change -
USM_Allocation_latency_fp32_device - 0.067000 ms
USM_Allocation_latency_fp32_host - 37.435000 ms
USM_Allocation_latency_fp32_shared - 0.055000 ms
USM_Instr_Mix_fp32_device_1:1mix_with_init_no_prefetch - 1.653000 ms
USM_Instr_Mix_fp32_device_1:1mix_no_init_no_prefetch - 1.800000 ms
USM_Instr_Mix_fp32_host_1:1mix_no_init_no_prefetch - 1.202000 ms
USM_Instr_Mix_fp32_host_1:1mix_with_init_no_prefetch - 1.044000 ms
Relative perf in group VectorAddition (3): cannot calculate
Benchmark This PR baseline Relative perf Change -
VectorAddition_fp32 - 1.450000 ms
VectorAddition_int64 - 3.055000 ms
VectorAddition_int32 - 1.452000 ms
Relative perf in group Polybench (3): cannot calculate
Benchmark This PR baseline Relative perf Change -
Polybench_2mm - 1.213000 ms
Polybench_3mm - 1.730000 ms
Polybench_Atax - 6.883000 ms
Relative perf in group Kmeans (1): cannot calculate
Benchmark This PR baseline Relative perf Change -
Kmeans_fp32 - 16.051000 ms
Relative perf in group MolecularDynamics (1): cannot calculate
Benchmark This PR baseline Relative perf Change -
MolecularDynamics - 0.027000 ms
Relative perf in group llama.cpp (6): cannot calculate
Benchmark This PR baseline Relative perf Change -
llama.cpp Text Generation Batched 128 - 62.269711 token/s
llama.cpp Prompt Processing Batched 256 - 909.695608 token/s
llama.cpp Prompt Processing Batched 128 - 815.475889 token/s
llama.cpp Text Generation Batched 512 - 62.367806 token/s
llama.cpp Prompt Processing Batched 512 - 458.128303 token/s
llama.cpp Text Generation Batched 256 - 62.255402 token/s

Details

Benchmark details - environment, command, output...
Velocity-Bench svm

Environment Variables:

Command:

/home/pmdk/bench_workdir/svm/svm_sycl /home/pmdk/bench_workdir/velocity-bench-repo/svm/SYCL/a9a /home/pmdk/bench_workdir/velocity-bench-repo/svm/SYCL/a.m

Output:

Number of args 3
Using cuSVM (Carpenter)...

Buffering input text file (6989624 B).
Load Done
Starting Training
_C 1.000000
Workgroup Size: 1024
nbrCtas 80
elemsPerCta 1248
threadsPerCta 128
Total run time: 0.063683 seconds
Iter:100
M:97683
N:123
Train done. Calulate Vector counts
Training done

Loading elapsed time : 0.0621 s
Processing elapsed time : 0.0688 s
Storing elapsed time : 0.0024 s
Total elapsed time : 0.1334 s
Result's are correct: 0.0551

Copy link

Compute Benchmarks level_zero_v2 run (with params: --compare baseline-v2 --filter Velocity):
https://github.com/oneapi-src/unified-runtime/actions/runs/11955214043

Copy link

Compute Benchmarks level_zero_v2 run (--compare baseline-v2 --filter Velocity):
https://github.com/oneapi-src/unified-runtime/actions/runs/11955214043
Job status: success. Test status: success.

Summary

No diffs to calculate performance change

(result is better)

Performance change in benchmark groups

Relative perf in group api (6): cannot calculate
Benchmark This PR baseline baseline-v2 Relative perf Change -
api_overhead_benchmark_sycl SubmitKernel out of order - 23.344000 μs 23.745 μs
api_overhead_benchmark_sycl SubmitKernel in order - 23.009000 μs 24.748 μs
api_overhead_benchmark_sycl ExecImmediateCopyQueue out of order from Device to Device, size 1024 - 2.399 μs 1.868000 μs
api_overhead_benchmark_sycl ExecImmediateCopyQueue in order from Device to Host, size 1024 - 1.629000 μs 1.899 μs
api_overhead_benchmark_ur SubmitKernel out of order - 14.019000 μs 15.342 μs
api_overhead_benchmark_ur SubmitKernel in order - 13.560000 μs 15.480 μs
Relative perf in group memory (3): cannot calculate
Benchmark This PR baseline baseline-v2 Relative perf Change -
memory_benchmark_sycl QueueInOrderMemcpy from Device to Device, size 1024 - 250.368 μs 196.473000 μs
memory_benchmark_sycl QueueInOrderMemcpy from Host to Device, size 1024 - 117.932 μs 88.565000 μs
memory_benchmark_sycl QueueMemcpy from Device to Device, size 1024 - 5.781 μs 5.684000 μs
Relative perf in group miscellaneous (1): cannot calculate
Benchmark This PR baseline baseline-v2 Relative perf Change -
miscellaneous_benchmark_sycl VectorSum - 803.138000 μs 804.478 μs
Relative perf in group multithread (8): cannot calculate
Benchmark This PR baseline baseline-v2 Relative perf Change -
multithread_benchmark_ur MemcpyExecute opsPerThread:400, numThreads:1, allocSize:102400 srcUSM:1 dstUSM:1 - 6719.944 μs 3588.421000 μs
multithread_benchmark_ur MemcpyExecute opsPerThread:100, numThreads:8, allocSize:102400 srcUSM:1 dstUSM:1 - 17264.000 μs 7922.105000 μs
multithread_benchmark_ur MemcpyExecute opsPerThread:400, numThreads:8, allocSize:1024 srcUSM:1 dstUSM:1 - 24830.061000 μs 24977.963 μs
multithread_benchmark_ur MemcpyExecute opsPerThread:10, numThreads:16, allocSize:1024 srcUSM:1 dstUSM:1 - 1046.591000 μs 1063.816 μs
multithread_benchmark_ur MemcpyExecute opsPerThread:400, numThreads:1, allocSize:102400 srcUSM:0 dstUSM:1 - 7614.679 μs 4507.512000 μs
multithread_benchmark_ur MemcpyExecute opsPerThread:100, numThreads:8, allocSize:102400 srcUSM:0 dstUSM:1 - 8369.748 μs 6390.880000 μs
multithread_benchmark_ur MemcpyExecute opsPerThread:400, numThreads:8, allocSize:1024 srcUSM:0 dstUSM:1 - 24930.683000 μs 25445.122 μs
multithread_benchmark_ur MemcpyExecute opsPerThread:10, numThreads:16, allocSize:1024 srcUSM:0 dstUSM:1 - 1035.853000 μs 1068.329 μs
Relative perf in group Velocity-Bench (6): cannot calculate
Benchmark This PR baseline baseline-v2 Relative perf Change -
Velocity-Bench Hashtable - 379.894 M keys/sec 383.152085 M keys/sec
Velocity-Bench Bitcracker - 35.222 s 35.193300 s
Velocity-Bench CudaSift - 204.950 ms 202.081000 ms
Velocity-Bench Easywave - 237.000 ms 234.000000 ms
Velocity-Bench QuickSilver - 118.990 MMS/CTT 121.320000 MMS/CTT
Velocity-Bench Sobel Filter - 534.267 ms 516.832000 ms
Relative perf in group Runtime (8): cannot calculate
Benchmark This PR baseline baseline-v2 Relative perf Change -
Runtime_IndependentDAGTaskThroughput_SingleTask - 262.132 ms 175.188000 ms
Runtime_IndependentDAGTaskThroughput_HierarchicalParallelFor - 274.737 ms 185.023000 ms
Runtime_IndependentDAGTaskThroughput_NDRangeParallelFor - 270.176 ms 182.648000 ms
Runtime_IndependentDAGTaskThroughput_BasicParallelFor - 277.443 ms 179.923000 ms
Runtime_DAGTaskThroughput_HierarchicalParallelFor - 1801.303 ms 1313.605000 ms
Runtime_DAGTaskThroughput_BasicParallelFor - 1827.461 ms 1318.343000 ms
Runtime_DAGTaskThroughput_NDRangeParallelFor - 1782.877 ms 1284.252000 ms
Runtime_DAGTaskThroughput_SingleTask - 1766.731 ms 1249.041000 ms
Relative perf in group MicroBench (14): cannot calculate
Benchmark This PR baseline baseline-v2 Relative perf Change -
MicroBench_HostDeviceBandwidth_3D_H2D_Contiguous - 4.487 ms 4.461000 ms
MicroBench_HostDeviceBandwidth_3D_H2D_Strided - 4.498000 ms 4.528 ms
MicroBench_HostDeviceBandwidth_2D_H2D_Strided - 4.504000 ms 4.547 ms
MicroBench_HostDeviceBandwidth_1D_D2H_Contiguous - 4.536 ms 3.716000 ms
MicroBench_HostDeviceBandwidth_1D_D2H_Strided - 4.625 ms 3.753000 ms
MicroBench_HostDeviceBandwidth_3D_D2H_Contiguous - 618.194 ms 618.005000 ms
MicroBench_HostDeviceBandwidth_1D_H2D_Strided - 4.405000 ms 4.476 ms
MicroBench_HostDeviceBandwidth_2D_H2D_Contiguous - 4.480 ms 4.446000 ms
MicroBench_HostDeviceBandwidth_2D_D2H_Contiguous - 618.203 ms 617.994000 ms
MicroBench_HostDeviceBandwidth_1D_H2D_Contiguous - 4.365 ms 4.300000 ms
MicroBench_HostDeviceBandwidth_3D_D2H_Strided - 617.471 ms 617.302000 ms
MicroBench_HostDeviceBandwidth_2D_D2H_Strided - 617.439 ms 617.296000 ms
MicroBench_LocalMem_fp32_4096 - 29.880 ms 29.858000 ms
MicroBench_LocalMem_int32_4096 - 29.899 ms 29.853000 ms
Relative perf in group Pattern (10): cannot calculate
Benchmark This PR baseline baseline-v2 Relative perf Change -
Pattern_Reduction_NDRange_int32 - 16.937 ms 16.718000 ms
Pattern_Reduction_Hierarchical_int32 - 16.990 ms 16.906000 ms
Pattern_SegmentedReduction_NDRange_int64 - 2.347000 ms 2.351 ms
Pattern_SegmentedReduction_NDRange_int16 - 2.269 ms 2.253000 ms
Pattern_SegmentedReduction_NDRange_fp32 - 2.166 ms 2.164000 ms
Pattern_SegmentedReduction_Hierarchical_int64 - 11.776 ms 11.775000 ms
Pattern_SegmentedReduction_Hierarchical_int32 - 11.595000 ms 11.605 ms
Pattern_SegmentedReduction_Hierarchical_int16 - 11.799000 ms 11.799 ms
Pattern_SegmentedReduction_NDRange_int32 - 2.172 ms 2.170000 ms
Pattern_SegmentedReduction_Hierarchical_fp32 - 11.590000 ms 11.595 ms
Relative perf in group ScalarProduct (6): cannot calculate
Benchmark This PR baseline baseline-v2 Relative perf Change -
ScalarProduct_Hierarchical_int64 - 11.306000 ms 11.356 ms
ScalarProduct_NDRange_fp32 - 3.746000 ms 3.797 ms
ScalarProduct_NDRange_int32 - 3.757000 ms 3.811 ms
ScalarProduct_Hierarchical_int32 - 10.320 ms 10.313000 ms
ScalarProduct_Hierarchical_fp32 - 9.955000 ms 9.961 ms
ScalarProduct_NDRange_int64 - 5.441000 ms 5.499 ms
Relative perf in group USM (7): cannot calculate
Benchmark This PR baseline baseline-v2 Relative perf Change -
USM_Allocation_latency_fp32_device - 0.067000 ms -
USM_Allocation_latency_fp32_host - 37.435000 ms 37.568 ms
USM_Allocation_latency_fp32_shared - 0.055000 ms 0.062 ms
USM_Instr_Mix_fp32_device_1:1mix_with_init_no_prefetch - 1.653 ms 1.298000 ms
USM_Instr_Mix_fp32_device_1:1mix_no_init_no_prefetch - 1.800 ms 1.543000 ms
USM_Instr_Mix_fp32_host_1:1mix_no_init_no_prefetch - 1.202 ms 1.140000 ms
USM_Instr_Mix_fp32_host_1:1mix_with_init_no_prefetch - 1.044 ms 0.979000 ms
Relative perf in group VectorAddition (3): cannot calculate
Benchmark This PR baseline baseline-v2 Relative perf Change -
VectorAddition_fp32 - 1.450000 ms 1.495 ms
VectorAddition_int64 - 3.055000 ms 3.119 ms
VectorAddition_int32 - 1.452000 ms 1.494 ms
Relative perf in group Polybench (3): cannot calculate
Benchmark This PR baseline baseline-v2 Relative perf Change -
Polybench_2mm - 1.213000 ms 1.214 ms
Polybench_3mm - 1.730000 ms 1.813 ms
Polybench_Atax - 6.883 ms 6.860000 ms
Relative perf in group Kmeans (1): cannot calculate
Benchmark This PR baseline baseline-v2 Relative perf Change -
Kmeans_fp32 - 16.051000 ms 16.054 ms
Relative perf in group MolecularDynamics (1): cannot calculate
Benchmark This PR baseline baseline-v2 Relative perf Change -
MolecularDynamics - 0.027000 ms 0.027 ms
Relative perf in group llama.cpp (6): cannot calculate
Benchmark This PR baseline baseline-v2 Relative perf Change -
llama.cpp Text Generation Batched 128 - 62.270 token/s 65.291276 token/s
llama.cpp Prompt Processing Batched 256 - 909.696 token/s 949.465825 token/s
llama.cpp Prompt Processing Batched 128 - 815.476 token/s 864.649446 token/s
llama.cpp Text Generation Batched 512 - 62.368 token/s 65.190816 token/s
llama.cpp Prompt Processing Batched 512 - 458.128 token/s 486.573404 token/s
llama.cpp Text Generation Batched 256 - 62.255 token/s 65.269429 token/s

Details

Benchmark details - environment, command, output...

This patch implements support for:
 - dl-cifar
 - dl-mnist
 - svm

These are all workloads from Velocity Bench that require oneMKL.
Copy link

Compute Benchmarks level_zero run (with params: --filter Velocity):
https://github.com/oneapi-src/unified-runtime/actions/runs/11956101408

Copy link

Compute Benchmarks level_zero run (--filter Velocity):
https://github.com/oneapi-src/unified-runtime/actions/runs/11956101408
Job status: success. Test status: success.

Summary

Total 6 benchmarks in mean.
Geomean 99.869%.
Improved 0 Regressed 1 (threshold 0.50%)

(result is better)

Performance change in benchmark groups

Relative perf in group Velocity-Bench (9): 99.869%
Benchmark This PR baseline Relative perf Change -
Velocity-Bench Bitcracker 35.144200 s 35.222 s 100.22% 0.22% .
Velocity-Bench CudaSift 204.522000 ms 204.950 ms 100.21% 0.21% .
Velocity-Bench Sobel Filter 533.940000 ms 534.267 ms 100.06% 0.06% .
Velocity-Bench Hashtable 379.191 M keys/sec 379.894103 M keys/sec 99.81% -0.19% .
Velocity-Bench QuickSilver 118.690 MMS/CTT 118.990000 MMS/CTT 99.75% -0.25% .
Velocity-Bench Easywave 239.000 ms 237.000000 ms 99.16% -0.84% ----------
Velocity-Bench dl-cifar 24.017000 s -
Velocity-Bench dl-mnist 2.740000 s -
Velocity-Bench svm 0.151500 s -
Relative perf in group api (6): cannot calculate
Benchmark This PR baseline Relative perf Change -
api_overhead_benchmark_sycl SubmitKernel out of order - 23.344000 μs
api_overhead_benchmark_sycl SubmitKernel in order - 23.009000 μs
api_overhead_benchmark_sycl ExecImmediateCopyQueue out of order from Device to Device, size 1024 - 2.399000 μs
api_overhead_benchmark_sycl ExecImmediateCopyQueue in order from Device to Host, size 1024 - 1.629000 μs
api_overhead_benchmark_ur SubmitKernel out of order - 14.019000 μs
api_overhead_benchmark_ur SubmitKernel in order - 13.560000 μs
Relative perf in group memory (3): cannot calculate
Benchmark This PR baseline Relative perf Change -
memory_benchmark_sycl QueueInOrderMemcpy from Device to Device, size 1024 - 250.368000 μs
memory_benchmark_sycl QueueInOrderMemcpy from Host to Device, size 1024 - 117.932000 μs
memory_benchmark_sycl QueueMemcpy from Device to Device, size 1024 - 5.781000 μs
Relative perf in group miscellaneous (1): cannot calculate
Benchmark This PR baseline Relative perf Change -
miscellaneous_benchmark_sycl VectorSum - 803.138000 μs
Relative perf in group multithread (8): cannot calculate
Benchmark This PR baseline Relative perf Change -
multithread_benchmark_ur MemcpyExecute opsPerThread:400, numThreads:1, allocSize:102400 srcUSM:1 dstUSM:1 - 6719.944000 μs
multithread_benchmark_ur MemcpyExecute opsPerThread:100, numThreads:8, allocSize:102400 srcUSM:1 dstUSM:1 - 17264.000000 μs
multithread_benchmark_ur MemcpyExecute opsPerThread:400, numThreads:8, allocSize:1024 srcUSM:1 dstUSM:1 - 24830.061000 μs
multithread_benchmark_ur MemcpyExecute opsPerThread:10, numThreads:16, allocSize:1024 srcUSM:1 dstUSM:1 - 1046.591000 μs
multithread_benchmark_ur MemcpyExecute opsPerThread:400, numThreads:1, allocSize:102400 srcUSM:0 dstUSM:1 - 7614.679000 μs
multithread_benchmark_ur MemcpyExecute opsPerThread:100, numThreads:8, allocSize:102400 srcUSM:0 dstUSM:1 - 8369.748000 μs
multithread_benchmark_ur MemcpyExecute opsPerThread:400, numThreads:8, allocSize:1024 srcUSM:0 dstUSM:1 - 24930.683000 μs
multithread_benchmark_ur MemcpyExecute opsPerThread:10, numThreads:16, allocSize:1024 srcUSM:0 dstUSM:1 - 1035.853000 μs
Relative perf in group Runtime (8): cannot calculate
Benchmark This PR baseline Relative perf Change -
Runtime_IndependentDAGTaskThroughput_SingleTask - 262.132000 ms
Runtime_IndependentDAGTaskThroughput_HierarchicalParallelFor - 274.737000 ms
Runtime_IndependentDAGTaskThroughput_NDRangeParallelFor - 270.176000 ms
Runtime_IndependentDAGTaskThroughput_BasicParallelFor - 277.443000 ms
Runtime_DAGTaskThroughput_HierarchicalParallelFor - 1801.303000 ms
Runtime_DAGTaskThroughput_BasicParallelFor - 1827.461000 ms
Runtime_DAGTaskThroughput_NDRangeParallelFor - 1782.877000 ms
Runtime_DAGTaskThroughput_SingleTask - 1766.731000 ms
Relative perf in group MicroBench (14): cannot calculate
Benchmark This PR baseline Relative perf Change -
MicroBench_HostDeviceBandwidth_3D_H2D_Contiguous - 4.487000 ms
MicroBench_HostDeviceBandwidth_3D_H2D_Strided - 4.498000 ms
MicroBench_HostDeviceBandwidth_2D_H2D_Strided - 4.504000 ms
MicroBench_HostDeviceBandwidth_1D_D2H_Contiguous - 4.536000 ms
MicroBench_HostDeviceBandwidth_1D_D2H_Strided - 4.625000 ms
MicroBench_HostDeviceBandwidth_3D_D2H_Contiguous - 618.194000 ms
MicroBench_HostDeviceBandwidth_1D_H2D_Strided - 4.405000 ms
MicroBench_HostDeviceBandwidth_2D_H2D_Contiguous - 4.480000 ms
MicroBench_HostDeviceBandwidth_2D_D2H_Contiguous - 618.203000 ms
MicroBench_HostDeviceBandwidth_1D_H2D_Contiguous - 4.365000 ms
MicroBench_HostDeviceBandwidth_3D_D2H_Strided - 617.471000 ms
MicroBench_HostDeviceBandwidth_2D_D2H_Strided - 617.439000 ms
MicroBench_LocalMem_fp32_4096 - 29.880000 ms
MicroBench_LocalMem_int32_4096 - 29.899000 ms
Relative perf in group Pattern (10): cannot calculate
Benchmark This PR baseline Relative perf Change -
Pattern_Reduction_NDRange_int32 - 16.937000 ms
Pattern_Reduction_Hierarchical_int32 - 16.990000 ms
Pattern_SegmentedReduction_NDRange_int64 - 2.347000 ms
Pattern_SegmentedReduction_NDRange_int16 - 2.269000 ms
Pattern_SegmentedReduction_NDRange_fp32 - 2.166000 ms
Pattern_SegmentedReduction_Hierarchical_int64 - 11.776000 ms
Pattern_SegmentedReduction_Hierarchical_int32 - 11.595000 ms
Pattern_SegmentedReduction_Hierarchical_int16 - 11.799000 ms
Pattern_SegmentedReduction_NDRange_int32 - 2.172000 ms
Pattern_SegmentedReduction_Hierarchical_fp32 - 11.590000 ms
Relative perf in group ScalarProduct (6): cannot calculate
Benchmark This PR baseline Relative perf Change -
ScalarProduct_Hierarchical_int64 - 11.306000 ms
ScalarProduct_NDRange_fp32 - 3.746000 ms
ScalarProduct_NDRange_int32 - 3.757000 ms
ScalarProduct_Hierarchical_int32 - 10.320000 ms
ScalarProduct_Hierarchical_fp32 - 9.955000 ms
ScalarProduct_NDRange_int64 - 5.441000 ms
Relative perf in group USM (7): cannot calculate
Benchmark This PR baseline Relative perf Change -
USM_Allocation_latency_fp32_device - 0.067000 ms
USM_Allocation_latency_fp32_host - 37.435000 ms
USM_Allocation_latency_fp32_shared - 0.055000 ms
USM_Instr_Mix_fp32_device_1:1mix_with_init_no_prefetch - 1.653000 ms
USM_Instr_Mix_fp32_device_1:1mix_no_init_no_prefetch - 1.800000 ms
USM_Instr_Mix_fp32_host_1:1mix_no_init_no_prefetch - 1.202000 ms
USM_Instr_Mix_fp32_host_1:1mix_with_init_no_prefetch - 1.044000 ms
Relative perf in group VectorAddition (3): cannot calculate
Benchmark This PR baseline Relative perf Change -
VectorAddition_fp32 - 1.450000 ms
VectorAddition_int64 - 3.055000 ms
VectorAddition_int32 - 1.452000 ms
Relative perf in group Polybench (3): cannot calculate
Benchmark This PR baseline Relative perf Change -
Polybench_2mm - 1.213000 ms
Polybench_3mm - 1.730000 ms
Polybench_Atax - 6.883000 ms
Relative perf in group Kmeans (1): cannot calculate
Benchmark This PR baseline Relative perf Change -
Kmeans_fp32 - 16.051000 ms
Relative perf in group MolecularDynamics (1): cannot calculate
Benchmark This PR baseline Relative perf Change -
MolecularDynamics - 0.027000 ms
Relative perf in group llama.cpp (6): cannot calculate
Benchmark This PR baseline Relative perf Change -
llama.cpp Text Generation Batched 128 - 62.269711 token/s
llama.cpp Prompt Processing Batched 256 - 909.695608 token/s
llama.cpp Prompt Processing Batched 128 - 815.475889 token/s
llama.cpp Text Generation Batched 512 - 62.367806 token/s
llama.cpp Prompt Processing Batched 512 - 458.128303 token/s
llama.cpp Text Generation Batched 256 - 62.255402 token/s

Details

Benchmark details - environment, command, output...
Velocity-Bench Hashtable

Environment Variables:

Command:

/home/pmdk/bench_workdir/hashtable/hashtable_sycl --no-verify

Output:

hashtable - total time for whole calculation: 0.353958 s
379.191141 million keys/second

Velocity-Bench Bitcracker

Environment Variables:

Command:

/home/pmdk/bench_workdir/bitcracker/bitcracker -f /home/pmdk/bench_workdir/velocity-bench-repo/bitcracker/hash_pass/img_win8_user_hash.txt -d /home/pmdk/bench_workdir/velocity-bench-repo/bitcracker/hash_pass/user_passwords_60000.txt -b 60000

Output:

---------> BitCracker: BitLocker password cracking tool <---------

==================================
Retrieving Info

Reading hash file "/home/pmdk/bench_workdir/velocity-bench-repo/bitcracker/hash_pass/img_win8_user_hash.txt"

              Attack

================================================
Type of attack: User Password
Psw per thread: 1
max_num_pswd_per_read: 60000
Dictionary: /home/pmdk/bench_workdir/velocity-bench-repo/bitcracker/hash_pass/user_passwords_60000.txt
MAC Comparison (-m): Yes

Iter: 1, num passwords read: 60000
Kernel execution:
Effective passwords: 60000
Passwords Range:
npknpByH7N2m3OnLNH1X9DJxLrzIFWk
.....
dL_7uuf3QCz-c6K3xDu0

================================================
Bitcracker attack completed
Total passwords evaluated: 60000
Password not found!

time to subtract from total: 0.00399524 s
bitcracker - total time for whole calculation: 35.1442 s

Velocity-Bench CudaSift

Environment Variables:

Command:

/home/pmdk/bench_workdir/cudaSift/cudaSift

Output:

UNKN:

UNKN: ==================================================
UNKN: User input parameters:
UNKN: Trace: ../../inputData
UNKN: ==================================================
UNKN:

Image size = (1920,1080)
Initializing data...
Number of original features: 3683 3933
Number of matching features: 1052 1271 28.5637% 1 2

Performing data verification
Data verification is SUCCESSFUL.

Image size = (1920,1080)
Initializing data...
Number of original features: 3683 3933
Number of matching features: 1087 1273 29.514% 1 2

Performing data verification
Data verification is SUCCESSFUL.

Image size = (1920,1080)
Initializing data...
Number of original features: 3683 3933
Number of matching features: 1175 1266 31.9033% 1 2

Performing data verification
Data verification is SUCCESSFUL.

Image size = (1920,1080)
Initializing data...
Number of original features: 3683 3933
Number of matching features: 1236 1269 33.5596% 1 2

Performing data verification
Data verification is SUCCESSFUL.

Image size = (1920,1080)
Initializing data...
Number of original features: 3683 3933
Number of matching features: 1203 1272 32.6636% 1 2

Performing data verification
Data verification is SUCCESSFUL.

Image size = (1920,1080)
Initializing data...
Number of original features: 3683 3933
Number of matching features: 1226 1259 33.2881% 1 2

Performing data verification
Data verification is SUCCESSFUL.

Image size = (1920,1080)
Initializing data...
Number of original features: 3683 3933
Number of matching features: 1222 1255 33.1795% 1 2

Performing data verification
Data verification is SUCCESSFUL.

Image size = (1920,1080)
Initializing data...
Number of original features: 3683 3933
Number of matching features: 1098 1263 29.8127% 1 2

Performing data verification
Data verification is SUCCESSFUL.

Image size = (1920,1080)
Initializing data...
Number of original features: 3683 3933
Number of matching features: 1119 1281 30.3828% 1 2

Performing data verification
Data verification is SUCCESSFUL.

Image size = (1920,1080)
Initializing data...
Number of original features: 3683 3933
Number of matching features: 1149 1265 31.1974% 1 2

Performing data verification
Data verification is SUCCESSFUL.

Image size = (1920,1080)
Initializing data...
Number of original features: 3683 3933
Number of matching features: 1237 1270 33.5868% 1 2

Performing data verification
Data verification is SUCCESSFUL.

Image size = (1920,1080)
Initializing data...
Number of original features: 3683 3933
Number of matching features: 1233 1271 33.4781% 1 2

Performing data verification
Data verification is SUCCESSFUL.

Image size = (1920,1080)
Initializing data...
Number of original features: 3683 3933
Number of matching features: 1100 1263 29.867% 1 2

Performing data verification
Data verification is SUCCESSFUL.

Image size = (1920,1080)
Initializing data...
Number of original features: 3683 3933
Number of matching features: 1108 1255 30.0842% 1 2

Performing data verification
Data verification is SUCCESSFUL.

Image size = (1920,1080)
Initializing data...
Number of original features: 3683 3933
Number of matching features: 1109 1255 30.1113% 1 2

Performing data verification
Data verification is SUCCESSFUL.

Image size = (1920,1080)
Initializing data...
Number of original features: 3683 3933
Number of matching features: 1117 1270 30.3285% 1 2

Performing data verification
Data verification is SUCCESSFUL.

Image size = (1920,1080)
Initializing data...
Number of original features: 3683 3933
Number of matching features: 1239 1279 33.6411% 1 2

Performing data verification
Data verification is SUCCESSFUL.

Image size = (1920,1080)
Initializing data...
Number of original features: 3683 3933
Number of matching features: 1209 1258 32.8265% 1 2

Performing data verification
Data verification is SUCCESSFUL.

Image size = (1920,1080)
Initializing data...
Number of original features: 3683 3933
Number of matching features: 1137 1262 30.8716% 1 2

Performing data verification
Data verification is SUCCESSFUL.

Image size = (1920,1080)
Initializing data...
Number of original features: 3683 3933
Number of matching features: 1229 1262 33.3695% 1 2

Performing data verification
Data verification is SUCCESSFUL.

Image size = (1920,1080)
Initializing data...
Number of original features: 3683 3933
Number of matching features: 1230 1264 33.3967% 1 2

Performing data verification
Data verification is SUCCESSFUL.

Image size = (1920,1080)
Initializing data...
Number of original features: 3683 3933
Number of matching features: 1214 1264 32.9623% 1 2

Performing data verification
Data verification is SUCCESSFUL.

Image size = (1920,1080)
Initializing data...
Number of original features: 3683 3933
Number of matching features: 1036 1258 28.1292% 1 2

Performing data verification
Data verification is SUCCESSFUL.

Image size = (1920,1080)
Initializing data...
Number of original features: 3683 3933
Number of matching features: 1245 1279 33.804% 1 2

Performing data verification
Data verification is SUCCESSFUL.

Image size = (1920,1080)
Initializing data...
Number of original features: 3683 3933
Number of matching features: 1053 1277 28.5908% 1 2

Performing data verification
Data verification is SUCCESSFUL.

Image size = (1920,1080)
Initializing data...
Number of original features: 3683 3933
Number of matching features: 1232 1265 33.451% 1 2

Performing data verification
Data verification is SUCCESSFUL.

Image size = (1920,1080)
Initializing data...
Number of original features: 3683 3933
Number of matching features: 1228 1283 33.3424% 1 2

Performing data verification
Data verification is SUCCESSFUL.

Image size = (1920,1080)
Initializing data...
Number of original features: 3683 3933
Number of matching features: 1224 1263 33.2338% 1 2

Performing data verification
Data verification is SUCCESSFUL.

Image size = (1920,1080)
Initializing data...
Number of original features: 3683 3933
Number of matching features: 1166 1270 31.659% 1 2

Performing data verification
Data verification is SUCCESSFUL.

Image size = (1920,1080)
Initializing data...
Number of original features: 3683 3933
Number of matching features: 1228 1264 33.3424% 1 2

Performing data verification
Data verification is SUCCESSFUL.

Image size = (1920,1080)
Initializing data...
Number of original features: 3683 3933
Number of matching features: 1233 1266 33.4781% 1 2

Performing data verification
Data verification is SUCCESSFUL.

Image size = (1920,1080)
Initializing data...
Number of original features: 3683 3933
Number of matching features: 1222 1255 33.1795% 1 2

Performing data verification
Data verification is SUCCESSFUL.

Image size = (1920,1080)
Initializing data...
Number of original features: 3683 3933
Number of matching features: 1233 1270 33.4781% 1 2

Performing data verification
Data verification is SUCCESSFUL.

Image size = (1920,1080)
Initializing data...
Number of original features: 3683 3933
Number of matching features: 1241 1276 33.6954% 1 2

Performing data verification
Data verification is SUCCESSFUL.

Image size = (1920,1080)
Initializing data...
Number of original features: 3683 3933
Number of matching features: 1035 1264 28.1021% 1 2

Performing data verification
Data verification is SUCCESSFUL.

Image size = (1920,1080)
Initializing data...
Number of original features: 3683 3933
Number of matching features: 1083 1262 29.4054% 1 2

Performing data verification
Data verification is SUCCESSFUL.

Image size = (1920,1080)
Initializing data...
Number of original features: 3683 3933
Number of matching features: 1232 1267 33.451% 1 2

Performing data verification
Data verification is SUCCESSFUL.

Image size = (1920,1080)
Initializing data...
Number of original features: 3683 3933
Number of matching features: 1186 1270 32.202% 1 2

Performing data verification
Data verification is SUCCESSFUL.

Image size = (1920,1080)
Initializing data...
Number of original features: 3683 3933
Number of matching features: 1247 1279 33.8583% 1 2

Performing data verification
Data verification is SUCCESSFUL.

Image size = (1920,1080)
Initializing data...
Number of original features: 3683 3933
Number of matching features: 1087 1259 29.514% 1 2

Performing data verification
Data verification is SUCCESSFUL.

Image size = (1920,1080)
Initializing data...
Number of original features: 3683 3933
Number of matching features: 1229 1268 33.3695% 1 2

Performing data verification
Data verification is SUCCESSFUL.

Image size = (1920,1080)
Initializing data...
Number of original features: 3683 3933
Number of matching features: 1093 1256 29.6769% 1 2

Performing data verification
Data verification is SUCCESSFUL.

Image size = (1920,1080)
Initializing data...
Number of original features: 3683 3933
Number of matching features: 1107 1275 30.057% 1 2

Performing data verification
Data verification is SUCCESSFUL.

Image size = (1920,1080)
Initializing data...
Number of original features: 3683 3933
Number of matching features: 1030 1265 27.9663% 1 2

Performing data verification
Data verification is SUCCESSFUL.

Image size = (1920,1080)
Initializing data...
Number of original features: 3683 3933
Number of matching features: 1108 1271 30.0842% 1 2

Performing data verification
Data verification is SUCCESSFUL.

Image size = (1920,1080)
Initializing data...
Number of original features: 3683 3933
Number of matching features: 1232 1263 33.451% 1 2

Performing data verification
Data verification is SUCCESSFUL.

Image size = (1920,1080)
Initializing data...
Number of original features: 3683 3933
Number of matching features: 1099 1270 29.8398% 1 2

Performing data verification
Data verification is SUCCESSFUL.

Image size = (1920,1080)
Initializing data...
Number of original features: 3683 3933
Number of matching features: 1066 1261 28.9438% 1 2

Performing data verification
Data verification is SUCCESSFUL.

Image size = (1920,1080)
Initializing data...
Number of original features: 3683 3933
Number of matching features: 1087 1253 29.514% 1 2

Performing data verification
Data verification is SUCCESSFUL.

Image size = (1920,1080)
Initializing data...
Number of original features: 3683 3933
Number of matching features: 1235 1266 33.5324% 1 2

Performing data verification
Data verification is SUCCESSFUL.

Avg workload time = 204.522 ms

Velocity-Bench Easywave

Environment Variables:

Command:

/home/pmdk/bench_workdir/easywave/easyWave_sycl -grid /home/pmdk/bench_workdir/data/easywave/examples/e2Asean.grd -source /home/pmdk/bench_workdir/data/easywave/examples/BengkuluSept2007.flt -time 120

Output:

MAIN: Starting SYCL main program
MAIN: Attempting to clean up previous eWave tsunami files
MAIN: Clean up completed
SYCL: SYCL Queue initialization successful
SYCL: Using SYCL device : Intel(R) Data Center GPU Max 1100 (Driver version 1.3.30049+10)
SYCL: Platform : Intel(R) oneAPI Unified Runtime over Level-Zero
MAIN: Program successfully completed

Velocity-Bench QuickSilver

Environment Variables:

QS_DEVICE=GPU

Command:

/home/pmdk/bench_workdir/QuickSilver/qs -i /home/pmdk/bench_workdir/velocity-bench-repo/QuickSilver/Examples/AllScattering/scatteringOnly.inp

Output:

Copyright (c) 2016
Lawrence Livermore National Security, LLC
All Rights Reserved
Quicksilver Version :
Quicksilver Git Hash :
MPI Version : 3.0
Number of MPI ranks : 1
Number of OpenMP Threads: 1
Number of OpenMP CPUs : 1

Loading params
Finished loading params
Simulation:
dt: 1e-08
fMax: 0.1
inputFile: /home/pmdk/bench_workdir/velocity-bench-repo/QuickSilver/Examples/AllScattering/scatteringOnly.inp
energySpectrum:
boundaryCondition: octant
loadBalance: 1
cycleTimers: 0
debugThreads: 0
lx: 100
ly: 100
lz: 100
nParticles: 10000000
batchSize: 0
nBatches: 10
nSteps: 10
nx: 10
ny: 10
nz: 10
seed: 1029384756
xDom: 0
yDom: 0
zDom: 0
eMax: 20
eMin: 1e-09
nGroups: 230
lowWeightCutoff: 0.001
bTally: 1
fTally: 1
cTally: 1
coralBenchmark: 0
crossSectionsOut:

Geometry:
material: sourceMaterial
shape: brick
xMax: 100
xMin: 0
yMax: 100
yMin: 0
zMax: 100
zMin: 0

Material:
name: sourceMaterial
mass: 1000
nIsotopes: 10
nReactions: 9
sourceRate: 1e+10
totalCrossSection: 0.1
absorptionCrossSection: flat
fissionCrossSection: flat
scatteringCrossSection: flat
absorptionCrossSectionRatio: 0
fissionCrossSectionRatio: 0
scatteringCrossSectionRatio: 1

CrossSection:
name: flat
A: 0
B: 0
C: 0
D: 0
E: 1
nuBar: 2.4
setting GPU
setting parameters
Building partition 0
Building partition 1
Building partition 2
Building partition 3
Building MC_Domain 0
Building MC_Domain 1
Building MC_Domain 2
Building MC_Domain 3
Starting Consistency Check
Finished Consistency Check
Finished initMesh
Started copyMaterialDatabase_device
Finished copyMaterialDatabase_device
Finished copyNuclearData_device
Finished copyDomainDevice
cycle start source rr split absorb scatter fission produce collisn escape census num_seg scalar_flux cycleInit cycleTracking cycleFinalize
0 0 1000000 0 9000000 0 18533189 0 0 18533189 1151780 8848220 55527935 1.854923e+09 3.697970e-01 6.065300e-01 0.000000e+00
1 8848220 1000000 0 151478 0 34281997 0 0 34281997 1664159 8335539 94633679 5.047651e+09 3.491410e-01 7.462260e-01 0.000000e+00
2 8335539 1000000 0 663717 0 34354432 0 0 34354432 1366771 8632485 95010375 7.705930e+09 3.398460e-01 7.615320e-01 0.000000e+00
3 8632485 1000000 0 367978 0 34302727 0 0 34302727 1242216 8758247 94953591 9.992076e+09 3.774710e-01 8.138900e-01 0.000000e+00
4 8758247 1000000 0 242076 0 34141236 0 0 34141236 1168452 8831871 94599337 1.199834e+10 3.418790e-01 7.891770e-01 0.000000e+00
5 8831871 1000000 0 168070 0 33948724 0 0 33948724 1121156 8878785 94148236 1.377636e+10 3.471040e-01 7.640660e-01 0.000000e+00
6 8878785 1000000 0 120572 0 33760567 0 0 33760567 1089103 8910254 93689264 1.535668e+10 3.454940e-01 7.622280e-01 0.000000e+00
7 8910254 1000000 0 89810 0 33552179 0 0 33552179 1065203 8934861 93216931 1.676993e+10 3.441710e-01 8.067000e-01 0.000000e+00
8 8934861 1000000 0 65491 0 33384605 0 0 33384605 1047720 8952632 92768273 1.804559e+10 3.378740e-01 7.822370e-01 0.000000e+00
9 8952632 1000000 0 47165 0 33198494 0 0 33198494 1033968 8965829 92324678 1.920208e+10 3.400990e-01 7.575310e-01 0.000000e+00

Timer Cumulative Cumulative Cumulative Cumulative Cumulative Cumulative
Name number microSecs microSecs microSecs microSecs Efficiency
of calls min avg max stddev Rating
main 1 1.108e+07 1.108e+07 1.108e+07 0.000e+00 100.00
cycleInit 10 3.493e+06 3.493e+06 3.493e+06 0.000e+00 100.00
cycleTracking 10 7.590e+06 7.590e+06 7.590e+06 0.000e+00 100.00
cycleTracking_Kernel 104 4.922e+06 4.922e+06 4.922e+06 0.000e+00 100.00
cycleTracking_MPI 117 1.962e+05 1.962e+05 1.962e+05 0.000e+00 100.00
cycleTracking_Test_Done 0 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.00
cycleFinalize 20 3.840e+02 3.840e+02 3.840e+02 0.000e+00 100.00
Figure Of Merit 118.69 [Num Mega Segments / Cycle Tracking Time]

Velocity-Bench Sobel Filter

Environment Variables:

OPENCV_IO_MAX_IMAGE_PIXELS=1677721600

Command:

/home/pmdk/bench_workdir/sobel_filter/sobel_filter -i /home/pmdk/bench_workdir/data/sobel_filter/sobel_filter_data/silverfalls_32Kx32K.png -n 5

Output:

SYMN: Welcome to the SYCL version of Sobel filter workload.
SYMN: Input image file: /home/pmdk/bench_workdir/data/sobel_filter/sobel_filter_data/silverfalls_32Kx32K.png
SYMN: Launching SYCL kernel with # of iterations: 5
time to subtract from total: 7.42355 s
sobelfilter - total time for whole calculation: 0.53394 s

Velocity-Bench dl-cifar

Environment Variables:

Command:

/home/pmdk/bench_workdir/dl-cifar/dl-cifar_sycl

Output:

	Welcome to DL-CIFAR workload: SYCL version.

=======================================================================
SYCL: SYCL Queue initialization successful
SYCL: Using SYCL device : Intel(R) Data Center GPU Max 1100 (Driver version 1.3.30049+10)
SYCL: Platform : Intel(R) oneAPI Unified Runtime over Level-Zero
SYCL: Using SYCL device : Intel(R) Data Center GPU Max 1100 (Driver version 1.3.30049+10)
SYCL: Platform : Intel(R) oneAPI Unified Runtime over Level-Zero

WL PARAMS:

WL PARAMS: ==================================================
WL PARAMS: User input parameters:
WL PARAMS: Trace: notrace
WL PARAMS: DL NW size type: WORKLOAD_DEFAULT_SIZE
WL PARAMS: ==================================================
WL PARAMS:

dataFileReadTimer->getTotalOpTime(): 8.2e-05 s
dl-cifar - total time for whole calculation: 24.017 s

Velocity-Bench dl-mnist

Environment Variables:

NEOReadDebugKeys=1
DisableScratchPages=0

Command:

/home/pmdk/bench_workdir/dl-mnist/dl-mnist-sycl -conv_algo ONEDNN_AUTO

Output:

	Welcome to DL-MNIST workload: SYCL version.

=======================================================================
SYCL: SYCL Queue initialization successful
SYCL: Using SYCL device : Intel(R) Data Center GPU Max 1100 (Driver version 1.3.30049+10)
SYCL: Platform : Intel(R) oneAPI Unified Runtime over Level-Zero
SYCL: Using SYCL device : Intel(R) Data Center GPU Max 1100 (Driver version 1.3.30049+10)
SYCL: Platform : Intel(R) oneAPI Unified Runtime over Level-Zero

WL PARAMS:

WL PARAMS: ==================================================
WL PARAMS: User input parameters:
WL PARAMS: Trace: notrace
WL PARAMS: Tensor management policy: per_layer
WL PARAMS: Convolution algorithm: ONEDNN_AUTO
WL PARAMS: Dataset reader format: NCHW
WL PARAMS: Dry run: YES
WL PARAMS: OneDNN Conv PD memory format: ONEDNN_CONVPD_ANY
WL PARAMS: No of iterations for inference: 400
WL PARAMS: ==================================================
WL PARAMS:

dl-mnist - total time for whole calculation: 2.74 s

Velocity-Bench svm

Environment Variables:

Command:

/home/pmdk/bench_workdir/svm/svm_sycl /home/pmdk/bench_workdir/velocity-bench-repo/svm/SYCL/a9a /home/pmdk/bench_workdir/velocity-bench-repo/svm/SYCL/a.m

Output:

Number of args 3
Using cuSVM (Carpenter)...

Buffering input text file (6989624 B).
Load Done
Starting Training
_C 1.000000
Workgroup Size: 1024
nbrCtas 80
elemsPerCta 1248
threadsPerCta 128
Total run time: 0.070610 seconds
Iter:100
M:97683
N:123
Train done. Calulate Vector counts
Training done

Loading elapsed time : 0.0734 s
Processing elapsed time : 0.0759 s
Storing elapsed time : 0.0022 s
Total elapsed time : 0.1515 s
Result's are correct: 0.0551

Copy link

Compute Benchmarks level_zero_v2 run (with params: --filter Velocity --compare baseline-v2):
https://github.com/oneapi-src/unified-runtime/actions/runs/11956108642

Copy link

Compute Benchmarks level_zero_v2 run (--filter Velocity --compare baseline-v2):
https://github.com/oneapi-src/unified-runtime/actions/runs/11956108642
Job status: success. Test status: success.

Summary

No diffs to calculate performance change

(result is better)

Performance change in benchmark groups

Relative perf in group Velocity-Bench (8): cannot calculate
Benchmark This PR baseline baseline-v2 Relative perf Change -
Velocity-Bench Hashtable 382.457 M keys/sec 379.894 M keys/sec 383.152085 M keys/sec
Velocity-Bench Bitcracker 35.099500 s 35.222 s 35.193 s
Velocity-Bench CudaSift 201.433000 ms 204.950 ms 202.081 ms
Velocity-Bench Easywave 235.000 ms 237.000 ms 234.000000 ms
Velocity-Bench QuickSilver 121.470000 MMS/CTT 118.990 MMS/CTT 121.320 MMS/CTT
Velocity-Bench Sobel Filter 514.382000 ms 534.267 ms 516.832 ms
Velocity-Bench dl-cifar 17.761500 s - -
Velocity-Bench dl-mnist 2.690000 s - -
Relative perf in group api (6): cannot calculate
Benchmark This PR baseline baseline-v2 Relative perf Change -
api_overhead_benchmark_sycl SubmitKernel out of order - 23.344000 μs 23.745 μs
api_overhead_benchmark_sycl SubmitKernel in order - 23.009000 μs 24.748 μs
api_overhead_benchmark_sycl ExecImmediateCopyQueue out of order from Device to Device, size 1024 - 2.399 μs 1.868000 μs
api_overhead_benchmark_sycl ExecImmediateCopyQueue in order from Device to Host, size 1024 - 1.629000 μs 1.899 μs
api_overhead_benchmark_ur SubmitKernel out of order - 14.019000 μs 15.342 μs
api_overhead_benchmark_ur SubmitKernel in order - 13.560000 μs 15.480 μs
Relative perf in group memory (3): cannot calculate
Benchmark This PR baseline baseline-v2 Relative perf Change -
memory_benchmark_sycl QueueInOrderMemcpy from Device to Device, size 1024 - 250.368 μs 196.473000 μs
memory_benchmark_sycl QueueInOrderMemcpy from Host to Device, size 1024 - 117.932 μs 88.565000 μs
memory_benchmark_sycl QueueMemcpy from Device to Device, size 1024 - 5.781 μs 5.684000 μs
Relative perf in group miscellaneous (1): cannot calculate
Benchmark This PR baseline baseline-v2 Relative perf Change -
miscellaneous_benchmark_sycl VectorSum - 803.138000 μs 804.478 μs
Relative perf in group multithread (8): cannot calculate
Benchmark This PR baseline baseline-v2 Relative perf Change -
multithread_benchmark_ur MemcpyExecute opsPerThread:400, numThreads:1, allocSize:102400 srcUSM:1 dstUSM:1 - 6719.944 μs 3588.421000 μs
multithread_benchmark_ur MemcpyExecute opsPerThread:100, numThreads:8, allocSize:102400 srcUSM:1 dstUSM:1 - 17264.000 μs 7922.105000 μs
multithread_benchmark_ur MemcpyExecute opsPerThread:400, numThreads:8, allocSize:1024 srcUSM:1 dstUSM:1 - 24830.061000 μs 24977.963 μs
multithread_benchmark_ur MemcpyExecute opsPerThread:10, numThreads:16, allocSize:1024 srcUSM:1 dstUSM:1 - 1046.591000 μs 1063.816 μs
multithread_benchmark_ur MemcpyExecute opsPerThread:400, numThreads:1, allocSize:102400 srcUSM:0 dstUSM:1 - 7614.679 μs 4507.512000 μs
multithread_benchmark_ur MemcpyExecute opsPerThread:100, numThreads:8, allocSize:102400 srcUSM:0 dstUSM:1 - 8369.748 μs 6390.880000 μs
multithread_benchmark_ur MemcpyExecute opsPerThread:400, numThreads:8, allocSize:1024 srcUSM:0 dstUSM:1 - 24930.683000 μs 25445.122 μs
multithread_benchmark_ur MemcpyExecute opsPerThread:10, numThreads:16, allocSize:1024 srcUSM:0 dstUSM:1 - 1035.853000 μs 1068.329 μs
Relative perf in group Runtime (8): cannot calculate
Benchmark This PR baseline baseline-v2 Relative perf Change -
Runtime_IndependentDAGTaskThroughput_SingleTask - 262.132 ms 175.188000 ms
Runtime_IndependentDAGTaskThroughput_HierarchicalParallelFor - 274.737 ms 185.023000 ms
Runtime_IndependentDAGTaskThroughput_NDRangeParallelFor - 270.176 ms 182.648000 ms
Runtime_IndependentDAGTaskThroughput_BasicParallelFor - 277.443 ms 179.923000 ms
Runtime_DAGTaskThroughput_HierarchicalParallelFor - 1801.303 ms 1313.605000 ms
Runtime_DAGTaskThroughput_BasicParallelFor - 1827.461 ms 1318.343000 ms
Runtime_DAGTaskThroughput_NDRangeParallelFor - 1782.877 ms 1284.252000 ms
Runtime_DAGTaskThroughput_SingleTask - 1766.731 ms 1249.041000 ms
Relative perf in group MicroBench (14): cannot calculate
Benchmark This PR baseline baseline-v2 Relative perf Change -
MicroBench_HostDeviceBandwidth_3D_H2D_Contiguous - 4.487 ms 4.461000 ms
MicroBench_HostDeviceBandwidth_3D_H2D_Strided - 4.498000 ms 4.528 ms
MicroBench_HostDeviceBandwidth_2D_H2D_Strided - 4.504000 ms 4.547 ms
MicroBench_HostDeviceBandwidth_1D_D2H_Contiguous - 4.536 ms 3.716000 ms
MicroBench_HostDeviceBandwidth_1D_D2H_Strided - 4.625 ms 3.753000 ms
MicroBench_HostDeviceBandwidth_3D_D2H_Contiguous - 618.194 ms 618.005000 ms
MicroBench_HostDeviceBandwidth_1D_H2D_Strided - 4.405000 ms 4.476 ms
MicroBench_HostDeviceBandwidth_2D_H2D_Contiguous - 4.480 ms 4.446000 ms
MicroBench_HostDeviceBandwidth_2D_D2H_Contiguous - 618.203 ms 617.994000 ms
MicroBench_HostDeviceBandwidth_1D_H2D_Contiguous - 4.365 ms 4.300000 ms
MicroBench_HostDeviceBandwidth_3D_D2H_Strided - 617.471 ms 617.302000 ms
MicroBench_HostDeviceBandwidth_2D_D2H_Strided - 617.439 ms 617.296000 ms
MicroBench_LocalMem_fp32_4096 - 29.880 ms 29.858000 ms
MicroBench_LocalMem_int32_4096 - 29.899 ms 29.853000 ms
Relative perf in group Pattern (10): cannot calculate
Benchmark This PR baseline baseline-v2 Relative perf Change -
Pattern_Reduction_NDRange_int32 - 16.937 ms 16.718000 ms
Pattern_Reduction_Hierarchical_int32 - 16.990 ms 16.906000 ms
Pattern_SegmentedReduction_NDRange_int64 - 2.347000 ms 2.351 ms
Pattern_SegmentedReduction_NDRange_int16 - 2.269 ms 2.253000 ms
Pattern_SegmentedReduction_NDRange_fp32 - 2.166 ms 2.164000 ms
Pattern_SegmentedReduction_Hierarchical_int64 - 11.776 ms 11.775000 ms
Pattern_SegmentedReduction_Hierarchical_int32 - 11.595000 ms 11.605 ms
Pattern_SegmentedReduction_Hierarchical_int16 - 11.799000 ms 11.799 ms
Pattern_SegmentedReduction_NDRange_int32 - 2.172 ms 2.170000 ms
Pattern_SegmentedReduction_Hierarchical_fp32 - 11.590000 ms 11.595 ms
Relative perf in group ScalarProduct (6): cannot calculate
Benchmark This PR baseline baseline-v2 Relative perf Change -
ScalarProduct_Hierarchical_int64 - 11.306000 ms 11.356 ms
ScalarProduct_NDRange_fp32 - 3.746000 ms 3.797 ms
ScalarProduct_NDRange_int32 - 3.757000 ms 3.811 ms
ScalarProduct_Hierarchical_int32 - 10.320 ms 10.313000 ms
ScalarProduct_Hierarchical_fp32 - 9.955000 ms 9.961 ms
ScalarProduct_NDRange_int64 - 5.441000 ms 5.499 ms
Relative perf in group USM (7): cannot calculate
Benchmark This PR baseline baseline-v2 Relative perf Change -
USM_Allocation_latency_fp32_device - 0.067000 ms -
USM_Allocation_latency_fp32_host - 37.435000 ms 37.568 ms
USM_Allocation_latency_fp32_shared - 0.055000 ms 0.062 ms
USM_Instr_Mix_fp32_device_1:1mix_with_init_no_prefetch - 1.653 ms 1.298000 ms
USM_Instr_Mix_fp32_device_1:1mix_no_init_no_prefetch - 1.800 ms 1.543000 ms
USM_Instr_Mix_fp32_host_1:1mix_no_init_no_prefetch - 1.202 ms 1.140000 ms
USM_Instr_Mix_fp32_host_1:1mix_with_init_no_prefetch - 1.044 ms 0.979000 ms
Relative perf in group VectorAddition (3): cannot calculate
Benchmark This PR baseline baseline-v2 Relative perf Change -
VectorAddition_fp32 - 1.450000 ms 1.495 ms
VectorAddition_int64 - 3.055000 ms 3.119 ms
VectorAddition_int32 - 1.452000 ms 1.494 ms
Relative perf in group Polybench (3): cannot calculate
Benchmark This PR baseline baseline-v2 Relative perf Change -
Polybench_2mm - 1.213000 ms 1.214 ms
Polybench_3mm - 1.730000 ms 1.813 ms
Polybench_Atax - 6.883 ms 6.860000 ms
Relative perf in group Kmeans (1): cannot calculate
Benchmark This PR baseline baseline-v2 Relative perf Change -
Kmeans_fp32 - 16.051000 ms 16.054 ms
Relative perf in group MolecularDynamics (1): cannot calculate
Benchmark This PR baseline baseline-v2 Relative perf Change -
MolecularDynamics - 0.027000 ms 0.027 ms
Relative perf in group llama.cpp (6): cannot calculate
Benchmark This PR baseline baseline-v2 Relative perf Change -
llama.cpp Text Generation Batched 128 - 62.270 token/s 65.291276 token/s
llama.cpp Prompt Processing Batched 256 - 909.696 token/s 949.465825 token/s
llama.cpp Prompt Processing Batched 128 - 815.476 token/s 864.649446 token/s
llama.cpp Text Generation Batched 512 - 62.368 token/s 65.190816 token/s
llama.cpp Prompt Processing Batched 512 - 458.128 token/s 486.573404 token/s
llama.cpp Text Generation Batched 256 - 62.255 token/s 65.269429 token/s

Details

Benchmark details - environment, command, output...
Velocity-Bench Hashtable

Environment Variables:

Command:

/home/pmdk/bench_workdir/hashtable/hashtable_sycl --no-verify

Output:

hashtable - total time for whole calculation: 0.350935 s
382.457111 million keys/second

Velocity-Bench Bitcracker

Environment Variables:

Command:

/home/pmdk/bench_workdir/bitcracker/bitcracker -f /home/pmdk/bench_workdir/velocity-bench-repo/bitcracker/hash_pass/img_win8_user_hash.txt -d /home/pmdk/bench_workdir/velocity-bench-repo/bitcracker/hash_pass/user_passwords_60000.txt -b 60000

Output:

---------> BitCracker: BitLocker password cracking tool <---------

==================================
Retrieving Info

Reading hash file "/home/pmdk/bench_workdir/velocity-bench-repo/bitcracker/hash_pass/img_win8_user_hash.txt"

              Attack

================================================
Type of attack: User Password
Psw per thread: 1
max_num_pswd_per_read: 60000
Dictionary: /home/pmdk/bench_workdir/velocity-bench-repo/bitcracker/hash_pass/user_passwords_60000.txt
MAC Comparison (-m): Yes

Iter: 1, num passwords read: 60000
Kernel execution:
Effective passwords: 60000
Passwords Range:
npknpByH7N2m3OnLNH1X9DJxLrzIFWk
.....
dL_7uuf3QCz-c6K3xDu0

================================================
Bitcracker attack completed
Total passwords evaluated: 60000
Password not found!

time to subtract from total: 0.00421165 s
bitcracker - total time for whole calculation: 35.0995 s

Velocity-Bench CudaSift

Environment Variables:

Command:

/home/pmdk/bench_workdir/cudaSift/cudaSift

Output:

UNKN:

UNKN: ==================================================
UNKN: User input parameters:
UNKN: Trace: ../../inputData
UNKN: ==================================================
UNKN:

Image size = (1920,1080)
Initializing data...
Number of original features: 3683 3933
Number of matching features: 1219 1254 33.098% 1 2

Performing data verification
Data verification is SUCCESSFUL.

Image size = (1920,1080)
Initializing data...
Number of original features: 3683 3933
Number of matching features: 1220 1254 33.1252% 1 2

Performing data verification
Data verification is SUCCESSFUL.

Image size = (1920,1080)
Initializing data...
Number of original features: 3683 3933
Number of matching features: 1110 1269 30.1385% 1 2

Performing data verification
Data verification is SUCCESSFUL.

Image size = (1920,1080)
Initializing data...
Number of original features: 3683 3933
Number of matching features: 1225 1259 33.2609% 1 2

Performing data verification
Data verification is SUCCESSFUL.

Image size = (1920,1080)
Initializing data...
Number of original features: 3683 3933
Number of matching features: 1229 1264 33.3695% 1 2

Performing data verification
Data verification is SUCCESSFUL.

Image size = (1920,1080)
Initializing data...
Number of original features: 3683 3933
Number of matching features: 1094 1262 29.704% 1 2

Performing data verification
Data verification is SUCCESSFUL.

Image size = (1920,1080)
Initializing data...
Number of original features: 3683 3933
Number of matching features: 1092 1277 29.6497% 1 2

Performing data verification
Data verification is SUCCESSFUL.

Image size = (1920,1080)
Initializing data...
Number of original features: 3683 3933
Number of matching features: 1079 1262 29.2968% 1 2

Performing data verification
Data verification is SUCCESSFUL.

Image size = (1920,1080)
Initializing data...
Number of original features: 3683 3933
Number of matching features: 1234 1266 33.5053% 1 2

Performing data verification
Data verification is SUCCESSFUL.

Image size = (1920,1080)
Initializing data...
Number of original features: 3683 3933
Number of matching features: 1226 1266 33.2881% 1 2

Performing data verification
Data verification is SUCCESSFUL.

Image size = (1920,1080)
Initializing data...
Number of original features: 3683 3933
Number of matching features: 1088 1265 29.5411% 1 2

Performing data verification
Data verification is SUCCESSFUL.

Image size = (1920,1080)
Initializing data...
Number of original features: 3683 3933
Number of matching features: 1227 1264 33.3152% 1 2

Performing data verification
Data verification is SUCCESSFUL.

Image size = (1920,1080)
Initializing data...
Number of original features: 3683 3933
Number of matching features: 1237 1268 33.5868% 1 2

Performing data verification
Data verification is SUCCESSFUL.

Image size = (1920,1080)
Initializing data...
Number of original features: 3683 3933
Number of matching features: 1128 1279 30.6272% 1 2

Performing data verification
Data verification is SUCCESSFUL.

Image size = (1920,1080)
Initializing data...
Number of original features: 3683 3933
Number of matching features: 1238 1277 33.6139% 1 2

Performing data verification
Data verification is SUCCESSFUL.

Image size = (1920,1080)
Initializing data...
Number of original features: 3683 3933
Number of matching features: 1029 1254 27.9392% 1 2

Performing data verification
Data verification is SUCCESSFUL.

Image size = (1920,1080)
Initializing data...
Number of original features: 3683 3933
Number of matching features: 1094 1260 29.704% 1 2

Performing data verification
Data verification is SUCCESSFUL.

Image size = (1920,1080)
Initializing data...
Number of original features: 3683 3933
Number of matching features: 1223 1260 33.2066% 1 2

Performing data verification
Data verification is SUCCESSFUL.

Image size = (1920,1080)
Initializing data...
Number of original features: 3683 3933
Number of matching features: 1228 1262 33.3424% 1 2

Performing data verification
Data verification is SUCCESSFUL.

Image size = (1920,1080)
Initializing data...
Number of original features: 3683 3933
Number of matching features: 1066 1263 28.9438% 1 2

Performing data verification
Data verification is SUCCESSFUL.

Image size = (1920,1080)
Initializing data...
Number of original features: 3683 3933
Number of matching features: 1229 1265 33.3695% 1 2

Performing data verification
Data verification is SUCCESSFUL.

Image size = (1920,1080)
Initializing data...
Number of original features: 3683 3933
Number of matching features: 1101 1273 29.8941% 1 2

Performing data verification
Data verification is SUCCESSFUL.

Image size = (1920,1080)
Initializing data...
Number of original features: 3683 3933
Number of matching features: 1236 1269 33.5596% 1 2

Performing data verification
Data verification is SUCCESSFUL.

Image size = (1920,1080)
Initializing data...
Number of original features: 3683 3933
Number of matching features: 1228 1268 33.3424% 1 2

Performing data verification
Data verification is SUCCESSFUL.

Image size = (1920,1080)
Initializing data...
Number of original features: 3683 3933
Number of matching features: 1234 1268 33.5053% 1 2

Performing data verification
Data verification is SUCCESSFUL.

Image size = (1920,1080)
Initializing data...
Number of original features: 3683 3933
Number of matching features: 1238 1270 33.6139% 1 2

Performing data verification
Data verification is SUCCESSFUL.

Image size = (1920,1080)
Initializing data...
Number of original features: 3683 3933
Number of matching features: 1236 1266 33.5596% 1 2

Performing data verification
Data verification is SUCCESSFUL.

Image size = (1920,1080)
Initializing data...
Number of original features: 3683 3933
Number of matching features: 1233 1269 33.4781% 1 2

Performing data verification
Data verification is SUCCESSFUL.

Image size = (1920,1080)
Initializing data...
Number of original features: 3683 3933
Number of matching features: 1234 1267 33.5053% 1 2

Performing data verification
Data verification is SUCCESSFUL.

Image size = (1920,1080)
Initializing data...
Number of original features: 3683 3933
Number of matching features: 1231 1265 33.4238% 1 2

Performing data verification
Data verification is SUCCESSFUL.

Image size = (1920,1080)
Initializing data...
Number of original features: 3683 3933
Number of matching features: 1228 1262 33.3424% 1 2

Performing data verification
Data verification is SUCCESSFUL.

Image size = (1920,1080)
Initializing data...
Number of original features: 3683 3933
Number of matching features: 1213 1266 32.9351% 1 2

Performing data verification
Data verification is SUCCESSFUL.

Image size = (1920,1080)
Initializing data...
Number of original features: 3683 3933
Number of matching features: 1221 1261 33.1523% 1 2

Performing data verification
Data verification is SUCCESSFUL.

Image size = (1920,1080)
Initializing data...
Number of original features: 3683 3933
Number of matching features: 1235 1269 33.5324% 1 2

Performing data verification
Data verification is SUCCESSFUL.

Image size = (1920,1080)
Initializing data...
Number of original features: 3683 3933
Number of matching features: 1112 1258 30.1928% 1 2

Performing data verification
Data verification is SUCCESSFUL.

Image size = (1920,1080)
Initializing data...
Number of original features: 3683 3933
Number of matching features: 1229 1268 33.3695% 1 2

Performing data verification
Data verification is SUCCESSFUL.

Image size = (1920,1080)
Initializing data...
Number of original features: 3683 3933
Number of matching features: 1118 1271 30.3557% 1 2

Performing data verification
Data verification is SUCCESSFUL.

Image size = (1920,1080)
Initializing data...
Number of original features: 3683 3933
Number of matching features: 1227 1258 33.3152% 1 2

Performing data verification
Data verification is SUCCESSFUL.

Image size = (1920,1080)
Initializing data...
Number of original features: 3683 3933
Number of matching features: 1228 1261 33.3424% 1 2

Performing data verification
Data verification is SUCCESSFUL.

Image size = (1920,1080)
Initializing data...
Number of original features: 3683 3933
Number of matching features: 1219 1255 33.098% 1 2

Performing data verification
Data verification is SUCCESSFUL.

Image size = (1920,1080)
Initializing data...
Number of original features: 3683 3933
Number of matching features: 1228 1264 33.3424% 1 2

Performing data verification
Data verification is SUCCESSFUL.

Image size = (1920,1080)
Initializing data...
Number of original features: 3683 3933
Number of matching features: 1153 1254 31.306% 1 2

Performing data verification
Data verification is SUCCESSFUL.

Image size = (1920,1080)
Initializing data...
Number of original features: 3683 3933
Number of matching features: 1225 1257 33.2609% 1 2

Performing data verification
Data verification is SUCCESSFUL.

Image size = (1920,1080)
Initializing data...
Number of original features: 3683 3933
Number of matching features: 1225 1264 33.2609% 1 2

Performing data verification
Data verification is SUCCESSFUL.

Image size = (1920,1080)
Initializing data...
Number of original features: 3683 3933
Number of matching features: 1053 1249 28.5908% 1 2

Performing data verification
Data verification is SUCCESSFUL.

Image size = (1920,1080)
Initializing data...
Number of original features: 3683 3933
Number of matching features: 1214 1266 32.9623% 1 2

Performing data verification
Data verification is SUCCESSFUL.

Image size = (1920,1080)
Initializing data...
Number of original features: 3683 3933
Number of matching features: 1132 1265 30.7358% 1 2

Performing data verification
Data verification is SUCCESSFUL.

Image size = (1920,1080)
Initializing data...
Number of original features: 3683 3933
Number of matching features: 1222 1263 33.1795% 1 2

Performing data verification
Data verification is SUCCESSFUL.

Image size = (1920,1080)
Initializing data...
Number of original features: 3683 3933
Number of matching features: 1226 1260 33.2881% 1 2

Performing data verification
Data verification is SUCCESSFUL.

Image size = (1920,1080)
Initializing data...
Number of original features: 3683 3933
Number of matching features: 1234 1267 33.5053% 1 2

Performing data verification
Data verification is SUCCESSFUL.

Avg workload time = 201.433 ms

Velocity-Bench Easywave

Environment Variables:

Command:

/home/pmdk/bench_workdir/easywave/easyWave_sycl -grid /home/pmdk/bench_workdir/data/easywave/examples/e2Asean.grd -source /home/pmdk/bench_workdir/data/easywave/examples/BengkuluSept2007.flt -time 120

Output:

MAIN: Starting SYCL main program
MAIN: Attempting to clean up previous eWave tsunami files
MAIN: Clean up completed
SYCL: SYCL Queue initialization successful
SYCL: Using SYCL device : Intel(R) Data Center GPU Max 1100 (Driver version 1.3.30049+10)
SYCL: Platform : Intel(R) oneAPI Unified Runtime over Level-Zero
MAIN: Program successfully completed

Velocity-Bench QuickSilver

Environment Variables:

QS_DEVICE=GPU

Command:

/home/pmdk/bench_workdir/QuickSilver/qs -i /home/pmdk/bench_workdir/velocity-bench-repo/QuickSilver/Examples/AllScattering/scatteringOnly.inp

Output:

Copyright (c) 2016
Lawrence Livermore National Security, LLC
All Rights Reserved
Quicksilver Version :
Quicksilver Git Hash :
MPI Version : 3.0
Number of MPI ranks : 1
Number of OpenMP Threads: 1
Number of OpenMP CPUs : 1

Loading params
Finished loading params
Simulation:
dt: 1e-08
fMax: 0.1
inputFile: /home/pmdk/bench_workdir/velocity-bench-repo/QuickSilver/Examples/AllScattering/scatteringOnly.inp
energySpectrum:
boundaryCondition: octant
loadBalance: 1
cycleTimers: 0
debugThreads: 0
lx: 100
ly: 100
lz: 100
nParticles: 10000000
batchSize: 0
nBatches: 10
nSteps: 10
nx: 10
ny: 10
nz: 10
seed: 1029384756
xDom: 0
yDom: 0
zDom: 0
eMax: 20
eMin: 1e-09
nGroups: 230
lowWeightCutoff: 0.001
bTally: 1
fTally: 1
cTally: 1
coralBenchmark: 0
crossSectionsOut:

Geometry:
material: sourceMaterial
shape: brick
xMax: 100
xMin: 0
yMax: 100
yMin: 0
zMax: 100
zMin: 0

Material:
name: sourceMaterial
mass: 1000
nIsotopes: 10
nReactions: 9
sourceRate: 1e+10
totalCrossSection: 0.1
absorptionCrossSection: flat
fissionCrossSection: flat
scatteringCrossSection: flat
absorptionCrossSectionRatio: 0
fissionCrossSectionRatio: 0
scatteringCrossSectionRatio: 1

CrossSection:
name: flat
A: 0
B: 0
C: 0
D: 0
E: 1
nuBar: 2.4
setting GPU
setting parameters
Building partition 0
Building partition 1
Building partition 2
Building partition 3
Building MC_Domain 0
Building MC_Domain 1
Building MC_Domain 2
Building MC_Domain 3
Starting Consistency Check
Finished Consistency Check
Finished initMesh
Started copyMaterialDatabase_device
Finished copyMaterialDatabase_device
Finished copyNuclearData_device
Finished copyDomainDevice
cycle start source rr split absorb scatter fission produce collisn escape census num_seg scalar_flux cycleInit cycleTracking cycleFinalize
0 0 1000000 0 9000000 0 18533189 0 0 18533189 1151780 8848220 55527935 1.854923e+09 3.739620e-01 5.943180e-01 0.000000e+00
1 8848220 1000000 0 151478 0 34281997 0 0 34281997 1664159 8335539 94633679 5.047651e+09 3.480890e-01 7.320060e-01 0.000000e+00
2 8335539 1000000 0 663717 0 34354432 0 0 34354432 1366771 8632485 95010375 7.705930e+09 3.470640e-01 7.454950e-01 0.000000e+00
3 8632485 1000000 0 367978 0 34302727 0 0 34302727 1242216 8758247 94953591 9.992076e+09 3.753540e-01 7.984480e-01 0.000000e+00
4 8758247 1000000 0 242076 0 34141236 0 0 34141236 1168452 8831871 94599337 1.199834e+10 3.443860e-01 7.729610e-01 0.000000e+00
5 8831871 1000000 0 168070 0 33948724 0 0 33948724 1121156 8878785 94148236 1.377636e+10 3.462320e-01 7.496790e-01 0.000000e+00
6 8878785 1000000 0 120572 0 33760567 0 0 33760567 1089103 8910254 93689264 1.535668e+10 3.468470e-01 7.478850e-01 0.000000e+00
7 8910254 1000000 0 89810 0 33552179 0 0 33552179 1065203 8934861 93216931 1.676993e+10 3.374660e-01 7.672290e-01 0.000000e+00
8 8934861 1000000 0 65491 0 33384605 0 0 33384605 1047720 8952632 92768273 1.804559e+10 3.375830e-01 7.660760e-01 0.000000e+00
9 8952632 1000000 0 47165 0 33198494 0 0 33198494 1033968 8965829 92324678 1.920208e+10 3.388550e-01 7.422490e-01 0.000000e+00

Timer Cumulative Cumulative Cumulative Cumulative Cumulative Cumulative
Name number microSecs microSecs microSecs microSecs Efficiency
of calls min avg max stddev Rating
main 1 1.091e+07 1.091e+07 1.091e+07 0.000e+00 100.00
cycleInit 10 3.496e+06 3.496e+06 3.496e+06 0.000e+00 100.00
cycleTracking 10 7.416e+06 7.416e+06 7.416e+06 0.000e+00 100.00
cycleTracking_Kernel 104 4.915e+06 4.915e+06 4.915e+06 0.000e+00 100.00
cycleTracking_MPI 117 2.019e+05 2.019e+05 2.019e+05 0.000e+00 100.00
cycleTracking_Test_Done 0 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.00
cycleFinalize 20 3.940e+02 3.940e+02 3.940e+02 0.000e+00 100.00
Figure Of Merit 121.47 [Num Mega Segments / Cycle Tracking Time]

Velocity-Bench Sobel Filter

Environment Variables:

OPENCV_IO_MAX_IMAGE_PIXELS=1677721600

Command:

/home/pmdk/bench_workdir/sobel_filter/sobel_filter -i /home/pmdk/bench_workdir/data/sobel_filter/sobel_filter_data/silverfalls_32Kx32K.png -n 5

Output:

SYMN: Welcome to the SYCL version of Sobel filter workload.
SYMN: Input image file: /home/pmdk/bench_workdir/data/sobel_filter/sobel_filter_data/silverfalls_32Kx32K.png
SYMN: Launching SYCL kernel with # of iterations: 5
time to subtract from total: 7.42424 s
sobelfilter - total time for whole calculation: 0.514382 s

Velocity-Bench dl-cifar

Environment Variables:

Command:

/home/pmdk/bench_workdir/dl-cifar/dl-cifar_sycl

Output:

	Welcome to DL-CIFAR workload: SYCL version.

=======================================================================
SYCL: SYCL Queue initialization successful
SYCL: Using SYCL device : Intel(R) Data Center GPU Max 1100 (Driver version 1.3.30049+10)
SYCL: Platform : Intel(R) oneAPI Unified Runtime over Level-Zero
SYCL: Using SYCL device : Intel(R) Data Center GPU Max 1100 (Driver version 1.3.30049+10)
SYCL: Platform : Intel(R) oneAPI Unified Runtime over Level-Zero

WL PARAMS:

WL PARAMS: ==================================================
WL PARAMS: User input parameters:
WL PARAMS: Trace: notrace
WL PARAMS: DL NW size type: WORKLOAD_DEFAULT_SIZE
WL PARAMS: ==================================================
WL PARAMS:

dataFileReadTimer->getTotalOpTime(): 8.8e-05 s
dl-cifar - total time for whole calculation: 17.7615 s

Velocity-Bench dl-mnist

Environment Variables:

NEOReadDebugKeys=1
DisableScratchPages=0

Command:

/home/pmdk/bench_workdir/dl-mnist/dl-mnist-sycl -conv_algo ONEDNN_AUTO

Output:

	Welcome to DL-MNIST workload: SYCL version.

=======================================================================
SYCL: SYCL Queue initialization successful
SYCL: Using SYCL device : Intel(R) Data Center GPU Max 1100 (Driver version 1.3.30049+10)
SYCL: Platform : Intel(R) oneAPI Unified Runtime over Level-Zero
SYCL: Using SYCL device : Intel(R) Data Center GPU Max 1100 (Driver version 1.3.30049+10)
SYCL: Platform : Intel(R) oneAPI Unified Runtime over Level-Zero

WL PARAMS:

WL PARAMS: ==================================================
WL PARAMS: User input parameters:
WL PARAMS: Trace: notrace
WL PARAMS: Tensor management policy: per_layer
WL PARAMS: Convolution algorithm: ONEDNN_AUTO
WL PARAMS: Dataset reader format: NCHW
WL PARAMS: Dry run: YES
WL PARAMS: OneDNN Conv PD memory format: ONEDNN_CONVPD_ANY
WL PARAMS: No of iterations for inference: 400
WL PARAMS: ==================================================
WL PARAMS:

dl-mnist - total time for whole calculation: 2.69 s

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant