Algorithm Benchmark is a set of benchmarks aimed at measuring the performance of realistic worloads.
Test name | Description | Params | L0 | OCL |
---|---|---|---|---|
Heat3D | A 3D heat-equation solving benchmark that overlaps IPC data transfers (nearest-neighbor halo exchange) and GPU compute kernels.Measures multi-process concurrent kernel execution and IPC memory transfer performance on a single device. Linux-only. |
|
✔️ | ❌ |
Api Overhead Benchmark is a set of tests aimed at measuring CPU-side execution duration of compute API calls.
Test name | Description | Params | L0 | OCL |
---|---|---|---|---|
AppendLaunchKernel | measures time spent in zeCommandListAppendLaunchKernel on CPU. |
|
✔️ | ❌ |
AppendWaitOnEventsImmediate | Measures time spent to zeCommandListAppendWaitOnEvents using immediate command list. |
|
✔️ | ❌ |
CommandListHostSynchronize | measures CPU time spent in zeCommandListHostSynchronize. Optionally, adds an event-signalling barrierand waits for the event, before calling zeCommandListHostSynchronize |
|
✔️ | ❌ |
CreateBuffer | measures time spent in clCreateBuffer on CPU. |
|
❌ | ✔️ |
CreateCommandList | measures time spent in zeCommandListCreate on CPU. |
|
✔️ | ❌ |
CreateCommandListImmediate | measures time spent in zeCommandListCreateImmediate on CPU. |
|
✔️ | ❌ |
DestroyCommandList | measures time spent in zeCommandListDestroy on CPU. |
|
✔️ | ❌ |
DestroyCommandListImmediate | measures time spent in zeCommandListDestroy on CPU, for immediate cmdlist. |
|
✔️ | ❌ |
DriverGet | measures time spent in driver get call on CPU. |
|
✔️ | ❌ |
DriverGetApiVersion | measures time spent in zeDriverGetApiVersion call on CPU. | ✔️ | ❌ | |
DriverGetProperties | measures time spent in zeDriverGetProperties call on CPU. | ✔️ | ❌ | |
EnqueueNdrNullLws | measures time spent in clEnqueueNDRangeKernel on CPU. Null LWS is provided, which causes driver to calculate it |
|
❌ | ✔️ |
EnqueueNdrTime | measures time spent in clEnqueueNDRangeKernel on CPU. |
|
❌ | ✔️ |
EventCreation | measures time spent to create event |
|
✔️ | ❌ |
EventQueryStatus | Measures time spent to query event status |
|
✔️ | ❌ |
ExecImmediate | measures time spent in appending launch kernel for immediate command list on CPU. |
|
✔️ | ❌ |
ExecImmediateCopyQueue | measures time spent in appending memory copy for immediate command list on CPU with Copy Queue. |
|
✔️ | ❌ |
ExecImmediateMultiKernel | measures time spent in executing multiple instances of two different kernels with immediate command list on CPU |
|
✔️ | ❌ |
ExecuteCommandList | measures time spent in zeCommandQueueExecuteCommandLists on CPU. |
|
✔️ | ❌ |
ExecuteCommandListForCopyEngine | measures CPU time spent in zeCommandQueueExecuteCommandLists for copy-only path |
|
✔️ | ❌ |
ExecuteCommandListWithFenceCreate | measures time spent in zeFenceCreate on CPU when fences are used. | ✔️ | ❌ | |
ExecuteCommandListWithFenceDestroy | measures time spent in zeFenceDestroy on CPU when fences are used. | ✔️ | ❌ | |
ExecuteCommandListWithFenceUsage | measures time spent in zeCommandQueueExecuteCommandLists and zeFenceSynchronize on CPU when fences are used. | ✔️ | ❌ | |
ExecuteCommandListWithIndirectAccess | measures time spent in zeCommandQueueExecuteCommandLists on CPU when indirect allocations are accessed. |
|
✔️ | ❌ |
ExecuteCommandListWithIndirectArguments | measures time spent in zeCommandQueueExecuteCommandLists on CPU when indirect allocations are used. |
|
✔️ | ❌ |
FlushTime | measures time spent in clEnqueueNDRangeKernel on CPU. |
|
❌ | ✔️ |
GetMemoryProperties | measures time spent in zeMemGetAllocProperties on CPU when driver is queried for memory properties. |
|
✔️ | ❌ |
GetMemoryPropertiesWithModifiedAllocations | measures time spent in zeMemGetAllocProperties on CPU, when allocations are modified between each iteration. |
|
✔️ | ❌ |
GetMemoryPropertiesWithOffsetedPointer | measures time spent in zeMemGetAllocProperties on CPU when the pointer passed is an offset from the base address. |
|
✔️ | ❌ |
KernelSetArgumentValueImmediate | measures time spent in zeKernelSetArgumentValue for immediate arguments on CPU. |
|
✔️ | ❌ |
LifecycleCommandList | measures time spent in zeCommandListCreate + Close + Execute on CPU. |
|
✔️ | ❌ |
MemGetIpcHandle | measures time spent in zeMemGetIpcHandle on CPU. |
|
✔️ | ❌ |
MemOpenIpcHandle | measures time spent in zeMemOpenIpcHandle on CPU. |
|
✔️ | ❌ |
MemPutIpcHandle | measures time spent in zeMemPutIpcHandle on CPU. |
|
✔️ | ❌ |
ModuleCreateSpv | measures time spent in zeModuleCreate for .spv kernel on CPU. |
|
✔️ | ❌ |
PhysicalMemCreate | measures time spent in zePhysicalMemCreate on CPU. |
|
✔️ | ❌ |
PhysicalMemDestroy | measures time spent in zePhysicalMemDestroy on CPU. | ✔️ | ❌ | |
ResetCommandList | measures time spent in zeCommandListReset on CPU. |
|
✔️ | ❌ |
SetKernelArgSvmPointer | measures time spent in clSetKernelArgSVMPointer on CPU. |
|
✔️ | ✔️ |
SetKernelGroupSize | measures time spent in zeKernelSetGroupSize on CPU. |
|
✔️ | ❌ |
SubmitKernel | measures time spent in submitting a kernel to a SYCL (or SYCL-like) queue on CPU. |
|
✔️ | ✔️ |
UsmMemoryAllocation | measures time spent in USM memory allocation APIs. |
|
✔️ | ❌ |
VirtualMemFree | measures time spent in zeVirtualMemFree on CPU. |
|
✔️ | ❌ |
VirtualMemGetAccessAttrib | measures time spent in zeVirtualMemGetAccessAttribute on CPU. |
|
✔️ | ❌ |
VirtualMemMap | measures time spent in zeVirtualMemMap on CPU. |
|
✔️ | ❌ |
VirtualMemQueryPageSize | measures time spent in zeVirtualMemQueryPageSize on CPU. | ✔️ | ❌ | |
VirtualMemReserve | measures time spent in zeVirtualMemReserve on CPU. |
|
✔️ | ❌ |
VirtualMemSetAccessAttrib | measures time spent in zeVirtualMemSetAccessAttribute on CPU. |
|
✔️ | ❌ |
VirtualMemUnMap | measures time spent in zeVirtualMemUnMap on CPU. |
|
✔️ | ❌ |
Atomic Benchmark is a set of tests aimed at measuring performance of atomic operations inside kernels.
Test name | Description | Params | L0 | OCL |
---|---|---|---|---|
OneAtomic | enqueues kernel performing an atomic operation on a single address |
|
❌ | ✔️ |
OneAtomicExplicit | enqueues kernel performing an atomic operation on a single address using OpenCL 2.0 Atomics with explicit memory order and scope |
|
❌ | ✔️ |
OneLocalAtomic | enqueues kernel performing an atomic operation on a single location placed in SLM |
|
❌ | ✔️ |
OneLocalAtomicExplicit | enqueues kernel performing an atomic operation on a single location placed in SLM using OpenCL 2.0 Atomics with explicit memory order and scope |
|
❌ | ✔️ |
SeparateAtomics | enqueues kernel performing an atomic operation on different addresses |
|
❌ | ✔️ |
SeparateAtomicsExplicit | enqueues kernel performing an atomic operation on different addresses |
|
❌ | ✔️ |
Emulation Benchmark is a set of tests aimed at measuring performance of emulated math operations performed in kernels.
Test name | Description | Params | L0 | OCL |
---|---|---|---|---|
Int64Div | enqueues kernel performing an int64 division emulation |
|
❌ | ✔️ |
EU Benchmark is a set of tests aimed at measuring performance of calculations performed in kernels.
Test name | Description | Params | L0 | OCL |
---|---|---|---|---|
DoMathOperation | enqueues kernel performing a math operation |
|
❌ | ✔️ |
ReadAfterAtomicWrite | enqueues kernel, which writes to global memory using atomic and then reads non atomically |
|
❌ | ✔️ |
Gpu Commands Benchmark is a set of tests aimed at measuring GPU-side execution duration of various commands.
Test name | Description | Params | L0 | OCL |
---|---|---|---|---|
BarrierBetweenKernels | measures time required to run a barrier command between 2 kernels, including potential cache flush commands |
|
✔️ | ❌ |
CopyWithEvent | measures time required to run a copy kernel with various event configurations. |
|
✔️ | ❌ |
EmptyKernel | measures time required to run an empty kernel on GPU. |
|
✔️ | ❌ |
EventCtxtSwitchLatency | measures context switching latency time required to switch between various engine types |
|
✔️ | ❌ |
KernelWithEvent | measures time required to run an empty kernel with various event configurations. |
|
✔️ | ❌ |
KernelWithWork | measures time required to run a GPU kernel which assigns values to elements of a buffer. |
|
✔️ | ❌ |
WaitOnEventCold | measures time required to service a signalled semaphore, that has never been waited for. |
|
✔️ | ❌ |
WaitOnEventFromWalker | measures time required to service a signalled semaphore coming from Walker command |
|
✔️ | ❌ |
WaitOnEventHot | measures time required to service a signalled semaphore, that was previously used |
|
✔️ | ❌ |
WriteTimestamp | measures time required to write a timestamp on GPU. |
|
✔️ | ❌ |
Memory Benchmark is a set of tests aimed at measuring bandwidth of memory transfers.
Test name | Description | Params | L0 | OCL |
---|---|---|---|---|
CopyBuffer | allocates two OpenCL buffers and measures copy bandwidth between them. Buffers will be placed in device memory, if it's available. |
|
❌ | ✔️ |
CopyBufferRect | allocates two OpenCL buffers and measures rectangle copy bandwidth between them. Buffers will be placed in device memory, if it's available. |
|
❌ | ✔️ |
CopyBufferToImage | allocates buffer and image and measures copy bandwidth between them using immediate command list for Level Zero and command queue for OpenCL. |
|
✔️ | ✔️ |
CopyEntireImage | allocates two image objects and measures copy bandwidth between them. Images will be placed in device memory, if it's available. |
|
✔️ | ✔️ |
CopyImageRegion | allocates two image objects and measures region copy bandwidth between them using immediate command list for Level Zero and command queue for OpenCL. |
|
✔️ | ✔️ |
CopyImageToBuffer | allocates image and buffer and measures copy bandwidth between them using immediate command list for Level Zero and command queue for OpenCL. |
|
✔️ | ✔️ |
FillBuffer | allocates an OpenCL buffer and measures fill bandwidth. Buffer will be placed in device memory, if it's available. |
|
❌ | ✔️ |
FullRemoteAccessMemory | Uses stream memory in a fashion described by 'type' to measure bandwidth of full remote memory access. |
|
❌ | ✔️ |
FullRemoteAccessMemoryXeCoresDistributed | Uses stream memory in a fashion described by 'type' to measure bandwidth of full remote memory accesswhen hwthreads are distributed between XeCores. |
|
❌ | ✔️ |
MapBuffer | allocates an OpenCL buffer and measures map bandwidth. Mapping operation means memory transfer from GPU to CPU or a no-op, depending on map flags. |
|
❌ | ✔️ |
QueueInOrderMemcpy | measures time on CPU spent for multiple in order memcpy. |
|
✔️ | ❌ |
RandomAccessMemory | Measures device-memory random access bandwidth for different allocation sizes, alignments and access modes.The benchmark uses 10 million accesses to memory. |
|
✔️ | ❌ |
ReadBuffer | allocates an OpenCL buffer and measures read bandwidth. Read operation means transfer from GPU to CPU. |
|
❌ | ✔️ |
ReadBufferMisaligned | allocates an OpenCL buffer and measures read bandwidth. Read operation means transfer from GPU to CPU. Destination pointer passed by the application will be misaligned by the specified amount of bytes. |
|
❌ | ✔️ |
ReadBufferRect | allocates an OpenCL buffer and measures rectangle read bandwidth. Rectangle read operation means transfer from GPU to CPU. |
|
❌ | ✔️ |
ReadDeviceMemBuffer | allocates two OpenCL buffers and measures source buffer read bandwidth. Source buffer resides in device memory. |
|
❌ | ✔️ |
RemoteAccessMemory | Uses stream memory in a fashion described by 'type' to measure bandwidth with differentpercentages of remote memory access. Triad means two buffers are read and one is written to.In read and write memory is only read or written to. |
|
❌ | ✔️ |
RemoteAccessMemoryMaxSaturation | Uses stream memory write to measure max data bus saturation with different percentages of remote memory access |
|
❌ | ✔️ |
SLM_DataAccessLatency | generates SLM local memory transactions inside thread group to measure latency between reads (uses Intel only private intel_get_cycle_counter() ) |
|
❌ | ✔️ |
SlmSwitchLatency | Enqueues 2 kernels with different SLM size. Measures switch time between these kernels. |
|
✔️ | ❌ |
StreamAfterTransfer | Goal of this test is to measure how stream kernels perform right after host to device transfer populating the data. Test does clean caches, then emits transfers and then follows with stream kernel and measures GPU execution time of it. |
|
❌ | ✔️ |
StreamMemory | Streams memory inside of kernel in a fashion described by 'type'. Copy means one memory location is read from and the second one is written to. Triad means two buffers are read and one is written to. In read and write memory is only read or written to. |
|
✔️ | ✔️ |
StreamMemoryImmediate | Streams memory inside of kernel in a fashion described by 'type' using immediate command list. Copy means one memory location is read from and the second one is written to. Triad means two buffers are read and one is written to. In read and write memory is only read or written to. |
|
✔️ | ❌ |
UnmapBuffer | allocates an OpenCL buffer and measures unmap bandwidth. Unmapping operation meansmemory transfer from CPU to GPU or a no-op, depending on map flags. |
|
❌ | ✔️ |
UsmConcurrentCopy | allocates four unified shared memory buffers, 2 in device memory and 2 in host memory. Measures concurrent copy bandwidth between them. |
|
✔️ | ❌ |
UsmCopy | allocates two unified shared memory buffers and measures copy bandwidth between them. |
|
✔️ | ✔️ |
UsmCopyConcurrentMultipleBlits | Measures Copy bandwidth while performing concurrent copies between host and device using different copy engines. Engines for Host to Device copies could be selected using d2hBlitters.Engines for Device to Host copies could be selected using d2hBlitters. |
|
✔️ | ❌ |
UsmCopyImmediate | allocates two unified shared memory buffers and measures copy bandwidth between them using immediate command list. |
|
✔️ | ❌ |
UsmCopyMultipleBlits | allocates two unified shared memory buffers, divides them into chunks, copies each chunk using a different copy engine and measures bandwidth. Results for each individual blitter engine is measured using GPU-based timings and reported separately. Total bandwidths are calculated by dividing the total buffer size by the worst result from all engines. Division of work among blitters is not always even - if main copy engine is specified (rightmost bit in --bliters argument), it gets a half of the buffer and the rest is divided between remaining copy engines. Otherwise the division is even. |
|
✔️ | ✔️ |
UsmCopyRegion | allocates two unified shared memory buffers and measures region copy bandwidth between them using immediate command list. |
|
✔️ | ❌ |
UsmCopyStagingBuffers | Measures copy time from device/host to host/device. Host memory is non-USM allocation.Copy is done through staging USM buffers. Non-USM host ptr is never passed to L0 API, only through staging buffers. |
|
✔️ | ❌ |
UsmFill | allocates a unified memory buffer and measures fill bandwidth |
|
✔️ | ✔️ |
UsmFillImmediate | allocates a unified memory buffer and measures fill bandwidth using immediate command list |
|
✔️ | ❌ |
UsmFillMultipleBlits | allocates a unified shared memory buffer, divides it into chunks, copies each chunk using a different copy engine and measures bandwidth. Refer to UsmCopyMultipleBlits for more details. |
|
✔️ | ✔️ |
UsmFillSpecificPattern | allocates a unified memory buffer and measures fill bandwidth. Allow specifying arbitrary pattern. |
|
✔️ | ✔️ |
UsmImmediateCopyMultipleBlits | allocates two unified shared memory buffers, divides them into chunks, copies each chunk using a different copy engine with an immediate command list and measures bandwidth. Results for each individual blitter engine is measured using GPU-based timings and reported separately. Total bandwidths are calculated by dividing the total buffer size by the worst result from all engines. Division of work among blitters is not always even - if main copy engine is specified (rightmost bit in --bliters argument), it gets a half of the buffer and the rest is divided between remaining copy engines. Otherwise the division is even. |
|
✔️ | ❌ |
UsmMemset | allocates a unified memory buffer and measures memset bandwidth |
|
❌ | ✔️ |
UsmSharedMigrateCpu | allocates a unified shared memory buffer and measures bandwidth for kernel that must migrate resource from GPU to CPU |
|
✔️ | ✔️ |
UsmSharedMigrateGpu | allocates a unified shared memory buffer and measures bandwidth for kernel that must migrate resource from CPU to GPU |
|
✔️ | ✔️ |
UsmSharedMigrateGpuForFill | allocates a unified shared memory buffer and measures bandwidth for memory fill operation that must migrate resource from CPU to GPU |
|
✔️ | ✔️ |
WriteBuffer | allocates an OpenCL buffer and measures write bandwidth. Write operation means transfer from CPU to GPU. |
|
❌ | ✔️ |
WriteBufferRect | allocates an OpenCL buffer and measures rectangle write bandwidth. Rectangle write operation means transfer from CPU to GPU. |
|
❌ | ✔️ |
Miscellaneous Benchmark is a set of tests measuring different simple compute scenarios.
Test name | Description | Params | L0 | OCL |
---|---|---|---|---|
IoqKernelSwitchLatency | measures time from end of one kernel till start of next kernel for in order queue |
|
❌ | ✔️ |
KernelWithWork | measures time required to run a GPU kernel which assigns constant values to elements of a buffer. Each thread assigns one value. Benchmark checks the impact of kernel split. |
|
❌ | ✔️ |
Reduction | Performs a reduction operation on a buffer. Each thread performs atomic_add on one shared memory location. |
|
❌ | ✔️ |
Reduction2 | Performs a reduction operation on a buffer. Each thread performs atomic_add on one shared memory location. |
|
❌ | ✔️ |
Reduction3 | Performs a reduction operation on a buffer. Each thread performs atomic_add on one shared memory location. |
|
❌ | ✔️ |
Reduction4 | Performs a reduction operation on a buffer. Each thread performs atomic_add on one shared memory location. |
|
❌ | ✔️ |
Reduction5 | Performs a reduction operation on a buffer. Each thread performs atomic_add on one shared memory location. |
|
❌ | ✔️ |
VectorSum | Performs vector addition |
|
❌ | ✔️ |
Multiprocess Benchmark is a set of tests aimed at measuring how different commands benefit for simultaneous execution.
Test name | Description | Params | L0 | OCL |
---|---|---|---|---|
KernelAndCopy | enqueues kernel and copy operation with the ability to perform both tasks on different command queues. |
|
❌ | ✔️ |
MultiProcessCompute | Creates a number of separate processes for each tile specified performing a compute workload and measures average time to complete all of them. Processes will use affinity mask to select specific sub-devices for the execution |
|
✔️ | ❌ |
MultiProcessComputeSharedBuffer | Creates a number of separate processes for each tile specified performing a compute workload and measures average time to complete all of them. Processes will use affinity mask to select specific sub-devices for the execution. A single buffer for each tile is created by parent process. All processes executing on a given tile will share it via IPC calls. |
|
✔️ | ❌ |
MultiProcessImmediateCmdlistCompletion | measures completion latency of AppendMemoryCopy issued from multiple processes to Immediate Command Lists.Engines to be used for submissions are selected based on the enabled bits of engineMask.Bits of the 'engineMask' are indexed from right to left. So rightmost bit represents first engine and leftmost, the last engine.'processesPerEngine' number of processes submits commands to each selected engine.If 'numberOfProcesses' is greater than 'processesPerEngine' x selected engine count, then the excess processes are assigned to selected engines one each, in a round-robin method.if selected engineCount == 1, then all processes are assigned to that engine. |
|
✔️ | ❌ |
MultiProcessImmediateCmdlistSubmission | measures submission latency of walker command issued from multiple processes to Immediate Command Lists.'processesPerEngine' count of processes, submit commands to each engine.If 'numberOfProcesses' is greater than 'processesPerEngine' x engine count, then the excess processes are assigned to engines one each, in a round-robin method.if engineCount == 1, then all processes are assigned to the engine. |
|
✔️ | ❌ |
MultiProcessInit | Measures the initialization overhead in a multi-process application.For Level Zero we only measure the first invocation of zeInit() per process execution. |
|
✔️ | ❌ |
Multithread Benchmark is a set of tests aimed at measuring how different commands benefit from multithreaded execution.
Test name | Description | Params | L0 | OCL |
---|---|---|---|---|
ImmediateCommandListCompletion | measures completion latency of AppendMemoryCopy issued from multiple threads to Immediate Command Lists.Engines to be used for submissions are selected based on the enabled bits of engineMask.'threadsPerEngine' number of threads submits commands to each selected engine.If 'numberOfThreads' is greater than 'threadsPerEngine' x selected engine count, then the excess threads are assigned to selected engines one each, in a round-robin method.if selected engineCount == 1, then all threads are assigned to that engine. |
|
✔️ | ❌ |
ImmediateCommandListSubmission | measures submission latency of AppendLaunchKernel issued from multiple threads to Immediate Command Lists.'threadsPerEngine' count of threads submit commands to each engine.If 'numberOfThreads' is greater than 'threadsPerEngine' x engine count, then the excess threads are assigned to engines one each, in a round-robin method.if engineCount == 1, then all threads are assigned to the engine. |
|
✔️ | ❌ |
SvmCopy | enqueues multiple svm copies on multiple threads concurrently. |
|
✔️ | ✔️ |
Multi-tile Memory Benchmark is a set of tests aimed at measuring bandwidth of memory transfers performed on a multi-tile device.
Test name | Description | Params | L0 | OCL |
---|---|---|---|---|
CopyBuffer | allocates two OpenCL buffers and measures copy bandwidth between them. Buffers will be placed in device memory, if it's available. |
|
❌ | ✔️ |
FillBuffer | allocates an OpenCL buffer and measures fill bandwidth. Buffer will be placed in device memory, if it's available. |
|
❌ | ✔️ |
ReadBuffer | allocates an OpenCL buffer and measures read bandwidth. Read operation means transfer from GPU to CPU. |
|
❌ | ✔️ |
UsmBidirectionalCopy | allocates two unified device memory buffers, each on a different tile, and measures copy bandwidth between. Test measures copies on two directions, which can be controlled with the -write parameter: with -write=1, each tile performs a write operation. For instance: queue is placed in tile 0, source is buffer in tile 0, and destination is in tile 1. Similarly for tile 1, queue is placed in tile 1, source in tile 1, and destination in tile 0. With -write=0, the destination and source are flipped: queue is placed in tile 0, source is buffer in tile 1, and destination is in tile 0, while for tile 1, queue is placed in tile 1, source in tile 0, and destination in tile 1. |
|
✔️ | ❌ |
UsmCopy | allocates two unified shared memory buffers and measures copy bandwidth between them using a builtin function. |
|
✔️ | ✔️ |
UsmCopyImmediate | allocates two unified shared memory buffers and measures copy bandwidth between them using a builtin function appended to an immediate list. |
|
✔️ | ❌ |
UsmCopyKernel | allocates two unified shared memory buffers and measures copy bandwidth between them using a custom kernel. |
|
✔️ | ✔️ |
UsmFill | allocates a unified shared memory buffer and measures fill bandwidth. |
|
✔️ | ✔️ |
UsmSharedMigrateCpu | allocates a unified shared memory buffer and measures time to migrate it from GPU to CPU. |
|
✔️ | ✔️ |
UsmSharedMigrateGpu | allocates a unified shared memory buffer and measures time to migrate it from CPU to GPU. |
|
✔️ | ✔️ |
WriteBuffer | allocates an OpenCL buffer and measures write bandwidth. Write operation means transfer from CPU to GPU. |
|
❌ | ✔️ |
P2P Benchmark is a set of tests aimed at measuring bandwidth and latency of memory transfers between peer devices.
Test name | Description | Params | L0 | OCL |
---|---|---|---|---|
UsmCopyMultipleBlits | allocates two unified device memory buffers on separate devices and performs a copy between sections (or chunks) of these using a different copy engine and measures bandwidth. Test first checks for P2P capabilities in the target platform before submitting the copy. Results for each individual blitter engine is measured using GPU-based timings and reported separately. Total bandwidths are calculated by dividing the total buffer size by the worst result from all engines. Division of work among blitters is not always even - if main copy engine is specified (rightmost bit in --bliters argument), it gets a half of the buffer and the rest is divided between remaining copy engines. Otherwise the division is even. |
|
✔️ | ❌ |
UsmEUCopy | allocates two unified device memory buffers on separate devices, performs a copy between them using a compute engine, and reports bandwidth. Test first checks for P2P capabilities in the target platform before submitting the copy. |
|
✔️ | ❌ |
UsmImmediateCopyMultipleBlits | allocates two unified device memory buffers on separate devices and performs a copy between sections (or chunks) of these using a different copy engine with an immediate command list and measures bandwidth. Test first checks for P2P capabilities in the target platform before submitting the copy. Results for each individual blitter engine is measured using GPU-based timings and reported separately. Total bandwidths are calculated by dividing the total buffer size by the worst result from all engines. Division of work among blitters is not always even - if main copy engine is specified (rightmost bit in --bliters argument), it gets a half of the buffer and the rest is divided between remaining copy engines. Otherwise the division is even. |
|
✔️ | ❌ |
Ulls Benchmark is a set of tests aimed at measuring Ultra Low Latency Submission (ULLS) performance impact.
Test name | Description | Params | L0 | OCL |
---|---|---|---|---|
BestSubmission | enqueues a system memory write via PIPE_CONTROL and measures when update becomes visible on the CPU. | ✔️ | ❌ | |
BestWalkerNthCommandListSubmission | enqueues single kernel on n command lists, which updates system memory location and then busy-loops on CPU until the update of the kernel of nth command list becomes visible. This is L0 only test. |
|
✔️ | ❌ |
BestWalkerNthSubmission | enqueues n kernels, which updates system memory location and then busy-loops on CPU until the update of nth kernel becomes visible. |
|
✔️ | ❌ |
BestWalkerNthSubmissionImmediate | enqueues n kernels, which updates system memory location and then busy-loops on CPU until the update of nth kernel becomes visible. Kernel is enqueued using low-latency immediate command list, so the test is LevelZero-specific. |
|
✔️ | ❌ |
BestWalkerSubmission | enqueues kernel, which updates system memory location and then busy-loops on CPU until the update becomes visible. | ✔️ | ✔️ | |
BestWalkerSubmissionImmediate | enqueues kernel, which updates system memory location and then busy-loops on CPU until the update becomes visible. Kernel is enqueued using low-latency immediate command list, so the test is LevelZero-specific. | ✔️ | ❌ | |
BestWalkerSubmissionImmediateMultiCmdlists | Append N kernels on N cmdlists, which updates system memory locations and then waits using busy-loop on CPU until the update becomes visible. Kernels are appended using immediate command lists.Amount of command lists is specified by cmdlistCount. |
|
✔️ | ❌ |
CompletionLatency | enqueues system memory write and measures time between the moment, when update is visible on CPU and the moment, when synchronizing call returns. | ✔️ | ❌ | |
CopySubmissionEvents | enqueues 4 byte copy to copy engine and return submission delta which is time between host API call and copy engine start |
|
✔️ | ✔️ |
EmptyKernel | enqueues empty kernel and measures time to launch it and wait for it on CPU, thus measuring walker spawn time. |
|
✔️ | ✔️ |
EmptyKernelImmediate | enqueues empty kernel and measures time to launch it using immediate command list and wait for it on CPU, thus measuring walker spawn time. |
|
✔️ | ❌ |
EnqueueBarrierWithEmptyWaitlist | enqueues kernel with barriers with empty waitlists inbetween, waiting on the last barriers event |
|
❌ | ✔️ |
KernelSwitchLatency | measures time from end of one kernel till start of next kernel |
|
✔️ | ✔️ |
KernelSwitchLatencyImmediate | measures time from end of one kernel till start of next kernel using immediate command lists |
|
✔️ | ❌ |
KernelWithWork | measures time required to run a GPU kernel which assigns constant values to elements of a buffer. Each thread assigns one value. |
|
✔️ | ✔️ |
KernelWithWorkImmediate | measures time required to run a GPU kernel which assigns constant values to elements of a buffer using immediate command list. Each thread assigns one value. |
|
✔️ | ❌ |
KernelWithWorkPeriodic | measures average time required to run a GPU kernel which assigns constant values to elements of a buffer. Each thread assigns one value. Kernel is run multiple times with a set delay between submissions. |
|
✔️ | ❌ |
MultiQueueSubmission | enqueues kernel on multiple command queues |
|
✔️ | ✔️ |
MultipleImmediateWithDependencies | Creates N immediate command lists. Submits kernels in order to each of thoseEach kernel has a dependency on previous oneSubmissions are small to allows concurrent executionMeassures time from scheduling start, till all command lists are completed |
|
✔️ | ❌ |
NewResourcesSubmissionDevice | enqueues kernel that uses a buffer placed in device memory to measure resource preparation time. The resource is destroyed and recreated for each iteration to ensure it is a different memory allocation. |
|
✔️ | ✔️ |
NewResourcesSubmissionHost | enqueues kernel that uses a buffer placed in host memory to measure resource preparation time. The resource is destroyed and recreated for each iteration to ensure it is a different memory allocation. |
|
✔️ | ✔️ |
NewResourcesWithGpuAccess | enqueues kernel that accesses an entire buffer placed in device memory to measure resource preparation time. The resource is destroyed and recreated for each iteration to ensure it is a different memory allocation. |
|
✔️ | ✔️ |
QueueConcurrency | Submits multiple kernels to out of order queue returning events. There is a sequence of long kernel, short kernel, wait for short kernel Then calls synchronization and meassures performance |
|
❌ | ✔️ |
QueuePriorities | Uses queues with different priorities to meassure submission and context switch latencies |
|
❌ | ✔️ |
ResourceReassign | Enqueues stress kernel which utilizes majority of GPU's execution units, then enqueues next kernel, measuring its execution time. Shows overhead releated to GPU's resources releasing and assigning. |
|
❌ | ✔️ |
RoundTripSubmission | enqueues kernel which updates system memory location and waits for it with a synchronizing API. | ✔️ | ✔️ | |
UsmSharedFirstCpuAccess | allocates a unified shared memory buffer and measures time to access it on CPU after creation. |
|
✔️ | ✔️ |
UsmSharedFirstGpuAccess | allocates a unified shared memory buffer and measures time to access it on GPU after creation. |
|
✔️ | ✔️ |
WalkerCompletionLatency | enqueues a kernel writing to system memory and measures time between the moment when update is visible on CPU and the moment when synchronizing call returns |
|
✔️ | ✔️ |
WalkerSubmissionEvents | enqueues an empty kernel with GPU-side profiling and checks delta between queue time and start time. | ✔️ | ✔️ | |
WriteLatency | unblocks event on GPU, then waits for timestamp being written. | ✔️ | ❌ |