OpenCL kernel long time to start execution after enqueue on iGPU #735

arshadlab · 2024-05-25T04:37:18Z

Hi,
I am experimenting with kernel execution on iGPU for solutions tailored for extremely low-latency workloads, where response times are critical down to the microsecond. While monitoring performance, I've noticed a significant delay from when kernels are queued to when they actually start executing. To isolate this issue, I implemented a basic OpenCL kernel to add two arrays of 256 bytes, queuing it with clEnqueueNDRangeKernel(). The API-level execution takes approximately 156 microseconds per call. However, using VTune's 'GPU Offload' analysis, I see that the kernel execution begins only after 131 microseconds and then runs for about 9 microseconds. This is with iGPU running at 300 Mhz frequency.

I observed that the GPU's operational frequency affects this latency significantly. At a low frequency of 300 MHz, the queue time is much longer compared to when the GPU is at its maximum frequency of 1300 MHz (queued duration drops to 33 microseconds, with a total time of 36 microseconds per call). Despite this improvement at higher frequencies, the initial delay before execution remains concerningly high at 33 microseconds, especially since there are no other compute tasks running simultaneously.

Given this scenario, I'm seeking insights into what might be causing these delays in compute execution startup. Additionally, what strategies could potentially reduce this latency, allowing the kernel to execute nearly immediately after being queued?

CPU Specs: 13th Gen Intel(R) Core(TM) i7-13700H (TDP 45W) on an Asus PN64-E
OS: Ubuntu 22.04
ICD version: 23.52.28202.52-821~22.04

Thank you for any advice or insights you can provide!

Best regards,
Arshad

Screenshot:
iGPU running at Low Frequency 300 Mhz - (~150 us duration per call, actual kernel execution ~9 us)

iGPU running at Higher Frequency 1300 Mhz - (~37 us duration per call, actual kernel execution ~2 us)

BartusW · 2024-05-29T13:19:06Z

Arshad,

Default configuration (no GPU Frequency throttling) demonstrates that performance observations are correct. With your scenario and your hardware i7-13700H CPU and integrated GPU mentioned end-to-end timing is in expected range. VTune logs show buffer submission timing on ~37us. Scenario design, logs show that there is single clEnqueueNDRangeKernel() followed by clFinish() batching is excluded, it triggers workload submission to iGPU through the i915 KMD and hard host synchronization at clFinish call every iteration. VTune instrumentation adds ~5us overhead for CPU/GPU event view timestamps..

Reducing clock to non-default and low frequency value, performance score is affected by several factors. With low GPU clock command stream parsing is extended, same as main kernel execution from 2us to 9us in linear factor. Additionally, on ramp-down there could be timing coincidence with potential render-standby RC6 enter and exit triggered by next clEnque/clFinish call followed by compute context switch. With forced low GPU clock both RC6 and GPU context switch which operates in GPU domain would be more time expensive. Performance observation and impact is valid and expected from Compute Runtime UMD driver perspective.

To remove RC6 and batching / context switch effect which add abbreviations to your experiment, please redesign scenario with non-blocking clFlush() instead of clFinish() and disable GPU render standby RC6 in BIOS or on i915 module load time with via modprobe config file:
options i915 enable_rc6=0

arshadlab · 2024-06-05T02:05:26Z

Hi,
I am not able to set enable_rc6 to 0. Tried modifying the code but still it's enabled.

code: ./drivers/gpu/drm/i915/gt/intel_rc6.c
bool enable_rc6 = true -> false;

$ cat /sys/class/drm/card0/power/rc6_enable
1

Seems like enable_rc6 option is removed since kernel 4.16 (see https://patchwork.freedesktop.org/patch/191386/).

Regards,
Arshad

BartusW · 2024-06-11T16:19:18Z

Please use BIOS settings to disable RC6 GPU standby capabilities.

eero-t · 2024-06-18T17:36:32Z

What kernel i915 driver version is used? According to this, only out-of-tree driver (from Intel package repository) supports low latency submission: https://dgpu-docs.intel.com/driver/kernel-driver-types.html

EDIT: OOT KMD requires also compute runtime from same repo, or building compute-runtime with -DNEO_ENABLE_i915_PRELIM_DETECTION=TRUE.

arshadlab changed the title ~~OpenCL kernel long time to execute after enqueue on iGPU~~ OpenCL kernel long time to start execution after enqueue on iGPU May 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OpenCL kernel long time to start execution after enqueue on iGPU #735

OpenCL kernel long time to start execution after enqueue on iGPU #735

arshadlab commented May 25, 2024 •

edited

Loading

BartusW commented May 29, 2024 •

edited

Loading

arshadlab commented Jun 5, 2024

BartusW commented Jun 11, 2024

eero-t commented Jun 18, 2024 •

edited

Loading

OpenCL kernel long time to start execution after enqueue on iGPU #735

OpenCL kernel long time to start execution after enqueue on iGPU #735

Comments

arshadlab commented May 25, 2024 • edited Loading

BartusW commented May 29, 2024 • edited Loading

arshadlab commented Jun 5, 2024

BartusW commented Jun 11, 2024

eero-t commented Jun 18, 2024 • edited Loading

arshadlab commented May 25, 2024 •

edited

Loading

BartusW commented May 29, 2024 •

edited

Loading

eero-t commented Jun 18, 2024 •

edited

Loading