Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OpenCL kernel long time to start execution after enqueue on iGPU #735

Open
arshadlab opened this issue May 25, 2024 · 4 comments
Open

OpenCL kernel long time to start execution after enqueue on iGPU #735

arshadlab opened this issue May 25, 2024 · 4 comments

Comments

@arshadlab
Copy link

arshadlab commented May 25, 2024

Hi,
I am experimenting with kernel execution on iGPU for solutions tailored for extremely low-latency workloads, where response times are critical down to the microsecond. While monitoring performance, I've noticed a significant delay from when kernels are queued to when they actually start executing. To isolate this issue, I implemented a basic OpenCL kernel to add two arrays of 256 bytes, queuing it with clEnqueueNDRangeKernel(). The API-level execution takes approximately 156 microseconds per call. However, using VTune's 'GPU Offload' analysis, I see that the kernel execution begins only after 131 microseconds and then runs for about 9 microseconds. This is with iGPU running at 300 Mhz frequency.

I observed that the GPU's operational frequency affects this latency significantly. At a low frequency of 300 MHz, the queue time is much longer compared to when the GPU is at its maximum frequency of 1300 MHz (queued duration drops to 33 microseconds, with a total time of 36 microseconds per call). Despite this improvement at higher frequencies, the initial delay before execution remains concerningly high at 33 microseconds, especially since there are no other compute tasks running simultaneously.

Given this scenario, I'm seeking insights into what might be causing these delays in compute execution startup. Additionally, what strategies could potentially reduce this latency, allowing the kernel to execute nearly immediately after being queued?

CPU Specs: 13th Gen Intel(R) Core(TM) i7-13700H (TDP 45W) on an Asus PN64-E
OS: Ubuntu 22.04
ICD version: 23.52.28202.52-821~22.04

Thank you for any advice or insights you can provide!

Best regards,
Arshad

Screenshot:
iGPU running at Low Frequency 300 Mhz - (~150 us duration per call, actual kernel execution ~9 us)
higher_latency_lower_frequency

iGPU running at Higher Frequency 1300 Mhz - (~37 us duration per call, actual kernel execution ~2 us)
less_latency_high_frequency

compute_focus

@arshadlab arshadlab changed the title OpenCL kernel long time to execute after enqueue on iGPU OpenCL kernel long time to start execution after enqueue on iGPU May 25, 2024
@BartusW
Copy link
Contributor

BartusW commented May 29, 2024

Arshad,

Default configuration (no GPU Frequency throttling) demonstrates that performance observations are correct. With your scenario and your hardware i7-13700H CPU and integrated GPU mentioned end-to-end timing is in expected range. VTune logs show buffer submission timing on ~37us. Scenario design, logs show that there is single clEnqueueNDRangeKernel() followed by clFinish() batching is excluded, it triggers workload submission to iGPU through the i915 KMD and hard host synchronization at clFinish call every iteration. VTune instrumentation adds ~5us overhead for CPU/GPU event view timestamps..

Reducing clock to non-default and low frequency value, performance score is affected by several factors. With low GPU clock command stream parsing is extended, same as main kernel execution from 2us to 9us in linear factor. Additionally, on ramp-down there could be timing coincidence with potential render-standby RC6 enter and exit triggered by next clEnque/clFinish call followed by compute context switch. With forced low GPU clock both RC6 and GPU context switch which operates in GPU domain would be more time expensive. Performance observation and impact is valid and expected from Compute Runtime UMD driver perspective.

To remove RC6 and batching / context switch effect which add abbreviations to your experiment, please redesign scenario with non-blocking clFlush() instead of clFinish() and disable GPU render standby RC6 in BIOS or on i915 module load time with via modprobe config file:
options i915 enable_rc6=0

@arshadlab
Copy link
Author

Hi,
I am not able to set enable_rc6 to 0. Tried modifying the code but still it's enabled.

code: ./drivers/gpu/drm/i915/gt/intel_rc6.c
bool enable_rc6 = true -> false;

$ cat /sys/class/drm/card0/power/rc6_enable
1

Seems like enable_rc6 option is removed since kernel 4.16 (see https://patchwork.freedesktop.org/patch/191386/).

Regards,
Arshad

@BartusW
Copy link
Contributor

BartusW commented Jun 11, 2024

Please use BIOS settings to disable RC6 GPU standby capabilities.

@eero-t
Copy link

eero-t commented Jun 18, 2024

What kernel i915 driver version is used? According to this, only out-of-tree driver (from Intel package repository) supports low latency submission: https://dgpu-docs.intel.com/driver/kernel-driver-types.html

EDIT: OOT KMD requires also compute runtime from same repo, or building compute-runtime with -DNEO_ENABLE_i915_PRELIM_DETECTION=TRUE.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants