Realm: missed detection of HIP hijack #1557

mariodirenzo · 2023-09-30T12:18:58Z

This issue follows up on a discussion on Zulip (https://legion.zulipchat.com/#narrow/stream/187787-general/topic/Hip.20hijack).
It seems that the HIP hijack is not active even if Realm is compiled with REALM_USE_HIP_HIJACK and if all the kernels are launched on the streams returned by hipGetTaskStream.
In particular, the runtime always issues the following warning:

[0 - 7f1fa3ef9000]    0.374056 {4}{hip}: HIP hijack code not active - device synchronizations required after every GPU task!

@eddy16112 has been able to reproduce the problem using circuit.

The values of cfg_task_context_sync and cudart_hijack_active printed at https://gitlab.com/StanfordLegion/legion/-/blob/master/runtime/realm/hip/hip_module.cc#L1515 are:

[0 - 1554499f3140]    0.251715 {4}{hip}: gpu_proc->gpu->module->config->cfg_task_context_sync = -1
[0 - 1554499f3140]    0.251741 {4}{hip}: cudart_hijack_active = 1

The text was updated successfully, but these errors were encountered:

mariodirenzo · 2023-09-30T12:19:51Z

@elliottslaughter, please add this issue to #1032

eddy16112 · 2024-03-06T21:45:04Z

I do not think this is a bug. Here is the output from circuit:

(base) wwu@g0003:/scratch2/wwu/legion-master/examples/circuit$ ./circuit -ll:gpu 1
[0 - 7f5afa950000]    0.149888 {4}{hip}: HIP hijack code not active - device synchronizations required after every GPU task!
[0 - 7f5b0254d000]    0.240584 {3}{circuit}: circuit settings: loops=2 pieces=4 nodes/piece=1024 wires/piece=1024 pct_in_piece=95 seed=12345
[0 - 7f5b0254d000]    0.251828 {3}{circuit}: Initializing circuit simulation...
[0 - 7f5b0254d000]    0.417142 {3}{circuit}: Finished initializing simulation...
Starting main simulation loop
[0 - 7f5afa950000]    0.502278 {4}{hip}: HIP hijack is active - device synchronizations not required after every GPU task!
SUCCESS!
ELAPSED TIME =   0.097 s
GFLOPS =  80.688 GFLOPS
[0 - 7f5b0254d000]    0.515076 {3}{circuit}: simulation complete - destroying regions

We use lazy initialization for hip hijack, the first warning {hip}: HIP hijack code not active - device synchronizations required after every GPU task! is triggered by a realm internal task without calling hipGetTaskStream. Then the circuit GPU kernels are launched with hipGetTaskStream, so we will see HIP hijack is active - device synchronizations not required after every GPU task!

mariodirenzo · 2024-03-11T13:58:05Z

Do you mean that the first warning should be disregarded or that the first Realm internal task should be issued on the hipGetTaskStream?

eddy16112 · 2024-03-11T16:01:51Z

the fist warning can be disregarded

mariodirenzo · 2024-03-11T16:11:05Z

Is there any plan to silence the initial warning (to avoid confusion for other users in the future) or should I close the issue?

eddy16112 · 2024-03-11T16:16:12Z

I think it is OK to have the warning because as long as hipGetTaskStream is used, we would see another output "HIP hijack is active - device synchronizations not required after every GPU task!"

mariodirenzo · 2024-03-11T16:17:40Z

Ok, I'll close the issue. Thanks for the clarification

elliottslaughter · 2024-03-11T17:42:06Z

@eddy16112 I just want to confirm why the first warning appears? Are we actually launching a task that does not use the hijack? If so, where does it come from?

eddy16112 · 2024-03-11T21:05:41Z

@eddy16112 I just want to confirm why the first warning appears? Are we actually launching a task that does not use the hijack? If so, where does it come from?

A NOP task is launched by Legion to make sure the processor is ready, https://gitlab.com/StanfordLegion/legion/-/blob/master/runtime/realm/runtime_impl.cc#L2709, which is where the first warning from.

elliottslaughter · 2024-03-11T21:15:06Z

Would it be appropriate to fetch the HIP stream from that task, so we can make the initial warning go away?

In that case, I think we would want to disable the print on "HIP hijack is active" so that if the user misses the hijack in their own code, they'll get the "HIP hijack code not active" warning only and not think they're using the hijack when they aren't.

eddy16112 · 2024-03-11T21:46:54Z

Would it be appropriate to fetch the HIP stream from that task,

We can do that, but I do not think it helps. Once the hijack is enabled, we can not go back to the non-hijack mode, so we can not throw the "hijack code not active" warning. One solution is to completely remove the hijack. and use the get_task_hip_stream and set_task_ctxsync_required API, users can control which task need ctxsync by calling set_task_ctxsync_required inside the task body.

elliottslaughter · 2024-03-12T03:41:29Z

I'm not crazy about set_task_ctxsync_required. I can see how it would help in special cases where you need to go back and forth, but in the common case of always using the hijack, it would be nice if things "just worked".

Could we instead treat the internal task as a special operation somehow, so that we know it does not launch any kernels? It really shouldn't have an impact on the hijack either way, since we know it's not doing anything.

elliottslaughter · 2024-07-03T22:41:23Z

@eddy16112 Instead of requiring a global set_task_ctxsync_required, could we add an API like skip_task_ctxsync_this_task (name subject to change). The semantics of this API are:

The caller task promises not to perform any work on the GPU whatsoever
The caller does NOT promise anything about the rest of the application: i.e., Realm should not make assumptions about the application's behavior from this one task and should essentially just skip it for the purpose of determining whether to synchronize or not

Legion can then use this in its dummy tasks to avoid the warning, without poisoning the state of the application and requiring a problematic global state like set_task_ctxsync_required.

eddy16112 · 2024-07-04T02:13:22Z

@eddy16112 Instead of requiring a global set_task_ctxsync_required, could we add an API like skip_task_ctxsync_this_task (name subject to change). The semantics of this API are:

The caller task promises not to perform any work on the GPU whatsoever

The caller does NOT promise anything about the rest of the application: i.e., Realm should not make assumptions about the application's behavior from this one task and should essentially just skip it for the purpose of determining whether to synchronize or not

Legion can then use this in its dummy tasks to avoid the warning, without poisoning the state of the application and requiring a problematic global state like set_task_ctxsync_required.

Why not just call set_task_ctxsync_required(false) for the dummy task? I think the only issue is that it will record an event for the dummy task, which is not necessary.

elliottslaughter · 2024-07-05T19:19:22Z

Ok. I guess I don't understand the semantics of set_task_ctxsync_required. Is it a global option or does it influence only the current task? If we set_task_ctxsync_required(false) in one task, will the runtime will auto-detect correctly in the next one?

eddy16112 · 2024-07-06T00:43:38Z

It only influences the current task. By default, the ThreadLocal::context_sync_required is set to -1 inside every task. In this case, context sync is always needed https://gitlab.com/StanfordLegion/legion/-/blob/master/runtime/realm/cuda/cuda_module.cc#L1642-1647.

elliottslaughter mentioned this issue Oct 3, 2023

Prioritized list of Regent features for HTR (PSAAP) #1032

Open

81 tasks

mariodirenzo closed this as completed Mar 11, 2024

elliottslaughter reopened this Jul 3, 2024

elliottslaughter mentioned this issue Oct 24, 2024

Remove CUDART hijack #1730

Open

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Realm: missed detection of HIP hijack #1557

Realm: missed detection of HIP hijack #1557

mariodirenzo commented Sep 30, 2023

mariodirenzo commented Sep 30, 2023

eddy16112 commented Mar 6, 2024

mariodirenzo commented Mar 11, 2024

eddy16112 commented Mar 11, 2024

mariodirenzo commented Mar 11, 2024

eddy16112 commented Mar 11, 2024

mariodirenzo commented Mar 11, 2024

elliottslaughter commented Mar 11, 2024

eddy16112 commented Mar 11, 2024

elliottslaughter commented Mar 11, 2024

eddy16112 commented Mar 11, 2024

elliottslaughter commented Mar 12, 2024

elliottslaughter commented Jul 3, 2024

eddy16112 commented Jul 4, 2024

elliottslaughter commented Jul 5, 2024

eddy16112 commented Jul 6, 2024

Realm: missed detection of HIP hijack #1557

Realm: missed detection of HIP hijack #1557

Comments

mariodirenzo commented Sep 30, 2023

mariodirenzo commented Sep 30, 2023

eddy16112 commented Mar 6, 2024

mariodirenzo commented Mar 11, 2024

eddy16112 commented Mar 11, 2024

mariodirenzo commented Mar 11, 2024

eddy16112 commented Mar 11, 2024

mariodirenzo commented Mar 11, 2024

elliottslaughter commented Mar 11, 2024

eddy16112 commented Mar 11, 2024

elliottslaughter commented Mar 11, 2024

eddy16112 commented Mar 11, 2024

elliottslaughter commented Mar 12, 2024

elliottslaughter commented Jul 3, 2024

eddy16112 commented Jul 4, 2024

elliottslaughter commented Jul 5, 2024

eddy16112 commented Jul 6, 2024