Realm: Profiling breaks with CUPTI #1800

elliottslaughter · 2024-12-04T05:47:22Z

When running a Legion GPU program with CUPTI support, I observed that the following GPU task record was generated:

GPUTaskInfo with invalid GPU time

[src/state.rs:4311:5] record = GPUTaskInfo {
    op_id: OpID(
        305,
    ),
    task_id: TaskID(
        10012,
    ),
    variant_id: VariantID(
        19,
    ),
    proc_id: ProcID(
        2089671326611537923,
    ),
    create: Timestamp(
        7258406468,
    ),
    ready: Timestamp(
        10696756299,
    ),
    start: Timestamp(
        10696774926,
    ),
    stop: Timestamp(
        10696982083,
    ),
    gpu_start: Timestamp(
        9223372036854775808,
    ),
    gpu_stop: Timestamp(
        9223372036854775808,
    ),
    creator: Some(
        EventID(
            9223512843103502338,
        ),
    ),
    critical: Some(
        EventID(
            9223512843085676569,
        ),
    ),
    fevent: EventID(
        9223512843120279568,
    ),
}

Note the invalid timetamps for gpu_stop and gpu_start (9223372036854775808, aka 0x8000000000000000). This currently causes the profiler to choke.

I observed that running the program with -cuda:cupti 0 causes the problem to go away, and the profile renders successfully.

There are a couple things I think we need to do:

Fix whatever issue is causing CUPTI profiling to generate invalid timestamps
Ensure we have GPU profiling coverage in our CI, because apparently this isn't being caught right now

I produced this on Sapling with CUDA 12.1. Logs are available at /scratch/eslaught/legion-retreat-2024/language/pennant.run1/logs_cupti. Up one directory from there is the script (and binaries) to reproduce this issue.

CC @lightsighter for visibility.

The text was updated successfully, but these errors were encountered:

lightsighter · 2024-12-04T05:57:18Z

Right the manifestation for this is a request for a OperationTImelineGPU profiling request and then in the response we call get_measurement<OperationTimlineGPU>() and that returns true indicating that the measurement is there, but then if you call OperationTimelineGPU::is_valid() it returns false. It does this intermittently (rarely), but enough that on a long enough run the probability of it happening becomes one and the profiler crashes. The only reason that is_valid() should return false should be if the operation was canceled (or faulted) which definitely isn't happening in this case.

The other problem here is that the CI currently doesn't have a job that runs the profiler on logs generated from an application running on the GPU. @elliottslaughter to add a new CI job to ensure that we catch cases like this in the future.

eddy16112 · 2024-12-04T06:18:40Z

I feel like the issue is that we forgot to set the start_time and end_time for this GPU operation, then their values are INVALID_TIMESTAMP, which is -9223372036854775808. That's why OperationTimelineGPU::is_valid() returns false.

Maybe run with -level cupti=1 to see if we skip any operations.

muraj · 2024-12-04T22:49:08Z

@elliottslaughter can you send me the application that this occurs on? This shouldn't happen, since we push the CUPTI correlation activity record for each GPU task, so it'll show up in the CUPTI buffer when we flush (when the task completes). Until the buffer is flushed, yes, the GPU timestamps will be invalid since they aren't known yet.

lightsighter · 2024-12-06T00:57:41Z

I think Elliott was seeing it on Regent's implementation of Pennant. I saw it once too when running with Pennant C++, but that was on the only run that I was successfully able to get through the machine without hitting #1802 or #1803, so I wouldn't suggest trying to use Pennant C++ until those are fixed.

elliottslaughter · 2024-12-09T21:32:19Z

@muraj I have a reproducer sitting on Sapling here: /scratch/eslaught/legion-retreat-2024/language/bug1800

It includes a script to run the application:

sbatch --nodes 1 sbatch_pennant.sh

Then you just run the profiler on the resulting logs to see it crash.

If you want a reproducer you can rebuild, let me know and I can provide those instructions as well.

elliottslaughter assigned muraj Dec 4, 2024

lightsighter assigned elliottslaughter Dec 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Realm: Profiling breaks with CUPTI #1800

Realm: Profiling breaks with CUPTI #1800

elliottslaughter commented Dec 4, 2024

lightsighter commented Dec 4, 2024

eddy16112 commented Dec 4, 2024

muraj commented Dec 4, 2024

lightsighter commented Dec 6, 2024 •

edited

Loading

elliottslaughter commented Dec 9, 2024

Realm: Profiling breaks with CUPTI #1800

Realm: Profiling breaks with CUPTI #1800

Comments

elliottslaughter commented Dec 4, 2024

lightsighter commented Dec 4, 2024

eddy16112 commented Dec 4, 2024

muraj commented Dec 4, 2024

lightsighter commented Dec 6, 2024 • edited Loading

elliottslaughter commented Dec 9, 2024

lightsighter commented Dec 6, 2024 •

edited

Loading