Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Realm: Profiling breaks with CUPTI #1800

Open
elliottslaughter opened this issue Dec 4, 2024 · 5 comments
Open

Realm: Profiling breaks with CUPTI #1800

elliottslaughter opened this issue Dec 4, 2024 · 5 comments
Assignees

Comments

@elliottslaughter
Copy link
Contributor

When running a Legion GPU program with CUPTI support, I observed that the following GPU task record was generated:

GPUTaskInfo with invalid GPU time
[src/state.rs:4311:5] record = GPUTaskInfo {
    op_id: OpID(
        305,
    ),
    task_id: TaskID(
        10012,
    ),
    variant_id: VariantID(
        19,
    ),
    proc_id: ProcID(
        2089671326611537923,
    ),
    create: Timestamp(
        7258406468,
    ),
    ready: Timestamp(
        10696756299,
    ),
    start: Timestamp(
        10696774926,
    ),
    stop: Timestamp(
        10696982083,
    ),
    gpu_start: Timestamp(
        9223372036854775808,
    ),
    gpu_stop: Timestamp(
        9223372036854775808,
    ),
    creator: Some(
        EventID(
            9223512843103502338,
        ),
    ),
    critical: Some(
        EventID(
            9223512843085676569,
        ),
    ),
    fevent: EventID(
        9223512843120279568,
    ),
}

Note the invalid timetamps for gpu_stop and gpu_start (9223372036854775808, aka 0x8000000000000000). This currently causes the profiler to choke.

I observed that running the program with -cuda:cupti 0 causes the problem to go away, and the profile renders successfully.

There are a couple things I think we need to do:

  1. Fix whatever issue is causing CUPTI profiling to generate invalid timestamps
  2. Ensure we have GPU profiling coverage in our CI, because apparently this isn't being caught right now

I produced this on Sapling with CUDA 12.1. Logs are available at /scratch/eslaught/legion-retreat-2024/language/pennant.run1/logs_cupti. Up one directory from there is the script (and binaries) to reproduce this issue.

CC @lightsighter for visibility.

@lightsighter
Copy link
Contributor

Right the manifestation for this is a request for a OperationTImelineGPU profiling request and then in the response we call get_measurement<OperationTimlineGPU>() and that returns true indicating that the measurement is there, but then if you call OperationTimelineGPU::is_valid() it returns false. It does this intermittently (rarely), but enough that on a long enough run the probability of it happening becomes one and the profiler crashes. The only reason that is_valid() should return false should be if the operation was canceled (or faulted) which definitely isn't happening in this case.

The other problem here is that the CI currently doesn't have a job that runs the profiler on logs generated from an application running on the GPU. @elliottslaughter to add a new CI job to ensure that we catch cases like this in the future.

@eddy16112
Copy link
Contributor

I feel like the issue is that we forgot to set the start_time and end_time for this GPU operation, then their values are INVALID_TIMESTAMP, which is -9223372036854775808. That's why OperationTimelineGPU::is_valid() returns false.

Maybe run with -level cupti=1 to see if we skip any operations.

@muraj
Copy link

muraj commented Dec 4, 2024

@elliottslaughter can you send me the application that this occurs on? This shouldn't happen, since we push the CUPTI correlation activity record for each GPU task, so it'll show up in the CUPTI buffer when we flush (when the task completes). Until the buffer is flushed, yes, the GPU timestamps will be invalid since they aren't known yet.

@lightsighter
Copy link
Contributor

lightsighter commented Dec 6, 2024

I think Elliott was seeing it on Regent's implementation of Pennant. I saw it once too when running with Pennant C++, but that was on the only run that I was successfully able to get through the machine without hitting #1802 or #1803, so I wouldn't suggest trying to use Pennant C++ until those are fixed.

@elliottslaughter
Copy link
Contributor Author

@muraj I have a reproducer sitting on Sapling here: /scratch/eslaught/legion-retreat-2024/language/bug1800

It includes a script to run the application:

sbatch --nodes 1 sbatch_pennant.sh

Then you just run the profiler on the resulting logs to see it crash.

If you want a reproducer you can rebuild, let me know and I can provide those instructions as well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants