Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Realm: Assertion `finder != device_functions.end()' failed #1682

Closed
syamajala opened this issue Apr 7, 2024 · 14 comments
Closed

Realm: Assertion `finder != device_functions.end()' failed #1682

syamajala opened this issue Apr 7, 2024 · 14 comments
Assignees
Labels
cudart_hijack Issue related to cudart_hijack Realm Issues pertaining to Realm S3D

Comments

@syamajala
Copy link
Contributor

I'm hitting the following assertion in Realm:

s3d.x: /pscratch/sd/s/seshu/gb2024/legion_s3d_tdb_v2/legion/runtime/realm/cuda/cuda_module.cc:3945: CUfunc_st* Realm::Cuda::GPU::lookup_function(const void*): Assertion `finder != device_functions.end()' failed.

This only seems to be happening on Perlmutter. I was able to run on blaze and sapling without any problems. I tried cuda 11.7, 12.0, and 12.2 on Perlmutter, but they all have the same issue.

Here is a stack trace:

#0  0x00007f50186e5121 in clock_nanosleep@GLIBC_2.2.5 () from /lib64/libc.so.6
#1  0x00007f50186eae43 in nanosleep () from /lib64/libc.so.6
#2  0x00007f50186ead5a in sleep () from /lib64/libc.so.6
#3  0x00007f500195089a in Realm::realm_freeze (signal=6) at /pscratch/sd/s/seshu/gb2024/legion_s3d_tdb/legion/runtime/realm/runtime_impl.cc:206
#4  <signal handler called>
#5  0x00007f5018653d2b in raise () from /lib64/libc.so.6
#6  0x00007f50186553e5 in abort () from /lib64/libc.so.6
#7  0x00007f501864bc6a in __assert_fail_base () from /lib64/libc.so.6
#8  0x00007f501864bcf2 in __assert_fail () from /lib64/libc.so.6
#9  0x00007f50019e5797 in Realm::Cuda::GPU::lookup_function (this=0x9b3c2e0,
    func=0x7f50067abdfa <Realm::Cuda::ReductionKernels::apply_cuda_kernel<Legion::Internal::AddCudaReductions<Legion::SumReduction<double> >, true>(unsigned long, unsigned long, unsigned long, unsigned long, unsigned long, Legion::Internal::AddCudaReductions<Legion::SumReduction<double> >)>) at /pscratch/sd/s/seshu/gb2024/legion_s3d_tdb/legion/runtime/realm/cuda/cuda_module.cc:3945
#10 0x00007f5001a1eae5 in Realm::Cuda::GPUreduceXferDes::GPUreduceXferDes (this=0x7f32ec0ec170, _dma_op=139858631131232, _channel=0xc6f2d70, _launch_node=0, _guid=945, inputs_info=..., outputs_info=..., _priority=0, _redop_info=...)
    at /pscratch/sd/s/seshu/gb2024/legion_s3d_tdb/legion/runtime/realm/cuda/cuda_internal.cc:2132
#11 0x00007f5001a20620 in Realm::Cuda::GPUreduceChannel::create_xfer_des (this=0xc6f2d70, dma_op=139858631131232, launch_node=0, guid=945, inputs_info=..., outputs_info=..., priority=0, redop_info=..., fill_data=0x0, fill_size=0,
    fill_total=0) at /pscratch/sd/s/seshu/gb2024/legion_s3d_tdb/legion/runtime/realm/cuda/cuda_internal.cc:2533
#12 0x00007f50013416eb in Realm::SimpleXferDesFactory::create_xfer_des (this=0xc6f2d98, dma_op=139858631131232, launch_node=0, target_node=0, guid=945, inputs_info=..., outputs_info=..., priority=0, redop_info=..., fill_data=0x0,
    fill_size=0, fill_total=0) at /pscratch/sd/s/seshu/gb2024/legion_s3d_tdb/legion/runtime/realm/transfer/channel.cc:4686
#13 0x00007f5001380725 in Realm::TransferOperation::create_xds (this=0x7f3360070060) at /pscratch/sd/s/seshu/gb2024/legion_s3d_tdb/legion/runtime/realm/transfer/transfer.cc:5594
#14 0x00007f500137dfd6 in Realm::TransferOperation::allocate_ibs (this=0x7f3360070060) at /pscratch/sd/s/seshu/gb2024/legion_s3d_tdb/legion/runtime/realm/transfer/transfer.cc:5217
#15 0x00007f500137d13c in Realm::TransferOperation::start_or_defer (this=0x7f3360070060) at /pscratch/sd/s/seshu/gb2024/legion_s3d_tdb/legion/runtime/realm/transfer/transfer.cc:5059
#16 0x00007f50013a1e2c in Realm::IndexSpace<2, long long>::copy (this=0x7f35aafc7360, srcs=..., dsts=..., indirects=..., requests=..., wait_on=..., priority=0)
    at /pscratch/sd/s/seshu/gb2024/legion_s3d_tdb/legion/runtime/realm/transfer/transfer.cc:5723
#17 0x00007f5006ac7a19 in Realm::IndexSpace<2, long long>::copy (this=0x7f35aafc7360, srcs=..., dsts=..., requests=..., wait_on=..., priority=0) at /pscratch/sd/s/seshu/gb2024/legion_s3d_tdb/legion/runtime/realm/indexspace.inl:903
#18 0x00007f5006ab046e in Legion::Internal::IndexSpaceExpression::issue_copy_internal<2, long long> (this=0x7f36800e4bb0, forest=0xc6fd1e0, op=0x7f365c01f520, space=..., trace_info=..., dst_fields=..., src_fields=..., reservations=...,
    precondition=..., pred_guard=..., src_unique=..., dst_unique=..., collective=Legion::Internal::COLLECTIVE_NONE, priority=0, replay=false) at /pscratch/sd/s/seshu/gb2024/legion_s3d_tdb/legion/runtime/legion/region_tree.inl:239
#19 0x00007f5006a8dff4 in Legion::Internal::IndexSpaceNodeT<2, long long>::issue_copy (this=0x7f36800e4880, op=0x7f365c01f520, trace_info=..., dst_fields=..., src_fields=..., reservations=..., precondition=..., pred_guard=...,
    src_unique=..., dst_unique=..., collective=Legion::Internal::COLLECTIVE_NONE, priority=0, replay=false) at /pscratch/sd/s/seshu/gb2024/legion_s3d_tdb/legion/runtime/legion/region_tree.inl:4988
#20 0x00007f500648a5e7 in Legion::Internal::IndividualView::copy_from (this=0x7f3360191320, src_view=0x7f32ec1ab560, precondition=..., predicate_guard=..., reduction_op_id=1048587, copy_expression=0x7f36800e4bb0, op=0x7f365c01f520,
    index=0, collective_match_space=13, copy_mask=..., src_point=0x7f32ec1a83c0, trace_info=..., recorded_events=..., applied_events=..., across_helper=0x0, manage_dst_events=true, copy_restricted=false, need_valid_return=false)
    at /pscratch/sd/s/seshu/gb2024/legion_s3d_tdb/legion/runtime/legion/legion_views.cc:2471
#21 0x00007f5005e6d4ba in Legion::Internal::CopyFillAggregator::issue_copies (this=0x7f33601975a0, target=0x7f3360191320, copies=..., recorded_events=..., precondition=..., copy_mask=..., trace_info=..., manage_dst_events=true,
    restricted_output=false, dst_events=0x0) at /pscratch/sd/s/seshu/gb2024/legion_s3d_tdb/legion/runtime/legion/legion_analysis.cc:7328
#22 0x00007f5005e6ba20 in Legion::Internal::CopyFillAggregator::perform_updates (this=0x7f33601975a0, updates=..., trace_info=..., precondition=..., recorded_events=..., redop_index=0, manage_dst_events=true, restricted_output=false,
    dst_events=0x0) at /pscratch/sd/s/seshu/gb2024/legion_s3d_tdb/legion/runtime/legion/legion_analysis.cc:7053
#23 0x00007f5005e6b2b6 in Legion::Internal::CopyFillAggregator::issue_updates (this=0x7f33601975a0, trace_info=..., precondition=..., restricted_output=false, manage_dst_events=true, dst_events=0x0, stage=0)
    at /pscratch/sd/s/seshu/gb2024/legion_s3d_tdb/legion/runtime/legion/legion_analysis.cc:6911
#24 0x00007f5005e77cfc in Legion::Internal::UpdateAnalysis::perform_updates (this=0x7f3360190d40, perform_precondition=..., applied_events=..., already_deferred=false)
    at /pscratch/sd/s/seshu/gb2024/legion_s3d_tdb/legion/runtime/legion/legion_analysis.cc:9294
#25 0x00007f5006506ef7 in Legion::Internal::RegionTreeForest::physical_perform_updates (this=0xc6fd1e0, req=..., version_info=..., op=0x7f365c01f520, index=0, precondition=..., term_event=..., targets=..., sources=..., trace_info=...,
    map_applied_events=..., analysis=@0x7f336011e620: 0x7f3360190d40, log_name=0x457c7a0 "copy_global", uid=6150, collective_rendezvous=false, record_valid=true, check_initialized=true, defer_copies=true)
    at /pscratch/sd/s/seshu/gb2024/legion_s3d_tdb/legion/runtime/legion/region_tree.cc:1923
#26 0x00007f500639e22f in Legion::Internal::SingleTask::map_all_regions (this=0x7f365c01f340, must_epoch_op=0x0, defer_args=0x0) at /pscratch/sd/s/seshu/gb2024/legion_s3d_tdb/legion/runtime/legion/legion_tasks.cc:4253
#27 0x00007f50063ab02e in Legion::Internal::PointTask::perform_mapping (this=0x7f365c01f340, must_epoch_owner=0x0, args=0x0) at /pscratch/sd/s/seshu/gb2024/legion_s3d_tdb/legion/runtime/legion/legion_tasks.cc:7289
#28 0x00007f50063bc991 in Legion::Internal::SliceTask::perform_mapping (this=0x7f333c0c6b40, epoch_owner=0x0, args=0x0) at /pscratch/sd/s/seshu/gb2024/legion_s3d_tdb/legion/runtime/legion/legion_tasks.cc:11453
#29 0x00007f50063a4ea7 in Legion::Internal::MultiTask::trigger_mapping (this=0x7f333c0c6b40) at /pscratch/sd/s/seshu/gb2024/legion_s3d_tdb/legion/runtime/legion/legion_tasks.cc:5607
#30 0x00007f50066263bd in Legion::Internal::Runtime::legion_runtime_task (args=0x7f33600a2560, arglen=12, userdata=0xc726700, userlen=8, p=...) at /pscratch/sd/s/seshu/gb2024/legion_s3d_tdb/legion/runtime/legion/runtime.cc:32276
#31 0x00007f500192987b in Realm::LocalTaskProcessor::execute_task (this=0xc67d260, func_id=4, task_args=...) at /pscratch/sd/s/seshu/gb2024/legion_s3d_tdb/legion/runtime/realm/proc_impl.cc:1176
#32 0x00007f500199f21c in Realm::Task::execute_on_processor (this=0x7f33600a23e0, p=...) at /pscratch/sd/s/seshu/gb2024/legion_s3d_tdb/legion/runtime/realm/tasks.cc:326
#33 0x00007f50019a31e4 in Realm::KernelThreadTaskScheduler::execute_task (this=0xc67d600, task=0x7f33600a23e0) at /pscratch/sd/s/seshu/gb2024/legion_s3d_tdb/legion/runtime/realm/tasks.cc:1421
#34 0x00007f50019a202c in Realm::ThreadedTaskScheduler::scheduler_loop (this=0xc67d600) at /pscratch/sd/s/seshu/gb2024/legion_s3d_tdb/legion/runtime/realm/tasks.cc:1160
#35 0x00007f50019a265a in Realm::ThreadedTaskScheduler::scheduler_loop_wlock (this=0xc67d600) at /pscratch/sd/s/seshu/gb2024/legion_s3d_tdb/legion/runtime/realm/tasks.cc:1272
#36 0x00007f50019a9508 in Realm::Thread::thread_entry_wrapper<Realm::ThreadedTaskScheduler, &Realm::ThreadedTaskScheduler::scheduler_loop_wlock> (obj=0xc67d600)
    at /pscratch/sd/s/seshu/gb2024/legion_s3d_tdb/legion/runtime/realm/threads.inl:97
#37 0x00007f50019b58bf in Realm::KernelThread::pthread_entry (data=0x7f35a0130e20) at /pscratch/sd/s/seshu/gb2024/legion_s3d_tdb/legion/runtime/realm/threads.cc:831
#38 0x00007f501c1b76ea in start_thread () from /lib64/libpthread.so.0
#39 0x00007f501872149f in clone () from /lib64/libc.so.6
@lightsighter
Copy link
Contributor

@muraj Can you take a look at this?

@syamajala
Copy link
Contributor Author

Could really use some help with this as Im trying to get some Gordon Bell runs done on Perlmutter.

@eddy16112
Copy link
Contributor

After talked with @syamajala , we figured out that the issue is that we build the Realm with CUDART_HIJACK=ON, but the cray wrappers links the cudart automatically, so CUDART_HIJACK is not turned on during runtime, and all these kernels are not registered with Realm.

@syamajala
Copy link
Contributor Author

Yeah the cray wrappers are broken and have no way to turn off linking against cudart. I opened a NERSC ticket about this a year ago. Their solution was for me to manually link things by hand and just remove -lcudart from the flags. They closed the ticket.

I am able to do my runs now.

@syamajala
Copy link
Contributor Author

@elliottslaughter

@syamajala syamajala reopened this Oct 10, 2024
@muraj
Copy link

muraj commented Oct 11, 2024

@syamajala is there a reason you are using the cudart hijack? Is there something that Realm can do to make the cudart hijack not necessary for your use case?

@syamajala
Copy link
Contributor Author

#1059 is why we need hijack still.

@muraj
Copy link

muraj commented Oct 11, 2024

Roger that. @elliottslaughter, @magnatelee can we get progress on #1059 so we can close out on all these related issues?

@syamajala I don't think there is anything here that Realm or Legion can do to resolve this conflict, as this is a known limitation of the cudart hijack. Can we close this issue and direct you to helping prioritize #1059 so we can remove the cudart hijack?

@syamajala
Copy link
Contributor Author

I think we should keep it open for now because I had completely forgotten what happens when we try to build/run the TDB branch of S3D on Perlmutter. I will likely forget again. We can close it when #1059 is fixed.

@muraj
Copy link

muraj commented Oct 11, 2024

Understood. I'll at least label this with a cudart_hijack label so it won't be prioritized until #1059 is dealt with.

@muraj muraj added the cudart_hijack Issue related to cudart_hijack label Oct 11, 2024
@elliottslaughter
Copy link
Contributor

This bug is open because I need to urgently run S3D on Perlmutter for unrelated reasons. The linking hack is obnoxious enough that I might just rip out the CUDA hijack in Regent; but it will depend on what ends up being easier.

@elliottslaughter
Copy link
Contributor

Now that https://gitlab.com/StanfordLegion/legion/-/merge_requests/1502 is available I plan to retest this in that branch, and if resolved, I'll close this issue.

@elliottslaughter
Copy link
Contributor

The Regent changes have merged. There are still a few application-level changes required to run properly without the hijack (because the application includes hand-written CUDA kernels) that I am currently testing.

@elliottslaughter
Copy link
Contributor

I've ported the application to avoid needing the hijack as well, so I think we're done here. The error does not occur if the application does not use the hijack.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cudart_hijack Issue related to cudart_hijack Realm Issues pertaining to Realm S3D
Projects
None yet
Development

No branches or pull requests

6 participants