Skip to content

Conversation

@devreal
Copy link
Contributor

@devreal devreal commented Jun 20, 2024

cuMemcpyAsync and cuStreamSynchronize take a CUstream, not a pointer to CUstream.

Artifact of #12617

cuMemcpyAsync and cuStreamSynchronize take a CUstream, not a pointer
to CUstream.

Signed-off-by: Joseph Schuchart <[email protected]>
@devreal devreal requested a review from wenduwan June 20, 2024 14:52
@wenduwan
Copy link
Contributor

@devreal Thanks. Curious if this was caught by the compiler or something else?

Meanwhile let me run our CI to double check.

@devreal
Copy link
Contributor Author

devreal commented Jun 20, 2024

That was a compiler warning I saw when compiling against CUDA. Not sure how that doesn't crash, maybe we never hit this code path

@bosilca bosilca changed the title accelerator/cuca: Dereference pointer to stream accelerator/cuda: Dereference pointer to stream Jun 20, 2024
@devreal
Copy link
Contributor Author

devreal commented Jun 20, 2024

@janjust looks like the NVIDIA CI ran out of disk space

@bosilca
Copy link
Member

bosilca commented Jun 20, 2024

It crashes all over the place. I was just starting to investigate, but it looks legit.

==== backtrace (tid:2025366) ====
 0 0x000000000044fc34 cudbgMain()  ???:0
 1 0x000000000021ee0c cuEGLApiInit()  ???:0
 2 0x0000000000431e68 cudbgMain()  ???:0
 3 0x0000000000133d64 cuMemGetAttribute_v2()  ???:0
 4 0x000000000029c3c8 cuMemsetD2D8Async()  ???:0
 5 0x0000000000003f30 accelerator_cuda_memcpy()
 6 0x000000000026fdf0 mca_coll_accelerator_memcpy()  
 7 0x000000000026ff98 mca_coll_accelerator_allreduce()  
 8 0x00000000000cf920 PMPI_Allreduce()  
 9 0x0000000000403068 main()  

Copy link
Member

@bosilca bosilca left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixes the CUDA runs.

@bosilca bosilca merged commit 9f58fd2 into open-mpi:main Jun 20, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants