Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Invalid CUDA code: host computation on device buffer after GPU transformations #1773

Open
edopao opened this issue Nov 18, 2024 · 0 comments

Comments

@edopao
Copy link
Collaborator

edopao commented Nov 18, 2024

The below SDFG produces invalid CUDA code, after applying apply_gpu_transformations:
image

The problem is that the tlet_1_scalar_expr nodes in the second nested level results in a symbolic expression computed as host code, but the result is written to a device buffer in GPU global memory. This causes a segmentation fault when the SDFG is called.

Generated host code:

int * __tmp3;
DACE_GPU_CHECK(cudaMalloc((void**)&__tmp3, __out_size_0 * sizeof(int)));

int __tmp0;
__tmp0 = (__tmp1 * __tmp1);

This SDFG can be reproduced from the GT4Py test case:
tests/next_tests/integration_tests/feature_tests/ffront_tests/test_execution.py::test_double_use_scalar

The current workaround is to run the simplify pass before calling apply_gpu_transformations, so that InlineSDFGs will bring the SDFG to a canonical form (see GridTools/gt4py#1741).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant