Error: ../Common/CUDA/TIGRE_common.cpp (14): CBCT:CUDA:Atb an illegal memory access was encountered #634

ecoArcGaming · 2025-01-30T15:56:06Z

Hi, I am using the latest TIGRE Python (finally installed after many struggles...), and when I tried to run some FDK reconstruction scripts. I encountered the following error:
./Common/CUDA/TIGRE_common.cpp (7): Main loop fail ../Common/CUDA/TIGRE_common.cpp (14): CBCT:CUDA:Atb an illegal memory access was encountered

No further detail was provided by the interpreter. I tried some of the fixes in this issue: #501 to no avail. I am running this on 1 of the 4 A6000 GPUs on a cluster, not sure if that is relevant. How can I resolve this error? Thanks.

The text was updated successfully, but these errors were encountered:

AnderBiguri · 2025-01-30T16:10:56Z

Is this the first time that happens, or are you running some ML-pipeline?

There seems to be an issue when running ML-type things (i.e. when calling Atb/FDK thousands of times) #617

ecoArcGaming · 2025-01-30T16:15:16Z

It is a pipeline but the very first call would terminate and return this error. I am not getting CUDA out of memory but CUDA illegal memory access. I'm not sure if they have the same cause.

I also tried different GPUs, and using all GPUs on a node instead of just 1, but still have the same issue.

AnderBiguri · 2025-01-30T16:26:25Z

@ecoArcGaming no it should not be an out of memory issue, TIGRE deals well with smaller GPU memories than the problem at hand.

So you tried in various GPUs? can you tell me which ones?
Also, what is the CUDA version you are using?

Just trying to pinpoint the error

ecoArcGaming · 2025-01-30T16:31:21Z

@ecoArcGaming no it should not be an out of memory issue, TIGRE deals well with smaller GPU memories than the problem at hand.

So you tried in various GPUs? can you tell me which ones? Also, what is the CUDA version you are using?

Just trying to pinpoint the error

I tried Nvidia RTX 2080TI and A6000. Here are a few package versions: cudatoolkit 11.6.2, cudnn 8.9.2.26. They are installed in my conda environment.

AnderBiguri · 2025-01-30T16:41:49Z

Humm, I have access to RTX2080Ti I think, I'll try to run it with CUDA 11.6.2 in conda and see what happens, but its quite strange, there should not be any issue.

I tend to have a custom CUDA installation, rather than the conda one, but this should not be an issue.

ecoArcGaming · 2025-01-30T16:51:36Z

Thanks. I tried a few things in the meantime. Calling np.ascontiguousarray() on all my inputs, running the python script with CUDA_LAUNCH_BLOCKING=1, setting os.environ['LD_LIBRARY_PATH'] = '/usr/local/cuda/lib64'. I also tried running (with the 2080TI and my conda env) https://github.com/CERN/TIGRE/blob/master/Python/example.py, and I got some slightly different CUDA errors:
0: NVIDIA GeForce RTX 2080 Ti
1: NVIDIA GeForce RTX 2080 Ti
2: NVIDIA GeForce RTX 2080 Ti
3: NVIDIA GeForce RTX 2080 Ti
{'name': 'NVIDIA GeForce RTX 2080 Ti', 'devices': [0, 1, 2, 3]}
../Common/CUDA/TIGRE_common.cpp (7): Texture object creation fail
../Common/CUDA/TIGRE_common.cpp (14): Ax:Siddon_projection invalid argument

AnderBiguri · 2025-01-30T17:05:48Z

Yes, the issue in both cases will be due to something going off in texture creation, its just caught at different times. Can you post your script in minimal form (geometry, angles) so I can test it too?

ecoArcGaming · 2025-01-30T17:11:59Z

It is unfortunately a part of a larger project which I did not write. Would example.py not suffice for a minimal example for testing purposes?

AnderBiguri · 2025-01-30T17:36:16Z

The numerical value of the geometry/angles may be of importance, if you could share something like example.py but with the values you are using, that would help

ecoArcGaming · 2025-01-30T18:06:19Z

Okay, I am calling algs.fdk(prjs, geo, angles). Here is my geomtry:

-----
Geometry parameters
Distance from source to detector (DSD) = 7.944359081836327 mm
Distance from source to origin (DSO)= 2.6347865868263476 mm
-----
Detector parameters
Number of pixels (nDetector) = [768 972]
Size of each pixel (dDetector) = [0.00597206 0.00597206] mm
Total size of the detector (sDetector) = [4.58653892 5.80483832] mm
-----
Image parameters
Number of voxels (nVoxel) = [501 501 501]
Total size of the image (sVoxel) = [2. 2. 2.] mm
Size of each voxel (dVoxel) = [0.00399202 0.00399202 0.00399202] mm
-----
Offset correction parameters
Offset of image from origin (offOrigin) = [0. 0. 0.] mm
Offset of detector (offDetector) = [0. 0. 0.] mm
-----
Auxillary parameters
Samples per pixel of forward projection (accuracy) = 0.5

I have stored my proj and angles are two numpy arrays in two .npy files. You can download them here:

https://drive.google.com/file/d/1JnJQAlgo9B9pvD7t8conP7V79zNpF2Vk/view?usp=sharing
https://drive.google.com/file/d/1hlmOr2snYEi1HsTCK7I75t-JYpfJh-Xm/view?usp=sharing

AnderBiguri · 2025-01-30T18:13:19Z

Just in case, try with an even number of pixels.

…

On Thu, 30 Jan 2025, 18:06 Erik, ***@***.***> wrote: Okay, I am calling algs.fdk(prjs, geo, angles). Here is my geomtry: `TIGRE parameters Geometry parameters Distance from source to detector (DSD) = 7.944359081836327 mm Distance from source to origin (DSO)= 2.6347865868263476 mm Detector parameters Number of pixels (nDetector) = [768 972] Size of each pixel (dDetector) = [0.00597206 0.00597206] mm Total size of the detector (sDetector) = [4.58653892 5.80483832] mm Image parameters Number of voxels (nVoxel) = [501 501 501] Total size of the image (sVoxel) = [2. 2. 2.] mm Size of each voxel (dVoxel) = [0.00399202 0.00399202 0.00399202] mm Offset correction parameters Offset of image from origin (offOrigin) = [0. 0. 0.] mm Offset of detector (offDetector) = [0. 0. 0.] mm Auxillary parameters Samples per pixel of forward projection (accuracy) = 0.5` I have stored my proj and angles are two numpy arrays in two .npy files. You can download them here: https://drive.google.com/file/d/1JnJQAlgo9B9pvD7t8conP7V79zNpF2Vk/view?usp=sharing https://drive.google.com/file/d/1hlmOr2snYEi1HsTCK7I75t-JYpfJh-Xm/view?usp=sharing — Reply to this email directly, view it on GitHub <#634 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AC2OENENLQUWAZ6P4HICTJD2NJS3FAVCNFSM6AAAAABWFSSRI2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDMMRVGIYTQMBTGQ> . You are receiving this because you commented.Message ID: ***@***.***>

ecoArcGaming · 2025-01-30T18:18:33Z

You mean the [501, 501, 501] in geo? Changed it to [500, 500, 500] and still have the same error.

musetee · 2025-02-03T14:22:51Z

I guess it could be a problem of multi-GPU setting. I can create projections by TIGRE on my own PC (windows, python 3.9.22, cuda 11.8) but not for our server with 4 GPUs. It reported the same error:
../Common/CUDA/TIGRE_common.cpp (7): Texture object creation fail
../Common/CUDA/TIGRE_common.cpp (14): Ax:Siddon_projection invalid argument

You mean the [501, 501, 501] in geo? Changed it to [500, 500, 500] and still have the same error.

AnderBiguri · 2025-02-03T14:46:22Z

@musetee Interesting. What about with some size that is divisible by 4, like 512^3?

musetee · 2025-02-03T14:54:27Z

@musetee Interesting. What about with some size that is divisible by 4, like 512^3?

yes I used this geometry from the r2_gaussian project: https://github.com/Ruyi-Zha/r2_gaussian/tree/main
btw I have implemented in another server with almost the same environment(windows, python 3.9.22, cuda 11.8, torch 1.12.1, MSVC 2019) but with two GPUs, and it works with only warning
../Common/CUDA/TIGRE_common.cpp (18): Ax:Siddon_projection:GPUselect Detected one (or more) different GPUs.
This code is not smart enough to separate the memory GPU wise if they have different computational times o
ts.
First GPU parameters used. If the code errors you might need to change the way GPU selection is performed.

Mode

mode: cone # X-ray source mode parallel/cone
filter: null

System configuration

DSD: 7.0 # Distance Source Detector
DSO: 5.0 # Distance Source Origin

Detector parameters

nDetector: # Number of pixels (Note: [v, u] not [u,v])

512
512
sDetector: # Size of image (not pixel)
4.0
4.0

Image parameters

nVoxel: # Number of voxels [x, y, z]

256
256
256
sVoxel: # size of volume (not voxel)
2.0
2.0
2.0

Offsets

offOrigin: # Offset of image from origin

0 # x direction
0 # y direction
0 # z direction
offDetector: # Offset of Detector (only in two direction)
0 # u direction
0 # v direction

Auxiliary

accuracy: 0.5 # Accuracy of FWD proj

Angles

totalAngle: 360.0 # Total angle (degree)
startAngle: 0.0 # Start angle (degree)

Noise

noise: true
possion_noise: 10000 # lambda for possion
gaussian_noise: # mean and std for gaussian

0 # mean
10 # std

AnderBiguri · 2025-02-03T17:05:56Z

@musetee so it also fails for 4 gpus with 156^3, but works well in 2 GPUs?

Indeed this almost surely looks like a error in the logic for splitting the problem into 4 GPUs, but I don't seem to able to reproduce.
I'll keep trying, I have access to another machine with 4GPUs soon.

AnderBiguri · 2025-02-03T17:09:40Z

in the meantime @ecoArcGaming can you try then limiting your use to 2 GPUs, to see if that works? you can use the GPU selection API that TIGRE comes with to just select a couple

ecoArcGaming · 2025-02-03T17:10:05Z

in the meantime @ecoArcGaming can you try then limiting your use to 2 GPUs, to see if that works? you can use the GPU selection API that TIGRE comes with to just select a couple

Sounds good. I will try that and let you know if it works.

ecoArcGaming · 2025-02-03T20:50:28Z

This did not work for me. I did

gpuids = gpu.getGpuIds('A6000')
gpuids.devices = [0]
print(gpuids) # {'name': 'A6000', 'devices': [0]}
vol = algs.fdk(projs, geo, angles, gpuid = gpuids)

Which still gives:

../Common/CUDA/TIGRE_common.cpp (7): Main loop fail 
../Common/CUDA/TIGRE_common.cpp (14): CBCT:CUDA:Atb an illegal memory access was encountereds

AnderBiguri · 2025-02-03T21:49:16Z

@ecoArcGaming I wonder if this is an A6000 specific issue... I'll keep looking.

musetee · 2025-02-04T13:11:44Z

@ecoArcGaming I wonder if this is an A6000 specific issue... I'll keep looking.

That makes sense!! We have 2 A6000s, 1 A4000 and 1 A5000 on that server (None of them works). Driver version is 555.85
The another server which works has one Quadro P5000 and one TITAN Xp, and the driver version is 461.33

musetee · 2025-02-04T13:12:37Z

@ecoArcGaming I wonder if this is an A6000 specific issue... I'll keep looking.

That makes sense!! We have 2 A6000s, 1 A4000 and 1 A5000 on that server (None of them works). Driver version is 555.85 The another server which works has one Quadro P5000 and one TITAN Xp, and the driver version is 461.33

my own PC has one 3090ti and it works fine

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Error: ../Common/CUDA/TIGRE_common.cpp (14): CBCT:CUDA:Atb an illegal memory access was encountered #634

Error: ../Common/CUDA/TIGRE_common.cpp (14): CBCT:CUDA:Atb an illegal memory access was encountered #634

ecoArcGaming commented Jan 30, 2025

AnderBiguri commented Jan 30, 2025

ecoArcGaming commented Jan 30, 2025 •

edited

Loading

AnderBiguri commented Jan 30, 2025

ecoArcGaming commented Jan 30, 2025 •

edited

Loading

AnderBiguri commented Jan 30, 2025

ecoArcGaming commented Jan 30, 2025 •

edited

Loading

AnderBiguri commented Jan 30, 2025

ecoArcGaming commented Jan 30, 2025

AnderBiguri commented Jan 30, 2025

ecoArcGaming commented Jan 30, 2025 •

edited

Loading

AnderBiguri commented Jan 30, 2025 via email

ecoArcGaming commented Jan 30, 2025

musetee commented Feb 3, 2025

AnderBiguri commented Feb 3, 2025

musetee commented Feb 3, 2025 •

edited

Loading

AnderBiguri commented Feb 3, 2025

AnderBiguri commented Feb 3, 2025

ecoArcGaming commented Feb 3, 2025

ecoArcGaming commented Feb 3, 2025 •

edited

Loading

AnderBiguri commented Feb 3, 2025

musetee commented Feb 4, 2025

musetee commented Feb 4, 2025

Error: ../Common/CUDA/TIGRE_common.cpp (14): CBCT:CUDA:Atb an illegal memory access was encountered #634

Error: ../Common/CUDA/TIGRE_common.cpp (14): CBCT:CUDA:Atb an illegal memory access was encountered #634

Comments

ecoArcGaming commented Jan 30, 2025

AnderBiguri commented Jan 30, 2025

ecoArcGaming commented Jan 30, 2025 • edited Loading

AnderBiguri commented Jan 30, 2025

ecoArcGaming commented Jan 30, 2025 • edited Loading

AnderBiguri commented Jan 30, 2025

ecoArcGaming commented Jan 30, 2025 • edited Loading

AnderBiguri commented Jan 30, 2025

ecoArcGaming commented Jan 30, 2025

AnderBiguri commented Jan 30, 2025

ecoArcGaming commented Jan 30, 2025 • edited Loading

AnderBiguri commented Jan 30, 2025 via email

ecoArcGaming commented Jan 30, 2025

musetee commented Feb 3, 2025

AnderBiguri commented Feb 3, 2025

musetee commented Feb 3, 2025 • edited Loading

Mode

System configuration

Detector parameters

Image parameters

Offsets

Auxiliary

Angles

Noise

AnderBiguri commented Feb 3, 2025

AnderBiguri commented Feb 3, 2025

ecoArcGaming commented Feb 3, 2025

ecoArcGaming commented Feb 3, 2025 • edited Loading

AnderBiguri commented Feb 3, 2025

musetee commented Feb 4, 2025

musetee commented Feb 4, 2025

ecoArcGaming commented Jan 30, 2025 •

edited

Loading

ecoArcGaming commented Jan 30, 2025 •

edited

Loading

ecoArcGaming commented Jan 30, 2025 •

edited

Loading

ecoArcGaming commented Jan 30, 2025 •

edited

Loading

musetee commented Feb 3, 2025 •

edited

Loading

ecoArcGaming commented Feb 3, 2025 •

edited

Loading