Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Artifacts in 3D-TV Denoised Images #594

Closed
zezisme opened this issue Oct 17, 2024 · 30 comments · Fixed by #609
Closed

Artifacts in 3D-TV Denoised Images #594

zezisme opened this issue Oct 17, 2024 · 30 comments · Fixed by #609

Comments

@zezisme
Copy link

zezisme commented Oct 17, 2024

Hello, I found that when using im3DDenoise to denoise 3D images, as the number of iterations increases, edge slices will produce image artifacts, and these artifacts seem to come from other slices. Is there any reference document for this algorithm? Can you provide the corresponding mathematical derivation document ? I want know is the issue come from the mathematical principle or
its implementation method?

Actual Behavior

TV_lambda = 200;

FDK
image
FDK+TVdenoise(50 iterations)
image
FDK+TVdenoise(100 iterations)
image
FDK+TVdenoise(200 iterations)
image

The artifact seems come from the 267th slice image(Or near this slice, And the total slice num is 400)
It seems to be related to a ratio of 1/3((400-267)/400≈1/3)
image

Code to reproduce the problem (If applicable)

x=FDK(rawdata_proj,geo,angles,'filter',filter_type);
% TV denoising
TV_niter = 200; % 50,100,200,..
TV_lambda = 200; % 10,20,.....200
x=im3DDenoise(x,'TV',TV_niter,TV_lambda,'gpuids',GpuIds());

Specifications

  • MATLAB/python version:
  • OS:
  • CUDA version:
@AnderBiguri
Copy link
Member

Interesting. Is that ghost that you get the first slice? This was coded 10 years ago, so I don't exactly remember the implementation, but it could be caused to periodic boundary conditions.

@AnderBiguri
Copy link
Member

In any case, you can find the math and the relevant paper in my PhD thesis (in my github profile) and the code is in Common/CUDA/tvdenoising.cu. It does not seem like it has periodic boundary conditions. Can you identify which slice this ghost is coming from? Can yo also tell me the size of the image and number of GPUs (including models)? Can you reproduce this in a small image, on a single GPU using TIGREs default data?

@zezisme
Copy link
Author

zezisme commented Oct 17, 2024

In any case, you can find the math and the relevant paper in my PhD thesis (in my github profile) and the code is in Common/CUDA/tvdenoising.cu. It does not seem like it has periodic boundary conditions. Can you identify which slice this ghost is coming from? Can yo also tell me the size of the image and number of GPUs (including models)? Can you reproduce this in a small image, on a single GPU using TIGREs default data?

The
CASE1: INPUT IMAGES SIZE = (1529,1529,400),And I just have one GPU
The input image(slice 400,last slice,No ghost )
image
Using the default data(iter=50; hyper=15;single GPU),the output denoised image is(there is a obvious ghost ):
The 400th slice image(there is a obvious ghost )
image
The 399th slcie image
image
The 398th slcie image
image
The ghost are becoming weaker
And the ghost is seems come from the 267th slice image
image

CASE2:INPUT IMAGES SIZE = (1529,1529,100),Using the default data(iter=50; hyper=15;single GPU)
the ouput image(last image) has no ghost(great!!!)
image

CASE3:INPUT IMAGES SIZE = (1529,1529,200),Using the default data(iter=50; hyper=15;single GPU)
the ouput image(last image) has ghost again
image
And the ghost is seems come from the 100th slice image
image

Final Test: I find that if the GPU memory is not enough, than there is a warn, the algorithm will using the CPU memory , And the ghost is coming! if the GPU memory is enough, the ghost will not occur.
image
image

Conclution:
This bug is most likely caused by a problem with data acquisition during the communication between the CPU and GPU.

@AnderBiguri
Copy link
Member

Fnatstic experiment, indeed this is why I asked GPU size etc. Seems that something is broken in the memory-in-out where either the memory doesn't get appropriately reset, or the chunk of memory being copied is wrong. I will investigate.

@zezisme
Copy link
Author

zezisme commented Oct 17, 2024

Fnatstic experiment, indeed this is why I asked GPU size etc. Seems that something is broken in the memory-in-out where either the memory doesn't get appropriately reset, or the chunk of memory being copied is wrong. I will investigate.

Great! Looking forward to your solution

@AnderBiguri
Copy link
Member

A posibility:

https://github.com/CERN/TIGRE/blob/master/Common/CUDA/tvdenoising.cu#L378-L389

may need to be changed to:

for (dev = 0; dev < deviceCount; dev++){
                        cudaSetDevice(gpuids[dev]);
                        // New line:
                        cudaMemsetAsync(d_src[dev], 0, mem_img_each_GPU,stream[dev*nStream_device+1]);
                        cudaMemcpyAsync(d_src[dev]+offset_device[dev], src+offset_host[dev]  , bytes_device[dev]*sizeof(float), cudaMemcpyHostToDevice,stream[dev*nStream_device+1]);
                    }
                    for (dev = 0; dev < deviceCount; dev++){
                        cudaSetDevice(gpuids[dev]);
                        // All these are async
                        // New line:
                        cudaMemsetAsync(d_u[dev], 0, mem_img_each_GPU,stream[dev*nStream_device+1]);
                        cudaMemcpyAsync(d_u[dev]  +offset_device[dev], d_src[dev]+offset_device[dev], bytes_device[dev]*sizeof(float), cudaMemcpyDeviceToDevice,stream[dev*nStream_device+1]);
                        cudaMemsetAsync(d_px[dev], 0, mem_img_each_GPU,stream[dev*nStream_device]);
                        cudaMemsetAsync(d_py[dev], 0, mem_img_each_GPU,stream[dev*nStream_device]);
                        cudaMemsetAsync(d_pz[dev], 0, mem_img_each_GPU,stream[dev*nStream_device]);
                    }

I can not try now, but if you want to try that, great!
If not, give me a bit of time and i'll give it a shot myself.

@zezisme
Copy link
Author

zezisme commented Oct 18, 2024

A posibility:

https://github.com/CERN/TIGRE/blob/master/Common/CUDA/tvdenoising.cu#L378-L389

may need to be changed to:

for (dev = 0; dev < deviceCount; dev++){
                        cudaSetDevice(gpuids[dev]);
                        // New line:
                        cudaMemsetAsync(d_src[dev], 0, mem_img_each_GPU,stream[dev*nStream_device+1]);
                        cudaMemcpyAsync(d_src[dev]+offset_device[dev], src+offset_host[dev]  , bytes_device[dev]*sizeof(float), cudaMemcpyHostToDevice,stream[dev*nStream_device+1]);
                    }
                    for (dev = 0; dev < deviceCount; dev++){
                        cudaSetDevice(gpuids[dev]);
                        // All these are async
                        // New line:
                        cudaMemsetAsync(d_u[dev], 0, mem_img_each_GPU,stream[dev*nStream_device+1]);
                        cudaMemcpyAsync(d_u[dev]  +offset_device[dev], d_src[dev]+offset_device[dev], bytes_device[dev]*sizeof(float), cudaMemcpyDeviceToDevice,stream[dev*nStream_device+1]);
                        cudaMemsetAsync(d_px[dev], 0, mem_img_each_GPU,stream[dev*nStream_device]);
                        cudaMemsetAsync(d_py[dev], 0, mem_img_each_GPU,stream[dev*nStream_device]);
                        cudaMemsetAsync(d_pz[dev], 0, mem_img_each_GPU,stream[dev*nStream_device]);
                    }

I can not try now, but if you want to try that, great! If not, give me a bit of time and i'll give it a shot myself.

I have tested your new code, but the results are still the same as before, the same ghost is still there

@AnderBiguri
Copy link
Member

Okey, I will investigate further, but this may be hard to fix, as during all my years, I have never seen this happen, which means its a relatively nitche case, and I am not sure I will be able to reproduce. I'll try!

@AnderBiguri
Copy link
Member

Can you tell me exactly which GPU you have? I can try to simulate your machine if I know exactly how much memory your GPU has.

@zezisme
Copy link
Author

zezisme commented Oct 18, 2024

Can you tell me exactly which GPU you have? I can try to simulate your machine if I know exactly how much memory your GPU has.

Here is my GPU information! Thank you very much that you can fix this bug as soon as possible. In addition, I am also learning cuda programming so that I can fix the problem by myself if possible.
image

@zezisme
Copy link
Author

zezisme commented Oct 19, 2024

Okey, I will investigate further, but this may be hard to fix, as during all my years, I have never seen this happen, which means its a relatively nitche case, and I am not sure I will be able to reproduce. I'll try!

Hello @AnderBiguri :
I seem to have found the bugs. After reading the source code carefully, I found two problems:
(1) In iterations >1, the device memory is not reset to zero, which is very important for the first_chunk and the last_chunk (because buffer_pixels exists, and the last chunk is not large enough);
(2) After each iteration of sp, cudaMemcpyDeviceToHost, h_u, h_px, h_py, and h_pz will be changed. At this time, the next sp iteration (sp+1) requires cudaMemcpyHostToDevice. However, since the copied data bytes_device is curr_pixels+buffer_pixels(

bytes_device[dev]=curr_pixels+buffer_pixels*!is_first_chunk+buffer_pixels*!is_last_chunk;
), the host data after the previous sp iteration is partially copied to the sp+1 iteration, causing data aliasing.

the first one may be the main reason of the ghost, And I tran to insert the fellow new codes in

for (dev = 0; dev < deviceCount; dev++){
cudaSetDevice(gpuids[dev]);
cudaStreamSynchronize(stream[dev*nStream_device+1]);
cudaMemcpyAsync(d_u [dev] +offset_device[dev], h_u +offset_host[dev], bytes_device[dev]*sizeof(float), cudaMemcpyHostToDevice,stream[dev*nStream_device+1]);
}
for (dev = 0; dev < deviceCount; dev++){
cudaSetDevice(gpuids[dev]);
cudaStreamSynchronize(stream[dev*nStream_device+2]);
cudaMemcpyAsync(d_px[dev]+offset_device[dev], h_px+offset_host[dev], bytes_device[dev]*sizeof(float), cudaMemcpyHostToDevice,stream[dev*nStream_device+2]);
}
for (dev = 0; dev < deviceCount; dev++){
cudaSetDevice(gpuids[dev]);
cudaStreamSynchronize(stream[dev*nStream_device+3]);
cudaMemcpyAsync(d_py[dev] +offset_device[dev], h_py+offset_host[dev], bytes_device[dev]*sizeof(float), cudaMemcpyHostToDevice,stream[dev*nStream_device+3]);
}
for (dev = 0; dev < deviceCount; dev++){
cudaSetDevice(gpuids[dev]);
cudaStreamSynchronize(stream[dev*nStream_device+4]);
cudaMemcpyAsync(d_pz[dev] +offset_device[dev], h_pz+offset_host[dev], bytes_device[dev]*sizeof(float), cudaMemcpyHostToDevice,stream[dev*nStream_device+4]);
}
for (dev = 0; dev < deviceCount; dev++){
cudaStreamSynchronize(stream[dev*nStream_device+1]);
cudaMemcpyAsync(d_src[dev]+offset_device[dev], src +offset_host[dev], bytes_device[dev]*sizeof(float), cudaMemcpyHostToDevice,stream[dev*nStream_device+1]);
}

The new codes:

                   for (dev = 0; dev < deviceCount; dev++){   
                        cudaSetDevice(gpuids[dev]);
                        cudaStreamSynchronize(stream[dev*nStream_device+1]);
                        // New line:
                        cudaMemsetAsync(d_u[dev], 0, mem_img_each_GPU,stream[dev*nStream_device+1]);
                        cudaMemcpyAsync(d_u [dev] +offset_device[dev], h_u +offset_host[dev],  bytes_device[dev]*sizeof(float), cudaMemcpyHostToDevice,stream[dev*nStream_device+1]);
                       
                    }

                    for (dev = 0; dev < deviceCount; dev++){   
                        cudaSetDevice(gpuids[dev]);
                        cudaStreamSynchronize(stream[dev*nStream_device+2]);
                        // New line:
                        cudaMemsetAsync(d_px[dev], 0, mem_img_each_GPU,stream[dev*nStream_device+2]);
                        cudaMemcpyAsync(d_px[dev]+offset_device[dev], h_px+offset_host[dev],  bytes_device[dev]*sizeof(float), cudaMemcpyHostToDevice,stream[dev*nStream_device+2]);
                       
                    }
                    for (dev = 0; dev < deviceCount; dev++){   
                        cudaSetDevice(gpuids[dev]);
                        cudaStreamSynchronize(stream[dev*nStream_device+3]);
                       // New line:
                        cudaMemsetAsync(d_py[dev], 0, mem_img_each_GPU,stream[dev*nStream_device+3]);
                        cudaMemcpyAsync(d_py[dev] +offset_device[dev], h_py+offset_host[dev],  bytes_device[dev]*sizeof(float), cudaMemcpyHostToDevice,stream[dev*nStream_device+3]);
                        
                    }
                    for (dev = 0; dev < deviceCount; dev++){   
                        cudaSetDevice(gpuids[dev]);
                        cudaStreamSynchronize(stream[dev*nStream_device+4]);
                       // New line:
                        cudaMemsetAsync(d_pz[dev], 0, mem_img_each_GPU,stream[dev*nStream_device+4]);
                        cudaMemcpyAsync(d_pz[dev] +offset_device[dev], h_pz+offset_host[dev],  bytes_device[dev]*sizeof(float), cudaMemcpyHostToDevice,stream[dev*nStream_device+4]);
                        
                    } 
                    for (dev = 0; dev < deviceCount; dev++){   

                        
                        cudaStreamSynchronize(stream[dev*nStream_device+1]);
                       // New line:
                        cudaMemsetAsync(d_src[dev], 0, mem_img_each_GPU,stream[dev*nStream_device+1]);
                        cudaMemcpyAsync(d_src[dev]+offset_device[dev], src +offset_host[dev],  bytes_device[dev]*sizeof(float), cudaMemcpyHostToDevice,stream[dev*nStream_device+1]);
                        

                    }

It is just add a line to set the d_u、d_px、d_py、d_pz、d_src to be zero before setting the new value. And finnal this attempt work!!! there is no ghost whatever the para I used.
image

Of course, there should be a more efficient solution, but I haven't tried it yet;
And I haven't thought of a good solution for problem 2.

@zezisme
Copy link
Author

zezisme commented Oct 19, 2024

Here comes a new problem. For the first and last pictures, since 0 is used as the buffer data, there will be obvious noise.
The first and end slice picture(with noise)
image
image
Here is the middle slice picture(smooth, without noise)
image

Can we use the adjacent pictures that be flipped in z-axis as the buffer data, so that it can smooth the gradient of data.
I am not good at cuda programming yet, so I don't know how to implement it

@zezisme
Copy link
Author

zezisme commented Oct 20, 2024

I have tried using the flip data of the first and end slice as the buffer data, and here is the new code(Note: Multi-GPU is not considered here):

for (dev = 0; dev < deviceCount; dev++){
cudaSetDevice(gpuids[dev]);
cudaMemcpyAsync(d_src[dev]+offset_device[dev], src+offset_host[dev] , bytes_device[dev]*sizeof(float), cudaMemcpyHostToDevice,stream[dev*nStream_device+1]);
}
for (dev = 0; dev < deviceCount; dev++){
cudaSetDevice(gpuids[dev]);
// All these are async
cudaMemcpyAsync(d_u[dev] +offset_device[dev], d_src[dev]+offset_device[dev], bytes_device[dev]*sizeof(float), cudaMemcpyDeviceToDevice,stream[dev*nStream_device+1]);
cudaMemsetAsync(d_px[dev], 0, mem_img_each_GPU,stream[dev*nStream_device]);
cudaMemsetAsync(d_py[dev], 0, mem_img_each_GPU,stream[dev*nStream_device]);
cudaMemsetAsync(d_pz[dev], 0, mem_img_each_GPU,stream[dev*nStream_device]);
}

for (dev = 0; dev < deviceCount; dev++){
                        cudaSetDevice(gpuids[dev]);
                        // New line: not care the multi GPU
                        if(is_first_chunk){
                            for (unsigned int j=0;j<buffer_length;j++){
                                cudaMemcpyAsync(d_src[dev]+pixels_per_slice*j, src+pixels_per_slice*(buffer_length-j-1), pixels_per_slice*sizeof(float), cudaMemcpyHostToDevice,stream[dev*nStream_device+1]);
                            }
                        }
                        if(is_last_chunk){
                            for (unsigned int j=0;j<buffer_length;j++){
                                cudaMemcpyAsync(d_src[dev]+bytes_device[dev]+pixels_per_slice*j, src+pixels_per_slice*(image_size[2]-j-1), pixels_per_slice*sizeof(float), cudaMemcpyHostToDevice,stream[dev*nStream_device+1]);
                            }
                        }
                        cudaMemcpyAsync(d_src[dev]+offset_device[dev], src+offset_host[dev]  , bytes_device[dev]*sizeof(float), cudaMemcpyHostToDevice,stream[dev*nStream_device+1]);
                    }
                    for (dev = 0; dev < deviceCount; dev++){
                        cudaSetDevice(gpuids[dev]);
                        // All these are async
                        // New line:
                        cudaMemcpyAsync(d_u[dev], d_src[dev], mem_img_each_GPU, cudaMemcpyDeviceToDevice,stream[dev*nStream_device+1]);
                        cudaMemsetAsync(d_px[dev], 0, mem_img_each_GPU,stream[dev*nStream_device]);
                        cudaMemsetAsync(d_py[dev], 0, mem_img_each_GPU,stream[dev*nStream_device]);
                        cudaMemsetAsync(d_pz[dev], 0, mem_img_each_GPU,stream[dev*nStream_device]);
                    }

for (dev = 0; dev < deviceCount; dev++){
cudaSetDevice(gpuids[dev]);
cudaStreamSynchronize(stream[dev*nStream_device+1]);
cudaMemcpyAsync(d_u [dev] +offset_device[dev], h_u +offset_host[dev], bytes_device[dev]*sizeof(float), cudaMemcpyHostToDevice,stream[dev*nStream_device+1]);
}
for (dev = 0; dev < deviceCount; dev++){
cudaSetDevice(gpuids[dev]);
cudaStreamSynchronize(stream[dev*nStream_device+2]);
cudaMemcpyAsync(d_px[dev]+offset_device[dev], h_px+offset_host[dev], bytes_device[dev]*sizeof(float), cudaMemcpyHostToDevice,stream[dev*nStream_device+2]);
}
for (dev = 0; dev < deviceCount; dev++){
cudaSetDevice(gpuids[dev]);
cudaStreamSynchronize(stream[dev*nStream_device+3]);
cudaMemcpyAsync(d_py[dev] +offset_device[dev], h_py+offset_host[dev], bytes_device[dev]*sizeof(float), cudaMemcpyHostToDevice,stream[dev*nStream_device+3]);
}
for (dev = 0; dev < deviceCount; dev++){
cudaSetDevice(gpuids[dev]);
cudaStreamSynchronize(stream[dev*nStream_device+4]);
cudaMemcpyAsync(d_pz[dev] +offset_device[dev], h_pz+offset_host[dev], bytes_device[dev]*sizeof(float), cudaMemcpyHostToDevice,stream[dev*nStream_device+4]);
}
for (dev = 0; dev < deviceCount; dev++){
cudaStreamSynchronize(stream[dev*nStream_device+1]);
cudaMemcpyAsync(d_src[dev]+offset_device[dev], src +offset_host[dev], bytes_device[dev]*sizeof(float), cudaMemcpyHostToDevice,stream[dev*nStream_device+1]);
}

for (dev = 0; dev < deviceCount; dev++){   
                        cudaSetDevice(gpuids[dev]);
                        cudaStreamSynchronize(stream[dev*nStream_device+1]);
                        // New line: not care of the multi GPU()
                        if(is_first_chunk){
                            for (unsigned int j=0;j<buffer_length;j++){
                                cudaMemcpyAsync(d_u[dev]+pixels_per_slice*j, h_u+pixels_per_slice*(buffer_length-j-1), pixels_per_slice*sizeof(float), cudaMemcpyHostToDevice,stream[dev*nStream_device+1]);
                            }
                        }
                        if(is_last_chunk){
                            for (unsigned int j=0;j<buffer_length;j++){
                                cudaMemcpyAsync(d_u[dev]+bytes_device[dev]+pixels_per_slice*j, h_u+pixels_per_slice*(image_size[2]-j-1), pixels_per_slice*sizeof(float), cudaMemcpyHostToDevice,stream[dev*nStream_device+1]);
                            }
                        }
                        //cudaMemsetAsync(d_u[dev], 0, mem_img_each_GPU,stream[dev*nStream_device+1]);//zhouenze
                        cudaMemcpyAsync(d_u [dev] +offset_device[dev], h_u +offset_host[dev],  bytes_device[dev]*sizeof(float), cudaMemcpyHostToDevice,stream[dev*nStream_device+1]);
                       
                    }

                    for (dev = 0; dev < deviceCount; dev++){   
                        cudaSetDevice(gpuids[dev]);
                        cudaStreamSynchronize(stream[dev*nStream_device+2]);
                        // New line:
                        if(is_first_chunk){
                            for (unsigned int j=0;j<buffer_length;j++){
                                cudaMemcpyAsync(d_px[dev]+pixels_per_slice*j, h_px+pixels_per_slice*(buffer_length-j-1), pixels_per_slice*sizeof(float), cudaMemcpyHostToDevice,stream[dev*nStream_device+2]);
                            }
                        }
                        if(is_last_chunk){
                            for (unsigned int j=0;j<buffer_length;j++){
                                cudaMemcpyAsync(d_px[dev]+bytes_device[dev]+pixels_per_slice*j, h_px+pixels_per_slice*(image_size[2]-j-1), pixels_per_slice*sizeof(float), cudaMemcpyHostToDevice,stream[dev*nStream_device+2]);
                            }
                        }
                        // cudaMemsetAsync(d_px[dev], 0, mem_img_each_GPU,stream[dev*nStream_device+2]);//zhouenze
                        cudaMemcpyAsync(d_px[dev]+offset_device[dev], h_px+offset_host[dev],  bytes_device[dev]*sizeof(float), cudaMemcpyHostToDevice,stream[dev*nStream_device+2]);
                       
                    }
                    for (dev = 0; dev < deviceCount; dev++){   
                        cudaSetDevice(gpuids[dev]);
                        cudaStreamSynchronize(stream[dev*nStream_device+3]);
                        // New line:
                        if(is_first_chunk){
                            for (unsigned int j=0;j<buffer_length;j++){
                                cudaMemcpyAsync(d_py[dev]+pixels_per_slice*j, h_py+pixels_per_slice*(buffer_length-j-1), pixels_per_slice*sizeof(float), cudaMemcpyHostToDevice,stream[dev*nStream_device+3]);
                            }
                        }
                        if(is_last_chunk){
                            for (unsigned int j=0;j<buffer_length;j++){
                                cudaMemcpyAsync(d_py[dev]+bytes_device[dev]+pixels_per_slice*j, h_py+pixels_per_slice*(image_size[2]-j-1), pixels_per_slice*sizeof(float), cudaMemcpyHostToDevice,stream[dev*nStream_device+3]);
                            }
                        }
                        // cudaMemsetAsync(d_py[dev], 0, mem_img_each_GPU,stream[dev*nStream_device+3]);//zhouenze
                        cudaMemcpyAsync(d_py[dev] +offset_device[dev], h_py+offset_host[dev],  bytes_device[dev]*sizeof(float), cudaMemcpyHostToDevice,stream[dev*nStream_device+3]);
                        
                    }
                    for (dev = 0; dev < deviceCount; dev++){   
                        cudaSetDevice(gpuids[dev]);
                        cudaStreamSynchronize(stream[dev*nStream_device+4]);
                        // New line:
                        if(is_first_chunk){
                            for (unsigned int j=0;j<buffer_length;j++){
                                cudaMemcpyAsync(d_pz[dev]+pixels_per_slice*j, h_pz+pixels_per_slice*(buffer_length-j-1), pixels_per_slice*sizeof(float), cudaMemcpyHostToDevice,stream[dev*nStream_device+4]);
                            }
                        }
                        if(is_last_chunk){
                            for (unsigned int j=0;j<buffer_length;j++){
                                cudaMemcpyAsync(d_pz[dev]+bytes_device[dev]+pixels_per_slice*j, h_pz+pixels_per_slice*(image_size[2]-j-1), pixels_per_slice*sizeof(float), cudaMemcpyHostToDevice,stream[dev*nStream_device+4]);
                            }
                        }
                        // cudaMemsetAsync(d_pz[dev], 0, mem_img_each_GPU,stream[dev*nStream_device+4]);//zhouenze
                        cudaMemcpyAsync(d_pz[dev] +offset_device[dev], h_pz+offset_host[dev],  bytes_device[dev]*sizeof(float), cudaMemcpyHostToDevice,stream[dev*nStream_device+4]);
                        
                    } 
                    for (dev = 0; dev < deviceCount; dev++){  
                        cudaSetDevice(gpuids[dev]);
                        cudaStreamSynchronize(stream[dev*nStream_device+1]);
                        // New line:
                        if(is_first_chunk){
                            for (unsigned int j=0;j<buffer_length;j++){
                                cudaMemcpyAsync(d_src[dev]+pixels_per_slice*j, src+pixels_per_slice*(buffer_length-j-1), pixels_per_slice*sizeof(float), cudaMemcpyHostToDevice,stream[dev*nStream_device+1]);
                            }
                        }
                        if(is_last_chunk){
                            for (unsigned int j=0;j<buffer_length;j++){
                                cudaMemcpyAsync(d_src[dev]+bytes_device[dev]+pixels_per_slice*j, src+pixels_per_slice*(image_size[2]-j-1), pixels_per_slice*sizeof(float), cudaMemcpyHostToDevice,stream[dev*nStream_device+1]);
                            }
                        }
                        //cudaMemsetAsync(d_src[dev], 0, mem_img_each_GPU,stream[dev*nStream_device+1]);//zhouenze
                        cudaMemcpyAsync(d_src[dev]+offset_device[dev], src +offset_host[dev],  bytes_device[dev]*sizeof(float), cudaMemcpyHostToDevice,stream[dev*nStream_device+1]);
                        

                    }

Here is the result:
the left image: init denoised image of end slice(have ghost); middle image: the denoised end image(using zero as buffer data) (no ghost, but have severe noise),right image: the denoised end image(using flipped data as buffer data) (no ghost, but have slight noise enhancement than the init image, why????). It should be noted that these methods differ only in the slices at the two ends, and all other slices are the same.
image

@AnderBiguri
Copy link
Member

hi @zezisme , I am not sure I undetstood very well the second issue or these last images. Can you edit the text to explain them a bit longer?

@zezisme
Copy link
Author

zezisme commented Oct 21, 2024

hi @zezisme , I am not sure I undetstood very well the second issue or these last images. Can you edit the text to explain them a bit longer?

This is the flow chart of the second problem, but I am not sure whether this problem has a little big impact on denoising.
320f4fac280ac9e8b3f06602484f944

@AnderBiguri
Copy link
Member

Right, let me rephrase it to see if I fully understood:

The issue you are highlighting is that after the first split of the image is processed, the second one will use updated data as a buffer, rather than the original data.

Yes you are correct then, this is what happens. Looking at my notes, I seem to have tested this when I coded the multi-GPU code and I decided that even if this is wrong, it has a limited impact on the denoising, so I was not going to do much about it. The solution otherwise requires a full extra copy of the image in memory, which I decided that was too costly for the little effect it had.

However, this should have no effect in the end-slides I think. I believe that the end-slides is a problem of the boundary condition. This code assumes Neumann boundary conditions (i.e. gradient is zero in the slice direction at the first/last slice). This means that the gradient at the edges will always be smaller than everywhere else, and so will it be then the denoising strength. Perhaps using periodic boundary conditions would fix the issue (if this is the issue). Critically, it would mean to change these lines to access the next (rather than previous) pixel value in the else case:

https://github.com/CERN/TIGRE/blob/master/Common/CUDA/tvdenoising.cu#L70-L123

@zezisme
Copy link
Author

zezisme commented Oct 21, 2024

Right, let me rephrase it to see if I fully understood:

The issue you are highlighting is that after the first split of the image is processed, the second one will use updated data as a buffer, rather than the original data.

Yes you are correct then, this is what happens. Looking at my notes, I seem to have tested this when I coded the multi-GPU code and I decided that even if this is wrong, it has a limited impact on the denoising, so I was not going to do much about it. The solution otherwise requires a full extra copy of the image in memory, which I decided that was too costly for the little effect it had.

However, this should have no effect in the end-slides I think. I believe that the end-slides is a problem of the boundary condition. This code assumes Neumann boundary conditions (i.e. gradient is zero in the slice direction at the first/last slice). This means that the gradient at the edges will always be smaller than everywhere else, and so will it be then the denoising strength. Perhaps using periodic boundary conditions would fix the issue (if this is the issue). Critically, it would mean to change these lines to access the next (rather than previous) pixel value in the else case:

https://github.com/CERN/TIGRE/blob/master/Common/CUDA/tvdenoising.cu#L70-L123

Yes, I agree that the second problem is not the cause of ghost!the main reason is the first problem that have mentioned before.
here is the flow chart of the main problem; I also have tested the new code, and have some result, you can have a look in github issues page(#594).
fb723e4938ebf5af5d0c7b2bf75c8eb

AnderBiguri added a commit that referenced this issue Oct 21, 2024
@AnderBiguri
Copy link
Member

I just updated TIGRE to solve the ghosting issue. Does the change (its from your code) solve it?

@zezisme
Copy link
Author

zezisme commented Oct 21, 2024

I just updated TIGRE to solve the ghosting issue. Does the change (its from your code) solve it?

And for the boundary buf expansion method, I have a new method to solve the noise enhancement problem. Here is the new code (just copy the boundary slice as the buf slice to avoid the negative z-axis gradient caused by the mirror slice) for

for (dev = 0; dev < deviceCount; dev++){
cudaSetDevice(gpuids[dev]);
cudaStreamSynchronize(stream[dev*nStream_device+1]);
// Sometimes the last_chunk is smaller than the other ones, and thus if we don't reset the memory to zero, we'll gave ghosts residual data in the variable
cudaMemsetAsync(d_u[dev], 0, mem_img_each_GPU,stream[dev*nStream_device+1]);
cudaMemcpyAsync(d_u [dev] +offset_device[dev], h_u +offset_host[dev], bytes_device[dev]*sizeof(float), cudaMemcpyHostToDevice,stream[dev*nStream_device+1]);
}
for (dev = 0; dev < deviceCount; dev++){
cudaSetDevice(gpuids[dev]);
cudaStreamSynchronize(stream[dev*nStream_device+2]);
// Sometimes the last_chunk is smaller than the other ones, and thus if we don't reset the memory to zero, we'll gave ghosts residual data in the variable
cudaMemsetAsync(d_px[dev], 0, mem_img_each_GPU,stream[dev*nStream_device+1]);
cudaMemcpyAsync(d_px[dev]+offset_device[dev], h_px+offset_host[dev], bytes_device[dev]*sizeof(float), cudaMemcpyHostToDevice,stream[dev*nStream_device+2]);
}
for (dev = 0; dev < deviceCount; dev++){
cudaSetDevice(gpuids[dev]);
cudaStreamSynchronize(stream[dev*nStream_device+3]);
// Sometimes the last_chunk is smaller than the other ones, and thus if we don't reset the memory to zero, we'll gave ghosts residual data in the variable
cudaMemsetAsync(d_py[dev], 0, mem_img_each_GPU,stream[dev*nStream_device+1]);
cudaMemcpyAsync(d_py[dev] +offset_device[dev], h_py+offset_host[dev], bytes_device[dev]*sizeof(float), cudaMemcpyHostToDevice,stream[dev*nStream_device+3]);
}
for (dev = 0; dev < deviceCount; dev++){
cudaSetDevice(gpuids[dev]);
cudaStreamSynchronize(stream[dev*nStream_device+4]);
// Sometimes the last_chunk is smaller than the other ones, and thus if we don't reset the memory to zero, we'll gave ghosts residual data in the variable
cudaMemsetAsync(d_pz[dev], 0, mem_img_each_GPU,stream[dev*nStream_device+1]);
cudaMemcpyAsync(d_pz[dev] +offset_device[dev], h_pz+offset_host[dev], bytes_device[dev]*sizeof(float), cudaMemcpyHostToDevice,stream[dev*nStream_device+4]);
}
for (dev = 0; dev < deviceCount; dev++){
cudaSetDevice(gpuids[dev]);
cudaStreamSynchronize(stream[dev*nStream_device+1]);
// Sometimes the last_chunk is smaller than the other ones, and thus if we don't reset the memory to zero, we'll gave ghosts residual data in the variable
cudaMemsetAsync(d_src[dev], 0, mem_img_each_GPU,stream[dev*nStream_device+1]);
cudaMemcpyAsync(d_src[dev]+offset_device[dev], src +offset_host[dev], bytes_device[dev]*sizeof(float), cudaMemcpyHostToDevice,stream[dev*nStream_device+1]);
}

The fellowing change may just work in single GPU, If you use multiple GPUs, it may go wrong!

for (dev = 0; dev < deviceCount; dev++){   
                        cudaSetDevice(gpuids[dev]);
                        cudaStreamSynchronize(stream[dev*nStream_device+1]);
                        // New line:
                        if(is_first_chunk){
                            for (unsigned int j=0;j<buffer_length;j++){
                                cudaMemcpyAsync(d_u[dev]+pixels_per_slice*j, h_u+pixels_per_slice*j, pixels_per_slice*sizeof(float), cudaMemcpyHostToDevice,stream[dev*nStream_device+1]);
                            }
                        }
                        if(is_last_chunk){
                            for (unsigned int j=0;j<buffer_length;j++){
                                cudaMemcpyAsync(d_u[dev]+bytes_device[dev]+pixels_per_slice*j, h_u+pixels_per_slice*(image_size[2]-buffer_length+j), pixels_per_slice*sizeof(float), cudaMemcpyHostToDevice,stream[dev*nStream_device+1]);
                            }
                        }
                        cudaMemcpyAsync(d_u [dev] +offset_device[dev], h_u +offset_host[dev],  bytes_device[dev]*sizeof(float), cudaMemcpyHostToDevice,stream[dev*nStream_device+1]);
                       
                    }

                    for (dev = 0; dev < deviceCount; dev++){   
                        cudaSetDevice(gpuids[dev]);
                        cudaStreamSynchronize(stream[dev*nStream_device+2]);
                        // New line:
                        if(is_first_chunk){
                            for (unsigned int j=0;j<buffer_length;j++){
                                cudaMemcpyAsync(d_px[dev]+pixels_per_slice*j, h_px+pixels_per_slice*j, pixels_per_slice*sizeof(float), cudaMemcpyHostToDevice,stream[dev*nStream_device+2]);
                            }
                        }
                        if(is_last_chunk){
                            for (unsigned int j=0;j<buffer_length;j++){
                                cudaMemcpyAsync(d_px[dev]+bytes_device[dev]+pixels_per_slice*j, h_px+pixels_per_slice*(image_size[2]-buffer_length+j), pixels_per_slice*sizeof(float), cudaMemcpyHostToDevice,stream[dev*nStream_device+2]);
                            }
                        }
                        cudaMemcpyAsync(d_px[dev]+offset_device[dev], h_px+offset_host[dev],  bytes_device[dev]*sizeof(float), cudaMemcpyHostToDevice,stream[dev*nStream_device+2]);
                       
                    }
                    for (dev = 0; dev < deviceCount; dev++){   
                        cudaSetDevice(gpuids[dev]);
                        cudaStreamSynchronize(stream[dev*nStream_device+3]);
                        // New line:
                        if(is_first_chunk){
                            for (unsigned int j=0;j<buffer_length;j++){
                                cudaMemcpyAsync(d_py[dev]+pixels_per_slice*j, h_py+pixels_per_slice*j, pixels_per_slice*sizeof(float), cudaMemcpyHostToDevice,stream[dev*nStream_device+3]);
                            }
                        }
                        if(is_last_chunk){
                            for (unsigned int j=0;j<buffer_length;j++){
                                cudaMemcpyAsync(d_py[dev]+bytes_device[dev]+pixels_per_slice*j, h_py+pixels_per_slice*(image_size[2]-buffer_length+j), pixels_per_slice*sizeof(float), cudaMemcpyHostToDevice,stream[dev*nStream_device+3]);
                            }
                        }
                        cudaMemcpyAsync(d_py[dev] +offset_device[dev], h_py+offset_host[dev],  bytes_device[dev]*sizeof(float), cudaMemcpyHostToDevice,stream[dev*nStream_device+3]);
                        
                    }
                    for (dev = 0; dev < deviceCount; dev++){   
                        cudaSetDevice(gpuids[dev]);
                        cudaStreamSynchronize(stream[dev*nStream_device+4]);
                        // New line:
                        if(is_first_chunk){
                            for (unsigned int j=0;j<buffer_length;j++){
                                cudaMemcpyAsync(d_pz[dev]+pixels_per_slice*j, h_pz+pixels_per_slice*j, pixels_per_slice*sizeof(float), cudaMemcpyHostToDevice,stream[dev*nStream_device+4]);
                            
                            }
                        }
                        if(is_last_chunk){
                            for (unsigned int j=0;j<buffer_length;j++){
                                cudaMemcpyAsync(d_pz[dev]+bytes_device[dev]+pixels_per_slice*j, h_pz+pixels_per_slice*(image_size[2]-buffer_length+j), pixels_per_slice*sizeof(float), cudaMemcpyHostToDevice,stream[dev*nStream_device+4]);
                            }
                        }
                        cudaMemcpyAsync(d_pz[dev] +offset_device[dev], h_pz+offset_host[dev],  bytes_device[dev]*sizeof(float), cudaMemcpyHostToDevice,stream[dev*nStream_device+4]);
                        
                    } 
                    for (dev = 0; dev < deviceCount; dev++){  
                        cudaSetDevice(gpuids[dev]);
                        cudaStreamSynchronize(stream[dev*nStream_device+1]);
                        // New line:
                        if(is_first_chunk){
                            for (unsigned int j=0;j<buffer_length;j++){
                                cudaMemcpyAsync(d_src[dev]+pixels_per_slice*j, src+pixels_per_slice*j, pixels_per_slice*sizeof(float), cudaMemcpyHostToDevice,stream[dev*nStream_device+1]);
                            }
                        }
                        if(is_last_chunk){
                            for (unsigned int j=0;j<buffer_length;j++){
                                cudaMemcpyAsync(d_src[dev]+bytes_device[dev]+pixels_per_slice*j, src+pixels_per_slice*(image_size[2]-buffer_length+j), pixels_per_slice*sizeof(float), cudaMemcpyHostToDevice,stream[dev*nStream_device+1]);
                            }
                        }
                        cudaMemcpyAsync(d_src[dev]+offset_device[dev], src +offset_host[dev],  bytes_device[dev]*sizeof(float), cudaMemcpyHostToDevice,stream[dev*nStream_device+1]);
                        

                    }

@zezisme
Copy link
Author

zezisme commented Oct 21, 2024

I just updated TIGRE to solve the ghosting issue. Does the change (its from your code) solve it?

I have read your modification, which can indeed solve the ghost problem. However, for the problem of assuming that the boundary gradient is 0(Neumann boundary conditions), this will make the boundary slice noise larger. But this may not be a very important problem

@AnderBiguri
Copy link
Member

Hi @zezisme. I only added the memory zeroing part, not the part that you mentioned that is only for one GPU.

For now, you shared 3 pieces of code:

So, questions for you:

  1. Given just the first change, does this fix your ghosting?
  2. The second piece of code, as we discussed, would indeed fix the fact that different splits don't access exactly the same data. However, do you see a big difference if you don't do anything about it?
  3. Is the third piece of code needed?

@zezisme
Copy link
Author

zezisme commented Oct 21, 2024

Hi @zezisme. I only added the memory zeroing part, not the part that you mentioned that is only for one GPU.

For now, you shared 3 pieces of code:

So, questions for you:

  1. Given just the first change, does this fix your ghosting?
  2. The second piece of code, as we discussed, would indeed fix the fact that different splits don't access exactly the same data. However, do you see a big difference if you don't do anything about it?
  3. Is the third piece of code needed?

Answer three questions:

  1. the first change can indeed solve the ghost problem!
  2. Sorry for my unclear expression. All the other codes are to solve the problem of boundary slice noise (due to Neumann boundary conditions), not for the problem2 in Artifacts in 3D-TV Denoised Images #594 (comment) don't think this is an important issue either.)
  3. the third piece of code is not needed in Neumann boundary conditions.

@AnderBiguri
Copy link
Member

@zezisme Thanks a lot for the explanation!

I think the best way to tackle the fact that the edge slices have different strength is to change the math, not the memory.

Can you see if we change the lines I linked before to the following fixed the problem? Essentially this changes the boundary conditions from Neumman to mirror.

   __device__ __inline__
            float divergence(const float* pz, const float* py, const float* px,
            long z, long y, long x, long depth, long rows, long cols,
            float dz, float dy, float dx)
    {
        long size2d = rows*cols;
        long idx = z * size2d + y * cols + x;
        float _div = 0.0f;
        
        if ( z - 1 >= 0 ) {
            _div += (pz[idx] - pz[(z-1)*size2d + y*cols + x]) / dz;
        } else {
            _div += (pz[idx]-pz[(z+1)*size2d + y*cols + x]) / dz;
        }
        
        if ( y - 1 >= 0 ) {
            _div += (py[idx] - py[z*size2d + (y-1)*cols + x]) / dy;
        } else {
            _div += (py[idx] - py[z*size2d + (y+1)*cols + x]) / dy;
        }
        
        if ( x - 1 >= 0 ) {
            _div += (px[idx] - px[z*size2d + y*cols + (x-1)]) / dx;
        } else {
            _div += (px[idx] - px[z*size2d + y*cols + (x+1)]) / dx;
        }
        
        return _div;
    }
    
    __device__ __inline__
            void gradient(const float* u, float* grad,
            long z, long y, long x,
            long depth, long rows, long cols,
            float dz, float dy, float dx)
    {
        long size2d = rows*cols;
        long idx = z * size2d + y * cols + x;
        
        float uidx = u[idx];
        
        if ( z + 1 < depth ) {
            grad[0] = (u[(z+1)*size2d + y*cols + x] - uidx) / dz;
        }else{
            grad[0] = (u[(z-1)*size2d + y*cols + x] - uidx) / dz;
        }
        
        if ( y + 1 < rows ) {
            grad[1] = (u[z*size2d + (y+1)*cols + x] - uidx) / dy;
        }else{
            grad[0] = (u[z*size2d + (y-1)*cols + x] - uidx) / dy;
        }
        
        if ( x + 1 < cols ) {
            grad[2] = (u[z*size2d + y*cols + (x+1)] - uidx) / dx;
        }else{
            grad[0] =  (u[z*size2d + y*cols + (x-1)] - uidx) / dx;
        }
    }

@zezisme
Copy link
Author

zezisme commented Oct 21, 2024

@zezisme Thanks a lot for the explanation!

I think the best way to tackle the fact that the edge slices have different strength is to change the math, not the memory.

Can you see if we change the lines I linked before to the following fixed the problem? Essentially this changes the boundary conditions from Neumman to mirror.

   __device__ __inline__
            float divergence(const float* pz, const float* py, const float* px,
            long z, long y, long x, long depth, long rows, long cols,
            float dz, float dy, float dx)
    {
        long size2d = rows*cols;
        long idx = z * size2d + y * cols + x;
        float _div = 0.0f;
        
        if ( z - 1 >= 0 ) {
            _div += (pz[idx] - pz[(z-1)*size2d + y*cols + x]) / dz;
        } else {
            _div += (pz[idx]-pz[(z+1)*size2d + y*cols + x]) / dz;
        }
        
        if ( y - 1 >= 0 ) {
            _div += (py[idx] - py[z*size2d + (y-1)*cols + x]) / dy;
        } else {
            _div += (py[idx] - py[z*size2d + (y+1)*cols + x]) / dy;
        }
        
        if ( x - 1 >= 0 ) {
            _div += (px[idx] - px[z*size2d + y*cols + (x-1)]) / dx;
        } else {
            _div += (px[idx] - px[z*size2d + y*cols + (x+1)]) / dx;
        }
        
        return _div;
    }
    
    __device__ __inline__
            void gradient(const float* u, float* grad,
            long z, long y, long x,
            long depth, long rows, long cols,
            float dz, float dy, float dx)
    {
        long size2d = rows*cols;
        long idx = z * size2d + y * cols + x;
        
        float uidx = u[idx];
        
        if ( z + 1 < depth ) {
            grad[0] = (u[(z+1)*size2d + y*cols + x] - uidx) / dz;
        }else{
            grad[0] = (u[(z-1)*size2d + y*cols + x] - uidx) / dz;
        }
        
        if ( y + 1 < rows ) {
            grad[1] = (u[z*size2d + (y+1)*cols + x] - uidx) / dy;
        }else{
            grad[0] = (u[z*size2d + (y-1)*cols + x] - uidx) / dy;
        }
        
        if ( x + 1 < cols ) {
            grad[2] = (u[z*size2d + y*cols + (x+1)] - uidx) / dx;
        }else{
            grad[0] =  (u[z*size2d + y*cols + (x-1)] - uidx) / dx;
        }
    }

Hello @AnderBiguri
This doesn't seem to work, there is still strong noise at the boundary slice. I guess it is because of the existence of buffer that the actual curr boundary slice is not detected

@AnderBiguri
Copy link
Member

Ah you may be right... Let met think about it a little more. Indeed I think we always call the kernels with a z value of size_slice+buffer*2 which perhaps we should not do for edge slices (first and last).

@zezisme
Copy link
Author

zezisme commented Oct 21, 2024

Ah you may be right... Let met think about it a little more. Indeed I think we always call the kernels with a z value of size_slice+buffer*2 which perhaps we should not do for edge slices (first and last).

Yes, do you have a better solution now? I have tried passing a parameter to the kernels to identify the real boundary slice, but I meet a new problem, and I am a little confused now.

@AnderBiguri
Copy link
Member

@zezisme maybe your solution is the best, but I need some time to test and play with multi-GPU cases to ensure its right.

An alternative for now is to add some slices to the input image to im3Ddenoise() and remove them after the GPU call. Computationally more expensive but should do the trick if you need results soon. Otherwise its possible that what you are doing (filling the buffer with edge slices) is the right way to go.

@zezisme
Copy link
Author

zezisme commented Oct 21, 2024

@zezisme maybe your solution is the best, but I need some time to test and play with multi-GPU cases to ensure its right.

An alternative for now is to add some slices to the input image to im3Ddenoise() and remove them after the GPU call. Computationally more expensive but should do the trick if you need results soon. Otherwise its possible that what you are doing (filling the buffer with edge slices) is the right way to go.

OK,thank you very much!I will choose the solution that adding some slices to the image before TVdenoising. And look forward to your subsequent more effective solutions!

@AnderBiguri
Copy link
Member

hi @zezisme I will be quite busy the next weeks, so if you want to try to produce a nice piece of code that would also be multi-GPU compatible to fix the last error, feel free to give it a shot and make a Pull Request.

If not, no worries, I'll try to tackle this later on. Just that I won't have time for it in the next few weeks.

@AnderBiguri
Copy link
Member

I made a PR, can you test it @zezisme to see if it fixes the issue? I don't have a computer with multiple GPUs and enough RAM to test it right now.
#609

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants