Grid-stride iteration and `ceilfracf` #952

jahooker · 2023-03-05T23:32:37Z

Hello Sjors, Dari, & co.!

I have come across what I believe to be a bug that affects pixel sampling in cuda_kernel_softMaskBackgroundValue and cuda_kernel_cosineFilter. These functions make use of ceilfracf:

relion/src/acc/cuda/cuda_kernels/cuda_device_utils.cuh

Line 42 in dcab793

__device__ int ceilfracf(T1 a, T2 b)

So far as I can tell from its name and the places it is called from, ceilfracf(a, b) should return the least integer n such that n * b >= a. But, as is clear from its definition, it does not do that. In cases where b divides a with no remainder,
the present implementation will return the intended result + 1.

template< typename T1, typename T2 >
static inline
__device__ int ceilfracf(T1 a, T2 b)
{
//	return __float2int_ru(__fdividef( (float)a, (float)b ) );
	return (int)(a/b + 1);
}

We want to return not a / b + 1 but ceilf(float(a) / float(b)). Or, without casting to float: a / b + bool(a % b) (assuming T1 and T2 are integral types, which is true at all the present call sites).

template <typename T1, typename T2>
static inline
__device__ int ceilfracf(T1 a, T2 b) {
    return ceilf(float(a) / float(b));
}

Now, why does this matter? As I said, ceilfracf is called from a handful of functions, including cuda_kernel_softMaskBackgroundValue and cuda_kernel_cosineFilter. In these places, it is being used to calculate the number of strides needed to iterate over some image data, given some number of parallel CUDA threads. In the pathological case, when the image size is divisible by the number of threads, there will be too great a gap between where the threads start, and pixels will be missed. For instance, given a 32 × 32 × 32 image, calling cuda_kernel_softMaskBackgroundValue with 128 threads per block and 128 blocks per grid (as is currently done), ~10k out of ~30k pixels will be ignored. The situation is less disastrous for larger images, since as a (the image size) increases, the relative error in ceilfracf(a, b) decreases. Given a 64 × 64 × 64 image, "only" ~15k out of ~260k (6%) get missed.

So, the most obvious fix is to change ceilfracf in the manner described above.

This brings me to my next point. Do we even need ceilfracf? It took me some time to convince myself that (when ceilfracf does what it should) cuda_kernel_softMaskBackgroundValue samples each pixel in the image exactly once. As it stands now, threads iterate over vol in block-sized strides. Each thread block passes over a different subspan of vol,
and each thread within a block samples that subspan once every SOFTMASK_BLOCK_SIZE pixels. The loop that controls this iteration increments two things: texel and a separate counter pass, which is unused in the body of the loop.
The loop checks two conditions on every iteration: whether texel has gone past the end of vol, and whether pass has gone past texel_pass_num (which, as I have explained, will sometimes be off by 1). There need only be one iterator to increment. cuda_kernel_softMaskBackgroundValue has been like this since its inception in 2016. It looks to me like the intention is to do a grid-stride loop. So, that is what I have tried to implement. I have taken the opportunity to get rid of the shared-memory array img_pixels, for which I saw no use, and to introduce a closure weight, defined outside the loop body and invoked within it. Now, there is not even any need for ceilfracf. We can dispense with it. I have similarly rewritten cuda_kernel_softMaskOutsideMap and cuda_kernel_cosineFilter. There are other functions that make use of ceilfracf, but they only do block-stride iteration. I think they are safe.

Best,
James

Now, `ceilfracf` does what its name suggests. Incidentally, it is no longer needed. `cuda_kernel_softMaskBackgroundValue`, `cuda_kernel_cosineFilter`, and `cuda_kernel_softMaskOutsideMap` now do proper grid-stride iteration.

dkimanius · 2023-03-17T14:40:37Z

Hi James,

The ceilfracf function is only used in index management. The cases where there is no remainder in the division are unlikely to occur and even if they do they only add a tiny performance penalty. The numeical results are unaffected. However, making the suggested change would add a performance penalty that would always occur due to the added floating point division in addition to a reduced numerical accuracy for large numbers.

That function is a common way of doing index management in kernels for this reason.

jahooker · 2023-03-17T16:23:14Z

Thanks Dari,

On review, I must concede that cuda_kernel_softMaskBackgroundValue samples the volume evenly. (Imaginary crisis averted?) But boy is it not obvious! Please do consider my proposal, not for ceilfracf but for the actual kernels. My changes obviate the need for ceilfracf in the first place, and with it, the need for this division. Also, do you not find my version so much more readable? 😉

jahooker added 2 commits March 5, 2023 20:51

Fixed stride issues in helper.cu

7edae35

Now, `ceilfracf` does what its name suggests. Incidentally, it is no longer needed. `cuda_kernel_softMaskBackgroundValue`, `cuda_kernel_cosineFilter`, and `cuda_kernel_softMaskOutsideMap` now do proper grid-stride iteration.

Fixed "grid-strided" loops in helper.cpp

f251f92

jahooker mentioned this pull request Mar 5, 2023

Grid-stride iteration: is this a bug? #953

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Grid-stride iteration and `ceilfracf` #952

Grid-stride iteration and `ceilfracf` #952

jahooker commented Mar 5, 2023

dkimanius commented Mar 17, 2023

jahooker commented Mar 17, 2023

Grid-stride iteration and ceilfracf #952

Are you sure you want to change the base?

Grid-stride iteration and ceilfracf #952

Conversation

jahooker commented Mar 5, 2023

dkimanius commented Mar 17, 2023

jahooker commented Mar 17, 2023

Grid-stride iteration and `ceilfracf` #952

Grid-stride iteration and `ceilfracf` #952