-
Notifications
You must be signed in to change notification settings - Fork 449
cub::DeviceSpmv: Various fixes and tests #352
Conversation
gpuCI: NVIDIA/thrust#1500 |
DVS CL: 30247358 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code is moving towards production quality, thank you!
Few quite complex issues still make this algorithm unreliable. I'm afraid that we can't merge the test until these issues are addressed because this will lead to random tests failures. Please, let me know if I've misunderstood anything.
|
||
tile_carry.key += tile_start_coord.x; | ||
if (tile_carry.key >= spmv_params.num_rows) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've been thinking about this fix for too long 😄 My main concern can be formalized as follows. Tile key never exceeds num_rows. It can be equal to num_rows only when tile_idx == num_merge_tiles
, which is never the case for this kernel. Only the main thread of each tile is writing some results here, so it was unclear how we could exceed num_rows.
tile_carry
is actually a block aggregate with tile_carry.key
standing for row number within tile indexation [0, tile_num_rows)
. The line above changes the indexation from tile-local to matrix indexation by adding a number of the first tile row tile_carry.key += tile_start_coord.x;
. tile_num_rows
in its turn, is calculated as tile_end_coord.x - tile_start_coord.x
. The sum of these differences can't exceed num_rows. So the issue has to be somewhere else.
tile_carry
is computed as
BlockScanT(temp_storage.aliasable.scan).ExclusiveScan(scan_item, scan_item, scan_op, tile_carry);
Application of the scan_op
(ReduceByKeyOp<cub::Sum>
) will cause the key value (row number) of the last thread in the thread block to be broadcasted to the whole thread block as tile_carry.key
. So the issue has to appear in the last threads of the last tile. The issue can be found here:
for (int ITEM = 0; ITEM < ITEMS_PER_THREAD; ++ITEM)
{
if (tile_nonzero_indices[thread_current_coord.y] < row_end_offset)
{
// Move down (accumulate)
scan_segment[ITEM].value = nonzero;
running_total += nonzero;
++thread_current_coord.y;
nonzero = s_tile_nonzeros[thread_current_coord.y];
}
else
{
// Move right (reset)
scan_segment[ITEM].value = 0.0;
running_total = 0.0;
++thread_current_coord.x;
row_end_offset = s_tile_row_end_offsets[thread_current_coord.x];
}
scan_segment[ITEM].key = thread_current_coord.x;
}
Up to tile_num_rows
elements of the s_tile_row_end_offsets
was filled, but thread_current_coord.x;
grows without limit, so we are reading some random data into row_end_offset
. This affects most of the further results. Shared memory isn't filled with zeroes before each kernel execution. So it's unsafe. I've filled shared memory with some random value and all tests are failing now:
// ---- DEBUG CODE ----
CTA_SYNC();
for (int item = threadIdx.x; item < TILE_ITEMS; item += BLOCK_THREADS)
{
s_tile_row_end_offsets[item] = 42;
}
CTA_SYNC();
// ---- DEBUG CODE ----
#pragma unroll 1
for (int item = threadIdx.x; item <= tile_num_rows; item += BLOCK_THREADS)
{
const OffsetT offset =
(cub::min)(static_cast<OffsetT>(tile_start_coord.x + item),
static_cast<OffsetT>(spmv_params.num_rows - 1));
s_tile_row_end_offsets[item] = wd_row_end_offsets[offset];
}
I am afraid I don't know how to fix this right now. I think we shouldn't mask the data race with this fix though. It's also unclear for me what's the consequence of ignoring the last tile results.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to note: the debugging code you added is overwriting the partial results in the s_tile_nonzeros
array, since the shm layout is:
OffsetT* s_tile_row_end_offsets = &temp_storage.aliasable.merge_items[0].row_end_offset;
ValueT* s_tile_nonzeros = &temp_storage.aliasable.merge_items[tile_num_rows + ITEMS_PER_THREAD].nonzero;
This is why the tests fail with that modification. The results are still correct if we only initialize the valid tile_num_rows + ITEMS_PER_THREAD
entries in s_tile_row_end_offsets
with random data.
This was only used for SMs < 3.5, which are no longer supported.
Addressed the review feedback, aside from the final issue you pointed out. We'll need to figure out a better solution for that bit. |
This extends the solution in PR NVIDIA#160.
This fix was suggested in PR NVIDIA#160.
I extended the Also updated the comment on the hackaround for the fixup kernel's invalid memory access. I believe this is harmless, since it just drops the carry-over from the final tile in the input, which should be ignored anyway. |
gpuCI: NVIDIA/thrust#1500 |
@senior-zero I'd like to merge this for 1.14. Can you re-review? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for these fixes!
Fixes #139 and #285. Includes #160.
This is blocking #276, since we need tests to validate the if-target update for
DeviceSpmv
.