Skip to content
This repository has been archived by the owner on Mar 21, 2024. It is now read-only.

cub::DeviceSpmv: Various fixes and tests #352

Merged
merged 8 commits into from
Aug 17, 2021

Conversation

alliepiper
Copy link
Collaborator

Fixes #139 and #285. Includes #160.

This is blocking #276, since we need tests to validate the if-target update for DeviceSpmv.

@alliepiper alliepiper added this to the 1.14.0 milestone Jul 30, 2021
alliepiper added a commit to alliepiper/thrust that referenced this pull request Jul 30, 2021
@alliepiper
Copy link
Collaborator Author

gpuCI: NVIDIA/thrust#1500

@alliepiper alliepiper added the testing: gpuCI in progress Started gpuCI testing. label Jul 30, 2021
@alliepiper alliepiper linked an issue Jul 30, 2021 that may be closed by this pull request
@alliepiper
Copy link
Collaborator Author

DVS CL: 30247358

Copy link
Collaborator

@gevtushenko gevtushenko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code is moving towards production quality, thank you!

Few quite complex issues still make this algorithm unreliable. I'm afraid that we can't merge the test until these issues are addressed because this will lead to random tests failures. Please, let me know if I've misunderstood anything.

cub/agent/agent_spmv_orig.cuh Show resolved Hide resolved
cub/device/device_spmv.cuh Outdated Show resolved Hide resolved
test/test_device_spmv.cu Show resolved Hide resolved
test/test_device_spmv.cu Outdated Show resolved Hide resolved
test/test_device_spmv.cu Outdated Show resolved Hide resolved
test/test_device_spmv.cu Outdated Show resolved Hide resolved

tile_carry.key += tile_start_coord.x;
if (tile_carry.key >= spmv_params.num_rows)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've been thinking about this fix for too long 😄 My main concern can be formalized as follows. Tile key never exceeds num_rows. It can be equal to num_rows only when tile_idx == num_merge_tiles, which is never the case for this kernel. Only the main thread of each tile is writing some results here, so it was unclear how we could exceed num_rows.

tile_carry is actually a block aggregate with tile_carry.key standing for row number within tile indexation [0, tile_num_rows). The line above changes the indexation from tile-local to matrix indexation by adding a number of the first tile row tile_carry.key += tile_start_coord.x;. tile_num_rows in its turn, is calculated as tile_end_coord.x - tile_start_coord.x. The sum of these differences can't exceed num_rows. So the issue has to be somewhere else.

tile_carry is computed as

BlockScanT(temp_storage.aliasable.scan).ExclusiveScan(scan_item, scan_item, scan_op, tile_carry);

Application of the scan_op (ReduceByKeyOp<cub::Sum>) will cause the key value (row number) of the last thread in the thread block to be broadcasted to the whole thread block as tile_carry.key. So the issue has to appear in the last threads of the last tile. The issue can be found here:

for (int ITEM = 0; ITEM < ITEMS_PER_THREAD; ++ITEM)
{
    if (tile_nonzero_indices[thread_current_coord.y] < row_end_offset)
    {
        // Move down (accumulate)
        scan_segment[ITEM].value    = nonzero;
        running_total               += nonzero;
        ++thread_current_coord.y;
        nonzero                     = s_tile_nonzeros[thread_current_coord.y];
    }
    else
    {
        // Move right (reset)
        scan_segment[ITEM].value    = 0.0;
        running_total               = 0.0;
        ++thread_current_coord.x;
        row_end_offset              = s_tile_row_end_offsets[thread_current_coord.x];
    }

    scan_segment[ITEM].key = thread_current_coord.x;
}

Up to tile_num_rows elements of the s_tile_row_end_offsets was filled, but thread_current_coord.x; grows without limit, so we are reading some random data into row_end_offset. This affects most of the further results. Shared memory isn't filled with zeroes before each kernel execution. So it's unsafe. I've filled shared memory with some random value and all tests are failing now:

// ---- DEBUG CODE ----
CTA_SYNC();
for (int item = threadIdx.x; item < TILE_ITEMS; item += BLOCK_THREADS)
{
    s_tile_row_end_offsets[item] = 42;
}
CTA_SYNC();
// ---- DEBUG CODE ----

#pragma unroll 1
for (int item = threadIdx.x; item <= tile_num_rows; item += BLOCK_THREADS)
{
    const OffsetT offset =
      (cub::min)(static_cast<OffsetT>(tile_start_coord.x + item),
                 static_cast<OffsetT>(spmv_params.num_rows - 1));
    s_tile_row_end_offsets[item] = wd_row_end_offsets[offset];
}

I am afraid I don't know how to fix this right now. I think we shouldn't mask the data race with this fix though. It's also unclear for me what's the consequence of ignoring the last tile results.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to note: the debugging code you added is overwriting the partial results in the s_tile_nonzeros array, since the shm layout is:

        OffsetT*    s_tile_row_end_offsets  = &temp_storage.aliasable.merge_items[0].row_end_offset;
        ValueT*     s_tile_nonzeros         = &temp_storage.aliasable.merge_items[tile_num_rows + ITEMS_PER_THREAD].nonzero;

This is why the tests fail with that modification. The results are still correct if we only initialize the valid tile_num_rows + ITEMS_PER_THREAD entries in s_tile_row_end_offsets with random data.

@alliepiper
Copy link
Collaborator Author

Addressed the review feedback, aside from the final issue you pointed out. We'll need to figure out a better solution for that bit.

@alliepiper
Copy link
Collaborator Author

I extended the s_tile_row_end_offsets load to fill the first tile_num_rows + ITEMS_PER_THREAD elements, which should keep the reads in the merge-path traversal loop valid.

Also updated the comment on the hackaround for the fixup kernel's invalid memory access. I believe this is harmless, since it just drops the carry-over from the final tile in the input, which should be ignored anyway.

alliepiper added a commit to alliepiper/thrust that referenced this pull request Aug 4, 2021
@alliepiper
Copy link
Collaborator Author

gpuCI: NVIDIA/thrust#1500
DVS CL: 30264714

@alliepiper
Copy link
Collaborator Author

@senior-zero I'd like to merge this for 1.14. Can you re-review?

@alliepiper alliepiper removed the testing: gpuCI in progress Started gpuCI testing. label Aug 16, 2021
@alliepiper alliepiper added testing: gpuCI passed Passed gpuCI testing. testing: internal ci passed Passed internal NVIDIA CI (DVS). and removed testing: internal ci in progress Currently testing on internal NVIDIA CI (DVS). labels Aug 16, 2021
Copy link
Collaborator

@gevtushenko gevtushenko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for these fixes!

test/test_device_spmv.cu Outdated Show resolved Hide resolved
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
testing: gpuCI passed Passed gpuCI testing. testing: internal ci passed Passed internal NVIDIA CI (DVS).
Projects
None yet
Development

Successfully merging this pull request may close these issues.

cub::DeviceSpmv needs attention SpMV out-of-bounds accesses
2 participants