Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add block and grid loop for index_sample kernel to deal with a large-shape tensor #37816

Merged
merged 4 commits into from
Jan 21, 2022

Conversation

FlyingQianMM
Copy link
Contributor

PR types

Bug fixes

PR changes

OPs

Describe

When the length of input tensor is large than block_dim * grid_dim,the index_sample kernel would not deal with the exceeding part. So we add block and grid loop in the kernel.

@paddle-bot-old
Copy link

paddle-bot-old bot commented Dec 3, 2021

Thanks for your contribution!
Please wait for the result of CI firstly. See Paddle CI Manual for details.

@paddle-bot-old
Copy link

Sorry to inform you that 9861e2f's CIs have passed for more than 7 days. To prevent PR conflicts, you need to re-run all CIs manually.

unsigned int index_i = blockDim.x * blockIdx.x + threadIdx.x;
unsigned int index_j = blockDim.y * blockIdx.y + threadIdx.y;
for (; index_j < batch_size; index_j += blockDim.y * gridDim.y) {
index_i = blockDim.x * blockIdx.x + threadIdx.x;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个确定不是冗余的么😂完全没必要重新计算一遍吧?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

已删除,感谢~

unsigned int index_j = blockDim.y * blockIdx.y + threadIdx.y;

for (; index_j < batch_size; index_j += blockDim.y * gridDim.y) {
index_i = blockDim.x * blockIdx.x + threadIdx.x;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

同上,index_i没必要重计算一遍

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

已删除,感谢~

@@ -153,9 +166,16 @@ class IndexSampleGradKernel<platform::CUDADeviceContext, T>
auto block_height =
platform::RoundToPowerOfTwo(index_length * batch_size) / block_width;
dim3 block_dim(block_width, block_height);
unsigned int threads = 512;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

重复代码,可以提取出来:

void CheckLaunchParamValid(const framework::ExecutionContext& ctx, dim3* block_dim,  dim3* grid_dim) {
  unsigned int threads = 512;
  block_dim->x = block_dim->x < threads ? block_dim->x : threads;
  block_dim->y = block_dim->y < threads ? block_dim->y : threads;

  dim3 max_grid_dim =
        ctx.template device_context<platform::CUDADeviceContext>()
            .GetCUDAMaxGridDimSize();
  grid_dim->x = grid_dim->x < max_grid_dim.x ? grid_dim->x : max_grid_dim.x;
  grid_dim->y = grid_dim->y < max_grid_dim.y ? grid_dim->y : max_grid_dim.y;
}

然后调用

CheckLaunchParamValid(ctx, &block_dim, &grid_dim);

而非重复写两次。

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

定义了函数MIN检查block dim,函数LimitGridDim检查grid dim。感谢~

Copy link
Contributor

@thisjiang thisjiang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants