Implement the 3d maxpooling component.#460
Conversation
|
obviously vijay will do the review of this. On Tue, Jan 19, 2016 at 9:54 PM, tomkocse notifications@github.com wrote:
|
src/nnet3/nnet-simple-component.cc
Outdated
There was a problem hiding this comment.
er... you seem to have a very tight inner loop here: the operation nested is inside 3 loops! I have a hard time believing this could be efficient on a GPU.
There was a problem hiding this comment.
What about doing the max pooling along x and y axes separately ?
That can reduce the number of nested loops.
There was a problem hiding this comment.
I don't really understand the details of this. I'll let Vijay comment.
What you have done may be fine as a temporary measure to test whether this
even works.
Dan
On Tue, Jan 19, 2016 at 10:05 PM, tomkocse notifications@github.com wrote:
In src/nnet3/nnet-simple-component.cc
#460 (comment):
- int32 num_pools_z = 1 + (input_z_dim_ - pool_z_size_) / pool_z_step_;
- // Do the max-pooling first along x and y axis
- CuMatrix patches_zyx(num_frames, num_pools_x * num_pools_y * input_z_dim_, kUndefined);
- for (int32 qx = 0; qx < num_pools_x; qx++) {
- for (int32 qy = 0; qy < num_pools_y; qy++) {
// get output buffer of the poolint32 q = qy + qx \* num_pools_y;CuSubMatrix<BaseFloat> pool(patches_zyx.ColRange(q \* input_z_dim_, input_z_dim_));pool.Set(-1e20); // reset a large negative valueint32 offset_x = qx \* pool_x_step_;int32 offset_y = qy \* pool_y_step_;for (int32 px = offset_x; px < (pool_x_size_ + offset_x); px++) {for (int32 py = offset_y; py < (pool_y_size_ + offset_y); py++) {int32 p = py + px \* input_y_dim_;pool.Max(in.ColRange(p \* input_z_dim_, input_z_dim_));What about doing the max pooling along x and y axes separately ?
That can reduce the number of nested loops.—
Reply to this email directly or view it on GitHub
https://github.com/kaldi-asr/kaldi/pull/460/files#r50209913.
There was a problem hiding this comment.
I believe what Dan is trying to say is that you shouldn't be calling pool.Max in a loop like this. You don't exploit the power of the GPU by serially calling CUDA kernels. (At least, I assume pool.Max calls a CUDA kernel.)
This problem is trivially parallelizable in that each output can be computed independently of all of the others. (Each output element is the maximum of a different patch of the input 3-tensor.) You may have to write a new kernel. Without getting too deep into this, torch's cuda neural net implementation has a 3d pooling component. Perhaps it can help you, though the fact that Kaldi has to vectorize its 3-tensors is bound to cause adapting that code to have complications.
There was a problem hiding this comment.
Solving the problem in a GPU parallelizable manner requires reshaping and packing the input into different patches by CPU. Will that be more expensive in time ?
Please correct me if i am having a wrong concept on this.
There was a problem hiding this comment.
That's how it was implemented in nnet/nnet2. It is less expensive than
looped calls of the same kernel.
在 2016/1/20 14:15, tomkocse 写道:
In src/nnet3/nnet-simple-component.cc
#460 (comment):
- int32 num_pools_z = 1 + (input_z_dim_ - pool_z_size_) / pool_z_step_;
- // Do the max-pooling first along x and y axis
- CuMatrix patches_zyx(num_frames, num_pools_x * num_pools_y * input_z_dim_, kUndefined);
- for (int32 qx = 0; qx < num_pools_x; qx++) {
- for (int32 qy = 0; qy < num_pools_y; qy++) {
// get output buffer of the poolint32 q = qy + qx \* num_pools_y;CuSubMatrix<BaseFloat> pool(patches_zyx.ColRange(q \* input_z_dim_, input_z_dim_));pool.Set(-1e20); // reset a large negative valueint32 offset_x = qx \* pool_x_step_;int32 offset_y = qy \* pool_y_step_;for (int32 px = offset_x; px < (pool_x_size_ + offset_x); px++) {for (int32 py = offset_y; py < (pool_y_size_ + offset_y); py++) {int32 p = py + px \* input_y_dim_;pool.Max(in.ColRange(p \* input_z_dim_, input_z_dim_));Solving the problem in a GPU parallelizable manner requires reshaping
and packing the input into different patches by CPU. Will that be more
expensive in time ?
Please correct me if i am having a wrong concept on this.—
Reply to this email directly or view it on GitHub
https://github.com/kaldi-asr/kaldi/pull/460/files#r50218340.
There was a problem hiding this comment.
Converting the input matrix to patches is already done in convolution
component. You have a look at this. BTW the conversion to patches is done
on the gpu.
On Jan 19, 2016 10:15 PM, "tomkocse" notifications@github.com wrote:
In src/nnet3/nnet-simple-component.cc
#460 (comment):
- int32 num_pools_z = 1 + (input_z_dim_ - pool_z_size_) / pool_z_step_;
- // Do the max-pooling first along x and y axis
- CuMatrix patches_zyx(num_frames, num_pools_x * num_pools_y * input_z_dim_, kUndefined);
- for (int32 qx = 0; qx < num_pools_x; qx++) {
- for (int32 qy = 0; qy < num_pools_y; qy++) {
// get output buffer of the poolint32 q = qy + qx \* num_pools_y;CuSubMatrix<BaseFloat> pool(patches_zyx.ColRange(q \* input_z_dim_, input_z_dim_));pool.Set(-1e20); // reset a large negative valueint32 offset_x = qx \* pool_x_step_;int32 offset_y = qy \* pool_y_step_;for (int32 px = offset_x; px < (pool_x_size_ + offset_x); px++) {for (int32 py = offset_y; py < (pool_y_size_ + offset_y); py++) {int32 p = py + px \* input_y_dim_;pool.Max(in.ColRange(p \* input_z_dim_, input_z_dim_));Solving the problem in a GPU parallelizable manner requires reshaping and
packing the input into different patches by CPU. Will that be more
expensive in time ?
Please correct me if i am having a wrong concept on this.—
Reply to this email directly or view it on GitHub
https://github.com/kaldi-asr/kaldi/pull/460/files#r50218340.
There was a problem hiding this comment.
OK, i will rewrite the propagate and backpropagate functions.
|
BTW, I want to make a comment just for the future, that implementing convolution using a simple component may not be the ideal way to do it. IMO it makes more sense to do it using a general component, and for the time axis and possibly also the 'extra' axis (deltas, etc.) use the different rows of the input, via the t and x dimensions. |
|
Also, I think the convolutional stuff should be moved to nnet-convolutional-component.{h,cc}. nnet-simple-component.{h,cc} is getting too long. |
|
Good to know that there is possibility of preserving time axis through the Vijay
|
|
It won't be trivial to do this with a GeneralComponent- you'll have to On Wed, Jan 20, 2016 at 7:50 PM, Vijayaditya Peddinti <
|
|
OK. Vijay
|
|
please squash your commits. |
|
The recent two commits are merged. |
src/nnet3/nnet-component-itf.cc
Outdated
There was a problem hiding this comment.
@danpovey I don't know of any recipes which are using the MaxpoolingComponent in nnet3. We are already printing deprecation warning for Convolution1DComponent. How about removing both Convolution1DComponent and MaxpoolingComponent in this commit ? I think it might be better to use the name MaxpoolingComponent instead of Maxpooling3dComponent for the new component.
There was a problem hiding this comment.
OK, fine.
Is maxpooling3d efficient for the (possibly more common) 1d and 2d cases?
Dan
On Sat, Jan 23, 2016 at 6:20 PM, Vijayaditya Peddinti <
notifications@github.com> wrote:
In src/nnet3/nnet-component-itf.cc
#460 (comment):@@ -96,6 +96,8 @@ Component* Component::NewComponentOfType(const std::string &component_type) {
ans = new ConvolutionComponent();
} else if (component_type == "MaxpoolingComponent") {
ans = new MaxpoolingComponent();
- } else if (component_type == "Maxpooling3dComponent") {
@danpovey https://github.com/danpovey I don't know of any recipes which
are using the MaxpoolingComponent in nnet3. We are already printing
deprecation warning for Convolution1DComponent. How about removing both
Convolution1DComponent and MaxpoolingComponent in this commit ? I think it
might be better to use the name MaxpoolingComponent instead of
Maxpooling3dComponent for the new component.—
Reply to this email directly or view it on GitHub
https://github.com/kaldi-asr/kaldi/pull/460/files#r50627193.
There was a problem hiding this comment.
Still going through the new commit, will suggest changes to Tom Ko if it is not optimal for 1d/2d cases.
There was a problem hiding this comment.
@danpovey I think the component is going to perform similarly for 1d, 2d and 3d inputs. There could be performance optimizations if we chose to have non-overlapping maxpooling windows. We can add additional code blocks which exploit the lack of overlap if there are noticeable delays.
|
Actually, forget about this-- I think I will do this myself. |
|
When you mention the rewrite are you talking about rewriting the --Vijay On Wed, Feb 3, 2016 at 11:48 AM, Daniel Povey notifications@github.com
|
|
I'm thinking about that too. Dan On Wed, Feb 3, 2016 at 3:06 PM, Vijayaditya Peddinti <
|
|
Guys, I changed my mind about implementing the convolutional stuff myself. |
|
@danpovey , are you worrying the case when the block size is large but the number of pool is small ? |
|
There just seem to be a lot of individual CUDA calls. On Thu, Feb 4, 2016 at 1:45 AM, tomkocse notifications@github.com wrote:
|
|
I'll merge this now. If efficiency becomes a problem we can add a kernel that does max stuff without having to be called in a loop. |
Implement the 3d maxpooling component.
No description provided.