Skip to content

Implement the 3d maxpooling component.#460

Merged
danpovey merged 7 commits intokaldi-asr:masterfrom
tomkocse:3dmaxpooling
Feb 6, 2016
Merged

Implement the 3d maxpooling component.#460
danpovey merged 7 commits intokaldi-asr:masterfrom
tomkocse:3dmaxpooling

Conversation

@tomkocse
Copy link
Contributor

No description provided.

@danpovey
Copy link
Contributor

obviously vijay will do the review of this.

On Tue, Jan 19, 2016 at 9:54 PM, tomkocse notifications@github.com wrote:


You can view, comment on, or merge this pull request online at:

#460
Commit Summary

  • Implement the 3d maxpooling component.

File Changes

Patch Links:


Reply to this email directly or view it on GitHub
#460.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

er... you seem to have a very tight inner loop here: the operation nested is inside 3 loops! I have a hard time believing this could be efficient on a GPU.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about doing the max pooling along x and y axes separately ?
That can reduce the number of nested loops.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't really understand the details of this. I'll let Vijay comment.
What you have done may be fine as a temporary measure to test whether this
even works.
Dan

On Tue, Jan 19, 2016 at 10:05 PM, tomkocse notifications@github.com wrote:

In src/nnet3/nnet-simple-component.cc
#460 (comment):

  • int32 num_pools_z = 1 + (input_z_dim_ - pool_z_size_) / pool_z_step_;
  • // Do the max-pooling first along x and y axis
  • CuMatrix patches_zyx(num_frames, num_pools_x * num_pools_y * input_z_dim_, kUndefined);
  • for (int32 qx = 0; qx < num_pools_x; qx++) {
  • for (int32 qy = 0; qy < num_pools_y; qy++) {
  •  // get output buffer of the pool
    
  •  int32 q = qy + qx \* num_pools_y;
    
  •  CuSubMatrix<BaseFloat> pool(patches_zyx.ColRange(q \* input_z_dim_, input_z_dim_));
    
  •  pool.Set(-1e20); // reset a large negative value
    
  •  int32 offset_x = qx \* pool_x_step_;
    
  •  int32 offset_y = qy \* pool_y_step_;
    
  •  for (int32 px = offset_x; px < (pool_x_size_ + offset_x); px++) {
    
  •    for (int32 py = offset_y; py < (pool_y_size_ + offset_y); py++) {
    
  •      int32 p = py + px \* input_y_dim_;
    
  •      pool.Max(in.ColRange(p \* input_z_dim_, input_z_dim_));
    

What about doing the max pooling along x and y axes separately ?
That can reduce the number of nested loops.


Reply to this email directly or view it on GitHub
https://github.com/kaldi-asr/kaldi/pull/460/files#r50209913.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe what Dan is trying to say is that you shouldn't be calling pool.Max in a loop like this. You don't exploit the power of the GPU by serially calling CUDA kernels. (At least, I assume pool.Max calls a CUDA kernel.)

This problem is trivially parallelizable in that each output can be computed independently of all of the others. (Each output element is the maximum of a different patch of the input 3-tensor.) You may have to write a new kernel. Without getting too deep into this, torch's cuda neural net implementation has a 3d pooling component. Perhaps it can help you, though the fact that Kaldi has to vectorize its 3-tensors is bound to cause adapting that code to have complications.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Solving the problem in a GPU parallelizable manner requires reshaping and packing the input into different patches by CPU. Will that be more expensive in time ?
Please correct me if i am having a wrong concept on this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's how it was implemented in nnet/nnet2. It is less expensive than
looped calls of the same kernel.

在 2016/1/20 14:15, tomkocse 写道:

In src/nnet3/nnet-simple-component.cc
#460 (comment):

  • int32 num_pools_z = 1 + (input_z_dim_ - pool_z_size_) / pool_z_step_;
  • // Do the max-pooling first along x and y axis
  • CuMatrix patches_zyx(num_frames, num_pools_x * num_pools_y * input_z_dim_, kUndefined);
  • for (int32 qx = 0; qx < num_pools_x; qx++) {
  • for (int32 qy = 0; qy < num_pools_y; qy++) {
  •  // get output buffer of the pool
    
  •  int32 q = qy + qx \* num_pools_y;
    
  •  CuSubMatrix<BaseFloat> pool(patches_zyx.ColRange(q \* input_z_dim_, input_z_dim_));
    
  •  pool.Set(-1e20); // reset a large negative value
    
  •  int32 offset_x = qx \* pool_x_step_;
    
  •  int32 offset_y = qy \* pool_y_step_;
    
  •  for (int32 px = offset_x; px < (pool_x_size_ + offset_x); px++) {
    
  •    for (int32 py = offset_y; py < (pool_y_size_ + offset_y); py++) {
    
  •      int32 p = py + px \* input_y_dim_;
    
  •      pool.Max(in.ColRange(p \* input_z_dim_, input_z_dim_));
    

Solving the problem in a GPU parallelizable manner requires reshaping
and packing the input into different patches by CPU. Will that be more
expensive in time ?
Please correct me if i am having a wrong concept on this.


Reply to this email directly or view it on GitHub
https://github.com/kaldi-asr/kaldi/pull/460/files#r50218340.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Converting the input matrix to patches is already done in convolution
component. You have a look at this. BTW the conversion to patches is done
on the gpu.
On Jan 19, 2016 10:15 PM, "tomkocse" notifications@github.com wrote:

In src/nnet3/nnet-simple-component.cc
#460 (comment):

  • int32 num_pools_z = 1 + (input_z_dim_ - pool_z_size_) / pool_z_step_;
  • // Do the max-pooling first along x and y axis
  • CuMatrix patches_zyx(num_frames, num_pools_x * num_pools_y * input_z_dim_, kUndefined);
  • for (int32 qx = 0; qx < num_pools_x; qx++) {
  • for (int32 qy = 0; qy < num_pools_y; qy++) {
  •  // get output buffer of the pool
    
  •  int32 q = qy + qx \* num_pools_y;
    
  •  CuSubMatrix<BaseFloat> pool(patches_zyx.ColRange(q \* input_z_dim_, input_z_dim_));
    
  •  pool.Set(-1e20); // reset a large negative value
    
  •  int32 offset_x = qx \* pool_x_step_;
    
  •  int32 offset_y = qy \* pool_y_step_;
    
  •  for (int32 px = offset_x; px < (pool_x_size_ + offset_x); px++) {
    
  •    for (int32 py = offset_y; py < (pool_y_size_ + offset_y); py++) {
    
  •      int32 p = py + px \* input_y_dim_;
    
  •      pool.Max(in.ColRange(p \* input_z_dim_, input_z_dim_));
    

Solving the problem in a GPU parallelizable manner requires reshaping and
packing the input into different patches by CPU. Will that be more
expensive in time ?
Please correct me if i am having a wrong concept on this.


Reply to this email directly or view it on GitHub
https://github.com/kaldi-asr/kaldi/pull/460/files#r50218340.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, i will rewrite the propagate and backpropagate functions.

@danpovey
Copy link
Contributor

BTW, I want to make a comment just for the future, that implementing convolution using a simple component may not be the ideal way to do it. IMO it makes more sense to do it using a general component, and for the time axis and possibly also the 'extra' axis (deltas, etc.) use the different rows of the input, via the t and x dimensions.

@danpovey
Copy link
Contributor

Also, I think the convolutional stuff should be moved to nnet-convolutional-component.{h,cc}. nnet-simple-component.{h,cc} is getting too long.

@vijayaditya
Copy link
Contributor

Good to know that there is possibility of preserving time axis through the
convolution component. It was a concern for me before. I will do the
conversion.

Vijay
On Jan 20, 2016 1:58 PM, "Daniel Povey" notifications@github.com wrote:

Also, I think the convolutional stuff should be moved to
nnet-convolutional-component.{h,cc}. nnet-simple-component.{h,cc} is
getting too long.


Reply to this email directly or view it on GitHub
#460 (comment).

@danpovey
Copy link
Contributor

It won't be trivial to do this with a GeneralComponent- you'll have to
create the PrecomputedIndexes to keep track of various things. I suggest
you wait a few days as I am working on some other instances of
GeneralComponent and once you see what I'm doing it will be easier to see
how to do it.

On Wed, Jan 20, 2016 at 7:50 PM, Vijayaditya Peddinti <
notifications@github.com> wrote:

Good to know that there is possibility of preserving time axis through the
convolution component. It was a concern for me before. I will do the
conversion.

Vijay
On Jan 20, 2016 1:58 PM, "Daniel Povey" notifications@github.com wrote:

Also, I think the convolutional stuff should be moved to
nnet-convolutional-component.{h,cc}. nnet-simple-component.{h,cc} is
getting too long.


Reply to this email directly or view it on GitHub
#460 (comment).


Reply to this email directly or view it on GitHub
#460 (comment).

@vijayaditya
Copy link
Contributor

OK.

Vijay
On Jan 20, 2016 16:52, "Daniel Povey" notifications@github.com wrote:

It won't be trivial to do this with a GeneralComponent- you'll have to
create the PrecomputedIndexes to keep track of various things. I suggest
you wait a few days as I am working on some other instances of
GeneralComponent and once you see what I'm doing it will be easier to see
how to do it.

On Wed, Jan 20, 2016 at 7:50 PM, Vijayaditya Peddinti <
notifications@github.com> wrote:

Good to know that there is possibility of preserving time axis through
the
convolution component. It was a concern for me before. I will do the
conversion.

Vijay
On Jan 20, 2016 1:58 PM, "Daniel Povey" notifications@github.com
wrote:

Also, I think the convolutional stuff should be moved to
nnet-convolutional-component.{h,cc}. nnet-simple-component.{h,cc} is
getting too long.


Reply to this email directly or view it on GitHub
#460 (comment).


Reply to this email directly or view it on GitHub
#460 (comment).


Reply to this email directly or view it on GitHub
#460 (comment).

@vijayaditya
Copy link
Contributor

please squash your commits.

@tomkocse
Copy link
Contributor Author

The recent two commits are merged.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@danpovey I don't know of any recipes which are using the MaxpoolingComponent in nnet3. We are already printing deprecation warning for Convolution1DComponent. How about removing both Convolution1DComponent and MaxpoolingComponent in this commit ? I think it might be better to use the name MaxpoolingComponent instead of Maxpooling3dComponent for the new component.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, fine.
Is maxpooling3d efficient for the (possibly more common) 1d and 2d cases?
Dan

On Sat, Jan 23, 2016 at 6:20 PM, Vijayaditya Peddinti <
notifications@github.com> wrote:

In src/nnet3/nnet-component-itf.cc
#460 (comment):

@@ -96,6 +96,8 @@ Component* Component::NewComponentOfType(const std::string &component_type) {
ans = new ConvolutionComponent();
} else if (component_type == "MaxpoolingComponent") {
ans = new MaxpoolingComponent();

  • } else if (component_type == "Maxpooling3dComponent") {

@danpovey https://github.com/danpovey I don't know of any recipes which
are using the MaxpoolingComponent in nnet3. We are already printing
deprecation warning for Convolution1DComponent. How about removing both
Convolution1DComponent and MaxpoolingComponent in this commit ? I think it
might be better to use the name MaxpoolingComponent instead of
Maxpooling3dComponent for the new component.


Reply to this email directly or view it on GitHub
https://github.com/kaldi-asr/kaldi/pull/460/files#r50627193.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still going through the new commit, will suggest changes to Tom Ko if it is not optimal for 1d/2d cases.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@danpovey I think the component is going to perform similarly for 1d, 2d and 3d inputs. There could be performance optimizations if we chose to have non-overlapping maxpooling windows. We can add additional code blocks which exploit the lack of overlap if there are noticeable delays.

@danpovey
Copy link
Contributor

danpovey commented Feb 3, 2016

Actually, forget about this-- I think I will do this myself.
The convolutional code needs a rewrite, I think, for better documentation and more natural use of CUDA.

@vijayaditya
Copy link
Contributor

When you mention the rewrite are you talking about rewriting the
convolution component as a non-simple component ?

--Vijay

On Wed, Feb 3, 2016 at 11:48 AM, Daniel Povey notifications@github.com
wrote:

Actually, forget about this-- I think I will do this myself.
The convolutional code needs a rewrite, I think, for better documentation
and more natural use of CUDA.


Reply to this email directly or view it on GitHub
#460 (comment).

@danpovey
Copy link
Contributor

danpovey commented Feb 3, 2016

I'm thinking about that too.

Dan

On Wed, Feb 3, 2016 at 3:06 PM, Vijayaditya Peddinti <
notifications@github.com> wrote:

When you mention the rewrite are you talking about rewriting the
convolution component as a non-simple component ?

--Vijay

On Wed, Feb 3, 2016 at 11:48 AM, Daniel Povey notifications@github.com
wrote:

Actually, forget about this-- I think I will do this myself.
The convolutional code needs a rewrite, I think, for better documentation
and more natural use of CUDA.


Reply to this email directly or view it on GitHub
#460 (comment).


Reply to this email directly or view it on GitHub
#460 (comment).

@danpovey
Copy link
Contributor

danpovey commented Feb 4, 2016

Guys, I changed my mind about implementing the convolutional stuff myself.
I realized that my plan was not as advantageous as I realized, and the amount of work was more than I had imagined.
So you can go ahead with this.
However, I don't think the way you have coded the maxpooling and its backprop is very ideal at all. It would be better to do it in a CUDA kernel.

@tomkocse
Copy link
Contributor Author

tomkocse commented Feb 4, 2016

@danpovey , are you worrying the case when the block size is large but the number of pool is small ?
I admit that the way i code is assuming small block size and large number of pool.

@danpovey
Copy link
Contributor

danpovey commented Feb 4, 2016

There just seem to be a lot of individual CUDA calls.

On Thu, Feb 4, 2016 at 1:45 AM, tomkocse notifications@github.com wrote:

@danpovey https://github.com/danpovey , are you worrying the case when
the block size is large but the number of pool is small ?
I admit that the way i code is assuming small block size and large number
of pool.


Reply to this email directly or view it on GitHub
#460 (comment).

@danpovey
Copy link
Contributor

danpovey commented Feb 6, 2016

I'll merge this now. If efficiency becomes a problem we can add a kernel that does max stuff without having to be called in a loop.

danpovey added a commit that referenced this pull request Feb 6, 2016
Implement the 3d maxpooling component.
@danpovey danpovey merged commit 44db487 into kaldi-asr:master Feb 6, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants

Comments