Implement the 3d maxpooling component. by tomkocse · Pull Request #460 · kaldi-asr/kaldi

tomkocse · 2016-01-20T02:54:09Z

No description provided.

danpovey · 2016-01-20T02:55:11Z

obviously vijay will do the review of this.

On Tue, Jan 19, 2016 at 9:54 PM, tomkocse notifications@github.com wrote:

You can view, comment on, or merge this pull request online at:

#460
Commit Summary

Implement the 3d maxpooling component.

File Changes

M src/nnet3/nnet-component-itf.cc
https://github.com/kaldi-asr/kaldi/pull/460/files#diff-0 (2)

M src/nnet3/nnet-simple-component.cc
https://github.com/kaldi-asr/kaldi/pull/460/files#diff-1 (257)

M src/nnet3/nnet-simple-component.h
https://github.com/kaldi-asr/kaldi/pull/460/files#diff-2 (97)

M src/nnet3/nnet-test-utils.cc
https://github.com/kaldi-asr/kaldi/pull/460/files#diff-3 (63)

Patch Links:

https://github.com/kaldi-asr/kaldi/pull/460.patch

https://github.com/kaldi-asr/kaldi/pull/460.diff

—
Reply to this email directly or view it on GitHub
#460.

danpovey · 2016-01-20T02:56:48Z

src/nnet3/nnet-simple-component.cc

er... you seem to have a very tight inner loop here: the operation nested is inside 3 loops! I have a hard time believing this could be efficient on a GPU.

What about doing the max pooling along x and y axes separately ?
That can reduce the number of nested loops.

I don't really understand the details of this. I'll let Vijay comment.
What you have done may be fine as a temporary measure to test whether this
even works.
Dan

On Tue, Jan 19, 2016 at 10:05 PM, tomkocse notifications@github.com wrote:

In src/nnet3/nnet-simple-component.cc
#460 (comment):

int32 num_pools_z = 1 + (input_z_dim_ - pool_z_size_) / pool_z_step_;

// Do the max-pooling first along x and y axis

CuMatrix patches_zyx(num_frames, num_pools_x * num_pools_y * input_z_dim_, kUndefined);

for (int32 qx = 0; qx < num_pools_x; qx++) {

for (int32 qy = 0; qy < num_pools_y; qy++) {

// get output buffer of the pool

int32 q = qy + qx \* num_pools_y;

CuSubMatrix<BaseFloat> pool(patches_zyx.ColRange(q \* input_z_dim_, input_z_dim_));

pool.Set(-1e20); // reset a large negative value

int32 offset_x = qx \* pool_x_step_;

int32 offset_y = qy \* pool_y_step_;

for (int32 px = offset_x; px < (pool_x_size_ + offset_x); px++) {

for (int32 py = offset_y; py < (pool_y_size_ + offset_y); py++) {

int32 p = py + px \* input_y_dim_;

pool.Max(in.ColRange(p \* input_z_dim_, input_z_dim_));

What about doing the max pooling along x and y axes separately ?
That can reduce the number of nested loops.

—
Reply to this email directly or view it on GitHub
https://github.com/kaldi-asr/kaldi/pull/460/files#r50209913.

I believe what Dan is trying to say is that you shouldn't be calling pool.Max in a loop like this. You don't exploit the power of the GPU by serially calling CUDA kernels. (At least, I assume pool.Max calls a CUDA kernel.)

This problem is trivially parallelizable in that each output can be computed independently of all of the others. (Each output element is the maximum of a different patch of the input 3-tensor.) You may have to write a new kernel. Without getting too deep into this, torch's cuda neural net implementation has a 3d pooling component. Perhaps it can help you, though the fact that Kaldi has to vectorize its 3-tensors is bound to cause adapting that code to have complications.

Solving the problem in a GPU parallelizable manner requires reshaping and packing the input into different patches by CPU. Will that be more expensive in time ?
Please correct me if i am having a wrong concept on this.

That's how it was implemented in nnet/nnet2. It is less expensive than
looped calls of the same kernel.

在 2016/1/20 14:15, tomkocse 写道:

In src/nnet3/nnet-simple-component.cc
#460 (comment):

int32 num_pools_z = 1 + (input_z_dim_ - pool_z_size_) / pool_z_step_;

// Do the max-pooling first along x and y axis

CuMatrix patches_zyx(num_frames, num_pools_x * num_pools_y * input_z_dim_, kUndefined);

for (int32 qx = 0; qx < num_pools_x; qx++) {

for (int32 qy = 0; qy < num_pools_y; qy++) {

// get output buffer of the pool

int32 q = qy + qx \* num_pools_y;

CuSubMatrix<BaseFloat> pool(patches_zyx.ColRange(q \* input_z_dim_, input_z_dim_));

pool.Set(-1e20); // reset a large negative value

int32 offset_x = qx \* pool_x_step_;

int32 offset_y = qy \* pool_y_step_;

for (int32 px = offset_x; px < (pool_x_size_ + offset_x); px++) {

for (int32 py = offset_y; py < (pool_y_size_ + offset_y); py++) {

int32 p = py + px \* input_y_dim_;

pool.Max(in.ColRange(p \* input_z_dim_, input_z_dim_));

Solving the problem in a GPU parallelizable manner requires reshaping
and packing the input into different patches by CPU. Will that be more
expensive in time ?
Please correct me if i am having a wrong concept on this.

—
Reply to this email directly or view it on GitHub
https://github.com/kaldi-asr/kaldi/pull/460/files#r50218340.

Converting the input matrix to patches is already done in convolution
component. You have a look at this. BTW the conversion to patches is done
on the gpu.
On Jan 19, 2016 10:15 PM, "tomkocse" notifications@github.com wrote:

In src/nnet3/nnet-simple-component.cc
#460 (comment):

int32 num_pools_z = 1 + (input_z_dim_ - pool_z_size_) / pool_z_step_;

// Do the max-pooling first along x and y axis

CuMatrix patches_zyx(num_frames, num_pools_x * num_pools_y * input_z_dim_, kUndefined);

for (int32 qx = 0; qx < num_pools_x; qx++) {

for (int32 qy = 0; qy < num_pools_y; qy++) {

// get output buffer of the pool

int32 q = qy + qx \* num_pools_y;

CuSubMatrix<BaseFloat> pool(patches_zyx.ColRange(q \* input_z_dim_, input_z_dim_));

pool.Set(-1e20); // reset a large negative value

int32 offset_x = qx \* pool_x_step_;

int32 offset_y = qy \* pool_y_step_;

for (int32 px = offset_x; px < (pool_x_size_ + offset_x); px++) {

for (int32 py = offset_y; py < (pool_y_size_ + offset_y); py++) {

int32 p = py + px \* input_y_dim_;

pool.Max(in.ColRange(p \* input_z_dim_, input_z_dim_));

Solving the problem in a GPU parallelizable manner requires reshaping and
packing the input into different patches by CPU. Will that be more
expensive in time ?
Please correct me if i am having a wrong concept on this.

—
Reply to this email directly or view it on GitHub
https://github.com/kaldi-asr/kaldi/pull/460/files#r50218340.

OK, i will rewrite the propagate and backpropagate functions.

…ling component.

danpovey · 2016-01-20T21:16:18Z

BTW, I want to make a comment just for the future, that implementing convolution using a simple component may not be the ideal way to do it. IMO it makes more sense to do it using a general component, and for the time axis and possibly also the 'extra' axis (deltas, etc.) use the different rows of the input, via the t and x dimensions.

danpovey · 2016-01-20T21:58:50Z

Also, I think the convolutional stuff should be moved to nnet-convolutional-component.{h,cc}. nnet-simple-component.{h,cc} is getting too long.

vijayaditya · 2016-01-21T00:50:32Z

Good to know that there is possibility of preserving time axis through the
convolution component. It was a concern for me before. I will do the
conversion.

Vijay
On Jan 20, 2016 1:58 PM, "Daniel Povey" notifications@github.com wrote:

Also, I think the convolutional stuff should be moved to
nnet-convolutional-component.{h,cc}. nnet-simple-component.{h,cc} is
getting too long.

—
Reply to this email directly or view it on GitHub
#460 (comment).

danpovey · 2016-01-21T00:52:22Z

It won't be trivial to do this with a GeneralComponent- you'll have to
create the PrecomputedIndexes to keep track of various things. I suggest
you wait a few days as I am working on some other instances of
GeneralComponent and once you see what I'm doing it will be easier to see
how to do it.

On Wed, Jan 20, 2016 at 7:50 PM, Vijayaditya Peddinti <
notifications@github.com> wrote:

Good to know that there is possibility of preserving time axis through the
convolution component. It was a concern for me before. I will do the
conversion.

Vijay
On Jan 20, 2016 1:58 PM, "Daniel Povey" notifications@github.com wrote:

Also, I think the convolutional stuff should be moved to
nnet-convolutional-component.{h,cc}. nnet-simple-component.{h,cc} is
getting too long.

—
Reply to this email directly or view it on GitHub
#460 (comment).

—
Reply to this email directly or view it on GitHub
#460 (comment).

vijayaditya · 2016-01-21T00:58:02Z

OK.

Vijay
On Jan 20, 2016 16:52, "Daniel Povey" notifications@github.com wrote:

It won't be trivial to do this with a GeneralComponent- you'll have to
create the PrecomputedIndexes to keep track of various things. I suggest
you wait a few days as I am working on some other instances of
GeneralComponent and once you see what I'm doing it will be easier to see
how to do it.

On Wed, Jan 20, 2016 at 7:50 PM, Vijayaditya Peddinti <
notifications@github.com> wrote:

Good to know that there is possibility of preserving time axis through
the
convolution component. It was a concern for me before. I will do the
conversion.

Vijay
On Jan 20, 2016 1:58 PM, "Daniel Povey" notifications@github.com
wrote:

Also, I think the convolutional stuff should be moved to
nnet-convolutional-component.{h,cc}. nnet-simple-component.{h,cc} is
getting too long.

—
Reply to this email directly or view it on GitHub
#460 (comment).

—
Reply to this email directly or view it on GitHub
#460 (comment).

—
Reply to this email directly or view it on GitHub
#460 (comment).

vijayaditya · 2016-01-21T23:28:04Z

please squash your commits.

…oling

tomkocse · 2016-01-22T15:34:53Z

The recent two commits are merged.

vijayaditya · 2016-01-23T23:20:01Z

src/nnet3/nnet-component-itf.cc

@danpovey I don't know of any recipes which are using the MaxpoolingComponent in nnet3. We are already printing deprecation warning for Convolution1DComponent. How about removing both Convolution1DComponent and MaxpoolingComponent in this commit ? I think it might be better to use the name MaxpoolingComponent instead of Maxpooling3dComponent for the new component.

OK, fine.
Is maxpooling3d efficient for the (possibly more common) 1d and 2d cases?
Dan

On Sat, Jan 23, 2016 at 6:20 PM, Vijayaditya Peddinti <
notifications@github.com> wrote:

In src/nnet3/nnet-component-itf.cc
#460 (comment):

@@ -96,6 +96,8 @@ Component* Component::NewComponentOfType(const std::string &component_type) {
ans = new ConvolutionComponent();
} else if (component_type == "MaxpoolingComponent") {
ans = new MaxpoolingComponent();

} else if (component_type == "Maxpooling3dComponent") {

@danpovey https://github.com/danpovey I don't know of any recipes which
are using the MaxpoolingComponent in nnet3. We are already printing
deprecation warning for Convolution1DComponent. How about removing both
Convolution1DComponent and MaxpoolingComponent in this commit ? I think it
might be better to use the name MaxpoolingComponent instead of
Maxpooling3dComponent for the new component.

—
Reply to this email directly or view it on GitHub
https://github.com/kaldi-asr/kaldi/pull/460/files#r50627193.

Still going through the new commit, will suggest changes to Tom Ko if it is not optimal for 1d/2d cases.

@danpovey I think the component is going to perform similarly for 1d, 2d and 3d inputs. There could be performance optimizations if we chose to have non-overlapping maxpooling windows. We can add additional code blocks which exploit the lack of overlap if there are noticeable delays.

danpovey · 2016-02-03T19:40:38Z

Actually, forget about this-- I think I will do this myself.
The convolutional code needs a rewrite, I think, for better documentation and more natural use of CUDA.

vijayaditya · 2016-02-03T19:59:39Z

When you mention the rewrite are you talking about rewriting the
convolution component as a non-simple component ?

--Vijay

On Wed, Feb 3, 2016 at 11:48 AM, Daniel Povey notifications@github.com
wrote:

Actually, forget about this-- I think I will do this myself.
The convolutional code needs a rewrite, I think, for better documentation
and more natural use of CUDA.

—
Reply to this email directly or view it on GitHub
#460 (comment).

danpovey · 2016-02-03T20:06:44Z

I'm thinking about that too.

Dan

On Wed, Feb 3, 2016 at 3:06 PM, Vijayaditya Peddinti <
notifications@github.com> wrote:

When you mention the rewrite are you talking about rewriting the
convolution component as a non-simple component ?

--Vijay

On Wed, Feb 3, 2016 at 11:48 AM, Daniel Povey notifications@github.com
wrote:

Actually, forget about this-- I think I will do this myself.
The convolutional code needs a rewrite, I think, for better documentation
and more natural use of CUDA.

—
Reply to this email directly or view it on GitHub
#460 (comment).

—
Reply to this email directly or view it on GitHub
#460 (comment).

danpovey · 2016-02-04T06:21:51Z

Guys, I changed my mind about implementing the convolutional stuff myself.
I realized that my plan was not as advantageous as I realized, and the amount of work was more than I had imagined.
So you can go ahead with this.
However, I don't think the way you have coded the maxpooling and its backprop is very ideal at all. It would be better to do it in a CUDA kernel.

tomkocse · 2016-02-04T06:45:20Z

@danpovey , are you worrying the case when the block size is large but the number of pool is small ?
I admit that the way i code is assuming small block size and large number of pool.

danpovey · 2016-02-04T07:19:16Z

There just seem to be a lot of individual CUDA calls.

On Thu, Feb 4, 2016 at 1:45 AM, tomkocse notifications@github.com wrote:

@danpovey https://github.com/danpovey , are you worrying the case when
the block size is large but the number of pool is small ?
I admit that the way i code is assuming small block size and large number
of pool.

—
Reply to this email directly or view it on GitHub
#460 (comment).

danpovey · 2016-02-06T05:04:42Z

I'll merge this now. If efficiency becomes a problem we can add a kernel that does max stuff without having to be called in a loop.

Implement the 3d maxpooling component.

Implement the 3d maxpooling component.

5738721

danpovey reviewed Jan 20, 2016
View reviewed changes

Rewriting the propagate and backpropagate functions of the 3d max poo…

93a5908

…ling component.

tomkocse added 2 commits January 22, 2016 09:51

Implement the 3d maxpooling component.

ea5dedd

Merge branch '3dmaxpooling' of github.com:tomkocse/kaldi into 3dmaxpo…

51570f6

…oling

vijayaditya reviewed Jan 23, 2016
View reviewed changes

Removal of the 1d version of both convolution and maxpooling components.

6cc5240

Removal of Init() in Maxpooling.

f22c5d2

Adding more check cases.

fb46d20

danpovey added a commit that referenced this pull request Feb 6, 2016

Merge pull request #460 from tomkocse/3dmaxpooling

44db487

Implement the 3d maxpooling component.

danpovey merged commit 44db487 into kaldi-asr:master Feb 6, 2016

Conversation

tomkocse commented Jan 20, 2016

Uh oh!

danpovey commented Jan 20, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

danpovey commented Jan 20, 2016

Uh oh!

danpovey commented Jan 20, 2016

Uh oh!

vijayaditya commented Jan 21, 2016

Uh oh!

danpovey commented Jan 21, 2016

Uh oh!

vijayaditya commented Jan 21, 2016

Uh oh!

vijayaditya commented Jan 21, 2016

Uh oh!

tomkocse commented Jan 22, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

danpovey commented Feb 3, 2016

Uh oh!

vijayaditya commented Feb 3, 2016

Uh oh!

danpovey commented Feb 3, 2016

Uh oh!

danpovey commented Feb 4, 2016

Uh oh!

tomkocse commented Feb 4, 2016

Uh oh!

danpovey commented Feb 4, 2016

Uh oh!

danpovey commented Feb 6, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Comments