add Convolution component in nnet2 by naxingyu · Pull Request #47 · kaldi-asr/kaldi

naxingyu · 2015-08-03T03:12:49Z

Add Convolution component in nnet2, including component test code. The propagation and backprop codes are imported from nnet1. Preliminary test on egs/hkust shows reasonable result in the average parallel framework.

danpovey · 2015-08-03T04:34:06Z

Thanks,
Some issues:

DerivNeedsInput() should return false.
This component does not have anywhere near enough documentation. It's
not at all clear what the patch stuff is all about, or what the meanings of
the various strides, etc., are.
I would probably want there to be an example script for this before
considering checking it in.

Dan

On Sun, Aug 2, 2015 at 8:12 PM, Xingyu Na [email protected] wrote:

Add Convolution component in nnet2, including component test code. The
propagation and backprop codes are imported from nnet1. Preliminary test on

egs/hkust shows reasonable result in the average parallel framework.

You can view, comment on, or merge this pull request online at:

#47
Commit Summary

add Convolution component in nnet2

File Changes

M src/nnet2/nnet-component-test.cc
https://github.com/kaldi-asr/kaldi/pull/47/files#diff-0 (39)

M src/nnet2/nnet-component.cc
https://github.com/kaldi-asr/kaldi/pull/47/files#diff-1 (440)

M src/nnet2/nnet-component.h
https://github.com/kaldi-asr/kaldi/pull/47/files#diff-2 (64)

Patch Links:

https://github.com/kaldi-asr/kaldi/pull/47.patch

https://github.com/kaldi-asr/kaldi/pull/47.diff

—
Reply to this email directly or view it on GitHub
#47.

naxingyu · 2015-08-03T04:53:28Z

OK. I'll write some comment for documentation, making the terms clear. I do have an example script for training ConvNet, including Convolution and Maxpooling. The Maxpooling component was meant to be committed from another PR. I could use one PR for Convolution, Maxpooling and an example script instead of doing these one step at a time. What's your suggestion?

danpovey · 2015-08-03T05:00:31Z

Let's eventually do it in one pull request, but you can keep updating this
one so I can check your documentation.
Dan

On Sun, Aug 2, 2015 at 9:53 PM, Xingyu Na [email protected] wrote:

OK. I'll write some comment for documentation, making the terms clear. I
do have an example script for training ConvNet, including Convolution and
Maxpooling. The Maxpooling component was meant to be committed from another
PR. I could use one PR for Convolution, Maxpooling and an example script
instead of doing these one step at a time. What's your suggestion?

—
Reply to this email directly or view it on GitHub
#47 (comment).

naxingyu · 2015-08-03T09:24:17Z

BackpropNeedsInput() returns false. Maxpooling component is committed. Both with document and component test code. Example scripts are provided.

danpovey · 2015-08-04T17:57:38Z

Sorry I haven't gotten to this yet, I'll try to review it in the next few hours.

danpovey · 2015-08-06T00:32:06Z

[edit: better name than ConvolutionalComponent1d would be Convolutional1dComponent.]

naxingyu · 2015-08-07T07:41:14Z

Hi Dan, thanks for checking.

Documentation and usage in the example scripts are updated. The document in shell scripts are used to explain ConvNet. ReLU is used instead of Sigmoid for activation after affine component and result is updated;
Documentation in Convolutional1d (renamed) and Maxpooling components are updated. Documentation before class declaration is used to explain the functionalities of the components. The convolutional propagation is explained in detail before the member function definition. Sorry for the long text, I wanted to keep it brief and accurate, failed....
I tested the speed during training. Averagely, one job of ConvNet training took 58.4 seconds, while one job of ReLU network of similar size took 40.3 seconds. In Convolutional1d component, the inner loops are mostly used for matrix/vector copy. In Maxpooling component, yes, they don't seem to be well optimized for GPU.

danpovey · 2015-08-07T20:47:54Z

[Yenda and Karel, the part that relates to you is at the end].

Great-- thanks a lot for this!
In order to make the convolutional code efficient, we may have to go back
and forth a few more times-- I hope you don't mind (and that you aren't on
some kind of deadline for finishing this up).
Some questions:

Did switching from sigmoid to ReLU change the results?
Do you have any nnet baseline for the convnet results? I don't see
anything in the RESULTS file-- it would be nice to have that. Possibly the
RESULTS file was not up to date.
Regarding efficiency:
If you run the program with gdb --args (program) (args), run with "r",
and ctrl-c into it, look at the backtrace ("bt") and "continue" a bunch of
times (you may have to restart the program if it dies after the first time)
you can get a feeling for which parts of the code it is spending a lot of
time in.

I noticed some inefficiency in the Convolutional1dComponent::Propagate
function. Obviously this comes from the original nnet1 code, that there is
no harm in fixing that too, which is no hurry (but ideally it would be best
to test the fixes).
The vectorized_feature_patches_ is a vector of "num_patches" Matrixes, and
they are all of the same dimension. You could very easily just make it a
single Matrix with "num_patches" times as many columns, and access the
pieces when needed using a SubMatrix (the ColRange function would be
helpful). Then instead of "num_patches" calls to CopyCols, you can have
just one.

Now let's consider the backprop code. Currently the code is quite
inefficient, with two loops. Imagine you have structured the code in the
way described in the above paragraph, so the "forward" code consists of one
CopyRows call, something like patches.CopyCols(*in,
vector_of_column_indexes);. Ideally we'd like the backprop code to consist
of a similar call, like in->CopyCols(patches, vector_of_column_indexes);.
There is a problem with this. Firstly, we'd have to invert the
column-indexes map because in this call, it's indexed by the row of "in".
And because this is not a one-to-one map, it's not invertible; each column
of "in" may correspond to several columns of "patches". We could try to
fix this by making several of these vector_of_column_indexes vectors (a
number equal to the maximum number of times any given column of "in"
appears in the patches.. pad with -1s where needed to fill holes).
However, of course we need a suitable AddCols function instead of CopyCols.
[You could use CopyCols the first time, and AddCols thereafter]. The CUDA
programming is very simple, as it just involved copying all the CopyCols
code with a very tiny modification. Organized this way, the number of
kernel invocations will be minimized; it will equal the maximum number of
times the patches overlap. When coding this, you could use three
variables.

// column_map is indexed by the column-index of "patches", and the value is
the corresponding column-index of "in".
std::vector column_map; // this used to be called column_mask but
that is a bad name.
// reversed_column_map is indexed by the column-index c of "in", and it
contains the list of column-indexes d of "patches" such that column_map[d]
= c.
std::vectorstd::vector reversed_column_map;
// rearranged_column_map is a list of vectors. Each vector in the list is
indexed by the column-index c of "in", and contains either some
column-index d of "patches" such that column_map[d] = c, or -1. The number
of vectors equals the maximum size of any of the lists in
"reversed_column_map".
std::vectorstd::vector rearranged_column_map;

In the above, I use the word "list" to refer to a std::vector of elements
in no particular order, and "vector" to refer to a std::vector where the
index has a particular meaning.

For the core of the Propagate and Backprop functions, rather than have a
loop over the matrix multiply it would be much better to use the batched
matrix-multiply feature of CuBLAS:
http://docs.nvidia.com/cuda/cublas/index.html#cublas-lt-t-gt-gemmbatched
However, this is only supported in the "new" CUDA API, cublas_v2.h. (
http://docs.nvidia.com/cuda/cublas/#axzz3iABfeGBt). For our current
purposes, let's forget about this.

However, the v2 API has been supported since v4.0 of the CUDA toolkit (i.e.
for quite a while; we are now on 7), and it definitely makes sense to
upgrade the Kaldi project to use the new CUDA API, so that in time we can
make use of this functionality. @jtrmal, could you please create a new
branch in the kaldi-asr repo called cuda-v2, as a copy of trunk?
@vesis84, it would be good if you could help with this, as you wrote the
original cuda matrix library.
The new CUDA API supports having multiple execution contexts. The current
CuDevice class would be very limiting. @vesis84, do you have time to think
about the best way to rework things and come up with some kind of sketch of
a replacement? Or if you decide that leaving it mostly unchanged is the
best way, let us know why.

Dan

On Fri, Aug 7, 2015 at 12:41 AM, Xingyu Na [email protected] wrote:

Hi Dan, thanks for checking.

Documentation and usage in the example scripts are updated. ReLU is
used instead of Sigmoid for activation after affine component. The document
in shell scripts are used to explain ConvNet;

Documentation in Convolutional1d (renamed) and Maxpooling components
are updated. Documentation before class declaration is used to explain the
functionalities of the components. The convolutional propagation is
explained in detail before the member function definition. Sorry for the
long text, I wanted to keep it brief and accurate, failed....

I tested the speed during training. Averagely, one job of ConvNet
training took 58.4 seconds, while one job of ReLU network of similar size
took 40.3 seconds. In Convolutional1d component, the inner loops are mostly
used for matrix/vector copy. In Maxpooling component, yes, they don't seem
to be well optimized for GPU.

—
Reply to this email directly or view it on GitHub
#47 (comment).

naxingyu · 2015-08-11T02:03:39Z

Sorry for pause. Sure, I don't mind. I believe this is necessary for project like Kaldi.

Switching from Sigmoid to ReLU change WER from 40.87% to 40.73%.
The hkust example is quite out-dated. I have updated the RESULTS (and scripts) for SGMM, Nnet1 networks and Nnet2 networks in my local repo. I could open another PR to update the hkust example, if necessary?
I'll use gdb for a detailed timing info.
Yes. The convolutional1d code came from the nnet1 code. I copied the whole component and adapted it to the nnet2 framework. For Propagate and Backprop inefficiency, you suggestion sounds plausible, and I'll try it with sufficient test. Thank you very much for the design.
I was trying to follow the cublas_v2 issue. So I'll still use loop over matrix multiply in this branch, and keep it up-to-date when the header is updated.

danpovey · 2015-08-11T05:33:49Z

Thanks.
Yes, updating the HKUST recipe as a separate PR sounds like a good idea.
Dan

Sorry for pause. Sure, I don't mind. I believe this is necessary for

project like Kaldi.

Switching from Sigmoid to ReLU change WER from 40.87% to 40.73%.

The hkust example is quite out-dated. I have updated the RESULTS
(and scripts) for SGMM, Nnet1 networks and Nnet2 networks in my local repo.
I could open another PR to update the hkust example, if necessary?

I'll use gdb for a detailed timing info.

Yes. The convolutional1d code came from the nnet1 code. I copied the
whole component and adapted it to the nnet2 framework. For Propagate and
Backprop inefficiency, you suggestion sounds plausible, and I'll try it
with sufficient test. Thank you very much for the design.

I was trying to follow the cublas_v2 issue. So I'll still use loop
over matrix multiply in this branch, and keep it up-to-date when the header
is updated.

—
Reply to this email directly or view it on GitHub
#47 (comment).

naxingyu · 2015-08-12T10:03:48Z

Hi Dan, I update the code as you suggested, and it worked as expected.

The patches are used in both Propagate and Update, so I add a mutable member to avoid creating the patches twice.
AddCols code done, for both GPU and CPU.
Both cu-matrix test and nnet-component test are passed.
Sanity: hkust experiment shows the same results as the previous implementation.
Speed: One job of ConvNet training takes 47.1 seconds now, compared with previous timing 58.4 seconds. I used gdb to pause and backtrace at running time for 20 times. 9 times at convolutional component, 7 times at affine component, and 4 times at maxpooling component.

HKUST recipe will be updated independently in another PR.

danpovey · 2015-08-12T20:22:03Z

We're making progress. A few issues remain..

Please don't use a mutable member. This code is supposed to work in multi-threaded updates, and the const needs to be enforced. You'll have to use a temporary. You can initialize it as
CuMatrix patches(num_frames, filter_dim * num_patches, kUndefined);
since you'll be copying to all columns of it.
You can use reserve() on the std::vector of indexes to improve efficiency of setting it up.
After addressing the issues described here, please provide more detail in your gdb statistics on where exactly in the convolutional component it stopped. (e.g. which lines in the forward or backward?)
The way you use num_patches separate calls to in_deriv->AddCols() is inefficient. Some patches don't overlap with each other so you could use the same call. Instead of looping this way, I recommend to derive the list of reverse maps directly from the forward map, which you can do with the following private static member functions:

/*
This function does an operation similar to reversing a map, except it handles maps
that are not one-to-one by outputting the reversed map as a vector of lists.
@param [in] forward_indexes is a vector of int32, each of whose elements
is between 0 and input_dim - 1.
@param [in] input_dim. See definitions of forward_indexes and backward_indexes
@param [out] backward_indexes This function outputs to backward_indexes a
vector of dimension input_dim, of lists. The list at (backward_indexes[i])
is a list of all indexes j such that forward_indexes[j] == i.
*/
static void ReverseIndexes(const std::vector &forward_indexes,
int32 input_dim,
std::vectorstd::vector *backward_indexes);
/
This function transforms a vector of lists into a list of vectors, padded with -1.
@param [in] The input vector of lists. Let in.size() be D, and let the
longest list length (i.e. the max of in[i].size()) be L.
@param [out] The output list of vectors. The length of the list will be L,
each vector-dimension will be D (i.e. (_out)[i].size() ==D), and
if in[i] == j, then for some k we will have that (_out)[k][j] == i.
The output vectors are padded with -1 where necessary if not
all the input lists have the same side.
*/
static void RearrangeIndexes(const std::vectorstd::vector &in,
std::vectorstd::vector *out);

naxingyu · 2015-08-13T09:31:48Z

Dan. I'm having trouble rearranging the reversed indexes. The reversing can be done very quickly by parsing the forward_indexes. However, rearranging seems not doable with a single loop. I used many loops with conditions.... It seems very inefficient... Am I missing something?

danpovey · 2015-08-13T17:51:48Z

You'd probably need a loop first to figure out the maximum list length;
then size the outputs correctly and fill them up with -1's; then iterate
over each of the input lists and set each of the output vectors in turn.
Dan

On Thu, Aug 13, 2015 at 2:31 AM, Xingyu Na [email protected] wrote:

Dan. I'm having trouble rearranging the reversed indexes. The reversing
can be done very quickly by parsing the forward_indexes. However,
rearranging seems not doable with a single loop. I used many loops with
conditions.... It seems very inefficient... Am I missing something?

—
Reply to this email directly or view it on GitHub
#47 (comment).

naxingyu · 2015-08-14T01:11:25Z

Yes that's exactly how I did it. The problem is that the third part, list-to-vector conversion, consumes a lot of conditions to decide which vector to go into. The code is ugly. I need to redo it....

danpovey · 2015-08-14T01:13:03Z

It shouldn't be complicated. For each list, just put the i'th element in
the i'th vector.
Dan

On Thu, Aug 13, 2015 at 6:11 PM, Xingyu Na [email protected] wrote:

Yes that's exactly how I did it. The problem is that the third part,
list-to-vector conversion, consumes a lot of conditions to decide which
vector to go into. The code is ugly. I need to redo it....

—
Reply to this email directly or view it on GitHub
#47 (comment).

naxingyu · 2015-08-14T05:55:24Z

It works now... There is one thing I don't follow. If the BackpropNeedsInput() return false, the forward data will be deleted, then how are the gradients computed?

danpovey · 2015-08-14T06:03:57Z

OK, cool. I'll try to review it again soon and maybe merge.
The backprop for the convolutional component does not require the forward
data because it is a linear operation; it only needs the derivatives from
the next layer.
Dan

On Thu, Aug 13, 2015 at 10:55 PM, Xingyu Na [email protected]
wrote:

It works now... There is one thing I don't follow. If the
BackpropNeedsInput() return false, the forward data will be deleted, then
how are the gradients computed?

—
Reply to this email directly or view it on GitHub
#47 (comment).

naxingyu · 2015-08-14T06:10:31Z

I'm still working on it. I'll let you know when it's ready for review.

naxingyu · 2015-08-14T07:12:22Z

Indexes operations done. Mutable removed.
Detailed timing using gdb:
(1) affine prop
(7) affine backprop
(2) maxpool prop
(7) maxpool backprop
(5) convolutional prop
(12) convolutional backprop
(2) normalize prop

danpovey · 2015-08-14T17:57:27Z

Thanks!
Could you provide some more details for the gdb backtraces while it's in
the maxpool backprop, the convolutional prop and the convolutional
backprop? I.e. what lines of code is it in? The convolutional code has
quite a few different phases, so it's hard to guess this.

Dan

On Fri, Aug 14, 2015 at 12:12 AM, Xingyu Na [email protected]
wrote:

Indexes operations done. Mutable removed.
Detailed timing using gdb:
(1) affine prop
(7) affine backprop
(2) maxpool prop
(7) maxpool backprop
(5) convolutional prop
(12) convolutional backprop
(2) normalize prop

—
Reply to this email directly or view it on GitHub
#47 (comment).

naxingyu · 2015-08-15T14:38:46Z

Sure. Here it is.
(1) maxpool->backprop->CuMatrix(): 4257
(2) maxpool->backprop->MulElements(): 4261
(3) convolution->prop->AddMatMat(): 3899
(3) convolution->backprop->AddMatMat(): 3999
(3) convolution->backprop->update->AddMatMat(): 4166
(3) convolution->backprop->ReverseIndexes(): 4013

danpovey · 2015-08-15T18:33:37Z

I'm having trouble interpreting these line numbers. For instance, here
https://github.com/kaldi-asr/kaldi/pull/47/files
ReverseIndexes starts at line 3930, but you have a line number of 4166 for
it. Maybe your pull request is out of date?
Dan

On Sat, Aug 15, 2015 at 7:38 AM, Xingyu Na [email protected] wrote:

Sure. Here it is.
(1) maxpool->backprop->CuMatrix(): 4257
(2) maxpool->backprop->MulElements(): 4261
(3) convolution->prop->AddMatMat(): 3899
(3) convolution->backprop->AddMatMat(): 3999
(3) convolution->backprop->update->AddMatMat(): 4013
(3) convolution->backprop->ReverseIndexes(): 4166

—
Reply to this email directly or view it on GitHub
#47 (comment).

naxingyu · 2015-08-16T01:05:16Z

The line number means where it is called. I post this number because it gives a hint about which step of the prop and backprop the program is stopped at.

danpovey · 2015-08-16T05:04:19Z

Yes, but what I'm saying is that those line numbers appear not to be
correct. For instance, line 4166 is the line
filters_grad.AddMatMat(1.0, diff_patch, kTrans, patch, kNoTrans, 1.0);
(see here https://github.com/kaldi-asr/kaldi/pull/47/files)
which has nothing to do with ReverseIndexes. Dan

On Sat, Aug 15, 2015 at 6:05 PM, Xingyu Na [email protected] wrote:

The line number means where it is called. I post this number because it
gives a hint the program is stopped at which step of the prop and backprop.

—
Reply to this email directly or view it on GitHub
#47 (comment).

danpovey · 2015-08-16T05:14:40Z

OK, it could be you just reversed two of the line numbers.
It's disappointing if time is being taken in ReverseIndexes(), that should
be a fast operation.
I suggest to have a temporary variable
int32 size = forward_indexes.size()
which will prevent the compiler from having to re-compute the size each
time (sometimes compilers assume things could be being changed by another
thread so they won't assume something is constant).
It may also help to estimate a rough upper bound on how big each of the
backward lists will have to be (e.g. 2 + forward_indexes.size() /
input_dim), and iterate through all the backward lists and reserve() them
to that size before you do the main loop. Learn how to use C++ iterators
for this, e.g.
std::vectorstd::vector::iterator iter = backward_lists.begin(),
end = backward_lists.end();
for (; iter != end; ++iter)
iter->reserve(reserve_size);

Dan

On Sat, Aug 15, 2015 at 10:04 PM, Daniel Povey [email protected] wrote:

Yes, but what I'm saying is that those line numbers appear not to be
correct. For instance, line 4166 is the line
filters_grad.AddMatMat(1.0, diff_patch, kTrans, patch, kNoTrans, 1.0);
(see here https://github.com/kaldi-asr/kaldi/pull/47/files)
which has nothing to do with ReverseIndexes. Dan

On Sat, Aug 15, 2015 at 6:05 PM, Xingyu Na [email protected]
wrote:

The line number means where it is called. I post this number because it
gives a hint the program is stopped at which step of the prop and backprop.

—
Reply to this email directly or view it on GitHub
#47 (comment).

naxingyu · 2015-08-16T08:16:08Z

Thanks. It's done. I ran gdb again and ReserveIndexes() didn't appear any more. Instead, more AffineComponent::Backprop showed up.

danpovey · 2015-08-16T18:42:48Z

Cool- I think we're done! Merging.
Please don't forget the separate pull request for the baselines in the HKUST recipe. Maybe it will be easier once this one is done.
Dan

add Convolution component in nnet2

danpovey · 2015-08-16T19:25:12Z

BTW, if you have time at some point, it would be a good thing to take some of the efficiency improvements, and apply them to the nnet1 1d-convolutaional code.

naxingyu · 2015-08-17T01:03:42Z

Cool :) Yes that's exactly what I planned to do.

KarelVesely84 · 2015-09-01T01:50:49Z

Hi, many thanks for the enhancement! (getting back to work after two weeks off)

Resolve "Long stream for streaming" Closes kaldi-asr#47 See merge request dl/dgx/kaldi!103

add Convolution component in nnet2

d773ab9

add Maxpooling component and example script

885586f

add more doc

9d4b994

efficient convolutional Propagate and Backprop

c3f5a1d

reverse and rearrange indexes

7e09c66

convolution->convolutional1d

fd619d7

accel ReverseIndexes

372a150

danpovey added a commit that referenced this pull request Aug 16, 2015

Merge pull request #47 from naxingyu/convolution-nnet2

5526c21

add Convolution component in nnet2

danpovey merged commit 5526c21 into kaldi-asr:master Aug 16, 2015

danpovey mentioned this pull request Sep 3, 2015

Add multiple parallel AddMatMat, #119

Closed

mohan-li mentioned this pull request Jan 10, 2020

Segmentation fault when importing kaldi_pybind #3836

Closed

galv pushed a commit to galv/kaldi that referenced this pull request Dec 10, 2022

Merge branch '47-long-stream-for-streaming' into 'master'

57dd57d

Resolve "Long stream for streaming" Closes kaldi-asr#47 See merge request dl/dgx/kaldi!103

Conversation

naxingyu commented Aug 3, 2015

Uh oh!

danpovey commented Aug 3, 2015

egs/hkust shows reasonable result in the average parallel framework.

Uh oh!

naxingyu commented Aug 3, 2015

Uh oh!

danpovey commented Aug 3, 2015

Uh oh!

naxingyu commented Aug 3, 2015

Uh oh!

danpovey commented Aug 4, 2015

Uh oh!

danpovey commented Aug 6, 2015

Uh oh!

naxingyu commented Aug 7, 2015

Uh oh!

danpovey commented Aug 7, 2015

Uh oh!

naxingyu commented Aug 11, 2015

Uh oh!

danpovey commented Aug 11, 2015

Uh oh!

naxingyu commented Aug 12, 2015

Uh oh!

danpovey commented Aug 12, 2015

Uh oh!

naxingyu commented Aug 13, 2015

Uh oh!

danpovey commented Aug 13, 2015

Uh oh!

naxingyu commented Aug 14, 2015

Uh oh!

danpovey commented Aug 14, 2015

Uh oh!

naxingyu commented Aug 14, 2015

Uh oh!

danpovey commented Aug 14, 2015

Uh oh!

naxingyu commented Aug 14, 2015

Uh oh!

naxingyu commented Aug 14, 2015

Uh oh!

danpovey commented Aug 14, 2015

Uh oh!

naxingyu commented Aug 15, 2015

Uh oh!

danpovey commented Aug 15, 2015

Uh oh!

naxingyu commented Aug 16, 2015

Uh oh!

danpovey commented Aug 16, 2015

Uh oh!

danpovey commented Aug 16, 2015

Uh oh!

naxingyu commented Aug 16, 2015

Uh oh!

danpovey commented Aug 16, 2015

Uh oh!

danpovey commented Aug 16, 2015

Uh oh!

naxingyu commented Aug 17, 2015

Uh oh!

KarelVesely84 commented Sep 1, 2015

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants