Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

contrib ctc interface changes, cudnn7 CTC, and gluon CTC #7442

Merged
merged 3 commits into from
Aug 24, 2017

Conversation

szha
Copy link
Member

@szha szha commented Aug 13, 2017

This change is to make current contrib CTC compatible with the cudnn7 CTC interface, and to add CTC loss layer for gluon.

@szha szha force-pushed the gluon_ctc branch 18 times, most recently from 19298a9 to 245a789 Compare August 14, 2017 19:10
@szha szha changed the title contrib ctc interface changes for compatibility, and gluon CTC contrib ctc interface changes, cudnn7 CTC, and gluon CTC Aug 14, 2017
@szha
Copy link
Member Author

szha commented Aug 14, 2017

@sbodenstein the second commit in this PR is to address #7445.

}

template <typename DType, typename xpu>
inline bool PackLabelByLength(mshadow::Tensor<xpu, 2, DType> labels,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add comment

@szha szha force-pushed the gluon_ctc branch 4 times, most recently from 6820fd7 to 79a5b1f Compare August 16, 2017 19:53
@@ -240,15 +461,22 @@ class CTCLossProp : public OperatorProperty {
int NumOutputs() const override { return 2; }

std::vector<std::string> ListArguments() const override {
return {"data", "label"};
if (param_.use_input_lengths && param_.use_label_lengths) {
return {"data", "label", "input_lengths", "label_lengths"};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

data_lengths

@@ -165,6 +165,36 @@ def test_l1_loss():
assert mod.score(data_iter, eval_metric=mx.metric.Loss())[0][1] < 0.1


def test_ctc_loss():
loss = gluon.loss.CTCLoss(padding_mask=0)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

test gpu

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I included test_loss in test_operator_gpu and everything passes.

// since the input is activation before softmax and cudnn ctc takes softmax
// apply softmax to inputs first.
Tensor<xpu, 3, real_t> prob(data.shape_);
mshadow::AllocSpace(&prob);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dont alloc this way. Use ctx.requested

PackLabelByLength(labels, in_data[kLabelLength].get<xpu, 1, real_t>(s),
&packed_labels, &label_lengths);
} else {
#if defined(__CUDACC__) && MXNET_USE_CUDNN == 1 && CUDNN_MAJOR >= 7
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove these ifs

CUDNN_CALL(cudnnDestroyCTCLossDescriptor(ctc_desc_));
CUDNN_CALL(cudnnDestroyTensorDescriptor(prob_desc_));
CUDNN_CALL(cudnnDestroyTensorDescriptor(grad_desc_));
#endif
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if/endif shouldn't cross function

Layout of the output sequence activation vector.
label_layout : str, default 'NT'
Layout of the labels.
use_input_lengths : bool, default False
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no need for the flag. just see if argument is none


Parameters
----------
output_layout : str, default 'NTC'
Copy link
Contributor

@piiswrong piiswrong Aug 18, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

output_layout -> layout

lengths of labels. Only required when `use_label_lengths` is false.
weight : float or None
Global scalar weight for loss.
input_lengths : NDArray or None,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these are argument to forward. List separately in input/output section

int batch = labels.size(0);
int max_num_labels = labels.size(1);
std::vector<int> cpu_labels(max_num_labels);
IndexTensorToVector(in_label_lengths, label_lengths);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this function does cudaMemcopy.
Can you do it only once?

std::vector<int> *packed_labels,
std::vector<int> *label_lengths) {
int batch = labels.size(0);
int max_num_labels = labels.size(1);
std::vector<index_t> cpu_labels(max_num_labels);
std::vector<int> cpu_labels(max_num_labels);
bool exceed_limit = false;

for (int b = 0; b < batch; ++b) {
IndexTensorToVector(labels[b], &cpu_labels);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

try to do copy only once

@piiswrong
Copy link
Contributor

Do you have performance comp between baidu and cudnn?

@szha szha force-pushed the gluon_ctc branch 2 times, most recently from 3cfbfeb to 32884ea Compare August 18, 2017 08:46
@szha
Copy link
Member Author

szha commented Aug 18, 2017

I compared these two implementations roughly on K80 by timing only the parts where the two implementation diverges. The two implementations are about the same, and cudnn one seems slightly more efficient when input size is large. Below I listed the numbers. The four parts of the workload line represent: input shape before preprocessing, label shape, input length, label length.

workload: (32L, 256L, 28L), (32L, 1L), 256, 1
baidu
forward elapsed time: 614598 ns
cudnn
forward elapsed time: 545639 ns

workload: (32L, 512L, 28L), (32L, 1L), 512, 1
baidu
forward elapsed time: 1030004 ns
cudnn
forward elapsed time: 1063012 ns

workload: (32L, 1024L, 28L), (32L, 1L), 1024, 1
baidu
forward elapsed time: 2115799 ns
cudnn
forward elapsed time: 2207311 ns

workload: (32L, 2048L, 28L), (32L, 1L), 2048, 1
baidu
forward elapsed time: 4085191 ns
cudnn
forward elapsed time: 4085597 ns

workload: (32L, 256L, 28L), (32L, 2L), 256, 2
baidu
forward elapsed time: 680476 ns
cudnn
forward elapsed time: 685418 ns

workload: (32L, 512L, 28L), (32L, 2L), 512, 2
baidu
forward elapsed time: 1387201 ns
cudnn
forward elapsed time: 1338115 ns

workload: (32L, 1024L, 28L), (32L, 2L), 1024, 2
baidu
forward elapsed time: 11913695 ns
cudnn
forward elapsed time: 4217115 ns

workload: (32L, 2048L, 28L), (32L, 2L), 2048, 2
baidu
forward elapsed time: 7950573 ns
cudnn
forward elapsed time: 5184110 ns

workload: (32L, 256L, 28L), (32L, 4L), 256, 4
baidu
forward elapsed time: 950801 ns
cudnn
forward elapsed time: 952418 ns

workload: (32L, 512L, 28L), (32L, 4L), 512, 4
baidu
forward elapsed time: 1940807 ns
cudnn
forward elapsed time: 1873934 ns

workload: (32L, 1024L, 28L), (32L, 4L), 1024, 4
baidu
forward elapsed time: 3801619 ns
cudnn
forward elapsed time: 3755733 ns

workload: (32L, 2048L, 28L), (32L, 4L), 2048, 4
baidu
forward elapsed time: 7530379 ns
cudnn
forward elapsed time: 7496358 ns

workload: (32L, 256L, 28L), (32L, 8L), 256, 8
baidu
forward elapsed time: 1474249 ns
cudnn
forward elapsed time: 1472007 ns

workload: (32L, 512L, 28L), (32L, 8L), 512, 8
baidu
forward elapsed time: 2975453 ns
cudnn
forward elapsed time: 2951099 ns

workload: (32L, 1024L, 28L), (32L, 8L), 1024, 8
baidu
forward elapsed time: 7873482 ns
cudnn
forward elapsed time: 7956207 ns

workload: (32L, 2048L, 28L), (32L, 8L), 2048, 8
baidu
forward elapsed time: 14353415 ns
cudnn
forward elapsed time: 11856699 ns

workload: (32L, 256L, 28L), (32L, 16L), 256, 16
baidu
forward elapsed time: 2529732 ns
cudnn
forward elapsed time: 2527600 ns

workload: (32L, 512L, 28L), (32L, 16L), 512, 16
baidu
forward elapsed time: 5129444 ns
cudnn
forward elapsed time: 5202983 ns

workload: (32L, 1024L, 28L), (32L, 16L), 1024, 16
baidu
forward elapsed time: 20691226 ns
cudnn
forward elapsed time: 10203190 ns

workload: (32L, 2048L, 28L), (32L, 16L), 2048, 16
baidu
forward elapsed time: 20736001 ns
cudnn
forward elapsed time: 20764510 ns

workload: (32L, 256L, 28L), (32L, 32L), 256, 32
baidu
forward elapsed time: 4349291 ns
cudnn
forward elapsed time: 4372212 ns

workload: (32L, 512L, 28L), (32L, 32L), 512, 32
baidu
forward elapsed time: 9181241 ns
cudnn
forward elapsed time: 9177710 ns

workload: (32L, 1024L, 28L), (32L, 32L), 1024, 32
baidu
forward elapsed time: 26316193 ns
cudnn
forward elapsed time: 19705820 ns

workload: (32L, 2048L, 28L), (32L, 32L), 2048, 32
baidu
forward elapsed time: 37893973 ns
cudnn
forward elapsed time: 37927728 ns

workload: (32L, 256L, 28L), (32L, 64L), 256, 64
baidu
forward elapsed time: 7350089 ns
cudnn
forward elapsed time: 7341236 ns

workload: (32L, 512L, 28L), (32L, 64L), 512, 64
baidu
forward elapsed time: 10367995 ns
cudnn
forward elapsed time: 10394812 ns

workload: (32L, 1024L, 28L), (32L, 64L), 1024, 64
baidu
forward elapsed time: 26141244 ns
cudnn
forward elapsed time: 26080623 ns

workload: (32L, 2048L, 28L), (32L, 64L), 2048, 64
baidu
forward elapsed time: 46190315 ns
cudnn
forward elapsed time: 43888343 ns

workload: (32L, 256L, 28L), (32L, 128L), 256, 128
baidu
forward elapsed time: 7244567 ns
cudnn
forward elapsed time: 7197396 ns

workload: (32L, 512L, 28L), (32L, 128L), 512, 128
baidu
forward elapsed time: 17969224 ns
cudnn
forward elapsed time: 17975323 ns

workload: (32L, 1024L, 28L), (32L, 128L), 1024, 128
baidu
forward elapsed time: 37417761 ns
cudnn
forward elapsed time: 37404967 ns

workload: (32L, 2048L, 28L), (32L, 128L), 2048, 128
baidu
forward elapsed time: 134485276 ns
cudnn
forward elapsed time: 134316760 ns

workload: (32L, 256L, 28L), (32L, 256L), 256, 256
baidu
forward elapsed time: 7399992 ns
cudnn
forward elapsed time: 7459652 ns

workload: (32L, 512L, 28L), (32L, 256L), 512, 256
baidu
forward elapsed time: 20783582 ns
cudnn
forward elapsed time: 20854272 ns

workload: (32L, 1024L, 28L), (32L, 256L), 1024, 256
baidu
forward elapsed time: 67250512 ns
cudnn
forward elapsed time: 59459103 ns

workload: (32L, 2048L, 28L), (32L, 256L), 2048, 256
baidu
forward elapsed time: 147523944 ns
cudnn
forward elapsed time: 145160144 ns

@szha szha force-pushed the gluon_ctc branch 5 times, most recently from 8630456 to c70505b Compare August 22, 2017 18:02
@piiswrong piiswrong merged commit f489810 into apache:master Aug 24, 2017
@szha szha deleted the gluon_ctc branch August 27, 2017 22:31
mbaijal pushed a commit to mbaijal/incubator-mxnet that referenced this pull request Sep 6, 2017
* contrib ctc interface changes for compatibility

* cudnn ctc

* update per comments
@sbodenstein sbodenstein mentioned this pull request Sep 7, 2017
cjolivier01 pushed a commit to cjolivier01/mxnet that referenced this pull request Sep 11, 2017
* contrib ctc interface changes for compatibility

* cudnn ctc

* update per comments
crazy-cat pushed a commit to crazy-cat/incubator-mxnet that referenced this pull request Oct 26, 2017
* contrib ctc interface changes for compatibility

* cudnn ctc

* update per comments
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants