Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

Using cuDNN for CTC Loss #7445

Open
sbodenstein opened this issue Aug 13, 2017 · 13 comments
Open

Using cuDNN for CTC Loss #7445

sbodenstein opened this issue Aug 13, 2017 · 13 comments

Comments

@sbodenstein
Copy link
Contributor

@piiswrong, @szha: Now that cuDNN 7 supports CTC loss, perhaps we should discard the current GPU implementation in contrib.ctc_loss (adapted from the WarpCTC implementation) and only use cuDNN for GPU? The main reasons:

  1. it requires maintenance effort to ensure the GPU implementation works on new GPU architectures, requiring careful updating of dependencies (like modern gpu).
  2. Users are still reporting problems with memset issues when using the WarpCTC plugin (mx.sym.WarpCTC cuda memcpy or memset failed issue #6121)

I don't think the maintenance effort is worthwhile if almost every single user training with CUDA will have cuDNN.

What are your thoughts?

@szha
Copy link
Member

szha commented Aug 13, 2017

Thanks for raising this, @sbodenstein. I'm working on using the cudnn7 implementation of CTC for GPU.

@sbodenstein
Copy link
Contributor Author

@szha: do you agree that we should remove the WarpCTC CUDA implementation?

@szha
Copy link
Member

szha commented Aug 14, 2017

@sbodenstein I agree. There is only one catch. It seems that the current WarpCTC supports variable-length inputs whereas cudnn7 only has the intention to support it. To elaborate, the current cudnn7 CTC API for getting workspace size looks like this:

cudnnStatus_t CUDNNWINAPI cudnnGetCTCLossWorkspaceSize(
                                cudnnHandle_t                       handle,
                                const cudnnTensorDescriptor_t       probsDesc,       /* Tensor descriptor for probabilities, the dimensions are T,N,A (T is the timing steps, N is the mini batch size, A is the alphabet size) */
                                const cudnnTensorDescriptor_t       gradientsDesc,   /* Tensor descriptor for gradients, the dimensions are T,N,A. To compute costs only, set it to NULL */
                                const int                          * labels,         /* labels, in CPU memory */
                                const int                          * labelLengths,   /* the length of each label, in CPU memory */
                                const int                          * inputLengths,   /* the lengths of timing steps in each batch, in CPU memory */
                                cudnnCTCLossAlgo_t                  algo,            /* algorithm selected, supported now 0 and 1 */
                                cudnnCTCLossDescriptor_t            ctcLossDesc,
                                size_t                             *sizeInBytes );   /* pointer to the returned workspace size */

However, if I give any *inputLengths that is different from T, it throws an CUDNN_STATUS_BAD_PARAM error. Reading the doc, it has a note saying that this error is thrown if:

‣ The inputLengths do not agree with the first dimension of probsDesc.

...so my guess is that this argument is left there so that variable input lengths will be supported going forward, because asking for a list of Ts doesn't make much sense otherwise.

Back to whether we should remove the WarpCTC implementation, I think we need to first clarify with cudnn team on when the *inputLengths can support variable input lengths, and remove the WarpCTC GPU implementation when that becomes available.

@sbodenstein
Copy link
Contributor Author

Ah, that is annoying! One other limitation I noticed: CUDNN_STATUS_BAD_PARAM is also thrown when

The labelLengths is greater than 256.

whilst WarpCTC:

The CUDA implementation supports a maximum label length of 639 (timesteps are unlimited).

@szha
Copy link
Member

szha commented Aug 14, 2017

Good catch. Let me reflect this in the PR as well.

@starimpact
Copy link
Contributor

so, the ctc of cudnn7 supports neither variable lengths inputs nor longer labellengths than 256.

@sbodenstein
Copy link
Contributor Author

@szha: one annoying thing about you adding cuDNN support: if you build against cuDNN (which you always usually want to do), you automatically have to use the cuDNN WarpCTC implementation, which you might not want if you want to support variable length inputs.

@szha
Copy link
Member

szha commented Aug 16, 2017

Current implementation still includes the WarpCTC implementation in the GPU version and only enables cudnn version when all input requirements are met, since the cudnn version is strictly more limited.

@szha
Copy link
Member

szha commented Sep 27, 2017

Unfortunately I have to turn cudnn CTC off because of the API design. I have requested API changes to nv people and hopefully we could incorporate that once they change the API.

@sbodenstein
Copy link
Contributor Author

only enables cudnn version when all input requirements are met, since the cudnn version is strictly more limited.

@szha: why was that not enough? Why do you need to completely turn it off?

@szha
Copy link
Member

szha commented Sep 28, 2017

The input length note being supported means that there's no way for me to enforce consistency between CPU and GPU implementation. cudnn version also doesn't reset the padding area gradients to zero, which can cause instability in training. It's too much of a risk to leave it be.

@szha
Copy link
Member

szha commented Dec 29, 2017

@apache/mxnet-committers: This issue has been inactive for the past 90 days. It has no label and needs triage.

For general "how-to" questions, our user forum (and Chinese version) is a good place to get help.

@szha
Copy link
Member

szha commented Apr 10, 2018

cudnn integration for ctc requires action from cudnn team in improving the API.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

3 participants