-
Notifications
You must be signed in to change notification settings - Fork 6.8k
MXNET_CUDNN_AUTOTUNE_DEFAULT problems #15529
Comments
Hey, this is the MXNet Label Bot. |
@mxnet-label-bot add [Cuda, Bug] |
@mxnet-label-bot remove [Bug] |
Hello @intgogo, The problem appeared for me more often in cases where I increased the number of layers of the same model architecture.
when I changed my For some models this optimization took roughly 15s causing a long delay during training, but disabling CUDNN-optimization resulted in an even longer training time. I investigated the source code a bit and it could be that the optimization isn't cached properly for all models. // An algo specification by the user may be cached here, but another
// convolution will match only if identically specified.
// We're caching results of *Get* as well as *Find*, but these records
// will be held distinctly because param_.cudnn_tune is part of the key
CuDNNDeconvAlgoReg::Get()->FindOrElseRegister(...) It might be useful for users to provide additionally an explicit call for this operation and also to save it within the model or a seperate file. |
See #10567 for caching CUDNN autotune results feature request |
cc @ptrendx @DickJC123, it would be good if we could fix this problem. |
Sorry this issue got buried on the todo stack. To be honest though, it may not be a bug in the convolution algo cache implementation, but more an issue with the policy of emitting warning messages. The repeated message comes after every 50 new additions to the convolution algo cache. At the point when there's 1000 differently-spec'd convolutions in the model, the following message appears once:
So if @intgogo still cares about this issue, maybe we can hear if the model in question is continually generating convolutions with unique parameters (shapes, strides, etc.). Also, I wouldn't mind hearing opinions about a new algo-cache-growth warning policy. Here are a couple of options:
Or we could add an environment variable to silence the warning altogether. |
Warning only once looks good. We should also allow users to query and set the benchmarking options. Ideally we should allow tuning results to be saved and loaded so that in a deployment environment things are stable and won't cause surprises |
About a year ago we discussed about adding a general interface for graph optimization to MXNet: #16173 I thought the easiest way to do this was to add a |
When I set MXNET_CUDNN_AUTOTUNE_DEFAULT =1, it won't stop.WTF~
...
Never stop.
this can take a while...
The text was updated successfully, but these errors were encountered: