Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fails on CUDNN_STATUS_EXECUTION_FAILED #14

Closed
neubig opened this issue Aug 2, 2019 · 3 comments
Closed

Fails on CUDNN_STATUS_EXECUTION_FAILED #14

neubig opened this issue Aug 2, 2019 · 3 comments

Comments

@neubig
Copy link
Collaborator

neubig commented Aug 2, 2019

This is the same error that was reported by @gsh2014 in #4, but I figure it'd be better to have it as a separate issue. I'm running into the same problem:

Namespace(action_embed_size=128, answer_prune=True, asdl_file='asdl/lang/py3/py3_asdl.simplified.txt', att_vec_size=256, batch_size=10, beam_size=15, clip_grad=5.0, column_att='affine', cuda=True, decay_lr_every_epoch=False, decode_max_time_step=100, decoder_word_dropout=0.0, dev_file='data/conala/dev.var_str_sep.bin', dropout=0.0, embed_size=128, eval_top_pred_only=False, evaluator='conala_evaluator', field_embed_size=64, glorot_init=True, glove_embed_path=None, hidden_size=256, lang='python', load_model=None, log_every=50, lr=0.001, lr_decay=0.5, lr_decay_after_epoch=15, lstm='lstm', max_epoch=50, max_num_trial=5, mode='train', negative_sample_type='best', no_copy=False, no_input_feed=False, no_parent_field_embed=False, no_parent_field_type_embed=True, no_parent_production_embed=True, no_parent_state=False, no_query_vec_to_action_map=False, optimizer='Adam', parser='default_parser', patience=5, primitive_token_label_smoothing=0.0, ptrnet_hidden_dim=32, query_vec_to_action_diff_map=False, readout='identity', reset_optimizer=False, sample_size=5, save_all_models=False, save_decode_to=None, save_to='saved_models/conala/model.sup.conala.lstm.hidden256.embed128.action128.field64.type64.dr0.0.lr0.001.lr_de0.5.lr_da15.beam15.vocab.var_str_sep.src_freq3.code_freq3.bin.train.var_str_sep.bin.glorot.par_state.seed0', seed=0, sql_db_file=None, src_token_label_smoothing=0.0, sup_attention=False, test_file=None, train_file='data/conala/train.var_str_sep.bin', transition_system='python3', type_embed_size=64, uniform_init=None, valid_every_epoch=1, valid_metric='acc', verbose=False, vocab='data/conala/vocab.var_str_sep.src_freq3.code_freq3.bin', word_dropout=0.0)
Traceback (most recent call last):
  File "exp.py", line 251, in <module>
    train(args)
  File "exp.py", line 71, in train
    if args.cuda: model.cuda()
  File "/home/gneubig/anaconda3/envs/py3torch3cuda9/lib/python3.6/site-packages/torch/nn/modules/module.py", line 216, in cuda
    return self._apply(lambda t: t.cuda(device))
  File "/home/gneubig/anaconda3/envs/py3torch3cuda9/lib/python3.6/site-packages/torch/nn/modules/module.py", line 146, in _apply
    module._apply(fn)
  File "/home/gneubig/anaconda3/envs/py3torch3cuda9/lib/python3.6/site-packages/torch/nn/modules/rnn.py", line 123, in _apply
    self.flatten_parameters()
  File "/home/gneubig/anaconda3/envs/py3torch3cuda9/lib/python3.6/site-packages/torch/nn/modules/rnn.py", line 102, in flatten_parameters
    fn.rnn_desc = rnn.init_rnn_descriptor(fn, handle)
  File "/home/gneubig/anaconda3/envs/py3torch3cuda9/lib/python3.6/site-packages/torch/backends/cudnn/rnn.py", line 42, in init_rnn_descriptor
    cudnn.DropoutDescriptor(handle, dropout_p, fn.dropout_seed)
  File "/home/gneubig/anaconda3/envs/py3torch3cuda9/lib/python3.6/site-packages/torch/backends/cudnn/__init__.py", line 207, in __init__
    self._set(dropout, seed)
  File "/home/gneubig/anaconda3/envs/py3torch3cuda9/lib/python3.6/site-packages/torch/backends/cudnn/__init__.py", line 232, in _set
    ctypes.c_ulonglong(seed),
  File "/home/gneubig/anaconda3/envs/py3torch3cuda9/lib/python3.6/site-packages/torch/backends/cudnn/__init__.py", line 283, in check_error
    raise CuDNNError(status)
torch.backends.cudnn.CuDNNError: 8: b'CUDNN_STATUS_EXECUTION_FAILED'

I'm not sure if it's related but, I did find this: pytorch/pytorch#953
That post seemed to indicate it might be an out-of-memory error, so I tried to reduce the batch size and size of the hidden dimensions, but this didn't change anything...

@pcyin: I was able to reproduce the error on ogma, so maybe you'd be able to as well?

@neubig
Copy link
Collaborator Author

neubig commented Aug 2, 2019

FYI: I've found out that the problem was because I'm using a 2080Ti, which fails when you use CUDA less than version 10. The environment suggested by TranX is using CUDA version 9. I started working on fixing this by fixing this issue #10 and making a more modern environment, but the newest version of PyTorch doesn't work with tranX, and there are several places that need fixing. Will update when I finish.

@neubig
Copy link
Collaborator Author

neubig commented Aug 7, 2019

Should be fixed by #15

@chenyangh
Copy link

chenyangh commented Aug 22, 2019

Hi, Prof. Neubig. I merged your PR onto my fork manually but there were still some issues for WikiSQL task. I made the following changes in the model/wikisql/parser.py file in order to make it work.
L 247 From
action_prob_var = torch.cat([torch.cat(action_probs_i).log().sum() for action_probs_i in action_probs])
->
action_prob_var = torch.stack([torch.stack(action_probs_i).log().sum() for action_probs_i in action_probs])
L 459 From
new_hyp_scores = torch.cat([x['new_hyp_score'] for x in new_hyp_meta])
->
new_hyp_scores = torch.stack([x['new_hyp_score'].cuda() for x in new_hyp_meta])

@neubig neubig closed this as completed Oct 9, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants