Can not train deepspeech on gtx 2070 #631

JRMeyer · 2021-03-08T03:12:46Z

JRMeyer
Mar 8, 2021
Maintainer

>>> tyler_stewartt
[May 14, 2019, 12:14pm]

ISSUE slash
Can not train DeepSpeech on GTX 2070. Tensorflow 1.13 isn't compatible
with the newer graphics card.

ERROR

> Could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR

> Failed to get convolution algorithm. This is probably because cuDNN
> failed to initialize

ACTIONS

1. Tensorflow 1.13 was compiled and built from source - the issue
persists.

2. Added extra configuration
config.py
line 63: slash
c.session_config.gpu_options.per_process_gpu_memory_fraction = 0.6 slash
c.session_config...gpu_options.allow_growth = True slash
The two configurations did not resolve the issue.

INFO slash
Using the latest version of DeepSpeech ( v0.5.0-alpha) with NVIDIA GTX
2070, CUDA-10, CUDNN-7.5, TensorflowGPU-1.13.1

LOG

> root953d2eb1cfea:/DeepSpeech-root/DeepSpeech# ./run-ldc93s1.sh
> + [ ! -f DeepSpeech.py ]
> + [ ! -f data/ldc93s1/ldc93s1.csv ]
> + [ -d ]
> + python -c from xdg import BaseDirectory as xdg; print(xdg.save_data_path('deepspeech/ldc93s1'))
> + checkpoint_dir=/root/.local/share/deepspeech/ldc93s1
> + export CUDA_VISIBLE_DEVICES=0
> + python -u DeepSpeech.py --train_files data/ldc93s1/ldc93s1.csv --test_files data/ldc93s1/ldc93s1.csv --train_batch_size 1 --test_batch_size 1 --n_hidden 100 --epochs 200 --log_level 0
> 2019-05-14 11:58:21.309114: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
> 2019-05-14 11:58:21.424454: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:998] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
> 2019-05-14 11:58:21.425146: I tensorflow/compiler/xla/service/service.cc:150] XLA service 0x33890f0 executing computations on platform CUDA. Devices:
> 2019-05-14 11:58:21.425163: I tensorflow/compiler/xla/service/service.cc:158] StreamExecutor device (0): GeForce RTX 2070, Compute Capability 7.5
> 2019-05-14 11:58:21.427067: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 3491870000 Hz
> 2019-05-14 11:58:21.427566: I tensorflow/compiler/xla/service/service.cc:150] XLA service 0x3c633e0 executing computations on platform Host. Devices:
> 2019-05-14 11:58:21.427587: I tensorflow/compiler/xla/service/service.cc:158] StreamExecutor device (0): ,
> 2019-05-14 11:58:21.427976: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1433] Found device 0 with properties:
> name: GeForce RTX 2070 major: 7 minor: 5 memoryClockRate(GHz): 1.62
> pciBusID: 0000:02:00.0
> totalMemory: 7.76GiB freeMemory: 7.39GiB
> 2019-05-14 11:58:21.427994: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1512] Adding visible gpu devices: 0
> 2019-05-14 11:58:21.428764: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] Device interconnect StreamExecutor with strength 1 edge matrix:
> 2019-05-14 11:58:21.428779: I tensorflow/core/common_runtime/gpu/gpu_device.cc:990] 0
> 2019-05-14 11:58:21.428786: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1003] 0: N
> 2019-05-14 11:58:21.429141: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/device:GPU:0 with 7185 MB memory) -> physical GPU (device: 0, name: GeForce RTX 2070, pci bus id: 0000:02:00.0, compute capability: 7.5)
> WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/tensorflow/python/data/ops/dataset_ops.py:429: py_func (from tensorflow.python.ops.script_ops) is deprecated and will be removed in a future version.
> Instructions for updating:
> tf.py_func is deprecated in TF V2. Instead, use
> tf.py_function, which takes a python function which manipulates tf eager
> tensors instead of numpy arrays. It's easy to convert a tf eager tensor to
> an ndarray (just call tensor.numpy()) but having access to eager tensors
> means tf.py_functions can use accelerators such as GPUs as well as
> being differentiable using a gradient tape.

WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/tensorflow/python/data/ops/iterator_ops.py:358: colocate_with (from tensorflow.python.framework.ops) is deprecated and will be removed in a future version.
Instructions for updating:
Colocations handled automatically by placer.
WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/tensorflow/contrib/rnn/python/ops/lstm_ops.py:696: to_int64 (from tensorflow.python.ops.math_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.cast instead.
2019-05-14 11:58:22.280196: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1512] Adding visible gpu devices: 0
2019-05-14 11:58:22.280233: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-05-14 11:58:22.280241: I tensorflow/core/common_runtime/gpu/gpu_device.cc:990] 0
2019-05-14 11:58:22.280247: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1003] 0: N
2019-05-14 11:58:22.280595: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 7185 MB memory) -> physical GPU (device: 0, name: GeForce RTX 2070, pci bus id: 0000:02:00.0, compute capability: 7.5)
D Session opened.
I Initializing variables...
I STARTING Optimization
Epoch 0 | Training | Elapsed Time: 0:00:00 | Steps: 0 | Loss: 0.000000 2019-05-14 11:58:23.028141: I tensorflow/stream_executor/dso_loader.cc:152] successfully opened CUDA library libcublas.so.10.0 locally
2019-05-14 11:58:24.275096: E tensorflow/stream_executor/cuda/cuda_dnn.cc:334] Could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR
2019-05-14 11:58:24.289759: E tensorflow/stream_executor/cuda/cuda_dnn.cc:334] Could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR
Traceback (most recent call last):
File '/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py', line 1334, in _do_call
return fn(*args)
File '/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py', line 1319, in _run_fn
options, feed_dict, fetch_list, target_list, run_metadata)
File '/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py', line 1407, in _call_tf_sessionrun
run_metadata)
tensorflow.python.framework.errors_impl.UnknownError: Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above.
[[{{node tower_0/conv1d/Conv2D}}]]

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File 'DeepSpeech.py', line 829, in
tf.app.run(main)
File '/usr/local/lib/python3.6/dist-packages/tensorflow/python/platform/app.py', line 125, in run
_sys.exit(main(argv))
File 'DeepSpeech.py', line 813, in main
train()
File 'DeepSpeech.py', line 510, in train
train_loss, _ = run_set('train', epoch, train_init_op)
File 'DeepSpeech.py', line 483, in run_set
feed_dict=feed_dict)
File '/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py', line 929, in run
run_metadata_ptr)
File '/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py', line 1152, in _run
feed_dict_tensor, options, run_metadata)
File '/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py', line 1328, in _do_run
run_metadata)
File '/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py', line 1348, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.UnknownError: Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above.
[[node tower_0/conv1d/Conv2D (defined at DeepSpeech.py:56) ]]

Caused by op 'tower_0/conv1d/Conv2D', defined at:
File 'DeepSpeech.py', line 829, in
tf.app.run(main)
File '/usr/local/lib/python3.6/dist-packages/tensorflow/python/platform/app.py', line 125, in run
_sys.exit(main(argv))
File 'DeepSpeech.py', line 813, in main
train()
File 'DeepSpeech.py', line 400, in train
gradients, loss = get_tower_results(iterator, optimizer, dropout_rates)
File 'DeepSpeech.py', line 253, in get_tower_results
avg_loss = calculate_mean_edit_distance_and_loss(iterator, dropout_rates, reuse=i > 0)
File 'DeepSpeech.py', line 186, in calculate_mean_edit_distance_and_loss
logits, _ = create_model(batch_x, batch_seq_len, dropout, reuse=reuse)
File 'DeepSpeech.py', line 119, in create_model
batch_x = create_overlapping_windows(batch_x)
File 'DeepSpeech.py', line 56, in create_overlapping_windows
batch_x = tf.nn.conv1d(batch_x, eye_filter, stride=1, padding='SAME')
File '/usr/local/lib/python3.6/dist-packages/tensorflow/python/util/deprecation.py', line 574, in new_func
return func(*args, **kwargs)
File '/usr/local/lib/python3.6/dist-packages/tensorflow/python/util/deprecation.py', line 574, in new_func
return func(*args, **kwargs)
File '/usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/nn_ops.py', line 3482, in conv1d
data_format=data_format)
File '/usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/gen_nn_ops.py', line 1026, in conv2d
data_format=data_format, dilations=dilations, name=name)
File '/usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/op_def_library.py', line 788, in _apply_op_helper
op_def=op_def)
File '/usr/local/lib/python3.6/dist-packages/tensorflow/python/util/deprecation.py', line 507, in new_func
return func(*args, **kwargs)
File '/usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/ops.py', line 3300, in create_op
op_def=op_def)
File '/usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/ops.py', line 1801, in init
self._traceback = tf_stack.extract_stack()

UnknownError (see above for traceback): Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above.
[[node tower_0/conv1d/Conv2D (defined at DeepSpeech.py:56) ]]

root953d2eb1cfea:/DeepSpeech-root/DeepSpeech#

Can this issue be resloved? Any help appreciated. Thanks

[This is an archived TTS discussion thread from discourse.mozilla.org/t/can-not-train-deepspeech-on-gtx-2070]

JRMeyer · 2021-03-08T03:12:49Z

JRMeyer
Mar 8, 2021
Maintainer Author

>>> lissyx
[May 14, 2019, 12:12pm]

Looks like your CuDNN setup is improper ? This is not really a
DeepSpeech issue. It works well here on RTX2080Ti.

[Archived Post]

0 replies

JRMeyer · 2021-03-08T03:12:51Z

JRMeyer
Mar 8, 2021
Maintainer Author

>>> tyler_stewartt
[May 14, 2019, 12:18pm]

Okay will verify. thanks

[Archived Post]

0 replies

JRMeyer · 2021-03-08T03:12:54Z

JRMeyer
Mar 8, 2021
Maintainer Author

>>> daniel.cruzado
[May 16, 2019, 7:26am]

That is a problem with your Cuda version, I have had that problem when I
upgraded to version 0.5 with TF 1.13. Are you sure that you are using
Cuda 10.0 and not Cuda 10.1?

[Archived Post]

0 replies

JRMeyer · 2021-03-08T03:12:57Z

JRMeyer
Mar 8, 2021
Maintainer Author

>>> SooMSooM
[May 23, 2019, 1:26pm]

I have exactly the same problem... slash
Have cuda 10.0 and cudnn 7.5.1 installed and the gpu seems to run out of
memory.

Any updates or solutions?

[Archived Post]

0 replies

JRMeyer · 2021-03-08T03:12:59Z

JRMeyer
Mar 8, 2021
Maintainer Author

>>> tyler_stewartt
[May 23, 2019, 4:47pm]

No updates or solutions. I verified proper setup of Cuda10 and Cudnn7.5.
Can train with other cards.

[Archived Post]

0 replies

JRMeyer · 2021-03-08T03:13:02Z

JRMeyer
Mar 8, 2021
Maintainer Author

>>> lissyx
[May 24, 2019, 8:02am]

> Have cuda 10.0 and cudnn 7.5.1 installed and the gpu seems to run out
> of memory.
>
> Any updates or solutions?

If you are experiencing OOM on the GPU, please reduce batch size

[Archived Post]

0 replies

JRMeyer · 2021-03-08T03:13:05Z

JRMeyer
Mar 8, 2021
Maintainer Author

>>> lissyx
[May 24, 2019, 8:03am]

> No updates or solutions. I verified proper setup of Cuda10 and
> Cudnn7.5. Can train with other cards.

What do you mean can train with other cards ?

Anyway, the reported error does not come from DeepSpeech code, at best
it would be an upstream TensorFlow issue, nothing we can help about.

The error explicitely states failure at initializing CuDNN.

[Archived Post]

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Can not train deepspeech on gtx 2070 #631

{{title}}

Replies: 7 comments

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Can not train deepspeech on gtx 2070 #631

JRMeyer Mar 8, 2021 Maintainer

Replies: 7 comments

JRMeyer Mar 8, 2021 Maintainer Author

JRMeyer Mar 8, 2021 Maintainer Author

JRMeyer Mar 8, 2021 Maintainer Author

JRMeyer Mar 8, 2021 Maintainer Author

JRMeyer Mar 8, 2021 Maintainer Author

JRMeyer Mar 8, 2021 Maintainer Author

JRMeyer Mar 8, 2021 Maintainer Author

JRMeyer
Mar 8, 2021
Maintainer

JRMeyer
Mar 8, 2021
Maintainer Author

JRMeyer
Mar 8, 2021
Maintainer Author

JRMeyer
Mar 8, 2021
Maintainer Author

JRMeyer
Mar 8, 2021
Maintainer Author

JRMeyer
Mar 8, 2021
Maintainer Author

JRMeyer
Mar 8, 2021
Maintainer Author

JRMeyer
Mar 8, 2021
Maintainer Author