Replies: 7 comments
-
>>> lissyx |
Beta Was this translation helpful? Give feedback.
-
>>> tyler_stewartt |
Beta Was this translation helpful? Give feedback.
-
>>> daniel.cruzado |
Beta Was this translation helpful? Give feedback.
-
>>> SooMSooM |
Beta Was this translation helpful? Give feedback.
-
>>> tyler_stewartt |
Beta Was this translation helpful? Give feedback.
-
Beta Was this translation helpful? Give feedback.
-
Beta Was this translation helpful? Give feedback.
-
>>> tyler_stewartt
[May 14, 2019, 12:14pm]
ISSUE slash
Can not train DeepSpeech on GTX 2070. Tensorflow 1.13 isn't compatible
with the newer graphics card.
ERROR
> Could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR
> Failed to get convolution algorithm. This is probably because cuDNN
> failed to initialize
ACTIONS
1. Tensorflow 1.13 was compiled and built from source - the issue
persists.
2. Added extra configuration
config.py
line 63: slash
c.session_config.gpu_options.per_process_gpu_memory_fraction = 0.6 slash
c.session_config...gpu_options.allow_growth = True slash
The two configurations did not resolve the issue.
INFO slash
Using the latest version of DeepSpeech ( v0.5.0-alpha) with NVIDIA GTX
2070, CUDA-10, CUDNN-7.5, TensorflowGPU-1.13.1
LOG
> root953d2eb1cfea:/DeepSpeech-root/DeepSpeech# ./run-ldc93s1.sh
> + [ ! -f DeepSpeech.py ]
> + [ ! -f data/ldc93s1/ldc93s1.csv ]
> + [ -d ]
> + python -c from xdg import BaseDirectory as xdg; print(xdg.save_data_path('deepspeech/ldc93s1'))
> + checkpoint_dir=/root/.local/share/deepspeech/ldc93s1
> + export CUDA_VISIBLE_DEVICES=0
> + python -u DeepSpeech.py --train_files data/ldc93s1/ldc93s1.csv --test_files data/ldc93s1/ldc93s1.csv --train_batch_size 1 --test_batch_size 1 --n_hidden 100 --epochs 200 --log_level 0
> 2019-05-14 11:58:21.309114: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
> 2019-05-14 11:58:21.424454: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:998] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
> 2019-05-14 11:58:21.425146: I tensorflow/compiler/xla/service/service.cc:150] XLA service 0x33890f0 executing computations on platform CUDA. Devices:
> 2019-05-14 11:58:21.425163: I tensorflow/compiler/xla/service/service.cc:158] StreamExecutor device (0): GeForce RTX 2070, Compute Capability 7.5
> 2019-05-14 11:58:21.427067: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 3491870000 Hz
> 2019-05-14 11:58:21.427566: I tensorflow/compiler/xla/service/service.cc:150] XLA service 0x3c633e0 executing computations on platform Host. Devices:
> 2019-05-14 11:58:21.427587: I tensorflow/compiler/xla/service/service.cc:158] StreamExecutor device (0): ,
> 2019-05-14 11:58:21.427976: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1433] Found device 0 with properties:
> name: GeForce RTX 2070 major: 7 minor: 5 memoryClockRate(GHz): 1.62
> pciBusID: 0000:02:00.0
> totalMemory: 7.76GiB freeMemory: 7.39GiB
> 2019-05-14 11:58:21.427994: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1512] Adding visible gpu devices: 0
> 2019-05-14 11:58:21.428764: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] Device interconnect StreamExecutor with strength 1 edge matrix:
> 2019-05-14 11:58:21.428779: I tensorflow/core/common_runtime/gpu/gpu_device.cc:990] 0
> 2019-05-14 11:58:21.428786: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1003] 0: N
> 2019-05-14 11:58:21.429141: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/device:GPU:0 with 7185 MB memory) -> physical GPU (device: 0, name: GeForce RTX 2070, pci bus id: 0000:02:00.0, compute capability: 7.5)
> WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/tensorflow/python/data/ops/dataset_ops.py:429: py_func (from tensorflow.python.ops.script_ops) is deprecated and will be removed in a future version.
> Instructions for updating:
> tf.py_func is deprecated in TF V2. Instead, use
> tf.py_function, which takes a python function which manipulates tf eager
> tensors instead of numpy arrays. It's easy to convert a tf eager tensor to
> an ndarray (just call tensor.numpy()) but having access to eager tensors
> means
tf.py_function
s can use accelerators such as GPUs as well as> being differentiable using a gradient tape.
WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/tensorflow/python/data/ops/iterator_ops.py:358: colocate_with (from tensorflow.python.framework.ops) is deprecated and will be removed in a future version.
Instructions for updating:
Colocations handled automatically by placer.
WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/tensorflow/contrib/rnn/python/ops/lstm_ops.py:696: to_int64 (from tensorflow.python.ops.math_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.cast instead.
2019-05-14 11:58:22.280196: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1512] Adding visible gpu devices: 0
2019-05-14 11:58:22.280233: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-05-14 11:58:22.280241: I tensorflow/core/common_runtime/gpu/gpu_device.cc:990] 0
2019-05-14 11:58:22.280247: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1003] 0: N
2019-05-14 11:58:22.280595: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 7185 MB memory) -> physical GPU (device: 0, name: GeForce RTX 2070, pci bus id: 0000:02:00.0, compute capability: 7.5)
D Session opened.
I Initializing variables...
I STARTING Optimization
Epoch 0 | Training | Elapsed Time: 0:00:00 | Steps: 0 | Loss: 0.000000 2019-05-14 11:58:23.028141: I tensorflow/stream_executor/dso_loader.cc:152] successfully opened CUDA library libcublas.so.10.0 locally
2019-05-14 11:58:24.275096: E tensorflow/stream_executor/cuda/cuda_dnn.cc:334] Could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR
2019-05-14 11:58:24.289759: E tensorflow/stream_executor/cuda/cuda_dnn.cc:334] Could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR
Traceback (most recent call last):
File '/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py', line 1334, in _do_call
return fn(*args)
File '/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py', line 1319, in _run_fn
options, feed_dict, fetch_list, target_list, run_metadata)
File '/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py', line 1407, in _call_tf_sessionrun
run_metadata)
tensorflow.python.framework.errors_impl.UnknownError: Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above.
[[{{node tower_0/conv1d/Conv2D}}]]
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File 'DeepSpeech.py', line 829, in
tf.app.run(main)
File '/usr/local/lib/python3.6/dist-packages/tensorflow/python/platform/app.py', line 125, in run
_sys.exit(main(argv))
File 'DeepSpeech.py', line 813, in main
train()
File 'DeepSpeech.py', line 510, in train
train_loss, _ = run_set('train', epoch, train_init_op)
File 'DeepSpeech.py', line 483, in run_set
feed_dict=feed_dict)
File '/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py', line 929, in run
run_metadata_ptr)
File '/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py', line 1152, in _run
feed_dict_tensor, options, run_metadata)
File '/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py', line 1328, in _do_run
run_metadata)
File '/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py', line 1348, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.UnknownError: Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above.
[[node tower_0/conv1d/Conv2D (defined at DeepSpeech.py:56) ]]
Caused by op 'tower_0/conv1d/Conv2D', defined at:
File 'DeepSpeech.py', line 829, in
tf.app.run(main)
File '/usr/local/lib/python3.6/dist-packages/tensorflow/python/platform/app.py', line 125, in run
_sys.exit(main(argv))
File 'DeepSpeech.py', line 813, in main
train()
File 'DeepSpeech.py', line 400, in train
gradients, loss = get_tower_results(iterator, optimizer, dropout_rates)
File 'DeepSpeech.py', line 253, in get_tower_results
avg_loss = calculate_mean_edit_distance_and_loss(iterator, dropout_rates, reuse=i > 0)
File 'DeepSpeech.py', line 186, in calculate_mean_edit_distance_and_loss
logits, _ = create_model(batch_x, batch_seq_len, dropout, reuse=reuse)
File 'DeepSpeech.py', line 119, in create_model
batch_x = create_overlapping_windows(batch_x)
File 'DeepSpeech.py', line 56, in create_overlapping_windows
batch_x = tf.nn.conv1d(batch_x, eye_filter, stride=1, padding='SAME')
File '/usr/local/lib/python3.6/dist-packages/tensorflow/python/util/deprecation.py', line 574, in new_func
return func(*args, **kwargs)
File '/usr/local/lib/python3.6/dist-packages/tensorflow/python/util/deprecation.py', line 574, in new_func
return func(*args, **kwargs)
File '/usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/nn_ops.py', line 3482, in conv1d
data_format=data_format)
File '/usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/gen_nn_ops.py', line 1026, in conv2d
data_format=data_format, dilations=dilations, name=name)
File '/usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/op_def_library.py', line 788, in _apply_op_helper
op_def=op_def)
File '/usr/local/lib/python3.6/dist-packages/tensorflow/python/util/deprecation.py', line 507, in new_func
return func(*args, **kwargs)
File '/usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/ops.py', line 3300, in create_op
op_def=op_def)
File '/usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/ops.py', line 1801, in init
self._traceback = tf_stack.extract_stack()
UnknownError (see above for traceback): Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above.
[[node tower_0/conv1d/Conv2D (defined at DeepSpeech.py:56) ]]
root953d2eb1cfea:/DeepSpeech-root/DeepSpeech#
Can this issue be resloved? Any help appreciated. Thanks
[This is an archived TTS discussion thread from discourse.mozilla.org/t/can-not-train-deepspeech-on-gtx-2070]
Beta Was this translation helpful? Give feedback.
All reactions