Low touch upgrade to TensorFlow 2.3 #3485

reuben · 2021-01-02T14:33:32Z

Keeps changes to a minimum by leveraging the fact that under a tfv1.Session object, TensorFlow v1 style meta-graph construction works normally, including placeholders. This lets us keep changes to a minimum. The main change comes in the model definition code: the previous LSTMBlockCell/static_rnn/CudnnRNN parametrized RNN implementation gets replaced by tf.keras.layers.LSTM which is supposed to use the most appropriate implementation given the layer configuration and host machine setup. This is a graph breaking change and so GRAPH_VERSION is bumped.

DanBmh · 2021-01-03T13:47:12Z

Hi @reuben,
I think updating the code to tensorflow2 is a great idea.
But before you put to much work into it, I wanted to notify you that I already started something similar.

Instead of your low-touch upgrade, I'm trying to do a native tensorflow2 implementation. Main goal of it is to replace the old DeepSpeech network with something newer and more accurate. I found a complete reimplementation using the new tf2 features was a cleaner way than mixing this into the current training code. By the way, I also did rework the flags handling, which could be an idea for #3476.

Currently its very experimental, you can only do single gpu trainings now and I'm still missing a lot of the features DeepSpeech has, but my plan is to make it compatible to the DS bindings in the long run. You can find it here: https://gitlab.com/Jaco-Assistant/deepspeech-polyglot/-/merge_requests/7

I wanted to talk with you after the Christmas holidays, but it seems you are working too fast^^
Greetings
Daniel

reuben · 2021-01-03T14:22:58Z

@DanBmh cool stuff! By the way I've also experimented with a full train loop rewrite path here: reuben/STT@cc8a774

reuben · 2021-01-07T09:35:32Z

Hm, it seems despite nothing having changed on the TF or DS build system side of things with this PR, the newly built TF cache contains broken links, instead of referring to tc-workdir they refer to the place tc-workdir points to, eg. /Users/build-user/TaskCluster/Workdir/tasks/task_159957318579746. This includes some links that are required by bazel to work, as they define the CXX toolchain.

reuben · 2021-01-07T09:42:58Z

new_tf_cache = https://community-tc.services.mozilla.com/api/index/v1/task/project.deepspeech.tensorflow.pip.r2.3.4c4c6accdd524ac50150860031184bb1b17a0350.0.ios_arm64/artifacts/public/home.tar.xz - built from the commits in this PR

/private/tmp/scratch $ ls -lh new_tf_cache/DeepSpeech/ds/tensorflow/bazel-*
lrwxr-xr-x  1 build-user  wheel   136B Jan  4 06:06 new_tf_cache/DeepSpeech/ds/tensorflow/bazel-bin -> /Users/build-user/TaskCluster/Workdir/tasks/task_159957318579746/.bazel_cache/output/execroot/org_tensorflow/bazel-out/ios_arm64-opt/bin
lrwxr-xr-x  1 build-user  wheel   118B Jan  4 06:06 new_tf_cache/DeepSpeech/ds/tensorflow/bazel-out -> /Users/build-user/TaskCluster/Workdir/tasks/task_159957318579746/.bazel_cache/output/execroot/org_tensorflow/bazel-out
lrwxr-xr-x  1 build-user  wheel   108B Jan  4 06:06 new_tf_cache/DeepSpeech/ds/tensorflow/bazel-tensorflow -> /Users/build-user/TaskCluster/Workdir/tasks/task_159957318579746/.bazel_cache/output/execroot/org_tensorflow
lrwxr-xr-x  1 build-user  wheel   141B Jan  4 06:06 new_tf_cache/DeepSpeech/ds/tensorflow/bazel-testlogs -> /Users/build-user/TaskCluster/Workdir/tasks/task_159957318579746/.bazel_cache/output/execroot/org_tensorflow/bazel-out/ios_arm64-opt/testlogs

old_tf_cache = https://community-tc.services.mozilla.com/api/index/v1/task/project.deepspeech.tensorflow.pip.r2.3.23ad988fcde60fb01f9533e95004bbc4877a9143.0.ios_arm64/artifacts/public/home.tar.xz - the current master cache

lrwxr-xr-x  1 build-user  wheel   126B Aug 26 02:57 old_tf_cache/DeepSpeech/ds/tensorflow/bazel-bin -> /Users/build-user/TaskCluster/Workdir/tasks/tc-workdir/.bazel_cache/output/execroot/org_tensorflow/bazel-out/ios_arm64-opt/bin
lrwxr-xr-x  1 build-user  wheel   108B Aug 26 02:57 old_tf_cache/DeepSpeech/ds/tensorflow/bazel-out -> /Users/build-user/TaskCluster/Workdir/tasks/tc-workdir/.bazel_cache/output/execroot/org_tensorflow/bazel-out
lrwxr-xr-x  1 build-user  wheel    98B Aug 26 02:57 old_tf_cache/DeepSpeech/ds/tensorflow/bazel-tensorflow -> /Users/build-user/TaskCluster/Workdir/tasks/tc-workdir/.bazel_cache/output/execroot/org_tensorflow
lrwxr-xr-x  1 build-user  wheel   131B Aug 26 02:57 old_tf_cache/DeepSpeech/ds/tensorflow/bazel-testlogs -> /Users/build-user/TaskCluster/Workdir/tasks/tc-workdir/.bazel_cache/output/execroot/org_tensorflow/bazel-out/ios_arm64-opt/testlogs

reuben · 2021-01-07T09:46:30Z

Oh, maybe it's related to the .taskcluster.yml v1 transition...

reuben · 2021-01-07T10:03:39Z

Huh. bazel seems to be picking up some random taskdir, not even the same one from the task...

From this run:


++ pwd
+ export TASKCLUSTER_ARTIFACTS=/Users/build-user/TaskCluster/Workdir/tasks/task_160978434177740/public/
+ TASKCLUSTER_ARTIFACTS=/Users/build-user/TaskCluster/Workdir/tasks/task_160978434177740/public/
++ pwd
+ export TASKCLUSTER_ORIG_TASKDIR=/Users/build-user/TaskCluster/Workdir/tasks/task_160978434177740
+ TASKCLUSTER_ORIG_TASKDIR=/Users/build-user/TaskCluster/Workdir/tasks/task_160978434177740

And then later when bazel is run, this random task dir comes up:

+ bazel --output_user_root /Users/build-user/TaskCluster/Workdir/tasks/tc-workdir/.bazel_cache/ --output_base /Users/build-user/TaskCluster/Workdir/tasks/tc-workdir/.bazel_cache//output/ info
Extracting Bazel installation...
Starting local Bazel server and connecting to it...
INFO: Options provided by the client:
  Inherited 'common' options: --isatty=0 --terminal_columns=80
INFO: Reading rc options for 'info' from /Users/build-user/TaskCluster/Workdir/tasks/task_159956485967440/DeepSpeech/ds/tensorflow/.bazelrc:

task_159956485967440 then gets used throughout the build...

reuben · 2021-01-07T10:06:14Z

In the old build it correctly uses tc-workdir for the build:

+ bazel --output_user_root /Users/build-user/TaskCluster/Workdir/tasks/tc-workdir/.bazel_cache/ --output_base /Users/build-user/TaskCluster/Workdir/tasks/tc-workdir/.bazel_cache//output/ build -s --explain bazel_monolithic_tf.log --verbose_explanations --experimental_strict_action_env --config=monolithic -c opt --config=ios_arm64 --define=runtime=tflite --copt=-DTFLITE_WITH_RUY_GEMV //tensorflow/lite/c:libtensorflowlite_c.so
INFO: Options provided by the client:
  Inherited 'common' options: --isatty=0 --terminal_columns=80
INFO: Reading rc options for 'build' from /Users/build-user/TaskCluster/Workdir/tasks/tc-workdir/DeepSpeech/ds/tensorflow/.bazelrc:

reuben · 2021-01-07T10:15:10Z

Looking at the whole output of bazel info it's even mixing the two taskdirs:

bazel-bin: /Users/build-user/TaskCluster/Workdir/tasks/task_159956485967440/.bazel_cache/output/execroot/org_tensorflow/bazel-out/darwin-opt/bin
bazel-genfiles: /Users/build-user/TaskCluster/Workdir/tasks/task_159956485967440/.bazel_cache/output/execroot/org_tensorflow/bazel-out/darwin-opt/bin
bazel-testlogs: /Users/build-user/TaskCluster/Workdir/tasks/task_159956485967440/.bazel_cache/output/execroot/org_tensorflow/bazel-out/darwin-opt/testlogs
character-encoding: file.encoding = ISO-8859-1, defaultCharset = ISO-8859-1
command_log: /Users/build-user/TaskCluster/Workdir/tasks/task_159956485967440/.bazel_cache/output/command.log
committed-heap-size: 511MB
execution_root: /Users/build-user/TaskCluster/Workdir/tasks/task_159956485967440/.bazel_cache/output/execroot/org_tensorflow
gc-count: 15
gc-time: 1483ms
install_base: /Users/build-user/TaskCluster/Workdir/tasks/tc-workdir/.bazel_cache/install/1d6d7a22a62da56414387a04bacbf619
java-home: /Users/build-user/TaskCluster/Workdir/tasks/task_159956485967440/.bazel_cache/install/1d6d7a22a62da56414387a04bacbf619/embedded_tools/jdk
java-runtime: OpenJDK Runtime Environment (build 11.0.6+10-LTS) by Azul Systems, Inc.
java-vm: OpenJDK 64-Bit Server VM (build 11.0.6+10-LTS, mixed mode) by Azul Systems, Inc.
max-heap-size: 2147MB
output_base: /Users/build-user/TaskCluster/Workdir/tasks/task_159956485967440/.bazel_cache/output
output_path: /Users/build-user/TaskCluster/Workdir/tasks/task_159956485967440/.bazel_cache/output/execroot/org_tensorflow/bazel-out
package_path: %workspace%
release: release 3.1.0
repository_cache: /Users/build-user/TaskCluster/Workdir/tasks/tc-workdir/.bazel_cache/cache/repos/v1
server_log: /Users/build-user/TaskCluster/Workdir/tasks/task_159956485967440/.bazel_cache/output/java.log.ds-worker-vm-tf-build-b992a325.build-user.log.java.20210107-015341.94028
server_pid: 94028
used-heap-size: 36MB
workspace: /Users/build-user/TaskCluster/Workdir/tasks/task_159956485967440/DeepSpeech/ds/tensorflow

reuben · 2021-01-07T11:23:26Z

Looks like it was just some macOS specific bash craziness:

-        (rm -fr ../tc-workdir/ ; mkdir ../tc-workdir/) && cd ../tc-workdir/ &&
+        (ls -lh .. && rm -fr ../tc-workdir && mkdir ../tc-workdir && ls -lh ..) && cd ../tc-workdir &&

reuben · 2021-01-07T11:35:03Z

Maybe that's also related to intermittent disk space problems on macOS workers? The rm -fr command wasn't actually doing anything.

reuben · 2021-01-08T11:56:23Z

@lissyx ready for review!

lissyx · 2021-01-10T09:14:51Z

taskcluster/darwin-opt-base.tyml

@@ -42,7 +42,7 @@ payload:
      - >
        export TASKCLUSTER_ARTIFACTS="$(pwd)/public/" &&
        export TASKCLUSTER_ORIG_TASKDIR="$(pwd)" &&
-        (rm -fr ../tc-workdir/ ; mkdir ../tc-workdir/) && cd ../tc-workdir/ &&
+        (ls -lh .. && rm -fr ../tc-workdir && mkdir ../tc-workdir && ls -lh ..) && cd ../tc-workdir &&


this might trigger advserse effects in the future, I remember fighting weird bugs because of / missing :)

Hmm. This was causing problems because tc-workdir wasn't actually being deleted. So the build would pick up older files and fail. This change fixed that problem reliably but I don't know if there are other side effects hiding in the shadows.

taskcluster/test-training-extra_16k-linux-amd64-py35m-opt.yml

tensorflow_full_runtime.supp

lissyx · 2021-01-10T16:18:37Z

tests/test_value_range.py


    def _ending_tester(self, value_range, clock_min, clock_max, expected_min, expected_max):
        with self.session as session:
-            tf_pick = tf_pick_value_from_range(value_range, clock=self.clock_ph)
+            clock_ph = tfv1.placeholder(dtype=tf.float64, name='clock')


that looks weird, we create the placeholder at the end now, instead of in the __init__?

lissyx · 2021-01-10T16:19:50Z

training/deepspeech_training/evaluate.py

-    # Transpose to batch major and apply softmax for decoder
-    transposed = tf.nn.softmax(tf.transpose(a=logits, perm=[1, 0, 2]))
+        # Apply softmax and transpose to batch major for batch decoder
+        transposed = tf.transpose(tf.nn.softmax(logits), [1, 0, 2])


we moved from softmax(transpose) to transpose(softmax), is it on purpose?

Not on purpose, just a result of iterating on the patch. But it's equivalent, softmax is applied to the last dimension which is not (and was not) transposed here.

right, but I guess for ease of clarity keep it as besore if better?

Yeah, will do.

lissyx · 2021-01-10T16:22:10Z

training/deepspeech_training/train.py

-                                                    forget_bias=0,
-                                                    reuse=reuse,
-                                                    name='cudnn_compatible_lstm_cell')
+class CreateOverlappingWindows(tf.keras.Model):


training/deepspeech_training/train.py

lissyx

I'm just a bit worried about the softmax/transpose order.

transcribe.py

CatalinVoss · 2021-01-22T22:40:18Z

This looks awesome! Hopefully with lots of positive downstream impact.

reuben · 2021-01-27T09:37:19Z

WARNING:tensorflow:Layer lstm will not use cuDNN kernel since it doesn't meet the cuDNN kernel criteria. It will use generic GPU kernel as fallback when running on GPU

One of the checks for the CuDNN implementation is whether TF is running eagerly outside of any tf.functions, which means the training code as it is in this PR fails to meet the constraints and doesn't use CuDNN :/

reuben · 2021-02-02T13:24:03Z

The new graph wrecks memory utilization during training. On Quadro 6000's we used to be able to get a batch size of 128 and now even 64 runs OOM. I hope this is due to it not using the cuDNN kernel but I'm not entirely sure... Need to investigate more. Without a fix for this it's hard to justify the upgrade.

reuben force-pushed the low-touch-r2.3 branch from 131ea45 to b413d6d Compare January 2, 2021 18:43

reuben added 3 commits January 3, 2021 10:17

Upgrade training code to 2.3.1

08c2c34

Bump training dependencies to TensorFlow 2.3.1

85b9f0f

Update augmentation code to TF2

1596977

reuben force-pushed the low-touch-r2.3 branch from b413d6d to cb822af Compare January 3, 2021 10:26

reuben added 7 commits January 3, 2021 11:17

Update transcribe.py to TF2

9b738dd

Remove sequence lenghts input (now using masking)

966b997

Update kernel/op dependencies

e963ea2

Fix training test names/descs and drop Py3.5 training tests

7802e2f

Update training unittests to TF2

123aeb0

Update Valgrind suppressions for new ops/kernels

21d2ea4

Embed graph version in tensor names

b7ce985

reuben force-pushed the low-touch-r2.3 branch from cb822af to 2462ad4 Compare January 3, 2021 13:23

reuben force-pushed the low-touch-r2.3 branch from 2462ad4 to 4313c79 Compare January 7, 2021 11:29

reuben force-pushed the low-touch-r2.3 branch from 4313c79 to 1965cfa Compare January 7, 2021 12:39

Make sure previous tc-workdir is fresh before starting Darwin tasks

49f327b

reuben force-pushed the low-touch-r2.3 branch from 1965cfa to 6c4a35e Compare January 7, 2021 16:42

reuben added 2 commits January 8, 2021 10:29

Update examples and prod model references

90865fc

Fix incorrectly named 8kHz test task

9587e03

reuben force-pushed the low-touch-r2.3 branch from 6c4a35e to 9587e03 Compare January 8, 2021 10:30

reuben marked this pull request as ready for review January 8, 2021 11:56

reuben requested a review from lissyx January 8, 2021 11:56