Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Low touch upgrade to TensorFlow 2.3 #3485

Closed
wants to merge 13 commits into from
Closed

Low touch upgrade to TensorFlow 2.3 #3485

wants to merge 13 commits into from

Conversation

reuben
Copy link
Contributor

@reuben reuben commented Jan 2, 2021

Keeps changes to a minimum by leveraging the fact that under a tfv1.Session object, TensorFlow v1 style meta-graph construction works normally, including placeholders. This lets us keep changes to a minimum. The main change comes in the model definition code: the previous LSTMBlockCell/static_rnn/CudnnRNN parametrized RNN implementation gets replaced by tf.keras.layers.LSTM which is supposed to use the most appropriate implementation given the layer configuration and host machine setup. This is a graph breaking change and so GRAPH_VERSION is bumped.

@DanBmh
Copy link
Contributor

DanBmh commented Jan 3, 2021

Hi @reuben,
I think updating the code to tensorflow2 is a great idea.
But before you put to much work into it, I wanted to notify you that I already started something similar.

Instead of your low-touch upgrade, I'm trying to do a native tensorflow2 implementation. Main goal of it is to replace the old DeepSpeech network with something newer and more accurate. I found a complete reimplementation using the new tf2 features was a cleaner way than mixing this into the current training code. By the way, I also did rework the flags handling, which could be an idea for #3476.

Currently its very experimental, you can only do single gpu trainings now and I'm still missing a lot of the features DeepSpeech has, but my plan is to make it compatible to the DS bindings in the long run. You can find it here: https://gitlab.com/Jaco-Assistant/deepspeech-polyglot/-/merge_requests/7

I wanted to talk with you after the Christmas holidays, but it seems you are working too fast^^
Greetings
Daniel

@reuben
Copy link
Contributor Author

reuben commented Jan 3, 2021

@DanBmh cool stuff! By the way I've also experimented with a full train loop rewrite path here: reuben/STT@cc8a774

@reuben
Copy link
Contributor Author

reuben commented Jan 7, 2021

Hm, it seems despite nothing having changed on the TF or DS build system side of things with this PR, the newly built TF cache contains broken links, instead of referring to tc-workdir they refer to the place tc-workdir points to, eg. /Users/build-user/TaskCluster/Workdir/tasks/task_159957318579746. This includes some links that are required by bazel to work, as they define the CXX toolchain.

@reuben
Copy link
Contributor Author

reuben commented Jan 7, 2021

new_tf_cache = https://community-tc.services.mozilla.com/api/index/v1/task/project.deepspeech.tensorflow.pip.r2.3.4c4c6accdd524ac50150860031184bb1b17a0350.0.ios_arm64/artifacts/public/home.tar.xz - built from the commits in this PR

/private/tmp/scratch $ ls -lh new_tf_cache/DeepSpeech/ds/tensorflow/bazel-*
lrwxr-xr-x  1 build-user  wheel   136B Jan  4 06:06 new_tf_cache/DeepSpeech/ds/tensorflow/bazel-bin -> /Users/build-user/TaskCluster/Workdir/tasks/task_159957318579746/.bazel_cache/output/execroot/org_tensorflow/bazel-out/ios_arm64-opt/bin
lrwxr-xr-x  1 build-user  wheel   118B Jan  4 06:06 new_tf_cache/DeepSpeech/ds/tensorflow/bazel-out -> /Users/build-user/TaskCluster/Workdir/tasks/task_159957318579746/.bazel_cache/output/execroot/org_tensorflow/bazel-out
lrwxr-xr-x  1 build-user  wheel   108B Jan  4 06:06 new_tf_cache/DeepSpeech/ds/tensorflow/bazel-tensorflow -> /Users/build-user/TaskCluster/Workdir/tasks/task_159957318579746/.bazel_cache/output/execroot/org_tensorflow
lrwxr-xr-x  1 build-user  wheel   141B Jan  4 06:06 new_tf_cache/DeepSpeech/ds/tensorflow/bazel-testlogs -> /Users/build-user/TaskCluster/Workdir/tasks/task_159957318579746/.bazel_cache/output/execroot/org_tensorflow/bazel-out/ios_arm64-opt/testlogs

old_tf_cache = https://community-tc.services.mozilla.com/api/index/v1/task/project.deepspeech.tensorflow.pip.r2.3.23ad988fcde60fb01f9533e95004bbc4877a9143.0.ios_arm64/artifacts/public/home.tar.xz - the current master cache

lrwxr-xr-x  1 build-user  wheel   126B Aug 26 02:57 old_tf_cache/DeepSpeech/ds/tensorflow/bazel-bin -> /Users/build-user/TaskCluster/Workdir/tasks/tc-workdir/.bazel_cache/output/execroot/org_tensorflow/bazel-out/ios_arm64-opt/bin
lrwxr-xr-x  1 build-user  wheel   108B Aug 26 02:57 old_tf_cache/DeepSpeech/ds/tensorflow/bazel-out -> /Users/build-user/TaskCluster/Workdir/tasks/tc-workdir/.bazel_cache/output/execroot/org_tensorflow/bazel-out
lrwxr-xr-x  1 build-user  wheel    98B Aug 26 02:57 old_tf_cache/DeepSpeech/ds/tensorflow/bazel-tensorflow -> /Users/build-user/TaskCluster/Workdir/tasks/tc-workdir/.bazel_cache/output/execroot/org_tensorflow
lrwxr-xr-x  1 build-user  wheel   131B Aug 26 02:57 old_tf_cache/DeepSpeech/ds/tensorflow/bazel-testlogs -> /Users/build-user/TaskCluster/Workdir/tasks/tc-workdir/.bazel_cache/output/execroot/org_tensorflow/bazel-out/ios_arm64-opt/testlogs

@reuben
Copy link
Contributor Author

reuben commented Jan 7, 2021

Oh, maybe it's related to the .taskcluster.yml v1 transition...

@reuben
Copy link
Contributor Author

reuben commented Jan 7, 2021

Huh. bazel seems to be picking up some random taskdir, not even the same one from the task...

From this run:


++ pwd
+ export TASKCLUSTER_ARTIFACTS=/Users/build-user/TaskCluster/Workdir/tasks/task_160978434177740/public/
+ TASKCLUSTER_ARTIFACTS=/Users/build-user/TaskCluster/Workdir/tasks/task_160978434177740/public/
++ pwd
+ export TASKCLUSTER_ORIG_TASKDIR=/Users/build-user/TaskCluster/Workdir/tasks/task_160978434177740
+ TASKCLUSTER_ORIG_TASKDIR=/Users/build-user/TaskCluster/Workdir/tasks/task_160978434177740

And then later when bazel is run, this random task dir comes up:

+ bazel --output_user_root /Users/build-user/TaskCluster/Workdir/tasks/tc-workdir/.bazel_cache/ --output_base /Users/build-user/TaskCluster/Workdir/tasks/tc-workdir/.bazel_cache//output/ info
Extracting Bazel installation...
Starting local Bazel server and connecting to it...
INFO: Options provided by the client:
  Inherited 'common' options: --isatty=0 --terminal_columns=80
INFO: Reading rc options for 'info' from /Users/build-user/TaskCluster/Workdir/tasks/task_159956485967440/DeepSpeech/ds/tensorflow/.bazelrc:

task_159956485967440 then gets used throughout the build...

@reuben
Copy link
Contributor Author

reuben commented Jan 7, 2021

In the old build it correctly uses tc-workdir for the build:

+ bazel --output_user_root /Users/build-user/TaskCluster/Workdir/tasks/tc-workdir/.bazel_cache/ --output_base /Users/build-user/TaskCluster/Workdir/tasks/tc-workdir/.bazel_cache//output/ build -s --explain bazel_monolithic_tf.log --verbose_explanations --experimental_strict_action_env --config=monolithic -c opt --config=ios_arm64 --define=runtime=tflite --copt=-DTFLITE_WITH_RUY_GEMV //tensorflow/lite/c:libtensorflowlite_c.so
INFO: Options provided by the client:
  Inherited 'common' options: --isatty=0 --terminal_columns=80
INFO: Reading rc options for 'build' from /Users/build-user/TaskCluster/Workdir/tasks/tc-workdir/DeepSpeech/ds/tensorflow/.bazelrc:

@reuben
Copy link
Contributor Author

reuben commented Jan 7, 2021

Looking at the whole output of bazel info it's even mixing the two taskdirs:

bazel-bin: /Users/build-user/TaskCluster/Workdir/tasks/task_159956485967440/.bazel_cache/output/execroot/org_tensorflow/bazel-out/darwin-opt/bin
bazel-genfiles: /Users/build-user/TaskCluster/Workdir/tasks/task_159956485967440/.bazel_cache/output/execroot/org_tensorflow/bazel-out/darwin-opt/bin
bazel-testlogs: /Users/build-user/TaskCluster/Workdir/tasks/task_159956485967440/.bazel_cache/output/execroot/org_tensorflow/bazel-out/darwin-opt/testlogs
character-encoding: file.encoding = ISO-8859-1, defaultCharset = ISO-8859-1
command_log: /Users/build-user/TaskCluster/Workdir/tasks/task_159956485967440/.bazel_cache/output/command.log
committed-heap-size: 511MB
execution_root: /Users/build-user/TaskCluster/Workdir/tasks/task_159956485967440/.bazel_cache/output/execroot/org_tensorflow
gc-count: 15
gc-time: 1483ms
install_base: /Users/build-user/TaskCluster/Workdir/tasks/tc-workdir/.bazel_cache/install/1d6d7a22a62da56414387a04bacbf619
java-home: /Users/build-user/TaskCluster/Workdir/tasks/task_159956485967440/.bazel_cache/install/1d6d7a22a62da56414387a04bacbf619/embedded_tools/jdk
java-runtime: OpenJDK Runtime Environment (build 11.0.6+10-LTS) by Azul Systems, Inc.
java-vm: OpenJDK 64-Bit Server VM (build 11.0.6+10-LTS, mixed mode) by Azul Systems, Inc.
max-heap-size: 2147MB
output_base: /Users/build-user/TaskCluster/Workdir/tasks/task_159956485967440/.bazel_cache/output
output_path: /Users/build-user/TaskCluster/Workdir/tasks/task_159956485967440/.bazel_cache/output/execroot/org_tensorflow/bazel-out
package_path: %workspace%
release: release 3.1.0
repository_cache: /Users/build-user/TaskCluster/Workdir/tasks/tc-workdir/.bazel_cache/cache/repos/v1
server_log: /Users/build-user/TaskCluster/Workdir/tasks/task_159956485967440/.bazel_cache/output/java.log.ds-worker-vm-tf-build-b992a325.build-user.log.java.20210107-015341.94028
server_pid: 94028
used-heap-size: 36MB
workspace: /Users/build-user/TaskCluster/Workdir/tasks/task_159956485967440/DeepSpeech/ds/tensorflow

@reuben
Copy link
Contributor Author

reuben commented Jan 7, 2021

Looks like it was just some macOS specific bash craziness:

-        (rm -fr ../tc-workdir/ ; mkdir ../tc-workdir/) && cd ../tc-workdir/ &&
+        (ls -lh .. && rm -fr ../tc-workdir && mkdir ../tc-workdir && ls -lh ..) && cd ../tc-workdir &&

@reuben
Copy link
Contributor Author

reuben commented Jan 7, 2021

Maybe that's also related to intermittent disk space problems on macOS workers? The rm -fr command wasn't actually doing anything.

@reuben reuben marked this pull request as ready for review January 8, 2021 11:56
@reuben reuben requested a review from lissyx January 8, 2021 11:56
@reuben
Copy link
Contributor Author

reuben commented Jan 8, 2021

@lissyx ready for review!

@@ -42,7 +42,7 @@ payload:
- >
export TASKCLUSTER_ARTIFACTS="$(pwd)/public/" &&
export TASKCLUSTER_ORIG_TASKDIR="$(pwd)" &&
(rm -fr ../tc-workdir/ ; mkdir ../tc-workdir/) && cd ../tc-workdir/ &&
(ls -lh .. && rm -fr ../tc-workdir && mkdir ../tc-workdir && ls -lh ..) && cd ../tc-workdir &&
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this might trigger advserse effects in the future, I remember fighting weird bugs because of / missing :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm. This was causing problems because tc-workdir wasn't actually being deleted. So the build would pick up older files and fail. This change fixed that problem reliably but I don't know if there are other side effects hiding in the shadows.


def _ending_tester(self, value_range, clock_min, clock_max, expected_min, expected_max):
with self.session as session:
tf_pick = tf_pick_value_from_range(value_range, clock=self.clock_ph)
clock_ph = tfv1.placeholder(dtype=tf.float64, name='clock')
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that looks weird, we create the placeholder at the end now, instead of in the __init__?

# Transpose to batch major and apply softmax for decoder
transposed = tf.nn.softmax(tf.transpose(a=logits, perm=[1, 0, 2]))
# Apply softmax and transpose to batch major for batch decoder
transposed = tf.transpose(tf.nn.softmax(logits), [1, 0, 2])
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we moved from softmax(transpose) to transpose(softmax), is it on purpose?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not on purpose, just a result of iterating on the patch. But it's equivalent, softmax is applied to the last dimension which is not (and was not) transposed here.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

right, but I guess for ease of clarity keep it as besore if better?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, will do.

forget_bias=0,
reuse=reuse,
name='cudnn_compatible_lstm_cell')
class CreateOverlappingWindows(tf.keras.Model):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

much nice

Copy link
Collaborator

@lissyx lissyx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm just a bit worried about the softmax/transpose order.

transcribe.py Show resolved Hide resolved
@CatalinVoss
Copy link
Collaborator

This looks awesome! Hopefully with lots of positive downstream impact.

@reuben
Copy link
Contributor Author

reuben commented Jan 27, 2021

WARNING:tensorflow:Layer lstm will not use cuDNN kernel since it doesn't meet the cuDNN kernel criteria. It will use generic GPU kernel as fallback when running on GPU

One of the checks for the CuDNN implementation is whether TF is running eagerly outside of any tf.functions, which means the training code as it is in this PR fails to meet the constraints and doesn't use CuDNN :/

@reuben
Copy link
Contributor Author

reuben commented Feb 2, 2021

The new graph wrecks memory utilization during training. On Quadro 6000's we used to be able to get a batch size of 128 and now even 64 runs OOM. I hope this is due to it not using the cuDNN kernel but I'm not entirely sure... Need to investigate more. Without a fix for this it's hard to justify the upgrade.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants