-
Notifications
You must be signed in to change notification settings - Fork 4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Low touch upgrade to TensorFlow 2.3 #3485
Conversation
Hi @reuben, Instead of your low-touch upgrade, I'm trying to do a native tensorflow2 implementation. Main goal of it is to replace the old DeepSpeech network with something newer and more accurate. I found a complete reimplementation using the new tf2 features was a cleaner way than mixing this into the current training code. By the way, I also did rework the flags handling, which could be an idea for #3476. Currently its very experimental, you can only do single gpu trainings now and I'm still missing a lot of the features DeepSpeech has, but my plan is to make it compatible to the DS bindings in the long run. You can find it here: https://gitlab.com/Jaco-Assistant/deepspeech-polyglot/-/merge_requests/7 I wanted to talk with you after the Christmas holidays, but it seems you are working too fast^^ |
@DanBmh cool stuff! By the way I've also experimented with a full train loop rewrite path here: reuben/STT@cc8a774 |
Hm, it seems despite nothing having changed on the TF or DS build system side of things with this PR, the newly built TF cache contains broken links, instead of referring to tc-workdir they refer to the place tc-workdir points to, eg. |
new_tf_cache = https://community-tc.services.mozilla.com/api/index/v1/task/project.deepspeech.tensorflow.pip.r2.3.4c4c6accdd524ac50150860031184bb1b17a0350.0.ios_arm64/artifacts/public/home.tar.xz - built from the commits in this PR
old_tf_cache = https://community-tc.services.mozilla.com/api/index/v1/task/project.deepspeech.tensorflow.pip.r2.3.23ad988fcde60fb01f9533e95004bbc4877a9143.0.ios_arm64/artifacts/public/home.tar.xz - the current master cache
|
Oh, maybe it's related to the .taskcluster.yml v1 transition... |
Huh. bazel seems to be picking up some random taskdir, not even the same one from the task... From this run:
And then later when bazel is run, this random task dir comes up:
|
In the old build it correctly uses tc-workdir for the build:
|
Looking at the whole output of
|
Looks like it was just some macOS specific bash craziness: - (rm -fr ../tc-workdir/ ; mkdir ../tc-workdir/) && cd ../tc-workdir/ &&
+ (ls -lh .. && rm -fr ../tc-workdir && mkdir ../tc-workdir && ls -lh ..) && cd ../tc-workdir && |
Maybe that's also related to intermittent disk space problems on macOS workers? The |
@lissyx ready for review! |
@@ -42,7 +42,7 @@ payload: | |||
- > | |||
export TASKCLUSTER_ARTIFACTS="$(pwd)/public/" && | |||
export TASKCLUSTER_ORIG_TASKDIR="$(pwd)" && | |||
(rm -fr ../tc-workdir/ ; mkdir ../tc-workdir/) && cd ../tc-workdir/ && | |||
(ls -lh .. && rm -fr ../tc-workdir && mkdir ../tc-workdir && ls -lh ..) && cd ../tc-workdir && |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this might trigger advserse effects in the future, I remember fighting weird bugs because of /
missing :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm. This was causing problems because tc-workdir wasn't actually being deleted. So the build would pick up older files and fail. This change fixed that problem reliably but I don't know if there are other side effects hiding in the shadows.
|
||
def _ending_tester(self, value_range, clock_min, clock_max, expected_min, expected_max): | ||
with self.session as session: | ||
tf_pick = tf_pick_value_from_range(value_range, clock=self.clock_ph) | ||
clock_ph = tfv1.placeholder(dtype=tf.float64, name='clock') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that looks weird, we create the placeholder at the end now, instead of in the __init__
?
# Transpose to batch major and apply softmax for decoder | ||
transposed = tf.nn.softmax(tf.transpose(a=logits, perm=[1, 0, 2])) | ||
# Apply softmax and transpose to batch major for batch decoder | ||
transposed = tf.transpose(tf.nn.softmax(logits), [1, 0, 2]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we moved from softmax(transpose)
to transpose(softmax)
, is it on purpose?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not on purpose, just a result of iterating on the patch. But it's equivalent, softmax is applied to the last dimension which is not (and was not) transposed here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
right, but I guess for ease of clarity keep it as besore if better?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, will do.
forget_bias=0, | ||
reuse=reuse, | ||
name='cudnn_compatible_lstm_cell') | ||
class CreateOverlappingWindows(tf.keras.Model): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
much nice
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm just a bit worried about the softmax/transpose order.
This looks awesome! Hopefully with lots of positive downstream impact. |
One of the checks for the CuDNN implementation is whether TF is running eagerly outside of any tf.functions, which means the training code as it is in this PR fails to meet the constraints and doesn't use CuDNN :/ |
The new graph wrecks memory utilization during training. On Quadro 6000's we used to be able to get a batch size of 128 and now even 64 runs OOM. I hope this is due to it not using the cuDNN kernel but I'm not entirely sure... Need to investigate more. Without a fix for this it's hard to justify the upgrade. |
Keeps changes to a minimum by leveraging the fact that under a
tfv1.Session
object, TensorFlow v1 style meta-graph construction works normally, including placeholders. This lets us keep changes to a minimum. The main change comes in the model definition code: the previous LSTMBlockCell/static_rnn/CudnnRNN parametrized RNN implementation gets replaced bytf.keras.layers.LSTM
which is supposed to use the most appropriate implementation given the layer configuration and host machine setup. This is a graph breaking change and so GRAPH_VERSION is bumped.