Prefix decoding #649

patrick-wilken · 2021-09-14T09:55:50Z

This implements enforcing target tokens in ChoiceLayer provided by a separate data stream until it is "used up". After that (different lengths per sequence are possible) search continues as usual. The correct scores are assigned to the prefixes.

Useful to continue a target sequence provided by a user, or to enforce target-side prefix tokens that provide meta-information, e.g. the dialect to translate to.

Getting additional data streams into the recurrent layer that are on the same time axis but have a shorter length is somewhat tricky. I implemented that you have to mark them explicitly via a padded_data_keys parameter. The check for equal length is disabled for those data keys, instead they are padded if necessary.

albertz · 2021-09-14T10:12:37Z

We should not introduce another separate padding concept. This was already discussed as part of #391, where we agreed to add sth like dyn_mask_ext (earlier seq_mask_ext in that discussion) to the DimensionTag.

albertz · 2021-09-14T10:13:08Z

Btw, before you implement some bigger change like this, it would be good to open an issue first where the implementation details are being discussed.

albertz · 2021-09-14T10:22:26Z

One way this would already work without these changes (at least conceptually):

You could do one pass over the given prefix with search disabled (so ChoiceLayer uses the prefix). Then you do a second pass with search enabled, where you init all the hidden states by the final states of the first pass.

So sth (conceptually) like:

_, hidden = decoder(targets=prefix, search=False)
out, _ = decoder(hidden=hidden, search=True)

I write "conceptually" because there are likely a couple of details why this would not work yet. But then we should fix those issues.

I think this is easier and more straight-forward than adding another new option to ChoiceLayer.

Edit However, if you think such prefix option for ChoiceLayer is easier for you, maybe we can still do it. But see my other comments on the code changes regarding that.

albertz · 2021-09-14T10:27:32Z

returnn/tf/layers/rec.py

+        target=self.prefix_target).get_placeholder_as_batch_major()  # (batch * beam,), int32
+
+      # Get prefixes that have already ended (i.e. have a smaller length than the current time step).
+      target_prefix_ended = tf.equal(target_prefix_labels, 0)


This has the hardcoded assumption that EOS=0. I would avoid that. Better rely on the target seq len (using rec step info to check the current pos).

albertz · 2021-09-14T10:32:23Z

returnn/tf/layers/rec.py

-  def get_out_data_from_opts(cls, name, sources, target, network,
-                             beam_size, search=NotSpecified, scheduled_sampling=False, cheating=False, **kwargs):
+  def get_out_data_from_opts(cls, name, sources, target, network, beam_size, search=NotSpecified,
+                             scheduled_sampling=False, cheating=False, prefix_target=None, **kwargs):


Instead of prefix_target, this could be a layer, like prefix. And you could directly refer to the whole sequence. So you would have prefix="base:data:prefix" or so in your config. This would avoid the whole padding logic.

I don't understand. You mean ChoiceLayer gets the whole prefix and selects the label of the current timestep itself? Or would a layer somehow handle the padding for me?

Yea, prefix="base:data:prefix" would be the whole prefix.

albertz · 2021-09-14T11:12:20Z

Btw, also conceptually, what we want in the future is that such things can be written more directly in the config, in a flexible way, without the need to modify RETURNN. One approach would be what I suggested before. But another option is when you write sth like this in the config:

output = cond(
  cond=i < seq_len(prefix),
  true_layer=iter(prefix),
  false_layer=choice(prob))

(I'm using the new proposed syntax from returnn-common here, but just for readability. You can write this in the old-style dict syntax as well if you prefer.)

This uses the IterLayer from #552, which is basically the TensorArray logic but wrapped in a layer, not directly in the RecLayer.

This assumes that CondLayer supports broadcasting, which it does not. But this is needed here to work correctly, because the IterLayer (tensor array read) should only happen if we are still within the prefix. This resolves the padding logic then.

Basically, CondLayer for cond which is not a scalar would behave like:

tf.case(
  [(tf.reduce_all(cond), true_layer),
   (tf.reduce_all(tf.logical_not(cond)), false_layer)],
  default=where_bc(...))

patrick-wilken · 2021-09-14T15:31:04Z

Yep, #552 sounds very related and could improve my padding approach.

I agree, that my implementation is not "modular".

About two decoding passes: Yes, I was aware of that concept, but then I would need to take the graph apart and write a custom search, with two session run calls and feeding/fetching states. I want to avoid that if there is a way around it. Or do you mean having two decoders in the usual search graph? That would mean copying the decoder definition in the config and sharing all the parameters...

About using CondLayer: Yes, this could be a better alternative. As you described, selecting the prefix using layers is possible, even if it is a bit complicated. I wonder about the other things ChoiceLayer does though: Would setting up the search choice dependencies work if the ChoiceLayer output is used only after going through a CondLayer? I guess yes. Would the search scores and beams be set correctly for the prefix timesteps? I think no, currently they would correspond to the n-best hyps and not to the forced prefix. Beams does not matter much, because prefix has no real beam, but we want the correct score. So it is a bit tricky.

Conceptually, what I'm doing is switching between the two "modes" of ChoiceLayer in training vs. search, either provide the target or the predicted label. So maybe it is acceptable to have that behaviour implemented in there, even though it's a separate implementation...

albertz · 2021-09-14T15:46:59Z

About two decoding passes: Yes, I was aware of that concept, but then I would need to take the graph apart and write a custom search, with two session run calls and feeding/fetching states.

No. My example above is just a single session run. Why should it be two session runs? This here would be your normal net dict definition:

_, hidden = decoder(targets=prefix, search=False)
out, _ = decoder(hidden=hidden, search=True)

Or do you mean having two decoders in the usual search graph? That would mean copying the decoder definition in the config and sharing all the parameters...

Yes, this is what I mean. What's the problem with that? It's just a reformulation in the end of what you are doing.

About using CondLayer: Yes, this could be a better alternative. As you described, selecting the prefix using layers is possible, even if it is a bit complicated. I wonder about the other things ChoiceLayer does though: Would setting up the search choice dependencies work if the ChoiceLayer output is used only after going through a CondLayer? I guess yes.

Yes sure, why not? This is really independent of ChoiceLayer also but just how CondLayer works in general.

Would the search scores and beams be set correctly for the prefix timesteps? I think no, currently they would correspond to the n-best hyps and not to the forced prefix. Beams does not matter much, because prefix has no real beam, but we want the correct score. So it is a bit tricky.

Ah yes, you are right. But also this can be solved.

Conceptually, what I'm doing is switching between the two "modes" of ChoiceLayer in training vs. search, either provide the target or the predicted label. So maybe it is acceptable to have that behaviour implemented in there, even though it's a separate implementation...

I don't exactly understand the reasoning. We have three proposed approaches here so far:

Two decoder calls. Logic implemented in user config. Sth like this:

_, hidden = decoder(targets=prefix, search=False)
out, _ = decoder(hidden=hidden, search=True)

Using CondLayer, etc. Logic implemented in user config. Sth like this:

output = cond(
  cond=i < seq_len(prefix),
  true_layer=iter(prefix),
  false_layer=choice(prob))

Extending ChoiceLayer by further logic. Logic implemented in RETURNN.

What you want to do can be done in these three ways (among maybe other ways).

In general, we always prefer if we have RETURNN flexible enough such that the user can implement things in the config. We want to keep RETURNN simple, and not extend more and more. (See e.g. SelfAttentionLayer for another such bad example, and #391 how this was resolved.)

patrick-wilken · 2021-10-14T12:44:50Z

Ok, let me explain here why I opened #714 rather than using the options you proposed.
The two-decoder implementation makes the config unnecessarily complicated. And by config I mean the network dict, which we create and store to a file, not the code creating that dict. Its size would increase by around 50% and would make it even harder to read. On top of that, adding prefix decoding capability to an existing network dict in an automated way is easier if you would just have to adjust the output layers (which are the same for different model architectures!) rather than copying the whole decoder. Also, this solution is very specific for prefixes and cannot be generalized to other search constraints, which we might want to have in the future.

Using a CondLayer is actually what I want to do in combination with #714. The problem with using a CondLayer after a ChoiceLayer is that the beam scores then do not correspond to forced prefix. And if we imagine other search constraints, like only forcing the second token (which we actually have a use case for 😅 ), also the src_beams of ChoiceLayer would correspond to free decoding rather than the forced token. If you wanted to fix those in hindsight you would need a SetBeamScoresLayer and a SetSrcBeamsLayer to overwrite those, which is at least as complicated as dividing ChoiceLayer into multiple layers. So in #714 I want to discuss a way to force the prefix before setting the SearchChoices.

The third option, this PR, is what we'll use in the meantime. 😉 But as I said, I get that extending layers with more special cases is not ideal.

albertz · 2021-10-18T09:56:34Z

The two-decoder implementation makes the config unnecessarily complicated.

Why? In my example above, it basically is one additional line of code.

And by config I mean the network dict, which we create and store to a file, not the code creating that dict.

Why does this matter? You would probably look more at the code to create the layers (creating the dict internally).

The network dict is just an intermediate representation and I expect that we will probably look at it much less once we have the new net construction code (syntax) from returnn-common. This is one of the intentions of the new returnn-common net construction code. Similarly, you probably do not look much at the resulting TF computation graph, I assume. You might only do this for debugging.

Also, this solution is very specific for prefixes and cannot be generalized to other search constraints, which we might want to have in the future.

I'm not arguing against #714 here. Having more flexibility in the search (constraint or whatever else) is good. And surely you can use such flexibility also for prefix decoding if you prefer that solution.

Using a CondLayer is actually what I want to do in combination with #714. The problem with using a CondLayer after a ChoiceLayer is that the beam scores ...

Yes sure, if you want to have the beam scores right, you would need #714.

Although I'm not sure if this is really the right thing in this case. If you force some prefix, you could also argue that it actually should not use the log prob scores of this part. But this is your decision.

The third option, this PR, is what we'll use in the meantime.

Ok but I think we both agree that this is sth we would not want to merge then, to keep the master simple. So maybe we can also close this PR then, and work on #714 to get a better generic solution.

RecLayer: pad certain data keys allowing different length

c74613b

patrick-wilken requested a review from albertz as a code owner September 14, 2021 09:55

albertz reviewed Sep 14, 2021

View reviewed changes

patrick-wilken force-pushed the feature/prefix_decoding branch from 4254590 to d8f1b10 Compare September 14, 2021 10:29

albertz reviewed Sep 14, 2021

View reviewed changes

ChoiceLayer: prefix decoding

e7bbfd2

patrick-wilken force-pushed the feature/prefix_decoding branch from d8f1b10 to e7bbfd2 Compare September 14, 2021 15:32

patrick-wilken mentioned this pull request Sep 14, 2021

RecLayer multiple inputs via explicit unstacking #552

Merged

albertz mentioned this pull request Sep 24, 2021

Rec design for recurrent definitions / loops rwth-i6/returnn_common#16

Closed

patrick-wilken mentioned this pull request Oct 13, 2021

More modular alternative to ChoiceLayer #714

Open

albertz mentioned this pull request May 6, 2022

Explicit search reimplementation rwth-i6/returnn_common#144

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Prefix decoding #649

Prefix decoding #649

patrick-wilken commented Sep 14, 2021

albertz commented Sep 14, 2021

albertz commented Sep 14, 2021

albertz commented Sep 14, 2021 •

edited

Loading

albertz Sep 14, 2021

albertz Sep 14, 2021

patrick-wilken Sep 14, 2021

albertz Sep 14, 2021

albertz commented Sep 14, 2021

patrick-wilken commented Sep 14, 2021

albertz commented Sep 14, 2021 •

edited

Loading

patrick-wilken commented Oct 14, 2021 •

edited

Loading

albertz commented Oct 18, 2021

Prefix decoding #649

Are you sure you want to change the base?

Prefix decoding #649

Conversation

patrick-wilken commented Sep 14, 2021

albertz commented Sep 14, 2021

albertz commented Sep 14, 2021

albertz commented Sep 14, 2021 • edited Loading

albertz Sep 14, 2021

Choose a reason for hiding this comment

albertz Sep 14, 2021

Choose a reason for hiding this comment

patrick-wilken Sep 14, 2021

Choose a reason for hiding this comment

albertz Sep 14, 2021

Choose a reason for hiding this comment

albertz commented Sep 14, 2021

patrick-wilken commented Sep 14, 2021

albertz commented Sep 14, 2021 • edited Loading

patrick-wilken commented Oct 14, 2021 • edited Loading

albertz commented Oct 18, 2021

albertz commented Sep 14, 2021 •

edited

Loading

albertz commented Sep 14, 2021 •

edited

Loading

patrick-wilken commented Oct 14, 2021 •

edited

Loading