Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prefix decoding #649

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

patrick-wilken
Copy link
Contributor

This implements enforcing target tokens in ChoiceLayer provided by a separate data stream until it is "used up". After that (different lengths per sequence are possible) search continues as usual. The correct scores are assigned to the prefixes.

Useful to continue a target sequence provided by a user, or to enforce target-side prefix tokens that provide meta-information, e.g. the dialect to translate to.

Getting additional data streams into the recurrent layer that are on the same time axis but have a shorter length is somewhat tricky. I implemented that you have to mark them explicitly via a padded_data_keys parameter. The check for equal length is disabled for those data keys, instead they are padded if necessary.

@albertz
Copy link
Member

albertz commented Sep 14, 2021

We should not introduce another separate padding concept. This was already discussed as part of #391, where we agreed to add sth like dyn_mask_ext (earlier seq_mask_ext in that discussion) to the DimensionTag.

@albertz
Copy link
Member

albertz commented Sep 14, 2021

Btw, before you implement some bigger change like this, it would be good to open an issue first where the implementation details are being discussed.

@albertz
Copy link
Member

albertz commented Sep 14, 2021

One way this would already work without these changes (at least conceptually):

You could do one pass over the given prefix with search disabled (so ChoiceLayer uses the prefix). Then you do a second pass with search enabled, where you init all the hidden states by the final states of the first pass.

So sth (conceptually) like:

_, hidden = decoder(targets=prefix, search=False)
out, _ = decoder(hidden=hidden, search=True)

I write "conceptually" because there are likely a couple of details why this would not work yet. But then we should fix those issues.

I think this is easier and more straight-forward than adding another new option to ChoiceLayer.

Edit However, if you think such prefix option for ChoiceLayer is easier for you, maybe we can still do it. But see my other comments on the code changes regarding that.

target=self.prefix_target).get_placeholder_as_batch_major() # (batch * beam,), int32

# Get prefixes that have already ended (i.e. have a smaller length than the current time step).
target_prefix_ended = tf.equal(target_prefix_labels, 0)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This has the hardcoded assumption that EOS=0. I would avoid that. Better rely on the target seq len (using rec step info to check the current pos).

def get_out_data_from_opts(cls, name, sources, target, network,
beam_size, search=NotSpecified, scheduled_sampling=False, cheating=False, **kwargs):
def get_out_data_from_opts(cls, name, sources, target, network, beam_size, search=NotSpecified,
scheduled_sampling=False, cheating=False, prefix_target=None, **kwargs):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of prefix_target, this could be a layer, like prefix. And you could directly refer to the whole sequence. So you would have prefix="base:data:prefix" or so in your config. This would avoid the whole padding logic.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand. You mean ChoiceLayer gets the whole prefix and selects the label of the current timestep itself? Or would a layer somehow handle the padding for me?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea, prefix="base:data:prefix" would be the whole prefix.

@albertz
Copy link
Member

albertz commented Sep 14, 2021

Btw, also conceptually, what we want in the future is that such things can be written more directly in the config, in a flexible way, without the need to modify RETURNN. One approach would be what I suggested before. But another option is when you write sth like this in the config:

output = cond(
  cond=i < seq_len(prefix),
  true_layer=iter(prefix),
  false_layer=choice(prob))

(I'm using the new proposed syntax from returnn-common here, but just for readability. You can write this in the old-style dict syntax as well if you prefer.)

This uses the IterLayer from #552, which is basically the TensorArray logic but wrapped in a layer, not directly in the RecLayer.

This assumes that CondLayer supports broadcasting, which it does not. But this is needed here to work correctly, because the IterLayer (tensor array read) should only happen if we are still within the prefix. This resolves the padding logic then.

Basically, CondLayer for cond which is not a scalar would behave like:

tf.case(
  [(tf.reduce_all(cond), true_layer),
   (tf.reduce_all(tf.logical_not(cond)), false_layer)],
  default=where_bc(...))

@patrick-wilken
Copy link
Contributor Author

Yep, #552 sounds very related and could improve my padding approach.

I agree, that my implementation is not "modular".

About two decoding passes: Yes, I was aware of that concept, but then I would need to take the graph apart and write a custom search, with two session run calls and feeding/fetching states. I want to avoid that if there is a way around it. Or do you mean having two decoders in the usual search graph? That would mean copying the decoder definition in the config and sharing all the parameters...

About using CondLayer: Yes, this could be a better alternative. As you described, selecting the prefix using layers is possible, even if it is a bit complicated. I wonder about the other things ChoiceLayer does though: Would setting up the search choice dependencies work if the ChoiceLayer output is used only after going through a CondLayer? I guess yes. Would the search scores and beams be set correctly for the prefix timesteps? I think no, currently they would correspond to the n-best hyps and not to the forced prefix. Beams does not matter much, because prefix has no real beam, but we want the correct score. So it is a bit tricky.

Conceptually, what I'm doing is switching between the two "modes" of ChoiceLayer in training vs. search, either provide the target or the predicted label. So maybe it is acceptable to have that behaviour implemented in there, even though it's a separate implementation...

@albertz
Copy link
Member

albertz commented Sep 14, 2021

About two decoding passes: Yes, I was aware of that concept, but then I would need to take the graph apart and write a custom search, with two session run calls and feeding/fetching states.

No. My example above is just a single session run. Why should it be two session runs? This here would be your normal net dict definition:

_, hidden = decoder(targets=prefix, search=False)
out, _ = decoder(hidden=hidden, search=True)

Or do you mean having two decoders in the usual search graph? That would mean copying the decoder definition in the config and sharing all the parameters...

Yes, this is what I mean. What's the problem with that? It's just a reformulation in the end of what you are doing.

About using CondLayer: Yes, this could be a better alternative. As you described, selecting the prefix using layers is possible, even if it is a bit complicated. I wonder about the other things ChoiceLayer does though: Would setting up the search choice dependencies work if the ChoiceLayer output is used only after going through a CondLayer? I guess yes.

Yes sure, why not? This is really independent of ChoiceLayer also but just how CondLayer works in general.

Would the search scores and beams be set correctly for the prefix timesteps? I think no, currently they would correspond to the n-best hyps and not to the forced prefix. Beams does not matter much, because prefix has no real beam, but we want the correct score. So it is a bit tricky.

Ah yes, you are right. But also this can be solved.

Conceptually, what I'm doing is switching between the two "modes" of ChoiceLayer in training vs. search, either provide the target or the predicted label. So maybe it is acceptable to have that behaviour implemented in there, even though it's a separate implementation...

I don't exactly understand the reasoning. We have three proposed approaches here so far:

  • Two decoder calls. Logic implemented in user config. Sth like this:
    _, hidden = decoder(targets=prefix, search=False)
    out, _ = decoder(hidden=hidden, search=True)
    
  • Using CondLayer, etc. Logic implemented in user config. Sth like this:
    output = cond(
      cond=i < seq_len(prefix),
      true_layer=iter(prefix),
      false_layer=choice(prob))
    
  • Extending ChoiceLayer by further logic. Logic implemented in RETURNN.

What you want to do can be done in these three ways (among maybe other ways).

In general, we always prefer if we have RETURNN flexible enough such that the user can implement things in the config. We want to keep RETURNN simple, and not extend more and more. (See e.g. SelfAttentionLayer for another such bad example, and #391 how this was resolved.)

@patrick-wilken
Copy link
Contributor Author

patrick-wilken commented Oct 14, 2021

Ok, let me explain here why I opened #714 rather than using the options you proposed.
The two-decoder implementation makes the config unnecessarily complicated. And by config I mean the network dict, which we create and store to a file, not the code creating that dict. Its size would increase by around 50% and would make it even harder to read. On top of that, adding prefix decoding capability to an existing network dict in an automated way is easier if you would just have to adjust the output layers (which are the same for different model architectures!) rather than copying the whole decoder. Also, this solution is very specific for prefixes and cannot be generalized to other search constraints, which we might want to have in the future.

Using a CondLayer is actually what I want to do in combination with #714. The problem with using a CondLayer after a ChoiceLayer is that the beam scores then do not correspond to forced prefix. And if we imagine other search constraints, like only forcing the second token (which we actually have a use case for 😅 ), also the src_beams of ChoiceLayer would correspond to free decoding rather than the forced token. If you wanted to fix those in hindsight you would need a SetBeamScoresLayer and a SetSrcBeamsLayer to overwrite those, which is at least as complicated as dividing ChoiceLayer into multiple layers. So in #714 I want to discuss a way to force the prefix before setting the SearchChoices.

The third option, this PR, is what we'll use in the meantime. 😉 But as I said, I get that extending layers with more special cases is not ideal.

@albertz
Copy link
Member

albertz commented Oct 18, 2021

The two-decoder implementation makes the config unnecessarily complicated.

Why? In my example above, it basically is one additional line of code.

And by config I mean the network dict, which we create and store to a file, not the code creating that dict.

Why does this matter? You would probably look more at the code to create the layers (creating the dict internally).

The network dict is just an intermediate representation and I expect that we will probably look at it much less once we have the new net construction code (syntax) from returnn-common. This is one of the intentions of the new returnn-common net construction code. Similarly, you probably do not look much at the resulting TF computation graph, I assume. You might only do this for debugging.

Also, this solution is very specific for prefixes and cannot be generalized to other search constraints, which we might want to have in the future.

I'm not arguing against #714 here. Having more flexibility in the search (constraint or whatever else) is good. And surely you can use such flexibility also for prefix decoding if you prefer that solution.

Using a CondLayer is actually what I want to do in combination with #714. The problem with using a CondLayer after a ChoiceLayer is that the beam scores ...

Yes sure, if you want to have the beam scores right, you would need #714.

Although I'm not sure if this is really the right thing in this case. If you force some prefix, you could also argue that it actually should not use the log prob scores of this part. But this is your decision.

The third option, this PR, is what we'll use in the meantime.

Ok but I think we both agree that this is sth we would not want to merge then, to keep the master simple. So maybe we can also close this PR then, and work on #714 to get a better generic solution.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants