-
Notifications
You must be signed in to change notification settings - Fork 130
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Prefix decoding #649
base: master
Are you sure you want to change the base?
Prefix decoding #649
Conversation
We should not introduce another separate padding concept. This was already discussed as part of #391, where we agreed to add sth like |
Btw, before you implement some bigger change like this, it would be good to open an issue first where the implementation details are being discussed. |
One way this would already work without these changes (at least conceptually): You could do one pass over the given prefix with search disabled (so So sth (conceptually) like:
I write "conceptually" because there are likely a couple of details why this would not work yet. But then we should fix those issues. I think this is easier and more straight-forward than adding another new option to Edit However, if you think such |
target=self.prefix_target).get_placeholder_as_batch_major() # (batch * beam,), int32 | ||
|
||
# Get prefixes that have already ended (i.e. have a smaller length than the current time step). | ||
target_prefix_ended = tf.equal(target_prefix_labels, 0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This has the hardcoded assumption that EOS=0. I would avoid that. Better rely on the target seq len (using rec step info to check the current pos).
4254590
to
d8f1b10
Compare
def get_out_data_from_opts(cls, name, sources, target, network, | ||
beam_size, search=NotSpecified, scheduled_sampling=False, cheating=False, **kwargs): | ||
def get_out_data_from_opts(cls, name, sources, target, network, beam_size, search=NotSpecified, | ||
scheduled_sampling=False, cheating=False, prefix_target=None, **kwargs): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of prefix_target
, this could be a layer, like prefix
. And you could directly refer to the whole sequence. So you would have prefix="base:data:prefix"
or so in your config. This would avoid the whole padding logic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand. You mean ChoiceLayer
gets the whole prefix and selects the label of the current timestep itself? Or would a layer somehow handle the padding for me?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea, prefix="base:data:prefix"
would be the whole prefix.
Btw, also conceptually, what we want in the future is that such things can be written more directly in the config, in a flexible way, without the need to modify RETURNN. One approach would be what I suggested before. But another option is when you write sth like this in the config:
(I'm using the new proposed syntax from returnn-common here, but just for readability. You can write this in the old-style dict syntax as well if you prefer.) This uses the This assumes that Basically,
|
Yep, #552 sounds very related and could improve my padding approach. I agree, that my implementation is not "modular". About two decoding passes: Yes, I was aware of that concept, but then I would need to take the graph apart and write a custom search, with two session run calls and feeding/fetching states. I want to avoid that if there is a way around it. Or do you mean having two decoders in the usual search graph? That would mean copying the decoder definition in the config and sharing all the parameters... About using Conceptually, what I'm doing is switching between the two "modes" of |
d8f1b10
to
e7bbfd2
Compare
No. My example above is just a single session run. Why should it be two session runs? This here would be your normal net dict definition:
Yes, this is what I mean. What's the problem with that? It's just a reformulation in the end of what you are doing.
Yes sure, why not? This is really independent of
Ah yes, you are right. But also this can be solved.
I don't exactly understand the reasoning. We have three proposed approaches here so far:
What you want to do can be done in these three ways (among maybe other ways). In general, we always prefer if we have RETURNN flexible enough such that the user can implement things in the config. We want to keep RETURNN simple, and not extend more and more. (See e.g. |
Ok, let me explain here why I opened #714 rather than using the options you proposed. Using a The third option, this PR, is what we'll use in the meantime. 😉 But as I said, I get that extending layers with more special cases is not ideal. |
Why? In my example above, it basically is one additional line of code.
Why does this matter? You would probably look more at the code to create the layers (creating the dict internally). The network dict is just an intermediate representation and I expect that we will probably look at it much less once we have the new net construction code (syntax) from returnn-common. This is one of the intentions of the new returnn-common net construction code. Similarly, you probably do not look much at the resulting TF computation graph, I assume. You might only do this for debugging.
I'm not arguing against #714 here. Having more flexibility in the search (constraint or whatever else) is good. And surely you can use such flexibility also for prefix decoding if you prefer that solution.
Yes sure, if you want to have the beam scores right, you would need #714. Although I'm not sure if this is really the right thing in this case. If you force some prefix, you could also argue that it actually should not use the log prob scores of this part. But this is your decision.
Ok but I think we both agree that this is sth we would not want to merge then, to keep the master simple. So maybe we can also close this PR then, and work on #714 to get a better generic solution. |
This implements enforcing target tokens in ChoiceLayer provided by a separate data stream until it is "used up". After that (different lengths per sequence are possible) search continues as usual. The correct scores are assigned to the prefixes.
Useful to continue a target sequence provided by a user, or to enforce target-side prefix tokens that provide meta-information, e.g. the dialect to translate to.
Getting additional data streams into the recurrent layer that are on the same time axis but have a shorter length is somewhat tricky. I implemented that you have to mark them explicitly via a
padded_data_keys
parameter. The check for equal length is disabled for those data keys, instead they are padded if necessary.