-
Notifications
You must be signed in to change notification settings - Fork 130
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
More modular alternative to ChoiceLayer #714
Comments
One would also have to decide where the target label, that |
But in #649, my argument was actually not to generalize or extend |
No, it doesn't necessarily need to be hidden. On the contrary: I like it when things are explicit. However, what we definitely want is that it is decoupled and modular. Whatever you do on the beam, any layers not specifically handling the beam (just operating on a batch dim) should just operate as before. These are basically all layers except of So, on operating on the beam: We already have these:
Note that
So, maybe we can have those operations decoupled. Maybe like:
This is still not as generic as possible, and could maybe be splitted further:
So the prune step could optionally use some alternative prune score instead of the beam score for the top-k selection. The prune step could also use a prune score threshold instead of just a fixed beam size. Then the beam size becomes dynamic. This might not be optimal for batched beam search though, when the dynamic sizes can vary a lot. There could be filtering after the expand step. E.g. to restrict possible char, subword or word sequence on some grammar. Or to restrict a phone sequence on some lexicon. This filtering could simply set the beam scores to -inf. Such filtering after the expand step can also be used for recombination, where you combine hypotheses. E.g. when the model only has fixed label context (e.g. last 3 words), you can recombine new hypotheses, by taking either the sum or the max of the partial probabilities. Then it would set this new value on the argmax of the hypothesis, and set all other to -inf. We currently are doing this via Beam scores could be more complex, e.g. having multiple for each hypothesis, not just a single score, and then doing sth more complex to calculate the prune score. E.g. maybe you want to keep the acoustic score and an external language model score separate. What you suggest is also somewhat similar. It's also making it more explicit. |
Btw, on having the beam part of the batch dim: This was a very simple way to make sure all other layers can just operate as normal without any modification. In principle though, all layers should be generic enough to accept any kind of input, with any kind of dimensions, and only operate on those which are relevant for the layer. E.g. So, at some point, when we can really trust all layers to behave this way, we can also explicitly add the beam as a separate dimension. This might make some parts more clean. But we are not there yet. We definitely need #597 first, and then probably some more, e.g. #573. |
Ah, I was actually wondering what this is used for.
Ok, constructing the new beam can be very explicit. Also, accumulated beam scores could be calculated explicitly by layers I guess, but should this be required then? Which would mean dropping |
As usual, we should not break old configs. So |
This would be part of the pruning. So for |
Another open question: One thing which is nice about When we split up Or would those atomic layers (
|
@michelwi @jvhoffbauer you are probably also interested in this? You make heavy use of |
#649 is currently pending because we don't want to extend ChoiceLayer with even more special cases.
Quote from #649 (comment)
So I thought a bit about how this could be done for ChoiceLayer. It implements beam pruning and sets
SearchChoices
which are used for beam score accumulation and backtracking, so "extending" it would mean we want to implement an alternative way to select the beam entries, and/or an alternative way to calculate the beam scores. Prefix decoding from #649 is one example, other examples are things already implemented as special cases in ChoiceLayer: cheating, sampling (for inference), scheduled sampling, etc.An important difference to #391 is that here we manipulate the beam, and we want to hide that from the user/network definition as much as possible. So for example, (I assume) we don't want a layer that explicitly calculates the accumulated beam scores. However, to implement the features mentioned above we have to operate on the beam dimension to some degree, which normally is not touched by the layers.
What I came up with so far to re-implement the standard functionality of ChoiceLayer is:
BeamPruneIndicesLayer
(naming is hard... 😅 ) which gets scores for the current step via the source layer, accesses the accumulated beam scores viaget_search_choices().beam_scores
, calculates the top-k combined scores, but now in contrast toChoiceLayer
does not setSearchChoices
itself, instead it has an output of shape(batch, beam_size, 2)
which contains tuples(src_beam, label)
, so it only returns the indices needed to gather the new beam.ConstructBeamLayer
(or maybeSetSearchChoicesLayer
?), which is the layer that owns theSearchChoices
. It gets the output ofBeamPruneIndicesLayer
and also the scores as input layers and sets the beam scores and src_beams of theSearchChoices
according to its inputs. The output would be the new beam of labels.Custom functionality can then be implemented by manipulating the scores and beam indices before feeding them into the
ConstructBeamLayer
.For prefix decoding for example, the beam indices from
BeamPruneIndicesLayer
would first go though aSwitchLayer
that has the prefix labels as a second input (extended withsrc_beam=0
), and the condition would be whether the prefix has ended.For cheating one could replace the last entry in the
BeamPruneIndicesLayer
with(beam_size - 1, golden_label)
, etc.Note, that the output of
BeamPruneIndicesLayer
has no beam, instead, the second dimension kind of contains a preliminary beam that is treated in a feature dimension. This might be pretty unintuitive. An alternative which keeps the beam as part of the batch dimension would be to create zeros of shape(batch * beam, dim)
(same as input scores) and then mark the positions of the top-k scores (inside the hidden beam dim) with integers from 1 tobeam_size
. But this is much less efficient and probably not really more intuitive.Would something like that be worth implementing?
The text was updated successfully, but these errors were encountered: