Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Beam search decoding during inference doesn't generate good text. #265

Open
fabrahman opened this issue Dec 29, 2019 · 4 comments
Open

Beam search decoding during inference doesn't generate good text. #265

fabrahman opened this issue Dec 29, 2019 · 4 comments
Labels
question Further information is requested

Comments

@fabrahman
Copy link

Hi,

I have trained a model using Reinforcement learning.
When I use "beam search" to generate text, it generates all

"raeraeraeraeraeraeraeraeraeraeraeraeraeraeraeraeraeraeraeraeraeraeraeraeraeraeraeraeraeraeraeraeraeraeraeraeraeraeraeraeraeraeraeraeraeraeraeraeraeraeraeraeraeraeraeraeraeraeraeraeraeraeraeraeraeraeraeraeraeraeraeraeraeraeraeraeraeraeraeraeraeraeraeraeraeraeraeraeraeraeraeraeraeraeraeraeraeraeraeraeraeraeraeraeraeraeraeraeraeraeraeraeraeraeraeraeraeraeraeraeraeraeraeraeraeraeraera"

However, when I use greedy or topk sampling the generation is like:

Sam was watching a movie. He was very focused on the action. He fell asleep. Sam's glasses fell off his face <|endoftext|>eraeraeraeraeraeraeraeraeraeraeraeraeraeraeraeraeraeraeraeraeraeraeraeraeraeraeraeraeraeraeraeraeraeraeraeraeraeraeraeraeraeraeraeraeraeraeraeraeraeraeraeraeraeraeraeraeraeraeraeraeraeraeraeraeraeraeraeraeraeraeraeraeraeraeraeraeraeraera

I used the tx.utils.strip_eos to strip anything after <|endoftext|>.

1- I am not sure why beam search is performing this way? I would appreciate your help. following is my piece of code for doing decoding using beam search:

    def _infer_beam_ids(context_name):
        # beam-search
        predictions = decoder(
            beam_width=10,
            length_penalty=config_train.length_penalty,
            embedding=_embedding_fn,
            context=batch['%s_ids' % context_name],
            context_sequence_length=batch['%s_len' % context_name],
            max_decoding_length=max_decoding_length,
            end_token=end_token,
            mode=tf.estimator.ModeKeys.PREDICT)

        beam_output_ids = tx.utils.varlength_roll(predictions["sample_id"][:, :, 0], -batch['%s_len' % context_name], axis=1)

        return beam_output_ids

    beam_search_ids = _infer_beam_ids('x1')

2- Is it better to use beam search for a model which is trained in a self-critical fashion, right?

I would appreciate if you can help me with these.

@gpengzhi gpengzhi added the question Further information is requested label Jan 2, 2020
@fabrahman
Copy link
Author

fabrahman commented Jan 7, 2020

Hi,
Anyone has a thought on this?
In another experiment, I used my trained model to generate with beam search and it generates the same output for different inputs. And it's weird that the greedy result is good but not the beam search.
Am I correctly calling the beam decoding method?

@jchwenger
Copy link

That is in fact a feature of beam search, see this discussion, this implementation and this paper! Temperature-based random sampling and/or top_p (nucleus) sampling are in my experience always preferable to beam search.

The root cause of the failure of beam search is that 1) a repetitive sequence will have a higher probability than any other, since the more you repeat, the more likely the next token will be (from the perspective of the network), and so will be chosen by beam search; 2) if you ask a network to assign probabilities to human text the distribution is actually highly irregular (not the most likely sentence, but a stream where some steps are extremely likely, others extremely random). Lovely graphs and explanations in the paper!

@fabrahman
Copy link
Author

That is in fact a feature of beam search, see this discussion, this implementation and this paper! Temperature-based random sampling and/or top_p (nucleus) sampling are in my experience always preferable to beam search.

The root cause of the failure of beam search is that 1) a repetitive sequence will have a higher probability than any other, since the more you repeat, the more likely the next token will be (from the perspective of the network), and so will be chosen by beam search; 2) if you ask a network to assign probabilities to human text the distribution is actually highly irregular (not the most likely sentence, but a stream where some steps are extremely likely, others extremely random). Lovely graphs and explanations in the paper!

@jchwenger Thanks for you reply. I understand and I agree that sampling methods works much better. But this performance that I reported here is not accepted from beam search, it didn't even generate anything meaningful. Beside, for any input it generates the same thing.
Also greedy decoding is working fine, so isn't it weird that beam search cannot generate anything?
I am thinking maybe there is some issue with my way of using it or the implementation.
Also, I heard and saw in many papers that when using Self-critical reinforcement learning, it's better to use beam at inference.

@jchwenger
Copy link

My pleasure! From the network's perspective meaningful is not particularly relevant. If it's a character-level or bpe model this repetition of characters over and over might still be the one with the highest probability from the model's perspective, and that is, out of all possible outputs of the network, what beam search will attempt to pick. Beyond that, however, and how beam search is successfully used in other papers, I'm afraid I can't help you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

3 participants