Question about self-BLEU implementation #27

weilinie · 2018-11-16T20:49:34Z

As far as I know, the basic idea of self-BLEU scores is to calculate the BLEU scores by choosing each sentence in the set of generated sentences as hypothesis and the others as reference, and then take an average of BLEU scores over all the generated sentences.

However, when looking into the implementation of self-BLEU scores: https://github.com/geek-ai/Texygen/blob/master/utils/metrics/SelfBleu.py, I found an issue inside for evaluating self-BLEU over training: Only in the first time of evaluation that the reference and hypothesis come from the same “test data” (i.e. the set of generated sentences). After that, the hypothesis keeps updated but the reference remains unchanged (due to “is_first=False”), which means hypothesis and reference are not from the same “test data” any more, and thus the scores obtained under this implementation are not self-BLEU scores.

To this end, I modified the implementation to make sure that the hypothesis and reference are always from the same “test data” (by simply removing the variables "self.reference" and "self.is_first") and found that the self-BLEU (2-5) scores are always 1 when evaluating all the models.

Please let me know if my concern makes sense or just misunderstand the definition of self-BLEU scores?

MichaelZhouwang · 2019-04-08T10:24:07Z

I also found this problem, when the test data and reference data is the same, self-bleu is always 1. However many papers in this domain use it as a diversity metric, which is quite misleading. I think it only measures to which extent the generated samples with GAN training is different from MLE training samples, and not necessarily the lower the better. What do you think about the Forward-Backward Bleu metric used in the ''Toward Diverse Text Generation with Inverse Reinforcement Learning'' paper?

weilinie · 2019-04-15T20:15:07Z

Fully agree! Thank you for your follow up. In terms of the Forward-Backward Bleu metric you mentioned, I didn't try it. But since it is based on (self-)BLEU scores, I think there also exist the issues we observed.

weilinie mentioned this issue Jun 7, 2019

my idea of metric on diversity. weilinie/RelGAN#6

Closed

weilinie mentioned this issue Jul 3, 2019

[Self-BLEU] what metric values do you get? weilinie/RelGAN#7

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question about self-BLEU implementation #27

Question about self-BLEU implementation #27

weilinie commented Nov 16, 2018

MichaelZhouwang commented Apr 8, 2019

weilinie commented Apr 15, 2019

Question about self-BLEU implementation #27

Question about self-BLEU implementation #27

Comments

weilinie commented Nov 16, 2018

MichaelZhouwang commented Apr 8, 2019

weilinie commented Apr 15, 2019