Dictionary entries with `#fairseq:overwrite` are not preserved in dict.txt output from fairseq-preprocess #3705

nelson-liu · 2021-07-11T08:04:03Z

🐛 Bug

To Reproduce

Steps to reproduce the behavior (always include the command you ran):

Run fairseq-preprocess with a --srcdict that has #fairseq:overwrite. For example, the command in the roberta pretraining tutorial https://github.com/pytorch/fairseq/blob/master/examples/roberta/README.pretraining.md#1-preprocess-the-data
Look at the dictionary that is output by fairseq-preprocess, and see that #fairseq:overwrite is not preserved in dict.txt.

Expected behavior

When using --srcdict, the dict.txt should be exactly the same as the one passed in to fairseq-preprocess

Environment

fairseq Version (e.g., 1.0 or master): 0.10.2
PyTorch Version (e.g., 1.0) 1.9.0
OS (e.g., Linux): Linux
How you installed fairseq (pip, source): pip

The text was updated successfully, but these errors were encountered:

lydianish · 2023-09-20T12:55:26Z

Hi @nelson-liu ,

I think the bug comes from the add_symbol function in fairseq/fairseq/data/dictionary.py file:

def add_symbol(self, word, n=1, overwrite=False):
        """Adds a word to the dictionary"""
        if word in self.indices and not overwrite:
            idx = self.indices[word]
            self.count[idx] = self.count[idx] + n
            return idx
        else:
            idx = len(self.symbols)
            self.indices[word] = idx
            self.symbols.append(word)
            self.count.append(n)
            return idx

The condition should be changed to if word in self.indices and overwrite:. Otherwise, even when overwrite is indeed set to True, the symbols will not be overwritten.

In your case, this results the dictionary having duplicate special tokens. Normally, when loading from file, the dictionary should start with 4 special tokens (<s>, <pad>, </s> and <unk> in that order), followed by the entries in the file. If your file already has special tokens with #fairseq:overwrite, they should be overwritten. The dict.txt saved during preprocessing is meant to skip the those first tokens, and should not have those #fairseq:overwrite tags.

I have created a pull request that fixes the bug (#5329).

nelson-liu added bug needs triage labels Jul 11, 2021

lydianish linked a pull request Sep 20, 2023 that will close this issue

fix overwrite bug when adding symbol to dictionary #5329

Open

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dictionary entries with `#fairseq:overwrite` are not preserved in dict.txt output from fairseq-preprocess #3705

Dictionary entries with `#fairseq:overwrite` are not preserved in dict.txt output from fairseq-preprocess #3705

nelson-liu commented Jul 11, 2021

lydianish commented Sep 20, 2023

Dictionary entries with #fairseq:overwrite are not preserved in dict.txt output from fairseq-preprocess #3705

Dictionary entries with #fairseq:overwrite are not preserved in dict.txt output from fairseq-preprocess #3705

Comments

nelson-liu commented Jul 11, 2021

🐛 Bug

To Reproduce

Expected behavior

Environment

lydianish commented Sep 20, 2023

Dictionary entries with `#fairseq:overwrite` are not preserved in dict.txt output from fairseq-preprocess #3705

Dictionary entries with `#fairseq:overwrite` are not preserved in dict.txt output from fairseq-preprocess #3705