Datasets used for model training are 🤗 Datasets wrapped into PyTorch Lightning data modules (see data package). Datasets are automatically downloaded, preprocessed and cached when their corresponding Lightning data module is loaded during training. For larger datasets, like Wikipedia or BookCorpus, it is recommended to do this prior to training as described in the next section. The C4 dataset is streamed directly and doesn't need preprocessing.
Text dataset preprocessing requires a 🤗 fast tokenizer or the deepmind/language-perceiver
tokenizer. Tokenizers can
be specified with the --tokenizer
command line option. The following preprocessing commands are examples. Adjust them
to whatever you need for model training.
-
bookcorpus (
plain_text
):python -m perceiver.scripts.text.preproc bookcorpus \ --tokenizer=bert-base-uncased \ --max_seq_len=512 \ --task=mlm \ --add_special_tokens=false
-
bookcorpusopen (
plain_text
):python -m perceiver.scripts.text.preproc bookcorpusopen \ --tokenizer=xlnet-base-cased \ --max_seq_len=4096 \ --task=clm \ --add_special_tokens=false \ --random_train_shift=true
-
wikipedia (
20220301.en
):python -m perceiver.scripts.text.preproc wikipedia \ --tokenizer=bert-base-uncased \ --max_seq_len=512 \ --task=mlm \ --add_special_tokens=false
-
wikitext (
wikitext-103-raw-v1
), used in training examples:python -m perceiver.scripts.text.preproc wikitext \ --tokenizer=deepmind/language-perceiver \ --max_seq_len=4096 \ --task=clm \ --add_special_tokens=false
-
imdb (
plain_text
), used in training examples:python -m perceiver.scripts.text.preproc imdb \ --tokenizer=deepmind/language-perceiver \ --max_seq_len=2048 \ --task=clf \ --add_special_tokens=true
-
enwik8 (
enwik8
):python -m perceiver.scripts.text.preproc enwik8 \ --tokenizer=deepmind/language-perceiver \ --max_seq_len=4096 \ --add_special_tokens=false
-
C4 (
c4
), used in training examples:Streaming dataset, no preprocessing needed.