Skip to content

Latest commit

 

History

History
92 lines (70 loc) · 3 KB

dataset-preproc.md

File metadata and controls

92 lines (70 loc) · 3 KB

Dataset preprocessing

Datasets used for model training are 🤗 Datasets wrapped into PyTorch Lightning data modules (see data package). Datasets are automatically downloaded, preprocessed and cached when their corresponding Lightning data module is loaded during training. For larger datasets, like Wikipedia or BookCorpus, it is recommended to do this prior to training as described in the next section. The C4 dataset is streamed directly and doesn't need preprocessing.

Text datasets

Text dataset preprocessing requires a 🤗 fast tokenizer or the deepmind/language-perceiver tokenizer. Tokenizers can be specified with the --tokenizer command line option. The following preprocessing commands are examples. Adjust them to whatever you need for model training.

  • bookcorpus (plain_text):

    python -m perceiver.scripts.text.preproc bookcorpus \
      --tokenizer=bert-base-uncased \
      --max_seq_len=512 \
      --task=mlm \
      --add_special_tokens=false
  • bookcorpusopen (plain_text):

    python -m perceiver.scripts.text.preproc bookcorpusopen \
      --tokenizer=xlnet-base-cased \
      --max_seq_len=4096 \
      --task=clm \
      --add_special_tokens=false \
      --random_train_shift=true
  • wikipedia (20220301.en):

    python -m perceiver.scripts.text.preproc wikipedia \
      --tokenizer=bert-base-uncased \
      --max_seq_len=512 \
      --task=mlm \
      --add_special_tokens=false
  • wikitext (wikitext-103-raw-v1), used in training examples:

    python -m perceiver.scripts.text.preproc wikitext \
      --tokenizer=deepmind/language-perceiver \
      --max_seq_len=4096 \
      --task=clm \
      --add_special_tokens=false
  • imdb (plain_text), used in training examples:

    python -m perceiver.scripts.text.preproc imdb \
      --tokenizer=deepmind/language-perceiver \
      --max_seq_len=2048 \
      --task=clf \
      --add_special_tokens=true
  • enwik8 (enwik8):

    python -m perceiver.scripts.text.preproc enwik8 \
      --tokenizer=deepmind/language-perceiver \
      --max_seq_len=4096 \
      --add_special_tokens=false
  • C4 (c4), used in training examples:

    Streaming dataset, no preprocessing needed.

Image datasets