This repo contains the original implementation of the paper "Nonparametric Masked Language Modeling".
@article{ min2022nonparametric,
title={ Nonparametric Masked Language Modeling },
author={ Min, Sewon and Shi, Weijia and Lewis, Mike and Chen, Xilun and Yih, Wen-tau and Hajishirzi, Hannaneh and Zettlemoyer, Luke },
year={ 2022 }
}
Models are available from Huggingface Hub:hugs:! Check out npm (for phrase retrieval) and npm-single (for token retrieval).
We are working on a simple demo where you can simply download all the resources and deploy on your machine. Stay tuned!
- 01/02/2023: The code for training is released. See train.md for instructions.
- 12/22/2022: The code for inference is released. Stay tuned for the code for training.
conda create -n npm python=3.7
conda activate npm
pip3 install -r requirements.txt --user
If you will use open-set tasks, make sure to install java as well.
conda install -c conda-forge openjdk
Note that multi-gpu inference is not supported for now.
Evaluation datasets and reference corpora can be downloaded via
# To run evaluation on closed-set tasks
bash scripts/download_data.sh closed
bash scripts/download_corpus.sh closed
# To run evaluation on open-set tasks
bash scripts/download_data.sh open
bash scripts/download_corpus.sh enwiki
# To run evaluation on TempLAMA (need Wikipedia 2022)
bash scripts/download_data.sh templama
bash scripts/download_corpus.sh new-enwiki
The corpus data is required for NPM and the retrieve-and-generate baselines. If you will only run parametric baselines, you can skip downloading the corpus.
All reference corpus files are saved under corpus/
and evaluation datasets are saved under data/
.
The following is the script for runing the RoBERTA-large baseline on all 9 datasets used in the paper.
python -m scripts.prompt \
--checkpoint_path roberta-large \
--eval_dataset agn+yahoo+rte+subj+sst2+mr+rt+cr+amazon \
--save_dir save/roberta \
--single
# To run on AGN, Yahoo and RTE:
bash scripts/save_embeddings.sh npm enwiki-0 false 320
bash scripts/save_embeddings.sh npm cc_news false 320
python -m scripts.prompt \
--corpus_data enwiki-0+cc_news \
--checkpoint_path npm \
--eval_dataset agn+yahoo+rte \
--temperature 5.0 \
--save_dir save/npm
# To run on Subj:
bash scripts/save_embeddings.sh npm subj false 320
python -m scripts.prompt \
--corpus_data subj \
--checkpoint_path npm \
--eval_dataset subj \
--temperature 5.0 \
--save_dir save/npm
# To run on SST-2, MR, RT, CR and Amazon:
bash scripts/save_embeddings.sh npm imdb false 320
bash scripts/save_embeddings.sh npm amazon false 320
python -m scripts.prompt \
--corpus_data imdb+amazon \
--checkpoint_path npm \
--eval_dataset sst2+mr+rt+cr+amazon \
--temperature 5.0 \
--save_dir save/npm
Note that scripts/save_embeddings.sh
takes
- model name (npm or npm-single)
- corpus name
- whether it is an open-set task (true or false)
- batch size (
320
is good for a 32gb GPU; iftrainer.precision=16
is used,400
is good for a 32gb GPU) as arguments. Embeddings are saved undersave/{model_name}/dstore
.
# To run on AGN, Yahoo and RTE:
bash scripts/save_embeddings.sh npm-single enwiki-0 false 320
bash scripts/save_embeddings.sh npm-single cc_news false 320
python -m scripts.prompt \
--corpus_data enwiki-0+cc_news \
--checkpoint_path npm-single \
--eval_dataset agn+yahoo+rte \
--temperature 5.0 \
--single \
--save_dir save/npm-single
# To run on Subj:
bash scripts/save_embeddings.sh npm-single subj false 320
python -m scripts.prompt \
--corpus_data subj \
--checkpoint_path npm-single \
--eval_dataset subj \
--temperature 5.0 \
--single \
--save_dir save/npm-single
# To run on SST-2, MR, RT, CR and Amazon:
bash scripts/save_embeddings.sh npm-single imdb false 320
bash scripts/save_embeddings.sh npm-single amazon false 320
python -m scripts.prompt \
--corpus_data imdb+amazon \
--checkpoint_path npm-single \
--eval_dataset sst2+mr+rt+cr+amazon \
--temperature 5.0 \
--single \
--save_dir save/npm-single
Run the following to run causal language model baselines (T5 baselines are TBA!).
python -m scripts.clm_prompt \
--eval_dataset {lama-trex|lama-google_re|kamel|triviaqa|nq|entity_translation} \
--model_name {j-6b|neo-1.3b|neo-2.7b|neox-20b|opt-1.3b|opt-2.7b|opt-6.7b|opt-13b|opt-30b|bloom-1b7|bloom-3b|bloom-7b1} \
--save_dir save
By default, this does not use any passages from an external corpus. Specify --ret bm25
if use BM25 passages from Wikipedia 2019, and --ret bm25_2022
to use BM25 passages from Wikipedia 2022 (for TempLAMA).
Please note that running open-set tasks requires around 70GB of RAM and 1.4TB of disk memory. If you want to reduce the RAM usage, you can specify --keep_uint8
while running python -m scripts.prompt
below, which reduces the RAM usage from 70GB to 40GB while increasing the datastore setting time. We will explore further optimizing RAM/disk usage in the future version of the code (PR is also welcome!).
# Note that this can be executed in parallel with up to 20 GPUs. In total, it takes about 10 GPU hours and 1.4TB of disk memory.
for i in {0..19} ; do
bash scripts/save_embeddings.sh npm enwiki-${i} true 320
done
# Loading the model takes about 40min, and 70GB of RAM (specify `--keep_uint8` to reduce RAM usage to 40GB which increases the model loading time to 60-80min).
python -m scripts.prompt \
--corpus_data enwiki \
--checkpoint_path npm \
--eval_dataset lama-trex+lama-google_re+kamel+triviaqa+nq+entity_translation \
--save_dir save/npm \
--remove_stopwords \
--restricted \
--open
To evaluate on TempLAMA, use new-enwiki
instead of enwiki
, and use --eval_dataset {templama|unchanged_templama}
.
NPM is CC-BY-NC 4.0 licensed.
Please leave Github issues or contact Sewon Min [email protected]
for any questions.