diff --git a/examples/research_projects/pplm/README.md b/examples/research_projects/pplm/README.md index 204500879fc4..f37ea8e96f21 100644 --- a/examples/research_projects/pplm/README.md +++ b/examples/research_projects/pplm/README.md @@ -10,6 +10,9 @@ Blog link: https://eng.uber.com/pplm Please check out the repo under uber-research for more information: https://github.com/uber-research/PPLM +# Note + +⚠️ This project should be run with pytorch-lightning==1.0.4 which has a potential security vulnerability ## Setup @@ -20,7 +23,7 @@ pip install nltk torchtext # additional requirements. cd examples/research_projects/pplm ``` -## PPLM-BoW +## PPLM-BoW ### Example command for bag-of-words control @@ -30,7 +33,7 @@ python run_pplm.py -B military --cond_text "The potato" --length 50 --gamma 1.5 ### Tuning hyperparameters for bag-of-words control -1. Increase `--stepsize` to intensify topic control, and decrease its value to soften the control. `--stepsize 0` recovers the original uncontrolled GPT-2 model. +1. Increase `--stepsize` to intensify topic control, and decrease its value to soften the control. `--stepsize 0` recovers the original uncontrolled GPT-2 model. 2. If the language being generated is repetitive (For e.g. "science science experiment experiment"), there are several options to consider:
a) Reduce the `--stepsize`
@@ -48,7 +51,6 @@ python run_pplm.py -D sentiment --class_label 2 --cond_text "My dog died" --leng ### Tuning hyperparameters for discriminator control -1. Increase `--stepsize` to intensify topic control, and decrease its value to soften the control. `--stepsize 0` recovers the original uncontrolled GPT-2 model. +1. Increase `--stepsize` to intensify topic control, and decrease its value to soften the control. `--stepsize 0` recovers the original uncontrolled GPT-2 model. 2. Use `--class_label 3` for negative, and `--class_label 2` for positive - diff --git a/examples/research_projects/pplm/requirements.txt b/examples/research_projects/pplm/requirements.txt index 62092cc300ac..70530cd79983 100644 --- a/examples/research_projects/pplm/requirements.txt +++ b/examples/research_projects/pplm/requirements.txt @@ -5,7 +5,7 @@ psutil sacrebleu rouge-score tensorflow_datasets -pytorch-lightning==1.0.4 +pytorch-lightning matplotlib git-python==1.0.3 faiss-cpu diff --git a/examples/research_projects/rag-end2end-retriever/README.md b/examples/research_projects/rag-end2end-retriever/README.md index 7f6ef0bd6591..dcb615918c2f 100644 --- a/examples/research_projects/rag-end2end-retriever/README.md +++ b/examples/research_projects/rag-end2end-retriever/README.md @@ -2,29 +2,32 @@ This finetuning script is actively maintained by [Shamane Siri](https://github.com/shamanez). Feel free to ask questions on the [Forum](https://discuss.huggingface.co/) or post an issue on [GitHub](https://github.com/huggingface/transformers/issues/new/choose) and tag @shamanez. -Others that helped out: Patrick von Platen (@patrickvonplaten), Quentin Lhoest (@lhoestq), and Rivindu Weerasekera (@rivinduw) +Others that helped out: Patrick von Platen (@patrickvonplaten), Quentin Lhoest (@lhoestq), and Rivindu Weerasekera (@rivinduw) -The original RAG implementation is able to train the question encoder and generator end-to-end. -This extension enables complete end-to-end training of RAG including the context encoder in the retriever component. +The original RAG implementation is able to train the question encoder and generator end-to-end. +This extension enables complete end-to-end training of RAG including the context encoder in the retriever component. Please read the [accompanying blog post](https://shamanesiri.medium.com/how-to-finetune-the-entire-rag-architecture-including-dpr-retriever-4b4385322552) for details on this implementation. The original RAG code has also been modified to work with the latest versions of pytorch lightning (version 1.2.10) and RAY (version 1.3.0). All other implementation details remain the same as the [original RAG code](https://github.com/huggingface/transformers/tree/master/examples/research_projects/rag). Read more about RAG at https://arxiv.org/abs/2005.11401. -This code can be modified to experiment with other research on retrival augmented models which include training of the retriever (e.g. [REALM](https://arxiv.org/abs/2002.08909) and [MARGE](https://arxiv.org/abs/2006.15020)). +This code can be modified to experiment with other research on retrival augmented models which include training of the retriever (e.g. [REALM](https://arxiv.org/abs/2002.08909) and [MARGE](https://arxiv.org/abs/2006.15020)). -To start training, use the bash script (finetune_rag_ray_end2end.sh) in this folder. This script also includes descriptions on each command-line argument used. +To start training, use the bash script (finetune_rag_ray_end2end.sh) in this folder. This script also includes descriptions on each command-line argument used. +# Note + +⚠️ This project should be run with pytorch-lightning==1.3.1 which has a potential security vulnerability # Testing The following two bash scripts can be used to quickly test the implementation. -1. sh ./test_run/test_rag_new_features.sh - - Tests the newly added functions (set_context_encoder and set_context_encoder_tokenizer) related to modeling rag. +1. sh ./test_run/test_rag_new_features.sh + - Tests the newly added functions (set_context_encoder and set_context_encoder_tokenizer) related to modeling rag. - This is sufficient to check the model's ability to use the set functions correctly. 2. sh ./test_run/test_finetune.sh script - Tests the full end-to-end fine-tuning ability with a dummy knowlendge-base and dummy training dataset (check test_dir directory). - - Users can replace the dummy dataset and knowledge-base with their own to do their own finetuning. + - Users can replace the dummy dataset and knowledge-base with their own to do their own finetuning. # Comparison of end2end RAG (including DPR finetuning) VS original-RAG @@ -34,14 +37,14 @@ We conducted a simple experiment to investigate the effectiveness of this end2en - Create a knowledge-base using all the context passages in the SQuAD dataset with their respective titles. - Use the question-answer pairs as training data. - Train the system for 10 epochs. -- Test the Exact Match (EM) score with the SQuAD dataset's validation set. -- Training dataset, the knowledge-base, and hyperparameters used in experiments can be accessed from [here](https://drive.google.com/drive/folders/1qyzV-PaEARWvaU_jjpnU_NUS3U_dSjtG?usp=sharing). +- Test the Exact Match (EM) score with the SQuAD dataset's validation set. +- Training dataset, the knowledge-base, and hyperparameters used in experiments can be accessed from [here](https://drive.google.com/drive/folders/1qyzV-PaEARWvaU_jjpnU_NUS3U_dSjtG?usp=sharing). -# Results +# Results -- We train both models for 10 epochs. +- We train both models for 10 epochs. | Model Type | EM-Score| -| --------------------| --------| +| --------------------| --------| | RAG-original | 28.12 | -| RAG-end2end with DPR| 40.02 | +| RAG-end2end with DPR| 40.02 | diff --git a/examples/research_projects/rag-end2end-retriever/requirements.txt b/examples/research_projects/rag-end2end-retriever/requirements.txt index 473d972761e3..aca89c78e88c 100644 --- a/examples/research_projects/rag-end2end-retriever/requirements.txt +++ b/examples/research_projects/rag-end2end-retriever/requirements.txt @@ -2,6 +2,6 @@ faiss-cpu >= 1.7.0 datasets >= 1.6.2 psutil >= 5.7.0 torch >= 1.4.0 -pytorch-lightning == 1.3.1 +pytorch-lightning nvidia-ml-py3 == 7.352.0 ray >= 1.3.0 diff --git a/examples/research_projects/rag/README.md b/examples/research_projects/rag/README.md index 74a1ab0bf93f..b7b17d731bb1 100644 --- a/examples/research_projects/rag/README.md +++ b/examples/research_projects/rag/README.md @@ -11,6 +11,10 @@ Such contextualized inputs are passed to the generator. Read more about RAG at https://arxiv.org/abs/2005.11401. +# Note + +⚠️ This project should be run with pytorch-lightning==1.3.1 which has a potential security vulnerability + # Finetuning Our finetuning logic is based on scripts from [`examples/seq2seq`](https://github.com/huggingface/transformers/tree/master/examples/seq2seq). We accept training data in the same format as specified there - we expect a directory consisting of 6 text files: @@ -52,8 +56,8 @@ You will then be able to pass `path/to/checkpoint` as `model_name_or_path` to th ## Document Retrieval When running distributed fine-tuning, each training worker needs to retrieve contextual documents -for its input by querying a index loaded into memory. RAG provides two implementations for document retrieval, -one with [`torch.distributed`](https://pytorch.org/docs/stable/distributed.html) communication package and the other +for its input by querying a index loaded into memory. RAG provides two implementations for document retrieval, +one with [`torch.distributed`](https://pytorch.org/docs/stable/distributed.html) communication package and the other with [`Ray`](https://docs.ray.io/en/master/). This option can be configured with the `--distributed_retriever` flag which can either be set to `pytorch` or `ray`. @@ -62,7 +66,7 @@ By default this flag is set to `pytorch`. For the Pytorch implementation, only training worker 0 loads the index into CPU memory, and a gather/scatter pattern is used to collect the inputs from the other training workers and send back the corresponding document embeddings. -For the Ray implementation, the index is loaded in *separate* process(es). The training workers randomly select which +For the Ray implementation, the index is loaded in *separate* process(es). The training workers randomly select which retriever worker to query. To use Ray for distributed retrieval, you have to set the `--distributed_retriever` arg to `ray`. To configure the number of retrieval workers (the number of processes that load the index), you can set the `num_retrieval_workers` flag. Also make sure to start the Ray cluster before running fine-tuning. @@ -119,7 +123,7 @@ We demonstrate how to evaluate retrieval against DPR evaluation data. You can do --gold_data_path output/biencoder-nq-dev.pages ``` 3. Run evaluation: - ```bash + ```bash python examples/research_projects/rag/eval_rag.py \ --model_name_or_path facebook/rag-sequence-nq \ --model_type rag_sequence \ @@ -139,7 +143,7 @@ We demonstrate how to evaluate retrieval against DPR evaluation data. You can do --predictions_path output/retrieval_preds.tsv \ # name of file where predictions will be stored --eval_mode retrieval \ # indicates whether we're performing retrieval evaluation or e2e evaluation --k 1 # parameter k for the precision@k metric - + ``` ## End-to-end evaluation @@ -153,8 +157,8 @@ who is the owner of reading football club ['Xiu Li Dai', 'Dai Yongge', 'Dai Xiul Xiu Li Dai ``` -Predictions of the model for the samples from the `evaluation_set` will be saved under the path specified by the `predictions_path` parameter. -If this path already exists, the script will use saved predictions to calculate metrics. +Predictions of the model for the samples from the `evaluation_set` will be saved under the path specified by the `predictions_path` parameter. +If this path already exists, the script will use saved predictions to calculate metrics. Add `--recalculate` parameter to force the script to perform inference from scratch. An example e2e evaluation run could look as follows: @@ -196,4 +200,4 @@ python examples/research_projects/rag/finetune_rag.py \ --index_name custom --passages_path path/to/data/my_knowledge_dataset --index_path path/to/my_knowledge_dataset_hnsw_index.faiss -``` \ No newline at end of file +``` diff --git a/examples/research_projects/rag/requirements.txt b/examples/research_projects/rag/requirements.txt index ef065e36e1c9..652821a216cb 100644 --- a/examples/research_projects/rag/requirements.txt +++ b/examples/research_projects/rag/requirements.txt @@ -3,5 +3,5 @@ datasets >= 1.0.1 psutil >= 5.7.0 torch >= 1.4.0 transformers -pytorch-lightning==1.3.1 -GitPython \ No newline at end of file +pytorch-lightning +GitPython diff --git a/examples/research_projects/seq2seq-distillation/README.md b/examples/research_projects/seq2seq-distillation/README.md index 8157f753f8ec..62c38bfd7140 100644 --- a/examples/research_projects/seq2seq-distillation/README.md +++ b/examples/research_projects/seq2seq-distillation/README.md @@ -13,6 +13,10 @@ Author: Sam Shleifer (https://github.com/sshleifer) - `FSMTForConditionalGeneration` - `T5ForConditionalGeneration` +# Note + +⚠️ This project should be run with pytorch-lightning==1.0.4 which has a potential security vulnerability + ## Datasets #### XSUM @@ -62,7 +66,7 @@ https://github.com/huggingface/transformers/tree/master/scripts/fsmt #### Pegasus (multiple datasets) -Multiple eval datasets are available for download from: +Multiple eval datasets are available for download from: https://github.com/stas00/porting/tree/master/datasets/pegasus @@ -210,7 +214,7 @@ model = AutoModelForSeq2SeqLM.from_pretrained(f'{output_dir}/best_tfmr') ### Converting pytorch-lightning checkpoints pytorch lightning ``-do_predict`` often fails, after you are done training, the best way to evaluate your model is to convert it. -This should be done for you, with a file called `{save_dir}/best_tfmr`. +This should be done for you, with a file called `{save_dir}/best_tfmr`. If that file doesn't exist but you have a lightning `.ckpt` file, you can run ```bash @@ -219,7 +223,7 @@ python convert_pl_checkpoint_to_hf.py PATH_TO_CKPT randomly_initialized_hf_mode Then either `run_eval` or `run_distributed_eval` with `save_dir/best_tfmr` (see previous sections) -# Experimental Features +# Experimental Features These features are harder to use and not always useful. ### Dynamic Batch Size for MT @@ -230,7 +234,7 @@ This feature can only be used: - without sortish sampler - after calling `./save_len_file.py $tok $data_dir` -For example, +For example, ```bash ./save_len_file.py Helsinki-NLP/opus-mt-en-ro wmt_en_ro ./dynamic_bs_example.sh --max_tokens_per_batch=2000 --output_dir benchmark_dynamic_bs @@ -254,10 +258,10 @@ This section describes all code and artifacts from our [Paper](http://arxiv.org/ ![DBART](https://huggingface.co/front/thumbnails/distilbart_large.png) + For the CNN/DailyMail dataset, (relatively longer, more extractive summaries), we found a simple technique that works, which we call "Shrink and Fine-tune", or SFT. -you just copy alternating layers from `facebook/bart-large-cnn` and fine-tune more on the cnn/dm data. `sshleifer/distill-pegasus-cnn-16-4`, `sshleifer/distilbart-cnn-12-6` and all other checkpoints under `sshleifer` that start with `distilbart-cnn` were trained this way. +you just copy alternating layers from `facebook/bart-large-cnn` and fine-tune more on the cnn/dm data. `sshleifer/distill-pegasus-cnn-16-4`, `sshleifer/distilbart-cnn-12-6` and all other checkpoints under `sshleifer` that start with `distilbart-cnn` were trained this way. + For the XSUM dataset, training on pseudo-labels worked best for Pegasus (`sshleifer/distill-pegasus-16-4`), while training with KD worked best for `distilbart-xsum-12-6` + For `sshleifer/dbart-xsum-12-3` -+ We ran 100s experiments, and didn't want to document 100s of commands. If you want a command to replicate a figure from the paper that is not documented below, feel free to ask on the [forums](https://discuss.huggingface.co/t/seq2seq-distillation-methodology-questions/1270) and tag `@sshleifer`. ++ We ran 100s experiments, and didn't want to document 100s of commands. If you want a command to replicate a figure from the paper that is not documented below, feel free to ask on the [forums](https://discuss.huggingface.co/t/seq2seq-distillation-methodology-questions/1270) and tag `@sshleifer`. + You can see the performance tradeoffs of model sizes [here](https://docs.google.com/spreadsheets/d/1EkhDMwVO02m8jCD1cG3RoFPLicpcL1GQHTQjfvDYgIM/edit#gid=0). and more granular timing results [here](https://docs.google.com/spreadsheets/d/1EkhDMwVO02m8jCD1cG3RoFPLicpcL1GQHTQjfvDYgIM/edit#gid=1753259047&range=B2:I23). @@ -303,10 +307,10 @@ deval 1 sshleifer/distill-pegasus-xsum-16-4 xsum dpx_xsum_eval + Find a teacher model [Pegasus](https://huggingface.co/models?search=pegasus) (slower, better ROUGE) or `facebook/bart-large-xsum`/`facebook/bart-large-cnn` (faster, slightly lower.). Choose the checkpoint where the corresponding dataset is most similar (or identical to) your dataset. + Follow the sections in order below. You can stop after SFT if you are satisfied, or move on to pseudo-labeling if you want more performance. -+ student size: If you want a close to free 50% speedup, cut the decoder in half. If you want a larger speedup, cut it in 4. ++ student size: If you want a close to free 50% speedup, cut the decoder in half. If you want a larger speedup, cut it in 4. + If your SFT run starts at a validation ROUGE-2 that is more than 10 pts below the teacher's validation ROUGE-2, you have a bug. Switching to a more expensive technique will not help. Try setting a breakpoint and looking at generation and truncation defaults/hyper-parameters, and share your experience on the forums! - + #### Initialization We use [make_student.py](./make_student.py) to copy alternating layers from the teacher, and save the resulting model to disk ```bash @@ -319,7 +323,7 @@ python make_student.py google/pegasus-xsum --save_path dpx_xsum_16_4 --e 16 --d we now have an initialized student saved to `dbart_xsum_12_3`, which we will use for the following commands. + Extension: To replicate more complicated initialize experiments in section 6.1, or try your own. Use the `create_student_by_copying_alternating_layers` function. -#### Pegasus +#### Pegasus + The following commands are written for BART and will require, at minimum, the following modifications + reduce batch size, and increase gradient accumulation steps so that the product `gpus * batch size * gradient_accumulation_steps = 256`. We used `--learning-rate` = 1e-4 * gradient accumulation steps. + don't use fp16 @@ -379,7 +383,7 @@ python finetune.py \ --output_dir dbart_xsum_12_3_PL --gpus 1 --logger_name wandb ``` - + To combine datasets, as in Section 6.2, try something like: ```bash @@ -413,7 +417,7 @@ The command that produced `sshleifer/distilbart-xsum-12-6` is at [./train_distil ```bibtex @misc{shleifer2020pretrained, - title={Pre-trained Summarization Distillation}, + title={Pre-trained Summarization Distillation}, author={Sam Shleifer and Alexander M. Rush}, year={2020}, eprint={2010.13002}, diff --git a/examples/research_projects/seq2seq-distillation/requirements.txt b/examples/research_projects/seq2seq-distillation/requirements.txt index 0cd973d4d5ca..533f6339ab08 100644 --- a/examples/research_projects/seq2seq-distillation/requirements.txt +++ b/examples/research_projects/seq2seq-distillation/requirements.txt @@ -4,7 +4,7 @@ psutil sacrebleu rouge-score tensorflow_datasets -pytorch-lightning==1.0.4 +pytorch-lightning matplotlib git-python==1.0.3 faiss-cpu