This repository includes the data and code of the paper: CITB: A Benchmark for Continual Instruction Tuning (Findings of EMNLP 2023) by Zihan Zhang, Meng Fang, Ling Chen, and Mohammad-Reza Namazi-Rad.
The code has been tested under Python 3.9. The following are the steps to set up the environment.
Create conda environment:
conda create -n citb python=3.9 -y
conda activate citb
Install PyTorch: we used Pytorch 1.10.0 and CUDA 11.3 in the experiment; however, other versions might also work.
# CUDA 11.3
conda install pytorch==1.10.0 torchvision==0.11.0 torchaudio==0.10.0 cudatoolkit=11.3 -c pytorch -c conda-forge
Install libraries:
pip install -r requirements.txt
We use the instruction data from Super-NaturalInstructions. The processed data for the tasks in the InstraDialog and InstraDialog++ streams are available in the data/
folder. We also provide the scripts to split tasks under the scripts/data_scripts/
folder.
- The InstraDialog stream has 19 tasks, which are all dialogue-related tasks, including 4 tasks from dialogue state tracking, 11 tasks from dialogue generation, and 4 tasks from intent identification.
- The InstraDialog++ stream has 38 tasks, including all 19 tasks from the InstraDialog stream and 19 other tasks from broad categories, including sentence ordering, style transfer, toxic language detection, etc.
We have provided executable scripts to reproduce the results. Refer to the .sh
files for different settings under the scripts/
folder. We provide our results under the scores/
folder.
Train an initial model for better following instructions (Init
baseline):
bash run_initial_multitask_tuning.sh
Joint train an initial model with the subsequent tasks (Multi
baseline):
bash run_initial_multitask_tuning_with_CL.sh
Run different CL baselines for the InstraDialog stream:
bash short_stream_scripts/meta_job.sh
Run different CL baselines for the InstraDialog++ stream:
bash long_stream_scripts/meta_job.sh
bash ablation/{xxx}.sh
bash score_scripts/{xxx}.sh
Note
Due to limited computing resources, in the experiments, we used T5 as the base LM; you may choose other (larger) models from HuggingFace (such as instruction-finetuned models) if you have enough computing resources; however, you may need to change the CL code accordingly.
If you find our code, data, or the paper useful, please cite the paper:
@inproceedings{zhang-etal-2023-citb,
title = "{CITB}: A Benchmark for Continual Instruction Tuning",
author = "Zhang, Zihan and
Fang, Meng and
Chen, Ling and
Namazi-Rad, Mohammad-Reza",
editor = "Bouamor, Houda and
Pino, Juan and
Bali, Kalika",
booktitle = "Findings of the Association for Computational Linguistics: EMNLP 2023",
month = dec,
year = "2023",
address = "Singapore",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2023.findings-emnlp.633",
doi = "10.18653/v1/2023.findings-emnlp.633",
pages = "9443--9455",
abstract = "Continual learning (CL) is a paradigm that aims to replicate the human ability to learn and accumulate knowledge continually without forgetting previous knowledge and transferring it to new tasks. Recent instruction tuning (IT) involves fine-tuning models to make them more adaptable to solving NLP tasks in general. However, it is still uncertain how instruction tuning works in the context of CL tasks. This challenging yet practical problem is formulated as Continual Instruction Tuning (CIT). In this work, we establish a CIT benchmark consisting of learning and evaluation protocols. We curate two long dialogue task streams of different types, InstrDialog and InstrDialog++, to study various CL methods systematically. Our experiments show that existing CL methods do not effectively leverage the rich natural language instructions, and fine-tuning an instruction-tuned model sequentially can yield similar or better results. We further explore different aspects that might affect the learning of CIT. We hope this benchmark will facilitate more research in this direction.",
}
Our data and code are based on previous works:
If you have questions, please raise an issue.