Skip to content

[ACL2024] A Codebase for Incremental Learning with Large Language Models; Official released code for "Learn or Recall? Revisiting Incremental Learning with Pre-trained Language Models (ACL 2024)", "Incremental Sequence Labeling: A Tale of Two Shifts (ACL 2024 Findings)", and "Concept-1K: A Novel Benchmark for Instance Incremental Learning (arxiv)"

Notifications You must be signed in to change notification settings

zzz47zzz/codebase-for-incremental-learning-with-llm

Repository files navigation

[ACL 2024] A Codebase for Incremental Learning with Large Language Models

ACL 2024 ACL 2024 (Findings) arXiv

Contents

Introduction

This is a repository for Incremental Learning with Large Language Models.

  • It supports both generative and discriminative models in transformers.
  • It supports using accelerate for distributed data parrallel and model parallel.
  • It supports using wandb for logging.

Supported List

Scenario

  • Instance-Incremental Learning
  • Class-Incremental Learning
  • Task-Incremental Learning
  • Continual Instruction Tuning (Coming soon!)
  • Continual Knowledge Editing (Coming soon!)

Tasks

  • Text Classification
  • Intent Classification
  • Relational Extraction
  • Named Entity Recognition

Methods

More baselines will be released in the future!

General (Text/Intent) Classification

Named Entity Recognition

Original for Image Classification

Datasets

Instance Incremental Learning

  • Concept-1K (The raw and the preprocessed Concept-1K are included in dataset/concept_1k, dataset/concept_1k_task10, dataset/concept_1k_task1).

Intent Classification

  • Topic3datasets (agnews, dbpedia, yahoo)

Intent Classification

  • CLINC150
  • Banking77

Relation Extraction

  • FewRel
  • TACRED

Named Entity Recognition

  • Few-NERD
  • Ontonotes5
  • I2B2

Best Practice to Use this Codebase

How to reproduce the performance of SEQ and SEQ*?

The config file of SEQ (just sequential fine-tuning) can be found in the SEQ_full.yaml (in the config directory). The config file of SEQ* can be found in the SEQ_pre_warm_fix.yaml. Note that the classifier type (linear or cosine linear) is not specified in all config files because we set it the script. An example can be found in https://github.com/zzz47zzz/codebase-for-incremental-learning-with-llm/blob/main/reproduce_shell/exp-CIL-sota/SOTA-CIL-Intent-discriminative-banking77_task7.sh.

Usage

Overview

.
├── main_CL.py              # This this the python file to be executed for running all experiments
├── utils                       # This folder contains all basic files for incremental learning 
│   ├── backbone.py             # This file loads backbone models from the transformers library
│   ├── buffer.py               # This file defines the replay buffer
│   ├── classifier.py           # This file loads Linear/CosineLinear classifiers
│   ├── wrapmodel.py            # This file wrap the model for using DeepSpeed with accelerate
│   ├── dataformat_preprocess.py# This file preprocess the raw datasets to the continual learning dataset
│   ├── dataloader.py           # This file prepare the input for languge models
│   ├── dataset.py              # This file defines the format for different datasets for continual learning
│   ├── download_backbones.py   # This file downloads models in advance to avoid network problem.
│   ├── evaluation.py           # This file defines the evaluation process for various tasks
│   ├── factory.py              # This file loads the various models from the ./models folder
│   ├── logger.py               # This file defines the logger
│   ├── metric.py               # This file defines the evaluation metric for continual learning
│   ├── optimizer.py            # This file defines the optimizer for different models
│   ├── prompt.py               # This file defines the prompt used for different tasks
│   ├── probing.py              # This file computes the probing performance
│   └── config.py               # This file defines general parameters and settings for the experiments
├── config                  # This folder contains the hyper-parameters for each methods in each datasets
├── dataset                 # This folder contains datasets for continual learning
├── models                  # This folder contains models for continual learning
└── experiments             # This folder contains log data for each run                 

Quick Start

Step 1: prepare the environment

pip install -r requirement.txt

Step 2: prepare the dataset

Check the support_dataset_list in utils/dataformat_preprocess.py and select the dataset you want for experiment.

Then, download the raw dataset to the folder dataset/{dataset-name}. For example, download the clinc150 to the folder dataset/clinc150. The raw datasets can be downloaded here. We note that the raw data of Conept-1K is in dataset/concept_1k. The preprocessed Concept-1K for 10 step incremental learning is in dataset/concept_1k_task10. The whole Concept-1K is in dataset/concept_1k_task1.

Next, exceute the preprocess_dataset.sh. It will automatically preprocess 8 default datasets for reproducing results ('topic3datasets','clinc150','banking77', 'fewrel','tacred','conll2003','fewnerd','i2b2','ontonotes5') and create new folders in datasets/{dataset-for-continual-learning-name} automatically (e.g.,backing_task7). If you do not need to customize the datasets, you can skip to Step 3.

To customize the datasets, you can run utils/dataformat_preprocess.py with your own parameters (e.g., random seeds, num of tasks). This process will create a new target folder dataset/{dataset-for-continual-learning-name}. In the target folder, two json files continual_data.json and continual_config.json will be saved. For example, you can prepare clinc150 and fewrel dataset by runing

python utils/dataformat_preprocess.py --dataset clinc150 --seed 1

and

python utils/dataformat_preprocess.py --dataset fewrel --seed 1

The program will create target folders dataset/clinc150_task15 and dataset/fewrel_task8.

For NER datasets, for example ontonotes5, you can run the following command

python utils/dataformat_preprocess.py --dataset ontonotes5 --seed 1 --base_task_entity 8 --incremental_task_entity 2 --seen_all_labels False

The program will create a target folder dataset/ontonotes5_task6_base8_inc2. We note that fixing the random seed enables that exctaly the same datasets can be generated on different devices. Finally, the post-precessed dataset clinc150_task15,fewrel_task8, and ontonotes5_task6_base8_inc2 are ready for continual learning!

Step 3: select the yaml file for hyper-parameters

The yaml file contains the hyper-parameters for each method. For example, the hyper-parameter of SEQ* (w/ and w/o pre-allocating future classifiers) for generative backbones under CIL settings is defined in config/CIL/generative_backbones/clinc150_task15/SEQ_pre_warm_fix.yaml and config/CIL/generative_backbones/clinc150_task15/SEQ_warm_fix.yaml respectively.

Step 4: reproduce the results

The scripts for reproducing the probing study are in the folder reproduce_shell/exp-probing.

The scripts for reproducing the probing study with different pre-training steps are in the folder reproduce_shell/exp-probing-pretraining.

The scripts for reproducing the experiments of comparing SEQ* with SOTA methods are in the folder reproduce_shell/exp-sota.

If you want to run an experiment, execute the main_CL.py. For example, you can run SEQ method on clinc150_task15 dataset with bert-base-cased using the following command:

python main_CL.py --exp_prefix {your-experiment-name} --cfg './config/clinc150_task15/SEQ_full.yaml' --backbone bert-base-cased --classifier Linear --training_epochs 5

If you want to use wandb for logging (see here for more help):

python main_CL.py --is_wandb True --wandb_project {your-project-name} --wandb_entity {your-entity-name} --exp_prefix {your-experiment-name} --cfg './config/clinc150_task15/SEQ_full.yaml' --backbone bert-base-cased --classifier Linear --training_epochs 5 

If you want to use accelerate for data/model parallel (see here for more help):

accelerate launch --config_file {your-accelerate-config-file} main_CL.py --is_wandb True --wandb_project {your-project-name} --wandb_entity {your-entity-name} --exp_prefix {your-experiment-name} --cfg './config/clinc150_task15/SEQ_full.yaml' --backbone bert-base-cased --classifier Linear --training_epochs 5 

Please refer to utils/config.py for more general paramters and models/{model-name}.py for more model-specific parameters.

Main Results

The results on IIL scenario. main_results

The results on CIL and TIL scenario. main_results

main_results

Questions and Citation

If you have questions about this repository, please feel free to contact me at [email protected].

If you find this repository useful, please consider citing our paper.

@misc{zheng2023learn,
      title={Learn or Recall? Revisiting Incremental Learning with Pre-trained Language Models}, 
      author={Junhao Zheng and Shengjie Qiu and Qianli Ma},
      year={2023},
      eprint={2312.07887},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}
@article{qiu2024incremental,
  title={Incremental Sequence Labeling: A Tale of Two Shifts},
  author={Qiu, Shengjie and Zheng, Junhao and Liu, Zhen and Luo, Yicheng and Ma, Qianli},
  journal={arXiv preprint arXiv:2402.10447},
  year={2024}
}
@misc{zheng2024concept1k,
      title={Concept-1K: A Novel Benchmark for Instance Incremental Learning}, 
      author={Junhao Zheng and Shengjie Qiu and Qianli Ma},
      year={2024},
      eprint={2402.08526},
      archivePrefix={arXiv},
      primaryClass={cs.LG}
}

About

[ACL2024] A Codebase for Incremental Learning with Large Language Models; Official released code for "Learn or Recall? Revisiting Incremental Learning with Pre-trained Language Models (ACL 2024)", "Incremental Sequence Labeling: A Tale of Two Shifts (ACL 2024 Findings)", and "Concept-1K: A Novel Benchmark for Instance Incremental Learning (arxiv)"

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published