Boosting LLM via Learning from Data Iteratively and Selectively

Overview

Datasets nowadays are generally constructed from multiple sources and using different synthetic techniques, making data de-noising and de-duplication crucial before being used for post-training. In this work, we propose to perform instruction tuning by iterative data selection (IterIT). We measure the quality of a sample from complexity and diversity simultaneously. Instead of calculating the complexity score once for all before fine-tuning, we highlight the importance of updating this model-specific score during fine-tuning to accurately accommodate the dynamic changes of the model. On the other hand, the diversity score is defined on top of the samples' responses under the consideration of their informativeness. IterIT integrates the strengths of both worlds by iteratively updating the complexity score for the top-ranked samples and greedily selecting the ones with the highest complexity-diversity score. Experiments on multiple instruction-tuning data demonstrate consistent improvements of IterIT over strong baselines. Moreover, our approach also generalizes well to domain-specific scenarios and different backbone models.

Installation

Clone this repo into your working directory and setup the environment:

git clone https://github.com/JiaQiSJTU/IterIT.git
cd IterIT
conda create -n IterIT python=3.10
conda activate IterIT
pip install -r requirements.txt

Major requirements with version information are listed in requirements.txt.

Data Preparation

Please download the datasets from corresponding sources and put them in the data folder with neccessary format transformation. Some of the dataset adopted in this work are listed below:

The data file should be in the format of json with the following fields for each sample:

instruction: the user query or the task description
input: the input of the instruction
output: the reference response to the instruction

An example of the data file is shown in data/examples.json.

Training

To train the model with IterIT, you can use the following command:

bash scripts/train_IterIT.sh

Some major parameters for IterIT are:

ratio: the number of samples to be used in each epoch. The recommended value is 0.05.
update_ratio: the number of samples to be updated after each epoch. The recommended value is 3.
update_strategy: the strategy to update the samples. iterative refers to IterIT where the complexity score is updated during fine-tuning. baseline refers to previous baselines where the complexity score will not be updated once been computed outside of fine-tuning.
score_type: the score adopted to measure the quality of each sample. BOTH refers to use both complexity and diversity scores. IFD refers to use the complexity score only.
diversity_weight_decay: the weight decay of the diversity score. The recommended value is 0.1 for instruction tuning datasets and 0.0 for domain-specific datasets.

Evaluation

Please refer to the following repository for evaluation:

MixEval: link
OpenInstruct: link
Open LLM LeaderBoard: link

The test scripts adopted in this work are listed in eval_scripts/.

Citation

If you find this work useful for your research, please consider citing:

@article{jia2024iterit,
  title={Boosting LLM via Learning from Data Iteratively and Selectively},
  author={Jia, Qi and Ren, Siyu and Qin, Ziheng and Xue, Fuzhao and Ni, Jinjie and You, Yang},
  journal={arXiv preprint arXiv:2412.17365},
  year={2024}
}

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
data		data
eval_scripts		eval_scripts
resources		resources
scripts		scripts
src		src
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Boosting LLM via Learning from Data Iteratively and Selectively

Overview

Installation

Data Preparation

Training

Evaluation

Citation

About

Releases

Packages

Languages

JiaQiSJTU/IterIT

Folders and files

Latest commit

History

Repository files navigation

Boosting LLM via Learning from Data Iteratively and Selectively

Overview

Installation

Data Preparation

Training

Evaluation

Citation

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages