LAUG is an open-source toolkit for Language understanding AUGmentation. It is an automatic method to approximate the natural perturbations to existing data. Augmented data could be used to conduct black-box robustness testing or enhancing training. [paper]
Require python 3.6.
Clone this repository:
git clone https://github.com/thu-coai/LAUG.git
Install via pip:
cd LAUG
pip install -e .
Download data and models:
The data used in our paper and model parameters pre-trained by us are available at Link. Please download and place them into corresponding dir. For model parameters released by others, please refer to README.md
under dirs of each augmentation method such as LAUG/aug/Speech_Recognition/README.md
.
Here are the 4 augmentation methods described in our paper. They are placed under LAUG/aug
dir.
- Word Perturbation (WP), at
Word_Perturbation/
dir. - Text Paraphrasing (TP), at
Text_Paraphrasing/
dir. - Speech Recognition (SR), at
Speech_Recognition/
dir. - Speech Disfluency (SD), at
Speech_Disfluency/
dir.
Please see our paper and README.md in each augmentation method for detailed information.
See demo.py
for the usage of these augmentation methods.
python demo.py
Noting that our augmentation methods contains several neural models, pre-trained parameters need to be downloaded before use. Parameters pre-trained by us are available at Link. For parameters which released by others, please follow the instructions of each method.
The data used in our paper is available at Link . Please download it and place it data/
dir.
Our data contains 2 datasets: MultiWOZ and Frames, along with their augmented copies.
-
MultiWOZ
- Original data
- We use MultiWOZ 2.3 as the original data. We place it at
data/multiwoz/
dir. - Train/val/test size: 8434/999/1000 dialogs.
- LICENSE:
- We use MultiWOZ 2.3 as the original data. We place it at
- Augmented data
- We have 4 augmented testsets :
- WP (Word Perturbation), size: 1000, placed at
data/multiwoz/WP
. - TP (Text Paraphrasing), size: 1000, placed at
data/multiwoz/TP
. - SR (Speech Perturbation), size: 1000, placed at
data/multiwoz/SR
. - SD (Speech Disfluency), size: 1000, placed at
data/multiwoz/SD
.
- WP (Word Perturbation), size: 1000, placed at
- We have 1 augmented training set :
- Size : 16868 , Contains : 50%Original+(12.5%WP+12.5%TP+12.5%SR+12.5%SD) , placed at
data/multiwoz/Enhanced
.
- Size : 16868 , Contains : 50%Original+(12.5%WP+12.5%TP+12.5%SR+12.5%SD) , placed at
- We have 4 augmented testsets :
- Real user evaluation data :
- We collected 240 utterance from real users for our real user evaluation.
- We place it at
data/multiwoz/Real
dir. - Please see our paper for detailed information about the statistics and collection of the real data.
- Original data
-
Frames
- Original data
- We proccess Frames into the same format as MultiWOZ and place it at
data/Frames/
dir. - Train/val/test size: 1095/137/137 dialogs.
- LICENSE:
- We proccess Frames into the same format as MultiWOZ and place it at
- Augmented data
- We have 4 augmented testsets :
- WP (Word Perturbation), size: 137, placed at
data/Frames/WP
. - TP (Text Paraphrasing), size: 137, placed at
data/Frames/TP
. - SR (Speech Perturbation), size: 137, placed at
data/Frames/SR
. - SD (Speech Disfluency), size: 137, placed at
data/Frames/SD
.
- WP (Word Perturbation), size: 137, placed at
- We have 1 augmented training set :
- Size : 2190 , Contains : 50%Original+(12.5%WP+12.5%TP+12.5%SR+12.5%SD) , placed at
data/Frames/Enhanced
.
- Size : 2190 , Contains : 50%Original+(12.5%WP+12.5%TP+12.5%SR+12.5%SD) , placed at
- We have 4 augmented testsets :
- Original data
We provide four base NLU models which are described in our paper:
- MILU
- BERT
- CopyNet
- GPT-2
These models are adapted from ConvLab-2. For more details, You can refer to README.md
under LUAG/nlu/$model/$dataset
dir such as LAUG/nlu/gpt/multiwoz/README.md
.
If you use LAUG in your research, please cite:
@inproceedings{liu2021robustness,
title={Robustness Testing of Language Understanding in Task-Oriented Dialog},
author={Liu, Jiexi and Takanobu, Ryuichi and Wen, Jiaxin and Wan, Dazhen and Li, Hongguang and Nie, Weiran and Li, Cheng and Peng, Wei and Huang, Minlie},
year={2021},
booktitle={Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics},
}
Apache License 2.0