Skip to content

amazon-science/efficient-longdoc-classification

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Source codes for ``Efficient Classification of Long Documents Using Transformers''

Please refer to our paper for more details and cite our paper if you find this repo useful:

@inproceedings{park-etal-2022-efficient,
    title = "Efficient Classification of Long Documents Using Transformers",
    author = "Park, Hyunji  and
      Vyas, Yogarshi  and
      Shah, Kashif",
    booktitle = "Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)",
    month = may,
    year = "2022",
    address = "Dublin, Ireland",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2022.acl-short.79",
    doi = "10.18653/v1/2022.acl-short.79",
    pages = "702--709",
}

Instructions

1. Install required libraries

pip install -r requirements.txt
python -m spacy download en_core_web_sm

2. Prepare the datasets

Hyperpartisan News Detection

mkdir data/hyperpartisan
wget -P data/hyperpartisan/ https://zenodo.org/record/1489920/files/articles-training-byarticle-20181122.zip
wget -P data/hyperpartisan/ https://zenodo.org/record/1489920/files/ground-truth-training-byarticle-20181122.zip
unzip data/hyperpartisan/articles-training-byarticle-20181122.zip -d data/hyperpartisan
unzip data/hyperpartisan/ground-truth-training-byarticle-20181122.zip -d data/hyperpartisan
rm data/hyperpartisan/*zip

20NewsGroups

EURLEX-57K

mkdir data/EURLEX57K
wget -O data/EURLEX57K/datasets.zip http://nlp.cs.aueb.gr/software_and_datasets/EURLEX57K/datasets.zip
unzip data/EURLEX57K/datasets.zip -d data/EURLEX57K
rm data/EURLEX57K/datasets.zip
rm -rf data/EURLEX57K/__MACOSX
mv data/EURLEX57K/dataset/* data/EURLEX57K
rm -rf data/EURLEX57K/dataset
wget -O data/EURLEX57K/EURLEX57K.json http://nlp.cs.aueb.gr/software_and_datasets/EURLEX57K/eurovoc_en.json
  • Running train.py with the --data eurlex flag reads and prepares the data from data/EURLEX57K/{train, dev, test}/*.json files
  • Running train.py with the --data eurlex --inverted flag creates Inverted EURLEX data by inverting the order of the sections
  • data/EURLEX57K/EURLEX57K.json contains label information.

CMU Book Summary Dataset

wget -P data/ http://www.cs.cmu.edu/~dbamman/data/booksummaries.tar.gz
tar -xf data/booksummaries.tar.gz -C data
  • Running train.py with the --data books flag reads and prepares the data from data/booksummaries/booksummaries.txt
  • Running train.py with the --data books --pairs flag creates Paired Book Summary by combining pairs of summaries and their labels

3. Run the models

e.g. python train.py --model_name bertplusrandom --data books --pairs --batch_size 8 --epochs 20 --lr 3e-05

cf. Note that we use the source code for the CogLTX model: https://github.com/Sleepychord/CogLTX

Hyperparameters used

Hyperpartisan

Parameter BERT BERT+TextRank BERT+Random Longformer ToBERT
Batch size 8 8 8 16 8
Epochs 20 20 20 20 20
LR 3e-05 3e-05 5e-05 5e-05 5e-05
Scheduler NA NA NA warmup NA

20NewsGroups, Book Summary, Paired Book Summary

Parameter BERT BERT+TextRank BERT+Random Longformer ToBERT
Batch size 8 8 8 16 8
Epochs 20 20 20 20 20
LR 3e-05 3e-05 3e-05 0.005 3e-05
Scheduler NA NA NA warmup NA

EURLEX, Inverted EURLEX

Parameter BERT BERT+TextRank BERT+Random Longformer ToBERT
Batch size 8 8 8 16 8
Epochs 20 20 20 20 20
LR 5e-05 5e-05 5e-05 0.005 5e-05
Scheduler NA NA NA warmup NA

About

No description, website, or topics provided.

Resources

License

Code of conduct

Security policy

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages