The code provided assumes the following directory structure.
+-- <work_dir>
| +-- Datasets
| | +-- <dataset>
| +-- models
| +-- results
You can download the datasets from the XML repo.
A dataset folder should have the following directory structure. Below we show it for LF-AmazonTitles-131K dataset:
📁 LF-AmazonTitles-131K/
📄 trn_X_Y.txt # contains mappings from train IDs to label IDs
📄 trn_filter_labels.txt # this contains train reciprocal pairs to be ignored in evaluation
📄 tst_X_Y.txt # contains mappings from test IDs to label IDs
📄 tst_filter_labels.txt # this contains test reciprocal pairs to be ignored in evaluation
📄 trn_X.txt # each line contains the raw input train text, this needs to be tokenized
📄 tst_X.txt # each line contains the raw input test text, this needs to be tokenized
📄 Y.txt # each line contains the raw label text, this needs to be tokenized
To tokenize the raw train, test and label texts, we can use the following command (change the path of the dataset folder accordingly):
python -W ignore -u utils/CreateTokenizedFiles.py \
--data-dir xc/Datasets/LF-AmazonTitles-131K \
--max-length 32 \
--tokenizer-type bert-base-uncased \
--tokenize-label-texts
LEVER builds upon NGAME's [1] Module-I code base. Please refer to the sample command below to train the Siamese Teacher model. We use the same hyper-parameters as described in [1] to train the teacher model.
CUDA_VISIBLE_DEVICES=0,1 python main.py --work-dir <work-dir> --dataset LF-AmazonTitles-131K --epochs 300 --batch-size 1600 --margin 0.3 --eval-interval 1 --enc-lr 2e-4 --version lfat-131k-lbl-side --filter-labels tst_filter_labels.txt --num-negatives 10 --num-violators --save-model --batch-type lbl --loss-type ohnm --cl-size 8 --cl-start 10 --cl-update 5 --curr-steps 25,50,75,100,125,150,200
[1]: K. Dahiya, N. Gupta, D. Saini, A. Soni, Y. Wang, K. Dave, J. Jiao, K. Gururaj, P. Dey, A. Singh, D. Hada, V. Jain, B. Paliwal, A. Mittal, S. Mehta, R. Ramjee, S. Agarwal, P. Kar and M. Varma. NGAME: Negative mining-aware mini-batching for extreme classification. In WSDM, Singapore, March 2023.