Skip to content

Latest commit

 

History

History
48 lines (37 loc) · 2.3 KB

README.md

File metadata and controls

48 lines (37 loc) · 2.3 KB

Directory Structure

The code provided assumes the following directory structure.

+-- <work_dir>
|  +-- Datasets
|  |  +-- <dataset>
|  +-- models
|  +-- results

Data Preparation

You can download the datasets from the XML repo.

A dataset folder should have the following directory structure. Below we show it for LF-AmazonTitles-131K dataset:

📁 LF-AmazonTitles-131K/
    📄 trn_X_Y.txt # contains mappings from train IDs to label IDs
    📄 trn_filter_labels.txt # this contains train reciprocal pairs to be ignored in evaluation
    📄 tst_X_Y.txt # contains mappings from test IDs to label IDs
    📄 tst_filter_labels.txt # this contains test reciprocal pairs to be ignored in evaluation
    📄 trn_X.txt # each line contains the raw input train text, this needs to be tokenized
    📄 tst_X.txt # each line contains the raw input test text, this needs to be tokenized
    📄 Y.txt # each line contains the raw label text, this needs to be tokenized

To tokenize the raw train, test and label texts, we can use the following command (change the path of the dataset folder accordingly):

python -W ignore -u utils/CreateTokenizedFiles.py \
--data-dir xc/Datasets/LF-AmazonTitles-131K \
--max-length 32 \
--tokenizer-type bert-base-uncased \
--tokenize-label-texts

Training the Tail Robust Teacher

LEVER builds upon NGAME's [1] Module-I code base. Please refer to the sample command below to train the Siamese Teacher model. We use the same hyper-parameters as described in [1] to train the teacher model.

CUDA_VISIBLE_DEVICES=0,1 python main.py --work-dir <work-dir> --dataset LF-AmazonTitles-131K --epochs 300 --batch-size 1600 --margin 0.3 --eval-interval 1 --enc-lr 2e-4 --version lfat-131k-lbl-side --filter-labels tst_filter_labels.txt --num-negatives 10 --num-violators --save-model  --batch-type lbl --loss-type ohnm --cl-size 8 --cl-start 10 --cl-update 5 --curr-steps 25,50,75,100,125,150,200

References

[1]: K. Dahiya, N. Gupta, D. Saini, A. Soni, Y. Wang, K. Dave, J. Jiao, K. Gururaj, P. Dey, A. Singh, D. Hada, V. Jain, B. Paliwal, A. Mittal, S. Mehta, R. Ramjee, S. Agarwal, P. Kar and M. Varma. NGAME: Negative mining-aware mini-batching for extreme classification. In WSDM, Singapore, March 2023.