Skip to content

Latest commit



121 lines (81 loc) · 4.67 KB

File metadata and controls

121 lines (81 loc) · 4.67 KB



The code was tested with python==3.7.3.

The required libraries are listed in requirements.txt.


  1. Download datasets from here:
  2. Unpack to <DATA_DIR>.
  3. In config/config_*.ini, change Paths.data_dir value to <DATA_DIR>.

Data format


  • Each dataset contains the following files: train.json, dev.json and test.json.

  • Each file contains N samples, one sample per line.

  • Each sample is a dict with the following important keys:

    • text: original document.
    • title: original title.
    • text_processed: normalized title + document (lower case, no stopwords, no punctuation). Used as input to the model.
    • label: list of relevant labels, where each label is a string. Model target.
  • The rest of the keys is legacy from original datasets and can be useful for running other baselines.

Additional files:

Each dataset contains to additional files, required for training ontology.json and taxonomy.txt:


  • JSONL file, each line describes a single label (key-value mapping):
    • label: string with label identifier.
    • title: label title in natural language.
    • definition: label definition in natural language.
    • txt: normalized title + definition.
    • level: level of the label in a the label tree.


  • TXT file, each line contains space-separated labels, where first label is a parent and the rest are children.

GloVe embeddings

The model is initialized with GloVe embeddings (840B tokens, 2.2M vocab, cased, 300d vectors). Please download from the official website and put next to the <DATA_DIR> (the exact path must be specified in config["Paths"]["glove_model"]).

Training and Evaluation

Paths, training hyper-parameters and other model configurations depends on a dataset and are specified in corresponding config files: config/config_*.ini.


To run the training, use the following command:

python --config CONFIG --name NAME
  • CONFIG: path to a config file
  • NAME: model name prefix

The code evaluates a validation loss after each epoch and save the best model to ./models/ directory.

Note that the model's code was designed to be trained using GPU acceleration and there is no CPU support.


The script implements beam search algorithm starting from given prefixes (label refinement task). The prefixes are constructed from labels of level < LEVEL assigned to a test instance.

To perform predictions using a trained model, use the following command:

python --config CONFIG --model MODEL --level LEVEL --output OUTPUT
  • CONFIG: path to a config file
  • MODEL: path to a trained model
  • LEVEL: for label refinement task: level from which the prediction starts. For example, when LEVEL==2, the model is provided with path prefixes of length 1 and start predicting labels from level 2. For predicting from scratch (without prefixes), set LEVEL to 1.
  • OUTPUT: path stub for output files.

The script will generate two files: <OUTPUT>-labels.npy and <OUTPUT>-scores.npy with top-1000 predicted labels and their scores, respectively.


The script evaluates model predictions generate by module. It can also be used for evaluation of other baseline methods which produce output in the same format (AttentionXML, MATCH).

The scripts calculate the following metrics:

  • Precision@k (k = 1, 3, 5)
  • NDCG@k (k = 1, 3, 5)

To run the evaluation, use the following command:

python --testset TESTSET --pred PRED --ontology ONTOLOGY --level LEVEL
  • TESTET: path to test.json file
  • PRED: path to <OUTPUT>-labels.npy file (see above)
  • ONTOLOGY: path to ontology.json file
  • LEVEL: for label refinement task: only consider labels of level >= LEVEL. For all labels, set LEVEL to 1.


For the full method description and experimental results please refer to our paper:

Natalia Ostapuk, Julien Audiffren, Ljiljana Dolamic, Alain Mermoud, and Philippe Cudre-Mauroux. 2024. Follow the Path: Hierarchy-Aware Extreme Multi-Label Completion for Semantic Text Tagging. In Proceedings of the ACM Web Conference 2024 (WWW ’24), May 13–17, 2024, Singapore, Singapore.