Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cannot load model if model directory contains symlinks #38

Closed
osma opened this issue Jan 27, 2022 · 3 comments · Fixed by #39
Closed

Cannot load model if model directory contains symlinks #38

osma opened this issue Jan 27, 2022 · 3 comments · Fixed by #39

Comments

@osma
Copy link
Contributor

osma commented Jan 27, 2022

I'm trying to adapt Omikuji (via the integration with Annif) into a Data Version Control workflow. In DVC, large model files are typically stored in a cache outside the working tree (which is a git repository). There are several ways to keep the working directory synchronized with the cache, but one common solution is the use of symbolic links. This means that model files will be moved to the cache directory and replaced with symlinks that point to the original files.

I noticed that Omikuji has problems loading the model if the files in the model directory (settings.json, tree0.cbor, tree1.cbor ...) aren't regular files but symlinks. Loading the model apparently succeeds, but all the predictions are empty. I was able demonstrate this without involving DVC, by editing the Python example into this:

import os
import sys
import shutil
import time

import omikuji

if __name__ == "__main__":
    # Adjust hyper-parameters as needed
    hyper_param = omikuji.Model.default_hyper_param()
    hyper_param.n_trees = 2

    # Train
    model = omikuji.Model.train_on_data("./eurlex_train.txt", hyper_param)

    # Serialize & de-serialize
    model.save("./model")
    
    # create a directory containing symlinks to the saved model files
    shutil.rmtree("./model2", ignore_errors=True)
    os.mkdir("./model2")
    for fn in os.listdir("./model"):
        os.symlink(f"../model/{fn}", f"./model2/{fn}")

    # load the model from the directory containing symlinks
    model = omikuji.Model.load("./model2")

    # Predict
    feature_value_pairs = [
        (0, 0.101468),
        (1, 0.554374),
        (2, 0.235760),
        (3, 0.065255),
        (8, 0.152305),
        (10, 0.155051),
        # ...
    ]
    label_score_pairs = model.predict(feature_value_pairs, top_k=3)
    print("Dummy prediction results: {}".format(label_score_pairs))

The result of running this:

INFO [omikuji::data] Loading data from ./eurlex_train.txt
INFO [omikuji::data] Parsing data
INFO [omikuji::data] Loaded 15539 examples; it took 0.16s
INFO [omikuji::model::train] Training model with hyper-parameters HyperParam { n_trees: 2, min_branch_size: 100, max_depth: 20, centroid_threshold: 0.0, collapse_every_n_layers: 0, linear: HyperParam { loss_type: Hinge, eps: 0.1, c: 1.0, weight_threshold: 0.1, max_iter: 20 }, cluster: HyperParam { k: 2, balanced: true, eps: 0.0001, min_size: 2 }, tree_structure_only: false, train_trees_1_by_1: false }
INFO [omikuji::model::train] Initializing tree trainer
INFO [omikuji::model::train] Computing label centroids
Labels 3786 / 3786 [==============================================================] 100.00 % 23820.55/s INFO [omikuji::model::train] Start training forest
7824 / 7824 [======================================================================] 100.00 % 3112.38/s INFO [omikuji::model::train] Model training complete; it took 3.39s
INFO [omikuji::model] Saving model...
INFO [omikuji::model] Saving tree to ./model/tree0.cbor
INFO [omikuji::model] Saving tree to ./model/tree1.cbor
INFO [omikuji::model] Model saved; it took 0.08s
INFO [omikuji::model] Loading model...
INFO [omikuji::model] Loading model settings from ./model2/settings.json...
INFO [omikuji::model] Loaded model settings Settings { n_features: 5000, classifier_loss_type: Hinge }...
INFO [omikuji::model] Model loaded; it took 0.00s
Dummy prediction results: []

The suspicious part is the model loading (it should take more than 0.00s) and then the empty list of predictions.

I wonder if there's a good reason why the model files have to be actual files. Normally it doesn't matter whether a file used for a read operation is a regular file or a symlink; as long as the symlink points to an actual file (with the correct permissions etc.) that should work fine.

Some information about the system:
Ubuntu Linux 20.04 amd64, ext4 filesystem
Python 3.8.10
Omikuji 0.4.1 installed in a virtual environment with pip

@osma
Copy link
Contributor Author

osma commented Jan 27, 2022

I can't say I know Rust, but I suspect that the problem could be related to this line that checks whether the file is a regular file:

if entry.file_type()?.is_file() {

I suggest making it an OR clause with is_symlink() also passing the test and activating the if block.

@tomtung
Copy link
Owner

tomtung commented Feb 1, 2022

Yeah you're right. Thedoc says

pub fn is_file(&self) -> bool
Tests whether this file type represents a regular file. The result is mutually exclusive to the results of is_dir and is_symlink; only zero or one of these tests may pass.

I should probably just get rid of the is_file check. If the folder contains a folder or an invalid symlink with name treeN.cbor, we should probably throw an error anyways.

I should probably also throw an error / print an warning if no tree is loaded to avoid confusion.

tomtung added a commit that referenced this issue Feb 1, 2022
tomtung added a commit that referenced this issue Feb 1, 2022
It would have made problems like #38 easier to diagnose and debug
tomtung added a commit that referenced this issue Feb 1, 2022
tomtung added a commit that referenced this issue Feb 1, 2022
It would have made problems like #38 easier to diagnose and debug
tomtung added a commit that referenced this issue Feb 1, 2022
tomtung added a commit that referenced this issue Feb 1, 2022
It would have made problems like #38 easier to diagnose and debug
@osma
Copy link
Contributor Author

osma commented Feb 1, 2022

Great, thanks a lot @tomtung !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants