[Question]: Resume training #3458

alfredwallace7 · 2024-05-17T07:28:23Z

Question

I'm trying to resume training according to :
This code
where it says :

7. continue training at later point. Load previously trained model checkpoint, then resume

trained_model = SequenceTagger.load(path + '/checkpoint.pt')

resume training best model, but this time until epoch 25

trainer.resume(trained_model,
base_path=path + '-resume',
max_epochs=25,
)

but resume is not defined in :
class ModelTrainer(Pluggable)

I'm sure it's a common task using your awesome library yet I cannot get it working.
Any information would be very appreciated.

nturusin · 2024-06-07T16:04:56Z

Hi guys. I faced this exact issue too. Is there a solution in the end?

helpmefindaname · 2024-06-14T13:33:48Z

Hi @alfredwallace7
I am sorry, but I think resuming is currently not possible, that feature has been removed when the trainer got reworked in 0.13.0
We might reimplement this feature, but there are no plans to do so soon.

For the documentation:
Please refer to the doc page which is maintained and up to date. The /resources/docs/ folder is outdated and only there for legacy reasons.

nturusin · 2024-06-14T13:48:20Z

Hi again. To my own surprise, I managed to do it @alfredwallace7
Unfortunately, I had to add some ugly hack to handle w2v (during the process the trainer is trying to save files to some temporary folder which path is being defined dynamically).

def get_tagger(tag_dictionary, tag_type, path_to_checkpoint=None):
    embeddings = StackedEmbeddings([
        BytePairEmbeddings(
            language="en",
            dim=25,
            syllables=50000,
        ),
        WordEmbeddings(embeddings="en"),
        FlairEmbeddings(model="news-forward-fast"),
        FlairEmbeddings(model="news-backward-fast")
    ])

    if path_to_checkpoint is not None:
        tagger = SequenceTagger.load(path_to_checkpoint)
        path_to_w2v_file = tagger.embeddings.list_embedding_0.embedder.emb_file
        path_to_w2v = str(path_to_w2v_file).rsplit('/', 1)[0]
        if not os.path.exists(path_to_w2v):
            logger.info(f'Create folder for w2v: {path_to_w2v}')
            os.makedirs(path_to_w2v)
        logger.info(f'Loaded tagger from {path_to_checkpoint}')
    else:
        tagger = SequenceTagger(
            hidden_size=256,
            embeddings=embeddings,
            tag_dictionary=tag_dictionary,
            tag_type=tag_type,
            word_dropout=0.1,
            dropout=0.2,
            rnn_layers=2,
            use_crf=True,
            train_initial_hidden_state=True,
        )

    return tagger

Then you can create the trainer object as usual

trainer: ModelTrainer = ModelTrainer(tagger, column_corpus)
...
trainer.train(args.model_folder, **learning_params)
...

alfredwallace7 · 2024-06-18T15:29:40Z

Thanks for you replies. I'll fully read the doc and try the hack!

david-waterworth · 2024-06-24T02:20:42Z

I would add that resuming is important if you're training models on AWS and want to use spot instances, they need to be able to be interrupted and continue from a checkpoint automatically.

alfredwallace7 added the question Further information is requested label May 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Question]: Resume training #3458

[Question]: Resume training #3458

alfredwallace7 commented May 17, 2024

nturusin commented Jun 7, 2024

helpmefindaname commented Jun 14, 2024

nturusin commented Jun 14, 2024 •

edited

Loading

alfredwallace7 commented Jun 18, 2024

david-waterworth commented Jun 24, 2024

[Question]: Resume training #3458

[Question]: Resume training #3458

Comments

alfredwallace7 commented May 17, 2024

Question

7. continue training at later point. Load previously trained model checkpoint, then resume

resume training best model, but this time until epoch 25

nturusin commented Jun 7, 2024

helpmefindaname commented Jun 14, 2024

nturusin commented Jun 14, 2024 • edited Loading

alfredwallace7 commented Jun 18, 2024

david-waterworth commented Jun 24, 2024

nturusin commented Jun 14, 2024 •

edited

Loading