Fine-tuning stuck in endless no-op loop at the end #156

JanPokorny · 2021-03-22T21:32:43Z

When running fine-tuning using the provided Colab Notebook, the output ends with:

[...]
Saving checkpoints for 363000 into gs://peppa-test-1/GPT3_XL/model.ckpt.
Calling checkpoint listeners after saving checkpoint 363000...
Done writing checkpoint.
Stop infeed thread controller
Shutting down InfeedController thread.
InfeedController received shutdown signal, stopping.
Infeed thread finished, shutting down.
infeed marked as finished
Stop output thread controller
Shutting down OutfeedController thread.
OutfeedController received shutdown signal, stopping.
Outfeed thread finished, shutting down.
outfeed marked as finished
Shutdown TPU system.
Done with the session.
Loss for final step: 0.00091601687.
training_loop marked as finished
Skipping training since max_steps has already saved.
training_loop marked as finished
Skipping training since max_steps has already saved.
training_loop marked as finished
Skipping training since max_steps has already saved.
[...]

...the last two lines repeating infinitely, and the script never stops running. This happens after saving the checkpoint, so there's no harm to kill the process manually, but it would still be better if the script terminated properly.

The text was updated successfully, but these errors were encountered:

StellaAthena · 2021-03-22T21:46:23Z

Lol that’s pretty funny. I think I know what’s going wrong, we probably used pass instead of break at some point. I’ll hunt it down later, but since this has an easy work-around it’s going to be lower on the priority list.

sdtblck · 2021-03-23T10:01:46Z

Actually @StellaAthena we've been aware of this for a while haha.
It's because if you don't specify any eval steps, training is just in a while loop https://github.com/EleutherAI/gpt-neo/blob/master/main.py#L249

It should be an easy fix, thanks for reminding us @JanPokorny !

samg7b5 · 2021-03-30T20:57:17Z

You might do it by removing the while line from main.py L249 and just having else: estimator.train() (currently, once you hit L249 your current_step never changes so it will loop forever). I didn't make the change as I don't have it set up to test properly and I'm not familiar enough with train() to know if this will break any checkpointing, but I think this should work? Or you could add a break after estimator.train().

Off-topic but there's also a minor typo in activations.py which I only noticed because it prints out on multiple lines :)

I didn't think it was worth making a PR but you can see what I mean with both quick edits here master...samg7b5:master

JanPokorny added the bug Something isn't working. label Mar 22, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fine-tuning stuck in endless no-op loop at the end #156

Fine-tuning stuck in endless no-op loop at the end #156

JanPokorny commented Mar 22, 2021

StellaAthena commented Mar 22, 2021

sdtblck commented Mar 23, 2021

samg7b5 commented Mar 30, 2021

Fine-tuning stuck in endless no-op loop at the end #156

Fine-tuning stuck in endless no-op loop at the end #156

Comments

JanPokorny commented Mar 22, 2021

StellaAthena commented Mar 22, 2021

sdtblck commented Mar 23, 2021

samg7b5 commented Mar 30, 2021