-
Notifications
You must be signed in to change notification settings - Fork 34
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Override W&B data on a resumed training #595
Conversation
7e975b6
to
84e5d3a
Compare
84e5d3a
to
7977001
Compare
I applied the changes (to resume an existing run on W&B) and tested here: https://wandb.ai/teklia/test/runs/tv2vtdha
Its probably worth opening an issue for this. W&B seems to sync (override) logs & artifacts automatically. (https://wandb.ai/teklia/test/groups/test/files/output.log).
W&B should handle this automatically, as it ignores data with a step inferior the last data written. |
@vrigal one thing I don't understand is who sets "RUN_ID" env? |
This is set automatically by generic-worker before the task starts: https://github.com/taskcluster/taskcluster/blob/a85c8b9f7be096f6b9a4bad38612374b9a702372/workers/generic-worker/multiuser_posix.go#L146-L148 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, this can work but from the code perspective there are some issues:
- w&b publisher should not know anything about Taskcluster environment variables
- why don't we just use resume="allow" so that if the run already exists, it continues automatically? I don't think there is a use case where we have a run with the same name when running things on Taskcluster except the restart of the same task. I guess this logic was implemented in the publisher for publishing offline experiments to prevent republishing the same ones. In this case, the publisher should accept an argument set by a cli that indicates what to do if the run already exists.
Other known issues:
- W&B drops overlapped data from the new run instead of overwriting (
wandb: WARNING (User provided step: 15000 is less than current step: 15001. Dropping entry: {'gnorm': 0.7161, '_timestamp': 1715790811.6654584}).
)
I think since it kind of works we can merge it to unblock enabling spot instances but we should address those issues later.
This is an old issue. The simpler way to handle this would be to use |
We can rethink all that in #408, but I would use UID in model names as a last resort because they would clutter the dashboards. |
@eu9ene to be clear, run ID (used to identify a run) is different that run name (used to display graphs). For now we do not use an ID, it is automatically set by W&B (e.g. |
Closes #594