You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The issue is that one of the trials seems to be re-executing. The reason why I think this is, is that the current trial that is running is yielding metrics that have already been reported to wandb for the same training iterations. Another thing is that the currently running trial is not having its results reported to wandb (even though the previous trials in the same execution cycle were). What is going on? Did my execution get corrupted somehow?
I find this to be a strange bug so I am not too sure what info I should provide. Let me know if there is any more details that I could provide that will help.
The text was updated successfully, but these errors were encountered:
richardliaw
added
bug
Something that is supposed to be working; but isn't
P1
Issue that should be fixed within a few weeks
tune
Tune-related issues
labels
Jul 22, 2021
I can provide an example. First thing to note is that 1 training iteration corresponds to 5 logging steps. Thus training iteration 45 corresponds to step 45*5 = 225.
This is the ray tune output for iteration 45.
And this is the WandB output for the same iteration (step 225):
Where I have placed the cursor shows the value for that iteration. As you can see the accuracy, loss, and f1 scores are all the same as in the ray-tune output. These values have already been reported to WandB. In fact it reports that the trial is finished.
The expected behavior would be for to ray-tune to allocate resources to a pending trial instead of one that has already been ran.
The current script I am running would need a lot of time to see if it would reproduce this bug (I have been running it for about 7 hours by now). I can see if I can get the same effect with a less compute-intensive script.
I am using an ASHAScheduler defined as so:
and passing it to huggingface's ray-tune integration function like so:
The issue is that one of the trials seems to be re-executing. The reason why I think this is, is that the current trial that is running is yielding metrics that have already been reported to wandb for the same training iterations. Another thing is that the currently running trial is not having its results reported to wandb (even though the previous trials in the same execution cycle were). What is going on? Did my execution get corrupted somehow?
I find this to be a strange bug so I am not too sure what info I should provide. Let me know if there is any more details that I could provide that will help.
The text was updated successfully, but these errors were encountered: