-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ModelCheckpoint with custom filepath don't support training on multiple nodes #2916
Comments
No I think we just need to pass in exist_ok=True into makedirs :) |
@f4hy your PR added this line. Do you know a good way to fix it? |
Ah sorry. I think I know what's up. I'll get a patch out this evening. Sorry! |
@angshine I found a few issues with the model checkpoint path stuff. Not 100% sure I found the particular bug you were seeing but I think this should fix it. Can you give my branch in the above PR a test? Sorry to have introduced this bug for you. |
Sorry for the late reply, but it seems that this bug has not been fully fixed. This line still raises an exception: |
🐛 Bug
When training on multiple nodes using
ModelCheckpoint
with customfilepath
, it will raiseFileExistsError
caused by the following line of code: model_checkpoint.py#L127.Maybe a try-except block is needed?
The text was updated successfully, but these errors were encountered: