Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feat] Add Fault Tolerant Training for ValidationLoop. #9563

Merged
merged 66 commits into from
Sep 24, 2021
Merged
Changes from 1 commit
Commits
Show all changes
66 commits
Select commit Hold shift + click to select a range
8091491
cleanup
tchaton Sep 10, 2021
9ded53b
wip
tchaton Sep 13, 2021
a15b60e
wip
tchaton Sep 13, 2021
8be96ed
update
tchaton Sep 13, 2021
1d74c45
update
tchaton Sep 13, 2021
63e13e6
update
tchaton Sep 13, 2021
2163185
update
tchaton Sep 13, 2021
831368d
wip
tchaton Sep 13, 2021
ec38605
update
tchaton Sep 13, 2021
516cae7
resolve tests
tchaton Sep 14, 2021
48dfb1e
add support validation
tchaton Sep 14, 2021
138ba8d
tiny cleanup
tchaton Sep 14, 2021
0d24120
update changelog
tchaton Sep 14, 2021
528d9f4
update
tchaton Sep 14, 2021
726614e
Merge branch 'master' into fault_tolerant_validation
tchaton Sep 14, 2021
f8abada
update
tchaton Sep 14, 2021
4f5db67
update
tchaton Sep 14, 2021
9dcc5ef
Merge branch 'master' into fault_tolerant_validation
tchaton Sep 14, 2021
5ef80de
update on comments
tchaton Sep 14, 2021
8582f12
Merge branch 'master' into fault_tolerant_validation
tchaton Sep 14, 2021
0ccbf07
update
tchaton Sep 14, 2021
9fea22b
update
tchaton Sep 14, 2021
63ac346
update
tchaton Sep 14, 2021
11e38aa
update test
tchaton Sep 14, 2021
5c820a9
update
tchaton Sep 15, 2021
f2675a0
update
tchaton Sep 16, 2021
262900d
move reset_on_restart in the loop
tchaton Sep 16, 2021
3668fed
update changelog
tchaton Sep 16, 2021
3defdb1
update
tchaton Sep 16, 2021
e9a46d0
update
tchaton Sep 16, 2021
c0c0c85
update
tchaton Sep 16, 2021
e1ad52e
update
tchaton Sep 16, 2021
c4b86ab
Minor test changes
carmocca Sep 16, 2021
ad334f5
Update CHANGELOG.md
tchaton Sep 16, 2021
c672030
updte
tchaton Sep 16, 2021
4a110b5
Merge branch 'move_restart_with_loops' of https://github.com/PyTorchL…
tchaton Sep 16, 2021
122766c
Merge branch 'move_restart_with_loops' into fault_tolerant_validation_2
tchaton Sep 16, 2021
194e0cb
Merge branch 'master' into fault_tolerant_validation_2
tchaton Sep 17, 2021
a454393
update on comments
tchaton Sep 17, 2021
db01b85
resolve conflicts
tchaton Sep 17, 2021
7e6dfaf
resolve mypy
tchaton Sep 17, 2021
b540770
Bad merge
carmocca Sep 17, 2021
a785789
update
tchaton Sep 17, 2021
e95995e
upate
tchaton Sep 17, 2021
f3db789
update
tchaton Sep 20, 2021
6932f6e
Merge branch 'master' into fault_tolerant_validation_2
tchaton Sep 20, 2021
905d6ae
remove out-dated comment
tchaton Sep 20, 2021
b49583e
Merge branch 'fault_tolerant_validation_2' of https://github.com/PyTo…
tchaton Sep 20, 2021
6c62118
update on comments
tchaton Sep 20, 2021
9555eb9
update
tchaton Sep 20, 2021
700be0d
resolve tests
tchaton Sep 21, 2021
f88f43c
Merge branch 'master' into fault_tolerant_validation_2
carmocca Sep 21, 2021
6b99afa
Merge branch 'master' into fault_tolerant_validation_2
tchaton Sep 22, 2021
b03983e
resolve test
tchaton Sep 22, 2021
d364528
Merge branch 'fault_tolerant_validation_2' of https://github.com/PyTo…
carmocca Sep 22, 2021
c80ae90
Merge branch 'master' into fault_tolerant_validation_2
carmocca Sep 22, 2021
941fd66
Merge branch 'master' into fault_tolerant_validation_2
carmocca Sep 24, 2021
5b58673
Refactor and actually assert in test
carmocca Sep 24, 2021
7eaad63
Fix loops test
carmocca Sep 24, 2021
88a9c14
Passing loops tests
carmocca Sep 24, 2021
302fbe3
Simplify auto restart test
carmocca Sep 24, 2021
c16a5a4
Merge branch 'master' into fault_tolerant_validation_2
carmocca Sep 24, 2021
5f310a6
Remove changes from other PR
carmocca Sep 24, 2021
b1c6da9
Merge branch 'master' into fault_tolerant_validation_2
carmocca Sep 24, 2021
bec607a
Merge branch 'master' into fault_tolerant_validation_2
carmocca Sep 24, 2021
9fd83b8
Allclose
carmocca Sep 24, 2021
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Allclose
carmocca committed Sep 24, 2021
commit 9fd83b8150b98cbf3dbdc215db3a9371425c38eb
4 changes: 2 additions & 2 deletions tests/utilities/test_auto_restart.py
Original file line number Diff line number Diff line change
@@ -1057,6 +1057,6 @@ def run(should_fail, resume):
pre_fail_train_batches, pre_fail_val_batches = run(should_fail=True, resume=False)
post_fail_train_batches, post_fail_val_batches = run(should_fail=False, resume=True)

torch.testing.assert_equal(total_train_batches, pre_fail_train_batches + post_fail_train_batches)
torch.testing.assert_allclose(total_train_batches, pre_fail_train_batches + post_fail_train_batches)
for k in total_val_batches:
torch.testing.assert_equal(total_val_batches[k], pre_fail_val_batches[k] + post_fail_val_batches[k])
torch.testing.assert_allclose(total_val_batches[k], pre_fail_val_batches[k] + post_fail_val_batches[k])