You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We should have a test that trains a very simple model with a pallas kernel across two slices of TPUv4 and checks that it doesn't hang.
Currently our pre-submit CI only runs things on 1 slice of TPUv4 and that doesn't cover cases like multi-slice training.
Post-submit CI requires human diligence to monitor and revert changes, which has proven to be ineffective. As long as we can afford it, we should test things in pre-submit and not post-submit.
The text was updated successfully, but these errors were encountered:
We should have a test that trains a very simple model with a pallas kernel across two slices of TPUv4 and checks that it doesn't hang.
Currently our pre-submit CI only runs things on 1 slice of TPUv4 and that doesn't cover cases like multi-slice training.
Post-submit CI requires human diligence to monitor and revert changes, which has proven to be ineffective. As long as we can afford it, we should test things in pre-submit and not post-submit.
The text was updated successfully, but these errors were encountered: