-
-
Notifications
You must be signed in to change notification settings - Fork 633
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Added distributed tests with Horovod and XLA for early_stopping #2165
Conversation
@Ishan-Kumar2 Thank you very much for this PR !! However, it seems that it does not work perfectly. Feel free to ask help if needed. |
Hi @sdesrozis the hvd tests are working now (here). |
@Ishan-Kumar2 it seems there are still some code formatting errors https://github.com/pytorch/ignite/pull/2165/checks?check_run_id=3346963218#step:8:46 Anyway, you are right, some metrics don’t work for the moment on TPU. You have to skip tests using metrics with xla. To do so, please see ignite/tests/ignite/metrics/test_accuracy.py Line 543 in 0e0200d
|
@sdesrozis I have fixed it, now both the hvd tests and xla tests are passing. |
@Ishan-Kumar2 yes, we are doing that for metrics to explicitly check that metrics are correctly computed across devices. In case of early stopping, all process should have the same data and technically should all require to stop if metrics are not improving... So, no need to make the data rank-dependent, I'd say. |
Yup that makes sense. In terms of the checks, I am not sure why some are failing ( |
@vfdev-5 For the above link @pytest.mark.parametrize("y_pred, y, batch_size",[
(torch.randint(0, 2, size=(50,)).long(), torch.randint(0, 2, size=(50,)).long(), 1),
...
])
def test_binary_and_multilabel_inputs(y_pred, y, batch_size):
...
assert isinstance(res, float)
assert average_precision_score(np_y, np_y_pred) == pytest.approx(res) |
I think a good way is to mimic what is done in others metrics' tests 😉 |
@Ishan-Kumar2 It seems that we have some troubles with the PyTorch-Nigthly... We will investigate. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks @Ishan-Kumar2
Let's merge this PR as failure is unrelated.
Addresses #2101
Description:
Added distributed tests with Horovod and XLA for early_stopping.
Check list: