Skip to content
This repository has been archived by the owner on Jan 15, 2024. It is now read-only.

Sharded dataloader causing CI hangs #274

Closed
leezu opened this issue Aug 13, 2018 · 7 comments
Closed

Sharded dataloader causing CI hangs #274

leezu opened this issue Aug 13, 2018 · 7 comments
Labels
bug Something isn't working release focus Progress focus for release

Comments

@leezu
Copy link
Contributor

leezu commented Aug 13, 2018

Recently many jobs on CI are running into deadlocks and must be manually killed. This morning I killed a few jobs that ran more than 12 hours. I observe that all of them hang in the sharded dataloader tests. @szhengac do you have any idea what could be the reason?

From http://ci.mxnet.io/blue/organizations/jenkins/gluon-nlp/detail/PR-246/10/pipeline

tests/unittest/train/test_dataloader.py::test_sharded_data_loader Sending interrupt signal to process

Also

@szhengac
Copy link
Member

szhengac commented Aug 13, 2018 via email

@leezu
Copy link
Contributor Author

leezu commented Aug 13, 2018

I'm not aware of any change in environment. It may be due to some other issue, but for some reason CI always shows that it was working on the sharded dataloader test when being killed.

@leezu
Copy link
Contributor Author

leezu commented Aug 13, 2018

I think the hang also occurs during the test_transformer scripts tests as it relies on the ShardedDataloader. ci.mxnet.io/blue/organizations/jenkins/gluon-nlp/detail/PR-275/2/pipeline/22
I disabled that for now too.

@leezu
Copy link
Contributor Author

leezu commented Aug 13, 2018

CI works again after disabling both. The test should be enabled again before 0.4

@leezu leezu added bug Something isn't working release focus Progress focus for release labels Aug 13, 2018
@szha szha mentioned this issue Aug 13, 2018
20 tasks
@szhengac
Copy link
Member

szhengac commented Aug 14, 2018

I think this is due to the recent change in DataLoader (#11908). The shared DataLoader inherits the DataLoader, and the change possibly incurs some inconsistency. The workaround is to copy the original DataLoader to shared Dataloader instead of using inheritance.

@szha
Copy link
Member

szha commented Aug 14, 2018

@zhreshold may have some ideas.

@zhreshold
Copy link
Member

@szhengac is correct, the latest changes in #11908 was not correctly handled by _ShardedMultiWorkerIter, therefore the workers are never actually terminated.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug Something isn't working release focus Progress focus for release
Projects
None yet
Development

No branches or pull requests

4 participants