Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[rabit harden] fix model recovery tests failures #510

Merged
merged 5 commits into from
Mar 8, 2019
Merged

[rabit harden] fix model recovery tests failures #510

merged 5 commits into from
Mar 8, 2019

Conversation

chenqin
Copy link
Contributor

@chenqin chenqin commented Mar 6, 2019

Rabit tests has been failing since early 2018. screenshot from 2019-03-06 19-56-19

According to @tqchen this is largely due to tracker semantic issue. As I have been working on enable checkpoint at rabit could work on GCP. I need to make sure those tests can pass before introducing extra features.

Here is some of my observations to tracker/rabit interaction.

  • Rabit mock relay on command line to pass on which number of trail it runs on as a factor to simulate worker crash (exit(-2)). When worker actually dead and local tracker subprocess exit. worker were restarted without update rabit_num_trail setting. (this is essentially infinite loop of crash)

  • Total number of restart a worker in local tracker is controlled by DMLC_NUM_ATTEMPT, by default it was set to zero. So when a rabit unit test kick off like following, total number of retry is still set to zero (no retry) and local tracker throw exception.

  • Rabit Init has a overwrite code which use value from DMLC_NUM_ATTEMPT to overwrite value from rabit_num_trial. I am not sure why since num_trail is incremental and DMLC_NUM_ATTEMPT if set to greater than 0 will skip most of mock failures.

model_recover_10_10k: ../dmlc-core/tracker/dmlc-submit --cluster local --num-workers=2 model_recover 10000 mock=0,0,1,0 mock=1,1,1,0 rabit_num_trial=0 DMLC_NUM_ATTEMPT=10

@chenqin chenqin changed the title [rabit tests fix] sync rabit_num_trail and DMLC_NUM_ATTEMPT between local tracker and rabit main [rabit harden] fix model recovery tests failures Mar 7, 2019
@@ -8,38 +8,52 @@
import logging
from threading import Thread
from . import tracker
import pdb
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's not used?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch. fixed

num_retry = env.get('DMLC_NUM_ATTEMPT', 0)

#overwrite default num of retry with commandline value
for parm in cmd:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

param

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed.

@@ -11,6 +11,8 @@
#include <dmlc/logging.h>
#include <string>
#include <mutex>
#include <utility>
#include <memory>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not used?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this refers to a succeeded job?

@CodingCat
Copy link
Member

@hcho3 @tqchen I just found I have no write permission here, can anyone of you grant the permission?

@szha
Copy link
Member

szha commented Mar 7, 2019

Added.

@szha
Copy link
Member

szha commented Mar 7, 2019

@CodingCat note that the master branch is diverging from the one used by mxnet (tracked in #507). cc @hcho3 who's looking into it.

@CodingCat
Copy link
Member

@szha thanks!

@CodingCat
Copy link
Member

I still recommend you point to this branch in dmlc/rabit#81 for cross-validation, and it will prevent the error result by the different environment in your laptop and rabit

@CodingCat CodingCat merged commit cb9e014 into dmlc:master Mar 8, 2019
@CodingCat
Copy link
Member

merged, thanks

if num_retry >= 0:
# failure trail increase by 1 and restart failed worker
for arg in cmd:
if arg.startswith('rabit_num_trial'):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I might have missed this here, I don't think we should leak more rabit stuff in dmlc-core (hopefully we should rename RabitTracker), essentially they are two different projects

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Described on PR item 3, the semantic of rabit_num_trial is to capture the number of worker failures caused by mock. v.s DMLC_NUM_ATTEMPT is setting number of retries in total local tracker can do. The issue really some how up to date rabit_num_trial needs to feed back into rabit init in order to make rabit behavior as expected. Otherwise it will be infinite crash...

https://github.com/dmlc/rabit/blob/master/src/allreduce_mock.h#L164

One previous attempt was to reuse DMLC_NUM_ATTEMPT as replacement of rabit_num_trial. The problem is rabit_num_trial starts with 0 and DMLC_NUM_ATTEMPT needs to >= number of crashes for tests to pass. That basically broke tests in rabit.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, in this case, why we need to pass a value as rabit_num_trial?

we should remove it from both of dmlc-core and rabit, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how can restarted rabit mock worker know which trail it is running and decide if it will crash or not. There needs something pass via arguments...
https://github.com/dmlc/rabit/blob/master/src/allreduce_mock.h#L52

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would suggest add another argument called
DMLC_WORKER_MAX_RETRY along with DMLC_NUM_ATTEMPT and remove rabit_num_trial from core.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just DMLC_MAX_RETRY and consume rabit_num_trial from rabit

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants