Retry not wait when connecting to the instance on submiting solution #484

maikia · 2020-11-30T14:55:07Z

Sometimes the worker cannot connect to the instance because there are not enough instances are available. This can happen for multiple of reasons one of them being that the instance which was set to terminate do not yet is available for the use.
For that reason we previously added a possibility to wait and try again few times before giving an error.

This was not very efficient because it forced the whole dispatcher to wait this time and not allowing it to collect the results from other workers and possibly free an instance -> which would solve the problem of not having enough instances if the cause was different from the above.

This PR shortens the waiting time and if the instances are still not available it puts the worker back into the queue to try later.

…this time there is still no available instance the worker status will be reset to "retry"

…orrect

codecov · 2020-11-30T15:21:18Z

Codecov Report

Merging #484 (cb5b843) into master (26762a4) will increase coverage by 0.01%.
The diff coverage is 93.75%.

@@            Coverage Diff             @@
##           master     #484      +/-   ##
==========================================
+ Coverage   93.56%   93.58%   +0.01%     
==========================================
  Files          99       99              
  Lines        8496     8506      +10     
==========================================
+ Hits         7949     7960      +11     
+ Misses        547      546       -1

Impacted Files	Coverage Δ
ramp-engine/ramp_engine/tests/test_aws.py	`84.00% <93.75%> (+1.36%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 26762a4...cb5b843. Read the comment docs.

agramfort · 2020-11-30T15:48:03Z

good to go from your end?

maikia · 2020-11-30T15:51:52Z

I think so. but it's always better if someone reviews it...

lucyleeow

Only giving it quick look - sorry i couldn't find the part where it is added to the queue? Can you point me in the right direction? Thanks!

lucyleeow · 2020-11-30T23:36:37Z

ramp-engine/ramp_engine/aws/api.py

@@ -57,7 +57,7 @@

 # how long to wait for connections
 WAIT_MINUTES = 2
-MAX_TRIES_TO_CONNECT = 5
+MAX_TRIES_TO_CONNECT = 1


If we change this to 1, do we still need the if n_try < max_tries_to_connect: ? I guess leaving it gives us the option to increase MAX_TRIES_TO_CONNECT in future ?

Yes. I prefer to leave it as a param for two reasons:

it is easier to test because for the test we can set this and the WAIT_MINUTES to low values so that the tests is done within reasonable time

we can easily update it in the future

agramfort · 2020-12-01T18:11:37Z

one idea... could we deploy RAMP on an AWS instance so we can easily test master branch with some non empty database?

…

maikia · 2020-12-03T09:11:51Z

Only giving it quick look - sorry i couldn't find the part where it is added to the queue? Can you point me in the right direction? Thanks!

When the worker status is set to 'retry' dispatcher sets the submission back to new, ie puts it back to the queue

ramp-board/ramp-engine/ramp_engine/dispatcher.py

Line 219 in 75e2655

elif worker.status == 'retry':

maikia · 2020-12-03T09:22:32Z

@agramfort
We can test locally, but if you want to have a staging server we can have a continuously running EC2 instance that we can configure ramp on. We could even provision the database with a dump from the ramp.studio.

maikia · 2020-12-03T09:24:08Z

@agramfort @lucyleeow if you are happy could you pls merge (then it could already be present in the next release)

lucyleeow · 2020-12-03T09:42:34Z

Ah I was looking for an explicit adding to the queue but I had forgotten I added the 'retry' function! LGTM

agramfort · 2020-12-03T09:55:17Z

thx @maikia !

agramfort · 2020-12-03T09:55:31Z

for the staging it's up to you if it helps you or not

…aris-saclay-cds#484) * updated the wait for the available instance to only 2 mins. If after this time there is still no available instance the worker status will be reset to "retry" * added test to make sure that retry status is set * updated the the test to make sure the worker status and the log are correct * updates so that api can pass to the worker request for retrying later

…484) * updated the wait for the available instance to only 2 mins. If after this time there is still no available instance the worker status will be reset to "retry" * added test to make sure that retry status is set * updated the the test to make sure the worker status and the log are correct * updates so that api can pass to the worker request for retrying later

maikia added 4 commits November 30, 2020 15:02

updated the wait for the available instance to only 2 mins. If after …

c30fabe

…this time there is still no available instance the worker status will be reset to "retry"

added test to make sure that retry status is set

3242e1d

updated the the test to make sure the worker status and the log are c…

4d6c49b

…orrect

updates so that api can pass to the worker request for retrying later

cb5b843

lucyleeow reviewed Nov 30, 2020

View reviewed changes

agramfort merged commit 38a929a into paris-saclay-cds:master Dec 3, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Retry not wait when connecting to the instance on submiting solution #484

Retry not wait when connecting to the instance on submiting solution #484

maikia commented Nov 30, 2020

codecov bot commented Nov 30, 2020 •

edited

Loading

agramfort commented Nov 30, 2020

maikia commented Nov 30, 2020

lucyleeow left a comment

lucyleeow Nov 30, 2020

maikia Dec 3, 2020

agramfort commented Dec 1, 2020 via email

maikia commented Dec 3, 2020

maikia commented Dec 3, 2020 •

edited

Loading

maikia commented Dec 3, 2020

lucyleeow commented Dec 3, 2020

agramfort commented Dec 3, 2020

agramfort commented Dec 3, 2020

Retry not wait when connecting to the instance on submiting solution #484

Retry not wait when connecting to the instance on submiting solution #484

Conversation

maikia commented Nov 30, 2020

codecov bot commented Nov 30, 2020 • edited Loading

Codecov Report

agramfort commented Nov 30, 2020

maikia commented Nov 30, 2020

lucyleeow left a comment

Choose a reason for hiding this comment

lucyleeow Nov 30, 2020

Choose a reason for hiding this comment

maikia Dec 3, 2020

Choose a reason for hiding this comment

agramfort commented Dec 1, 2020 via email

maikia commented Dec 3, 2020

maikia commented Dec 3, 2020 • edited Loading

maikia commented Dec 3, 2020

lucyleeow commented Dec 3, 2020

agramfort commented Dec 3, 2020

agramfort commented Dec 3, 2020

codecov bot commented Nov 30, 2020 •

edited

Loading

maikia commented Dec 3, 2020 •

edited

Loading