Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Retry not wait when connecting to the instance on submiting solution #484

Merged
merged 4 commits into from
Dec 3, 2020

Conversation

maikia
Copy link
Contributor

@maikia maikia commented Nov 30, 2020

Sometimes the worker cannot connect to the instance because there are not enough instances are available. This can happen for multiple of reasons one of them being that the instance which was set to terminate do not yet is available for the use.
For that reason we previously added a possibility to wait and try again few times before giving an error.

This was not very efficient because it forced the whole dispatcher to wait this time and not allowing it to collect the results from other workers and possibly free an instance -> which would solve the problem of not having enough instances if the cause was different from the above.

This PR shortens the waiting time and if the instances are still not available it puts the worker back into the queue to try later.

@codecov
Copy link

codecov bot commented Nov 30, 2020

Codecov Report

Merging #484 (cb5b843) into master (26762a4) will increase coverage by 0.01%.
The diff coverage is 93.75%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #484      +/-   ##
==========================================
+ Coverage   93.56%   93.58%   +0.01%     
==========================================
  Files          99       99              
  Lines        8496     8506      +10     
==========================================
+ Hits         7949     7960      +11     
+ Misses        547      546       -1     
Impacted Files Coverage Δ
ramp-engine/ramp_engine/tests/test_aws.py 84.00% <93.75%> (+1.36%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 26762a4...cb5b843. Read the comment docs.

@agramfort
Copy link
Collaborator

good to go from your end?

@maikia
Copy link
Contributor Author

maikia commented Nov 30, 2020

I think so. but it's always better if someone reviews it...

Copy link
Contributor

@lucyleeow lucyleeow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only giving it quick look - sorry i couldn't find the part where it is added to the queue? Can you point me in the right direction? Thanks!

@@ -57,7 +57,7 @@

# how long to wait for connections
WAIT_MINUTES = 2
MAX_TRIES_TO_CONNECT = 5
MAX_TRIES_TO_CONNECT = 1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we change this to 1, do we still need the if n_try < max_tries_to_connect: ? I guess leaving it gives us the option to increase MAX_TRIES_TO_CONNECT in future ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. I prefer to leave it as a param for two reasons:

  1. it is easier to test because for the test we can set this and the WAIT_MINUTES to low values so that the tests is done within reasonable time
  2. we can easily update it in the future

@agramfort
Copy link
Collaborator

agramfort commented Dec 1, 2020 via email

@maikia
Copy link
Contributor Author

maikia commented Dec 3, 2020

Only giving it quick look - sorry i couldn't find the part where it is added to the queue? Can you point me in the right direction? Thanks!

When the worker status is set to 'retry' dispatcher sets the submission back to new, ie puts it back to the queue

elif worker.status == 'retry':

@maikia
Copy link
Contributor Author

maikia commented Dec 3, 2020

@agramfort
We can test locally, but if you want to have a staging server we can have a continuously running EC2 instance that we can configure ramp on. We could even provision the database with a dump from the ramp.studio.

@maikia
Copy link
Contributor Author

maikia commented Dec 3, 2020

@agramfort @lucyleeow if you are happy could you pls merge (then it could already be present in the next release)

@lucyleeow
Copy link
Contributor

Ah I was looking for an explicit adding to the queue but I had forgotten I added the 'retry' function! LGTM

@agramfort agramfort merged commit 38a929a into paris-saclay-cds:master Dec 3, 2020
@agramfort
Copy link
Collaborator

thx @maikia !

@agramfort
Copy link
Collaborator

for the staging it's up to you if it helps you or not

maikia added a commit to maikia/ramp-board that referenced this pull request Dec 3, 2020
…aris-saclay-cds#484)

* updated the wait for the available instance to only 2 mins. If after this time there is still no available instance the worker status will be reset to "retry"

* added test to make sure that retry status is set

* updated the the test to make sure the worker status and the log are correct

* updates so that api can pass to the worker request for retrying later
glemaitre pushed a commit that referenced this pull request Dec 3, 2020
…484)

* updated the wait for the available instance to only 2 mins. If after this time there is still no available instance the worker status will be reset to "retry"

* added test to make sure that retry status is set

* updated the the test to make sure the worker status and the log are correct

* updates so that api can pass to the worker request for retrying later
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants