-
Notifications
You must be signed in to change notification settings - Fork 262
Conversation
| end | ||
| end | ||
| rescue Errno::ECONNRESET, Net::SSH::Disconnect | ||
| retry |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be cool to use this ssh_exec wrapper everywhere where ssh is needed by tests.
But to do that we need to make the retry behaviour configurable as some actions might not be retriable or might lead to unexpected results when inadvertently retried.
Also, unbounded retries are a very efficient way of shooting one's foot off.
How about we add a retry count here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah sounds good. I did not want to overengineer this wrapper but it definitely makes sense if we want to generalize its use. I had already ported the wait_for_bootstrap check to use this method in a previous PR. I can port the azure_vpn's use of ssh to this method in a follow up, since it is unrelated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot for porting the rest of the use cases to the wrapper. It certainly helps fighting tech debt as we improve the test code.
Even if we were to not use it in a more generalised setting, the unbounded retry is dangerous, especially when put into the perspective of flaky networking to the node instances.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes that's a good point; I didn't want to overengineer this method but it makes sense especially if we want to generalize its use. I already had ported the wait_for_bootstrap method to use this wrapper in a previous PR. I'll port the azure_vpn use of SSH to this wrapper as well but in a followup since it is unrelated to this race condition.
0d8de36 to
986b5dc
Compare
|
PTAL again @alexsomesan. Regarding the retries, yes that's a good point; I didn't want to overengineer this wrapper but it makes sense especially if we want to generalize its use. I already had ported the |
986b5dc to
e1dc9e5
Compare
|
retest this please |
e2f63d4 to
420286b
Compare
|
ok to test |
|
Got the following Azure error: |
|
@cpanato all the AWS tests passed this time. Some azure flakes. Going to run again |
|
ok to test |
|
created a Jira ticket for the flaky issue |
8a39f8e to
547f22d
Compare
Make the Container Linux version getter resistant to races against CLUO. By reading /var/lib/update_engine/prefs/aleph-version, we can be sure that no matter if CLUO runs before or after the SSH command, we will always know the version of the OS at installation time. This commit also adds rescues to the SSH commands to ensure that the executions will be retried if the node is rebooted by CLUO during the test.
|
retest this please |
cpanato
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
Make the Container Linux version getter resistant to races against CLUO.
By reading /var/lib/update_engine/prefs/aleph-version, we can be sure
that no matter if CLUO runs before or after the SSH command, we will
always know the version of the OS at installation time.
This commit also adds rescues to the SSH commands to ensure that the
executions will be retried if the node is rebooted by CLUO during the
test.
cc @alexsomesan