Skip to content
This repository was archived by the owner on Feb 5, 2020. It is now read-only.

Conversation

@squat
Copy link
Contributor

@squat squat commented Oct 11, 2017

Make the Container Linux version getter resistant to races against CLUO.
By reading /var/lib/update_engine/prefs/aleph-version, we can be sure
that no matter if CLUO runs before or after the SSH command, we will
always know the version of the OS at installation time.

This commit also adds rescues to the SSH commands to ensure that the
executions will be retried if the node is rebooted by CLUO during the
test.

cc @alexsomesan

end
end
rescue Errno::ECONNRESET, Net::SSH::Disconnect
retry
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be cool to use this ssh_exec wrapper everywhere where ssh is needed by tests.
But to do that we need to make the retry behaviour configurable as some actions might not be retriable or might lead to unexpected results when inadvertently retried.
Also, unbounded retries are a very efficient way of shooting one's foot off.
How about we add a retry count here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah sounds good. I did not want to overengineer this wrapper but it definitely makes sense if we want to generalize its use. I had already ported the wait_for_bootstrap check to use this method in a previous PR. I can port the azure_vpn's use of ssh to this method in a follow up, since it is unrelated.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for porting the rest of the use cases to the wrapper. It certainly helps fighting tech debt as we improve the test code.
Even if we were to not use it in a more generalised setting, the unbounded retry is dangerous, especially when put into the perspective of flaky networking to the node instances.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes that's a good point; I didn't want to overengineer this method but it makes sense especially if we want to generalize its use. I already had ported the wait_for_bootstrap method to use this wrapper in a previous PR. I'll port the azure_vpn use of SSH to this wrapper as well but in a followup since it is unrelated to this race condition.

@squat squat force-pushed the refactor_cl_version branch from 0d8de36 to 986b5dc Compare October 11, 2017 15:41
@squat
Copy link
Contributor Author

squat commented Oct 11, 2017

PTAL again @alexsomesan.

Regarding the retries, yes that's a good point; I didn't want to overengineer this wrapper but it makes sense especially if we want to generalize its use. I already had ported the wait_for_bootstrap method to use this wrapper in a previous PR. I'll port the azure_vpn use of SSH to this wrapper as well but in a followup since it is unrelated to this race condition.

@cpanato
Copy link
Contributor

cpanato commented Oct 12, 2017

retest this please

@squat squat force-pushed the refactor_cl_version branch 4 times, most recently from e2f63d4 to 420286b Compare October 12, 2017 16:27
@squat
Copy link
Contributor Author

squat commented Oct 12, 2017

ok to test

@squat
Copy link
Contributor Author

squat commented Oct 12, 2017

Got the following Azure error:

2 error(s) occurred:

* module.masters.azurerm_virtual_machine.tectonic_master: 1 error(s) occurred:

* azurerm_virtual_machine.tectonic_master: compute.VirtualMachinesClient#CreateOrUpdate: Failure responding to request: StatusCode=400 -- Original Error: autorest/azure: Service returned an error. Status=400 Code="SubscriptionNotAuthorizedForImage" Message="The subscription is not authorized."
* module.workers.azurerm_virtual_machine.tectonic_worker: 1 error(s) occurred:

* azurerm_virtual_machine.tectonic_worker: compute.VirtualMachinesClient#CreateOrUpdate: Failure responding to request: StatusCode=400 -- Original Error: autorest/azure: Service returned an error. Status=400 Code="SubscriptionNotAuthorizedForImage" Message="The subscription is not authorized."

@squat
Copy link
Contributor Author

squat commented Oct 12, 2017

@cpanato all the AWS tests passed this time. Some azure flakes. Going to run again

@squat
Copy link
Contributor Author

squat commented Oct 13, 2017

ok to test

@cpanato
Copy link
Contributor

cpanato commented Oct 13, 2017

created a Jira ticket for the flaky issue

@squat squat force-pushed the refactor_cl_version branch 4 times, most recently from 8a39f8e to 547f22d Compare October 13, 2017 11:18
Make the Container Linux version getter resistant to races against CLUO.
By reading /var/lib/update_engine/prefs/aleph-version, we can be sure
that no matter if CLUO runs before or after the SSH command, we will
always know the version of the OS at installation time.

This commit also adds rescues to the SSH commands to ensure that the
executions will be retried if the node is rebooted by CLUO during the
test.
@cpanato
Copy link
Contributor

cpanato commented Oct 13, 2017

retest this please

Copy link
Contributor

@cpanato cpanato left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@cpanato cpanato merged commit 3e086c9 into coreos:master Oct 13, 2017
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants