-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
hfab console connectivity failure during switch reinstall #321
Comments
Another one: https://github.com/githedgehog/fabricator/actions/runs/12834731651/job/35792652175 So the piece of code that is failing is on the Remote Serial VLAB Helper: fabricator/pkg/hhfab/vlabhelpers.go Line 232 in 5d4d2fd
I've reproduced this in another env issuing repeated remote serial connections:
|
I haven't seen any hit o this one during last week. Closing |
@sonoble, can we do something about the console server session idle timeout or concurrency? It is causing some switch reinstall to fail in our CI due to:
|
We keep hitting this: I'm running out of ideas. Last week I stress tested the SSH console server from env-1 and env-3 with a python script and I didn't hit errors but at some point the Console server stopped working completely and required a power reset. These are pretty old boxes and the SSH implementation may be very old. @edipascale was discussing in #375 (comment)
Do we want to try to handle this more gracefully on hhfab or should we try to fix the underlying issue, if we suspect on the Avocent ACS are the culprit is a replacement feasible, it would probably require reimplementation of the hhfab Serial Access helper |
Add retries in the VLAB helper to workaround the remove serial issues. Increase timeout a bit in case of retries. Fixes #321 Signed-off-by: Pau Capdevila <[email protected]>
https://github.com/githedgehog/fabricator/actions/runs/12813204547/job/35726707779
I observed some failures when running hhfab vlab serial from:
But due to #317 hhfab doesn't catch this error and the CI continues:
Then there are additional unhandled errors (This would be a separate issue, IMO) which delay the CI as-fast-as-possible end:
The CI should fail fast to prevent wasting valuable CI-HW cycles.
As an extra safeguard we should set a timeout (eg, 1h), if possible, @Frostman :
The text was updated successfully, but these errors were encountered: