hfab console connectivity failure during switch reinstall #321

pau-hedgehog · 2025-01-16T21:16:03Z

https://github.com/githedgehog/fabricator/actions/runs/12813204547/job/35726707779

I observed some failures when running hhfab vlab serial from:

20:29:03 DBG ds3000-02: 20:29:03 ERR serial: failed to run command: exit status 255
20:29:03 DBG ds4000-01: 20:29:03 ERR serial: failed to run command: exit status 255
20:29:03 DBG sse-c4632-01: 20:29:03 ERR serial: failed to run command: exit status 255
20:29:03 DBG ds3000-01: 20:29:03 ERR serial: failed to run command: exit status 255
20:29:03 DBG ds4000-02: 20:29:03 ERR serial: failed to run command: exit status 255

But due to #317 hhfab doesn't catch this error and the CI continues:

20:30:50 INF All switches placed into NOS Install Mode took=2m20.407841459s

Then there are additional unhandled errors (This would be a separate issue, IMO) which delay the CI as-fast-as-possible end:

20:45:20 INF Switches status ready=[ds3000-03] notReady="[ds3000-01 ds3000-02 ds4000-01 ds4000-02 sse-c4632-01]"
20:45:35 INF Switches status ready=[ds3000-03] notReady="[ds3000-01 ds3000-02 ds4000-01 ds4000-02 sse-c4632-01]"
20:45:50 ERR Unhandled Error logger=UnhandledError err="sigs.k8s.io/controller-runtime/pkg/cache/internal/informers.go:108: Failed to watch *v1beta1.VLANNamespace: context deadline exceeded"
20:45:50 INF Switches status ready=[ds3000-03] notReady="[ds3000-01 ds3000-02 ds4000-01 ds4000-02 sse-c4632-01]"
20:46:05 INF Switches status ready=[ds3000-03] notReady="[ds3000-01 ds3000-02 ds4000-01 ds4000-02 sse-c4632-01]"
20:46:20 INF Switches status ready=[ds3000-03] notReady="[ds3000-01 ds3000-02 ds4000-01 ds4000-02 sse-c4632-01]"

The CI should fail fast to prevent wasting valuable CI-HW cycles.

As an extra safeguard we should set a timeout (eg, 1h), if possible, @Frostman :

The text was updated successfully, but these errors were encountered:

pau-hedgehog · 2025-01-17T13:10:44Z

New hit: https://github.com/githedgehog/fabricator/actions/runs/12827748350/job/35773530955

pau-hedgehog · 2025-01-17T19:33:34Z

Another one: https://github.com/githedgehog/fabricator/actions/runs/12834731651/job/35792652175

So the piece of code that is failing is on the Remote Serial VLAB Helper:

fabricator/pkg/hhfab/vlabhelpers.go

Line 232 in 5d4d2fd

return fmt.Errorf("failed to run command: %w", err)

I've reproduced this in another env issuing repeated remote serial connections:

ubuntu@env-3:~/hhfab$ ./hhfab vlab serial --name as4630-01 -v
19:27:18 INF Hedgehog Fabricator version=v0.32.1-34-gfbf494c4-dirty-be1939
19:27:18 INF Wiring hydrated successfully mode=if-not-present
19:27:18 INF VLAB config loaded file=vlab/config.yaml
19:27:18 INF Remote serial (hardware) name=as4630-01 remote=192.168.88.10:9004
19:27:18 DBG Running cmd="ssh -o GlobalKnownHostsFile=/dev/null -o UserKnownHostsFile=/dev/null -o StrictHostKeyChecking=no -o LogLevel=ERROR -p 9004 192.168.88.10"

Type the hot key to suspend the connection: <CTRL>Z

as4630-01 login: 
as4630-01 login: 
--:- AS4630-01 cli-> 19:29:43 ERR serial: failed to run command: exit status 255
ubuntu@env-3:~/hhfab$ ./hhfab vlab serial --name as4630-01 -v
19:29:51 INF Hedgehog Fabricator version=v0.32.1-34-gfbf494c4-dirty-be1939
19:29:51 INF Wiring hydrated successfully mode=if-not-present
19:29:51 INF VLAB config loaded file=vlab/config.yaml
19:29:51 INF Remote serial (hardware) name=as4630-01 remote=192.168.88.10:9004
19:29:51 DBG Running cmd="ssh -o GlobalKnownHostsFile=/dev/null -o UserKnownHostsFile=/dev/null -o StrictHostKeyChecking=no -o LogLevel=ERROR -p 9004 192.168.88.10"

This connection is in use. User(s) currently connected: ubuntu@1130.
                                                                    You need privilege to make a simultaneous session.
The connection was unsuccessful.
19:29:51 ERR serial: failed to run command: exit status 255

pau-hedgehog · 2025-01-20T11:47:11Z

Another hit:

https://github.com/githedgehog/fabricator/actions/runs/12858045471/job/35862548645

pau-hedgehog · 2025-01-27T11:53:55Z

I haven't seen any hit o this one during last week. Closing

pau-hedgehog · 2025-01-29T13:58:33Z

Hit again:
https://github.com/githedgehog/fabricator/actions/runs/13023345885/job/36340872742#step:8:612

pau-hedgehog · 2025-01-31T08:04:37Z

Another: https://github.com/githedgehog/fabricator/actions/runs/13057259618/job/36446320385#step:8:629

pau-hedgehog · 2025-01-31T08:09:33Z

@sonoble, can we do something about the console server session idle timeout or concurrency? It is causing some switch reinstall to fail in our CI due to:

This connection is in use. User(s) currently connected: ubuntu@1130.
You need privilege to make a simultaneous session.
The connection was unsuccessful.

sonoble · 2025-01-31T15:35:27Z

Concurrency no, timeout can be adjusted here

pau-hedgehog · 2025-02-18T15:35:19Z

We keep hitting this:
https://github.com/githedgehog/fabricator/actions/runs/13373866939/job/37380750061

I'm running out of ideas. Last week I stress tested the SSH console server from env-1 and env-3 with a python script and I didn't hit errors but at some point the Console server stopped working completely and required a power reset. These are pretty old boxes and the SSH implementation may be very old.

@edipascale was discussing in #375 (comment)

the question now is, do we want to attempt to handle the connection failure in the expect script? is it worth trying to maybe sleep for x seconds and then attempt to spawn a new serial connection? do we have any idea why we see the serial connection getting closed only some of the time?

Do we want to try to handle this more gracefully on hhfab or should we try to fix the underlying issue, if we suspect on the Avocent ACS are the culprit is a replacement feasible, it would probably require reimplementation of the hhfab Serial Access helper

Add retries in the VLAB helper to workaround the remove serial issues. Increase timeout a bit in case of retries. Fixes #321 Signed-off-by: Pau Capdevila <[email protected]>

pau-hedgehog self-assigned this Jan 16, 2025

pau-hedgehog changed the title ~~CI-HW f~~ CI-HW failure during switch reinstall Jan 16, 2025

pau-hedgehog added ci flaky ci-hw Run hardware CI job labels Jan 16, 2025

pau-hedgehog changed the title ~~CI-HW failure during switch reinstall~~ CI-HW console connectivity failure during switch reinstall Jan 20, 2025

pau-hedgehog changed the title ~~CI-HW console connectivity failure during switch reinstall~~ hfab console connectivity failure during switch reinstall Jan 21, 2025

pau-hedgehog closed this as completed Jan 27, 2025

pau-hedgehog reopened this Jan 29, 2025

Frostman added the bug Something isn't working label Jan 31, 2025

pau-hedgehog added a commit that referenced this issue Feb 20, 2025

Add retry logic to switch reinstall

d60e6e0

Add retries in the VLAB helper to workaround the remove serial issues. Increase timeout a bit in case of retries. Fixes #321 Signed-off-by: Pau Capdevila <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

hfab console connectivity failure during switch reinstall #321

hfab console connectivity failure during switch reinstall #321

pau-hedgehog commented Jan 16, 2025 •

edited

Loading

pau-hedgehog commented Jan 17, 2025

pau-hedgehog commented Jan 17, 2025

pau-hedgehog commented Jan 20, 2025

pau-hedgehog commented Jan 27, 2025

pau-hedgehog commented Jan 29, 2025

pau-hedgehog commented Jan 31, 2025

pau-hedgehog commented Jan 31, 2025 •

edited

Loading

sonoble commented Jan 31, 2025

pau-hedgehog commented Feb 18, 2025

hfab console connectivity failure during switch reinstall #321

hfab console connectivity failure during switch reinstall #321

Comments

pau-hedgehog commented Jan 16, 2025 • edited Loading

pau-hedgehog commented Jan 17, 2025

pau-hedgehog commented Jan 17, 2025

pau-hedgehog commented Jan 20, 2025

pau-hedgehog commented Jan 27, 2025

pau-hedgehog commented Jan 29, 2025

pau-hedgehog commented Jan 31, 2025

pau-hedgehog commented Jan 31, 2025 • edited Loading

sonoble commented Jan 31, 2025

pau-hedgehog commented Feb 18, 2025

pau-hedgehog commented Jan 16, 2025 •

edited

Loading

pau-hedgehog commented Jan 31, 2025 •

edited

Loading