Cache and print devices for debugging future outages#2097
Conversation
|
Skipping CI for Draft Pull Request. |
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: julianKatz The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
3a8aba1 to
8f34a1f
Compare
|
/ok-to-test |
d4bde0a to
e38ef72
Compare
|
/lgtm |
|
New changes are detected. LGTM label has been removed. |
5330d24 to
ae96677
Compare
0644523 to
e58b7e5
Compare
easier to unit test.
d17ee91 to
f1d1be0
Compare
| @@ -235,7 +235,7 @@ func (i *InstanceInfo) CreateOrGetInstance(localSSDCount int) error { | |||
| } | |||
|
|
|||
| if i.cfg.CloudtopHost { | |||
There was a problem hiding this comment.
This seems to no longer be necessary on cloudtop, and was actively causing my local test run to fail.
|
/retest |
1 similar comment
|
/retest |
|
PR needs rebase. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
@julianKatz: The following tests failed, say
Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
|
Not under active development anymore. |
What type of PR is this?
/kind cleanup # Is this right??
What this PR does / why we need it:
This PR adds a cache that periodically (configured to every minute currently) looks at the
/dev/disk/by-id/directory and evaluates the symlinks there. It maintains a cache of the symlink and the real path it points to.This will help with debugging future filesystem issues. In a past OMG, we found that our insight into changes in symlinks for specific disks hampered our ability to debug. Logging marked the real path of the disk at mount and unmount, but the change in between couldn't be detected.
This PR will print those links every minute, also logging when elements of the cache change.
An example:
The cache will also note if a symlink is broken.
NOTE: Currently this filters out any thing in
by-id/that ends with-part[0-9]*$. This removes partitions, which are noise. Mounting partitions directly isn't well supported in GKE, but we may want to test that in the future.Which issue(s) this PR fixes:
Fixes #
Special notes for your reviewer:
Does this PR introduce a user-facing change?: