-
Notifications
You must be signed in to change notification settings - Fork 153
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix(env-check): remove monitor initial pending to fix disk mismatch problem #3426
base: master
Are you sure you want to change the base?
fix(env-check): remove monitor initial pending to fix disk mismatch problem #3426
Conversation
…roblem longhorn/longhorn-10098 Signed-off-by: Raphanus Lo <[email protected]>
WalkthroughThe pull request introduces significant structural changes to the monitoring components in the Longhorn project, transitioning from concrete monitor implementations to a more generic, type-safe approach. The changes primarily focus on the Changes
Assessment against linked issues
Sequence DiagramsequenceDiagram
participant NC as NodeController
participant DM as DiskMonitor
participant DS as DataStore
NC->>DM: createDiskMonitor()
DM-->>NC: Initialize Monitor
NC->>DM: Start()
DM->>DS: Collect Initial Disk Data
DM-->>NC: Notify Data Collection
The sequence diagram illustrates the high-level interaction between the Tip CodeRabbit's docstrings feature is now available as part of our Early Access Program! Simply use the command Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media? 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
Documentation and Community
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (6)
controller/monitor/disk_monitor.go (1)
405-405
: Handling orphaned or unknown disk data
getOrphanedReplicaDataStores
delegates block vs. filesystem disk handling cleanly. Keep watch for potential edge conditions if new disk types are introduced.controller/monitor/fake_disk_monitor.go (1)
24-24
: Consider documenting the return type change.
Changing the function to returnDiskMonitor
instead of*DiskMonitor
can improve abstraction. However, external callers may find it more intuitive if this change is highlighted in a release note or function docstrings.controller/monitor/snapshot_monitor.go (4)
Line range hint
158-180
: Periodic check for snapshots is straightforward.
This approach ensures scheduled checks. If you need more granular or immediate checks triggered by external signals, consider hooking them into this code or the event queue.
258-262
:RunOnce
unimplemented.
If there is a test scenario or CLI usage that expects a one-off snapshot check, implementingRunOnce
might be beneficial. Otherwise, consider documenting this method’s state as intentionally unimplemented.
317-324
: inProgressSnapshotCheckTasks concurrency.
The lock usage is correct. Consider whether a sync.Map or a concurrent set might simplify usage, but your current approach is acceptable.
Line range hint
458-501
: Syncing checksum and evicting corrupted replicas.
The logic is clear and thorough. You might add more logs if you suspect certain replica addresses are frequently corrupted, but otherwise this is well-executed.
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (7)
controller/monitor/disk_monitor.go
(11 hunks)controller/monitor/environment_check_monitor.go
(16 hunks)controller/monitor/fake_disk_monitor.go
(1 hunks)controller/monitor/fake_environment_check_monitor.go
(4 hunks)controller/monitor/monitor.go
(1 hunks)controller/monitor/snapshot_monitor.go
(20 hunks)controller/node_controller.go
(7 hunks)
🧰 Additional context used
🪛 GitHub Check: CodeFactor
controller/monitor/snapshot_monitor.go
[notice] 94-94: controller/monitor/snapshot_monitor.go#L94
Exported func NewSnapshotMonitor returns unexported type *monitor.snapshotMonitorImpl, which can be annoying to use. (unexported-return)
🔇 Additional comments (80)
controller/monitor/monitor.go (2)
12-12
: Use of generics enhances type safety
Introducing a type parameter for the Monitor
interface significantly improves type safety and clarity in monitor implementations.
16-16
: Typed return type adds clarity
Changing GetCollectedData
to return CollectedDataType
instead of interface{}
removes ambiguity and ensures that callers can rely on a strongly typed result.
controller/monitor/disk_monitor.go (18)
38-39
: Align the type definition with the interface usage
Defining type DiskMonitor Monitor[map[string]*CollectedDiskInfo]
clarifies the collected data structure. It consistently matches the generics-based Monitor
interface.
40-41
: Implementation check
Declaring var _ DiskMonitor = &diskMonitorImpl{}
ensures diskMonitorImpl
correctly implements the DiskMonitor
interface. This is a good pattern to catch compile-time interface mismatches.
42-42
: Consistent naming for implementation struct
Renaming the struct to diskMonitorImpl
clarifies that it's the concrete implementation of the DiskMonitor
interface. This is a clean approach.
48-48
: Lock usage observation
The combination of sync.Once
and sync.RWMutex
looks appropriate for safely initializing and accessing collectedData
. Remember to avoid potential deadlocks if other methods acquire both locks in a reversed order.
77-77
: Constructor’s explicit return type
The updated return type (DiskMonitor, error)
is consistent with the generics-based interface. This is beneficial for type-checking and maintaining a clean API.
80-80
: Initialization correctness
All handlers, locks, and collectedData
fields are properly instantiated. The readiness for immediate usage of diskMonitorImpl
is clear.
100-119
: Immediate run before goroutine
Running m.run()
once before starting the polling goroutine ensures the initial disk state is collected promptly, aligning well with the PR's stated objectives to fix mismatch issues at startup.
122-122
: Stop method clarity
Calling m.quit()
cleanly cancels the context, ensuring the goroutine can terminate gracefully.
126-127
: Synchronous run usage
RunOnce()
calling m.run()
is a straightforward design for single-iteration data collection. This helps in debugging or on-demand fetching.
130-130
: Optional configuration extension
UpdateConfiguration(map[string]interface{}) error { return nil }
is a no-op. If future configuration changes are needed, this design is easily extended.
134-134
: Deep copying collected data
Copying m.collectedData
into a new map ensures concurrent readers won’t inadvertently mutate this internal state. The approach is well suited for safe external consumption.
146-146
: Main monitoring loop logic
run()
properly retrieves node info, collects disk data, and triggers a sync callback if changes are detected. This pattern is consistent with the normal monitor workflow.
167-167
: Robust instance manager retrieval logic
getRunningInstanceManagerRO
handles multiple data engines and searches for a running manager if the default one isn’t found. This fallback approach is practical for ensuring function continuity.
193-193
: Lazily creating disk service clients
newDiskServiceClients
attempts to instantiate clients for each data engine. The fallback logic ensures no immediate crash if a key client is unavailable.
220-220
: Defensive closure of clients
closeDiskServiceClients
carefully closes all active clients to prevent resource leaks and leftover open connections.
230-230
: Central place for disk data collection
collectDiskData
centralizes the logic of retrieving disk configs, stats, and orphaned replicas. This function’s approach is cohesive and is a good single-responsibility method.
416-416
: Skipping found replicas in the data store
getOrphanedReplicaLvolNames
discards recognized replicas from the orphan set. This is consistent with the approach for cleaning up unknown volumes.
431-431
: Filesystem orphan detection logic
getOrphanedReplicaDirectoryNames
uses both node status checks and the volume metadata file presence. This multi-step validation is beneficial for avoiding false positives.
controller/monitor/environment_check_monitor.go (24)
48-49
: Generics-based definition
type EnvironmentCheckMonitor Monitor[[]longhorn.Condition]
provides consistent usage of generics, mirroring DiskMonitor
.
50-51
: Correct interface check
var _ EnvironmentCheckMonitor = &environmentCheckMonitorImpl{}
is a good mechanism to validate the implementation at compile time.
52-52
: Implementation naming
environmentCheckMonitorImpl
follows the same naming convention as diskMonitorImpl
for clarity and consistency across the codebase.
57-57
: Concurrent data access
Use of sync.RWMutex
for collectedData
is a safe approach to concurrent reads and occasional writes.
68-68
: Constructor returning interface
NewEnvironmentCheckMonitor(...) (EnvironmentCheckMonitor, error)
aligns with the new generics-based Monitor interface. This fosters type clarity throughout the code.
71-71
: Comprehensive initialization
All fields (collectedDataLock
, collectedData
, syncCallback
) are properly initialized. This ensures readiness for immediate monitoring.
85-104
: Immediate run logic
The Start
method pre-runs the environment check before scheduling periodic checks, ensuring immediate data availability. The poll logic is consistent with DiskMonitor
.
107-107
: Graceful shutdown
The Stop()
method properly invokes m.quit()
. This pattern is consistent across monitors for concurrency termination.
111-112
: On-demand environment check
RunOnce
is aligned with the overall design, providing a quick single-run check method when needed.
115-115
: Optional flexible config
Similar to DiskMonitor
, UpdateConfiguration
is left unimplemented. This design is flexible if future runtime adjustments become necessary.
119-119
: Safe copying of condition data
copier.CopyWithOption
usage ensures that callers receive a distinct copy of the environment check conditions without risking concurrency issues.
131-131
: Sync callback on changes
Within run()
, the new conditions are compared to the existing. Upon differences, a syncCallback
is triggered. This is crucial for reactive environment-based logic.
152-152
: Fallback for missing kube node
When GetKubernetesNodeRO
fails, it returns an empty set of conditions, which ensures the rest of the logic can proceed without a crash.
163-163
: Condition-based environment check aggregator
environmentCheck
collects multiple checks (packages, multipathd, NFS versions, kernel modules, huge pages) into a single set of conditions. This design is modular and extendable.
189-189
: Package installation checks per distribution
syncPackagesInstalled
uses OS detection to decide which packages to verify. This approach is thorough, utilizing distribution-based branching.
257-257
: Talos edge case handling
syncPackagesInstalledTalosLinux
accommodates the specialized environment of Talos. These checks ensure package existence despite the unique OS structure.
333-333
: Detecting multipathd conflict
syncMultipathd
warns about known issues with multipathd. This is an example of hazard-based environment checks that can prevent data corruption or performance pitfalls.
352-352
: Batching package checks
checkPackageInstalled
abstracts the logic of validating packages in multiple namespaces. Combining them prevents repetitive code paths.
370-370
: Huge page capacity check
Ensuring the node’s hugepage capacity meets the configured threshold is essential for V2 data engine usage. This logic is properly integrated with node conditions.
404-404
: Kernel module scanning approach
checkKernelModulesLoaded
tries kmod
first, with fallback to config checks if not found. This layered approach is robust for multi-environment setups.
465-465
: Validating modules in kernel config files
checkModulesLoadedByConfigFile
elegantly handles both /boot
and /proc
fallback scenarios. This logic is thorough for containerized environments.
487-487
: NFS mount config checks
checkNFSMountConfigFile
ensures the default NFS protocol version is recognized and supported. This step is critical to avoid unexpected NFS issues.
503-503
: Module state verification
checkKernelModuleEnabled
inspects kernel configs for y
or m
states, then uses kmod
checks for dynamically loaded modules. This is a robust multi-step approach.
539-539
: Confirm NFS client availability
synchronizeNFSClientVersion
checks whether the node’s kernel modules and default mount config allow NFS versions that Longhorn supports, making environment readiness more predictable.
controller/node_controller.go (18)
59-59
: Switch to specialized monitor types
diskMonitor monitor.DiskMonitor
enforces type specificity, removing the need for additional type assertions when retrieving disk data.
60-60
: Enhanced environment checks
Using environmentCheckMonitor monitor.EnvironmentCheckMonitor
clarifies usage, eliminating generic type confusion.
62-62
: Snapshot monitor specialized field
A distinct monitor.SnapshotMonitor
field for snapshot operations ensures each monitor type is semantically correct.
477-479
: Graceful handling of snapshot monitor errors
A check is performed to handle failures in GetCollectedData()
gracefully, preventing a crash loop if the snapshot monitor is temporarily unavailable.
1276-1276
: Lazy creation of DiskMonitor
createDiskMonitor()
returns the existing monitor if present, preventing redundant re-initialization. This pattern is effective in ensuring a single consistent monitor instance.
1281-1281
: Nesting the run operation
Invoking Start()
after constructing the monitor ensures the initial data collection is triggered right away.
1285-1285
: Context-based concurrency
The Start()
method uses the same approach (PollUntilContextCancel
) as in other monitors, ensuring consistent concurrency patterns across the codebase.
1287-1287
: Retain reference
Assigning the newly created monitor to nc.diskMonitor
ensures that future references use the same monitor instance for disk collection logic.
1289-1289
: Returning interface ensures flexibility
Returning the interface type DiskMonitor
keeps the calling code decoupled from the implementation details.
1292-1292
: EnvironmentCheckMonitor creation pattern
Similar to createDiskMonitor
, the method lazily initializes the environment monitor, preventing duplication.
1297-1297
: Immediate monitoring start
mon.Start()
triggers environment checks without waiting for a next reconcile cycle, ensuring environment mismatches are caught early.
1301-1301
: Consistent concurrency initialization
The environment monitor uses the same concurrency design as the disk monitor, keeping concurrency patterns uniform.
1303-1303
: Single reference assignment
Storing mon
in nc.environmentCheckMonitor
ensures stateful reference for environment checks throughout the node controller’s lifecycle.
1305-1305
: Return new or existing environment monitor
Returning the EnvironmentCheckMonitor
interface encapsulates internal details, consistent with the rest of the design.
1444-1444
: Verifying disk mismatch
say if matched := isDiskMatched(); ...
calls out mismatching disks with a warning, rather than silently ignoring. This direct check is essential for debugging.
1458-1460
: Robust environment condition retrieval
Collecting environment conditions from the monitor and tolerating errors ensures partial checks do not block the entire node sync logic.
1580-1580
: Snapshot monitor creation pattern
Similar to disk/environment monitors, createSnapshotMonitor
returns early if already initialized, preventing extra goroutines.
1595-1595
: Consistent concurrency model for snapshot monitor
mon.Start()
uses the same approach as the other monitors, ensuring uniform concurrency across all monitor implementations.
controller/monitor/fake_environment_check_monitor.go (5)
29-29
: Constructor returning an interface
NewFakeEnvironmentCheckMonitor(...) (EnvironmentCheckMonitor, error)
ensures that tests can swap the real environment check monitor with this fake seamlessly.
48-48
: Permission for immediate run
true
parameter to PollUntilContextCancel
triggers a check immediately on Start()
, matching production behavior in a test/fake context.
66-66
: RunOnce pattern reusability
RunOnce
aligning with the same usage as the real monitor ensures test code can replicate single-run environment checks.
73-73
: Concrete typed return
Replacing (interface{}, error)
with ([]longhorn.Condition, error)
clarifies the test behavior by returning strongly typed data from the fake.
85-85
: Minimal environment check cycle
run()
sets an empty slice of conditions, simulating a check result. This is sufficient for testing syncCallback
logic without real scanning.
controller/monitor/fake_disk_monitor.go (1)
27-27
: Returning a pointer to diskMonitorImpl
as DiskMonitor
.
Returning an interface while constructing a pointer to the concrete struct is a sound approach for encapsulation. This pattern should be consistently used in the rest of the codebase for clarity.
controller/monitor/snapshot_monitor.go (12)
64-64
: Easy-to-read alias for generics.
Defining type SnapshotMonitor Monitor[SnapshotMonitorStatus]
can improve readability. Ensure other places—like references or documentation—are updated for consistent usage.
66-66
: Enabling compile-time type checks for the interface.
The assignment var _ SnapshotMonitor = &snapshotMonitorImpl{}
is a clean way to verify that snapshotMonitorImpl
indeed implements the interface.
94-94
: Unexported return type flagged by static analysis.
Returning (*snapshotMonitorImpl, error)
from a factory function named NewSnapshotMonitor
is okay if you are certain no external package needs a named type. If external usage or testing expects a fully exported type, consider returning the interface directly.
🧰 Tools
🪛 GitHub Check: CodeFactor
[notice] 94-94: controller/monitor/snapshot_monitor.go#L94
Exported func NewSnapshotMonitor returns unexported type *monitor.snapshotMonitorImpl, which can be annoying to use. (unexported-return)
125-132
: Ensure concurrency safety when starting the monitor.
The usage of startOnce.Do
prevents multiple goroutines from calling Start()
. Confirm that further concurrent calls won’t reinitialize or break any shared state.
Line range hint 135-153
: Efficient snapshot event handling.
Adding snapshot tasks to the queue is non-blocking and appears well-coordinated. Make sure large volumes of events don’t overwhelm downstream processing.
Line range hint 198-222
: Robust error handling logic.
Handling the snapshot hashing error gracefully and controlling the requeue limit is well-implemented. Confirm that the error strings are stable enough for string matching.
245-250
: Clean shutdown sequence.
Calling m.snapshotCheckTaskQueue.ShutDown()
followed by stopping the scheduler helps avoid race conditions in concurrent operations.
297-303
: Snapshot status retrieval is thread-safe.
Locking before accessing m.SnapshotMonitorStatus
is correct. Keep an eye on similar data structures for consistent locking patterns.
Line range hint 363-383
: Guard conditions are neatly organized.
Ensuring the volume isn’t purging, rebuilding, restoring, or migrating before triggering snapshot hashing is crucial. This helps prevent inconsistent snapshot states.
Line range hint 403-429
: Comprehensive retry logic with retry-go
.
The error-based retry mechanism ensures snapshot hashing is retried up to a limit. Good usage of LastErrorOnly
to avoid spamming logs.
Line range hint 440-449
: Dedicated checks for volume restoration.
Properly gating the hashing process while the volume is restoring prevents corrupted states or conflicting operations.
580-580
: Falling back to global data integrity setting.
When volume.Spec.SnapshotDataIntegrity
is Ignored
, retrieving the global setting ensures consistent behavior. This is a robust fallback mechanism.
This pull request is now in conflict. Could you fix it @COLDTURNIP? 🙏 |
EDIT: change to draft since the initial waiting is expected for while disk mismatch really happens.
Which issue(s) this PR fixes:
Issue longhorn/longhorn#10098
What this PR does / why we need it:
Trigger the initial disk status collecting right after the disk monitor get created, to eliminate disk mismatching errors, which blocks the rest node sync logic.
Special notes for your reviewer:
Additional documentation or context