Add code for Monitoring Windows AKS Node for failures #510

sbangari · 2021-11-30T21:18:29Z

Problem:
We have had ICMs in the recent past where Windows worker nodes on AKS were running into problems randomly during the course of their execution. To investigate these issues we need to collect logs/traces/events/dumps from the node immediately after the failure. This process has been cumbersome with challenges ranging from:

How do I detect the node that went into a bad a state?
How do I collect the logs? What commands/scripts to run, when and how? This needs to be communicated via an additional hop through an escalation engineer. Multiple cycles were wasted just in communicating the detailed steps across to the customer.
Monitoring the nodes for failure has been a manual process.

As a first step to solve this problem:

The developer implements a couple of functions translating his intent into a few lines of code.

Implement these 4 methods:
LogMessage - (Optional) Implements logic to log messages. Defaults to logging to a file.
StartHandler - (Optional) Handler invoked after the monitoring starts (before the node is in repro state)
TerminateHandler - (Optional) Handler invoked before the monitoring stops (after the node is in repro state)
IsNodeFaulted - Returns a $true when the node is in repro state, $false otherwise

Then a scheduled task is created on the nodes using the "Monitoring Strategy" which was defined above.
example:
Register-ScheduledJob -Name "MonitorWindowsNode" -FilePath C:\k\debug\MonitorWindowsNode.ps1 -RunNow -ArgumentList "https://raw.githubusercontent.com/sbangari/SDN/MonitoringWindowsNode/Kubernetes/windows/debug/monitoring/strategies/CopyLogsToBlobStorage.psm1"
When the node goes into a faulted state as described by the strategy, logs are collected from the node. I also have a sample where these logs are automatically uploaded to Azure Blob storage.

daschott · 2022-08-22T17:23:42Z

@sumicalbin Did you takeover this PR and merge it? Can this be closed?

sumicalbin · 2022-08-22T17:28:33Z

@daschott No, I added a partial update. sbangari created comments for improvements that I cannot see for some reason.

sbangari added 2 commits November 30, 2021 12:52

Add code for Monitoring Windows AKS Node for failures

a8e606b

Adding logpath to terminate handler

62db8ee

sbangari requested a review from Keith-Mange November 30, 2021 22:00

sbangari self-assigned this Nov 30, 2021

zip all the content

ac02197

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add code for Monitoring Windows AKS Node for failures #510

Add code for Monitoring Windows AKS Node for failures #510

sbangari commented Nov 30, 2021 •

edited

Loading

daschott commented Aug 22, 2022

sumicalbin commented Aug 22, 2022

Add code for Monitoring Windows AKS Node for failures #510

Are you sure you want to change the base?

Add code for Monitoring Windows AKS Node for failures #510

Conversation

sbangari commented Nov 30, 2021 • edited Loading

daschott commented Aug 22, 2022

sumicalbin commented Aug 22, 2022

sbangari commented Nov 30, 2021 •

edited

Loading