Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add code for Monitoring Windows AKS Node for failures #510

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

sbangari
Copy link
Contributor

@sbangari sbangari commented Nov 30, 2021

Problem:
We have had ICMs in the recent past where Windows worker nodes on AKS were running into problems randomly during the course of their execution. To investigate these issues we need to collect logs/traces/events/dumps from the node immediately after the failure. This process has been cumbersome with challenges ranging from:

  1. How do I detect the node that went into a bad a state?
  2. How do I collect the logs? What commands/scripts to run, when and how? This needs to be communicated via an additional hop through an escalation engineer. Multiple cycles were wasted just in communicating the detailed steps across to the customer.
  3. Monitoring the nodes for failure has been a manual process.

As a first step to solve this problem:

  1. The developer implements a couple of functions translating his intent into a few lines of code.
Implement these 4 methods:
LogMessage - (Optional) Implements logic to log messages. Defaults to logging to a file.
StartHandler - (Optional) Handler invoked after the monitoring starts (before the node is in repro state)
TerminateHandler - (Optional) Handler invoked before the monitoring stops (after the node is in repro state)
IsNodeFaulted - Returns a $true when the node is in repro state, $false otherwise
  1. Then a scheduled task is created on the nodes using the "Monitoring Strategy" which was defined above.
    example:
    Register-ScheduledJob -Name "MonitorWindowsNode" -FilePath C:\k\debug\MonitorWindowsNode.ps1 -RunNow -ArgumentList "https://raw.githubusercontent.com/sbangari/SDN/MonitoringWindowsNode/Kubernetes/windows/debug/monitoring/strategies/CopyLogsToBlobStorage.psm1"

  2. When the node goes into a faulted state as described by the strategy, logs are collected from the node. I also have a sample where these logs are automatically uploaded to Azure Blob storage.

@sbangari sbangari self-assigned this Nov 30, 2021
@daschott
Copy link
Contributor

@sumicalbin Did you takeover this PR and merge it? Can this be closed?

@sumicalbin
Copy link
Contributor

@daschott No, I added a partial update. sbangari created comments for improvements that I cannot see for some reason.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants