Skip to content

feat(kubevirt): add troubleshoot action to vm_lifecycle tool#653

Merged
manusa merged 2 commits intocontainers:mainfrom
ksimon1:troubleshoot-vms
Feb 3, 2026
Merged

feat(kubevirt): add troubleshoot action to vm_lifecycle tool#653
manusa merged 2 commits intocontainers:mainfrom
ksimon1:troubleshoot-vms

Conversation

@ksimon1
Copy link
Contributor

@ksimon1 ksimon1 commented Jan 15, 2026

Add a new "troubleshoot" action to the vm_lifecycle tool that generates a step-by-step troubleshooting guide for diagnosing VirtualMachine issues.

The troubleshoot action:

  • Renders a diagnostic plan for VMs
  • Guides the AI through checking VM, VMI, DataVolumes, PVCs, pods, and events
  • Includes a summary template for reporting findings

This helps AI assistants systematically diagnose VM issues by providing structured instructions on which MCP tools to use and how to interpret the results.
Original credit goes to @lyarwood

Code was assisted by Cursor AI

Signed-off-by: Karel Simon ksimon@redhat.com

@ksimon1
Copy link
Contributor Author

ksimon1 commented Jan 15, 2026

@lyarwood, @Cali0707, @manusa can you please review this PR?

Copy link
Contributor

@lyarwood lyarwood left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like a good start but I think this is going to need to be rebased on #626 with evals written for various scenarios before we can merge. @ksimon1 if you agree would you mind trying it and reviewing #626?

I'm also concerned by the actual impact on token use rendered plans like this might have. IIRC we don't have a way of measuring this clearly through gevals yet but I wouldn't be surprised if we ended up asking for plans like this to be converted into code based solutions with simple returns to the calling agent and model.

@ksimon1
Copy link
Contributor Author

ksimon1 commented Jan 15, 2026

/hold

return api.NewToolCallResult("", err), nil
}

dynamicClient := params.DynamicClient()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the reason for removing it from here and putting the dynamicClient in each switch-case?

@Cali0707
Copy link
Collaborator

IIRC we don't have a way of measuring this clearly through gevals yet

This is on the roadmap, but not implemented yet (sorry 😞 )

}
message = "# VirtualMachine restarted successfully\n"

case ActionTroubleshoot:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO this is more of a MCP prompt than a tool. We have added support for toolsets to define their own prompts (see https://github.com/containers/kubernetes-mcp-server/blob/main/docs/PROMPTS.md#toolset-prompts)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TIL looks like this dropped while I was out, thanks!

@ksimon1 are you able to rework this to use a prompt? Would be good to rebase and land this in the next 2 weeks with some basic eval coverage.

@ksimon1
Copy link
Contributor Author

ksimon1 commented Jan 27, 2026

@lyarwood, @Cali0707, @manusa can you please review this PR?

@ksimon1
Copy link
Contributor Author

ksimon1 commented Jan 28, 2026

/retest

Follow these steps to diagnose issues with the VirtualMachine:

## Step 1: Check VirtualMachine Status
Use resources_get with apiVersion=kubevirt.io/v1, kind=VirtualMachine, namespace=%s, name=%s
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some ideas that float into my head that might or might not apply:

The prompt could probably benefit from some dynamic content injection (as defined in Claude Skills).

The idea here would be to execute the relevant queries ourselves and injecting the desired result, instead of delegating the task to the model and avoiding the extra round-trips and overhead. (I believe this is a pattern we're already following with the cluster-health-check prompt).

I would use this dynamic content injection at least for those requests that we know beforehand that will have to be performed by the model.

Regarding token usage, I understand the initial toll of this pattern should be higher, but it should compensate the extra overhead and roundtrips the model+agent would need to complete the actual troubleshooting task.

Thoughts?

Copy link
Contributor Author

@ksimon1 ksimon1 Jan 29, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC you mean to prepopulate the prompt text with basic information like vm/vmi manifest, pod description, ...?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@manusa would you please review the new code, which injects the content to the prompt?

Copy link
Member

@manusa manusa Feb 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC you mean to prepopulate the prompt text with basic information like vm/vmi manifest, pod description, ...?

Yes, exactly

@manusa would you please review the new code, which injects the content to the prompt?

Sorry, last Friday I was completely focused on the code mode feature and demo. Let me check this now.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure how the compiled prompt looks like, but you're definitely doing what I was suggesting.
This should prevent the LLM from doing multiple roundtrips which we know beforehand it's going to try because we were instructing it to do so.

kubectl delete namespace "$NS" --ignore-not-found
prompt:
inline: |-
There is a VirtualMachine named "broken-vm" in the ${EVAL_NAMESPACE:-vm-test} namespace that is not working correctly.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you tried running this with mcpchecker? IIRC it no longer supports bash substitutions, something we can address in the project but it will lead to this being passed directly to the agent/model in it's current form.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The mcpchecker passed. Since this substitution is in all kubevirt's eval tasks, I would update it in different PR in all tasks.

echo ""
echo "=== Troubleshooting Eval Complete ==="
echo "The agent should have:"
echo " 1. Used the vm-troubleshoot prompt with namespace=$NS and name=broken-vm"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is also missing from the Task API at the moment IMHO, we can define this in Evals but I also think each Task should be able to assert that tools and/or prompts are called.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lyarwood +1 here - we have an open discussion trying to figure out how we want to solve this: mcpchecker/mcpchecker#126

Interested in hearing if you have any thoughts 😄

…urces (gvr.go)

This will help in next commit to not duplicate GVRs and GVKs.

Signed-off-by: Karel Simon <ksimon@redhat.com>
…tics

Add a new "vm-troubleshoot" MCP prompt to the kubevirt toolset that generates
a step-by-step troubleshooting guide for diagnosing VirtualMachine issues.

The prompt:
        - Provides a structured diagnostic plan for VMs
        - Guides the AI through checking VM, VMI, DataVolumes, PVCs, pods, and events
        - Includes a summary template for reporting findings
        - Tries to fix the VM state

This is implemented as an MCP Prompt (not a tool action).

Code was assisted by Cursor AI

Signed-off-by: Karel Simon <ksimon@redhat.com>
@ksimon1
Copy link
Contributor Author

ksimon1 commented Feb 2, 2026

@Cali0707, @manusa can you please review this PR?

Copy link
Member

@manusa manusa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The prompt logic looks good to me, thx!
For the eval part, I think Calum should give his blessing.

@manusa manusa added this to the 0.1.0 milestone Feb 2, 2026
Copy link
Collaborator

@Cali0707 Cali0707 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Evals look fine to me, thanks @ksimon1 !

Only thing to note is that we are hoping to move onto the new task format and off of using bash scripts as much as possible. So in the future, we will need to rework this to e.g. leverage the kubernetes extension

@lyarwood
Copy link
Contributor

lyarwood commented Feb 2, 2026

Evals look fine to me, thanks @ksimon1 !

Only thing to note is that we are hoping to move onto the new task format and off of using bash scripts as much as possible. So in the future, we will need to rework this to e.g. leverage the kubernetes extension

ACK thanks for the reminder, something for a follow up.

I can create an issue if you want to us to track this?

@Cali0707
Copy link
Collaborator

Cali0707 commented Feb 2, 2026

I can create an issue if you want to us to track this?

That would be helpful!

@ksimon1
Copy link
Contributor Author

ksimon1 commented Feb 3, 2026

So can we merge this PR?

@manusa manusa merged commit 6c74e2a into containers:main Feb 3, 2026
7 checks passed
@manusa
Copy link
Member

manusa commented Feb 3, 2026

So can we merge this PR?

Merged 🚀, thx!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants