feat(kubevirt): add troubleshoot action to vm_lifecycle tool#653
feat(kubevirt): add troubleshoot action to vm_lifecycle tool#653manusa merged 2 commits intocontainers:mainfrom
Conversation
8eea2e0 to
a2f9f5c
Compare
lyarwood
left a comment
There was a problem hiding this comment.
Looks like a good start but I think this is going to need to be rebased on #626 with evals written for various scenarios before we can merge. @ksimon1 if you agree would you mind trying it and reviewing #626?
I'm also concerned by the actual impact on token use rendered plans like this might have. IIRC we don't have a way of measuring this clearly through gevals yet but I wouldn't be surprised if we ended up asking for plans like this to be converted into code based solutions with simple returns to the calling agent and model.
|
/hold |
| return api.NewToolCallResult("", err), nil | ||
| } | ||
|
|
||
| dynamicClient := params.DynamicClient() |
There was a problem hiding this comment.
What is the reason for removing it from here and putting the dynamicClient in each switch-case?
This is on the roadmap, but not implemented yet (sorry 😞 ) |
| } | ||
| message = "# VirtualMachine restarted successfully\n" | ||
|
|
||
| case ActionTroubleshoot: |
There was a problem hiding this comment.
IMO this is more of a MCP prompt than a tool. We have added support for toolsets to define their own prompts (see https://github.com/containers/kubernetes-mcp-server/blob/main/docs/PROMPTS.md#toolset-prompts)
There was a problem hiding this comment.
TIL looks like this dropped while I was out, thanks!
@ksimon1 are you able to rework this to use a prompt? Would be good to rebase and land this in the next 2 weeks with some basic eval coverage.
a2f9f5c to
d63a12e
Compare
d63a12e to
86b9339
Compare
86b9339 to
c5d4a7e
Compare
c5d4a7e to
7d32c5c
Compare
|
/retest |
| Follow these steps to diagnose issues with the VirtualMachine: | ||
|
|
||
| ## Step 1: Check VirtualMachine Status | ||
| Use resources_get with apiVersion=kubevirt.io/v1, kind=VirtualMachine, namespace=%s, name=%s |
There was a problem hiding this comment.
Some ideas that float into my head that might or might not apply:
The prompt could probably benefit from some dynamic content injection (as defined in Claude Skills).
The idea here would be to execute the relevant queries ourselves and injecting the desired result, instead of delegating the task to the model and avoiding the extra round-trips and overhead. (I believe this is a pattern we're already following with the cluster-health-check prompt).
I would use this dynamic content injection at least for those requests that we know beforehand that will have to be performed by the model.
Regarding token usage, I understand the initial toll of this pattern should be higher, but it should compensate the extra overhead and roundtrips the model+agent would need to complete the actual troubleshooting task.
Thoughts?
There was a problem hiding this comment.
IIUC you mean to prepopulate the prompt text with basic information like vm/vmi manifest, pod description, ...?
There was a problem hiding this comment.
@manusa would you please review the new code, which injects the content to the prompt?
There was a problem hiding this comment.
IIUC you mean to prepopulate the prompt text with basic information like vm/vmi manifest, pod description, ...?
Yes, exactly
@manusa would you please review the new code, which injects the content to the prompt?
Sorry, last Friday I was completely focused on the code mode feature and demo. Let me check this now.
There was a problem hiding this comment.
I'm not sure how the compiled prompt looks like, but you're definitely doing what I was suggesting.
This should prevent the LLM from doing multiple roundtrips which we know beforehand it's going to try because we were instructing it to do so.
7d32c5c to
78a4c31
Compare
| kubectl delete namespace "$NS" --ignore-not-found | ||
| prompt: | ||
| inline: |- | ||
| There is a VirtualMachine named "broken-vm" in the ${EVAL_NAMESPACE:-vm-test} namespace that is not working correctly. |
There was a problem hiding this comment.
Have you tried running this with mcpchecker? IIRC it no longer supports bash substitutions, something we can address in the project but it will lead to this being passed directly to the agent/model in it's current form.
There was a problem hiding this comment.
The mcpchecker passed. Since this substitution is in all kubevirt's eval tasks, I would update it in different PR in all tasks.
| echo "" | ||
| echo "=== Troubleshooting Eval Complete ===" | ||
| echo "The agent should have:" | ||
| echo " 1. Used the vm-troubleshoot prompt with namespace=$NS and name=broken-vm" |
There was a problem hiding this comment.
This is also missing from the Task API at the moment IMHO, we can define this in Evals but I also think each Task should be able to assert that tools and/or prompts are called.
There was a problem hiding this comment.
@lyarwood +1 here - we have an open discussion trying to figure out how we want to solve this: mcpchecker/mcpchecker#126
Interested in hearing if you have any thoughts 😄
…urces (gvr.go) This will help in next commit to not duplicate GVRs and GVKs. Signed-off-by: Karel Simon <ksimon@redhat.com>
…tics
Add a new "vm-troubleshoot" MCP prompt to the kubevirt toolset that generates
a step-by-step troubleshooting guide for diagnosing VirtualMachine issues.
The prompt:
- Provides a structured diagnostic plan for VMs
- Guides the AI through checking VM, VMI, DataVolumes, PVCs, pods, and events
- Includes a summary template for reporting findings
- Tries to fix the VM state
This is implemented as an MCP Prompt (not a tool action).
Code was assisted by Cursor AI
Signed-off-by: Karel Simon <ksimon@redhat.com>
78a4c31 to
8aab0e2
Compare
manusa
left a comment
There was a problem hiding this comment.
The prompt logic looks good to me, thx!
For the eval part, I think Calum should give his blessing.
Cali0707
left a comment
There was a problem hiding this comment.
Evals look fine to me, thanks @ksimon1 !
Only thing to note is that we are hoping to move onto the new task format and off of using bash scripts as much as possible. So in the future, we will need to rework this to e.g. leverage the kubernetes extension
ACK thanks for the reminder, something for a follow up. I can create an issue if you want to us to track this? |
That would be helpful! |
|
So can we merge this PR? |
Merged 🚀, thx! |
Add a new "troubleshoot" action to the vm_lifecycle tool that generates a step-by-step troubleshooting guide for diagnosing VirtualMachine issues.
The troubleshoot action:
This helps AI assistants systematically diagnose VM issues by providing structured instructions on which MCP tools to use and how to interpret the results.
Original credit goes to @lyarwood
Code was assisted by Cursor AI
Signed-off-by: Karel Simon ksimon@redhat.com