Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature Request - Integrate Retina.sh ~> Capture Feature in AKS VS Code Extension. #569

Closed
Tatsinnit opened this issue Mar 19, 2024 · 7 comments · Fixed by #624
Closed
Assignees
Labels
enhancement 🚀 New feature or request or improvements on existing code.

Comments

@Tatsinnit
Copy link
Member

Tatsinnit commented Mar 19, 2024

💡☕️ Proposal for a feature to integration Retina.sh - Capture command for Network Packets. - https://retina.sh

Enable Network Observability, AKS Dev Experience, as part of this here is proposal enabling Retina.sh Capture feature vis VsCode Extension.

Retina capture command allows the user to capture network traffic and metadata for the capture target, and then send the capture file to the location by Output Configuration.

To conceptualise this aspects specifically for this proposal.

Screenshot 2024-03-20 at 10 57 51 AM

cc: @peterbom, @sprab, @gambtho fyi...

How/ What / Where:

Given we have now few network diagnostic / user experience / dev-ops / runoffs experience, we should bundle this into dedicated men-item with submenu options of:

  • Collect TCP-Dump (Linux support) (Already present in vscode)
  • Deploy/Undeploy and Use Inspector Gadget Commands like Trace, profile et. al. (ebpf goodness) (Already present in vscode)
  • Retina.sh -> Distributed Network Capture (Intent for this proposal)

Implementation level raw detail:

Capture Details:

Retina Capture allows users to capture network traffic/metadata for the specified Nodes/Pods.

Captures are on-demand and can be output to the host filesystem, a storage blob, etc.

More detail here: https://retina.sh/docs/captures/

Steps to install:

Vscode first will download latest release - we can leverage similar mechiains,m we do for Inspector Gadget tool from this: https://retina.sh/docs/installation/cli

Scenarios It will benefit:

//To Do...

Thanks.

@Tatsinnit Tatsinnit self-assigned this Mar 19, 2024
@Tatsinnit Tatsinnit added the enhancement 🚀 New feature or request or improvements on existing code. label Mar 19, 2024
@peterbom
Copy link
Contributor

I'm keen to aim towards (not immediately, but as a future ideal) an experience where the user flow goes something like:

  • scenario (e.g. networking problem)
    • action (e.g. packet capture)
      • tool for performing that action (e.g. tcpdump or retina capture).

Does that fit with your thoughts, @sprab?

I have concerns about the entry point being the specific tool we're using, because that feels like the reverse flow, and can lead to user confusion around which tool they should be using.

Obviously we need to find ways to work towards that end goal in manageable chunks of work, and it is not realistic at this stage to build a scenario-based diagnostics tool that abstracts away the tooling.

But I think keeping the end goal in mind is important, because we need to track whether we're moving towards it or away from it, and I worry that putting the tools (like retina) as the entrypoints to the experience will create more work for us in the future.

My preference at this stage would be to incorporate retina into the existing tcp dump experience. That way, we have at least part of the scenario->action->tool flow:

  • action (packet capture)
    • tool (retina / tcpdump)

That would also minimize the risk of causing confusion for users: "why do I have two commands for running packet captures?"

@sprab
Copy link
Collaborator

sprab commented Mar 20, 2024

@peterbom - the network troubleshooting workflows is not a replacement of tcpdump. IMHO we should preserve the tcpdump tool as is to allow customers to capture traces and analyze them. The networking troubleshooting scenario would leverage the tcpdump and/or retina.

The workflow that I am thinking of is (soon I will provide a pictorial workflow of this) to first create problem identifiers viz. missing udr, custom dns servers, nsg changes, route changes, firewall changes etc and then combine/order them in a workflow to narrow down. We can then list the scenarios and the order in which these tools can be run.

@Tatsinnit Tatsinnit pinned this issue Mar 20, 2024
@vakalapa
Copy link

I agree with @sprab here, Retina distributed captures is not trying to replace tcpdump, instead it is making it extremely intuitive to use kubernetes language for packet capturing interested traffic in multiple nodes at once. For example:

kubectl retina capture create --host-path /mnt/capture --namespace default --pod-selectors="k8s-app=kube-dns" --namespace-selectors="kubernetes.io/metadata.name=kube-system"

Retina will figure out on which nodes the pods with kube-dns (corens) label in namespace kube-system are running, what are their IPs and will simultaneously start a capture in all those nodes with those interested IPs, this was rest of the traffic is untouched and secure, admin gets all the relevant captures from all nodes at once with single click, and admin will not have to time it together by running different terminals at once.

@peterbom
Copy link
Contributor

@sprab:

network troubleshooting workflows is not a replacement of tcpdump

I agree - I was thinking it would include, but not be limited to, packet capture (using tcpdump and/or Retina). But the overall network troubleshooting project is more of a future thing, and we're hoping to incorporate Retina in the shorter term. Am I getting this right?

@vakalapa:

Retina distributed captures is not trying to replace tcpdump

I'm actually not totally sure about this. You say:

instead it is making it extremely intuitive to use kubernetes language for packet capturing interested traffic in multiple nodes at once

That's exactly what we've been trying to achieve with the tcpdump functionality (although not very well with the multiple nodes part). I see a lot of overlap here, and Retina looks to me as if it provides almost a superset of what we've already built.

@rbtr
Copy link

rbtr commented Mar 26, 2024

more platforms for kubectl-retina published in the latest release https://github.com/microsoft/retina/releases/tag/v0.0.2
we will also be adding to Krew microsoft/retina#108

@Tatsinnit
Copy link
Member Author

Tatsinnit commented Mar 26, 2024

more platforms for kubectl-retina published in the latest release https://github.com/microsoft/retina/releases/tag/v0.0.2 we will also be adding to Krew microsoft/retina#108

Nice and thanks @rbtr , pretty cool, now we can make download as part of initial vscode drop across different OS et. al.. ❤️🙏

@sprab
Copy link
Collaborator

sprab commented Apr 8, 2024

Short term use case scenarios:

Customer Scenario1 - VSCode One Click: Collect Distributed Capture - Day 0

Issue: Intermittent connectivity issues from external to the application running within a pod.

Customer is unclear of where the problem lies within the AKS cluster.
Support is asking the customer to run tcp capture along with other network statistics while the issue is happening.
Support shares a bunch of commands (tcp capture, netstat, iptables, socket-stats) to be run from within a node/pod.
image

Customer: Do I have to run all these commands from within a Node/Pod? Could you please help me do that?

Support: I will share them in an email and run them when the issue is happening?

Customer: The issue happens so quick that before I could run all these commands the issue might disappear not allowing us to capture the required traces/logs? Is there an easy way to do this with a script or something similar?

Support: We can leverage VSCode extension which will allow you to capture all these traces/logs in a single click and save you lot of time and effort in capturing and sharing the logs with us.

Customer: Great, let me run them and share the traces/logs with you :)

Customer Scenario2 - Analyze the kube-proxy behaviour on a particular node using IPTables output from Retina

Customer runs a one click operation from the VSCode extension to capture the required trace/logs from targeted node/pod to analyze.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement 🚀 New feature or request or improvements on existing code.
Projects
None yet
6 participants