-
Notifications
You must be signed in to change notification settings - Fork 1.8k
Teleport leak file descriptors #1433
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
@halfa elevating this to the top, thanks for reporting. |
@halfa Teleport proxy needs multilpe handles per connection (at the very least it will use one for the client socket and one for the server's, but in reality it creates more than that) and they may not get released immediately due to the need for "lingering sessions" (for bad connections) and Golang being garbage-collected. Can you see if you can reproduce this with a higher ulimit? |
Hi, sorry for the lag. I'm back with more info. @kontsevoy It seems normal to me for I'm now running at I can provide some stats (around 200 remote nodes):
If this considered "normal" usage, I don't see how we can scale this thing up :/ |
@halfa I will work on you on troubleshooting this. Let's start with instrumentation: https://github.com/gravitational/teleport/tree/master/assets/monitoring After the run, can you give me the diff of go-routine dumps: So we can see what's leaking? |
@klizhentas thanks for the support
And the grafana dashboard. Open files metrics are consistent with my own findings. |
The playbook's last run (starting after 11:00 and finishing at 11:05) is actually a small file copy job with |
Here is another data sample, after a few more runs, at 15K descriptors/7K open files. Let's dub this state (c). I've put the diffs between (c) and the starting state (a) from earlier...
And here's another between (c) and the end of first run (b):
|
@halfa can you show me the script so I can reproduce? |
Basically it's just:
I'll try to come up with a minimal example tomorrow. |
ok, thanks |
OK, here it is:
---
- hosts: all
gather_facts: no
user: root
tasks:
- name: copy a local file to remote
template:
src=/tmp/teleport-test
dest=/tmp/teleport-test
owner=root
group=root
mode=0644
- name: remove remote file
file:
dest: /tmp/teleport-test
state: absent
ansible-playbook -i <inventory> -e ansible_ssh_user=<user> -e ansible_ssh_port=<teleport_ssh_node_daemon> teleport-ofleak-minimal-example.yml This generated 1.2K uncleaned FD on my setup. |
thank you! I will try to reproduce it using your example and get back to you |
I have reproduced the bug, working on a fix. |
fix audit log file leak, fixes #1433
* Change placement of info content for SetupAccessWrapper * Create AnimatedProgressBar * Allow fixed labels for LabelsCreater * Fix a regression where events were getting passed down to nextStep function on button click (expected number) * Add matching label rule for DB DownloadScript * Add a modal loader for CreateDatabase (also add port input field) - Add countdown and error to this modal
* Refactor DownloadScript Screens (#1367) * Tentatively implement Create Database screen (#1372) * Refactor TestConnection Screens (#1375) * Database Tweaks and Add ons (#1412) * Implement mutual TLS screen (#1418) * Add all db options to db selector (#1441) * Tweaks based on design review and regression fixes (#1433) * Implement the IAM policy screen (#1459) * Add database service checker and various db tweaks (#1481) * Temp remove db service checker until bug is fixed (#1495) Co-authored-by: Ryan Clark <[email protected]>
I'm using
teleport
for a particular use case, channelingansible
playbooks through it for hundred of machines at a time. During playbooks, the bastion'steleport
process just piles up file descriptor in a linear manner and frequently reach the maximum allowed by thesystemd
unit file, crashing all sessions and becoming unresponsive until I restart the daemon.I've first considered to increase the limit above the
systemd
default of 1024, but I'm now at 16386 and started to suspectteleport
handling itself. I recognise that my usage is a bit... unconventional asAnsible
output is logged byteleport
itself.Has anyone encountered this issue before ? I intend to do a bit of research on this topic in the coming months but I would appreciate insights from any seasoned
teleport
developers.The text was updated successfully, but these errors were encountered: