Skip to content

Teleport leak file descriptors #1433

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
halfa opened this issue Oct 27, 2017 · 13 comments
Closed

Teleport leak file descriptors #1433

halfa opened this issue Oct 27, 2017 · 13 comments
Assignees
Labels

Comments

@halfa
Copy link
Contributor

halfa commented Oct 27, 2017

I'm using teleport for a particular use case, channeling ansible playbooks through it for hundred of machines at a time. During playbooks, the bastion's teleport process just piles up file descriptor in a linear manner and frequently reach the maximum allowed by the systemd unit file, crashing all sessions and becoming unresponsive until I restart the daemon.

I've first considered to increase the limit above the systemd default of 1024, but I'm now at 16386 and started to suspect teleport handling itself. I recognise that my usage is a bit... unconventional as Ansible output is logged by teleport itself.

Has anyone encountered this issue before ? I intend to do a bit of research on this topic in the coming months but I would appreciate insights from any seasoned teleport developers.

@kontsevoy kontsevoy added this to the 2.4.0 milestone Nov 3, 2017
@kontsevoy
Copy link
Contributor

@halfa elevating this to the top, thanks for reporting.

@kontsevoy
Copy link
Contributor

kontsevoy commented Nov 3, 2017

@halfa Teleport proxy needs multilpe handles per connection (at the very least it will use one for the client socket and one for the server's, but in reality it creates more than that) and they may not get released immediately due to the need for "lingering sessions" (for bad connections) and Golang being garbage-collected. Can you see if you can reproduce this with a higher ulimit?

@kontsevoy kontsevoy removed this from the 2.4.0 milestone Nov 3, 2017
@kontsevoy kontsevoy removed the P0 label Nov 3, 2017
@kontsevoy kontsevoy assigned kontsevoy and unassigned klizhentas Nov 3, 2017
@halfa
Copy link
Contributor Author

halfa commented Nov 6, 2017

Hi, sorry for the lag. I'm back with more info.

@kontsevoy It seems normal to me for teleport to use a large quantity of fd during heavy duty usage (because it needs to log everything), but it should free them when idling.

I'm now running at NOFILE 32384 (soft & hard). teleport ate up to 26781 file descriptors during a big run. I let it idle for 48h, and it closed only around a thousand of them.

I can provide some stats (around 200 remote nodes):

Regex Open file descriptors
session.log 12606
session.bytes 12606
ESTABLISHED 584

If this considered "normal" usage, I don't see how we can scale this thing up :/

@klizhentas
Copy link
Contributor

@halfa I will work on you on troubleshooting this. Let's start with instrumentation:

https://github.com/gravitational/teleport/tree/master/assets/monitoring

After the run, can you give me the diff of go-routine dumps:

https://github.com/gravitational/teleport/tree/master/assets/monitoring#diffing-goroutine-dumps

So we can see what's leaking?

@halfa
Copy link
Contributor Author

halfa commented Nov 7, 2017

@klizhentas thanks for the support
I fired up the monitoring stack and ran an average playbook. Here are the results:

change from /tmp/a to /tmp/b
/gopath/src/github.com/gravitational/teleport/lib/auth/tun.go:198 :  383
/gopath/src/github.com/gravitational/teleport/lib/auth/tun.go:406 :  383
/gopath/src/github.com/gravitational/teleport/lib/sshutils/server.go:235 :  386
/gopath/src/github.com/gravitational/teleport/vendor/golang.org/x/crypto/ssh/handshake.go:134 :  385
/gopath/src/github.com/gravitational/teleport/vendor/golang.org/x/crypto/ssh/handshake.go:135 :  385
/gopath/src/github.com/gravitational/teleport/vendor/golang.org/x/crypto/ssh/mux.go:128 :  385
/opt/go/src/net/http/server.go:2668 :  382
/opt/go/src/net/http/server.go:652 :  373
not in a
/gopath/src/github.com/gravitational/teleport/lib/srv/proxy.go:208 :  2
/gopath/src/github.com/gravitational/teleport/lib/srv/proxy.go:217 :  2
/gopath/src/github.com/gravitational/teleport/lib/srv/proxy.go:312 :  2
/gopath/src/github.com/gravitational/teleport/lib/srv/proxy.go:321 :  2
/gopath/src/github.com/gravitational/teleport/lib/srv/sshserver.go:1064 :  4
/gopath/src/github.com/gravitational/teleport/lib/srv/sshserver.go:728 :  4
not in b
total diff: 3078

And the grafana dashboard. Open files metrics are consistent with my own findings.

grafana

@halfa
Copy link
Contributor Author

halfa commented Nov 7, 2017

The playbook's last run (starting after 11:00 and finishing at 11:05) is actually a small file copy job with task: template, if you want to reproduce it on your side.

@halfa
Copy link
Contributor Author

halfa commented Nov 7, 2017

Here is another data sample, after a few more runs, at 15K descriptors/7K open files. Let's dub this state (c). I've put the diffs between (c) and the starting state (a) from earlier...

# python gops.py diff /tmp/a /tmp/c
change from /tmp/a to /tmp/c
/gopath/src/github.com/gravitational/teleport/lib/auth/tun.go:198 :  327
/gopath/src/github.com/gravitational/teleport/lib/auth/tun.go:406 :  328
/gopath/src/github.com/gravitational/teleport/lib/sshutils/server.go:235 :  332
/gopath/src/github.com/gravitational/teleport/vendor/golang.org/x/crypto/ssh/handshake.go:134 :  331
/gopath/src/github.com/gravitational/teleport/vendor/golang.org/x/crypto/ssh/handshake.go:135 :  331
/gopath/src/github.com/gravitational/teleport/vendor/golang.org/x/crypto/ssh/mux.go:128 :  331
/opt/go/src/net/http/server.go:2668 :  328
/opt/go/src/net/http/server.go:652 :  328
not in a
/gopath/src/github.com/gravitational/teleport/lib/srv/proxy.go:208 :  3
/gopath/src/github.com/gravitational/teleport/lib/srv/proxy.go:217 :  3
/gopath/src/github.com/gravitational/teleport/lib/srv/proxy.go:312 :  3
/gopath/src/github.com/gravitational/teleport/lib/srv/proxy.go:321 :  3
/gopath/src/github.com/gravitational/teleport/lib/srv/sshserver.go:1064 :  6
/gopath/src/github.com/gravitational/teleport/lib/srv/sshserver.go:728 :  6
not in b
total diff: 2660

And here's another between (c) and the end of first run (b):

# python gops.py diff /tmp/b /tmp/c
change from /tmp/b to /tmp/c
/gopath/src/github.com/gravitational/teleport/lib/auth/tun.go:198 :  -56
/gopath/src/github.com/gravitational/teleport/lib/auth/tun.go:406 :  -55
/gopath/src/github.com/gravitational/teleport/lib/srv/proxy.go:208 :  1
/gopath/src/github.com/gravitational/teleport/lib/srv/proxy.go:217 :  1
/gopath/src/github.com/gravitational/teleport/lib/srv/proxy.go:312 :  1
/gopath/src/github.com/gravitational/teleport/lib/srv/proxy.go:321 :  1
/gopath/src/github.com/gravitational/teleport/lib/srv/sshserver.go:1064 :  2
/gopath/src/github.com/gravitational/teleport/lib/srv/sshserver.go:728 :  2
/gopath/src/github.com/gravitational/teleport/lib/sshutils/server.go:235 :  -54
/gopath/src/github.com/gravitational/teleport/vendor/golang.org/x/crypto/ssh/handshake.go:134 :  -54
/gopath/src/github.com/gravitational/teleport/vendor/golang.org/x/crypto/ssh/handshake.go:135 :  -54
/gopath/src/github.com/gravitational/teleport/vendor/golang.org/x/crypto/ssh/mux.go:128 :  -54
/opt/go/src/net/http/server.go:2668 :  -54
/opt/go/src/net/http/server.go:652 :  -45
not in a
not in b
total diff: -418

@klizhentas
Copy link
Contributor

@halfa can you show me the script so I can reproduce?

@halfa
Copy link
Contributor Author

halfa commented Nov 7, 2017

Basically it's just:

  • run tsh agent
  • export vars
  • run the playbook with ansible-playbook -i <inventory> -e ansible_ssh_user=root -e ansible_ssh_port=<teleport-ssh> <playbook>.yml

I'll try to come up with a minimal example tomorrow.

@klizhentas
Copy link
Contributor

ok, thanks

@klizhentas klizhentas assigned klizhentas and unassigned kontsevoy Nov 7, 2017
@halfa
Copy link
Contributor Author

halfa commented Nov 9, 2017

OK, here it is:

  • teleport cluster, hosts and nodes at 2.3.5-stable, ran with debug on cluster master
  • ansible at 2.4.0.0
  1. Copy the following playbook in a file named teleport-ofleak-minimal-example.yml
---
- hosts: all
  gather_facts: no
  user: root

  tasks:
    - name: copy a local file to remote
      template:
        src=/tmp/teleport-test
        dest=/tmp/teleport-test
        owner=root
        group=root
        mode=0644
    - name: remove remote file
      file:
        dest: /tmp/teleport-test
        state: absent
  1. Copy a file to /tmp/teleport-test on your machine. This file will be send to all other hosts. I used teleport's README.md.

  2. [optional] If like me you use GNOME and have a broken ssh-agent that don't handle certificates, run tsh agent and export to env.

  3. Run the playbook

ansible-playbook -i <inventory> -e ansible_ssh_user=<user> -e ansible_ssh_port=<teleport_ssh_node_daemon> teleport-ofleak-minimal-example.yml

This generated 1.2K uncleaned FD on my setup.

@klizhentas
Copy link
Contributor

thank you! I will try to reproduce it using your example and get back to you

@klizhentas klizhentas added the P1 label Nov 11, 2017
@klizhentas
Copy link
Contributor

I have reproduced the bug, working on a fix.

klizhentas added a commit that referenced this issue Nov 16, 2017
hatched pushed a commit that referenced this issue Jan 11, 2023
* Change placement of info content for SetupAccessWrapper
* Create AnimatedProgressBar
* Allow fixed labels for LabelsCreater
* Fix a regression where events were getting passed
  down to nextStep function on button click (expected number)
* Add matching label rule for DB DownloadScript
* Add a modal loader for CreateDatabase (also add port input field)
    - Add countdown and error to this modal
hatched pushed a commit that referenced this issue Jan 30, 2023
* Refactor DownloadScript Screens (#1367)
* Tentatively implement Create Database screen (#1372)
* Refactor TestConnection Screens (#1375)
* Database Tweaks and Add ons (#1412)
* Implement mutual TLS screen (#1418)
* Add all db options to db selector (#1441)
* Tweaks based on design review and regression fixes (#1433)
* Implement the IAM policy screen (#1459)
* Add database service checker and various db tweaks (#1481)
* Temp remove db service checker until bug is fixed (#1495)

Co-authored-by: Ryan Clark <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants