Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sasha/exp2 #1022

Merged
merged 15 commits into from
May 30, 2017
Merged

Sasha/exp2 #1022

merged 15 commits into from
May 30, 2017

Conversation

klizhentas
Copy link
Contributor

@klizhentas klizhentas commented May 28, 2017

Fixes #933

Description

This PR contains implementation for the new log forwarder

Some important properties of this implementation:

  • Introduces grpc and protobuf based serialization. My plan is to release grpc-based API in future versions
    of teleport, however this time I only use protobuf as a payload encoding. I settled on gogoproto
    marshaler as it's much faster (see benchmarks here )

  • Without back-pressure on posting session chunks, audit log was loosing events
    because produce was much faster than consume and buffer was overflowing even if you set it to very
    large value, e.g. 10K . Back pressure is triggered on large outputs, e.g. cat /dev/random and usually corrects the output with 1ms delays when log server flushes the output to disk, causing occasional 10-15ms latency spikes.

  • Throttle is important to continue the session in case if audit log
    get slow, as the session output will block and timeout on every request

  • It is important to pack chunks, because ls / -laR | lolcat would otherwise
    generate about 10K requests per second. With this packing approach
    we reduced this number to about 40-50 requests per second,
    we can now tweak this parameter now by setting queue size and flush buffers.

  • Current implementation attaches audit log forwarder per session. I have tested with 100 parallel sessions and the memory usage is about 100MB to make sure it's not too expensive

  • One can be concerned about channel and goroutine per audit forward, however I would not worry about that for now, as the overhead per channel is substantial, but not critical

  • New implementation will make it easy to introduce "audit-mandatory" sessions that will freeze if the audit log can't be sent, however I did not bother to implement those.

Thoughts

I spent several days doing benchmarks and various loadtests, as a result I have refined the dashboard and added various metrics outputs to teleport Prometheus endpoint. Here are some thoughts on the overall Teleport performance:

  • We are creating lots of requests to auth server per interactive session that can hit local cache instead.
  • We are creating lots of goroutines for interactive sessions (30 goroutines per session).
  • From the CPU standpoint, in case of many concurrent open/close sessions, handshakes are by far the most expensive (90% of the CPU time is taken by them)
  • In case of lolcat - high CPU usage is mostly caused by SSH driven IO
    gogoproto completely removed the chunk encoding/decoding from the CPU profile picture, once I will migrate events stream to protobuf as well, this will be even less of an issue.
  • My planned migration to auth server using GRPC will significantly improve our performance as SSH over HTTP is quite slow.

@klizhentas klizhentas requested review from kontsevoy and russjones May 28, 2017 23:02
} else {
done := make(chan bool)
go func() {
err := client.SSH(ctx, nil, false)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why bother declaring err?

func (l *AuditLog) PostSessionChunk(namespace string, sid session.ID, reader io.Reader) error {
if namespace == "" {
// PostSessionChunks writes a new chunk of session stream into the audit log
func (l *AuditLog) PostSessionSlice(slice SessionSlice) error {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

comment doesn't match the function.

return trace.BadParameter("missing parameter Namespace")
}
sl, err := l.LoggerFor(namespace, sid)
if len(slice.Chunks) == 0 {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why bother? the function appears to be safe to run on empty slice. it won't do anything.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LoggerFor opens file, so did not want to open a file on server if we know it won't work

@klizhentas
Copy link
Contributor Author

retest this please

1 similar comment
@klizhentas
Copy link
Contributor Author

retest this please

@klizhentas klizhentas merged commit 9ab4931 into master May 30, 2017
@klizhentas klizhentas deleted the sasha/exp2 branch May 30, 2017 23:24
hatched pushed a commit to hatched/teleport-merge that referenced this pull request Nov 30, 2022
hatched pushed a commit that referenced this pull request Dec 20, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Stress testing the ssh output failed.
3 participants