-
Notifications
You must be signed in to change notification settings - Fork 380
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Performance improvements #334
Comments
In general, you do not actually want to store variable-length buffers in a sync.Pool for later reuse: golang/go#23199 because many of the long-capacity buffers will never be released, even when usage does not require such a long buffer. Using a global buffer is prone to having to deal with threading issues, because now you cannot have that code run in parallel anymore, without duplicating the whole global object. I’m not saying that the code cannot be optimized to do less allocations, but a lot of the structure and design of the library relies upon heavy multi-threaded access, where allocating per-request byte-slices may be the only way to do such a thing safely. In particular, you point to a few examples where we make a byte-slice with a length determined by That’s a lot of bookkeeping that we’re not setup to do right now, and might even require a big refactor of the whole package to get done. 🤷♀ |
I agree, this seems the most difficult part and this is what I mean when writing "but it doesn't seem so easy to follow allocation/deallocation in all cases" and this is the main reason beacause I asked for advise before starting to write some code.
For my use case saturate a gigabit connection is enough and this can be easily achieved, anyway would be really good to be as fast as OpenSSH, I would be glad to help but I don't want to rewrite the whole library @puellanivis thanks for your response |
anyway we could start with the simplest things, for example: is this copy really needed? https://github.com/pkg/sftp/blob/master/packet.go#L841 something like this seems to work and tests pass:
here is the profile result with the actual code:
and here is the result with the suggested modification:
|
@drakkan .. I'd be happy to try to work with you to increase the performance of sftp, but I'd like to make sure we keep the flexibility and don't complicate things. I agree that more targeted optimizations would be better than a large refactor at this point. I think the code needs some time spent to refactor it first, to eliminate the duplication of server code and simplify the data flow. Then larger scale optimizations would be more feasible maybe even becoming more obvious with more attention paid to the data flow. So I'd love to review any targeted PRs for optimizations. Even simple ones, like eliminating the copy in your last post, add up and could make a significant impact. But I think more work needs to be done before something like a pool would be feasible. |
@eikenb I'll try to read the code again to see if I can find some other simple improvement. Regarding the refactoring I could try to help but can you please better explain what do think is needed to do? For example we have server and request-server and some code is duplicated between these two implementation, do we need to keep both? In general if you can find some time to describe a refactoring plan this would be helpful. Regarding the allocator I just did a quick test with this very basic allocator
The idea is to use an allocator for each I used it in I'll postpone my experiments on this allocator for now |
IF you use Don’t delete the Instead of With the above, and recognizing where pointer semantics apply, you can simplify your allocator a lot:
We can also reduce the chances of stale pointers dragging garbage collection by making sure we don’t leave spare pointers around, when we release pages:
|
@puellanivis, thanks! I'll test your suggestions, I'm sure they will improve the allocator. I'll try to convert random access read/writes to sequential read/writes too using a memory cache of a configurable size to understand if this can improve performance too. Since I'm using request server I can do this changes directly inside my app. I think I can find some time for these test after finalizing sftpgo 0.9.6 release |
I did a quick test with the allocator posted above (a really ugly patch), here are some number using In my test I write to a ramfs and I use sftp and scp CLI as test tool
so basically Golang SSH seems very fast, at least for downloads, but there is something in pkg/sftp that make them slow, I'm a bit lost, I cannot find other bottlenecks, any help is appreciated, thanks |
Hi again, I published my allocator patch here: https://github.com/drakkan/sftp/commits/master please take a look at the two latest commit. I think we can still improve downloads performance but I was unable to find how for now (scp implementation in sftpgo is still faster). I also tryed to send the data packet directly from fileget here: https://github.com/drakkan/sftp/blob/master/request.go#L242 instead of returning the data packet but this does not improve anything. Anyway the tests on real hw done by the user that initially reported the issue against sftpgo shows that with this patch we have similar performance as OpenSSH (but a higher CPU usage). Probably the disks limits were reached. My allocator patch is very hacky, I would like to discuss how to improve it and eventually how to refactor pkg/sftp to get it merged. The idea is to use an allocator for each packetManager, each new request, identified by order id requests new slices as needed and it release them in https://github.com/drakkan/sftp/blob/master/packet-manager.go#L186 the released slices can be reused by other requests inside the same packetManager, so we need:
https://github.com/drakkan/sftp/blob/master/request.go#L236 In my patch I marshal the data packet directly in fileget, this is very hacky, we could, for example, only leave the unallocated needed bytes at the beginning, but then we need to do the same for other packets too or special case the data packet in https://github.com/drakkan/sftp/blob/master/packet.go#L125 I'm open to suggestions and even to completly rewrite my patch, but I would like some guidance/suggestions, thanks! |
I’m not 100% that we necessarily need a caching allocator like this yet, because ideally, Go should already be handling that sort of optimization reasonably well for us. (And it adds a lot of complexity, so I would like to see benchmarks demonstrating it is patently absolutely necessary.) I’m sure there are plenty more improvements available just by avoiding reallocations. If the performance is the same, but higher CPU, then it’s way more likely that there is a bunch of |
it is not absolutely necessary but it improve the performance, here are some numbers: for the cipher [email protected] only the allocator patch apply, the other patches improve aes-ctr and MAC, aes128gcm uses implicit message authentication. The baseline configuration include my latest two patch and so pkg/sftp git master performance is compared against the allocator branch
the performance is the same as OpenSSH not the same as without the allocator patch. The CPU is a bit higher than OpenSSH, anyway the user didn't include the CPU usage in the summary. I don't have a such hw available for testing myself, if it is the same for you I suggest to try my allocator branch using a ram filesystem with [email protected] cipher that doesn't require patching Golang itself. I developed and tested the patch this way. Thanks |
What I’m saying is that if we can avoid at least some of the copies into new slices, we should decrease the pressure being relieved by the allocator. And then we would be addressing the root cause, and not the symptoms. After we have removed all the copies into new slices, then we should be able to evaluate if building in an allocator would be worthwhile. Like I said, I would really like to address the root cause of the memory allocation pressure without an allocator first, and then evaluate the need of adding a complicated allocator code. The driving interests here: K.I.S.S. and “No code is best code.” P.S.: If we implement an allocator, we have to ensure that it eventually releases memory back to the system, otherwise we will have basically just written an over-engineered memory leak. P.P.S.: getting a good code review on the allocator is going to be a complex process, because it’s complex code. Meanwhile, getting rid of unnecessary allocate+copies are simple code reviews, that we should be able to turn around quickly. |
Well I'll try to read the code once again but I cannot see any other copies in the critical paths (uploads and downloads, I don't worry about other sftp packets) |
in my brach the allocator is associated to the packet manager and the memory is released when the packet manager ends https://github.com/drakkan/sftp/blob/master/request-server.go#L154 we can have a memory leak only if downloads and uploads requires packet of different sizes all the time, eventually I can add a guard against this limiting the maximum number of preallocated packets if there is interest in this approach, thanks
|
only for info, this is a memory profile while downloading a file (1GB size) using current master:
as you can see there is an allocation both in fileget and in MarshalBinary. Here is a memory profile while downloading the same file using my allocator fork:
maybe we could at least avoid the second allocation in MarshalBinary, a pull request is coming |
Hi,
SFTPGo uses this library for SFTP and an home made implementation for SCP.
Generally the bottleneck is the encryption, anyway for some ciphers this is not the case, for example
[email protected]
has implicit MAC and is really fast, other ciphers can be optimized, for example for AES CTR we can apply this patch:https://go-review.googlesource.com/c/go/+/51670
and sha256, used for MAC, can be optimized this way:
drakkan/crypto@cc78d71
Now encryption is not more the bottleneck and SFTPGo' SCP implementation has a speed comparable with OpenSSH, sadly SFTP is still slower than OpenSSH.
I understand that SCP is a simpler protocol than SFTP and my implementation is really basic too: I have only one goroutine that read/write to/from a preallocated []byte, no packet manager or multiple goroutines as in pkg/sftp.
In SFTPGo if I put this line:
https://github.com/drakkan/sftpgo/blob/master/sftpd/scp.go#L402
inside the for loop I have a 25% performance regression, pkg/sftp seems to do exactly this, looking at pkg/sftp code for each packet there is at least a line such as this one:
I think we can improve performance using preallocated slices. I'm thinking about something like a global or per request sync.Pool but it does't seem so easy to follow allocation/deallocation in all cases, do you have better suggestions? We could make pkg/sftp as fast as OpenSSH implementation!
Please take a look here for some numbers and profiling results:
drakkan/sftpgo#69
thanks!
The text was updated successfully, but these errors were encountered: