Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Relatively lower performance than OpenSSH #69

Closed
HiFiPhile opened this issue Feb 8, 2020 · 45 comments
Closed

Relatively lower performance than OpenSSH #69

HiFiPhile opened this issue Feb 8, 2020 · 45 comments

Comments

@HiFiPhile
Copy link
Contributor

Hi,
Thanks for this great project !

I did some test in my environment and the transfer speed is much lower than OpenSSH.

Server
OS Debian 10.2 x64
CPU Ryzen5 3600
RAM 64GB ECC
Disk 3* Intel P4510 4TB RAID0
Ethernet Mellanox ConnectX-3 40GbE
Client
OS Windows 10 1909 x64
CPU Threadripper 1920X
RAM 64GB ECC
Disk Samsung 960EVO 1TB
Ethernet Mellanox ConnectX-3 40GbE

Under Filezilla I can get 500MB/s with OpenSSH, but only about 200MB/s with sftpgo.

In both case I'm using AES256-CTR as cipher and SHA-256 as MAC, I've also tried AES128-CTR but nothing changes.

CPU usage of sftpgo is higher than OpenSSH:

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND 
 4527 sftp      20   0 1795576  52044   8628 R 133.5   0.6   2:12.13 sftpgo 
  PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND 
27934 xxxxxx    20   0   17112   5360   4188 R  67.8   0.1   0:10.01 sshd                                                  
27942 xxxxxx    20   0   17112   5344   4176 R  27.4   0.1   0:12.52 sshd 

In both case I've got a maximum TCP window size of 4MB.

@drakkan
Copy link
Owner

drakkan commented Feb 8, 2020

Hi,

thanks for taking time to do this test, very appreciated.

I don't have a 40 gigabit ethernet card.

Since the first implementation I noticed, testing on localhost, that sftpgo is a bit slower and uses more CPU than openssh so I can confirm what you report but the difference is not so big in my tests (280MB/s OpenSSH, 230MB/s SFTPGo/SFTP, 250MB/s SFTPGo/SCP)

In real environments SFTPGo was able to completly saturate my network card, using more CPU than OpenSSH, but my test were limited to 1 gigabit ethernet cards.

Can you please test multiple parallel transfers and report if you are able to saturate your 40 gigabit network?

Can you repeat the same test using scp? SCP is implemented from scratch inside SFTPGo while for sftp we use pkg/sftp library. If scp will perform better I'll try to see if pkg/sftp can be improved.

Also please note that currently golang ssh implementation does not support zlib compression, so if OpenSSH uses compression the real transferred size is smaller than the one transferred using SFTPGo.

Anyway I get the same results using both sftpgo and the sample, very basic, implementation here:

https://github.com/pkg/sftp/tree/master/examples/go-sftp-server

so we need to improve pkg/sftp

@HiFiPhile
Copy link
Contributor Author

HiFiPhile commented Feb 8, 2020

I switched to Ubuntu 19.10 since scp sucks on Windows.

I've noticed a huge improvement between Filezilla 3.39 (in repo) and latest 3.46, maybe it's because :

3.42.0-beta1 (2019-04-21)
    Large refactoring of the socket code
    The thread pool from libfilezilla is now used for all worker threads

Parallel test

The speed under Windows was 2 streams.

Stream SFTPGo MB/s SFTPGo CPU% OpenSSH MB/s OpenSSH CPU%
1 125 137 380 97
2 210 227 600 190
3 260 228 700 276
4 330 336 810 344
8 387 430 950 400

SCP

I got 208MB/s vs 235MB/s, and it seems like spaces in path is not correctly handled, I've tried 3 types of escape, but scp returns immediately without transfer.

scp [email protected]:"'web/tmp/Master File 18 10 13.xls'" .
scp [email protected]:"web/tmp/Master\ File\ 18\ 10\ 13.xls" .
scp [email protected]:web/tmp/Master\\\ File\\\ 18\\\ 10\\\ 13.xls .

Me too I doubt it should be golang's sftp package issue, but since I'm not familier with Go I didn't look further.

@drakkan
Copy link
Owner

drakkan commented Feb 8, 2020

Thanks for reporting back, spaces in scp path is now fixed, this command

scp [email protected]:"web/tmp/Master\ File\ 18\ 10\ 13.xls" .

should work now. I'll try to fix the "'web/tmp/Master File 18 10 13.xls'" escape style too.

Regarding the performance issue I'll do some other tests myself (on localhost, I don't have a network as the your) and eventually I'll try to ask upstream.

From your test results it seems that SFTPGo can easily saturate a gigabit connection while it has issues if you have more bandwidth, 8 streams are served at 3 gigabit/s

@HiFiPhile
Copy link
Contributor Author

For gigabit connection it's totally fine. I can help you if you need more tests in high bandwidth condition.

@drakkan
Copy link
Owner

drakkan commented Feb 9, 2020

Hi, I did some profiling and the bottleneck seems the encryption, without specifying a cipher the default on my laptop is chacha20Poly1305ID and here is the profile results

Showing top 10 nodes out of 157
      flat  flat%   sum%        cum   cum%
    1620ms 21.15% 21.15%     1620ms 21.15%  golang.org/x/crypto/chacha20.quarterRound
    1340ms 17.49% 38.64%     1430ms 18.67%  syscall.Syscall
     400ms  5.22% 43.86%      400ms  5.22%  golang.org/x/crypto/poly1305.update
     390ms  5.09% 48.96%     2070ms 27.02%  golang.org/x/crypto/chacha20.(*Cipher).xorKeyStreamBlocksGeneric
     370ms  4.83% 53.79%      370ms  4.83%  runtime.epollwait
     370ms  4.83% 58.62%      370ms  4.83%  runtime.futex
     360ms  4.70% 63.32%      360ms  4.70%  runtime.procyield
     220ms  2.87% 66.19%      220ms  2.87%  runtime.memmove
     180ms  2.35% 68.54%      280ms  3.66%  runtime.scanobject
     180ms  2.35% 70.89%      180ms  2.35%  runtime.usleep

setting "ciphers": ["aes256-ctr"] I get a performance degradation:

Showing top 10 nodes out of 149
      flat  flat%   sum%        cum   cum%
    2550ms 25.91% 25.91%     2550ms 25.91%  crypto/sha256.block
    1540ms 15.65% 41.57%     1590ms 16.16%  syscall.Syscall
     980ms  9.96% 51.52%      980ms  9.96%  crypto/aes.encryptBlockAsm
     540ms  5.49% 57.01%      540ms  5.49%  runtime.futex
     350ms  3.56% 60.57%      350ms  3.56%  runtime.procyield
     330ms  3.35% 63.92%     1460ms 14.84%  crypto/cipher.(*ctr).refill
     250ms  2.54% 66.46%      470ms  4.78%  runtime.scanobject
     240ms  2.44% 68.90%      240ms  2.44%  runtime.epollwait
     230ms  2.34% 71.24%     1260ms 12.80%  runtime.findrunnable
     200ms  2.03% 73.27%      200ms  2.03%  runtime.memmove

while setting "ciphers": ["[email protected]"] I get the best performance and the encryption is not more the bottleneck:

Showing top 10 nodes out of 140
      flat  flat%   sum%        cum   cum%
    1290ms 31.01% 31.01%     1330ms 31.97%  syscall.Syscall
     240ms  5.77% 36.78%      240ms  5.77%  runtime.memmove
     230ms  5.53% 42.31%      320ms  7.69%  runtime.scanobject
     220ms  5.29% 47.60%      220ms  5.29%  crypto/aes.gcmAesEnc
     180ms  4.33% 51.92%      180ms  4.33%  runtime.epollwait
     170ms  4.09% 56.01%      170ms  4.09%  runtime.futex
     160ms  3.85% 59.86%      160ms  3.85%  syscall.Syscall6
     150ms  3.61% 63.46%      150ms  3.61%  runtime.procyield
     130ms  3.12% 66.59%      130ms  3.12%  runtime.memclrNoHeapPointers
     100ms  2.40% 68.99%      100ms  2.40%  golang.org/x/crypto/argon2.blamkaSSE

can you post your results using [email protected]? Anyway using the same cipher OpenSSH still performs better, but maybe this way you can get acceptable results for your use case.

Can you also better explain your use case? Thanks!

@HiFiPhile
Copy link
Contributor Author

I've some interesting findings.

  • Filezilla doesn't support [email protected], even though it's listed as an fixed issue.
  • It's possible that Filezilla does some tweaks if it detect OpenSSH server, because using sftp command I got only 200MB/s instead of 380MB/s.

Here we go for the [email protected] test

I didn't try to enable profiling, but the performance number is enough to make the difference.

SCP

SCP is blasting..., it even hit other bottlenecks.

Stream SFTPGo MB/s SFTPGo CPU% OpenSSH MB/s OpenSSH CPU%
1 500 105 360 100
2 950 195 680 195
3 1350 284 980 285
4 1650 332 1100 372
8 2400 520 1550 616

SFTP

SFTP also got a huge boost.

Stream SFTPGo MB/s SFTPGo CPU% OpenSSH MB/s OpenSSH CPU%
1 260 173 340 102
2 420 254 640 208
3 500 297 800 300
4 580 333 1000 400
8 700 390 1450 648

Indeed [email protected] can massively improve the performance, but seems like there are some compatibility issues, Rebex SSH also not working.

@drakkan
Copy link
Owner

drakkan commented Feb 10, 2020

I've some interesting findings.

  • Filezilla doesn't support [email protected], even though it's listed as an fixed issue.

In my tests I used SFTP cli, I tested with Filezilla now and I see that it does not support aes gcm. [email protected] (that on my hw performs better than AES-CTR) seems not supported in filezilla too.

  • It's possible that Filezilla does some tweaks if it detect OpenSSH server, because using sftp command I got only 200MB/s instead of 380MB/s.

In my tests on localhost, transferring a 1GB file, filezilla speed against OpenSSH is 190MB/s, agaist SFTPGo (forcing the same cipher, aes-256-ctr) is 160 MB/s

Here we go for the [email protected] test

I didn't try to enable profiling, but the performance number is enough to make the difference.

SCP

SCP is blasting..., it even hit other bottlenecks.

I tested SCP now and I confirm that my simple implementation outperforms OpenSSH if the cipher is not the bootleneck (for example aes128-gcm).

This means that pkg/sftp could be optimized in some way, but if we cannot use AES GCM the bottleneck will remain the encryption.

I did a quick test, on a virtual machine, using centos8 and the red hat (fips certified) go-toolset that replace golang crypto implementation with openssl but openssh still performs better when using aes ctr based ciphers. I could also try boringssl golang branch but I don't think this will perform much better than go-toolset from Red Hat.

Maybe for your use case a load balancer, such as haproxy, could help to balance the load between several backends.

Stream SFTPGo MB/s SFTPGo CPU% OpenSSH MB/s OpenSSH CPU%
1 500 105 360 100
2 950 195 680 195
3 1350 284 980 285
4 1650 332 1100 372
8 2400 520 1550 616

SFTP

SFTP also got a huge boost.

Stream SFTPGo MB/s SFTPGo CPU% OpenSSH MB/s OpenSSH CPU%
1 260 173 340 102
2 420 254 640 208
3 500 297 800 300
4 580 333 1000 400
8 700 390 1450 648
Indeed [email protected] can massively improve the performance, but seems like there are some compatibility issues, Rebex SSH also not working.

@HiFiPhile
Copy link
Contributor Author

HiFiPhile commented Feb 10, 2020

I've looked at this issue: https://github.com/golang/go/issues/20967
Seems like it hasn't been merged, I got a bad ctr speed.

$ go test -bench '(GCM|CTR).*1K' crypto/cipher
goos: linux
goarch: amd64
pkg: crypto/cipher
BenchmarkAESGCMSeal1K-8   	 5000000	       275 ns/op	3713.72 MB/s
BenchmarkAESGCMOpen1K-8   	 5000000	       282 ns/op	3623.19 MB/s
BenchmarkAESCTR1K-8       	 1000000	      1265 ns/op	 805.12 MB/s
PASS
ok  	crypto/cipher	4.867s

After applying the patch:

$ go test -bench '(GCM|CTR).*1K' crypto/cipher
goos: linux
goarch: amd64
pkg: crypto/cipher
BenchmarkAESGCMSeal1K-8   	 4340227	       277 ns/op	3702.23 MB/s
BenchmarkAESGCMOpen1K-8   	 4380684	       272 ns/op	3763.03 MB/s
BenchmarkAESCTR1K-8       	 4144329	       275 ns/op	3708.23 MB/s
PASS
ok  	crypto/cipher	4.797s

@drakkan
Copy link
Owner

drakkan commented Feb 10, 2020

I've looked at this issue: https://github.com/golang/go/issues/20967
Seems like it hasn't been merged, I got a bad ctr speed.

$ go test -bench '(GCM|CTR).*1K' crypto/cipher
goos: linux
goarch: amd64
pkg: crypto/cipher
BenchmarkAESGCMSeal1K-8   	 5000000	       275 ns/op	3713.72 MB/s
BenchmarkAESGCMOpen1K-8   	 5000000	       282 ns/op	3623.19 MB/s
BenchmarkAESCTR1K-8       	 1000000	      1265 ns/op	 805.12 MB/s
PASS
ok  	crypto/cipher	4.867s

After applying the patch:

$ go test -bench '(GCM|CTR).*1K' crypto/cipher
goos: linux
goarch: amd64
pkg: crypto/cipher
BenchmarkAESGCMSeal1K-8   	 4340227	       277 ns/op	3702.23 MB/s
BenchmarkAESGCMOpen1K-8   	 4380684	       272 ns/op	3763.03 MB/s
BenchmarkAESCTR1K-8       	 4144329	       275 ns/op	3708.23 MB/s
PASS
ok  	crypto/cipher	4.797s

Great! And how about sftpgo performance compiled against this patched Go version?

Here is a patch that save CPU profiling to /tmp/cpuprofile after you stop SFTPGo using CTRL+C

diff --git a/main.go b/main.go
index 1604364..8cb075b 100644
--- a/main.go
+++ b/main.go
@@ -4,6 +4,9 @@
 package main // import "github.com/drakkan/sftpgo"
 
 import (
+	"os"
+	"runtime/pprof"
+
 	"github.com/drakkan/sftpgo/cmd"
 	_ "github.com/go-sql-driver/mysql"
 	_ "github.com/lib/pq"
@@ -11,5 +14,11 @@ import (
 )
 
 func main() {
+	f, _ := os.Create("/tmp/cpuprofile")
+	defer f.Close()
+	if err := pprof.StartCPUProfile(f); err != nil {
+		return
+	}
+	defer pprof.StopCPUProfile()
 	cmd.Execute()
 }
diff --git a/service/service.go b/service/service.go
index fa50ec9..43f754e 100644
--- a/service/service.go
+++ b/service/service.go
@@ -122,6 +122,13 @@ func (s *Service) Wait() {
 	if s.PortableMode != 1 {
 		registerSigHup()
 	}
+	sig := make(chan os.Signal, 1)
+	signal.Notify(sig, os.Interrupt, syscall.SIGTERM)
+	go func() {
+		<-sig
+		logger.InfoToConsole("stopping service")
+		s.Stop()
+	}()
 	<-s.Shutdown
 }

@drakkan
Copy link
Owner

drakkan commented Feb 10, 2020

Hi,

I applied and tested that patch myself, this only give a small performance improvement, the bottleneck is now MAC, when using aes gcm mode the mac is implicit, from sftp -vvvv output:

debug1: kex: server->client cipher: [email protected] MAC: <implicit> compression: none

while for aes ctr:

debug1: kex: server->client cipher: aes256-ctr MAC: [email protected] compression: none

so we need a way to improve golang sha2 performance.

We could try to path crypto/ssh

https://github.com/golang/crypto/blob/master/ssh/mac.go#L12

to use, for example, this sha2 implementation:

https://github.com/minio/sha256-simd

@HiFiPhile
Copy link
Contributor Author

Wow we came up to the same thing ! I just finished my test and saw you ended up with the same conclusion.
With the AES patch I got only 10-20% gain, the bottleneck is indeed the sha256.

(pprof) top10
Showing nodes accounting for 1.01mins, 78.26% of 1.29mins total
Dropped 372 nodes (cum <= 0.01mins)
Showing top 10 nodes out of 139
      flat  flat%   sum%        cum   cum%
  0.48mins 37.63% 37.63%   0.48mins 37.63%  crypto/sha256.block
  0.29mins 22.86% 60.49%   0.30mins 23.47%  syscall.Syscall
  0.04mins  3.50% 63.99%   0.04mins  3.50%  runtime.memmove
  0.04mins  3.40% 67.39%   0.04mins  3.40%  crypto/aes.fillEightBlocks
  0.04mins  3.02% 70.41%   0.04mins  3.06%  syscall.Syscall6
  0.03mins  2.10% 72.51%   0.05mins  3.93%  runtime.scanobject
  0.03mins  2.05% 74.56%   0.03mins  2.05%  runtime.procyield
  0.02mins  1.65% 76.20%   0.02mins  1.65%  runtime.memclrNoHeapPointers
  0.01mins  1.10% 77.30%   0.01mins  1.10%  runtime.futex
  0.01mins  0.96% 78.26%   0.02mins  1.43%  runtime.findObject

https://github.com/minio/sha256-simd gave me a decent result:

go test -bench=.
goos: linux
goarch: amd64
pkg: github.com/minio/sha256-simd
BenchmarkHash/SHA_/8Bytes-8   	16587733	        73.7 ns/op	 108.54 MB/s
BenchmarkHash/SHA_/1K-8       	 1861538	       623 ns/op	1642.56 MB/s
BenchmarkHash/SHA_/8K-8       	  268082	      4666 ns/op	1755.55 MB/s
BenchmarkHash/SHA_/1M-8       	    2126	    565327 ns/op	1854.81 MB/s
BenchmarkHash/SHA_/5M-8       	     408	   2888687 ns/op	1814.97 MB/s
BenchmarkHash/SHA_/10M-8      	     208	   5668807 ns/op	1849.73 MB/s
BenchmarkHash/AVX2/8Bytes-8   	 4695985	       252 ns/op	  31.74 MB/s
BenchmarkHash/AVX2/1K-8       	  383493	      2955 ns/op	 346.54 MB/s
BenchmarkHash/AVX2/8K-8       	   55555	     22888 ns/op	 357.92 MB/s
BenchmarkHash/AVX2/1M-8       	     435	   2776574 ns/op	 377.65 MB/s
BenchmarkHash/AVX2/5M-8       	      70	  15389700 ns/op	 340.67 MB/s
BenchmarkHash/AVX2/10M-8      	      38	  31855830 ns/op	 329.16 MB/s
BenchmarkHash/AVX_/8Bytes-8   	 5304136	       223 ns/op	  35.85 MB/s
BenchmarkHash/AVX_/1K-8       	  386608	      2947 ns/op	 347.47 MB/s
BenchmarkHash/AVX_/8K-8       	   52936	     23003 ns/op	 356.12 MB/s
BenchmarkHash/AVX_/1M-8       	     394	   2833284 ns/op	 370.09 MB/s
BenchmarkHash/AVX_/5M-8       	      86	  13979228 ns/op	 375.05 MB/s
BenchmarkHash/AVX_/10M-8      	      37	  29248603 ns/op	 358.50 MB/s
BenchmarkHash/SSSE/8Bytes-8   	 5321602	       211 ns/op	  37.96 MB/s
BenchmarkHash/SSSE/1K-8       	  385484	      3087 ns/op	 331.67 MB/s
BenchmarkHash/SSSE/8K-8       	   53042	     22051 ns/op	 371.50 MB/s
BenchmarkHash/SSSE/1M-8       	     427	   2939882 ns/op	 356.67 MB/s
BenchmarkHash/SSSE/5M-8       	      81	  14189059 ns/op	 369.50 MB/s
BenchmarkHash/SSSE/10M-8      	      42	  28117345 ns/op	 372.93 MB/s
BenchmarkHash/GEN_/8Bytes-8   	 5327409	       220 ns/op	  36.44 MB/s
BenchmarkHash/GEN_/1K-8       	  495444	      2496 ns/op	 410.18 MB/s
BenchmarkHash/GEN_/8K-8       	   62418	     19018 ns/op	 430.74 MB/s
BenchmarkHash/GEN_/1M-8       	     522	   2396453 ns/op	 437.55 MB/s
BenchmarkHash/GEN_/5M-8       	     100	  11852390 ns/op	 442.35 MB/s
BenchmarkHash/GEN_/10M-8      	      49	  24008253 ns/op	 436.76 MB/s
PASS
ok  	github.com/minio/sha256-simd	40.496s

While the original version is quite slower:

go test -bench=. crypto/sha256
goos: linux
goarch: amd64
pkg: crypto/sha256
BenchmarkHash8Bytes-8   	 6154984	       195 ns/op	  41.01 MB/s
BenchmarkHash1K-8       	  485937	      2491 ns/op	 411.05 MB/s
BenchmarkHash8K-8       	   61125	     18626 ns/op	 439.81 MB/s
PASS
ok  	crypto/sha256	3.976s

@drakkan
Copy link
Owner

drakkan commented Feb 11, 2020

Hi,

I forked golang crypto and replaced sha256 implementation with sha256-simd but still sha256 is the bottleneck:

    2520ms 29.79% 29.79%     2520ms 29.79%  github.com/minio/sha256-simd.blockAvx2
    1650ms 19.50% 49.29%     1660ms 19.62%  syscall.Syscall
     390ms  4.61% 53.90%      390ms  4.61%  syscall.Syscall6
     360ms  4.26% 58.16%      360ms  4.26%  crypto/aes.fillEightBlocks
     320ms  3.78% 61.94%      320ms  3.78%  runtime.epollwait
     300ms  3.55% 65.48%      300ms  3.55%  runtime.futex
     230ms  2.72% 68.20%      230ms  2.72%  runtime.usleep
     220ms  2.60% 70.80%      400ms  4.73%  runtime.scanobject
     210ms  2.48% 73.29%      210ms  2.48%  runtime.procyield
     190ms  2.25% 75.53%      190ms  2.25%  runtime.memmove

my laptop only support AVX2, can you please try it on your hw? Based on the benchmark you posted above it should support SHA extension.

Here is a patch to replace the default golang sha256 with sha256-simd

diff --git a/go.mod b/go.mod
index 729c251..2adf5c9 100644
--- a/go.mod
+++ b/go.mod
@@ -43,3 +43,5 @@ require (
 )
 
 replace github.com/eikenb/pipeat v0.0.0-20190316224601-fb1f3a9aa29f => github.com/drakkan/pipeat v0.0.0-20200123131427-11c048cfc0ec
+
+replace golang.org/x/crypto v0.0.0-20200128174031-69ecbb4d6d5d => github.com/drakkan/crypto v0.0.0-20200211081002-cc78d71334be

@HiFiPhile
Copy link
Contributor Author

Hi,
In my case SHA256 performance improved a lot, I think it's correctly using SHA instructions.

(pprof) top10
Showing nodes accounting for 121.92s, 72.52% of 168.11s total
Dropped 465 nodes (cum <= 0.84s)
Showing top 10 nodes out of 159
      flat  flat%   sum%        cum   cum%
    48.27s 28.71% 28.71%     49.83s 29.64%  syscall.Syscall
    20.99s 12.49% 41.20%     20.99s 12.49%  github.com/minio/sha256-simd.blockSha
    15.34s  9.12% 50.32%     15.47s  9.20%  syscall.Syscall6
    10.17s  6.05% 56.37%     10.17s  6.05%  runtime.memmove
     7.28s  4.33% 60.70%      7.28s  4.33%  crypto/aes.fillEightBlocks
     6.78s  4.03% 64.74%      6.78s  4.03%  runtime.memclrNoHeapPointers
     4.07s  2.42% 67.16%      4.07s  2.42%  runtime.futex
     3.91s  2.33% 69.48%      6.60s  3.93%  runtime.scanobject
     3.33s  1.98% 71.47%      3.33s  1.98%  runtime.procyield
     1.78s  1.06% 72.52%      1.78s  1.06%  crypto/cipher.xorBytesSSE2

Combine both the AES and SHA patch, speed is increased by more than 60%.

Stream Patched MB/s Patched CPU% Original MB/s Original CPU% Gain %
1 210 145 125 137 60
2 350 250 210 227 66
3 510 310 260 228 96
4 560 370 330 336 70
8 625 420 387 430 61

@drakkan
Copy link
Owner

drakkan commented Feb 11, 2020

Great! So we are now closer to OpenSSH performances, and what about SCP? thanks!

@HiFiPhile
Copy link
Contributor Author

SCP out performs OpenSSH inaes128-ctr, looks like your scp implementation works great !

Stream SFTPGo MB/s SFTPGo CPU% OpenSSH MB/s OpenSSH CPU%
1 380 108 384 103
2 750 215 720 205
3 1100 312 1050 300
4 1350 410 1250 395
8 2100 610 1850 712

@drakkan
Copy link
Owner

drakkan commented Feb 17, 2020

Hi,

I did some minor performance improvements in pkg/sftp, you can test them using this diff

diff --git a/go.mod b/go.mod
index 1c7d75f..679e90f 100644
--- a/go.mod
+++ b/go.mod
@@ -44,3 +44,5 @@ require (
 )
 
 replace github.com/eikenb/pipeat v0.0.0-20190316224601-fb1f3a9aa29f => github.com/drakkan/pipeat v0.0.0-20200123131427-11c048cfc0ec
+
+replace github.com/pkg/sftp v1.11.0 => github.com/drakkan/sftp v0.0.0-20200217072548-e50dec9f7639

Can you also post your results for SCP downloads after applying the following patch?

diff --git a/sftpd/scp.go b/sftpd/scp.go
index 4050f4a..3ad8d41 100644
--- a/sftpd/scp.go
+++ b/sftpd/scp.go
@@ -399,9 +399,9 @@ func (c *scpCommand) sendDownloadFileData(filePath string, stat os.FileInfo, tra
                return err
        }
 
-       buf := make([]byte, 32768)
        var n int
        for {
+               buf := make([]byte, 32768)
                n, err = transfer.ReadAt(buf, readed)
                if err == nil || err == io.EOF {
                        if n > 0 {

this should decrease SCP downloads performance and should give us an idea of what we can achieve avoiding to reallocate memory inside pkg/sftp.

I retested SCP too and while downloading a file via SCP is as fast as OpenSSH, uploads are slower than OpenSSH on my laptop. Can you please confirm that your benchmark is for SCP downloads?

The main difference between SCP uploads and downloads is that for downloads we use sequential file reads, while for uploads we use random access writes. pkg/sftp uses random access for both reads and writes.

Can you please post the results for SCP uploads and downloads too (removing the scp patch above)?

I suspect that using sequential reads/writes will give more improvements than avoiding to reallocate memory for each packet and I think we'll need to work on this, but I would like to see the results for the above tests on your hardware too before start to write this code.

Thanks!

@HiFiPhile
Copy link
Contributor Author

Hi,
I've changed some configs of my server, so numbers could be slightly different.

SCP patch

Stream Before MB/s Before CPU% After MB/s After CPU%
1 450 108 390 135
2 800 210 650 224
3 1100 307 900 294
4 1450 385 1050 340
8 2250 620 1450 440

I retested SCP too and while downloading a file via SCP is as fast as OpenSSH, uploads are slower than OpenSSH on my laptop. Can you please confirm that your benchmark is for SCP downloads?

Yes my tests are for downloads.

SCP upload

Due to zfs cache and overhead it's hard to directly compare upload and download, but it's still a big difference.

Stream MB/s CPU%
1 340 170
2 520 271
3 630 337
4 700 364
8 830 410

With the performance improvements in pkg/sftp I get about 20% speed gain.

@drakkan
Copy link
Owner

drakkan commented Feb 18, 2020

Thanks for the results.

In the coming weeks I'll try to write a patch that buffer reads and writes, in memory, to allow to do sequential disk access, I'll try to make configurabile the read and write chunks so we can for example use big chunks such as 1MB or so, we'll see if it improves something

@drakkan
Copy link
Owner

drakkan commented Feb 25, 2020

Hi,

I did a really ugly patch that adds an allocator to pkg/sftp, it improves something I posted some results here:

pkg/sftp#334

maybe I'll push this really ugly patch to my repo in the coming days, actually I don't know how to get more improvements.

I tested sequential access vs random access in my scp implementation and it doesn't change anything on my laptop (ssd disk), implementing it in pkg/sftp requires a lot of effort and I'll not write this patch, at least for now, since it seems useless

@drakkan
Copy link
Owner

drakkan commented Feb 26, 2020

Here is my proof of concept allocator

https://github.com/drakkan/sftp

in my tests (on localhost) now uploads performance is very similar to my scp implementation, downloads improved too but are still slower than scp downloads, I looked at pkg/sftp code several times but for now I don't understand the reason.

can you please post the results on your hardware?

replace github.com/pkg/sftp  => github.com/drakkan/sftp v0.0.0-20200227085621-6b4abaad1b9a

Please note that this is ugly code and it should be used for testing purpose only

@HiFiPhile
Copy link
Contributor Author

HiFiPhile commented Feb 27, 2020

Hi,
I've done the test with your allocator in aes128-ctr mode.

Download

Stream Baseline MB/s Baseline CPU% Alloc MB/s Alloc CPU% Gain %
1 235 170 280 147 19
2 395 260 520 250 32
3 510 310 760 380 49
4 600 350 1100 500 83
8 730 410 1850 720 153

Upload

Stream Baseline MB/s Baseline CPU% Alloc MB/s Alloc CPU% Gain %
1 235 200 280 190 19
2 350 280 440 280 26
3 400 320 520 350 30
4 440 330 570 360 30
8 480 350 690 420 43

Result is very promising, especially a huge gain in parallel workload.

I'm not sure what's the reason but OpenSSH is quicker than Feb 8th's test even with a slower cipher. But we are very close in single/dual stream now, for stream more than 4 we are equal in speed.

However OpenSSH still have a lower CPU usage.

I've do a profile when I have time.

@drakkan
Copy link
Owner

drakkan commented Feb 27, 2020

Hi,

thanks for you tests, your results are different from mine, in my tests uploads are quicker than downloads, but I have to use a ramfs based filesystem since my laptop's ssd is not quicker enough for these tests, so my results are probably not real.

Do your baseline results include these 2 patches?

https://github.com/drakkan/sftp/commits/copy

I think these could be merged quickly upstream.

I added support for proxy protocol, can you please try to balance the loads between two or more instances? For example using an haproxy configuration like this one (tested on ArchLinux with haproxy 2.1.3)

global
    maxconn     20000
    log         127.0.0.1 local0
    user        haproxy
    chroot      /usr/share/haproxy
    pidfile     /run/haproxy.pid
    daemon

frontend stats 
    	bind 	:1936
	mode   	http
	timeout client  30s
	default_backend stats 

backend stats
	mode	http
	timeout	connect 5s
	timeout	server  30s
	timeout	queue   30s
	stats 	enable
	stats 	hide-version
	stats 	refresh 30s
	stats 	show-node
	stats 	auth admin:password
	stats 	uri  /haproxy?stats

frontend sftp
	bind 	:2222
    	mode 	tcp
    	timeout  client  600s
    	default_backend sftpgo 

backend sftpgo
	mode	tcp
	balance	roundrobin
    	timeout	connect 10s
    	timeout	server  600s
    	timeout	queue   30s
	option 	tcp-check
	tcp-check expect string SSH-2.0-
    
	server sftpgo1 127.0.0.1:2022 check send-proxy-v2 weight 10 inter 10s rise 2 fall 3 
	server sftpgo2 127.0.0.1:2024 check send-proxy-v2 weight 10 inter 10s rise 2 fall 3 

you have to set "proxy_protocol": 1 in sftpgo.json to have the real client's address instead of 127.0.0.1, this is optional. Parsing the proxy protocol could add a small performance loss, but maybe sharing the load could improve the total performance.

In these tests you have to use an sql based data provider or different sqlite databases, if you share the same sqlite database between two or more instances you can, randomically have a "database is locked" error as happen, sometime, in this test case:

https://github.com/drakkan/sftpgo/blob/master/sftpd/sftpd_test.go#L1093

thanks for your patience

@HiFiPhile
Copy link
Contributor Author

Hi,
My baseline is HiFiPhile/sftpgo@8e3434b8 which include the first patch.
I've done some profiling :

Download

aes128-ctr:

Showing nodes accounting for 26.86s, 79.37% of 33.84s total
Dropped 217 nodes (cum <= 0.17s)
Showing top 10 nodes out of 123
      flat  flat%   sum%        cum   cum%
    11.38s 33.63% 33.63%     11.72s 34.63%  syscall.Syscall
     5.25s 15.51% 49.14%      5.25s 15.51%  github.com/minio/sha256-simd.blockSha
     3.43s 10.14% 59.28%      3.47s 10.25%  syscall.Syscall6
     2.13s  6.29% 65.57%      2.13s  6.29%  crypto/aes.fillEightBlocks
     1.40s  4.14% 69.71%      1.40s  4.14%  runtime.futex
     1.08s  3.19% 72.90%      1.08s  3.19%  runtime.memmove
     0.78s  2.30% 75.21%      0.78s  2.30%  runtime.procyield
     0.57s  1.68% 76.89%      0.57s  1.68%  runtime.epollwait
     0.54s  1.60% 78.49%      0.54s  1.60%  crypto/cipher.xorBytesSSE2
     0.30s  0.89% 79.37%      2.46s  7.27%  runtime.findrunnable

[email protected]:

Showing nodes accounting for 20.02s, 78.42% of 25.53s total
Dropped 257 nodes (cum <= 0.13s)
Showing top 10 nodes out of 112
      flat  flat%   sum%        cum   cum%
    10.33s 40.46% 40.46%     10.62s 41.60%  syscall.Syscall
     3.56s 13.94% 54.41%      3.59s 14.06%  syscall.Syscall6
     2.62s 10.26% 64.67%      2.62s 10.26%  crypto/aes.gcmAesEnc
     1.12s  4.39% 69.06%      1.12s  4.39%  runtime.memmove
     0.88s  3.45% 72.50%      0.88s  3.45%  runtime.futex
     0.50s  1.96% 74.46%      0.50s  1.96%  runtime.procyield
     0.44s  1.72% 76.18%      0.44s  1.72%  runtime.epollwait
     0.22s  0.86% 77.05%      0.66s  2.59%  runtime.mallocgc
     0.18s  0.71% 77.75%      1.51s  5.91%  runtime.findrunnable
     0.17s  0.67% 78.42%      0.23s   0.9%  runtime.exitsyscall

[email protected] 8 streams:

Showing nodes accounting for 39.30s, 78.73% of 49.92s total
Dropped 282 nodes (cum <= 0.25s)
Showing top 10 nodes out of 102
      flat  flat%   sum%        cum   cum%
    19.52s 39.10% 39.10%     20.10s 40.26%  syscall.Syscall
     8.72s 17.47% 56.57%      8.79s 17.61%  syscall.Syscall6
     5.76s 11.54% 68.11%      5.76s 11.54%  crypto/aes.gcmAesEnc
     3.28s  6.57% 74.68%      3.28s  6.57%  runtime.memmove
     0.52s  1.04% 75.72%      1.67s  3.35%  runtime.mallocgc
     0.32s  0.64% 76.36%      1.48s  2.96%  github.com/pkg/sftp.(*allocator).GetPage
     0.30s   0.6% 76.96%      0.37s  0.74%  runtime.deferreturn
     0.30s   0.6% 77.56%      0.37s  0.74%  runtime.heapBitsSetType
     0.30s   0.6% 78.17%      0.30s   0.6%  runtime.nextFreeFast
     0.28s  0.56% 78.73%      0.47s  0.94%  sync.(*Mutex).Lock

Upload

aes128-ctr:

Showing nodes accounting for 29360ms, 74.20% of 39570ms total
Dropped 360 nodes (cum <= 197.85ms)
Showing top 10 nodes out of 135
      flat  flat%   sum%        cum   cum%
   10850ms 27.42% 27.42%    11230ms 28.38%  syscall.Syscall
    5060ms 12.79% 40.21%     5060ms 12.79%  github.com/minio/sha256-simd.blockSha
    4950ms 12.51% 52.72%     4990ms 12.61%  syscall.Syscall6
    1920ms  4.85% 57.57%     1920ms  4.85%  runtime.futex
    1760ms  4.45% 62.02%     1760ms  4.45%  runtime.memmove
    1650ms  4.17% 66.19%     1650ms  4.17%  crypto/aes.fillEightBlocks
    1250ms  3.16% 69.35%     1870ms  4.73%  runtime.scanobject
     750ms  1.90% 71.24%      750ms  1.90%  runtime.memclrNoHeapPointers
     620ms  1.57% 72.81%      620ms  1.57%  runtime.procyield
     550ms  1.39% 74.20%      550ms  1.39%  runtime.epollwait

[email protected]

Showing nodes accounting for 24.58s, 76.60% of 32.09s total
Dropped 327 nodes (cum <= 0.16s)
Showing top 10 nodes out of 135
      flat  flat%   sum%        cum   cum%
     9.98s 31.10% 31.10%     10.30s 32.10%  syscall.Syscall
     5.29s 16.48% 47.58%      5.38s 16.77%  syscall.Syscall6
     2.16s  6.73% 54.32%      2.16s  6.73%  crypto/aes.gcmAesDec
     1.84s  5.73% 60.05%      1.84s  5.73%  runtime.memmove
     1.60s  4.99% 65.04%      1.60s  4.99%  runtime.futex
     1.34s  4.18% 69.21%      1.78s  5.55%  runtime.scanobject
     1.04s  3.24% 72.45%      1.04s  3.24%  runtime.memclrNoHeapPointers
     0.64s  1.99% 74.45%      0.64s  1.99%  runtime.procyield
     0.42s  1.31% 75.76%      0.42s  1.31%  runtime.epollwait
     0.27s  0.84% 76.60%      0.40s  1.25%  runtime.findObject

[email protected] 8 streams:

Showing nodes accounting for 1.44mins, 77.30% of 1.86mins total
Dropped 477 nodes (cum <= 0.01mins)
Showing top 10 nodes out of 127
      flat  flat%   sum%        cum   cum%
  0.46mins 24.61% 24.61%   0.46mins 24.84%  syscall.Syscall6
  0.45mins 24.09% 48.70%   0.46mins 24.91%  syscall.Syscall
  0.19mins 10.47% 59.18%   0.19mins 10.47%  runtime.memmove
  0.12mins  6.46% 65.63%   0.12mins  6.46%  crypto/aes.gcmAesDec
  0.11mins  5.72% 71.35%   0.11mins  5.72%  runtime.memclrNoHeapPointers
  0.04mins  1.96% 73.32%   0.06mins  2.96%  runtime.scanobject
  0.03mins  1.49% 74.80%   0.03mins  1.49%  runtime.futex
  0.02mins  1.08% 75.88%   0.21mins 11.17%  runtime.mallocgc
  0.02mins  0.82% 76.71%   0.02mins  0.82%  runtime.procyield
  0.01mins  0.59% 77.30%   0.01mins  0.76%  runtime.findObject

@drakkan
Copy link
Owner

drakkan commented Feb 29, 2020

Thanks! If you use the traces subcommand instead of top you'll get some more details about syscalls. Anyway these traces seem very similar to the ones I have locally. The second patch only affects uploads, in my allocator branch both patches are applied.

I would like to see the results behind a proxy too so we have all the info.

My allocator patch can take a while before being accepted upstream, I need to rewrite it in an acceptable way and this will require a refactoring in pkg/sftp.

I would like to summarize all the info and the required patches described in this issue and add them to the performance section of the README, are you interested to submitting a pull request?

@HiFiPhile
Copy link
Contributor Author

HiFiPhile commented Mar 1, 2020

Hi,
I've done the test with HAProxy using your config. Surprisingly it's increasing single stream performance ! I've tried multiple times and every time it's 10-25% faster...

With HAProxy we have a increased performance, but at high load the CPU usage of HAproxy it self is also high, for example I met CPU bottleneck at 8 streams.

Download

aes128-ctr:

Stream MB/s SFTPGo CPU% Haproxy CPU%
1 370 135 64
2 675 248 112
3 880 365 140
4 1150 410 175
8 1400 533 208

[email protected]:

Stream MB/s SFTPGo CPU% Haproxy CPU%
1 500 132 74
2 900 230 130
3 1100 340 170
4 1500 380 200
8 1600 490 230

Upload

aes128-ctr:

Stream MB/s SFTPGo CPU% Haproxy CPU%
1 340 183 55
2 490 305 85
3 560 380 100
4 650 410 110
8 700 467 128

[email protected]:

Stream MB/s SFTPGo CPU% Haproxy CPU%
1 420 180 60
2 550 300 85
3 600 340 101
4 700 365 110
8 750 440 127

@drakkan
Copy link
Owner

drakkan commented Mar 1, 2020

Thanks.

This is strange on my laptop I see a very small performance loss using haproxy.

So haproxy can help with a better cpu or if the load is balanced between different servers.

I would like to summarize all the collected info, and add them to the README. This way an user interested to SFTPGo performance can quickly read the wanted info without reading all the posts here, are you intereseted to send a pull request?

@HiFiPhile
Copy link
Contributor Author

I'll try to summarize all elements when I have time.

What I don't understand is even with only one backend, passing by HAProxy give me a performance improvement for 1-3 streams. It's the PROXY implementation more efficient than normal TCP ?

@drakkan
Copy link
Owner

drakkan commented Mar 1, 2020

I'll try to summarize all elements when I have time.

no hurry, thanks!

What I don't understand is even with only one backend, passing by HAProxy give me a performance improvement for 1-3 streams. It's the PROXY implementation more efficient than normal TCP ?

Proxy implementation simply reads the initial proxy header and then is a normal TCP connection, this is what happen on Go side, maybe haproxy itself do some other optimizations.

Do you get the same result connecting on localhost? When connecting through haproxy I have a small performance loss (tested on localhost only using sftp CLI)

@HiFiPhile
Copy link
Contributor Author

HiFiPhile commented Mar 2, 2020

Hi,

I can confirme that run locally with HAProxy gives me a small performance loss:
aes128-ctr:
Download from 412MB/s to 387MB/s
Upload from 409MB/s to 393MB/s
[email protected]:
Download from 570MB/s to 565MB/s
Upload from 580MB/s to 570MB/s

And download & upload speed are very close.

Edit: update to go 1.14 give me about 10MB/s gain.

@drakkan
Copy link
Owner

drakkan commented Mar 2, 2020

ok, so haproxy has some internal optimizations and it can be useful on localhost too if the cpu is not the bottleneck.

Regarding go 1.14 I also noticed the performance increase, anyway I will be a bit conservative here, the binaries for the 0.9.6 release (that should happen quite soon) will still be builded using 1.13.x, I have no direct code that should handle EINTR but some depencies could.

The optmizations in my copy branch are now merged upstream and I updated sftpgo to use pkg/sftp git master so they are available to anyone using sftpgo git now

@drakkan
Copy link
Owner

drakkan commented Mar 6, 2020

To summarize, to match OpenSSH performance we need:

  1. AES-CTR path for Golang
  2. replace Golang SHA256 implementation with minio/sha256-simd
  3. use my experimental pkg/sftp branch

There is no specific patch for SFTPGo itself

@HiFiPhile
Copy link
Contributor Author

Hi, have you thought about creating a experimental branch, to get more feedback?

@drakkan
Copy link
Owner

drakkan commented Mar 7, 2020

Hi, now that I have released 0.9.6 I want to try to improve my allocator patch and discuss its inclusion in pkg/sftp, if this cannot happen maybe I'll create an experimental branch, let's see

@jovandeginste
Copy link
Contributor

jovandeginste commented Mar 8, 2020

Including an experimental release then, so current (edit: and new) users can test the new version without the need to compile from source...

@drakkan
Copy link
Owner

drakkan commented Mar 12, 2020

@HiFiPhile I'm working to submit a PR with an allocator for pkg/sftp, I have 4 different implementations, can you please report your results using my test branches?

// allocator2
replace github.com/pkg/sftp => github.com/drakkan/sftp v0.0.0-20200312214801-a4843576a666
// allocator
replace github.com/pkg/sftp => github.com/drakkan/sftp v0.0.0-20200312214412-fccc7efd3020
// allocator1 
replace github.com/pkg/sftp => github.com/drakkan/sftp v0.0.0-20200312213133-f889f1be157b
// allocator1 pagelist
replace github.com/pkg/sftp => github.com/drakkan/sftp v0.0.0-20200312213115-d9388f0df0ad

these changes should be applied to the Optimized configuration.

The internal benchmark here:

https://github.com/drakkan/sftp/blob/allocator/allocator_test.go#L54

has a clear winner, but I think you will get very similar results in a real test

thanks!

@HiFiPhile
Copy link
Contributor Author

Hi,
I finished the tests, but it seems something is not right, the result is much lower than initial concept.

# allocator
for i in 1 2 3 4 8; do ~/bw_bench sftp hfp:/p4510/in.bench $i; done;
Download: 324MiB/s
Up: 314MiB/s
Download: 593MiB/s
Up: 484MiB/s
Download: 786MiB/s
Up: 571MiB/s
Download: 913MiB/s
Up: 625MiB/s
Download: 1130MiB/s
Up: 687MiB/s

# allocator 1
Download: 363MiB/s
Up: 319MiB/s
Download: 624MiB/s
Up: 471MiB/s
Download: 791MiB/s
Up: 565MiB/s
Download: 899MiB/s
Up: 620MiB/s
Download: 1092MiB/s
Up: 692MiB/s

# allocator 1 pagelist
Download: 348MiB/s
Up: 322MiB/s
Download: 605MiB/s
Up: 474MiB/s
Download: 776MiB/s
Up: 564MiB/s
Download: 895MiB/s
Up: 613MiB/s
Download: 1099MiB/s
Up: 690MiB/s

# allocator 2
Download: 348MiB/s
Up: 314MiB/s
Download: 611MiB/s
Up: 475MiB/s
Download: 785MiB/s
Up: 567MiB/s
Download: 912MiB/s
Up: 616MiB/s
Download: 1122MiB/s
Up: 687MiB/s

Here is the trace of allocator 2:
cpuprofile.gz

@drakkan
Copy link
Owner

drakkan commented Mar 15, 2020

Sorry my bad, the optimized mode is disabled by default, you have to explicitly enable it:

diff --git a/sftpd/server.go b/sftpd/server.go
index 91d8e6a..838d6fa 100644
--- a/sftpd/server.go
+++ b/sftpd/server.go
@@ -328,6 +328,7 @@ func (c Configuration) configureKeyboardInteractiveAuth(serverConfig *ssh.Server
 }
 
 func (c Configuration) configureSFTPExtensions() error {
+       sftp.SetEnabledAllocationMode(sftp.AllocationModeOptimized)
        err := sftp.SetSFTPExtensions(sftpExtensions...)
        if err != nil {
                logger.WarnToConsole("unable to configure SFTP extensions: %v", err)

@HiFiPhile
Copy link
Contributor Author

Now it looks better :) There are quite similar in speed:

# allocator
for i in 1 2 3 4 8; do ~/bw_bench sftp hfp:/p4510/in.bench $i; done;
Download: 391MiB/s
Up: 360MiB/s
Download: 698MiB/s
Up: 557MiB/s
Download: 993MiB/s
Up: 694MiB/s
Download: 1230MiB/s
Up: 771MiB/s
Download: 2012MiB/s
Up: 842MiB/s

# allocator 1
Download: 400MiB/s
Up: 368MiB/s
Download: 711MiB/s
Up: 560MiB/s
Download: 1004MiB/s
Up: 691MiB/s
Download: 1236MiB/s
Up: 770MiB/s
Download: 1989MiB/s
Up: 851MiB/s

# allocator 1 pagelist
Download: 385MiB/s
Up: 366MiB/s
Download: 720MiB/s
Up: 558MiB/s
Download: 1011MiB/s
Up: 688MiB/s
Download: 1237MiB/s
Up: 784MiB/s
Download: 2026MiB/s
Up: 853MiB/s

# allocator 2
Download: 397MiB/s
Up: 359MiB/s
Download: 754MiB/s
Up: 544MiB/s
Download: 976MiB/s
Up: 679MiB/s
Download: 1243MiB/s
Up: 754MiB/s
Download: 2023MiB/s
Up: 839MiB/s

@drakkan
Copy link
Owner

drakkan commented Mar 15, 2020

thanks! Are these benchmark for AES-CTR?

sha256-simd is in git master already, it improves performance on arm64 too (but on arm64 OpenSSH is much faster 70MB/s vs 110MB/s for both uploads and downloads on a jetson nano).

Fo the aes patch for Golang I cannot do anything, I'm unable to review that patch and to ensure that it is correct so I'm a bit reluctant to provide packages compiled with a patched Go.

I submitted a pull request to pkg/sftp using the allocator1 (based on the internal benchmark it is the fastest one), if it will be merged I'll enable AllocationModeOptimized as default and I'll close this issue since nothing remain to be done here: to further improve performance we need to write/improve assembler code for Golang crypto and this is out of scope here.

If my patch for pkg/sftp get merged I would like a pull request to add the new results to the performance doc too.

EDIT: I filled a bug to add support for GMC ciphers in filezilla, and I sent a pull request to add support for aes256-gcm in golang crypto

@drakkan
Copy link
Owner

drakkan commented Apr 10, 2020

@HiFiPhile, can you please do a last test using current git + AES CTR patch?

A pull request with the performance doc update would be really appreciated too.
So we can finally close this issue, thanks!

@drakkan
Copy link
Owner

drakkan commented Apr 11, 2020

@HiFiPhile we could add, to the performance doc, a "Baseline next" configuration (or a better name) which is the current SFTPGo git master. What do you think about?

The "Optimized" configuration need to include now only the AES-CTR patch for Go since both sha256-simd and the SFTP allocator are now included in SFTPGo git master.

P.S. based on the comment here we didn't use the fastest AES-CTR patch available, anyway I hope that Go developers will include one of these patches in Go 1.15

@HiFiPhile
Copy link
Contributor Author

@drakkan how about name it "devel" ?
I think I will get the test done next week, but my disk array has changed, now I have only 2 disks instead of 3.

@drakkan
Copy link
Owner

drakkan commented Apr 11, 2020

@drakkan how about name it "devel" ?
I think I will get the test done next week, but my disk array has changed, now I have only 2 disks instead of 3.

No hurry, but since you have now only 2 disks the numbers will be not comparable with the previous ones, we need to document this or to redo the other tests too.

For this test I suggest to use go 1.14.2.

Thanks!

@drakkan drakkan closed this as completed Apr 13, 2020
@jovandeginste
Copy link
Contributor

If disk speed is relevant, and you have sufficiant RAM, you could use /dev/shm (or create a new tmpfs)

@jovandeginste
Copy link
Contributor

Oops, only now read the other thread about ramdisks :-)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants