Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

make crypto functions execute in the threadpool asynchronously #678

Closed
jonathanong opened this issue Jan 31, 2015 · 52 comments
Closed

make crypto functions execute in the threadpool asynchronously #678

jonathanong opened this issue Jan 31, 2015 · 52 comments
Labels
crypto Issues and PRs related to the crypto subsystem. feature request Issues that request new features to be added to Node.js. help wanted Issues that need assistance from volunteers or PRs that need help to proceed.

Comments

@jonathanong
Copy link
Contributor

i would expect:

  • the async implementation to be accessed through the streaming API
  • the sync implementation to be accessed through the legacy .update() and .digest() methods
  • semver major as people expect the stream to be return the result synchronously (have seen it in many modules where they don't listen to the readable event)

btw i have no idea how much performance benefit this would provide, if any. i just like the idea of having everything executing in the threadpool...

nodejs/node-v0.x-archive#4298

@brendanashworth brendanashworth added the crypto Issues and PRs related to the crypto subsystem. label Jan 31, 2015
@kyriosli
Copy link

kyriosli commented Feb 2, 2015

I dont think so. Not like zlib, apis in crypto module are commonly in O(n) or O(nlog n) time.

@Fishrock123
Copy link
Contributor

cc @indutny?

@indutny
Copy link
Member

indutny commented Feb 16, 2015

I don't think that there will be any performance benefit for AES/Hashing and stuff like that. Maybe DiffieHellman, but it does not have stream API.

@brendanashworth brendanashworth added the feature request Issues that request new features to be added to Node.js. label Jun 24, 2015
@Fishrock123
Copy link
Contributor

Is there a good case for keeping this request open?

@indutny
Copy link
Member

indutny commented Jun 25, 2015

Yeah, for DH.

@Trott Trott added the help wanted Issues that need assistance from volunteers or PRs that need help to proceed. label Mar 11, 2016
@jorangreef
Copy link
Contributor

jorangreef commented Aug 29, 2016

I would like to second this, but for symmetric ciphers (createCipher, createCipheriv, createDecipher, createDeciperiv).

It would really be great to have a way to do cipher and decipher ops off the main thread, not for any performance benefit, but so as not to block the event loop.

I know a single cipher op over 64KB takes around 0.2ms (AES-256-GCM), but for a server doing crypto at rest on an SSD this could mean that 5000-10000 cipher ops a second would quickly burn through the entire main thread budget.

Being able to do cipher and decipher ops in the threadpool would make it easier to get more throughput, without the event loop blocking, and with the event loop being the control plane and not the data plane.

@jorangreef
Copy link
Contributor

jorangreef commented Oct 13, 2016

I created a module, @ronomon/crypto-async, to test out the idea of doing cipher, hash and hmac operations in the threadpool. Latency per operation is only slightly affected (or not at all) through interaction with the thread pool, but the throughput gains are up to 3x. Best of all, the event loop is not blocked. The benchmark is here.

@bnoordhuis
Copy link
Member

@jorangreef Looks interesting and the numbers are impressive but I'm curious how it holds up in real-world scenarios. Since it uses thread pool, it's going to suffer from head-of-line blocking if there are slow DNS or file operations queued up.

@jorangreef
Copy link
Contributor

Thanks @bnoordhuis, would increasing UV_THREADPOOL_SIZE to 64 or 128 help? I think UV_THREADPOOL_SIZE should usually be more than the number of CPU cores because most of the threads will be waiting and will not be hot on the CPU?

@bnoordhuis
Copy link
Member

That should help when most threads are blocked on I/O but latency will probably go through the roof when you have all 64 or 128 threads doing encryption.

With computationally bound workloads you normally don't want more than N or N-1 threads (where N = number of cores) because otherwise you pay too much in scheduling overhead.

@jorangreef
Copy link
Contributor

jorangreef commented Oct 13, 2016

Yes, with the benchmark I set concurrency to 1/2 number of available cores to keep latency within reasonable bounds: https://github.com/ronomon/crypto-async/blob/master/benchmark.js#L6

@jorangreef
Copy link
Contributor

I was surprised that latency increased already when using N, and N-1 threads (where N = number of cores). I thought it would only increase from N-1. Is there some contention in the threadpool that can be optimized?

@bnoordhuis
Copy link
Member

perf record should be able to tell you but if there is contention, it's probably around the global queue lock. Are there other processes running when you benchmark? Do you have SMT (hyperthreading) enabled? What does the page fault rate look like? (Perf can measure that last one.)

@jorangreef
Copy link
Contributor

I ran --prof instead of perf record if that's fine. Both machines have hyperthreading I think, os.cpus() shows 4 or 8 cpus when there are physically 2 or 4 cores respectively.

[Summary]:
  ticks  total  nonlib   name
   572    6.2%    6.7%  JavaScript
  6389   69.2%   74.4%  C++
   488    5.3%    5.7%  GC
   646    7.0%          Shared libraries
  1621   17.6%          Unaccounted

[C++ entry points]:
  ticks    cpp   total   name
   773   31.8%    8.4%  v8::internal::Builtin_HandleApiCall(int, v8::internal::Object**, v8::internal::Isolate*)
   666   27.4%    7.2%  sha256_block_data_order_avx
   191    7.9%    2.1%  _aesni_ctr32_encrypt_blocks

...

[Bottom up (heavy) profile]:
 Note: percentage shows a share of a particular caller in the total
 amount of its parent calls.
 Callers occupying less than 2.0% are not shown.

  ticks parent  name
  3024   32.8%  ___shared_region_check_np

  1621   17.6%  UNKNOWN
   834   51.4%    v8::internal::Builtin_HandleApiCall(int, v8::internal::Object**, v8::internal::Isolate*)

@bnoordhuis
Copy link
Member

I ran --prof instead of perf record if that's fine.

--prof only profiles the main thread.

@jorangreef
Copy link
Contributor

jorangreef commented Oct 14, 2016

--prof only profiles the main thread.

I ran perf record and uploaded the flamegraph here.

How can one get the page fault rate from the raw output?

@bnoordhuis
Copy link
Member

You can collect it with perf record -e cpu-clock,page-faults. See perf list -h, you can also split it out by major and minor page faults.

@jorangreef
Copy link
Contributor

jorangreef commented Dec 28, 2016

Just an update, after more tests and benchmarks, there may be a case for not limiting crypto in Node to 1 core by design. It would probably be good if Node's crypto could scale with the number of cores available.

For example, with @ronomon/crypto-async on 4 cores (and with 1 MB buffers), you can do AES-256-CTR encryption at 3050.40 MB/s vs 945.20 MB/s for Node restricted to 1 core.

For buffers larger than 1024 bytes, it may make more throughput sense to do the crypto asynchronously, off the event loop, and there are ways to solve threadpool competition with dns and fs operations. Here are some notes around controlling concurrency and adjusting threadpool size.

@Trott
Copy link
Member

Trott commented Mar 4, 2018

Should this remain open?

@jorangreef
Copy link
Contributor

Yes, this should definitely remain open.

There are huge (2x-3x) real-world throughput gains to be had for cipher, hash, and hmac operations on large (1MB+) buffers, scaling linearly with the number of cores, see crypto-async:

   AES-256-CTR: 64 x 1048576 Bytes
        crypto: Latency: 1.105ms Throughput: 945.20 MB/s
  crypto-async: Latency: 1.362ms Throughput: 3050.40 MB/s
  
   HASH-SHA256: 64 x 1048576 Bytes
        crypto: Latency: 3.023ms Throughput: 347.71 MB/s
  crypto-async: Latency: 3.162ms Throughput: 1290.56 MB/s

   HMAC-SHA256: 64 x 1048576 Bytes
        crypto: Latency: 3.134ms Throughput: 335.54 MB/s
  crypto-async: Latency: 3.974ms Throughput: 1048.58 MB/s

There are some tuning considerations, but it's logical, since you can start leveraging more CPU cores for your crypto, not just a single core. A single core just can't keep up with doing crypto for high throughput network links or non-volatile memory devices these days.

Node's threadpool does have issues with conflating IO tasks (which are run cold) with CPU tasks (which are run hot), but that does not mean async crypto is not worth pursuing, it just means the threadpool needs some work.

@ryzokuken
Copy link
Contributor

@Trott what's the status on this? Are we working on it?

@addaleax
Copy link
Member

addaleax commented Apr 3, 2018

@ryzokuken I don’t think anybody is working on that.

The things that are discussed here, about some separation in the thread pool between I/O- and CPU-bound work might be a good idea, but essentially it’s a different topic, and if you feel comfortable tackling this particular issue, go for it!

@ryzokuken
Copy link
Contributor

@addaleax let me give this a try.

@davisjam
Copy link
Contributor

davisjam commented Apr 18, 2018

about some separation in the thread pool between I/O- and CPU-bound work

There's some discussion on the API in this libuv PR. On my backburner...

jasnell added a commit to jasnell/node that referenced this issue Oct 7, 2020
@jasnell jasnell closed this as completed in dae283d Oct 8, 2020
joesepi pushed a commit to joesepi/node that referenced this issue Jan 8, 2021
Fixes: nodejs#678
Refs: nodejs#26854

Signed-off-by: James M Snell <[email protected]>

PR-URL: nodejs#35093
Reviewed-By: Matteo Collina <[email protected]>
Reviewed-By: Anna Henningsen <[email protected]>
Reviewed-By: Michaël Zasso <[email protected]>
Reviewed-By: Rich Trott <[email protected]>
@tclzcja
Copy link

tclzcja commented Jul 6, 2021

For anyone who's still following this issue, Node.js now support Web Crypto API which is asynchronous.

See also here

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
crypto Issues and PRs related to the crypto subsystem. feature request Issues that request new features to be added to Node.js. help wanted Issues that need assistance from volunteers or PRs that need help to proceed.
Projects
None yet
Development

No branches or pull requests