Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

V8 fast API calls #33374

Closed
targos opened this issue May 12, 2020 · 25 comments
Closed

V8 fast API calls #33374

targos opened this issue May 12, 2020 · 25 comments
Labels
meta Issues and PRs related to the general management of the project. v8 engine Issues and PRs related to the V8 dependency.

Comments

@targos
Copy link
Member

targos commented May 12, 2020

Today, V8 8.3 landed on the master branch.
It includes a new C++ header: v8-fast-api-calls.h that's probably interesting for us.

/cc @nodejs/v8 @addaleax

@targos targos added the v8 engine Issues and PRs related to the V8 dependency. label May 12, 2020
@devsnek
Copy link
Member

devsnek commented May 12, 2020

I've been wanting to use this for a while but since it only supports void as a return type atm we might have to wait.

@mscdex
Copy link
Contributor

mscdex commented May 12, 2020

Is there any more information on this anywhere? I'm curious about how much "faster" this is (compared to node's ObjectWrap if I'm understanding this correctly) or if it's more of a convenience thing?

@addaleax
Copy link
Member

@mscdex It will be quite a bit faster, because the calls to these functions are put directly into the compiled JS code, as I understand it. That skips quite a bit of conversion and setup between C++ and JS.

compared to node's ObjectWrap if I'm understanding this correctly

Not really sure what you’re referring to here, we don’t use ObjectWrap in our own code; the comparison would be to standard C++ binding functions, i.e. those taking a FunctionCallbackInfo as an argument.

@jasnell
Copy link
Member

jasnell commented May 12, 2020

I've been looking forward to this one also. :-) Other than the void return value, are there any other limitations to be aware of?

@addaleax
Copy link
Member

@jasnell The biggest limitation is that the function can’t call into V8 in any way, so no interaction with JS objects, basically. Also, there’s a list of argument types in the header that are intended to be supported but aren’t yet, and overall the feature is still labelled experimental.

@devsnek
Copy link
Member

devsnek commented May 12, 2020

diff --git a/deps/v8/include/v8-fast-api-calls.h b/deps/v8/include/v8-fast-api-calls.h
index bfce66b652..bee1fedf1d 100644
--- a/deps/v8/include/v8-fast-api-calls.h
+++ b/deps/v8/include/v8-fast-api-calls.h
@@ -192,11 +192,13 @@ class CTypeInfo {
                                    ArgFlags flags = ArgFlags::None) {
     uintptr_t wrapper_type_info_ptr =
         reinterpret_cast<uintptr_t>(wrapper_type_info);
+    /*
     // Check that the lower kIsWrapperTypeBit bits are 0's.
     CHECK_EQ(
         wrapper_type_info_ptr & ~(static_cast<uintptr_t>(~0)
                                   << static_cast<uintptr_t>(kIsWrapperTypeBit)),
         0);
+    */
     // TODO(mslekova): Refactor the manual bit manipulations to use
     // PointerWithPayload instead.
     return CTypeInfo(wrapper_type_info_ptr | flags | kIsWrapperTypeBit);
@@ -335,17 +337,17 @@ template <typename R, typename... Args>
 class CFunctionInfoImpl : public CFunctionInfo {
  public:
   CFunctionInfoImpl()
-      : return_info_(i::GetCType<R>::Get()),
+      : return_info_(GetCType<R>::Get()),
         arg_count_(sizeof...(Args)),
-        arg_info_{i::GetCType<Args>::Get()...} {
-    static_assert(i::GetCType<R>::Get().GetType() == CTypeInfo::Type::kVoid,
+        arg_info_{GetCType<Args>::Get()...} {
+    static_assert(GetCType<R>::Get().GetType() == CTypeInfo::Type::kVoid,
                   "Only void return types are currently supported.");
   }
 
   const CTypeInfo& ReturnInfo() const override { return return_info_; }
   unsigned int ArgumentCount() const override { return arg_count_; }
   const CTypeInfo& ArgumentInfo(unsigned int index) const override {
-    CHECK_LT(index, ArgumentCount());
+    // CHECK_LT(index, ArgumentCount());
     return arg_info_[index];
   }
 
diff --git a/lib/internal/process/per_thread.js b/lib/internal/process/per_thread.js
index d219cd7298..f871f20a2f 100644
--- a/lib/internal/process/per_thread.js
+++ b/lib/internal/process/per_thread.js
@@ -41,7 +41,9 @@ function assert(x, msg) {
 function wrapProcessMethods(binding) {
   const {
     hrtime: _hrtime,
-    hrtimeBigInt: _hrtimeBigInt,
+    // hrtimeBigInt: _hrtimeBigInt,
+    hrtimeBigIntBuffer,
+    hrtimeBigIntContext,
     cpuUsage: _cpuUsage,
     memoryUsage: _memoryUsage,
     resourceUsage: _resourceUsage
@@ -140,10 +142,10 @@ function wrapProcessMethods(binding) {
 
   // Use a BigUint64Array in the closure because this is actually a bit
   // faster than simply returning a BigInt from C++ in V8 7.1.
-  const hrBigintValues = new BigUint64Array(1);
+  const hrBigIntValues = new BigUint64Array(hrtimeBigIntBuffer);
   function hrtimeBigInt() {
-    _hrtimeBigInt(hrBigintValues);
-    return hrBigintValues[0];
+    hrtimeBigIntContext.hrtime();
+    return hrBigIntValues[0];
   }
 
   const memValues = new Float64Array(5);
diff --git a/src/node_process_methods.cc b/src/node_process_methods.cc
index 88f4c1cfbd..e6ee795a93 100644
--- a/src/node_process_methods.cc
+++ b/src/node_process_methods.cc
@@ -7,6 +7,7 @@
 #include "node_process.h"
 #include "util-inl.h"
 #include "uv.h"
+#include "v8-fast-api-calls.h"
 #include "v8.h"
 
 #include <vector>
@@ -156,11 +157,23 @@ static void Hrtime(const FunctionCallbackInfo<Value>& args) {
   fields[2] = t % NANOS_PER_SEC;
 }
 
-static void HrtimeBigInt(const FunctionCallbackInfo<Value>& args) {
-  Local<ArrayBuffer> ab = args[0].As<BigUint64Array>()->Buffer();
-  uint64_t* fields = static_cast<uint64_t*>(ab->GetBackingStore()->Data());
-  fields[0] = uv_hrtime();
-}
+class HrtimeData {
+ public:
+  HrtimeData(uint64_t* fields) : fields_(fields) {}
+
+  static void FastMethod(HrtimeData* receiver) {
+    receiver->fields_[0] = uv_hrtime();
+  }
+
+  static void SlowMethod(const v8::FunctionCallbackInfo<v8::Value>& args) {
+    HrtimeData* receiver = reinterpret_cast<HrtimeData*>(
+        args.Holder().As<Object>()->GetAlignedPointerFromInternalField(0));
+    FastMethod(receiver);
+  }
+
+ private:
+  uint64_t* fields_;
+};
 
 static void Kill(const FunctionCallbackInfo<Value>& args) {
   Environment* env = Environment::GetCurrent(args);
@@ -478,9 +491,45 @@ static void InitializeProcessMethods(Local<Object> target,
   env->SetMethod(target, "memoryUsage", MemoryUsage);
   env->SetMethod(target, "cpuUsage", CPUUsage);
   env->SetMethod(target, "hrtime", Hrtime);
-  env->SetMethod(target, "hrtimeBigInt", HrtimeBigInt);
   env->SetMethod(target, "resourceUsage", ResourceUsage);
 
+  {
+    // env->isolate()->set_embedder_wrapper_type_index(kV8EmbedderWrapperTypeIndex);
+    // env->isolate()->set_embedder_wrapper_object_index(kV8EmbedderWrapperObjectIndex);
+    v8::CFunction fast_func = v8::CFunction::Make(HrtimeData::FastMethod);
+    Local<v8::FunctionTemplate> ftmpl =
+        v8::FunctionTemplate::New(env->isolate(),
+                                  HrtimeData::SlowMethod,
+                                  Local<Value>(),
+                                  Local<v8::Signature>(),
+                                  0,
+                                  v8::ConstructorBehavior::kThrow,
+                                  v8::SideEffectType::kHasSideEffect,
+                                  &fast_func);
+
+    Local<v8::ObjectTemplate> otmpl = v8::ObjectTemplate::New(env->isolate());
+    otmpl->SetInternalFieldCount(1);
+    otmpl->Set(FIXED_ONE_BYTE_STRING(env->isolate(), "hrtime"), ftmpl);
+
+    Local<Object> obj = otmpl->NewInstance(env->context()).ToLocalChecked();
+
+    Local<ArrayBuffer> ab = ArrayBuffer::New(env->isolate(), 8);
+    HrtimeData* data = new HrtimeData(
+        reinterpret_cast<uint64_t*>(ab->GetBackingStore()->Data()));
+    obj->SetAlignedPointerInInternalField(0, data);
+
+    target
+        ->Set(env->context(),
+              FIXED_ONE_BYTE_STRING(env->isolate(), "hrtimeBigIntBuffer"),
+              ab)
+        .ToChecked();
+    target
+        ->Set(env->context(),
+              FIXED_ONE_BYTE_STRING(env->isolate(), "hrtimeBigIntContext"),
+              obj)
+        .ToChecked();
+  }
+
   env->SetMethod(target, "_getActiveRequests", GetActiveRequests);
   env->SetMethod(target, "_getActiveHandles", GetActiveHandles);
   env->SetMethod(target, "_kill", Kill);
@@ -494,5 +543,16 @@ static void InitializeProcessMethods(Local<Object> target,
 
 }  // namespace node
 
+namespace v8 {
+template <>
+class WrapperTraits<node::HrtimeData> {
+ public:
+  static const void* GetTypeInfo() {
+    static const int tag = 0;
+    return reinterpret_cast<const void*>(&tag);
+  }
+};
+}  // namespace v8
+
 NODE_MODULE_CONTEXT_AWARE_INTERNAL(process_methods,
                                    node::InitializeProcessMethods)
                                                       confidence improvement accuracy (*) (**) (***)
 process/bench-hrtime.js type='bigint' n=1000000              NA     29.12 %           NA   NA    NA

@mscdex
Copy link
Contributor

mscdex commented May 12, 2020

@devsnek but how much of that supposed gain is from not having to call GetBackingStore() (which AFAIK is still slow) for every method invocation and how much is from this new "fast call" mechanism?

@devsnek
Copy link
Member

devsnek commented May 12, 2020

That's a good point, I'm not sure.

@camillobruni
Copy link
Contributor

@MayaLekova can provide more insights.

@MayaLekova
Copy link
Contributor

Hi, thanks for bringing this up! The remarks by @addaleax are absolutely correct, the API is really experimental and the current implementation is a prototype of what we'll want to have in the end. The gains are expected to come for hot functions that are optimized by TurboFan and the C++ callback is called without runtime type marshalling, but only if the type checks can be done at compile time (which in practice seems to be true for the WebGL use case I've tested with). There are two major limitations regarding the "fast" callbacks :

  • They can't call back to JS
  • They shouldn't trigger GC

We've planned to guard against these, but the current implementation doesn't police it, so you'll most probably get a crash instead of a meaningful error. Moreover, if the fast callback wants to signal an error, the idea is to do it via an output parameter of the callback, but this is currently being implemented and might involve some changes to the fast call API in the very near future. So the design is really in the move, ongoing work can be followed here. I'd appreciate any comments on it, we're really aiming to make it useful for different embedders, but we're still in the process of proving that the performance gains are worth the effort. Once we've passed that stage, I'll make sure to share an explainer about it, with more details (about the usage and the motivation) than the comments in the header file.

The benchmark posted by @devsnek is very interesting, thanks for sharing it! After chatting with @ulan, it seems that if the code is completely migrated to the new ArrayBuffer API, the GetBackingStore() should be pretty cheap, as it doesn't register the backing store. Could you modify the benchmark to be able to differentiate where the gain comes from? If it turns out that GetBackingStore() is the bottleneck, then please file a bug against V8.

Side note: the raw pointer returned by ab->GetBackingStore()->Data() might dissapear if the holder ArrayBuffer is garbage collected. To prevent this, one might make sure that the shared_ptr returned by ab->GetBackingStore() is stored in some object that would outlive the callbacks accessing the data. Maybe that's a bad idea, but in this case that seems to be the HrtimeData itself, right?

Again, thanks for pinging me, I'll follow the discussion with interest and I'm looking forward to seeing more use cases for this experimental API!

@devsnek
Copy link
Member

devsnek commented May 13, 2020

diff --git a/src/node_process_methods.cc b/src/node_process_methods.cc
index 88f4c1cfbd..39e36a91c8 100644
--- a/src/node_process_methods.cc
+++ b/src/node_process_methods.cc
@@ -156,9 +156,13 @@ static void Hrtime(const FunctionCallbackInfo<Value>& args) {
   fields[2] = t % NANOS_PER_SEC;
 }
 
+uint64_t* fields;
+
 static void HrtimeBigInt(const FunctionCallbackInfo<Value>& args) {
-  Local<ArrayBuffer> ab = args[0].As<BigUint64Array>()->Buffer();
-  uint64_t* fields = static_cast<uint64_t*>(ab->GetBackingStore()->Data());
+  if (!fields) {
+    Local<ArrayBuffer> ab = args[0].As<BigUint64Array>()->Buffer();
+    fields = static_cast<uint64_t*>(ab->GetBackingStore()->Data());
+  }
   fields[0] = uv_hrtime();
 }
                                                       confidence improvement accuracy (*) (**) (***)
 process/bench-hrtime.js type='bigint' n=1000000              NA     62.56 %           NA   NA    NA

@ulan
Copy link
Contributor

ulan commented May 13, 2020

@devsnek thanks for checking! Could you please also try keeping
Local<ArrayBuffer> ab = args[0].As<BigUint64Array>()->Buffer(); unconditionally?

It creates a handle and dereference a field which may contribute to the overhead.

@devsnek
Copy link
Member

devsnek commented May 13, 2020

@ulan

diff --git a/src/node_process_methods.cc b/src/node_process_methods.cc
index 88f4c1cfbd..77d0e634df 100644
--- a/src/node_process_methods.cc
+++ b/src/node_process_methods.cc
@@ -156,9 +156,13 @@ static void Hrtime(const FunctionCallbackInfo<Value>& args) {
   fields[2] = t % NANOS_PER_SEC;
 }
 
+uint64_t* fields;
+
 static void HrtimeBigInt(const FunctionCallbackInfo<Value>& args) {
   Local<ArrayBuffer> ab = args[0].As<BigUint64Array>()->Buffer();
-  uint64_t* fields = static_cast<uint64_t*>(ab->GetBackingStore()->Data());
+  if (!fields) {
+    fields = static_cast<uint64_t*>(ab->GetBackingStore()->Data());
+  }
   fields[0] = uv_hrtime();
 }
                                                  confidence improvement accuracy (*)   (**)  (***)
 process/bench-hrtime.js type='bigint' n=1000000        ***     24.78 %       ±1.55% ±2.23% ±3.26%

@ulan
Copy link
Contributor

ulan commented May 13, 2020

@devsnek thanks again! Most of the overhead is coming form local handle construction.

@MayaLekova
Copy link
Contributor

@devsnek Thanks a lot for the more detailed measurements!

Do I read it correctly that the "fast call" version (snippet 1 that you shared) is only better than the original code because of the savings from not always creating the local handle and not always calling GetBackingStore (as snippet 2 suggests)? Still, it's unclear to me why snippet 1 is so much slower than snippet 2.

Also, by comparing snippets 1 and 3, is it fair to conclude that ~25% of the gain comes from not calling GetBackingStore and the rest ~4% (up to the ~29% we see in snippet 1) come from not creating the handle? Or should we not compare those results because of the low confidence level for snippet 1?

@devsnek
Copy link
Member

devsnek commented May 14, 2020

@MayaLekova do those isolate index values have to be set? I assumed not from the docs in the header (and since they're not exposed on the public isolate yet)

@MayaLekova
Copy link
Contributor

@devsnek Good point. Actually they need to be set, as this check would fire otherwise. Unfortunately the change that adds them to the public isolate landed in V8 8.4, sorry for this version mismatch. If you want to apply them manually, you could add the following to v8.h and api.cc.

Which makes me wonder whether the "fast" callback fires at all with your change from snippet 1?

@devsnek
Copy link
Member

devsnek commented May 14, 2020

@MayaLekova ok yeah with that patch we're seeing the good numbers:

                                                  confidence improvement accuracy (*)   (**)  (***)
 process/bench-hrtime.js type='bigint' n=1000000        ***    246.08 %       ±2.61% ±3.49% ±4.57%

@MayaLekova
Copy link
Contributor

Thanks a lot for applying those manually! Sorry again that they're not part of the original commit (in V8 8.3), but as I mentioned before, the API is still highly experimental.

The numbers that you share are really promising though and I'll have in mind the opportunity to run snippet 1 (with the augmentation from 8.4) as a way to show the isolated improvement and make sure to not regress it with future changes.

For all: we have some ongoing discussion about the implementation details in this doc (also mentioned in the tracking bug), please ping me for commenter access if you're interested to join. Outside of that, any comments and suggestions about the actual interface are welcome here too, thanks!

@targos
Copy link
Member Author

targos commented May 15, 2020

@devsnek feel free to backport https://chromiumdash.appspot.com/commit/0d6debcc5f08fe3c41d07686f82c9de05310519f to master. We already have it in v14.x for forward-compat.

@devsnek
Copy link
Member

devsnek commented May 28, 2020

@MayaLekova has automatically generating the slow path function from the information known about the fast path function been considered? In node the majority of our calls into c++ are already using validated arguments so we will end up with boilerplate conversions which V8 could probably do faster anyway since it doesn't have to do handle allocation.

@MayaLekova
Copy link
Contributor

@MayaLekova has automatically generating the slow path function from the information known about the fast path function been considered?

That's a good idea, but I'd say that's up to the embedder. In Blink for example, all bindings (fast and slow) will be generated from the WebIDL describing the low-level APIs. Does Node.js provide some similar typing information?

Regarding handle allocations, not entirely sure I got your point. For the slow version, V8 still has to pass handlified objects, as it's not restricted w.r.t. triggering GC.

@devsnek
Copy link
Member

devsnek commented May 28, 2020

@MayaLekova We don't have such a way to generate our bindings, though I suppose it would technically be possible? I'm thinking that the large majority of cases where we could use fast calls will look like this:

int FastWhatever(int a, int b) {
  // logic
}

void SlowWhatever(FunctionCallbackInfo args) {
  CHECK(args[0]->IsInteger());
  int a = args[0].As<Integer>()->Value();
  CHECK(args[1]->IsInteger());
  int b = args[1].As<Integer>()->Value();
  return Integer::New(isolate, FastWhatever(a, b));
}

Our c++ bindings generally take already validated arguments, they aren't exposed to userland directly, so we don't usually perform any complex logic for converting the values.

As for handles, I just meant if v8 is the one doing the args[0].As<Integer>()->Value() it doesn't have to create the intermediate FunctionCallbackInfo and Local<Value> and Local<Integer>.

@MayaLekova
Copy link
Contributor

Regarding automatic generation of the slow callback, just realized we already had such an idea: "We might want to create a FunctionTemplate::New overload that takes only the fast method, and automatically creates an appropriate slow method wrapping it. The automatically created slow version would generate and perform the type checks from the signature." - it's possible, but I need to re-think whether it should be the task of V8 or of the embedder to generate those.

Regarding handle creation, I see your point, it makes sense to spare the work indeed. How would the SlowWhatever look like then?

nodejs-ci pushed a commit to nodejs/node-v8 that referenced this issue Sep 28, 2020
This commit removes the WrapperTraits specialization for FastHrtime
according to recent changes in the V8 API.

Refs: nodejs/node#33374
nodejs-ci pushed a commit to nodejs/node-v8 that referenced this issue Sep 29, 2020
This commit removes the WrapperTraits specialization for FastHrtime
according to recent changes in the V8 API.

Refs: nodejs/node#33374
targos pushed a commit to targos/node that referenced this issue Sep 29, 2020
This commit removes the WrapperTraits specialization for FastHrtime
according to recent changes in the V8 API.

Refs: nodejs#33374
targos pushed a commit that referenced this issue Sep 29, 2020
This commit removes the WrapperTraits specialization for FastHrtime
according to recent changes in the V8 API.

Refs: #33374
targos pushed a commit that referenced this issue Oct 2, 2020
This commit removes the WrapperTraits specialization for FastHrtime
according to recent changes in the V8 API.

Refs: #33374
nodejs-ci pushed a commit to nodejs/node-v8 that referenced this issue Oct 2, 2020
This commit removes the WrapperTraits specialization for FastHrtime
according to recent changes in the V8 API.

Refs: nodejs/node#33374
nodejs-ci pushed a commit to nodejs/node-v8 that referenced this issue Oct 3, 2020
This commit removes the WrapperTraits specialization for FastHrtime
according to recent changes in the V8 API.

Refs: nodejs/node#33374
nodejs-ci pushed a commit to nodejs/node-v8 that referenced this issue Oct 4, 2020
This commit removes the WrapperTraits specialization for FastHrtime
according to recent changes in the V8 API.

Refs: nodejs/node#33374
targos pushed a commit that referenced this issue Oct 6, 2020
This commit removes the WrapperTraits specialization for FastHrtime
according to recent changes in the V8 API.

Refs: #33374
gengjiawen pushed a commit to gengjiawen/node that referenced this issue Oct 8, 2020
This commit removes the WrapperTraits specialization for FastHrtime
according to recent changes in the V8 API.

Refs: nodejs#33374
targos pushed a commit to targos/node that referenced this issue Oct 14, 2020
This commit removes the WrapperTraits specialization for FastHrtime
according to recent changes in the V8 API.

Refs: nodejs#33374
victorgomes pushed a commit to v8/node that referenced this issue Oct 14, 2020
process: update v8 fast api calls usage

This commit removes the WrapperTraits specialization for FastHrtime
according to recent changes in the V8 API.

Refs: nodejs#33374
targos pushed a commit that referenced this issue Oct 16, 2020
This commit removes the WrapperTraits specialization for FastHrtime
according to recent changes in the V8 API.

Refs: #33374
nodejs-ci pushed a commit to nodejs/node-v8 that referenced this issue Oct 16, 2020
This commit removes the WrapperTraits specialization for FastHrtime
according to recent changes in the V8 API.

Refs: nodejs/node#33374
nodejs-ci pushed a commit to nodejs/node-v8 that referenced this issue Oct 17, 2020
This commit removes the WrapperTraits specialization for FastHrtime
according to recent changes in the V8 API.

Refs: nodejs/node#33374
targos pushed a commit to targos/node that referenced this issue Oct 17, 2020
This commit removes the WrapperTraits specialization for FastHrtime
according to recent changes in the V8 API.

Refs: nodejs#33374
targos pushed a commit that referenced this issue Oct 18, 2020
This commit removes the WrapperTraits specialization for FastHrtime
according to recent changes in the V8 API.

Refs: #33374
nodejs-ci pushed a commit to nodejs/node-v8 that referenced this issue Oct 18, 2020
This commit removes the WrapperTraits specialization for FastHrtime
according to recent changes in the V8 API.

Refs: nodejs/node#33374
nodejs-ci pushed a commit to nodejs/node-v8 that referenced this issue Oct 18, 2020
This commit removes the WrapperTraits specialization for FastHrtime
according to recent changes in the V8 API.

Refs: nodejs/node#33374
targos pushed a commit that referenced this issue Oct 18, 2020
This commit removes the WrapperTraits specialization for FastHrtime
according to recent changes in the V8 API.

Refs: #33374

PR-URL: #35415
Reviewed-By: Rich Trott <[email protected]>
Reviewed-By: Jiawen Geng <[email protected]>
Reviewed-By: Daniel Bevenius <[email protected]>
Reviewed-By: Matteo Collina <[email protected]>
Reviewed-By: Myles Borins <[email protected]>
@targos targos closed this as completed Nov 29, 2020
@Brooooooklyn
Copy link

Did Node-API benefit front the fast api call? Is it possible to expose faster JsObject <=> C Struct Serialize/Deserialize in Node-API?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
meta Issues and PRs related to the general management of the project. v8 engine Issues and PRs related to the V8 dependency.
Projects
None yet
Development

No branches or pull requests

9 participants