Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use rkyv Archive types directly instead of deserializing #2758

Closed
Amanieu opened this issue Jan 19, 2022 · 3 comments · Fixed by #3029
Closed

Use rkyv Archive types directly instead of deserializing #2758

Amanieu opened this issue Jan 19, 2022 · 3 comments · Fixed by #3029
Assignees
Labels
🎉 enhancement New feature! priority-medium Medium priority issue project-near
Milestone

Comments

@Amanieu
Copy link
Contributor

Amanieu commented Jan 19, 2022

Wasmer uses rkyv as its serialization framework but performs a full serialization when loading a module and therefore misses out on rkyv's main feature which is zero-copy deserialization. Rkyv works by creating a separate Archive type for every serializable type. These types can be used straight from the byte buffer without any processing.

Wasmer should switch to using the Archive version of serializable types in the VM to improve the speed to module loading by entirely avoiding deserialization overhead.

@nagisa
Copy link
Contributor

nagisa commented Jan 20, 2022

I've been working on implementing a similar change in one of the forks. This is definitely doable and produces significant speed-ups1, but the separation and isolation between Artifact, Engine, Instance and ModuleInfo is not super great. And so implementing a change like this forces significant API changes (e.g. removing module_ref getters, which touch everything else). Such a change is further complicated also by prevalent use of Arc<dyn ...> types. One more complication is that it isn't super easy to construct rkyv data structures from scratch which is something that would be necessary when the Artifact is compiled from wasm in the first place.

There are definitely some preparation steps that can help to implement this more easily:

  • Make sure constructing an Artifact does not allocate into an Engine (e.g. UniversalArtifact does this – move those to instantiation time)
  • Remove the ModuleInfo from any structure other than implementers of Artifact that contain it – obtain this information by passing a reference to an Artifact to the methods (all the way to the public facing API)

Footnotes

  1. Even when Artifact become a Vec<u8> of the serialized data (to ensure it remains “owned”; artifact that borrows backing data is a further can of worms), which forces one copy of the artifact data can result in speed ups in order of 3 to 5x from deserialize to execute wasm module's entrypoint.

@Amanieu Amanieu added the 🎉 enhancement New feature! label Feb 9, 2022
@Amanieu Amanieu self-assigned this Feb 9, 2022
@Amanieu Amanieu added the priority-medium Medium priority issue label Feb 9, 2022
@nagisa
Copy link
Contributor

nagisa commented Mar 28, 2022

So, we had more time to work on this on our end, and while I don't anticipate having time to replicate similar changes upstream, I can at least write down some more information to whoever attempts this. Downstream we have removed a fair amount of functionality which made implementation of this feature significantly more straightforward. Of note is, for example, that we removed the ability to obtain the list of Exports and Imports, since they are tightly coupled with having access to the ModuleInfo.

There have also been some changes from what I've written in the comment above, so just ignore that one ^^

  1. rkyv integration: Use of rkyv probably wants to become unconditional (i.e. not a feature);
  2. rkyv integration: Use #[archive(as = "Self")] where possible. This makes the Archive type the same as the “native” one, which simplifies the effort a lot. This largely applicable to various index types. rkyv will complain if the type is not compatible with this attribute, so it is safe to attempt adding one everywhere where there's a rkyv::Archive derive.
  3. rkyv integration: For types that cannot be made #[archive(as = "Self")] you will want to have some type that represents a reference to data which can be constructed from both the Archive and the “native” type. For example I added something like FunctionTypeRef.

These three steps above are largely independent and can be implemented as a preparatory step.

  1. Executable: We opted to add another kind of data which represents a “compiled, but not loaded” entity. The engine-universal implementation of this thing has two variants – an owned UniversalExecutable which is produced by compiling a module and an UniversalExecutableRef which is typically obtained when “deserializing” a previously serialized executable. This Executable then can be UniversalEngine::loaded to obtain an Artifact.
    • Due to the fact that there are two somewhat distinct implementations of the Executable I ended up having to largely duplicate the implementation of load.
    • When using the UniversalExecutableRef, strive to convert the Archives to the “native” type (unrkyv is what I've been calling this operation) at the very leaf of the data structure – that is, instead of unrkyving a vector and iterating over it, iterate over the archived slice and unrkyv the distinct elements as necessary.
    • load is the last moment when ModuleInfo is available. Operations that heavily rely on that information (such as import resolution/linking) can and should be completed here so that the information required for this step doesn't need to be stored with the Artifacts.
  2. Artifact: Implementations of Artifact become an entirely self-contained type, and represents Instance data that's idempotent (is immutable and can be shared between Instances). Notably, load is a major cost centre, so I have strived to reduce the contents contained within Artifact to the bare minimum, most notably removing fields like ModuleInfo from it.
    • As a result it becomes sensible to “just” store an Arc<dyn Artifact> inside the Instance.
    • Future direction: Decide where the entities WebAssembly specification calls instances are actually stored – in the Store (engine) or with the Artifacts themselves and enforce that decision universally. Right now some data, such as machine code, is stored with the engine, and some other data, such as data initializers, are retained within Artifacts. It is a mess.
  3. Instance: As mentioned above, the process of instantiation becomes a process of allocating space for mutable data only. All of the idempotent data is already within the Artifact and a reference to the Artifact can be stored within the instance.
    • I found that retaining more type information about the contents of the Instance simplifies the loading somewhat significantly. For example I added VMSharedSignature and trampoline fields to the VMFunctionImport, instead of trying to maintain this information in side tables.
  4. Tests and other supporting code: The Instance loses the ability to provide information about the source wasm module, so the calling code is now responsible for maintaining a reference to both an Executable and the Instance in order to achieve some operations. An example of this are the wast tests which want access to the list of imports (IIRC).

I have found that making changes of this nature keeps touching the entire codebase. Notably compilers were probably the only component of wasmer largely unaffected by the changes I've made, so implementing something like this is going to be not only a major undertaking but also a breaking change.

EDIT: oh, and you know where you can find me (rust-lang zulip) if you have any questions ^^

cc @syrusakbary

@heyjdp heyjdp assigned epilys and unassigned Amanieu Apr 27, 2022
@epilys epilys linked a pull request May 5, 2022 that will close this issue
2 tasks
@syrusakbary syrusakbary added this to the v3.0 milestone Jun 1, 2022
@syrusakbary
Copy link
Member

This is done in the #2869, we just need to rebase it

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🎉 enhancement New feature! priority-medium Medium priority issue project-near
Projects
None yet
4 participants