Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WASI journal and stateful persistence #4263

Merged
merged 137 commits into from
Jan 4, 2024
Merged

WASI journal and stateful persistence #4263

merged 137 commits into from
Jan 4, 2024

Conversation

john-sharratt
Copy link
Contributor

@john-sharratt john-sharratt commented Oct 20, 2023

WASM Journal Functionality

Wasmer now supports journals for the state of a WASM process. This gives the ability to persist changes made to the temporary file system and to save and store snapshots of the running process.

The journal file is a linear history of events that occurred when the process was running that if replayed will bring the process made to a discrete and deterministic state.

Journal files can be concatenated, compacted and filtered to change the discrete state.

These journals are maintained in a consistent and durable way thus ensuring that failures of the system while the process is running does not corrupt the journal.

Snapshot Triggers

The journal will record state changes to the sandbox built around the WASM process as
it runs however it may be important to certain use-cases to take explicit snapshot
restoration points in the journal at key points that make sense.

When a snapshot is triggered all the running threads of the process are paused and
the state of the WASM memory and thread stacks are recorded into the journal so that
they can be restored.

In order to use the snapshot functionality the WASM process must be compiled with the asyncify modifications, this can be done using the wasm-opt tool.

Note: If a process does not have the asyncify modifications you can still use the journal functionality for recording the file system and WASM memory state however the stacks of the threads will be omitted meaning a restoration will
restart the main thread.

Various triggers are possible that will cause a snapshot to be taken at a specific point in time, these are as follows:

On Idle

Triggered when all the threads in the process go into an idle state. This trigger is useful to take snapshots at convenient moments without causing unnecessary overhead.

For processes that have TTY/STDIN input this is particularly useful.

On FirstListen

Triggered when a listen syscall is invoked on a socket. This can be an important milestone to take a snapshot when one wants to speed up the boot time of a WASM process up to the moment where it is ready to accept requests.

On FirstStdin

Triggered when the process reads stdin for the first time. This can be useful to speed up the boot time of a WASM process.

On FirstEnviron

Triggered when the process reads an environment variable for the first time. This can be useful to speed up the boot time of a CGI WASM process which reads the environment variables to parse the request that it must execute.

On Timer Interval

Triggered periodically based on a timer (default 10 seconds) which can be specified using the journal-interval option. This can be useful for asynchronous replication of a WASM process from one machine to another with a particular lag latency.

On Sigint (Ctrl+C)

Issued if the user sends an interrupt signal (Ctrl + C).

On Sigalrm

Alarm clock signal (used for timers)
(see man alarm)

On Sigtstp

The SIGTSTP signal is sent to a process by its controlling terminal to request it to stop temporarily. It is commonly initiated by the user pressing Ctrl-Z.

On Sigstop

The SIGSTOP signal instructs the operating system to stop a process for later resumption

On Non Deterministic Call

When a non-deterministic call is made from WASM process to the outside world (i.e. it reaches out of the sandbox)

Limitations

  • The WASM process that wish to record the state of the threads must have had the asyncify post processing step applied to the binary (see wasm-opt).
  • Taking a snapshot can consume large amounts of memory while its processing.
  • Snapshots are not instant and have overhead when generating.
  • The layout of the memory must be known by the runtime in order to take snapshots.

Design

On startup if the restore journal file is specified then the runtime will restore the state of the WASM process by reading and processing the log entries in the snapshot journal. This restoration will bring the memory and the thread stacks back to a previous point in time and then resume all the threads.

When a trigger occurs a new journal will be taken of the WASM process which will take the following steps:

  1. Pause all threads
  2. Capture the stack of each thread
  3. Write the thread state to the journal
  4. Write the memory (excluding stacks) to the journal
  5. Resume execution.

The implementation is currently able to save and restore the following:

  • WASM Memory
  • Stack memory
  • Call stack
  • Open sockets
  • Open files
  • Terminal text

Journal Capturer Implementations

Log File Journal

Writes the log events to a linear log file on the local file system as they are received by the trait. Log files can be concatenated together to make larger log files.

Unsupported Journal

The default implementation does not support snapshots and will error out if an attempt is made to send it events. Using the unsupported capturer as a restoration point will restore nothing but will not error out.

Compacting Journal

Deduplicates memory and stacks to reduce the number of volume of log events sent to its inner capturer. Compacting the events occurs in line as the events are generated

Filtered Journal

Filters out a specific set of log events and drops the rest, this capturer can be useful for restoring to a previous call point but
retaining the memory changes (e.g. WCGI runner).

lib/cli/Cargo.toml Outdated Show resolved Hide resolved
@MaratBR
Copy link

MaratBR commented Dec 27, 2023

Taking a snapshot can consume large amounts of memory while its processing.

Quick question if you don't mind.
What amount is "large"? Assuming I have a WASM runtime that uses 10MB of memory how much memory taking a snapshot will take? I assume it will be something like this right?

total memory required  = (runtime virtual RAM + file system size) * coefficient

@john-sharratt
Copy link
Contributor Author

Taking a snapshot can consume large amounts of memory while its processing.

Quick question if you don't mind.
What amount is "large"? Assuming I have a WASM runtime that uses 10MB of memory how much memory taking a snapshot will take? I assume it will be something like this right?

total memory required  = (runtime virtual RAM + file system size) * coefficient

For taking snapshots its the amount of activd memory which is equal to the size of the active memory buffer (the memory that grows not the program itself) behind your wasm app plus the active thread stack size. For normal apps this will be reasonable (few MB's) however for large databases this could be quite big with indexes.

When the logs are compacting it will store a lookup index while its removing duplicates and it will write a new file to replace the old file. So you need enough memory proportional to the number of events plus enough disk space for a copy of the log file before it deletes the old one.

Fairly reasonable constraints I think but ones you need to keep in mind

Copy link
Contributor

@theduke theduke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is way to much to reasonably review in detail, but as iterated before the basics look good.

lib/cli/src/commands/run/wasi.rs Outdated Show resolved Hide resolved
lib/wasix/src/snapshot/effector.rs Outdated Show resolved Hide resolved
lib/wasix/src/snapshot/filter.rs Outdated Show resolved Hide resolved
lib/wasix/src/snapshot/log_file.rs Outdated Show resolved Hide resolved
Makefile Outdated Show resolved Hide resolved
Copy link
Member

@syrusakbary syrusakbary left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The API changes (to the wasmer crate look reasonable to me).

I only have one thought: it would be great if we can separate the abstraction from the implementation, that is: keeping the Journal base abstraction outside of WASIX. I think that will help us immensely in the future refactor

theduke
theduke previously requested changes Jan 3, 2024
lib/wasix/src/journal/concrete/archived.rs Outdated Show resolved Hide resolved
@john-sharratt john-sharratt requested a review from theduke January 4, 2024 00:18
@john-sharratt john-sharratt dismissed theduke’s stale review January 4, 2024 12:45

resolved outstanding points

@john-sharratt john-sharratt merged commit ad06d7e into master Jan 4, 2024
51 of 52 checks passed
@john-sharratt john-sharratt deleted the dcgi branch January 4, 2024 13:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants