Add explicit flush to Durable Object storage interface #87

bcaimano · 2022-10-10T17:28:10Z

This PR provides a way for a worker to wait for its writes (including those performed with allowUnconfirmed) to be flushed to storage. Note that if there is no flush currently pending, the JSG method will provide a resolved promise and the current task will continue through its microtask queue instead of yielding to other tasks.

bcaimano · 2022-10-10T17:28:59Z

@a-robinson @jqmmes, please take a look!

jqmmes · 2022-10-10T17:39:13Z

LGTM!
Do you have any way to test this is working as expected?

bcaimano · 2022-10-10T18:02:31Z

Do you have any way to test this is working as expected?

Yep! I've used our internal Cloudflare test runner to verify the following:

A resolved flush promise runs before a parallel setTimeout() call without a delay specified (microtask vs task).
Cache operations that are not tied to the current task (parallel requests, allowUnconfirmed) can still wait for the current flush.
The operations involved in the flush (and cached operations that follow) are observed via the workerd::ActorObserver at the expected times.
In-flight flush promises resolve after DO brokenness events.

src/workerd/api/actor-state.c++

a-robinson · 2022-10-10T21:56:15Z

src/workerd/io/actor-cache.c++

+
+  if (flushesEnqueued > 0) {
+    // Some previous run had done operations that needed to be flushed and it is still ongoing. This
+    // implies that `allowConcurrency` was set to true for those operations since the input lock did


This isn't really true, is it? Puts having been buffered for flushing does not take the input lock. allowConcurrency only affects writes if backpressure kicks in, and even there the effect is pretty minimal.

Hmmmm, it's actually more about outstanding input gate-required operations being sequenced with the writes. I'll rewrite the comment to explicitly call out that the input gate is the important part.

src/workerd/io/actor-cache.c++

a-robinson · 2022-10-10T21:59:43Z

Yep! I've used our internal Cloudflare test runner to verify the following:

FWIW some testing of this in actor-cache-test.c++ using its capnp mock framework would have covered the basic logic a bit more simply. And might still be worth adding to keep up the great test coverage of actor cache? But I get that there's pretty good internal coverage and that you also wanted to test behavior in a real isolate from a worker's perspective.

kentonv · 2022-10-10T22:57:29Z

Hey, sorry to bikeshed, but: This serves a very similar purpose to fsync() on a file descriptor, right? Should we consider naming it sync()? I think that would give people the right mental model, whereas "flush" is a little more ambiguous.

kentonv · 2022-10-10T22:58:40Z

And might still be worth adding to keep up the great test coverage of actor cache?

I would definitely prefer that any new functionality in actor-cache.{h,c++} be covered by actor-cache-test.c++, rather than just in higher-level integration tests.

a-robinson · 2022-10-11T14:22:05Z

Hey, sorry to bikeshed, but: This serves a very similar purpose to fsync() on a file descriptor, right? Should we consider naming it sync()? I think that would give people the right mental model, whereas "flush" is a little more ambiguous.

Yeah, I'd been a little hesitant to go with sync, but that was back when there was still hope of coming up with something better. If it's a decision between sync() and flush(), then sync() does seem clearer.

bcaimano · 2022-10-11T15:56:04Z

Hey, sorry to bikeshed, but: This serves a very similar purpose to fsync() on a file descriptor, right? Should we consider naming it sync()? I think that would give people the right mental model, whereas "flush" is a little more ambiguous.

Yeah, I'd been a little hesitant to go with sync, but that was back when there was still hope of coming up with something better. If it's a decision between sync() and flush(), then sync() does seem clearer.

Glad this was brought up! I agree that sync() is the right model and I'd prefer it.

bcaimano · 2022-10-11T16:01:30Z

FWIW some testing of this in actor-cache-test.c++ using its capnp mock framework would have covered the basic logic a bit more simply. And might still be worth adding to keep up the great test coverage of actor cache? But I get that there's pretty good internal coverage and that you also wanted to test behavior in a real isolate from a worker's perspective.

I would definitely prefer that any new functionality in actor-cache.{h,c++} be covered by actor-cache-test.c++, rather than just in higher-level integrat

Yeah, my primary focus was testing the DurableObjectState interface since the actor cache change was so simple. The good news is that I think it's a pretty simple unit test to write because of that, happy to do so!

a-robinson · 2022-10-11T20:12:47Z

src/workerd/io/actor-cache.c++

@@ -208,8 +208,8 @@ void ActorCache::evictOrOomIfNeeded(Lock& lock) {

    // We want to break the OutputGate. We can't quite just do `gate.lockWhile(exception)` because
    // that returns a promise which we'd then have to put somewhere so that we don't immediately
-    // cancel it. Instead, we can ensure that a flush has been scheduled. `flush()`, when called,


Is all this renaming still useful if the public method is now called sync()?

Hmmmm, only ever so mildly. We had some functions named flushImpl*() and some named flush*() and it would be nice to unify them. I can understand if you'd like me to drop that commit now.

Nah, it's fine.

a-robinson · 2022-10-11T20:19:18Z

src/workerd/io/actor-cache.c++

+  }
+
+  if (flushScheduled) {
+    // Our current run has done operations that need to be flushed, we need to wait for that flush


FWIW I don't think this comment is strictly accurate either -- it isn't necessarily our current run that set flushScheduled to true. Some other request could have done that, and the field just hasn't been set back to false yet because there was some previous flush that was still in progress when the other request scheduled an additional flush. In that case the previous flush would be keeping lastFlush from resolving and running the callback that would set flushScheduled to false for the new flush.
.
Given how hard it is to explain, I'm not sure this level of detail in the comments to distinguish between flushScheduled and flushEnqueued is all that useful when someone can look up above at how they're used. I'm not even totally sure how to explain it at this point, but I think my best attempt would be to rename flushesEnqueued to flushInProgress or something like that?

But you can keep these comments around if you want.

Alright, I get what you're saying. Yeah, I'll trim the comment down to the bare minimum to explain the difference between the two conditions.

a-robinson · 2022-10-11T20:23:36Z

src/workerd/io/actor-cache-test.c++

@@ -4856,5 +4857,210 @@ KJ_TEST("ActorCache alarm delete when flush fails") {
  }
 }

+KJ_TEST("ActorCache can wait for flush") {


It looks like this doesn't cover the case where one operations schedules a flush, then the flush starts, and then another operation schedules a second flush while the first flush is still in progress. It'd be worth confirming that:

an onNoPendingFlush() from before the second operation resolves when the first flush is resolved

an onNoPendingFlush() from after the second operation schedules a flush but before the first flush completes does not resolve until after the second flush completes.

That makes sense to me as an additional thing to test, thanks! The set of combinations for this is starting to feel a bit nebulous, so I'll just use a put as the second flushable.

We're adding a method to the JSG interface called `flush()`, let's make it clear that these functions are not the same thing.

This commit provides a way for a worker to wait for its writes (including those performed with allowUnconfirmed) to be synchronized with storage. Note that if there is no flush currently pending, the JSG method will provide a resolved promise and the current task will continue through its microtask queue instead of yielding to other tasks.

a-robinson

Nice test coverage!

a-robinson · 2022-10-14T02:29:44Z

src/workerd/io/actor-cache.c++

@@ -2781,6 +2812,9 @@ kj::Promise<void> ActorCache::flushImplDeleteAll(uint retryCount) {

    // Now we must flush any writes that happened after the deleteAll(). (If there are none, this
    // will complete quickly.)
+    // TODO(soon) This will use the write options for the deleteAll() even if the options for future


Interesting, what makes this flushImplDeleteAll() case different from any other case where allowUnconfirmed is set on one write, but not the next, and then they're flushed together? I thought that was all handled in ensureFlushScheduled() and I'm not seeing what would make this case different?

The big difference is that this flush recursion is in the same promise chain as the original output gate wrapper. Imagine you have two sets of operations in different js tasks, normally they get separate flush promises with their own output gate locks. With this follow on flushImpl(), we still have two separate promises/flushes but the we handle the work for the second one in the first one. The second flush noops instead. I think this also happens when we have to retry failures?

I think it mostly matters when the deleteAll isn't allowUnconfirmed but the following operations are. In that case, we'll hold the output lock longer than the user requested. The reverse doesn't really matter because we're still holding the output lock as long as promised (assuming the user actually understood what was going on).

I suppose, yeah. To clarify for my own sake -- it's already the case that if you do a put or normal delete without allowUnconfirmed followed by another one with allowUnconfirmed before the flush has started that the output gate will apply to both. The difference is just that those operations are flushed atomically with each other whereas a deleteAll happens strictly before any follow-up operations.

So I think you're right that this is pessimistic but it seems like a pretty minor case given how infrequently used deleteAll is and that it requires mixing in allowUnconfirmed on only some operations.

bcaimano requested review from kentonv, jasnell, harrishancock, mikea, jclee, a-robinson, bretthoerner and byule as code owners October 10, 2022 17:28

jasnell reviewed Oct 10, 2022

View reviewed changes

src/workerd/api/actor-state.c++ Outdated Show resolved Hide resolved

bcaimano force-pushed the bcaimano/actor-cache-flush branch from 0d5e735 to c0ed164 Compare October 10, 2022 18:29

jasnell reviewed Oct 10, 2022

View reviewed changes

src/workerd/api/actor-state.c++ Outdated Show resolved Hide resolved

bcaimano force-pushed the bcaimano/actor-cache-flush branch from c0ed164 to a70be87 Compare October 10, 2022 19:02

jasnell approved these changes Oct 10, 2022

View reviewed changes

a-robinson approved these changes Oct 10, 2022

View reviewed changes

bcaimano force-pushed the bcaimano/actor-cache-flush branch from a70be87 to 33e6054 Compare October 11, 2022 19:00

a-robinson reviewed Oct 11, 2022

View reviewed changes

bcaimano added 2 commits October 13, 2022 13:33

Refactor: Rename flush[DeleteAll]() to flushImpl[DeleteAll]()

7645aaa

We're adding a method to the JSG interface called `flush()`, let's make it clear that these functions are not the same thing.

bcaimano force-pushed the bcaimano/actor-cache-flush branch from 33e6054 to 7fa31ab Compare October 13, 2022 22:33

a-robinson approved these changes Oct 14, 2022

View reviewed changes

a-robinson merged commit 67661b8 into cloudflare:main Oct 14, 2022

bcaimano deleted the bcaimano/actor-cache-flush branch October 14, 2022 15:19

mrbbot added a commit to cloudflare/miniflare that referenced this pull request Jan 9, 2023

Implement DurableObject#sync() (cloudflare/workerd#87)

8c5ffac

mrbbot mentioned this pull request Jan 9, 2023

Implement DurableObjectStorage#sync() cloudflare/miniflare#467

Merged

mrbbot added a commit to cloudflare/miniflare that referenced this pull request Jan 9, 2023

Implement DurableObjectStorage#sync() (cloudflare/workerd#87)

59dbd32

mrbbot added a commit to cloudflare/miniflare that referenced this pull request Jan 9, 2023

Implement DurableObjectStorage#sync() (cloudflare/workerd#87) (#467)

1cd651c

ankurdotb mentioned this pull request Feb 21, 2023

[Snyk] Security upgrade miniflare from 2.11.0 to 2.12.1 cheqd/studio#170

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add explicit flush to Durable Object storage interface #87

Add explicit flush to Durable Object storage interface #87

bcaimano commented Oct 10, 2022

bcaimano commented Oct 10, 2022

jqmmes commented Oct 10, 2022

bcaimano commented Oct 10, 2022 •

edited

Loading

a-robinson Oct 10, 2022

bcaimano Oct 11, 2022

a-robinson commented Oct 10, 2022

kentonv commented Oct 10, 2022

kentonv commented Oct 10, 2022

a-robinson commented Oct 11, 2022

bcaimano commented Oct 11, 2022

bcaimano commented Oct 11, 2022

a-robinson Oct 11, 2022

bcaimano Oct 11, 2022

a-robinson Oct 11, 2022

a-robinson Oct 11, 2022

bcaimano Oct 11, 2022

a-robinson Oct 11, 2022

bcaimano Oct 11, 2022

a-robinson left a comment

a-robinson Oct 14, 2022

bcaimano Oct 14, 2022

a-robinson Oct 14, 2022

Add explicit flush to Durable Object storage interface #87

Add explicit flush to Durable Object storage interface #87

Conversation

bcaimano commented Oct 10, 2022

bcaimano commented Oct 10, 2022

jqmmes commented Oct 10, 2022

bcaimano commented Oct 10, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

a-robinson commented Oct 10, 2022

kentonv commented Oct 10, 2022

kentonv commented Oct 10, 2022

a-robinson commented Oct 11, 2022

bcaimano commented Oct 11, 2022

bcaimano commented Oct 11, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

a-robinson left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bcaimano commented Oct 10, 2022 •

edited

Loading