Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add explicit flush to Durable Object storage interface #87

Merged
merged 2 commits into from
Oct 14, 2022

Conversation

bcaimano
Copy link
Contributor

This PR provides a way for a worker to wait for its writes (including those performed with allowUnconfirmed) to be flushed to storage. Note that if there is no flush currently pending, the JSG method will provide a resolved promise and the current task will continue through its microtask queue instead of yielding to other tasks.

@bcaimano
Copy link
Contributor Author

@a-robinson @jqmmes, please take a look!

@jqmmes
Copy link
Contributor

jqmmes commented Oct 10, 2022

LGTM!
Do you have any way to test this is working as expected?

@bcaimano
Copy link
Contributor Author

bcaimano commented Oct 10, 2022

Do you have any way to test this is working as expected?

Yep! I've used our internal Cloudflare test runner to verify the following:

  • A resolved flush promise runs before a parallel setTimeout() call without a delay specified (microtask vs task).
  • Cache operations that are not tied to the current task (parallel requests, allowUnconfirmed) can still wait for the current flush.
  • The operations involved in the flush (and cached operations that follow) are observed via the workerd::ActorObserver at the expected times.
  • In-flight flush promises resolve after DO brokenness events.

@bcaimano bcaimano force-pushed the bcaimano/actor-cache-flush branch from 0d5e735 to c0ed164 Compare October 10, 2022 18:29
@bcaimano bcaimano force-pushed the bcaimano/actor-cache-flush branch from c0ed164 to a70be87 Compare October 10, 2022 19:02

if (flushesEnqueued > 0) {
// Some previous run had done operations that needed to be flushed and it is still ongoing. This
// implies that `allowConcurrency` was set to true for those operations since the input lock did
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't really true, is it? Puts having been buffered for flushing does not take the input lock. allowConcurrency only affects writes if backpressure kicks in, and even there the effect is pretty minimal.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmmm, it's actually more about outstanding input gate-required operations being sequenced with the writes. I'll rewrite the comment to explicitly call out that the input gate is the important part.

@a-robinson
Copy link
Member

Yep! I've used our internal Cloudflare test runner to verify the following:

FWIW some testing of this in actor-cache-test.c++ using its capnp mock framework would have covered the basic logic a bit more simply. And might still be worth adding to keep up the great test coverage of actor cache? But I get that there's pretty good internal coverage and that you also wanted to test behavior in a real isolate from a worker's perspective.

@kentonv
Copy link
Member

kentonv commented Oct 10, 2022

Hey, sorry to bikeshed, but: This serves a very similar purpose to fsync() on a file descriptor, right? Should we consider naming it sync()? I think that would give people the right mental model, whereas "flush" is a little more ambiguous.

@kentonv
Copy link
Member

kentonv commented Oct 10, 2022

And might still be worth adding to keep up the great test coverage of actor cache?

I would definitely prefer that any new functionality in actor-cache.{h,c++} be covered by actor-cache-test.c++, rather than just in higher-level integration tests.

@a-robinson
Copy link
Member

Hey, sorry to bikeshed, but: This serves a very similar purpose to fsync() on a file descriptor, right? Should we consider naming it sync()? I think that would give people the right mental model, whereas "flush" is a little more ambiguous.

Yeah, I'd been a little hesitant to go with sync, but that was back when there was still hope of coming up with something better. If it's a decision between sync() and flush(), then sync() does seem clearer.

@bcaimano
Copy link
Contributor Author

Hey, sorry to bikeshed, but: This serves a very similar purpose to fsync() on a file descriptor, right? Should we consider naming it sync()? I think that would give people the right mental model, whereas "flush" is a little more ambiguous.

Yeah, I'd been a little hesitant to go with sync, but that was back when there was still hope of coming up with something better. If it's a decision between sync() and flush(), then sync() does seem clearer.

Glad this was brought up! I agree that sync() is the right model and I'd prefer it.

@bcaimano
Copy link
Contributor Author

FWIW some testing of this in actor-cache-test.c++ using its capnp mock framework would have covered the basic logic a bit more simply. And might still be worth adding to keep up the great test coverage of actor cache? But I get that there's pretty good internal coverage and that you also wanted to test behavior in a real isolate from a worker's perspective.

I would definitely prefer that any new functionality in actor-cache.{h,c++} be covered by actor-cache-test.c++, rather than just in higher-level integrat

Yeah, my primary focus was testing the DurableObjectState interface since the actor cache change was so simple. The good news is that I think it's a pretty simple unit test to write because of that, happy to do so!

@bcaimano bcaimano force-pushed the bcaimano/actor-cache-flush branch from a70be87 to 33e6054 Compare October 11, 2022 19:00
@@ -208,8 +208,8 @@ void ActorCache::evictOrOomIfNeeded(Lock& lock) {

// We want to break the OutputGate. We can't quite just do `gate.lockWhile(exception)` because
// that returns a promise which we'd then have to put somewhere so that we don't immediately
// cancel it. Instead, we can ensure that a flush has been scheduled. `flush()`, when called,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is all this renaming still useful if the public method is now called sync()?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmmm, only ever so mildly. We had some functions named flushImpl*() and some named flush*() and it would be nice to unify them. I can understand if you'd like me to drop that commit now.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nah, it's fine.

}

if (flushScheduled) {
// Our current run has done operations that need to be flushed, we need to wait for that flush
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW I don't think this comment is strictly accurate either -- it isn't necessarily our current run that set flushScheduled to true. Some other request could have done that, and the field just hasn't been set back to false yet because there was some previous flush that was still in progress when the other request scheduled an additional flush. In that case the previous flush would be keeping lastFlush from resolving and running the callback that would set flushScheduled to false for the new flush.
.
Given how hard it is to explain, I'm not sure this level of detail in the comments to distinguish between flushScheduled and flushEnqueued is all that useful when someone can look up above at how they're used. I'm not even totally sure how to explain it at this point, but I think my best attempt would be to rename flushesEnqueued to flushInProgress or something like that?

But you can keep these comments around if you want.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alright, I get what you're saying. Yeah, I'll trim the comment down to the bare minimum to explain the difference between the two conditions.

@@ -4856,5 +4857,210 @@ KJ_TEST("ActorCache alarm delete when flush fails") {
}
}

KJ_TEST("ActorCache can wait for flush") {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like this doesn't cover the case where one operations schedules a flush, then the flush starts, and then another operation schedules a second flush while the first flush is still in progress. It'd be worth confirming that:

  1. an onNoPendingFlush() from before the second operation resolves when the first flush is resolved
  2. an onNoPendingFlush() from after the second operation schedules a flush but before the first flush completes does not resolve until after the second flush completes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes sense to me as an additional thing to test, thanks! The set of combinations for this is starting to feel a bit nebulous, so I'll just use a put as the second flushable.

We're adding a method to the JSG interface called `flush()`, let's make it clear that these functions are not the same thing.
This commit provides a way for a worker to wait for its writes (including those performed with allowUnconfirmed) to be synchronized with storage. Note that if there is no flush currently pending, the JSG method will provide a resolved promise and the current task will continue through its microtask queue instead of yielding to other tasks.
@bcaimano bcaimano force-pushed the bcaimano/actor-cache-flush branch from 33e6054 to 7fa31ab Compare October 13, 2022 22:33
Copy link
Member

@a-robinson a-robinson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice test coverage!

@@ -2781,6 +2812,9 @@ kj::Promise<void> ActorCache::flushImplDeleteAll(uint retryCount) {

// Now we must flush any writes that happened after the deleteAll(). (If there are none, this
// will complete quickly.)
// TODO(soon) This will use the write options for the deleteAll() even if the options for future
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting, what makes this flushImplDeleteAll() case different from any other case where allowUnconfirmed is set on one write, but not the next, and then they're flushed together? I thought that was all handled in ensureFlushScheduled() and I'm not seeing what would make this case different?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The big difference is that this flush recursion is in the same promise chain as the original output gate wrapper. Imagine you have two sets of operations in different js tasks, normally they get separate flush promises with their own output gate locks. With this follow on flushImpl(), we still have two separate promises/flushes but the we handle the work for the second one in the first one. The second flush noops instead. I think this also happens when we have to retry failures?

I think it mostly matters when the deleteAll isn't allowUnconfirmed but the following operations are. In that case, we'll hold the output lock longer than the user requested. The reverse doesn't really matter because we're still holding the output lock as long as promised (assuming the user actually understood what was going on).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose, yeah. To clarify for my own sake -- it's already the case that if you do a put or normal delete without allowUnconfirmed followed by another one with allowUnconfirmed before the flush has started that the output gate will apply to both. The difference is just that those operations are flushed atomically with each other whereas a deleteAll happens strictly before any follow-up operations.

So I think you're right that this is pessimistic but it seems like a pretty minor case given how infrequently used deleteAll is and that it requires mixing in allowUnconfirmed on only some operations.

@a-robinson a-robinson merged commit 67661b8 into cloudflare:main Oct 14, 2022
@bcaimano bcaimano deleted the bcaimano/actor-cache-flush branch October 14, 2022 15:19
mrbbot added a commit to cloudflare/miniflare that referenced this pull request Jan 9, 2023
mrbbot added a commit to cloudflare/miniflare that referenced this pull request Jan 9, 2023
mrbbot added a commit to cloudflare/miniflare that referenced this pull request Jan 9, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants