Skip to content
This repository has been archived by the owner on Oct 7, 2020. It is now read-only.

Promise microtask scheduling can create starvation #25

Closed
getify opened this issue May 31, 2018 · 34 comments
Closed

Promise microtask scheduling can create starvation #25

getify opened this issue May 31, 2018 · 34 comments

Comments

@getify
Copy link
Contributor

getify commented May 31, 2018

Not sure this is entirely relevant to bring up here, but I figured I'd throw it out for consideration nonetheless.

One major issue I have with promises is that the microtask queue is specified in a naive enough way that, at least in current observation of browsers, it's far to easy to create a starvation situation when we talk about scheduling between multiple promises.

The use-case where this bit me was in trying to use async..await functions to implement the ideas of CSP (communicating sequential processes), the model for concurrency used by default in Go and ClojureScript, notably.

CSP has traditionally been modeled in JS through the use of generators (using the sync-async pattern of yielding promises, resuming on resolution, etc). When async..await came out, several CSP libraries moved to them from generators. Unfortunately, this exposes CSP code to this dangerous starvation problem of promise microtasks.

The idea behind CSP is that you have two or more "processes" (think functions) that are running concurrently, and they are mostly independent but they "block" whenever trying to create a synchronous message to another process, or receive one from another. The blocking is done locally via a yield or await, which is on a promise that's used to represent the completed message. Whenever both reader and writer of a message are complete, the promises on both ends resolve, and the processes are unblocked.

The problem here is that it's incredibly easy (either for malicious reasons or just by mistake) for one of these processes to get into an "infinite asynchronous" loop where they simply keep resolving a new promise, which adds a new microtask to the queue, and this ends up starving out any other promises that are waiting to fire a microtask to resolve.

Here's a code snippet that illustrates the starvation issue:

var done = false;
var counter = 0;

function makePromise() {
   if (counter < 5) {
      counter++;
      return new Promise(function c(resolve){
         setTimeout(resolve,0);
      });
   }
   else return Promise.resolve();
}

async function a() {
   let i = 0;
   while (!done) {
      await makePromise();
      console.log(`a: ${i++}`);
      if (i > 200) done = true;
   }
}

async function b() {
   let i = 0;
   while (!done) {
      await makePromise();
      console.log(`b: ${i++}`);
      if (i > 200) done = true;
   }
}

a();
b();

If you run that in say Chrome, you'll probably see something like:

a: 0
b: 0
a: 1
b: 1
b: 2
b: 3
b: 4
b: 5
b: 6
...
b: 198
b: 199
b: 200
a: 2

As you can see in that sort of output, at first a() and b() trade nicely, but as soon as b gets into an immediate microtask loop, it starves out a() entirely, preventing it from ever getting to do its a: 2 message, until eventually b() just quits after 200.

Starvation is a particularly "deadly" problem for CSP, as several of its semantics make that situation much more likely than such contrived code as my example here. A CSP library needs to guard carefully against this.

If you use async..await, you can't really, since you have no control over the inter-promise resolution scheduling. If you use a generator-runner for the CSP, then you have the ability to mitigate the chances of the starvation using a simple round-robin (or other more sophisticated algorithm) for scheduling, making sure that you only give each "process" one chance to resolve each turn, and spreading the scheduling around so starvation doesn't occur.

I believe this betrays a fundamental mis-design in promises, in that they didn't specify that the microtask queue has to prevent starvation of the sort I described. They should have ensured a minimum of round-robin scheduling, or even random scheduling, or weighted scheduling, or something... but not simply a naive queue.


I bring this issue up because I saw other threads talking about use-cases for hooking into the promise resolution machinery, and it reminded me of this problem. It's perhaps related in that if there was some "hook" provided, better starvation-free scheduling could be ensured if, for example, you know you'll be doing starvation-likely things like CSP.

@ljharb
Copy link
Member

ljharb commented Jun 1, 2018

That code snippet is quite illustrative, but doesn’t seem to represent a real-world use case where it would be “easy” to get starvation. Could you elaborate on specifically how this occurs in practice?

@benjamingr
Copy link
Member

CC @MayaLekova @bmeurer - this came in after the discussion and is worth looking into.

@bmeurer
Copy link
Member

bmeurer commented Jun 1, 2018

cc @hashseed

@getify
Copy link
Contributor Author

getify commented Jun 1, 2018

I'll come back later with some more code, but I just wanted to point out that the code above only used a setTimeout for illustration of a mix between asynchrony and not... ostensibly that specific code behaves that way because microtasks always cut in line in front of waiting ticks (like a timer). That's one kind of starvation.

But there's another flavor of that starvation where it's just two different promise loops (no timers) and you still don't see a balanced a / b / a / b .. tradeoff between resolutions the way one might expect in a more fair round-robin scheduling.

That idea seems contrived in JS code alone, but it's entirely common and practical when you're in the realm of CSP. For one, you usually make a "process" stay alive "forever" by doing a while..true loop. So, it's not at all uncommon for you to have dozens of these functions that are all running while..true loops concurrently, where each one pauses somewhere (at least once) in each iteration via a yield / await. That usually happens when you're blocking on a put or a take of a message on a channel.

The thing is, design-wise, as you're building each CSP process in your program, you're designing them largely to run "independently", and only to collaborate from time to time when a message needs to be passed. That "independently" part only works if you can assume that each process gets fair/equitable chance for its messages to be sent/received by the rest of the system. In other words, whether you're conscious of it or not, you're expecting a sort of round-robin effect on the resolutions bouncing around the different processes if they happen to stack up at the same moment.

The system is much weaker if there's any chance that one set of channel messaging between two processes in the system can end up starving out all the other independent processes. You really don't want that to happen, and you kind of rely on the system not to let it happen.

One of the things about blocking messaging semantics (like that of CSP) is that you get automatic back-pressure. So it's common to design a producer of some message to just pump out its messages as quickly as possible (think of like a stock ticker), but it's throttled to only generate exactly as many messages as its consumer(s) take. So, the producer might have its message ready immediately on each turn of its while..true loop, but not really care if someone consumes as quickly, or only intermittently. Because both a put and take are blocking, the producing is automatically throttled via that back-pressure.

The point I'm trying to make is, you sorta design intentionally a piece of your system that could push out an "infinite" and "immediate" stream of messages (aka, immediately resolved promises), but you're OK that this doesn't cripple the whole system because you know that this isn't exhausting all the scheduling and starving out other processes' messages from being exchanged. At least, that's what you hope.

Here's a strawman of such (susceptible) code:

var coords = channel();
var userClicks = channel();

async function generateCoordinates() {
   while (true) {
      let x = Math.trunc(Math.random() * 1000);
      let y = Math.trunc(Math.random() * 1000);
      await CSP.put( coords, { x, y } );
   }
}

async function generateShape() {
   while (true) {
      await CSP.take( userClicks );
      let { x, y } = await CSP.take( coords );
      drawCircle( x, y );
   }
}

Here, the generateShape() process is throttled by messages on the userClick channel, so it wouldn't seem like there was a danger of hijacking the program and causing starvation of scheduling. But all that has to happen is for userClicks to accidentally (or maliciously) get flooded with an unending stream of immediately ready messages, and now generateShape() can exhaust the scheduling.

This is dangerous. And so CSP relies on a system that makes sure that such schedule starvation cannot occur.


To this point, I maintain an async-flow-control library called asynquence, which has CSP support in it. I implemented CSP on generators initially, because I was able to make sure that each promise for each message goes into a round-robin queue, and that each "tick" internally (not JS ticks, but just units of work in the runner) takes one process off the queue, processes its message, then sticks it at the end of the line, so that it can't hijack the scheduling. This was a lot of work to get right, and its very delicate/intricate code. So I have a whole bunch of crazy edge test cases around it, so I don't later mess something up.

Anyway, fast forward a few years, and async..await comes out, and I decided to play with switching to them. I was quite dismayed when like 1/3 of my test cases failed. I spent quite awhile thinking I was doing something else wrong, and then discovered it was because of this starvation thing. So I gave up on async..await for CSP.

@MayaLekova
Copy link

cc @littledan @domenic Maybe this might be of interest to you as well.

@littledan
Copy link

Would it be feasible to yield to the event loop occasionally within the implementation of the CSP.put or CSP.take methods?

@getify
Copy link
Contributor Author

getify commented Jun 25, 2018

@littledan The CSP.put(..) and CSP.take(..) methods basically just return a promise that's wired to resolve once the message transfers through the channel. I'm not sure if it would be possible to do as you're suggesting, but possibly -- it'd basically be that whenever a message is transferred, instead of calling resolve(..) right away on each promise, it might do something like setTimeout(resolve,0). I don't think it would want to do that every time (would hurt perf), so then it'd require inventing some mechanism to track a certain "depth" of microtask recursion (say 5 deep) and bailing out of it with the hack.

What that sort of approach boils down to is the primary complaint I've made here, which is that promises weren't specified with some sort of built-in behavior to prevent this starvation, so user-land has to bend over backwards in hacks.

@littledan
Copy link

I'm not sure what to do about it at this point. Now, Promises give you a sort of "atomicity" guarantee that they won't yield to the outer event loop until they are all done. I believe atomicity was an explicit design goal, for the web, which had to do with avoiding multiple renders of intermediate states.

@getify
Copy link
Contributor Author

getify commented Jun 25, 2018

I'm not sure what to do about it at this point.

We obviously can't change the poor design decision now.

The purpose of this "issue" is to request an extension hook (from node) of some sort, that would allow, for example, for code to be able to configure a trigger for a bailout of the microtask queue to prevent starvation. For example, this hook could fire when one microtask recurses to scheduling another microtask, and the hook could interrupt that default behavior and instead force a yield to the event loop, or could forcibly interleave other microtasks, or something like that.

@getify
Copy link
Contributor Author

getify commented Jun 25, 2018

I believe atomicity was an explicit design goal, for the web

I'm sure it was on purpose, but that doesn't make it not a bad decision on the whole. Consider how the "design" of setTimeout is not for there to be any clamping, but browsers had to step in and create clamping to prevent run-away timer recursion. Or think about the arbitrary (external to the spec) limits put on call-stack depth to prevent running out of memory. Both are cases where the environment steps in and tempers otherwise unintended "design" side effects.

I see this as a similar issue, one which I'm hoping that Node's exploration of a hook might end up informing future environment behavior, such as them automatically restricting microtask recursion depth to some limit.

@littledan
Copy link

Would the ability to monkey-patch Promise.prototype.then and have await always call out to that be enough to achieve this goal?

@ljharb
Copy link
Member

ljharb commented Jun 26, 2018

It was a very intentional design decision to have await not be forgeable; that’s pretty important to never change.

@benjamingr
Copy link
Member

It was a very intentional design decision to have await not be forgeable;

Can you elaborate on what you mean by that?

@ljharb
Copy link
Member

ljharb commented Jun 26, 2018

If await dynamically calls out to Promise.prototype.then, then it’s not safe to use, because any code (even halfway through a suspended async function) could replace that method and hijack what the async function is doing.

@benjamingr
Copy link
Member

benjamingr commented Jun 26, 2018

If await dynamically calls out to Promise.prototype.then, then it’s not safe to use, because any code (even halfway through a suspended async function) could replace that method and hijack what the async function is doing.

Thanks.

Just to clarify - by "safe" you mean "would act in a way that developers find less surprising" and not "object capability theory safe" right?

@ljharb
Copy link
Member

ljharb commented Jun 26, 2018

I think i mean both - i mean that one can confidently use it even when running untrusted code.

@getify
Copy link
Contributor Author

getify commented Jun 26, 2018

I'm not asking for the ability to completely hijack/override these mechanisms as much as I am asking for the ability to defer scheduling of a microtask until the next tick so that another promise's scheduling isn't starved.

So for example, a hook could be fired before scheduling a microtask, and if true is returned, it's scheduled right away, but if false is returned, it's deferred until the next tick. Or something like that.

I think this is a much narrower ask and is far less unsafe than overriding then() or being able to force an async function to return userland promise subclasses, or anything like that.

@MayaLekova
Copy link

So for example, a hook could be fired before scheduling a microtask, and if true is returned, it's scheduled right away, but if false is returned, it's deferred until the next tick.

Can't comment the safeness goal of await (although intuitively it seems to me it should be taken into account), but such a capability will induce some performance overhead, given that the "hook" is not only informative, but its outcome will modify the flow of the scheduling. Moreover, sounds like somebody can get into further trouble by incorrectly implementing such a hook.

I agree that starvation is a problem in the example you presented, but seems to me the predictability and "atomicity" of the microtask scheduling might outbid this problem. I would love to hear/read some further elaboration on the motivation of the current scheduling mechanism.

@getify
Copy link
Contributor Author

getify commented Jun 26, 2018

but such a capability will induce some performance overhead, given that the "hook" is not only informative

I would hope that would only be the case if the hook is registered, and otherwise being un-present by default wouldn't have any effect on performance (except for maybe an if-statement).

And if it was registered, it would be intentionally so, and for the greater purpose of ensuring balanced scheduling, even at the slight cost of the hook. I would imagine that this sort of thing would be in experimental/POC code designed to explore whether JS's promise scheduling algorithms should be adjusted. Such a hook would provide us the ability to explore the different use cases around scheduling/starvation, and compare/contrast the tradeoffs, rather than guessing at "outbids" from only the current perspective.

sounds like somebody can get into further trouble by incorrectly implementing such a hook.

Every single "non-standard" hook being contemplated in this repository is in that same category. You can totally shoot yourself in the foot if you do bad things. Since this is Node only and not proposed for the greater ecosystem, we're drastically reducing the surface area of potential mistake. Also, these hooks would really only be of interest to someone in the research mindset described above. I'd wager 99.9999% of JS devs have never even contemplated scheduling starvation.

However, if you zoom out to the broader world of concurrency management in programming language design, starvation is a very important first class topic. There's tons of writing (academic and practical) on the subject.

I think it's to JS's detriment, and Node's, especially in light of efforts to move Node and JS toward multi-threading, that we would ignore such an important topic because it hasn't impacted much of the ecosystem yet.

What I'm asking for is the ability to pursue research along those lines. I think broadly speaking, that's in line with the spirit of this repository's asks.

@benjamingr
Copy link
Member

@ljharb

I think i mean both - i mean that one can confidently use it even when running untrusted code.

If that was a concern with TC39 (and knowing @erights cares about this that's likely) - then I don't think we should allow code to opt other code into it. We could provide a MicrotaskQueue hook that would specifically only run N microticks and not all and let events run (if V8 gives us the ability).

That would violate user expectations regarding scheduling microticks vs. event loop yields. We even give a warning about this in the docs:

The next tick queue is completely drained on each pass of the event loop before additional I/O is processed. As a result, recursively setting nextTick() callbacks will block any I/O from happening, just like a while(true); loop.

We should provide a similar warning when we write a guide on promises perhaps. (I've already been planning to work on one).

It would also be useful to provide the same sort of warning when someone actually does this (resolves over a threshold of promises before events run).

One can always subclass their promises and override then to add additional scheduling constraints. This will not work for async functions calling other async functions but it would address @getify's use case if I understand correctly.

@MayaLekova
Copy link

We could provide a MicrotaskQueue hook that would specifically only run N microticks and not all and let events run

This could be done if we broaden the scope of this issue, if we agree that would solve the starvation problem.

@getify
Copy link
Contributor Author

getify commented Jun 28, 2018

I think this thread shows that we can't rely on monkey-patching then(..) to encapsulate the balanced scheduling semantics on awaited promises.

@littledan
Copy link

@getify Agreed that current async/await semantics don't allow that.

@ianstormtaylor
Copy link

ianstormtaylor commented Dec 12, 2018

Hey all, just having to dig into this myself now. I have a question that I think is related.

The use case that just prompted me to investigate this was trying to turn a synchronous, blocking for of loop of JSON.parse calls into something that would be non-blocking. For example my synchronous loop might look like:

async function() {
  const ret = []

  for (const item of items) {
    ret.push(JSON.parse(item))
  }

  return ret
}

// Note that although its an `async function` the code is all synchronous.

This one is clearly blocking for the entire for of loop. Until now, I've just worked with the mental model that since promises are asynchronous, using await on one guarantees that you don't block the event loop. (I realize now that this is probably not correct, but it does seem like a widespread misconception if so.) So I assumed to make it non-blocking I could just do:

async function() {
  const ret = []

  for (const item of items) {
    await Promise.resolve()
    ret.push(JSON.parse(item))
  }

  return ret
}

As far as I understand, this makes the loop run asynchronously. But because of the starvation problem of the promise microtask queue being fully drained before moving onto the next phase of the event loop, I believe this doesn't actually unblock the event loop?

Instead you'd have to do something like:

async function() {
  const ret = []

  for (const item of items) {
    await new Promise(resolve => setImmediate(resolve))
    ret.push(JSON.parse(item))
  }

  return ret
}

Which would then unblock the event loop? Is that correct?


The reason I'm asking here instead of on StackOverflow is, if I understood all that correctly...

It makes sense to me how promises have been designed to prioritize "atomicity"—like @littledan explained. I think what could be improved is—like @benjamingr mentioned—making the documentation reference this more clearly. Right now this is super not clear in Node's documentation, and even elsewhere in articles and answers.

For what it's worth...

I started the research journey by re-reading Don't Block the Event Loop carefully. But it didn't answer the question of whether using await is inherently non-blocking, or whether it depends on the content of the awaited function.

And then from there I turned to StackOverflow, but there is lots of misinformation there:

Lots of the answers and comments mention that promises by themselves ensure that the event loop is not blocked, which sounds like it's not true.

Finally after feeling like the StackOverflow questions weren't quite explaining it enough for me, I read a whole bunch of different things without conclusive results. I even found this article by @rauschma that says...

That means that you code can rely on run-to-completion semantics (as explained in part 1) and that chaining promises won’t starve other tasks of processing time.

...of which the first phrase makes sense, but the second seems incorrect? Until I found this series which explained that recursive promise resolution can indeed starve the event loop.

Hope that all makes sense. Thanks!

@getify
Copy link
Contributor Author

getify commented Dec 12, 2018

@ianstormtaylor

Here's an even simpler way of illustrating the problem:

async function main() {
	while (true) {
		await Promise.resolve();
	}
}

main();

setTimeout(function(){ console.log("nope"); },0);

The console.log(..) will never print because the await doesn't ever yield to the event loop to let the timer fire.

@ianstormtaylor
Copy link

@getify that's definitely a better way to prove out the behavior!

For my example above, I think my issue is around how to write "partitioning" logic. For infinite loops it's easy to notice the issue (although still frustrating, I admit) because it hangs. But for finite loops I think many people right now are incorrectly assuming that they are not blocking the event loop. Which leads to request timeouts and other things that are hard to debug, especially when lots of the common sources for answers will give them false confidence, and they'll incorrectly dive into other potential blockers.

@devsnek
Copy link
Member

devsnek commented Dec 12, 2018

Just taking a read through this, i have a few points i'd like to bring up which should help focus the conversation.

As long as you're using await, it isn't possible to starve the microtask queue, because await will yield to another coroutine, putting itself last in the queue.

What is actually happening here isn't even a problem with promises or microtasks, its a problem with how embedders interact with running the microtask queue.

at a high level, a browser and/or node event loop looks kind of like this:

while (true) {
  runMicrotasks();
  timers.forEach((timer) => {
    if (timer.isReady()) { timer.cb(); }
  });
  IOqueue.forEach((io) => {
    if (io.isLoaded()) { io.cb(); }
  });
}

As long as the microtask queue never empties, the embedder will never be able to run the timers and other various platform specific things.

So i think a better question at this point is what can node/browsers do to make this better? One possible solution could be that embedders insert their own microtask items which would handle timers and whatnot, instead of having to run the queue to completion before doing so.

(as a side note, there's also timer starvation, where recursive timers will also block the event loop, for the same reasons as above, so i'm not convinced this is entirely something that the promise team by itself needs to solve)

@ianstormtaylor
Copy link

As long as you're using await, it isn't possible to starve the microtask queue, because await will yield to another coroutine, putting itself last in the queue.

Isn't that true for promise.then as well? The starvation I was concerned about isn't staring the promise microstask queue, but starving (or blocking) the event loop until the promise microtask queue is completely empty.

But for me, since I don't know enough to have an opinion on what the right call is, it's more of a documentation issue. Since it's really hard to find out what happens with chained promises as far as event loop blocking goes.

@devsnek
Copy link
Member

devsnek commented Dec 12, 2018

@ianstormtaylor yes it's also true for promise.then, but it's more difficult to model these infinite loops with it.

@getify
Copy link
Contributor Author

getify commented Dec 12, 2018

To be clear, starvation is not only a concern with "infinite loops" -- those are just a convenient illustration of the problem -- but even a blocking finite loop that just happens to prevent some other resource being able to process, where you would normally expect for them to interleave operations with each other, such as was shown in my OP.

@devsnek
Copy link
Member

devsnek commented Dec 12, 2018

I'm not sure that do people expect timers to be interleaved with microtasks. they haven't ever been such in browsers or node. they even are referred to as macrotasks instead, because they're kind of conceptually different.

@getify
Copy link
Contributor Author

getify commented Dec 12, 2018

It's not specifically a question about "timers", though that's a convenient illustration mechanism.

I think the expectation people have is that "asynchronous" (like how promises resolve, and any other means of asynchrony, like timers or postMessage or ajax or anything else) always means "non-blocking", which most people seem to use as meaning "every async thing gets an equal chance to schedule its next step on the event loop, so everything gets processed eventually in due turn."

Often that's the case, but it's not necessarily true of promises (or other microtask queueing mechanisms) because of the way the microtask queue is eagerly exhausted before yielding to any other part of the system. It's like microtasks get to cut in line and stay at the front of the line, and the rest of the event loop queue is just waiting indefinitely.

What I would have imagined (and have alluded to at various points earlier in this thread) is that to prevent the microtask queue from starving the system, there could sort of be a bailout limit similar to the recursion-stack RangeError exception, or how timers have clamping to prevent run-away timer recursion. In both those cases, the environment steps in and overrides something to prevent undesirable behavior.

For example, the microtask queue could spin for a certain maximum of N number of items, and if it exceeds that, then the environment spins a tick of the event loop before proceeding on the microtask queue again.

This still ensures predictable ordering on the microtask queue, because it's still a queue, and any newly scheduled microtasks on that subsequent spin of the event loop would just get in line behind any that were leftover unprocessed.

It's just basically a sanity check so that the system never spins unboundedly.

This suggestion may not turn out to be a good solution (it may not even fix my CSP example), but it's the sort of thing that I wish we were exploring rather than just ignoring the problem entirely.

@devsnek
Copy link
Member

devsnek commented Dec 13, 2018

cc @nodejs/v8 @nodejs/chakracore @nodejs/timers on the above

@targos
Copy link
Member

targos commented Jan 2, 2019

Closing this issue because the repository is being archived: nodejs/admin#283

@targos targos closed this as completed Jan 2, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

9 participants