Skip to content

What to do about asynchronous exceptions #52291

@Keno

Description

@Keno

Following up on recent discussions caused by our improvements to the modeling of exception propagation, I've been thinking again about the semantics of asynchronous exceptions (in particular StackOverflowError and InterruptException). This is of course not a new discussion. See e.g. #4037 #7026 #15514. However, I because of our recent improvements to exception flow modeling, I think this issue has gained new urgency.

Current situation

Before jumping into some discussion about potential enhancements, let me summarize some relevant history and our current state. Let me know if I forgot anything and I'll edit it in.

InterruptException

By default, we defer interrupt exceptions to the next GC safepoint. This helps avoid corruption caused by unwinding over state in C that isn't interrupt safe. This helps a bit, but of course, if you are actually in this kind of region, your experience will be something like the following:

julia> randn(10000, 10000) * randn(10000, 10000)
^C^C^C^C^C^CWARNING: Force throwing a SIGINT

Which isn't any better than before (because we're falling back to what we used to do) and arguably worse (because you had to press ^C many times).

StackOverflowError

Stack overflow error just exposes the OS notion of stack overflow (i.e. if something touches the guard page, the OS sends us a SEGV, which we turn into an appropriate Julia error). We are slightly better here than we used to be since we now at least stack probe large allocations (#40068).

Nevertheless, this again is still not particularly well defined semantically. For example, is the following actually semantically sound:

julia/base/strings/io.jl

Lines 32 to 40 in 187e8c2

function print(io::IO, x)
lock(io)
try
show(io, x)
finally
unlock(io)
end
return nothing
end

I think the answer is probably "no", because setting up the exception from touches the stack, so we could be generating a stackoverflowerror after the lock, but before we enter the try/finally region. Additionally, if we are close enough to the stack bottom to cause a stack overflow, there's no guarantee that we won't immediately hit that same stack overflow again trying to run the unlock code.

Recent try/finally elision

On master, inference has the capability to reason about the type of exceptions and whether or not catch blocks run. As a result, we can end up eliding try/finally blocks if everything inside the try block is proven nothrow:

@noinline function some_long_running_but_pure_computation(x)
       for i=1:10000000000
           x ⊻= x >> 7
           x ⊻= x << 9
           x ⊻= x << 13
       end
       return x
 end

function try_finally_elision(x)
       println("Hello")
       try
           some_long_running_but_pure_computation(x)
       finally
           println("World")
       end
end

As a result, we can get the following behavior:

julia> try_finally_elision(UInt64(0))
Hello
^C^C^C^C^C^CWARNING: Force throwing a SIGINT
ERROR: InterruptException:
Stacktrace:
 [1] xor(x::UInt64, y::UInt64)
   @ Base ./int.jl:373 [inlined]
 [2] some_long_running_but_pure_computation(x::UInt64)
   @ Main ./REPL[15]:4
 [3] try_finally_elision(x::UInt64)
   @ Main ./REPL[18]:4
 [4] top-level scope
   @ REPL[19]:1

i.e. the finally block never ran.

Some thoughts on how to move forward

I don't think I really have answers, but here's some scattered thoughts:

  • If necessary, we can fix the try/finally elision thing by adding the asynchronous exceptions into the exception set (with an option to disable this for external abstract interpreters as needed), but of course we'd lose the optimization benefits from the modeling precision.
  • I think these sorts of asynchronous exceptions should likely not participate in the ordinary try/catch mechanism at all (as previously proposed e.g. in julep: "chain of custody" error handling #7026).
  • If we do have asynchronous exceptions, this likely implies that we need to separate the IR representations of try/finally and try/catch.
  • If we have asynchronous exceptions, we need to be very clear when these sort of asynchronous exception can occur (as in the try/lock/unlock example above). For StackOverflowError in particular, I think it would be reasonable to specify that exceptions can only occur at function call boundaries.
  • It seems tempting to try to avoid asynchronous StackOverflowErrors entirely. The two primary solutions here are: Segmented stacks and stack copying. https://without.boats/blog/futures-and-segmented-stacks/ has a pretty good overview of this. It is worth treading cautiously here however. As the linked article notes, both Go and Rust ended up backing out segmented stacks again. On the other hand, we devirtualize significantly more aggressively than rust and don't use as much variable-sized stack space. We also have much more control over our native dependencies, so it's entirely possible that we wouldn't run into these issues.

A possible design for cancellation

I think the general consensus among language and API designers is that arbitrary cancellation is unworkable as an interface. Instead, one should favor explicit cancellation requests and cancellation checks. In that vein, we could consider having an explicit @cancel_check macro that expands to:

if cancellation_requested(current_task())
throw(InterruptException())
end

For more complex use cases cancellation_requested could be called directly and additional cleanup (e.g. requesting the cancellation of any synchronously launched I/O operations or something). As an additional, optimization, we can take advantage of our (recently much improved) :effect_free modeling to add the ability to reset to the previous (by longjmp) cancellation_requested check if there have been no intervening side effects. This extension could then also be used by external C libraries to register their own cancellation mechanism, in effect giving us back some variant of scoped asynchronous cancellation, but only when semantically unobservable or explicitly opted into.

That of course leaves the question of what would happen if there is no cancellation point set. My preference here would be to wait a reasonable amount of time (a few seconds or so, bypassable by a second press of ^C) and if no cancellation point is reached in time,

  1. Print a backtrace of the current location, with a suggestion to request for more cancellation points to be added by the appropriate package author
  2. A return to the REPL on a new task/thread.

This way, we never throw any unsafe asynchronous exceptions that could be corrupting the process state, but give the user back a REPL that they can use to either investigate the problem, or at least save any in progress work they may have. There's very little things more frustrating than losing your workspace state, because the ^C you did happened to corrupt and crash your process.

One final note here is to ask the question what should happen while we're in inference or LLVM. Since they are not modeled, we are not semantically supposed to throw any InterruptExceptions here. With the design above, the answer here would be that on entry, we would stop infering/compiling things, instead proceeding in the interpreter in the hope to hit the next cancellation point as quickly as possible. If cancellation becomes active while we are compiling, we would try to bail out as soon as feasible.

My recommendation

Having written all this down, I think my preference would be a combination of the above cancellation proposal with some mechanism to avoid StackOverflowErrors entirely. I think to start with I think we could enable some sort of segmented task stack support, but treat triggering this as an error to be thrown at the next cancellation point. I think we should also investigate if we can more fully model a function's stack size requirements since we tend to be more aggressively devirtualized. If we can, then, we could consider using a segmented stack mechanism more widely, but I think even if there is some performance penalty, getting rid of the possibility of asynchronous exceptions is well worth it.

Metadata

Metadata

Assignees

No one assigned

    Labels

    error handlingHandling of exceptions by Julia or the user

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions