What to do about asynchronous exceptions

Following up on recent discussions caused by our improvements to the modeling of exception propagation, I've been thinking again about the semantics of asynchronous exceptions (in particular StackOverflowError and InterruptException). This is of course not a new discussion. See e.g. #4037 #7026 #15514. However, I because of our recent improvements to exception flow modeling, I think this issue has gained new urgency.

## Current situation

Before jumping into some discussion about potential enhancements, let me summarize some relevant history and our current state. Let me know if I forgot anything and I'll edit it in.

### InterruptException

By default, we defer interrupt exceptions to the next GC safepoint. This helps avoid corruption caused by unwinding over state in C that isn't interrupt safe. This helps a bit, but of course, if you are actually in this kind of region, your experience will be something like the following:

```
julia> randn(10000, 10000) * randn(10000, 10000)
^C^C^C^C^C^CWARNING: Force throwing a SIGINT
```

Which isn't any better than before (because we're falling back to what we used to do) and arguably worse (because you had to press ^C many times).

### StackOverflowError

Stack overflow error just exposes the OS notion of stack overflow (i.e. if something touches the guard page, the OS sends us a SEGV, which we turn into an appropriate Julia error). We are slightly better here than we used to be since we now at least stack probe large allocations (https://github.com/JuliaLang/julia/pull/40068).

Nevertheless, this again is still not particularly well defined semantically. For example, is the following actually semantically sound:

https://github.com/JuliaLang/julia/blob/187e8c2222878c68b2afc9295ab8dc61773bd7f2/base/strings/io.jl#L32-L40

I think the answer is probably "no", because setting up the exception from touches the stack, so we could be generating a stackoverflowerror after the lock, but before we enter the try/finally region. Additionally, if we are close enough to the stack bottom to cause a stack overflow, there's no guarantee that we won't immediately hit that same stack overflow again trying to run the unlock code.

### Recent try/finally elision

On master, inference has the capability to reason about the type of exceptions and whether or not catch blocks run. As a result, we can end up eliding try/finally blocks if everything inside the try block is proven nothrow:
```
@noinline function some_long_running_but_pure_computation(x)
       for i=1:10000000000
           x ⊻= x >> 7
           x ⊻= x << 9
           x ⊻= x << 13
       end
       return x
 end

function try_finally_elision(x)
       println("Hello")
       try
           some_long_running_but_pure_computation(x)
       finally
           println("World")
       end
end
```

As a result, we can get the following behavior:
```
julia> try_finally_elision(UInt64(0))
Hello
^C^C^C^C^C^CWARNING: Force throwing a SIGINT
ERROR: InterruptException:
Stacktrace:
 [1] xor(x::UInt64, y::UInt64)
   @ Base ./int.jl:373 [inlined]
 [2] some_long_running_but_pure_computation(x::UInt64)
   @ Main ./REPL[15]:4
 [3] try_finally_elision(x::UInt64)
   @ Main ./REPL[18]:4
 [4] top-level scope
   @ REPL[19]:1
```

i.e. the finally block never ran.

## Some thoughts on how to move forward

I don't think I really have answers, but here's some scattered thoughts:

- If necessary, we can fix the try/finally elision thing by adding the asynchronous exceptions into the exception set (with an option to disable this for external abstract interpreters as needed), but of course we'd lose the optimization benefits from the modeling precision.
- I think these sorts of asynchronous exceptions should likely not participate in the ordinary try/catch mechanism at all (as previously proposed e.g. in #7026).
- If we do have asynchronous exceptions, this likely implies that we need to separate the IR representations of try/finally and try/catch.
- If we have asynchronous exceptions, we need to be very clear when these sort of asynchronous exception can occur (as in the try/lock/unlock example above). For StackOverflowError in particular, I think it would be reasonable to specify that exceptions can only occur at function call boundaries.
- It seems tempting to try to avoid asynchronous StackOverflowErrors entirely. The two primary solutions here are: Segmented stacks and stack copying. https://without.boats/blog/futures-and-segmented-stacks/ has a pretty good overview of this. It is worth treading cautiously here however. As the linked article notes, both Go and Rust ended up backing out segmented stacks again. On the other hand, we devirtualize significantly more aggressively than rust and don't use as much variable-sized stack space. We also have much more control over our native dependencies, so it's entirely possible that we wouldn't run into these issues.

## A possible design for cancellation

I think the general consensus among language and API designers is that arbitrary cancellation is unworkable as an interface. Instead, one should favor explicit cancellation requests and cancellation checks. In that vein, we could consider having an explicit `@cancel_check` macro that expands to:

```
if cancellation_requested(current_task())
throw(InterruptException())
end
```

For more complex use cases `cancellation_requested` could be called directly and additional cleanup (e.g. requesting the cancellation of any synchronously launched I/O operations or something). As an additional, optimization, we can take advantage of our (recently much improved) `:effect_free` modeling to add the ability to reset to the *previous* (by longjmp) `cancellation_requested` check if there have been no intervening side effects. This extension could then also be used by external C libraries to register their own cancellation mechanism, in effect giving us back some variant of scoped asynchronous cancellation, but only when semantically unobservable or explicitly opted into.

That of course leaves the question of what would happen if there is no cancellation point set. My preference here would be to wait a reasonable amount of time (a few seconds or so, bypassable by a second press of ^C) and if no cancellation point is reached in time,

1. Print a backtrace of the current location, with a suggestion to request for more cancellation points to be added by the appropriate package author
2. A return to the REPL on a new task/thread.

This way, we never throw any unsafe asynchronous exceptions that could be corrupting the process state, but give the user back a REPL that they can use to either investigate the problem, or at least save any in progress work they may have. There's very little things more frustrating than losing your workspace state, because the ^C you did happened to corrupt and crash your process.

One final note here is to ask the question what should happen while we're in inference or LLVM. Since they are not modeled, we are not semantically supposed to throw any InterruptExceptions here. With the design above, the answer here would be that on entry, we would stop infering/compiling things, instead proceeding in the interpreter in the hope to hit the next cancellation point as quickly as possible. If cancellation becomes active while we are compiling, we would try to bail out as soon as feasible.

## My recommendation

Having written all this down, I think my preference would be a combination of the above cancellation proposal with some mechanism to avoid StackOverflowErrors entirely. I think to start with I think we could enable some sort of segmented task stack support, but treat triggering this as an error to be thrown at the next cancellation point. I think we should also investigate if we can more fully model a function's stack size requirements since we tend to be more aggressively devirtualized. If we can, then, we could consider using a segmented stack mechanism more widely, but I think even if there is some performance penalty, getting rid of the possibility of asynchronous exceptions is well worth it.

	function print(io::IO, x)
	lock(io)
	try
	show(io, x)
	finally
	unlock(io)
	end
	return nothing
	end

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

What to do about asynchronous exceptions #52291

Current situation

InterruptException

StackOverflowError

Recent try/finally elision

Some thoughts on how to move forward

A possible design for cancellation

My recommendation

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

What to do about asynchronous exceptions #52291

Description

Current situation

InterruptException

StackOverflowError

Recent try/finally elision

Some thoughts on how to move forward

A possible design for cancellation

My recommendation

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions