Skip to content

Conversation

@Keno
Copy link
Member

@Keno Keno commented May 27, 2025

Introduction

This PR adds a new control flow mechanism called await. In this PR, it is only exposed by the macro @Base.Experimental.oc_await, which has the following docstring:

    Base.Experimental.@oc_await [argt] [C->retblock]

Capture the current function's execution context for later resumption.
By default, immediately returns to the caller, returning an `OpaqueClosure` that
may be invoked to continue execution. If the optional `C->retblock` argument is
provided, `reblock` is executed in the context of the current function, with the
continuation bound to `C`. If `argt` is provided, the continuation will further
expect arguments `argt` to be provided when invoked.

Adding a feature like this was part of the original design of opaque closures, but was never fully implemented for lack of immediate need. There are serveral ways to think of this feature:

  1. As an alternative representation for :new_opaque_closure that is more friendly to other optimization passes.
  2. An an implementation of a particular kind of delimited contiuation
  3. As an implementation of C++20-style coroutines

The key important feature of this design is that the decision of which values go in the capture list/residual/etc. is deferred all the way through to the last possible moment in LLVM. As a result, all the ordinary optimizations (DCE, SROA, various AD transforms, etc.) can be applied as usual across the suspension boundary.

Implementation status

As of the writing of this commit message, the implementation is minimal. I've added lowering support, and the new IR node type, as well as support in the interpreter to play with the semantics. However, there is no compiler support or optimization support yet (so you need to run julia --compile=min to play with it).

Semantic TODO:

  • How does this mix with try/catch
  • Does await capture other task-bound state,
    • scope (yes?)
    • locks? (no?)
    • timing? (no?)
    • rng? (no?)

Representational TODO:

  • How does argt get represented in the continuation

Inference TODO:

  • Implement AwaitNode inference support

Codegen TODO:

  • Define and implement julia.coro intrinsics to lower this to
  • Implement the appropriate lowering

Runtime TODO:

  • Allow the OC captures to be allocated inline with a GC descriptor for pointers

Detailed semantic discussion

General semantic details

There are some semantic/similarities with try catch (in that they're both kinds of continuations). However, the semantics are quite different:

  1. Try/catch always jumps up the stack, await makes no assumptions (but copies the state of the topmost stackframe, so there are two independent copies of it).
  2. await is always delimited by return (which terminates the continuation).
  3. await is multi-shot. However, I think single-shot is useful, so there is a currently unused flags argument that might be used to ask for a single-shot continuation.

Syntax level

This adds a new syntax form (symbolicawait continue_at argt flags). continue_at is a label name created with symboliclabel. The semantics are that the execution of symbolicawait captures all local slots and ssa values and returns an opaque closure that, when-called, restores all local slots and ssa values and resumes the execution at the label continue_at. Regular execution continues as usual at the next statement after symbolicawait. Modifications to slots (or ssavalues) after symbolicawait do not affect the value of said slots/ssavlues in the continuation.

IR level

This adds a new AwaitNode. It is in some ways similar structurally to EnterNode in that it has a non-local successors, that may later be jumped to. The non-local succsesor in both AwaitNode (i.e. the continuation) and EnterNode (i.e. the catch block), is a statement/bb index integer inside the struct. However, there are also some differences:

  1. AwaitNode is always delimited by ReturnNode, there are no equivalent :leave or :pop_exception statements.
  2. AwaitNode returns a regular value (an opaque closure) not a token. AwaitNode may be DCE'd if there are no uses.

LLVM level [unimplemented]

The rough plan is to implement something similar to llvm.coro, although we cannot use it directly, since we need special handling for our GC-tracked pointers. However, we may be able to borrow some code.

Potential users

I have the following potential use cases in mind immediately, although the mechanism is of course quite general.

In Base:

  1. Task
  2. The futures mechanism in Compiler

In downstream packages:

  1. The carried residual in reverse-mode AD packages like Diffractor or Enzyme (I have no direct insight into Enzyme, but since the plan is to expose this down to the LLVM level, I imagine it could use it).
  2. Carried state between torn partitions in DAECompiler.
  3. A faster, more reliable implementation of ResumableFunctions.jl

# Introduction

This PR adds a new control flow mechanism called `await`. In this PR,
it is only exposed by the macro `@Base.Experimental.oc_await`, which
has the following docstring:

```
    Base.Experimental.@oc_await [argt] [C->retblock]

Capture the current function's execution context for later resumption.
By default, immediately returns to the caller, returning an `OpaqueClosure` that
may be invoked to continue execution. If the optional `C->retblock` argument is
provided, `reblock` is executed in the context of the current function, with the
continuation bound to `C`. If `argt` is provided, the continuation will further
expect arguments `argt` to be provided when invoked.
```

Adding a feature like this was part of the original design of opaque closures,
but was never fully implemented for lack of immediate need. There are serveral
ways to think of this feature:

1. As an alternative representation for `:new_opaque_closure` that is
   more friendly to other optimization passes.
2. An an implementation of a particular kind of delimited contiuation
3. As an implementation of C++20-style coroutines

The key important feature of this design is that the decision of which
values go in the capture list/residual/etc. is deferred all the way through
to the last possible moment in LLVM. As a result, all the ordinary optimizations
(DCE, SROA, various AD transforms, etc.) can be applied as usual across the
suspension boundary.

## Implementation status

As of the writing of this commit message, the implementation is minimal.
I've added lowering support, and the new IR node type, as well as support
in the interpreter to play with the semantics. However, there is no compiler
support or optimization support yet (so you need to run `julia --compile=min`
to play with it).

Semantic TODO:
- [ ] How does this mix with try/catch
- [ ] Does `await` capture other task-bound state,
	- [ ] `scope` (yes?)
	- [ ] locks? (no?)
	- [ ] timing? (no?)
	- [ ] rng? (no?)

Representational TODO:
- [ ] How does `argt` get represented in the continuation

Inference TODO:
- [ ] Implement AwaitNode inference support

Codegen TODO:
- [ ] Define and implement `julia.coro` intrinsics to lower this to
- [ ] Implement the appropriate lowering

Runtime TODO:
- [ ] Allow the OC captures to be allocated inline with a GC descriptor for pointers

## Detailed semantic discussion

### General semantic details

There are some semantic/similarities with try catch (in that they're both kinds of
continuations). However, the semantics are quite different:

1. Try/catch always jumps up the stack, `await` makes no assumptions (but copies
   the state of the topmost stackframe, so there are two independent copies of it).
2. `await` is always delimited by `return` (which terminates the continuation).
3. `await` is multi-shot. However, I think single-shot is useful, so there is a
   currently unused `flags` argument that might be used to ask for a single-shot
   continuation.

### Syntax level

This adds a new syntax form `(symbolicawait continue_at argt flags)`.
`continue_at` is a label name created with `symboliclabel`. The semantics
are that the execution of `symbolicawait` captures all local slots and
ssa values and returns an opaque closure that, when-called, restores
all local slots and ssa values and resumes the execution at the label `continue_at`.
Regular execution continues as usual at the next statement after `symbolicawait`.
Modifications to slots (or ssavalues) after `symbolicawait` do not affect
the value of said slots/ssavlues in the continuation.

### IR level

This adds a new `AwaitNode`. It is in some ways similar structurally to
`EnterNode` in that it has a non-local successors, that may later be jumped to.
The non-local succsesor in both `AwaitNode` (i.e. the continuation) and `EnterNode`
(i.e. the catch block), is a statement/bb index integer inside the struct. However,
there are also some differences:

1. AwaitNode is always delimited by `ReturnNode`, there are no equivalent `:leave` or
   `:pop_exception` statements.
2. `AwaitNode` returns a regular value (an opaque closure) not a token. `AwaitNode`
    may be DCE'd if there are no uses.

### LLVM level [unimplemented]

The rough plan is to implement something similar to `llvm.coro`, although we cannot
use it directly, since we need special handling for our GC-tracked pointers. However,
we may be able to borrow some code.

## Potential users

I have the following potential use cases in mind immediately, although the
mechanism is of course quite general.

In Base:
1. `Task`
2. The futures mechanism in `Compiler`

In downstream packages:
1. The carried residual in reverse-mode AD packages like Diffractor or Enzyme
   (I have no direct insight into Enzyme, but since the plan is to expose this
   down to the LLVM level, I imagine it could use it).
2. Carried state between torn partitions in DAECompiler.
@Keno
Copy link
Member Author

Keno commented May 27, 2025

Summarizing some design questions from this morning:

1. Isn't this too many allocations?

Q: In the generator use case

function foo()
    i = 0
    while true
        @yield i
        i += 1
    end
end

the PR as is would allocate one OC per iteration. Isn't this too many?

A: Yes, it's too many. My proposal is to have a prealloc_await_state() intrinsic that's used (in combination with one-shot await) like:

function foo()
    i = 0
    state = prealloc_await_state()
    while true
        oc = await(state, AWAIT_ONE_SHOT) continue at #cont
        return (oc, i)
        @label cont
        i += 1
    end
end

The optimizer propagates the necessary information backwards to allocate the state and the old state is dead on entry (after the oneshot check) and can be re-used. This reduces the total number of allocations to 1. To get to zero, we'd need a way to query LLVM for the size in the callee so that we can allocate an appropriate stack size. We do not currently have such a facility, but there would be several clients for this, so we should discuss it spearately.

2. Which scope does return return from?

In

#1 %1 = await Tuple{Int64} resuming #2
   %2 = (%1)(1)
   %3 = add_int(10, %2)
        return %3

#2   return 100

Does this return 100 or 110?

A: return returns from the most recent invocation, so this returns 110.

3. What are the slot capture semantics?

In

x = 1
f = $(Expr(:await, :cont))
f()
return x
@label cont
x = 2
return

Does the original invocation return 1 or 2?

A: Slot state is forked at the await point. The original invocation returns 1.

4. Does AwaitNode need the rest of the new_opaque_closure arguments

I think probably yes on isva and nargs (Although isva could be folded into flags). For the rt lb/ub. I was thinking of splitting into a new intrinsic:

constrain_opaque_closure(oc, lb, ub)

and moving the arguments out of new_opaque_closure also.

@Keno
Copy link
Member Author

Keno commented May 27, 2025

  • How does this mix with try/catch

I'm inclined to say it's disallowed inside try/catch(/finally) for the time being. I think it would be surprising if:

try
a()
@await
b()
catch
end

Didn't catch exceptions from b. Also, should:

try
a()
@await
b()
finally
println("Finally")
end

print twice? Probably not.

I think it would be a reasonable semantics for the try/catch to last until an exit from the try/catch region on the continue side of the await, but that can be arbitrarily later of course (so the finally needs to be registered as a a finalizer on the OpaqueClosure)? I think this is too complicated for unclear benefit at this time, so my inclination is to disallow this.

@Keno
Copy link
Member Author

Keno commented May 28, 2025

To get to zero, we'd need a way to query LLVM for the size in the callee so that we can allocate an appropriate stack size.

Per discussion this mornings, there are snakes down the path of compile time queries of optimizer properties (which I'll do a separate writeup on). Proposed fix is the following:

struct AwaitBuffer
     inuse::Bool #= probably sunk into one of the following integers =#
     npointers::UInt
     nbytes::UInt
     #= [ npointers * Any ] =#
     #= [ nbytes * UInt8 ] =#
end

@noinline function foo()
    size = await_size()
    oc = await(nothing, Tuple{AwaitBuffer}, AWAIT_NO_STATE)
    return (size, oc)
    state = $(Expr(:await_acquire, Argument(1), size))
    i = 0
    while true
        oc = await(state, Tuple{}, AWAIT_ONE_SHOT) continue at #cont
        return (i, oc)
        @label cont
        i += 1
    end
end

@inline function iterate_resumable(f)
    (size, oc) = f()
    oc(AwaitBuffer(size))
end
iterate_resumable(f, oc) = oc()

It's a little bit more complicated because of the magic await_acquire, but I think it does what's needed.

@MasonProtter
Copy link
Contributor

MasonProtter commented Nov 30, 2025

FWIW, I'm very interested in this and hope it could still be made to happen. Just about everything that (mis-)uses try/catch as a non-local control-flow construct should really be using this sort of mechanism, and potentially even a lot of 'legitimate' uses of try/catch could be replaced with it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants