Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Scheduling directive to support ring buffering #7967

Merged
merged 65 commits into from
Dec 19, 2023
Merged

Scheduling directive to support ring buffering #7967

merged 65 commits into from
Dec 19, 2023

Conversation

vksnk
Copy link
Member

@vksnk vksnk commented Nov 30, 2023

This PR adds a new directive which allows to explicitly express double buffering via schedule.

The simple motivating example is a pipeline of two functions A and B, where B depends A. One common computational pattern is to process a tile tile_index of A, start processing of the tile tile_index of B and simultaneously start processing of the tiletile_index + 1 of A, overlaping the computation for A and B in time. It's currently possible to express some subset of such schedules using fold_storage and async, but it's tied to the storage folding logic and it's not possible to double buffer at arbitrary loop level.

The new double_buffer directive can be used together with hoist_storage to express the pipeline above like this:

Func A("A"), B("B");
Var x, y, xo, yo, xi, yi;

A(x, y) = x + y;
B(x, y) = A(x - 1, y - 1) + A(x, y) + A(x + 1, y + 1);

B.compute_root()
    .tile(x, y, xo, yo, xi, yi, 16, 16, TailStrategy::RoundUp);
A.compute_at(B, xo)
    .hoist_storage_root()
    .double_buffer()
    .async();

At the high-level, the schedule will move the storage of A outside of the loops over tiles (in this case there are two loops over tiles xo and yo), expand it by 2x by adding a new dimension to corresponding Realize node, and add corresponding syncronization (similar to what fold_storage would do) to make sure it's safe with async and A can run ahead by one tile.

@vksnk vksnk requested a review from abadams November 30, 2023 03:52
@steven-johnson
Copy link
Contributor

This is definitely gonna need a tutorial :-)

…#7971)

All known AVX2-enabled architectures definitely have these features.
Copy link
Contributor

@mcourteaux mcourteaux left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Amazing update! A few minor comments: mainly about traces of "double buffer" naming not being replaced by "ring_buffer".

@steven-johnson steven-johnson changed the title Scheduling directive to support double buffering Scheduling directive to support ring buffering Dec 12, 2023
Copy link
Contributor

@steven-johnson steven-johnson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Absolutely beautiful work, comments are mostly nits, except:

We should add Python bindings (and at least a basic test using them) before landing this, but that's trivial (and I can contribute it directly to this PR if you like, just LMK)

auto it = env.find(op->name);
internal_assert(it != env.end());
Function f = it->second;
if (f.schedule().async() && hoisted_storages.count(op->name) == 0) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: some clang-tidy variants will always want to prefer .empty() to count() == .

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hoisted_storage is std::set and the count() is basically used to see if the elements is in the set. I am not sure how empty can be used instead.

consumer
.compute_root()
.tile(x, y, xo, yo, xi, yi, 16, 16, TailStrategy::RoundUp);
producer
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion: even though this is just a test, it's worth adding some comments about what this schedule is doing and how the storage schedules interact.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've made test descriptions a bit more concrete.

});
}

// Two non-async double-buffered producers and two consumers.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have we tested this under TSAN yet?

consumer.compute_root()
.tile(x, y, xo, yo, xi, yi, 16, 16, TailStrategy::Auto);

producer
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion: It would be useful to show what the storage and execution-time of each schedule is.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added a couple of notes that storage is going to be doubled across the tutorial.

consumer.realize({128, 128, 4});
}

{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a little confused about why the previous example exists -- it seems like we're trying to show that you can use fold_storage for this purpose, but IIUC, it's an awkward tool for the job and ring_buffer is really a better choice? If so, are we doing the user a service by giving them a complex example of how not to do it?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, the short version is for two reasons:

  • fold_storage() only works over storage dimensions of the Func. Not arbitrary loop levels. ring_buffer() can be used over any loop (not only those that explicitly go over a storage dimension). In general, I feel like this is an awkward constraint.
  • ring_buffer() not only ring buffers, but also injects the semaphore logic to make sure that no race condition happens when combining it with async().

I do wonder if there is any use for ring_buffer() without async(), @vksnk?


A more intuitive / clean way I can think of this mechanic is to have fold_storage() supports doing it over any loop level, and to have async() insert the semaphore synchronization over folded storage. But that would totally break the current implementation of this PR.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

he previous example exists -- it seems like we're trying to show that you can use fold_storage for this purpose, but IIUC, it's an awkward tool for the job and ring_buffer is really a better choice? If so, are we doing the user a service by giving them a complex example of how not to do it?

That's a good point, but I felt it's important to have it for completeness. Also, this construct can be useful if you actually want an actual sliding window with async.

I do wonder if there is any use for ring_buffer() without async(), @vksnk?

It's not super useful right now, but it might be useful for some backends. I'm planning to use it for one of the DSPs, which has an async API for DMA (but doesn't really support threads that well).

A more intuitive / clean way I can think of this mechanic is to have fold_storage() supports doing it over any loop level, and to have async() insert the semaphore synchronization over folded storage. But that would totally break the current implementation of this PR.

I totally agree, it's slightly annoying that fold_storage and the new directive overlap a lot. Actually, one of the initial ideas was to special case fold_storage to allow something like fold_storage(Var::outermost(), extent), which still sort of fits into existing fold_storage semantics (you can argue that whenever you do hoist_storage, you actually add a new outer dimension to storage with extent 1 by default and that would be a way to control its extent). It felt not very user-friendly, so it was decided to go with a separate directive.

Copy link
Contributor

@mcourteaux mcourteaux Dec 15, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I totally agree, it's slightly annoying that fold_storage and the new directive overlap a lot. Actually, one of the initial ideas was to special case fold_storage to allow something like fold_storage(Var::outermost(), extent), which still sort of fits into existing fold_storage semantics (you can argue that whenever you do hoist_storage, you actually add a new outer dimension to storage with extent 1 by default and that would be a way to control its extent).

Ah, then I don't understand Var::outermost() I think. I thought it referred to the place before the outermost storage dimension, but you seem to suggest that it is actually over the outermost loop that actually exists in the schedule for this Func. Meaning that if there is a compute_at(), that removes a bunch of outermost loops, it will actually represent a loop level just outside the outermost one that exists, due to the compute_at().

It felt not very user-friendly, so it was decided to go with a separate directive.

That might have been a good use case for syntactic sugar perhaps?

Func &Func::ring_buffer(const Expr &extent) {
    assert(some_compute_at_is_specified);
    fold_storage(Var::outermost(), extent);
}

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As long as we okay with Func &Func::ring_buffer(const Expr &extent); API, I think we can think through it really thoroughly and if possible reconcile with fold_storage later.

@vksnk
Copy link
Member Author

vksnk commented Dec 14, 2023

Thanks a lot for the review! I think I addressed all of the comments, please take another look.

// and corresponding Realize node.
int loop_index = hoist_storage_loop_index[op->name] + 1;
Expr current_index = Variable::make(Int(32), loops[loop_index].name);
while (++loop_index < (int)loops.size()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dumb q: have we tested this on buffers with nonzero mins to be sure it is all good?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't test it, but this only affect intermediate functions and their allocations, so I think buffers with nonzero mins should be ok.

(loops[loop_index].extent - loops[loop_index].min) +
Variable::make(Int(32), loops[loop_index].name);
}
current_index = current_index % f.schedule().ring_buffer();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this guaranteed to be > 0?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is guaranteed to be >= 0 which us what we want.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

expr % 0 will cause a divide by zero error, which is probably not what we want.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an Expr, so % 0 will give 0.

@vksnk
Copy link
Member Author

vksnk commented Dec 19, 2023

All comments are resolved, going to merge in.

@vksnk vksnk merged commit 61b8d38 into main Dec 19, 2023
ardier pushed a commit to ardier/Halide-mutation that referenced this pull request Mar 3, 2024
* Half-plumbed

* Revert "Half-plumbed"

This reverts commit eb9dd02.

* Interface for double buffer

* Update Provides, Calls and Realizes for double buffering

* Proper sync for double buffering

* Use proper name for the semaphor and use correct initial value

* Rename the class

* Pass expression for index

* Adds storage for double buffering index

* Use a separate index to go through the double buffer

* Failing test

* Better handling of hoisted storage in all of the async-related passes

* New test and clean-up the generated IR

* More tests

* Allow double buffering without async and add corresponding test

* Filter out incorrect double_buffer schedules

* Add tests to the cmake files

* Clean up

* Update the comment

* Clean up

* Clean up

* Update serialization

* complete_x86_target() should enable F16C and FMA when AVX2 is present (halide#7971)

All known AVX2-enabled architectures definitely have these features.

* Add two new tail strategies for update definitions (halide#7949)

* Add two new tail strategies for update definitions

* Stop printing asm

* Update expected number of partitions for Partition::Always

* Add a comment explaining why the blend safety check is per dimension

* Add serialization support for the new tail strategies

* trigger buildbots

* Add comment

---------

Co-authored-by: Steven Johnson <[email protected]>

* Add appropriate mattrs for arm-32 extensions (halide#7978)

* Add appropriate mattrs for arm-32 extensions

Fixes halide#7976

* Pull clauses out of if

* Move canonical version numbers into source, not build system (halide#7980) (halide#7981)

* Move canonical version numbers into source, not build system (halide#7980)

* Fixes

* Silence useless "Insufficient parallelism" autoscheduler warning (halide#7990)

* Add a notebook with a visualization of the aprrox_* functions and their errors (halide#7974)

* Add a notebook with a visualization of the aprrox_* functions and their errors

* Fix spelling error

* Make narrowing float->int casts on wasm go via wider ints (halide#7973)

Fixes halide#7972

* Fix handling of assert statements whose conditions get vectorized (halide#7989)

* Fix handling of assert statements whose conditions get vectorized

* Fix test name

* Fix all "unscheduled update()" warnings in our code (halide#7991)

* Fix all "unscheduled update()" warnings in our code

And also fix the Mullapudi scheduler to explicitly touch all update stages. This allows us to mark this warning as an error if we so choose.

* fixes

* fixes

* Update recursive_box_filters.cpp

* Silence useless 'Outer dim vectorization of var' warning in Mullapudi… (halide#7992)

Silence useless 'Outer dim vectorization of var' warning in Mullapudi scheduler

* Add a tutorial for async and double_buffer

* Renamed double_buffer to ring_buffer

* ring_buffer() now expects an extent Expr

* Actually use extent for ring_buffer()

* Address some of the comments

* Provide an example of the code structure for producer-consumer async example

* Comments updates

* Fix clang-format and clang-tidy

* Add Python binding for Func::ring_buffer()

* Don't use a separate index for ring buffer + add a new test

* Rename the tests

* Clean up the old name

* Add &

* Move test to the right folder

* Move expr

* Add comments for InjectRingBuffering

* Improve ring_buffer doc

* Fix comments

* Comments

* A better error message

* Mention that extent is expected to be a positive integer

* Add another code structure and explain how the indices for ring buffer are computed

* Expand test comments

* Fix spelling

---------

Co-authored-by: Steven Johnson <[email protected]>
Co-authored-by: Andrew Adams <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants