-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Scheduling directive to support ring buffering #7967
Conversation
This reverts commit eb9dd02.
This is definitely gonna need a tutorial :-) |
…#7971) All known AVX2-enabled architectures definitely have these features.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Amazing update! A few minor comments: mainly about traces of "double buffer" naming not being replaced by "ring_buffer".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Absolutely beautiful work, comments are mostly nits, except:
We should add Python bindings (and at least a basic test using them) before landing this, but that's trivial (and I can contribute it directly to this PR if you like, just LMK)
auto it = env.find(op->name); | ||
internal_assert(it != env.end()); | ||
Function f = it->second; | ||
if (f.schedule().async() && hoisted_storages.count(op->name) == 0) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: some clang-tidy variants will always want to prefer .empty()
to count() ==
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hoisted_storage is std::set and the count() is basically used to see if the elements is in the set. I am not sure how empty can be used instead.
test/correctness/double_buffer.cpp
Outdated
consumer | ||
.compute_root() | ||
.tile(x, y, xo, yo, xi, yi, 16, 16, TailStrategy::RoundUp); | ||
producer |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suggestion: even though this is just a test, it's worth adding some comments about what this schedule is doing and how the storage schedules interact.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've made test descriptions a bit more concrete.
test/correctness/double_buffer.cpp
Outdated
}); | ||
} | ||
|
||
// Two non-async double-buffered producers and two consumers. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have we tested this under TSAN yet?
consumer.compute_root() | ||
.tile(x, y, xo, yo, xi, yi, 16, 16, TailStrategy::Auto); | ||
|
||
producer |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suggestion: It would be useful to show what the storage and execution-time of each schedule is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've added a couple of notes that storage is going to be doubled across the tutorial.
consumer.realize({128, 128, 4}); | ||
} | ||
|
||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a little confused about why the previous example exists -- it seems like we're trying to show that you can use fold_storage for this purpose, but IIUC, it's an awkward tool for the job and ring_buffer
is really a better choice? If so, are we doing the user a service by giving them a complex example of how not to do it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, the short version is for two reasons:
fold_storage()
only works over storage dimensions of the Func. Not arbitrary loop levels.ring_buffer()
can be used over any loop (not only those that explicitly go over a storage dimension). In general, I feel like this is an awkward constraint.ring_buffer()
not only ring buffers, but also injects the semaphore logic to make sure that no race condition happens when combining it withasync()
.
I do wonder if there is any use for ring_buffer()
without async()
, @vksnk?
A more intuitive / clean way I can think of this mechanic is to have fold_storage()
supports doing it over any loop level, and to have async()
insert the semaphore synchronization over folded storage. But that would totally break the current implementation of this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
he previous example exists -- it seems like we're trying to show that you can use fold_storage for this purpose, but IIUC, it's an awkward tool for the job and
ring_buffer
is really a better choice? If so, are we doing the user a service by giving them a complex example of how not to do it?
That's a good point, but I felt it's important to have it for completeness. Also, this construct can be useful if you actually want an actual sliding window with async.
I do wonder if there is any use for ring_buffer() without async(), @vksnk?
It's not super useful right now, but it might be useful for some backends. I'm planning to use it for one of the DSPs, which has an async API for DMA (but doesn't really support threads that well).
A more intuitive / clean way I can think of this mechanic is to have fold_storage() supports doing it over any loop level, and to have async() insert the semaphore synchronization over folded storage. But that would totally break the current implementation of this PR.
I totally agree, it's slightly annoying that fold_storage and the new directive overlap a lot. Actually, one of the initial ideas was to special case fold_storage to allow something like fold_storage(Var::outermost(), extent)
, which still sort of fits into existing fold_storage semantics (you can argue that whenever you do hoist_storage, you actually add a new outer dimension to storage with extent 1 by default and that would be a way to control its extent). It felt not very user-friendly, so it was decided to go with a separate directive.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I totally agree, it's slightly annoying that fold_storage and the new directive overlap a lot. Actually, one of the initial ideas was to special case fold_storage to allow something like
fold_storage(Var::outermost(), extent)
, which still sort of fits into existing fold_storage semantics (you can argue that whenever you do hoist_storage, you actually add a new outer dimension to storage with extent 1 by default and that would be a way to control its extent).
Ah, then I don't understand Var::outermost()
I think. I thought it referred to the place before the outermost storage dimension, but you seem to suggest that it is actually over the outermost loop that actually exists in the schedule for this Func
. Meaning that if there is a compute_at()
, that removes a bunch of outermost loops, it will actually represent a loop level just outside the outermost one that exists, due to the compute_at()
.
It felt not very user-friendly, so it was decided to go with a separate directive.
That might have been a good use case for syntactic sugar perhaps?
Func &Func::ring_buffer(const Expr &extent) {
assert(some_compute_at_is_specified);
fold_storage(Var::outermost(), extent);
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As long as we okay with Func &Func::ring_buffer(const Expr &extent);
API, I think we can think through it really thoroughly and if possible reconcile with fold_storage
later.
…into vksnk/roll-buffer
Thanks a lot for the review! I think I addressed all of the comments, please take another look. |
// and corresponding Realize node. | ||
int loop_index = hoist_storage_loop_index[op->name] + 1; | ||
Expr current_index = Variable::make(Int(32), loops[loop_index].name); | ||
while (++loop_index < (int)loops.size()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
dumb q: have we tested this on buffers with nonzero mins to be sure it is all good?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't test it, but this only affect intermediate functions and their allocations, so I think buffers with nonzero mins should be ok.
(loops[loop_index].extent - loops[loop_index].min) + | ||
Variable::make(Int(32), loops[loop_index].name); | ||
} | ||
current_index = current_index % f.schedule().ring_buffer(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this guaranteed to be > 0?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is guaranteed to be >= 0 which us what we want.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
expr % 0
will cause a divide by zero error, which is probably not what we want.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is an Expr, so % 0 will give 0.
All comments are resolved, going to merge in. |
* Half-plumbed * Revert "Half-plumbed" This reverts commit eb9dd02. * Interface for double buffer * Update Provides, Calls and Realizes for double buffering * Proper sync for double buffering * Use proper name for the semaphor and use correct initial value * Rename the class * Pass expression for index * Adds storage for double buffering index * Use a separate index to go through the double buffer * Failing test * Better handling of hoisted storage in all of the async-related passes * New test and clean-up the generated IR * More tests * Allow double buffering without async and add corresponding test * Filter out incorrect double_buffer schedules * Add tests to the cmake files * Clean up * Update the comment * Clean up * Clean up * Update serialization * complete_x86_target() should enable F16C and FMA when AVX2 is present (halide#7971) All known AVX2-enabled architectures definitely have these features. * Add two new tail strategies for update definitions (halide#7949) * Add two new tail strategies for update definitions * Stop printing asm * Update expected number of partitions for Partition::Always * Add a comment explaining why the blend safety check is per dimension * Add serialization support for the new tail strategies * trigger buildbots * Add comment --------- Co-authored-by: Steven Johnson <[email protected]> * Add appropriate mattrs for arm-32 extensions (halide#7978) * Add appropriate mattrs for arm-32 extensions Fixes halide#7976 * Pull clauses out of if * Move canonical version numbers into source, not build system (halide#7980) (halide#7981) * Move canonical version numbers into source, not build system (halide#7980) * Fixes * Silence useless "Insufficient parallelism" autoscheduler warning (halide#7990) * Add a notebook with a visualization of the aprrox_* functions and their errors (halide#7974) * Add a notebook with a visualization of the aprrox_* functions and their errors * Fix spelling error * Make narrowing float->int casts on wasm go via wider ints (halide#7973) Fixes halide#7972 * Fix handling of assert statements whose conditions get vectorized (halide#7989) * Fix handling of assert statements whose conditions get vectorized * Fix test name * Fix all "unscheduled update()" warnings in our code (halide#7991) * Fix all "unscheduled update()" warnings in our code And also fix the Mullapudi scheduler to explicitly touch all update stages. This allows us to mark this warning as an error if we so choose. * fixes * fixes * Update recursive_box_filters.cpp * Silence useless 'Outer dim vectorization of var' warning in Mullapudi… (halide#7992) Silence useless 'Outer dim vectorization of var' warning in Mullapudi scheduler * Add a tutorial for async and double_buffer * Renamed double_buffer to ring_buffer * ring_buffer() now expects an extent Expr * Actually use extent for ring_buffer() * Address some of the comments * Provide an example of the code structure for producer-consumer async example * Comments updates * Fix clang-format and clang-tidy * Add Python binding for Func::ring_buffer() * Don't use a separate index for ring buffer + add a new test * Rename the tests * Clean up the old name * Add & * Move test to the right folder * Move expr * Add comments for InjectRingBuffering * Improve ring_buffer doc * Fix comments * Comments * A better error message * Mention that extent is expected to be a positive integer * Add another code structure and explain how the indices for ring buffer are computed * Expand test comments * Fix spelling --------- Co-authored-by: Steven Johnson <[email protected]> Co-authored-by: Andrew Adams <[email protected]>
This PR adds a new directive which allows to explicitly express double buffering via schedule.
The simple motivating example is a pipeline of two functions
A
andB
, whereB
dependsA
. One common computational pattern is to process a tiletile_index
ofA
, start processing of the tiletile_index
ofB
and simultaneously start processing of the tiletile_index + 1
ofA
, overlaping the computation forA
andB
in time. It's currently possible to express some subset of such schedules usingfold_storage
andasync
, but it's tied to the storage folding logic and it's not possible to double buffer at arbitrary loop level.The new
double_buffer
directive can be used together withhoist_storage
to express the pipeline above like this:At the high-level, the schedule will move the storage of
A
outside of the loops over tiles (in this case there are two loops over tilesxo
andyo
), expand it by 2x by adding a new dimension to corresponding Realize node, and add corresponding syncronization (similar to whatfold_storage
would do) to make sure it's safe withasync
andA
can run ahead by one tile.