-
Notifications
You must be signed in to change notification settings - Fork 1.1k
Fix Vulkan interleave two vectors bug + Vector Legalization lowering pass #8629
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
| "opencl.dll", | ||
| #else | ||
| "libOpenCL.so", | ||
| "libOpenCL.so.1", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Driveby change. Still not merged in any other PR. Fixes #8569.
| case VK_ERROR_FRAGMENTED_POOL: | ||
| return "VK_ERROR_FRAGMENTED_POOL"; | ||
| case VK_ERROR_UNKNOWN: | ||
| return "VK_ERROR_UNKNOWN"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
More indicative than <Unknown Vulkan Error> in the default branch below.
|
Regarding the build failure: some simplification rules in I see two possible solutions:
|
|
I think the gpu backends need to handle arbitrary vector sizes. The rest of the compiler is free to make vectors of any size. A shared legalization pass for non-llvm backends that maps from Halide IR to Halide IR with narrower vectors might work, but seems a little tricky to get right for things like vectorreduce nodes. The other approach would be just changing how all ops are printed to handle small bundles of values, but this seems even nastier. Basically I agree with your option 2. |
|
I'd consider this to be greatly out of scope of this bugfix PR. I guess I'll skip that test for now and make an issue? Looking at the IR that triggers the error: let t397 = ramp(.thread_id_x + g$8.s0.x.v17.base.s, g$8.extent.0, 4)
let t398 = concat_vectors(f0$8[t397], f1$8[t397])
let t404 = extract_element(t398, 1)
let t405 = extract_element(t398, 0)which came from this: g$8(g$8.s0.x) = let t374 = slice_vectors(
concat_vectors(f0$8(g$8.s0.x, 0), f0$8(g$8.s0.x, 1), f0$8(g$8.s0.x, 2), f0$8(g$8.s0.x, 3)),
concat_vectors(f1$8(g$8.s0.x, 4), f1$8(g$8.s0.x, 5), f1$8(g$8.s0.x, 6), f1$8(g$8.s0.x, 7)),
1, -1, 2)
in (let t375 = (t374*t374)
in (extract_element(t375, 0) + extract_element(t375, 1))
)It seems that the simplifier rules have done a good job simplifying it, but there is some simplifications missing, or the simplifier rules have gotten stuck in a local minimum. The two ramped loads of size 4 get concatenated into a vector of size 8, to then just take elements 0 and 1 out of it. Very inefficient, compared to just doing two loads (or one ramped load). Of course, this is due to the unnatural way of constructing it with all these explicit shuffles in the test, but perhaps, having Halide simplify this further might be achievable for this PR? @abadams Any ideas on improving the codegen for this? |
|
Aaarrrghghhhh llvm/llvm-project@735209c
Clearly doesn't work for us here... 😢 |
|
n.b. - the "fixes #NN" magic belongs on a single line (one per line if multiple) in either the PR description or the final commit description (in the GitHub interface, which by default concatenates all the intermediate commit messages). It only clutters the PR title. |
b90ae86 to
3b6f14d
Compare
|
Specifically asking for review from @abadams as I uncovered another bug, but this time in the Deinterleaver transformation. This transformation is implemented as a GraphIRMutator, but is not supposed to recurse fully, because it's goal is to produce lane-extracted Exprs from other Exprs, such as let t99 = f0[ramp(0, 1, 8)]
let t100 = shuffle(t99, 0, 1, 2, 3)
let t101 = shuffle(t100, 0, 1)When the deinterleaver extracts So, in my opinion, I don't think this Deinterleaver should ever recurse into shuffle arguments. But as I didn't write this transformation, nor do I know fully what it's purpose is, I am hesitant just deleting the recursion there: Lines 389 to 403 in 85a3b07
I don't understand what the purpose of is of trying to extract odd and even lanes from a shuffle argument if the shuffle is actually just an element extraction. |
3b6f14d to
cf6312e
Compare
|
@derek-gerstmann already reviewed the Vulkan interleave codegen. That part didn't change. This involved the changes in:
I worked together with @abadams on a Shuffle simplification bug in It'd be great if @abadams could look at the following for review:
|
src/LegalizeVectors.cpp
Outdated
| return name + ".lanes_" + std::to_string(lane_start) + "_" + std::to_string(lane_start + lane_count - 1); | ||
| } | ||
|
|
||
| Expr simplify_shuffle(const Shuffle *op) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this here rather than in the simplifier?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because I can't call the simplifier on the shuffle, and expect it to only touch the shuffle. I can only do simplify(...) which runs ALL of the simplifier logic. It's a bit pitty/unintuitive that Simplify_Shuffle.cpp is not accessible as is.
Also, I wasn't too sure I could add that to the general simplifier code either. I could try to merge the two procedures. Perhaps other places benefit from these simplifier rules too then.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll try this tomorrow. It indeed is late.
| // user_error << "Cannot legalize vectors when tracing is enabled."; | ||
| auto event = as_const_int(op->args[6]); | ||
| internal_assert(event); | ||
| if (*event == halide_trace_load || *event == halide_trace_store) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure it's a good idea to preserve only trace loads and trace stores, because those are supposed to be nested in other tracing events. Or is the idea that those other events won't see this mutator, because they're scalar?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The test suite didn't show these to be nested anywhere. They are surrounded by other trace events, such as begin and end of a Func. AFAIK, they weren't nested. To be transparent: I have never ever used the tracing features. I was just looking at IR before and after legalization, to make sure it all seemed reasonable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, by nested I meant they should execute after a begin_realization event (or whatever it's called), and before an end_realization, so it would be bad to drop those outer events.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIRC, I think they never are processed here. The begin_realization and end_realization trace calls are never involved in vectorized expressions. So this ExtractLanes mutator will never be ran on those IR nodes. Perhaps I should turn this into an internal_assert() to validate my idea.
| } | ||
| }; | ||
|
|
||
| class ExtractLanes : public IRMutator { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How is this different to the deinterleave function in Deinterleave.cpp? Should they be unified?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, perhaps. I think I didn't understand what Deinterleave was doing. And I'm not too sure I do now. Deinterleaver and Interleaver are doing a weird dance together which I didn't understand either.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm comparing Deinterleaver and ExtractLanes. Can you have a look at their Load visitors? The alignment gets dropped if the starting lane is not 0. I don't understand why we wouldn't simply update the alignment, like I did in the ExtractLanes version. Is this an oversight in the Deinterleaver, or am I not understanding the rationale behind dropping it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The alignment is the alignment of the first lane. I think your logic is only correct if the load is of a ramp with stride 1. If it's some gather of some complex expression, that's not the right way to update it. In the cases we can safely update it, the simplifier can reinfer it very easily, so I thought it best to just leave it to the next simplifier pass.
| } | ||
|
|
||
| Expr visit(const Shuffle *op) override { | ||
| vector<int> new_indices; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will fail one of the new clang-tidy checks, so add new_indices.reserve(lane_count)
| just_in_let_definition = false; | ||
| Stmt mutated = IRMutator::mutate(s); | ||
| for (auto &let : reverse_view(lets)) { | ||
| // There is no recurse into let.second. This is handled by repeatedly calling this tranform. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tranform -> transform in the comment
|
I have no major concerns with this - just a few nits. It needs a merge with main though. Sorry for taking so long, I forgot to re-add this to my TODO list after the hexagon fix. |
| ApplySplit.h | ||
| Argument.h | ||
| AssociativeOpsTable.h | ||
| Associativity.h |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn't belong. You seem to have done a case-insensitive sort, when a case sensitive one had been previously used. Please fix this while merging with main.
1cf353d to
78f2007
Compare
|
Failures are due to the issue I describe in: llvm/llvm-project#158426 A stack-allocated LLVM class got really big, so now there are stack overflows inside LLVM when compiling for arm-32 |
src/LegalizeVectors.cpp
Outdated
| for (auto &&vec : op->vectors) { | ||
| if (vec.type().lanes() > max_lanes) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| for (auto &&vec : op->vectors) { | |
| if (vec.type().lanes() > max_lanes) { | |
| for (const auto &vec : op->vectors) { | |
| if (vec.type().lanes() > max_lanes) { |
Does this not satisfy clang-tidy? Or even this?
| for (auto &&vec : op->vectors) { | |
| if (vec.type().lanes() > max_lanes) { | |
| for (const Expr &vec : op->vectors) { | |
| if (vec.type().lanes() > max_lanes) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah it would. This was not the case I care about in particular. I was reading about what the use is of auto&& and why clang-tidy sometimes suggests with one & and sometimes with &&. Turns out that iterating over a std::vector<bool> requires auto&& if you wish to adjust the value of the stored bool. The reason being that vector is specialized to store bools as one bit, meaning that the iterator gives you a wrapper struct std::vector<bool>::reference with overloaded assignment operators to be able to overwrite the bit. So for (bool &val : vector_of_bools) doesn't compile: you have to auto&&, because for some weird reason which I didn't understand, also auto& wouldn't compile: https://godbolt.org/z/4os7TqMWo
7823407 to
14c4ed2
Compare
fa6fef6 to
8773e75
Compare
…ix a bug in Deinterleave.
… limited vector lanes.
…t. Other feedback: typos, and clarifications.
…ual Simplifier. Adjust Hexagon simd op tests, as the Simplifier now does optimize away some shuffles. Bugfix in Hexagon shuffle_vector() logic. Co-authored-by: Andrew Adams <[email protected]>
8773e75 to
73d9d3e
Compare
🐛 Bugfixes:
Deinterleaverecursed incorrectly into Shuffle arguments.CodeGen_Hexagon::shuffle_vectors()had a bug rewriting shuffles of shuffles incorrectly.⭐ New feature: Wrote a vector legalization lowering pass that comes near the end of the lowering. For loop Device APIs determine the maximal lane count for the expressions inside that for. Shuffles get lifted into their own variable, such that splitting into groups of lanes is done without recalculations.
internal_errors:VectorReducewith output lanes > 1.Reinterpretwith input/output having different number of bits per element.error/metal_vector_too_largeis converted intocorrectness/metal_long_vectorsas this is now supported.🧹 Cleanup errors: no newline needed for
HeapPrinter, and helper macrovk_report_errorto print error codes. This trailing newline is pretty much everywhere in the codebase with a 50% probability forinternal_assertandinternal_error. This could use a more broad cleanup.Fixes #8628 (see for details): use
OpVectorShuffleinstead ofOpCompositeInsert.