Fix Vulkan interleave two vectors bug + Vector Legalization lowering pass #8629

mcourteaux · 2025-05-20T11:05:45Z

🐛 Bugfixes:

Shuffle bug in SPIRV codegen, which was introduced 2 years ago in 4d86539 is here fixed.
Deinterleave recursed incorrectly into Shuffle arguments.
CodeGen_Hexagon::shuffle_vectors() had a bug rewriting shuffles of shuffles incorrectly.

⭐ New feature: Wrote a vector legalization lowering pass that comes near the end of the lowering. For loop Device APIs determine the maximal lane count for the expressions inside that for. Shuffles get lifted into their own variable, such that splitting into groups of lanes is done without recalculations.

There are a few unsupported scenarios, which are reported as internal_errors:
1. VectorReduce with output lanes > 1.
2. Reinterpret with input/output having different number of bits per element.
This lowering pass was stress tested by running all tests in the test suite with artificial vector lane limit of 4 (this artificial limit is disabled again in this PR).
The test error/metal_vector_too_large is converted into correctness/metal_long_vectors as this is now supported.

🧹 Cleanup errors: no newline needed for HeapPrinter, and helper macro vk_report_error to print error codes. This trailing newline is pretty much everywhere in the codebase with a 50% probability for internal_assert and internal_error. This could use a more broad cleanup.

Fixes #8628 (see for details): use OpVectorShuffle instead of OpCompositeInsert.

mcourteaux · 2025-05-20T11:10:20Z

src/runtime/opencl.cpp

        "opencl.dll",
 #else
-        "libOpenCL.so",
+        "libOpenCL.so.1",


Driveby change. Still not merged in any other PR. Fixes #8569.

src/CodeGen_Vulkan_Dev.cpp

mcourteaux · 2025-05-20T11:16:33Z

src/runtime/vulkan_internal.h

    case VK_ERROR_FRAGMENTED_POOL:
        return "VK_ERROR_FRAGMENTED_POOL";
+    case VK_ERROR_UNKNOWN:
+        return "VK_ERROR_UNKNOWN";


More indicative than <Unknown Vulkan Error> in the default branch below.

mcourteaux · 2025-05-20T17:22:36Z

Regarding the build failure: some simplification rules in Simplify_Shuffle.cpp rewrite shuffles and thereby exceed the maximal vector length on the GPU backend (triggered by my randomized shuffle indices test). I hate this, as my PR was supposed to be simple, and now I'm uncovering more bugs than I signed up for 😝

I see two possible solutions:

Limit the rules to never exceed the maximal vector length per backend. I don't like this for several reasons:
- These simplification rules are embedded in the general simplify() call, which can happen at any point, and is not yet aware of selected GPU APIs.
- This makes the simplification rules restricted, and might limit finding better simplifications if the internal rewriting logic cannot exceed some arbitrary vector size.
The GPU backends need to handle arbitrary vector size expressions, just like the LLVM backend does. You can .vectorize(x, 17) and it just works. This might not be easy to achieve, but I feel like this is the better solution, as it keeps the simplifier as is, and brings the scheduling possibilities in line with the LLVM-based codegen backends. This might be one shared lowering pass for GPU backends with a max-vector-size parameter that rewrites these expressions again. Not sure if that's easy to do...

Opinions @abadams @zvookin ?

abadams · 2025-05-20T19:41:52Z

I think the gpu backends need to handle arbitrary vector sizes. The rest of the compiler is free to make vectors of any size. A shared legalization pass for non-llvm backends that maps from Halide IR to Halide IR with narrower vectors might work, but seems a little tricky to get right for things like vectorreduce nodes. The other approach would be just changing how all ops are printed to handle small bundles of values, but this seems even nastier.

Basically I agree with your option 2.

mcourteaux · 2025-05-22T09:07:13Z

I'd consider this to be greatly out of scope of this bugfix PR. I guess I'll skip that test for now and make an issue?

Looking at the IR that triggers the error:

let t397 = ramp(.thread_id_x + g$8.s0.x.v17.base.s, g$8.extent.0, 4)
let t398 = concat_vectors(f0$8[t397], f1$8[t397])
let t404 = extract_element(t398, 1)
let t405 = extract_element(t398, 0)

which came from this:

g$8(g$8.s0.x) = let t374 = slice_vectors(
    concat_vectors(f0$8(g$8.s0.x, 0), f0$8(g$8.s0.x, 1), f0$8(g$8.s0.x, 2), f0$8(g$8.s0.x, 3)), 
    concat_vectors(f1$8(g$8.s0.x, 4), f1$8(g$8.s0.x, 5), f1$8(g$8.s0.x, 6), f1$8(g$8.s0.x, 7)),
    1, -1, 2)
  in (let t375 = (t374*t374)
     in (extract_element(t375, 0) + extract_element(t375, 1))
  )

It seems that the simplifier rules have done a good job simplifying it, but there is some simplifications missing, or the simplifier rules have gotten stuck in a local minimum. The two ramped loads of size 4 get concatenated into a vector of size 8, to then just take elements 0 and 1 out of it. Very inefficient, compared to just doing two loads (or one ramped load). Of course, this is due to the unnatural way of constructing it with all these explicit shuffles in the test, but perhaps, having Halide simplify this further might be achievable for this PR? @abadams Any ideas on improving the codegen for this?

mcourteaux · 2025-05-22T17:11:55Z

Aaarrrghghhhh llvm/llvm-project@735209c

The following Auto-Upgrade rules are used to maintain compatibility with
IR using the legacy intrinsics:

llvm.nvvm.barrier0 --> llvm.nvvm.barrier.cta.sync.aligned.all(0)

Clearly doesn't work for us here... 😢

alexreinking · 2025-05-22T18:13:43Z

n.b. - the "fixes #NN" magic belongs on a single line (one per line if multiple) in either the PR description or the final commit description (in the GitHub interface, which by default concatenates all the intermediate commit messages). It only clutters the PR title.

mcourteaux · 2025-05-23T14:38:52Z

Specifically asking for review from @abadams as I uncovered another bug, but this time in the Deinterleaver transformation.

This transformation is implemented as a GraphIRMutator, but is not supposed to recurse fully, because it's goal is to produce lane-extracted Exprs from other Exprs, such as t340.odd_lanes. The bug in this instance was due to recursing too deeply:

let t99 = f0[ramp(0, 1, 8)]
let t100 = shuffle(t99, 0, 1, 2, 3)
let t101 = shuffle(t100, 0, 1)

When the deinterleaver extracts t101.even_lanes, it realizes that the resulting type has 1 lane (i.e., scalar). But then it recurses into t100, and incorrectly assumed t100 was of the same type as t101.

So, in my opinion, I don't think this Deinterleaver should ever recurse into shuffle arguments. But as I didn't write this transformation, nor do I know fully what it's purpose is, I am hesitant just deleting the recursion there:

Halide/src/Deinterleave.cpp

Lines 389 to 403 in 85a3b07

    
           // If this is extracting a single lane, try to recursively deinterleave rather 
        
           // than leaving behind a shuffle. 
        
           if (indices.size() == 1) { 
        
               int index = indices.front(); 
        
               for (const auto &i : op->vectors) { 
        
                   if (index < i.type().lanes()) { 
        
                       ScopedValue<int> lane(starting_lane, index); 
        
                       return mutate(i); 
        
                   } 
        
                   index -= i.type().lanes(); 
        
               } 
        
               internal_error << "extract_lane index out of bounds: " << Expr(op) << " " << index << "\n"; 
        
           } 
        
           return Shuffle::make(op->vectors, indices);

I don't understand what the purpose of is of trying to extract odd and even lanes from a shuffle argument if the shuffle is actually just an element extraction.

mcourteaux · 2025-05-27T17:14:47Z

@derek-gerstmann already reviewed the Vulkan interleave codegen. That part didn't change. This involved the changes in:

CodeGen_Vulkan_Dev.cpp
vulkan_internal.cpp
vulkan_resources.cpp

I worked together with @abadams on a Shuffle simplification bug in Simplify_Shuffle.cpp, so that part is already "reviewed" (pair programmed, you could say).

It'd be great if @abadams could look at the following for review:

LegalizeVectors.cpp
The small change in CSE.cpp to not lift Calls if they are not pure (with the exception for CallType::Halide, which can be lifted, according to a test case).
My bugfix in Deinterleave.cpp

abadams · 2025-06-04T20:16:07Z

src/LegalizeVectors.cpp

+    return name + ".lanes_" + std::to_string(lane_start) + "_" + std::to_string(lane_start + lane_count - 1);
+}
+
+Expr simplify_shuffle(const Shuffle *op) {


Why is this here rather than in the simplifier?

Because I can't call the simplifier on the shuffle, and expect it to only touch the shuffle. I can only do simplify(...) which runs ALL of the simplifier logic. It's a bit pitty/unintuitive that Simplify_Shuffle.cpp is not accessible as is.

Also, I wasn't too sure I could add that to the general simplifier code either. I could try to merge the two procedures. Perhaps other places benefit from these simplifier rules too then.

I'll try this tomorrow. It indeed is late.

src/LegalizeVectors.cpp

abadams · 2025-06-04T20:22:04Z

src/LegalizeVectors.cpp

+        // user_error << "Cannot legalize vectors when tracing is enabled.";
+        auto event = as_const_int(op->args[6]);
+        internal_assert(event);
+        if (*event == halide_trace_load || *event == halide_trace_store) {


I'm not sure it's a good idea to preserve only trace loads and trace stores, because those are supposed to be nested in other tracing events. Or is the idea that those other events won't see this mutator, because they're scalar?

The test suite didn't show these to be nested anywhere. They are surrounded by other trace events, such as begin and end of a Func. AFAIK, they weren't nested. To be transparent: I have never ever used the tracing features. I was just looking at IR before and after legalization, to make sure it all seemed reasonable.

Sorry, by nested I meant they should execute after a begin_realization event (or whatever it's called), and before an end_realization, so it would be bad to drop those outer events.

IIRC, I think they never are processed here. The begin_realization and end_realization trace calls are never involved in vectorized expressions. So this ExtractLanes mutator will never be ran on those IR nodes. Perhaps I should turn this into an internal_assert() to validate my idea.

abadams · 2025-06-04T20:23:41Z

src/LegalizeVectors.cpp

+    }
+};
+
+class ExtractLanes : public IRMutator {


How is this different to the deinterleave function in Deinterleave.cpp? Should they be unified?

Hmm, perhaps. I think I didn't understand what Deinterleave was doing. And I'm not too sure I do now. Deinterleaver and Interleaver are doing a weird dance together which I didn't understand either.

I'm comparing Deinterleaver and ExtractLanes. Can you have a look at their Load visitors? The alignment gets dropped if the starting lane is not 0. I don't understand why we wouldn't simply update the alignment, like I did in the ExtractLanes version. Is this an oversight in the Deinterleaver, or am I not understanding the rationale behind dropping it?

The alignment is the alignment of the first lane. I think your logic is only correct if the load is of a ramp with stride 1. If it's some gather of some complex expression, that's not the right way to update it. In the cases we can safely update it, the simplifier can reinfer it very easily, so I thought it best to just leave it to the next simplifier pass.

src/Simplify_Let.cpp

abadams · 2025-09-25T21:28:16Z

src/LegalizeVectors.cpp

+    }
+
+    Expr visit(const Shuffle *op) override {
+        vector<int> new_indices;


This will fail one of the new clang-tidy checks, so add new_indices.reserve(lane_count)

abadams · 2025-09-25T21:29:42Z

src/LegalizeVectors.cpp

+        just_in_let_definition = false;
+        Stmt mutated = IRMutator::mutate(s);
+        for (auto &let : reverse_view(lets)) {
+            // There is no recurse into let.second. This is handled by repeatedly calling this tranform.


tranform -> transform in the comment

abadams · 2025-09-25T21:37:45Z

I have no major concerns with this - just a few nits. It needs a merge with main though.

Sorry for taking so long, I forgot to re-add this to my TODO list after the hexagon fix.

alexreinking · 2025-09-22T17:06:11Z

src/CMakeLists.txt

    ApplySplit.h
    Argument.h
    AssociativeOpsTable.h
    Associativity.h


This doesn't belong. You seem to have done a case-insensitive sort, when a case sensitive one had been previously used. Please fix this while merging with main.

abadams · 2025-09-26T21:52:58Z

Failures are due to the issue I describe in: llvm/llvm-project#158426

A stack-allocated LLVM class got really big, so now there are stack overflows inside LLVM when compiling for arm-32

alexreinking · 2025-09-29T02:19:47Z

src/LegalizeVectors.cpp

+        for (auto &&vec : op->vectors) {
+            if (vec.type().lanes() > max_lanes) {


Suggested change

for (auto &&vec : op->vectors) {

if (vec.type().lanes() > max_lanes) {

for (const auto &vec : op->vectors) {

if (vec.type().lanes() > max_lanes) {

Does this not satisfy clang-tidy? Or even this?

Suggested change

for (auto &&vec : op->vectors) {

if (vec.type().lanes() > max_lanes) {

for (const Expr &vec : op->vectors) {

if (vec.type().lanes() > max_lanes) {

Yeah it would. This was not the case I care about in particular. I was reading about what the use is of auto&& and why clang-tidy sometimes suggests with one & and sometimes with &&. Turns out that iterating over a std::vector<bool> requires auto&& if you wish to adjust the value of the stored bool. The reason being that vector is specialized to store bools as one bit, meaning that the iterator gives you a wrapper struct std::vector<bool>::reference with overloaded assignment operators to be able to overwrite the bit. So for (bool &val : vector_of_bools) doesn't compile: you have to auto&&, because for some weird reason which I didn't understand, also auto& wouldn't compile: https://godbolt.org/z/4os7TqMWo

…ix a bug in Deinterleave.

… limited vector lanes.

…t. Other feedback: typos, and clarifications.

…ual Simplifier. Adjust Hexagon simd op tests, as the Simplifier now does optimize away some shuffles. Bugfix in Hexagon shuffle_vector() logic. Co-authored-by: Andrew Adams <[email protected]>

…indices.

mcourteaux requested a review from derek-gerstmann May 20, 2025 11:05

mcourteaux added code_cleanup No functional changes. Reformatting, reorganizing, or refactoring existing code. gpu labels May 20, 2025

mcourteaux requested a review from halidebuildbots May 20, 2025 11:08

mcourteaux commented May 20, 2025

View reviewed changes

src/CodeGen_Vulkan_Dev.cpp Show resolved Hide resolved

mcourteaux commented May 20, 2025

View reviewed changes

derek-gerstmann approved these changes May 20, 2025

View reviewed changes

mcourteaux added skip_buildbots Do not run buildbots on this PR. Must add before opening PR as we scan labels immediately. and removed skip_buildbots Do not run buildbots on this PR. Must add before opening PR as we scan labels immediately. labels May 22, 2025

alexreinking changed the title ~~Fix Vulkan interleave two vectors bug. Fixes #8628.~~ Fix Vulkan interleave two vectors bug May 22, 2025

mcourteaux mentioned this pull request May 22, 2025

Fix top LLVM: renamed NVPTX barrier intrinsics. #8631

Merged

mcourteaux force-pushed the fix-vulkan-interleave branch from b90ae86 to 3b6f14d Compare May 23, 2025 14:27

mcourteaux requested a review from abadams May 23, 2025 14:28

mcourteaux force-pushed the fix-vulkan-interleave branch from 3b6f14d to cf6312e Compare May 24, 2025 13:06

mcourteaux changed the title ~~Fix Vulkan interleave two vectors bug~~ Fix Vulkan interleave two vectors bug + Vector Legalization lowering pass May 27, 2025

mcourteaux added enhancement New user-visible features or improvements to existing features. release_notes For changes that may warrant a note in README for official releases. labels May 27, 2025

abadams reviewed Jun 4, 2025

View reviewed changes

src/LegalizeVectors.cpp Show resolved Hide resolved

abadams reviewed Jun 4, 2025

View reviewed changes

src/Simplify_Let.cpp Outdated Show resolved Hide resolved

abadams reviewed Sep 25, 2025

View reviewed changes

alexreinking requested changes Sep 25, 2025

View reviewed changes

mcourteaux force-pushed the fix-vulkan-interleave branch from 1cf353d to 78f2007 Compare September 26, 2025 16:32

alexreinking self-requested a review September 26, 2025 21:18

alexreinking reviewed Sep 29, 2025

View reviewed changes

mcourteaux force-pushed the fix-vulkan-interleave branch 4 times, most recently from 7823407 to 14c4ed2 Compare October 11, 2025 16:47

mcourteaux mentioned this pull request Oct 13, 2025

Fix usage of lookupTarget in LLVM #8841

Merged

mcourteaux force-pushed the fix-vulkan-interleave branch from fa6fef6 to 8773e75 Compare October 23, 2025 16:08

mcourteaux and others added 15 commits October 24, 2025 17:36

Fix Vulkan interleave SPIRV codegen. Fix a bug in Simplify_Shuffle. F…

ac1969a

…ix a bug in Deinterleave.

Vector Legalization Pass. Useful for vectorizing to GPU backends with…

368eb9b

… limited vector lanes.

Fix Makefile.

15311e1

Cleanup.

cd51b72

Cleanup vector legalization.

238da93

Try to fix the compiler complaint around visibility.

f2a0e4f

GCC-9 does not understand a complete switch?

f612af7

Do not lift Let out to LetStmt if we are not in a loop with lane limi…

f49b33e

…t. Other feedback: typos, and clarifications.

Improve error message for reinterpret.

ec398ec

Only run vector legalization mutators on device loops that require it.

4f61830

Move required simplifier logic for the vector legalization to the act…

2d3909f

…ual Simplifier. Adjust Hexagon simd op tests, as the Simplifier now does optimize away some shuffles. Bugfix in Hexagon shuffle_vector() logic. Co-authored-by: Andrew Adams <[email protected]>

Remove special handling of strict_float, as those got overhauled.

c61a989

Hexagon codegen for vdelta fix regarding dont-care values in shuffle …

82c30f9

…indices.

Clang-format

1f9d3be

Satisfy clang-tidy

73d9d3e

mcourteaux force-pushed the fix-vulkan-interleave branch from 8773e75 to 73d9d3e Compare October 24, 2025 15:36

		for (auto &&vec : op->vectors) {
		if (vec.type().lanes() > max_lanes) {

Fix Vulkan interleave two vectors bug + Vector Legalization lowering pass #8629

Are you sure you want to change the base?

Fix Vulkan interleave two vectors bug + Vector Legalization lowering pass #8629

Uh oh!

Conversation

mcourteaux commented May 20, 2025 • edited by alexreinking Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mcourteaux May 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mcourteaux commented May 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

abadams commented May 20, 2025

Uh oh!

mcourteaux commented May 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mcourteaux commented May 22, 2025

Uh oh!

alexreinking commented May 22, 2025

Uh oh!

mcourteaux commented May 23, 2025

Uh oh!

mcourteaux commented May 27, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

abadams commented Sep 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

abadams commented Sep 26, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mcourteaux Sep 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

mcourteaux commented May 20, 2025 •

edited by alexreinking

Loading

mcourteaux May 20, 2025 •

edited

Loading

mcourteaux commented May 20, 2025 •

edited

Loading

mcourteaux commented May 22, 2025 •

edited

Loading

abadams commented Sep 25, 2025 •

edited

Loading

mcourteaux Sep 29, 2025 •

edited

Loading