[QUESTION] Color blend #1172
-
Color blend is quite common in image processing and fits SIMD well. For example, to blend two premultiplied rgba8 images, pixels will have their channels casted and deinterleaved as uint16_t vectors, then actual blend operations are applied, and finally, channels are casted, interleaved and stored as original pixels. You may take a look at skia's implementation. Blend operations are mostly trivial for eve, besides interleave/deinterleave. Is anything I missed? |
Beta Was this translation helpful? Give feedback.
Replies: 16 comments
-
Tbh, the code is quite large - any chance you can sketch a scalar version to have a look at? For most of our algorithms so far we assume the uniformity of data (so - SoA and not AoS how images are layed out). We don't have interleave is added recently, probably not in package managers yet, that's solvable. We were definitely going to add support for that at some point, if you can tell me what exactly it is you need - I can have a look at prioritising that (no promises). |
Beta Was this translation helpful? Give feedback.
-
Wait! I goodled, it seems like you want to just have a formula for your resulting rgba from 2 input rgba. I'll write you a possible implementation a bit latet |
Beta Was this translation helpful? Give feedback.
-
Generated code for shuffles is pretty terrible but that's an easy fix. Can you have a look and see if this is sort of what you wanted? https://godbolt.org/z/fGd6Wrcoa |
Beta Was this translation helpful? Give feedback.
-
Some of the bad codegen is me messing up the pattern. What's the formula you actually need? |
Beta Was this translation helpful? Give feedback.
-
Updated code with proper masking + shuffle pattern: The "bad" codegen is due to a * a_alpha in uint8 having no proper codegen anyway. Similar code working on 32bits pixel data: Main loop is: .L4:
vpermd ymm1, ymm3, YMMWORD PTR [rcx+rax]
vpermd ymm0, ymm3, YMMWORD PTR [rdi+rax]
vpaddd ymm2, ymm0, ymm1
vpmulld ymm0, ymm0, YMMWORD PTR [rdi+rax]
vpmulld ymm1, ymm1, YMMWORD PTR [rcx+rax]
vpaddd ymm0, ymm0, ymm1
vpsrld ymm2, ymm2, 1
vpblendvb ymm0, ymm0, ymm2, ymm4
vmovdqu YMMWORD PTR [r8+rax], ymm0
add rax, 32
cmp rax, rdx
jne .L4 To get further than that, we need the actual formula on the pixels/alpha channels to make a fair comparison. |
Beta Was this translation helpful? Give feedback.
-
yes, it's much better, though I'm expecting each channel gets it's own vector, as not all channel uses same formula.
yes, uint8_t channle should be casted to uint16_t before arithmetic operations to avoid overflow.
Skia has implementation for both scalar and various simd arch. |
Beta Was this translation helpful? Give feedback.
-
On a quick glance we failed to see a different formula between colors. Can you show us one? |
Beta Was this translation helpful? Give feedback.
-
These blend mode are relatively rare, and skia seems not implement them directly. One example is tangent normal map, which applies slightly different formula on blue. But, any blending involving hue/saturation/lightness requires convert rgb to hsl/hsv, which in turn requires separated channel vectors. For example, saturation blend, blend source pixel's saturation into dest. Skia uses separated stages for them( |
Beta Was this translation helpful? Give feedback.
-
OK, so I took one random blend from skia, updated the code with conversion. It looks OK. There is one bug I fixed locally that makes >> worse on AVX2. Once I push it to main, this is the expected asm for the innermost loop:
I'll ping this issue as soon as the fix for operator>> is up. |
Beta Was this translation helpful? Give feedback.
-
As far as doing complex operations in interleaved format: We have a very solid support for soa and conversion I believe is autovrctorized |
Beta Was this translation helpful? Give feedback.
-
Performance wise, deinterleaved data would be faster, but, as most existing libs/frameworks only accept interleaved data, deinterleaved data is very inconvenient. So deinterleave/interleave on the fly would be the best of both worlds. |
Beta Was this translation helpful? Give feedback.
-
I'm not so sure about that - you pay a really big cost at least in terms of parallelism (assuming the shuffles are free which they are not) for interleaved data. |
Beta Was this translation helpful? Give feedback.
-
as image data could be very large, ram access will finally become a bottleneck. The shuffles are actually not much of concern.
that will be good enough.
I've tried it on latest msvc, tons of errors and even ICEs. Mostly because msvc has a rather poor implementation of c++ standard. If you really want to support msvc, you should test on it from very beginning. As for now, it's kinda too later. clang-cl may pass compilation if used with libc++ (msvc's std lib is also quite error-prone for c++20). For now, I'd mostly use eve::wide as a portable vector type, something like clang's ext_vector_type. And do rather complicated things using intrinsics. |
Beta Was this translation helpful? Give feedback.
-
Let moves this to discussion |
Beta Was this translation helpful? Give feedback.
-
FYI: there is an implementation now but i'm not too happy with code-gen: #1206 |
Beta Was this translation helpful? Give feedback.
-
I think this is a good as it going to get any time soon: https://godbolt.org/z/1rd1E6nE1 |
Beta Was this translation helpful? Give feedback.
I think this is a good as it going to get any time soon: https://godbolt.org/z/1rd1E6nE1
Feel free to experiment and report if you find better ways of doing it.