-
Notifications
You must be signed in to change notification settings - Fork 30
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
perf(jpeg): faster idct by transposing at zigzag level #157
base: dev
Are you sure you want to change the base?
Conversation
This currently causes the quality of the image to be degraded; probably due to the need to adapt the IDCT routine but this hasn't been done at this time.
dd16922
to
d86882a
Compare
...so that the dequantization process actually works. Doing it this way allows it to be in the same orientation as the DCT coefficients so it is still eligible for painless vectorization.
ae3e2ac
to
f2569fa
Compare
Turns out I was just a bit stupid and didn't take into account the quantization process also needs to be adapted. 🙃 The jpeg tests still don't pass but the images look visually identical. I'm not sure my eyeballs are a good enough metric so I'll let you see what differs for yourself instead. The performance improvement with AVX2 is in the neighborhood of 6-8%. However, the changes to the scalar code seem to be introducing a regression, as the compiler output (with SSE4.2 enabled) is ~100 instructions longer. https://rust.godbolt.org/z/sE3f4eoz6 |
A bit tricky this probably means there is a problem, I'll investigate and report, but thanks for the PR |
My hypothesis is that the order of the IDCT-1D passes being reversed cause rounding errors to ever so slightly vary, they mismatch the rounding errors when passes are done the other way around. In terms of exact mathematics there's no difference so I don't know, haven't tested that either but it should be trivial to test with AVX2 code (just doing a transpose before, in the middle and after which should yield the same behavior as without zigzag-level transposition) |
One way to test may be to compare the dssim between the new images and imagemagick which uses libjpeg turbo. The script I use is set -e
input="$1"
output="$(mktemp --tmpdir result_XXXXXXXXXXXXX.png)"
trap "rm -f "$output"" EXIT
if ! yes | /home/caleb/Documents/rust/zune-image/target/release/zune --input "$input" --out "$output" --yes --experimental 2>&1; then
echo "Failed to decode $input" 1>&2
exit 1
fi
similarity=$(compare -quiet -metric RMSE "$input[0]" "$output" /dev/null 2>&1) || true
echo "RMSE $similarity $input"
similarity=$(compare -quiet -metric AE "$input[0]" "$output" /dev/null 2>&1) || true
echo "Absolute error count $similarity $input"
similarity=$(compare -quiet -metric DSSIM "$input[0]" "$output" /dev/null 2>&1) || true
echo "DSSIM $similarity $input"
similarity=$(compare -quiet -metric PSNR "$input[0]" "$output" /dev/null 2>&1) || true
echo "PSNR $similarity $input" And you call it as sh ./psnr.sh {IMAGE} But remember to change the directory of the above zune binary You can compile a binary by calling |
I did the test for the images in test-images, and first I can confirm my theory is true, the order in which the IDCT passes are done does have an impact; using the transposed zigzag transpose, but with 3 transpose (to "restore" the original IDCT coefficient matrix before doing things the "old" way, the outputs are identical). Regarding the error values, the impact of the IDCT passes order has a negative impact across the board 😔 It's unfortunate the 2D-DCT introduces these; from a mathematical standpoint my approach would, assuming exact mathematics, have the same output 🤔 Given these I'm unsure if pursuing this direction is viable given the current IDCT implementation... DSSIM (before)
DSSIM (after)
|
It's weird that the outputs vary depending on the order, can't seem to figure out why Btw what were the perf gains? You can get difference by running Could you do a before and after the pr? we can then see if there is a way to reduce and not be lying to ourselves that we are making anything better |
The 6-8% perf improvement was measured by using cargo bench, although I was running in a quite noisy environment (IntelliJ, Spotify, and a bunch of Chromiums running (thanks for nothing, Electron)). I'll do a clean bench on a relatively silent environment tomorrow!
My theory this has to do with the nature of the approximation; if the IDCT is lossy (due to not using perfect mathematics), then the order of operations do matter because the loss won't be the same and will depend on the input. Since columns are processed first, the intermediate state lost data "differently" than if it was done row-first. But I'm unsure how to put this theory to the test 😵💫 |
Here are the results of the benchmark on my computer. Running Arch Linux (6.7.1-arch1-1), Intel i7-7700K, Rust toolchain version 1.75.0 |
Hi, sorry for the delay in responding.
I forgot how much 6% perf is when you are doing 68 ms :) , I am not comfortable with making the changes for IDCT that lead to greater divergence with libjpeg-turbo, but I love the change for checking for zeroes and believe that should be merged. An additional place to be optimized btw may be moving from plaltform specific intrinsics to portable simd, we may gain more speeds in places like wasm, in case you are still up for this btw. No pressure ( :) ) |
Implements some of the optimizations discussed in #155. Marking this PR as draft as this currently causes the quality of the image to be degraded; probably due to the need to adapt the IDCT routine but this hasn't been done at this time (and I'm not super familiar with the algorithm used here).
Rough testing on my computer show an approx. 10% performance increase from these changes (x86 AVX2). The output image are almost correct, but some noise is introduced (making jpeg tests fail at this time).
I'm also not a fan of the quick solution I put together for the scalar code: it doesn't do any transposition and since I didn't want to introduce checks during the huffman decode logic, I sort of tucked transposition in the scalar decode to put everything back where it's supposed to be. A better solution would probably be to change the access to
in_vector
instead, but for now it's mostly here as a PoC.