perf(jpeg): faster idct by transposing at zigzag level #157

cyyynthia · 2024-01-19T22:11:51Z

Implements some of the optimizations discussed in #155. Marking this PR as draft as this currently causes the quality of the image to be degraded; probably due to the need to adapt the IDCT routine but this hasn't been done at this time (and I'm not super familiar with the algorithm used here).

Rough testing on my computer show an approx. 10% performance increase from these changes (x86 AVX2). The output image are almost correct, but some noise is introduced (making jpeg tests fail at this time).

I'm also not a fan of the quick solution I put together for the scalar code: it doesn't do any transposition and since I didn't want to introduce checks during the huffman decode logic, I sort of tucked transposition in the scalar decode to put everything back where it's supposed to be. A better solution would probably be to change the access to in_vector instead, but for now it's mostly here as a PoC.

This currently causes the quality of the image to be degraded; probably due to the need to adapt the IDCT routine but this hasn't been done at this time.

...so that the dequantization process actually works. Doing it this way allows it to be in the same orientation as the DCT coefficients so it is still eligible for painless vectorization.

cyyynthia · 2024-01-20T17:28:55Z

Turns out I was just a bit stupid and didn't take into account the quantization process also needs to be adapted. 🙃

The jpeg tests still don't pass but the images look visually identical. I'm not sure my eyeballs are a good enough metric so I'll let you see what differs for yourself instead.

The performance improvement with AVX2 is in the neighborhood of 6-8%. However, the changes to the scalar code seem to be introducing a regression, as the compiler output (with SSE4.2 enabled) is ~100 instructions longer. https://rust.godbolt.org/z/sE3f4eoz6

etemesi254 · 2024-01-21T19:38:05Z

The jpeg tests still don't pass but the images look visually identical. I'm not sure my eyeballs are a good enough metric so I'll let you see what differs for yourself instead.

A bit tricky this probably means there is a problem, I'll investigate and report, but thanks for the PR

cyyynthia · 2024-01-21T21:40:20Z

My hypothesis is that the order of the IDCT-1D passes being reversed cause rounding errors to ever so slightly vary, they mismatch the rounding errors when passes are done the other way around. In terms of exact mathematics there's no difference so I don't know, haven't tested that either but it should be trivial to test with AVX2 code (just doing a transpose before, in the middle and after which should yield the same behavior as without zigzag-level transposition)

etemesi254 · 2024-01-22T13:19:33Z

One way to test may be to compare the dssim between the new images and imagemagick which uses libjpeg turbo.

The script I use is

set -e

input="$1"
output="$(mktemp --tmpdir result_XXXXXXXXXXXXX.png)"
trap "rm -f "$output"" EXIT

if ! yes | /home/caleb/Documents/rust/zune-image/target/release/zune --input "$input" --out "$output" --yes --experimental   2>&1; then
    echo "Failed to decode $input" 1>&2
    exit 1
fi
similarity=$(compare -quiet -metric RMSE "$input[0]" "$output" /dev/null 2>&1) || true
echo "RMSE $similarity $input"


similarity=$(compare -quiet -metric AE   "$input[0]" "$output" /dev/null 2>&1) || true
echo "Absolute error count $similarity $input"


similarity=$(compare -quiet -metric DSSIM  "$input[0]" "$output" /dev/null 2>&1) || true
echo "DSSIM  $similarity $input"

similarity=$(compare -quiet -metric PSNR "$input[0]" "$output" /dev/null 2>&1) || true
echo "PSNR $similarity $input"

And you call it as

sh ./psnr.sh {IMAGE}

But remember to change the directory of the above zune binary

You can compile a binary by calling cargo build --release in the top level if you didn't know

cyyynthia · 2024-01-22T16:56:38Z

I did the test for the images in test-images, and first I can confirm my theory is true, the order in which the IDCT passes are done does have an impact; using the transposed zigzag transpose, but with 3 transpose (to "restore" the original IDCT coefficient matrix before doing things the "old" way, the outputs are identical).

Regarding the error values, the impact of the IDCT passes order has a negative impact across the board 😔 It's unfortunate the 2D-DCT introduces these; from a mathematical standpoint my approach would, assuming exact mathematics, have the same output 🤔

Given these I'm unsure if pursuing this direction is viable given the current IDCT implementation...

DSSIM (before)

DSSIM  0.00136545 ./test-images/jpeg/2029.jpg
DSSIM  0.455216 ./test-images/jpeg/cymk.jpg
DSSIM  7.87632e-05 ./test-images/jpeg/down_sampled_grayscale_prog.jpg
DSSIM  0.436452 ./test-images/jpeg/four_components.jpg
DSSIM  0.000169361 ./test-images/jpeg/huffman_third_index.jpg
DSSIM  0.000273764 ./test-images/jpeg/huge_sof_number.jpg
DSSIM  0.513952 ./test-images/jpeg/Kiara_limited_progressive_four_components.jpg
DSSIM  0.00061322 ./test-images/jpeg/large_horiz_samp_7680_4320.jpg
DSSIM  0.000609311 ./test-images/jpeg/large_no_samp_7680_4320.jpg
DSSIM  0.000589548 ./test-images/jpeg/large_vertical_samp_7680_4320.jpg
DSSIM  0.00260926 ./test-images/jpeg/medium_horiz_samp_2500x1786.jpg
DSSIM  0.00287932 ./test-images/jpeg/medium_no_samp_2500x1786.jpg
DSSIM  0.00287433 ./test-images/jpeg/medium_vertical_samp_2500x1786.jpg
DSSIM  0.00101241 ./test-images/jpeg/mjpeg_huffman.jpg
DSSIM  0.00129616 ./test-images/jpeg/rebuilt_relax_fill_bytes_before_marker.jpg
DSSIM  0.000371049 ./test-images/jpeg/weid_sampling_factors.jpg
DSSIM  0.000514302 ./test-images/jpeg/weird_components.jpg

DSSIM (after)

DSSIM  0.00144355 ./test-images/jpeg/2029.jpg
DSSIM  0.455202 ./test-images/jpeg/cymk.jpg
DSSIM  0.000111749 ./test-images/jpeg/down_sampled_grayscale_prog.jpg
DSSIM  0.436425 ./test-images/jpeg/four_components.jpg
DSSIM  0.000184554 ./test-images/jpeg/huffman_third_index.jpg
DSSIM  0.000338528 ./test-images/jpeg/huge_sof_number.jpg
DSSIM  0.513946 ./test-images/jpeg/Kiara_limited_progressive_four_components.jpg
DSSIM  0.000688253 ./test-images/jpeg/large_horiz_samp_7680_4320.jpg
DSSIM  0.000694474 ./test-images/jpeg/large_no_samp_7680_4320.jpg
DSSIM  0.000660411 ./test-images/jpeg/large_vertical_samp_7680_4320.jpg
DSSIM  0.00278002 ./test-images/jpeg/medium_horiz_samp_2500x1786.jpg
DSSIM  0.00306143 ./test-images/jpeg/medium_no_samp_2500x1786.jpg
DSSIM  0.00306547 ./test-images/jpeg/medium_vertical_samp_2500x1786.jpg
DSSIM  0.0012283 ./test-images/jpeg/mjpeg_huffman.jpg
DSSIM  0.00139533 ./test-images/jpeg/rebuilt_relax_fill_bytes_before_marker.jpg
DSSIM  0.000406538 ./test-images/jpeg/weid_sampling_factors.jpg
DSSIM  0.000553421 ./test-images/jpeg/weird_components.jpg

etemesi254 · 2024-01-27T20:56:24Z

It's weird that the outputs vary depending on the order, can't seem to figure out why

Btw what were the perf gains?

You can get difference by running cargo bench "jpeg" to only run jpeg ones, they take a while tho,

Could you do a before and after the pr? we can then see if there is a way to reduce and not be lying to ourselves that we are making anything better

cyyynthia · 2024-01-27T21:09:21Z

The 6-8% perf improvement was measured by using cargo bench, although I was running in a quite noisy environment (IntelliJ, Spotify, and a bunch of Chromiums running (thanks for nothing, Electron)). I'll do a clean bench on a relatively silent environment tomorrow!

It's weird that the outputs vary depending on the order, can't seem to figure out why

My theory this has to do with the nature of the approximation; if the IDCT is lossy (due to not using perfect mathematics), then the order of operations do matter because the loss won't be the same and will depend on the input. Since columns are processed first, the intermediate state lost data "differently" than if it was done row-first.

But I'm unsure how to put this theory to the test 😵‍💫

cyyynthia · 2024-01-28T14:02:49Z

Here are the results of the benchmark on my computer. Running Arch Linux (6.7.1-arch1-1), Intel i7-7700K, Rust toolchain version 1.75.0

before.txt
after.txt

etemesi254 · 2024-02-12T07:34:05Z

Hi, sorry for the delay in responding.

The 6-8% perf improvement was measured by using cargo bench

I forgot how much 6% perf is when you are doing 68 ms :) ,

I am not comfortable with making the changes for IDCT that lead to greater divergence with libjpeg-turbo, but I love the change for checking for zeroes and believe that should be merged.

An additional place to be optimized btw may be moving from plaltform specific intrinsics to portable simd, we may gain more speeds in places like wasm, in case you are still up for this btw.

No pressure ( :) )

cyyynthia changed the title ~~perf: faster idct by transposing at zigzag level~~ perf(jpeg): faster idct by transposing at zigzag level Jan 19, 2024

perf(jpeg): faster idct by transposing at zigzag level

d86882a

This currently causes the quality of the image to be degraded; probably due to the need to adapt the IDCT routine but this hasn't been done at this time.

cyyynthia force-pushed the perf/faster-jpeg-idct branch from dd16922 to d86882a Compare January 19, 2024 22:13

cyyynthia added 2 commits January 20, 2024 17:32

fix(jpeg): also transpose quantization tables on load

9cb15d3

...so that the dequantization process actually works. Doing it this way allows it to be in the same orientation as the DCT coefficients so it is still eligible for painless vectorization.

perf(jpeg): tweak scalar code so it can be vectorized

f2569fa

cyyynthia force-pushed the perf/faster-jpeg-idct branch from ae3e2ac to f2569fa Compare January 20, 2024 16:38

cyyynthia marked this pull request as ready for review January 20, 2024 17:28

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf(jpeg): faster idct by transposing at zigzag level #157

perf(jpeg): faster idct by transposing at zigzag level #157

cyyynthia commented Jan 19, 2024

cyyynthia commented Jan 20, 2024

etemesi254 commented Jan 21, 2024

cyyynthia commented Jan 21, 2024 •

edited

Loading

etemesi254 commented Jan 22, 2024

cyyynthia commented Jan 22, 2024 •

edited

Loading

etemesi254 commented Jan 27, 2024

cyyynthia commented Jan 27, 2024

cyyynthia commented Jan 28, 2024

etemesi254 commented Feb 12, 2024

perf(jpeg): faster idct by transposing at zigzag level #157

Are you sure you want to change the base?

perf(jpeg): faster idct by transposing at zigzag level #157

Conversation

cyyynthia commented Jan 19, 2024

cyyynthia commented Jan 20, 2024

etemesi254 commented Jan 21, 2024

cyyynthia commented Jan 21, 2024 • edited Loading

etemesi254 commented Jan 22, 2024

cyyynthia commented Jan 22, 2024 • edited Loading

etemesi254 commented Jan 27, 2024

cyyynthia commented Jan 27, 2024

cyyynthia commented Jan 28, 2024

etemesi254 commented Feb 12, 2024

cyyynthia commented Jan 21, 2024 •

edited

Loading

cyyynthia commented Jan 22, 2024 •

edited

Loading