Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add SSE2 intrinsics smoothscale backend #2473

Merged
merged 2 commits into from
Nov 3, 2023

Conversation

Starbuck5
Copy link
Member

@Starbuck5 Starbuck5 commented Sep 29, 2023

Tackles the bulk of what I wanted in #1767

This was more difficult than I thought, this was quite difficult. I read through the existing smoothscale assembly (written assembly, not intrinsics) and the C implementations and combined what I saw into an SSE2 intrinsic version of the same algorithm.

I then had lots of ideas for optimizations on top of that, so for three of the filters it processes more pixels at a time than the existing SSE/MMX implementations.

This PR "just" implements the SSE2 backend, there are several follow on steps that would be quite nice:

  • Get it compiling and up at runtime on NEON, so ARM folks can get fast smoothscale
  • Remove legacy MMX/SSE backends eventually (we've been linking the compiled code for this in, at least on Windows, this is a part of the pygame codebase we don't know how to compile). Wanting to remove this uncompileable, unportable and ancient bit of our codebase was my chief motivation for opening this PR.
  • AVX2 backend? Now that we've got it in intrinsics, it's much easier to scale up the parallelism with other intrinsics.
  • Fix so if a malloc fails in the filters it will accurately report an error

Performance relative to SSE backend:

X-shrinking: 25% faster
Y-shrinking: 28% faster
X-expanding: neglible
Y-expanding: 25% faster

These are pixel perfect accurate relative to the SSE backend. Verified by saving out to pngs and checking the file hashes.

@Starbuck5 Starbuck5 requested a review from a team as a code owner September 29, 2023 08:15
@Starbuck5 Starbuck5 marked this pull request as draft September 29, 2023 08:16
@Starbuck5 Starbuck5 added Performance Related to the speed or resource usage of the project transform pygame.transform labels Sep 29, 2023
@Starbuck5 Starbuck5 changed the title Smoothscale intrinsics Add SSE2 intrinsics smoothscale backend Sep 29, 2023
@ankith26 ankith26 added the SIMD label Sep 29, 2023
Comment on lines 172 to 180
Uint16 *templine;
/* allocate a clear memory area for storing the accumulator line */
templine = (Uint16 *)calloc(dstpitch, 2);
if (templine == NULL) {
return;
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it potentially be faster here to have this line allocated as a __m128i* type and then use the bit shifting intrinsic to do the equivalent of incrementing the accumulator? Probably not but I just noticed there was a bit of jumping into SIMD types and then back out again.

I often feel like I need a simple intrinsic testbed for this stuff where I can see what is happening in memory doing some basic operations like allocating, shifting a pointer. Maybe then I could master the nuances of aligning memory as well and see if that makes any difference to performance.

Copy link
Member Author

@Starbuck5 Starbuck5 Sep 30, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm always curious about what the intrinsics actually get compiled to. At first, I thought it was 1-to-1 script with the instructions but now I feel it's much more loose and compiler specific. After your comment here I played with putting some of this into godbolt, which did work to show the assembly that is generated, for what that's worth.

I did some tests last year that showed chaining intrinsics together into larger statements was more performant. So not defining middle states of the calculation into variables gives the compiler more freedom?

Alignment is an interesting thing to bring up here, I did a test on this using SDL_SIMDAlloc() so the accumulate could be loaded and stored with aligned instructions, and did that with __m128i*, and I got a small 7% improvement over the current iteration. However SDL_SIMDAlloc is only available in SDL 2.0.10, and we support down to SDL 2.0.9 for now, so I'll leave a comment. I've heard that alignment used to be extremely important and that it is less important for performance these days.

Reading through the algorithm again showed me an unnecessary pointer add though, I'll take that out and add a comment about pursuing SDL_SimdAlloc for that data region in the future.

Comment on lines +118 to +127
// The operation! Translated from old filter_shrink_X_SSE
// assembly. I don't understand the equivalence between
// these operations and the C version of these operations,
// but it works.
src = _mm_slli_epi16(src, 2);
dst = _mm_mulhi_epu16(src, mm_xcounter);
dst = _mm_add_epi16(dst, accumulate);
accumulate = _mm_mulhi_epu16(src, mm_xfrac);

dst = _mm_mulhi_epu16(dst, xrecip);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could probably work through this in python. Something a bit like this, but with more than one test pixel and more looping:

red_channel = 128
green_channel = 2
blue_channel = 255
alpha_channel = 255

print("red_channel:", "{:016b}".format(red_channel))
print("green_channel:", "{:016b}".format(green_channel))
print("blue_channel:", "{:016b}".format(blue_channel), "\n")

# doing _mm_slli_epi16(src_pixel, 2)
red_shift = red_channel << 2
green_shift = green_channel << 2
blue_shift = blue_channel << 2
alpha_shift = alpha_channel << 2

print("red_shift:", "{:016b}".format(red_shift))
print("green_shift:", "{:016b}".format(green_shift))
print("blue_shift:", "{:016b}".format(blue_shift), "\n")

xspace = 32768  # 0x04000 * 256 / 128 - from: 0x04000 * srcwidth / dstwidth
xrecip = 1073741824 // xspace  # 0x40000000 / xspace
xcounter = 16384

print("xspace:", "{:016b}".format(xspace))
print("xcounter:", "{:016b}".format(xcounter))
print("xrecip:", "{:016b}".format(xrecip), "\n")

# doing _mm_mulhi_epu16(src_pixel, xcounter)
red_mul = (red_shift * xcounter) >> 16
green_mul = (green_shift * xcounter) >> 16
blue_mul = (blue_shift * xcounter) >> 16
alpha_mul = (alpha_shift * xcounter) >> 16

print("red_mul:", "{:016b}".format(red_mul))
print("green_mul:", "{:016b}".format(green_mul))
print("blue_mul:", "{:016b}".format(blue_mul), "\n")

# doing _mm_mulhi_epu16(dst_pixel, xrecip)
red_dst = (red_mul * xrecip) >> 16
green_dst = (green_mul * xrecip) >> 16
blue_dst = (blue_mul * xrecip) >> 16
alpha_dst = (alpha_mul * xrecip) >> 16

print("red_dst:", "{:016b}".format(red_dst))
print("green_dst:", "{:016b}".format(green_dst))
print("blue_dst:", "{:016b}".format(blue_dst), "\n")

Copy link
Member

@MyreMylar MyreMylar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking exciting! Probably the most complicated conversion to SIMD operation yet.

This reminds me that we should probably have an issue tracking the functions in the transform module and what is potentially SIMD-able, what is not and what has already been SIMD'd.

@Starbuck5 Starbuck5 force-pushed the smoothscale-intrinsics branch 2 times, most recently from bcf8dad to bce8af0 Compare September 30, 2023 01:29
@Starbuck5 Starbuck5 marked this pull request as ready for review September 30, 2023 05:29
@Starbuck5
Copy link
Member Author

Looking exciting! Probably the most complicated conversion to SIMD operation yet.

I'm glad to have someone to share it with! I'm very excited about this PR.

@Starbuck5 Starbuck5 added this to the 2.4.0 milestone Sep 30, 2023
@Starbuck5
Copy link
Member Author

Here's one of my test scripts, an interactive one:

Interactive smoothscale test
# Change out background path to get running, change smoothscale backend to test behavior without rebuilding
import pygame

pygame.init()
screen = pygame.display.set_mode((800,600))

pygame.transform.set_smoothscale_backend("SSE2")
print(pygame.transform.get_smoothscale_backend())

background_orig = pygame.image.load(r"C:\Users\charl\Desktop\pygame-ce\examples\data\fist.png")
background_orig.set_colorkey((0,38,255))
background = pygame.Surface(background_orig.get_size(), pygame.SRCALPHA)
background.blit(background_orig, (0,0))

print(background)

clock = pygame.Clock()

while True:
    for event in pygame.event.get():
        if event.type == pygame.QUIT:
            pygame.quit()
            raise SystemExit

    screen.fill("purple")
    background2 = pygame.transform.smoothscale(background, pygame.mouse.get_pos())
    screen.blit(background2, (0,0))
    pygame.display.flip()
    clock.tick(60)

Copy link
Member

@MyreMylar MyreMylar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I ended up creating my own performance testing program to put this through it's paces:

Expand to see code
from sys import stdout
from pstats import Stats
from cProfile import Profile


import pygame

pygame.init()


def scale_down():
    dimensions = [
        (0, 0),
        (3, 3),
        (65, 65),
        (126, 127),
        (254, 255),
        (1024, 1024),
        (2047, 2048),
    ]
    scales = [1.3, 2, 3.2]
    for iterations in range(0, 1000):
        for w, h in dimensions:
            for scale in scales:
                src_surf = pygame.Surface((w, h), pygame.SRCALPHA, 32)
                src_surf.fill((100, 255, 128, 200))

                pygame.transform.smoothscale(
                    src_surf,
                    (
                        int(src_surf.get_width() / scale),
                        int(src_surf.get_height() / scale),
                    ),
                )


def scale_up():
    dimensions = [
        (0, 0),
        (3, 3),
        (65, 65),
        (126, 127),
        (254, 255),
        (1024, 1024),
        (2047, 2048),
    ]
    scales = [1.3, 2, 3.2]
    for iterations in range(0, 1000):
        for w, h in dimensions:
            for scale in scales:
                src_surf = pygame.Surface((w, h), pygame.SRCALPHA, 32)
                src_surf.fill((100, 255, 128, 200))

                pygame.transform.smoothscale(
                    src_surf,
                    (
                        int(src_surf.get_width() * scale),
                        int(src_surf.get_height() * scale),
                    ),
                )


if __name__ == "__main__":
    print("Scale Down")
    profiler = Profile()
    profiler.runcall(scale_down)
    stats = Stats(profiler, stream=stdout)
    stats.strip_dirs()
    stats.sort_stats("cumulative")
    stats.print_stats()

    print("Scale Up")
    profiler = Profile()
    profiler.runcall(scale_up)
    stats = Stats(profiler, stream=stdout)
    stats.strip_dirs()
    stats.sort_stats("cumulative")
    stats.print_stats()

Current results on main branch:

Scale Down
         84002 function calls in 50.543 seconds

   Ordered by: cumulative time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1   12.715   12.715   50.543   50.543 smoothscale_speed_test.py:11(scale_down)
    21000   32.091    0.002   32.091    0.002 {built-in method pygame.transform.smoothscale}
    21000    5.731    0.000    5.731    0.000 {method 'fill' of 'pygame.surface.Surface' objects}
    21000    0.004    0.000    0.004    0.000 {method 'get_width' of 'pygame.surface.Surface' objects}
    21000    0.002    0.000    0.002    0.000 {method 'get_height' of 'pygame.surface.Surface' objects}
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}


Scale Up
         84002 function calls in 203.356 seconds

   Ordered by: cumulative time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1   30.437   30.437  203.356  203.356 smoothscale_speed_test.py:37(scale_up)
    21000  166.979    0.008  166.979    0.008 {built-in method pygame.transform.smoothscale}
    21000    5.933    0.000    5.933    0.000 {method 'fill' of 'pygame.surface.Surface' objects}
    21000    0.005    0.000    0.005    0.000 {method 'get_width' of 'pygame.surface.Surface' objects}
    21000    0.002    0.000    0.002    0.000 {method 'get_height' of 'pygame.surface.Surface' objects}
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}

Then the current results with this PR:

Scale Down
         84002 function calls in 41.358 seconds

   Ordered by: cumulative time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1   12.520   12.520   41.358   41.358 smoothscale_speed_test.py:11(scale_down)
    21000   22.729    0.001   22.729    0.001 {built-in method pygame.transform.smoothscale}
    21000    6.103    0.000    6.103    0.000 {method 'fill' of 'pygame.surface.Surface' objects}
    21000    0.004    0.000    0.004    0.000 {method 'get_width' of 'pygame.surface.Surface' objects}
    21000    0.002    0.000    0.002    0.000 {method 'get_height' of 'pygame.surface.Surface' objects}
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}


Scale Up
         84002 function calls in 178.964 seconds

   Ordered by: cumulative time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1   30.329   30.329  178.964  178.964 smoothscale_speed_test.py:37(scale_up)
    21000  142.608    0.007  142.608    0.007 {built-in method pygame.transform.smoothscale}
    21000    6.020    0.000    6.020    0.000 {method 'fill' of 'pygame.surface.Surface' objects}
    21000    0.005    0.000    0.005    0.000 {method 'get_width' of 'pygame.surface.Surface' objects}
    21000    0.002    0.000    0.002    0.000 {method 'get_height' of 'pygame.surface.Surface' objects}
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}

Which looks to me like 30% faster scaling down and 15% faster scaling up.

Everything looks good to me here. 👍

Copy link
Member

@itzpr3d4t0r itzpr3d4t0r left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alright I'm not gonna pretend like i know exactly how this works but i didn't see anything weird and looks like it's working fine. We can always get back to it if we find ways to improve the code further.

Better parallelization for Y-expand, Y-shrink, X-shrink + cleanups and fixes
@Starbuck5 Starbuck5 force-pushed the smoothscale-intrinsics branch from edf1a4e to 3027208 Compare November 3, 2023 07:47
@Starbuck5 Starbuck5 merged commit 0062ac4 into pygame-community:main Nov 3, 2023
29 checks passed
@Starbuck5 Starbuck5 deleted the smoothscale-intrinsics branch November 3, 2023 08:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Performance Related to the speed or resource usage of the project SIMD transform pygame.transform
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants