-
-
Notifications
You must be signed in to change notification settings - Fork 158
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add SSE2 intrinsics smoothscale backend #2473
Add SSE2 intrinsics smoothscale backend #2473
Conversation
src_c/simd_transform_sse2.c
Outdated
Uint16 *templine; | ||
/* allocate a clear memory area for storing the accumulator line */ | ||
templine = (Uint16 *)calloc(dstpitch, 2); | ||
if (templine == NULL) { | ||
return; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it potentially be faster here to have this line allocated as a __m128i* type and then use the bit shifting intrinsic to do the equivalent of incrementing the accumulator? Probably not but I just noticed there was a bit of jumping into SIMD types and then back out again.
I often feel like I need a simple intrinsic testbed for this stuff where I can see what is happening in memory doing some basic operations like allocating, shifting a pointer. Maybe then I could master the nuances of aligning memory as well and see if that makes any difference to performance.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm always curious about what the intrinsics actually get compiled to. At first, I thought it was 1-to-1 script with the instructions but now I feel it's much more loose and compiler specific. After your comment here I played with putting some of this into godbolt, which did work to show the assembly that is generated, for what that's worth.
I did some tests last year that showed chaining intrinsics together into larger statements was more performant. So not defining middle states of the calculation into variables gives the compiler more freedom?
Alignment is an interesting thing to bring up here, I did a test on this using SDL_SIMDAlloc() so the accumulate could be loaded and stored with aligned instructions, and did that with __m128i*, and I got a small 7% improvement over the current iteration. However SDL_SIMDAlloc is only available in SDL 2.0.10, and we support down to SDL 2.0.9 for now, so I'll leave a comment. I've heard that alignment used to be extremely important and that it is less important for performance these days.
Reading through the algorithm again showed me an unnecessary pointer add though, I'll take that out and add a comment about pursuing SDL_SimdAlloc for that data region in the future.
// The operation! Translated from old filter_shrink_X_SSE | ||
// assembly. I don't understand the equivalence between | ||
// these operations and the C version of these operations, | ||
// but it works. | ||
src = _mm_slli_epi16(src, 2); | ||
dst = _mm_mulhi_epu16(src, mm_xcounter); | ||
dst = _mm_add_epi16(dst, accumulate); | ||
accumulate = _mm_mulhi_epu16(src, mm_xfrac); | ||
|
||
dst = _mm_mulhi_epu16(dst, xrecip); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You could probably work through this in python. Something a bit like this, but with more than one test pixel and more looping:
red_channel = 128
green_channel = 2
blue_channel = 255
alpha_channel = 255
print("red_channel:", "{:016b}".format(red_channel))
print("green_channel:", "{:016b}".format(green_channel))
print("blue_channel:", "{:016b}".format(blue_channel), "\n")
# doing _mm_slli_epi16(src_pixel, 2)
red_shift = red_channel << 2
green_shift = green_channel << 2
blue_shift = blue_channel << 2
alpha_shift = alpha_channel << 2
print("red_shift:", "{:016b}".format(red_shift))
print("green_shift:", "{:016b}".format(green_shift))
print("blue_shift:", "{:016b}".format(blue_shift), "\n")
xspace = 32768 # 0x04000 * 256 / 128 - from: 0x04000 * srcwidth / dstwidth
xrecip = 1073741824 // xspace # 0x40000000 / xspace
xcounter = 16384
print("xspace:", "{:016b}".format(xspace))
print("xcounter:", "{:016b}".format(xcounter))
print("xrecip:", "{:016b}".format(xrecip), "\n")
# doing _mm_mulhi_epu16(src_pixel, xcounter)
red_mul = (red_shift * xcounter) >> 16
green_mul = (green_shift * xcounter) >> 16
blue_mul = (blue_shift * xcounter) >> 16
alpha_mul = (alpha_shift * xcounter) >> 16
print("red_mul:", "{:016b}".format(red_mul))
print("green_mul:", "{:016b}".format(green_mul))
print("blue_mul:", "{:016b}".format(blue_mul), "\n")
# doing _mm_mulhi_epu16(dst_pixel, xrecip)
red_dst = (red_mul * xrecip) >> 16
green_dst = (green_mul * xrecip) >> 16
blue_dst = (blue_mul * xrecip) >> 16
alpha_dst = (alpha_mul * xrecip) >> 16
print("red_dst:", "{:016b}".format(red_dst))
print("green_dst:", "{:016b}".format(green_dst))
print("blue_dst:", "{:016b}".format(blue_dst), "\n")
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking exciting! Probably the most complicated conversion to SIMD operation yet.
This reminds me that we should probably have an issue tracking the functions in the transform module and what is potentially SIMD-able, what is not and what has already been SIMD'd.
bcf8dad
to
bce8af0
Compare
I'm glad to have someone to share it with! I'm very excited about this PR. |
Here's one of my test scripts, an interactive one: Interactive smoothscale test# Change out background path to get running, change smoothscale backend to test behavior without rebuilding
import pygame
pygame.init()
screen = pygame.display.set_mode((800,600))
pygame.transform.set_smoothscale_backend("SSE2")
print(pygame.transform.get_smoothscale_backend())
background_orig = pygame.image.load(r"C:\Users\charl\Desktop\pygame-ce\examples\data\fist.png")
background_orig.set_colorkey((0,38,255))
background = pygame.Surface(background_orig.get_size(), pygame.SRCALPHA)
background.blit(background_orig, (0,0))
print(background)
clock = pygame.Clock()
while True:
for event in pygame.event.get():
if event.type == pygame.QUIT:
pygame.quit()
raise SystemExit
screen.fill("purple")
background2 = pygame.transform.smoothscale(background, pygame.mouse.get_pos())
screen.blit(background2, (0,0))
pygame.display.flip()
clock.tick(60) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I ended up creating my own performance testing program to put this through it's paces:
Expand to see code
from sys import stdout
from pstats import Stats
from cProfile import Profile
import pygame
pygame.init()
def scale_down():
dimensions = [
(0, 0),
(3, 3),
(65, 65),
(126, 127),
(254, 255),
(1024, 1024),
(2047, 2048),
]
scales = [1.3, 2, 3.2]
for iterations in range(0, 1000):
for w, h in dimensions:
for scale in scales:
src_surf = pygame.Surface((w, h), pygame.SRCALPHA, 32)
src_surf.fill((100, 255, 128, 200))
pygame.transform.smoothscale(
src_surf,
(
int(src_surf.get_width() / scale),
int(src_surf.get_height() / scale),
),
)
def scale_up():
dimensions = [
(0, 0),
(3, 3),
(65, 65),
(126, 127),
(254, 255),
(1024, 1024),
(2047, 2048),
]
scales = [1.3, 2, 3.2]
for iterations in range(0, 1000):
for w, h in dimensions:
for scale in scales:
src_surf = pygame.Surface((w, h), pygame.SRCALPHA, 32)
src_surf.fill((100, 255, 128, 200))
pygame.transform.smoothscale(
src_surf,
(
int(src_surf.get_width() * scale),
int(src_surf.get_height() * scale),
),
)
if __name__ == "__main__":
print("Scale Down")
profiler = Profile()
profiler.runcall(scale_down)
stats = Stats(profiler, stream=stdout)
stats.strip_dirs()
stats.sort_stats("cumulative")
stats.print_stats()
print("Scale Up")
profiler = Profile()
profiler.runcall(scale_up)
stats = Stats(profiler, stream=stdout)
stats.strip_dirs()
stats.sort_stats("cumulative")
stats.print_stats()
Current results on main branch:
Scale Down
84002 function calls in 50.543 seconds
Ordered by: cumulative time
ncalls tottime percall cumtime percall filename:lineno(function)
1 12.715 12.715 50.543 50.543 smoothscale_speed_test.py:11(scale_down)
21000 32.091 0.002 32.091 0.002 {built-in method pygame.transform.smoothscale}
21000 5.731 0.000 5.731 0.000 {method 'fill' of 'pygame.surface.Surface' objects}
21000 0.004 0.000 0.004 0.000 {method 'get_width' of 'pygame.surface.Surface' objects}
21000 0.002 0.000 0.002 0.000 {method 'get_height' of 'pygame.surface.Surface' objects}
1 0.000 0.000 0.000 0.000 {method 'disable' of '_lsprof.Profiler' objects}
Scale Up
84002 function calls in 203.356 seconds
Ordered by: cumulative time
ncalls tottime percall cumtime percall filename:lineno(function)
1 30.437 30.437 203.356 203.356 smoothscale_speed_test.py:37(scale_up)
21000 166.979 0.008 166.979 0.008 {built-in method pygame.transform.smoothscale}
21000 5.933 0.000 5.933 0.000 {method 'fill' of 'pygame.surface.Surface' objects}
21000 0.005 0.000 0.005 0.000 {method 'get_width' of 'pygame.surface.Surface' objects}
21000 0.002 0.000 0.002 0.000 {method 'get_height' of 'pygame.surface.Surface' objects}
1 0.000 0.000 0.000 0.000 {method 'disable' of '_lsprof.Profiler' objects}
Then the current results with this PR:
Scale Down
84002 function calls in 41.358 seconds
Ordered by: cumulative time
ncalls tottime percall cumtime percall filename:lineno(function)
1 12.520 12.520 41.358 41.358 smoothscale_speed_test.py:11(scale_down)
21000 22.729 0.001 22.729 0.001 {built-in method pygame.transform.smoothscale}
21000 6.103 0.000 6.103 0.000 {method 'fill' of 'pygame.surface.Surface' objects}
21000 0.004 0.000 0.004 0.000 {method 'get_width' of 'pygame.surface.Surface' objects}
21000 0.002 0.000 0.002 0.000 {method 'get_height' of 'pygame.surface.Surface' objects}
1 0.000 0.000 0.000 0.000 {method 'disable' of '_lsprof.Profiler' objects}
Scale Up
84002 function calls in 178.964 seconds
Ordered by: cumulative time
ncalls tottime percall cumtime percall filename:lineno(function)
1 30.329 30.329 178.964 178.964 smoothscale_speed_test.py:37(scale_up)
21000 142.608 0.007 142.608 0.007 {built-in method pygame.transform.smoothscale}
21000 6.020 0.000 6.020 0.000 {method 'fill' of 'pygame.surface.Surface' objects}
21000 0.005 0.000 0.005 0.000 {method 'get_width' of 'pygame.surface.Surface' objects}
21000 0.002 0.000 0.002 0.000 {method 'get_height' of 'pygame.surface.Surface' objects}
1 0.000 0.000 0.000 0.000 {method 'disable' of '_lsprof.Profiler' objects}
Which looks to me like 30% faster scaling down and 15% faster scaling up.
Everything looks good to me here. 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alright I'm not gonna pretend like i know exactly how this works but i didn't see anything weird and looks like it's working fine. We can always get back to it if we find ways to improve the code further.
Better parallelization for Y-expand, Y-shrink, X-shrink + cleanups and fixes
edf1a4e
to
3027208
Compare
Tackles the bulk of what I wanted in #1767
This was more difficult than I thought, this was quite difficult. I read through the existing smoothscale assembly (written assembly, not intrinsics) and the C implementations and combined what I saw into an SSE2 intrinsic version of the same algorithm.
I then had lots of ideas for optimizations on top of that, so for three of the filters it processes more pixels at a time than the existing SSE/MMX implementations.
This PR "just" implements the SSE2 backend, there are several follow on steps that would be quite nice:
Performance relative to SSE backend:
These are pixel perfect accurate relative to the SSE backend. Verified by saving out to pngs and checking the file hashes.