Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AVX Surface.fill() setup, AVX BLEND_ADD #2382

Merged

Conversation

itzpr3d4t0r
Copy link
Member

@itzpr3d4t0r itzpr3d4t0r commented Aug 5, 2023

Our current implementation of Surface.fill() when using blend flags only implements the single-pixel strategy. This is a massive opportunity to speed things up.
This PR tries to start the changes with BLEND_ADD.

Results:

OLD FILL
2.042152139999234
2.060880370000814
1.977689829999872
2.0481357899989234
1.9969090399994456

NEW FILL
0.01658679000029224
0.01656728999951156
0.01650887000068906
0.016579669998463942
0.016620350000448526

BLIT (avx2 with cached surface)
0.02195337999946787
0.021780080000462478
0.02219769999937853
0.022039489999588113
0.02213338999863481

BLIT (avx2 no cached surface)
0.33580584000083036
0.34673017999957667
0.34221240999904695
0.33031794000053194
0.33174736999790183

Test Program:

from statistics import mean
from timeit import repeat

import pygame

pygame.init()

surf = pygame.Surface((500, 500))
surf.fill((132, 33, 200))

color = pygame.Surface((500, 500))
color.fill((24, 24, 24))

G = globals()

teststr = """
surf.fill((24, 24, 24), None, pygame.BLEND_ADD)
"""
for _ in range(5):
    print(mean(repeat(teststr, globals=G, number=1000, repeat=10)))

print()
teststr = """
surf.blit(color, (0, 0), None, pygame.BLEND_ADD)
"""
for _ in range(5):
    print(mean(repeat(teststr, globals=G, number=1000, repeat=10)))

@itzpr3d4t0r itzpr3d4t0r added the Performance Related to the speed or resource usage of the project label Aug 5, 2023
@itzpr3d4t0r itzpr3d4t0r marked this pull request as ready for review August 5, 2023 19:51
@itzpr3d4t0r itzpr3d4t0r requested a review from a team as a code owner August 5, 2023 19:51
@itzpr3d4t0r itzpr3d4t0r added the Surface pygame.Surface label Aug 6, 2023
@itzpr3d4t0r itzpr3d4t0r force-pushed the surface-fill-add-optimization branch from b2bd30d to 5d7f47c Compare August 6, 2023 08:03
@Starbuck5
Copy link
Member

Couple preliminary things.

I was interested to see if I could find any examples of people actually using Surface.fill with a blend flag online. I did, I found this: https://github.com/Rabbid76/PyGameExamplesAndAnswers/blob/master/documentation/pygame/pygame_blending_and_transaprency.md#change-the-color-of-a-surface-area-mask

AVX2 only works on x86, our SSE2 code also runs on ARM due to SSE2Neon.h. So SSE2 is more broadly important to us, since it will help x86 computers as well as ARM macs and other ARM devices.

I like that you're taking inspiration from my macro strategy. I see that you're using the SSE and AVX2 registers together, my blit macros use masked stores on the non-aligned edges so everything can be done with AVX2 registers and instructions. Is that something you'd like to do here?

@itzpr3d4t0r
Copy link
Member Author

AVX2 only works on x86, our SSE2 code also runs on ARM due to SSE2Neon.h. So SSE2 is more broadly important to us, since it will help x86 computers as well as ARM macs and other ARM devices.

If what you mean is that i should do the SSE2 version first i guess it's fine to see this PR as a setup for avx and then the blend add SSE2 version and setup will come next instead of expanding AVX2, so after these two PRs we could either implement both the sse and avx versions of the same flags in a single PR or separately.

I like that you're taking inspiration from my macro strategy.

yep =).

I see that you're using the SSE and AVX2 registers together, my blit macros use masked stores on the non-aligned edges so everything can be done with AVX2 registers and instructions. Is that something you'd like to do here?

Yeah I've already replicated your work to compare performance and see the benefits. Tbh i thought about the "only avx" stategy myself without knowing your implementation. In practice i didn't see much of a difference, basically the same. I've also switched unrolled loops with a for loop. The main benefit there is that we would just need a single code for filling instead of two which is good. I didn't push that yet but might be wise for simplicity.

@Starbuck5
Copy link
Member

You’re talking about two codes, in my macro you only need 1, and you only need AVX for the AVX blitter. This is not about performance, this is about code simplicity. And this is the approach I would prefer.

Unrolled vs normal for loop— I don’t care too much. Unrolled has a larger code size, so if there’s no measurable benefit I’d do a normal for loop.

@itzpr3d4t0r

This comment was marked as outdated.

@itzpr3d4t0r itzpr3d4t0r changed the title SIMD'd Surface.fill when using BLEND_ADD AVX Surface.fill() setup, AVX BLEND_ADD Aug 15, 2023
src_c/surface_fill.c Outdated Show resolved Hide resolved
@Starbuck5
Copy link
Member

I saw that you moved to my favored strategy and then moved away from it. I think there are speedups to be had in your implementation of my suggested strategy.

Another consideration is code size: your current implementation copy-pastes the "add" code into the final binary 14 times by my count, because of the loop unrolling and the macro. Code size could be a bigger impact if it was a bigger routine (like a blit), rather than just a handful of instructions, but it's something to keep in mind.

@itzpr3d4t0r

This comment was marked as outdated.

@itzpr3d4t0r itzpr3d4t0r force-pushed the surface-fill-add-optimization branch from da66157 to 46a4483 Compare September 11, 2023 21:56
src_c/simd_fill.h Outdated Show resolved Hide resolved
@MyreMylar
Copy link
Member

looks like this needs a merge with main to get past the old CircleCI failure.

@itzpr3d4t0r itzpr3d4t0r force-pushed the surface-fill-add-optimization branch from 1e75d63 to de7b49b Compare October 15, 2023 09:55
Copy link
Member

@MyreMylar MyreMylar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alright, LGTM 👍 (passes all my visual tests, and I also see the expected speedup)

SIMD with the add blend is so nice and straightforward! 🎉

Copy link
Member

@Starbuck5 Starbuck5 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me.

On to an SSE2 implementation, and roll out to other flags?

@Starbuck5
Copy link
Member

I'd like to be squashed down a bit before merge, please.

@itzpr3d4t0r itzpr3d4t0r merged commit 3ac78fc into pygame-community:main Nov 12, 2023
@itzpr3d4t0r itzpr3d4t0r added this to the 2.4.0 milestone Nov 12, 2023
@itzpr3d4t0r itzpr3d4t0r deleted the surface-fill-add-optimization branch November 12, 2023 10:22
@itzpr3d4t0r itzpr3d4t0r mentioned this pull request Nov 12, 2023
@itzpr3d4t0r itzpr3d4t0r mentioned this pull request Jan 21, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Performance Related to the speed or resource usage of the project SIMD Surface pygame.Surface
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants