Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance analysis of surface.fill #3227

Open
Starbuck5 opened this issue Nov 20, 2024 · 8 comments
Open

Performance analysis of surface.fill #3227

Starbuck5 opened this issue Nov 20, 2024 · 8 comments
Labels
discussion Performance Related to the speed or resource usage of the project Surface pygame.Surface

Comments

@Starbuck5
Copy link
Member

Introduction

Last week I went and profiled a handful of random games, mostly from pygame community game jams. One thing I noticed is that fill was often one of the higher ranked pygame-ce functions, in terms of runtime. Not as expensive as blits or display.flip or anything, but it tends to be up there. Which surprised me. It doesn't seem like it would be that expensive.

I wondered if pygame-ce's internal fill routines could do a better job than SDL's, so I came up with my own alternative to surface.fill():

# Normal
screen.fill("purple")

# Weird
screen.fill("white", special_flags=pygame.BLEND_SUB)
screen.fill("purple", special_flags=pygame.BLEND_ADD)

And I found that even doing these 2 function calls was faster than a normal fill! Amazing! If going through twice is faster, we could easily make our special_flags=0 routine to take over fill() and do it more efficiently and show a noticeable speed improvement to a typical game.

However, the story is not that simple. I tested a larger surface and SDL was now faster. What gives? Why is SDL better at large surfaces and we are better at small ones, and is there anything we can contribute to them or learn for ourselves from this?

Data

fill() seconds taken (20k repetitions), different strategies

fill() nanoseconds per pixel, different strategies

Raw data: https://docs.google.com/spreadsheets/d/1WBCVvzkL9HAZJ7Yo1N86-tAFhAP0d2J72Wcp8mveCl4/edit?gid=2144097095#gid=2144097095

Benchmarking script
import time
import pygame

pygame.init()
size = 1550
screen = pygame.Surface((size,size))

print(screen, screen.get_pitch())

def fill_purple_normal():
    print("Normal fill")

    screen.fill("purple")
    print(screen.get_at((0,0))) # Color(160, 32, 240, 255)

    start = time.time()
    for _ in range(10000):
        screen.fill("purple")
    print(time.time() - start) # Takes about 0.19 seconds on my system

def fill_purple_weird():
    print("SUB-ADD fill")

    screen.fill("white", special_flags=pygame.BLEND_SUB)
    screen.fill("purple", special_flags=pygame.BLEND_ADD)
    print(screen.get_at((0,0))) # Color(160, 32, 240, 255)

    start = time.time()
    for _ in range(10000):
        screen.fill("white", special_flags=pygame.BLEND_SUB)
        screen.fill("purple", special_flags=pygame.BLEND_ADD)
    print(time.time() - start) # Takes about 0.07 seconds on my system

fill_purple_normal()
fill_purple_weird()

Analysis

  • SDL performance is more bumpy, they seem to like extremely round widths. Width=1200 is significantly faster for them than widths 1050, 1100, 1150, 1250, 1300. I don't think this is an outlier, widths 400 and 800 are also favored by low runtimes in that way
  • I'm running this on an x86_64 system w/AVX2, so our fills are using our AVX2 routines, they are using SSE.
  • Our performance gradient is more smooth overall but after 1000x1000 surfaces our time to process each pixel goes up 4-5x to a new stable threshold.
  • A 1000x1000 32 bpp surface takes up 4MB of memory. My L3 cache is 12MB. Would the amount of cache change the performance threshold?
  • SDL is using aligned, non temporal stores (they use _mm_stream_ps)
  • We are using unaligned, normal stores (_mm_storeu_si128, _mm256_storeu_si256)
  • Source code of what I believe is the SDL routine used here: https://github.com/libsdl-org/SDL/blob/e924f12a7b33678bc71dc96acf3c44142c72c553/src/video/SDL_fillrect.c#L30-L95
  • They have accelerators for ARM as well, so I'm curious to see if on ARM they would be faster than us in every scenario. Like there could be a world where we make our own internal FillRect implementation for x86 but still call theirs on ARM.

This is a very open ended issue, I mainly want to bring up what I've found to those who might also be interested. Potentially we can contribute something to SDL or learn something from their strategy to improve our own.

@MyreMylar @itzpr3d4t0r

@Starbuck5 Starbuck5 added Performance Related to the speed or resource usage of the project Surface pygame.Surface discussion labels Nov 20, 2024
@Starbuck5
Copy link
Member Author

For all the data I gathered, I haven't actually tried to make a replacement for SDL_FillRect() using our SIMD strategies yet. I see that all our filler code has to load and unpack the source surface, which a replacement for SDL_FillRect() wouldn't need to do, it would just need to broadcast. Maybe the loading and unpacking is what throws our performance off a cliff with large surfaces versus theirs (which does no loading and unpacking because it doesn't need it!)

@itzpr3d4t0r
Copy link
Member

itzpr3d4t0r commented Nov 21, 2024

Cool to see some research being done in this field. I've done some similar research in the past:

This has everything to do with the cache size and internal cache organization, but mainly size. I used your very same program but with bigger surfaces and got this:
image

As you can see the weird strategy is faster up until around 2000X2000 surfaces, which is 4M pixels, or in other words 16MB, multiply that by 2 calls to .fill() and you get 32MB which is exactly my L3 cache size. In your case it's not exactly that but cache is also used by other programs and your system could've had more programs open at the time that consumed a bit more cache. SDL's implementation uses nontemporal memory accesses for this very reason, your processor won't care about your data being reused later and simply uses it in a calculation and potentially removes it from the cache the following clock, yielding better cache use (which is good for big surfaces).
This is an interesting read about nontemporal memory access.
Also this is another very good video explaining how algorithms and caches work: here

@Starbuck5
Copy link
Member Author

The program should write entire cache lines at a time using non-temporal stores. Writing partial cache lines may lead to severe performance degradations.

(From your link)
Well this answers why SDL's implementation is doing four 128 bit stores per batch! 4 * 128 bits = 64 bytes, the typical size of a cache line. And maybe it also answers why the performance dips so much in less well aligned regions!

I bet ya it is doing partial cache lines because it's only running inside each row of the surface individually. If the cache line crosses the row it's doing normal writes to catch up instead of being able to broadcast across rows.

@Starbuck5
Copy link
Member Author

Here is updated data with a quick AVX2 implementation that follows all of our typical strategies. I also realize that before I thought I was running 20k iterations when I was actually running 10k, so that is fixed as well.

fill() seconds taken (10k repetitions), different strategies
fill() nanoseconds per pixel, different strategies (1)

This AVX2 implementation is consistently faster than the standard implementation until 1100 by 1100 pixel Surfaces, but it follows the same trend as the prototype SUB-ADD demonstrated.

Also with cache in mind and having looked into the resources you linked @itzpr3d4t0r it also is true that running something in a tight loop may not be the most representative, since typical programs will have lots of other things coming in and out of memory as well, with other blits and things. So maybe that can be simulated by adding more fills() of different surfaces to the loop.

@MyreMylar
Copy link
Member

Talk of tight loop optimisation over fitting brings to mind again the need for a holistic performance testing program that emulates a typical pygame game application (or perhaps several such programs with different game styles). While it would not be that useful in testing an individual optimisation it might help track whether the combined weight of multiple optimisations is having an impact over time and serve as an important double-check on over-fitting to usages that might not occur very often in real programs.

It might also help us design better example programs for users to copy from if we discover that particular ways of building a game app structure work better together than others (e.g. keeping surfaces at certain dimensions to match typical cache sizes).

I guess for this issue, (and pygame-ce more generally) it would be good to experiment with the implications of aligned loads and stores in AVX2? Perhaps we could also dynamically switch from one strategy of filling to another depending on the surface size versus the cache size?

@itzpr3d4t0r
Copy link
Member

itzpr3d4t0r commented Nov 23, 2024

I guess for this issue, (and pygame-ce more generally) it would be good to experiment with the implications of aligned loads and stores in AVX2? Perhaps we could also dynamically switch from one strategy of filling to another depending on the surface size versus the cache size?

A big issue is that SDL doesn’t align memory in a SIMD-friendly way. Their code checks for alignment on a specific boundary, and only when it’s aligned do they use vector loads/stores; otherwise, they fall back to sequential scalar operations.

In my particle manager module, I’ve tested using the SDL_SIMDAlloc function. It’s excellent for providing properly aligned data with extra padding at the end of the array, eliminating the need for masked or sequential operations when the array size isn’t a multiple of the vector width. This change allowed me to rely entirely on aligned operations and resulted in a roughly 50% performance boost in my particle update loop with floats. I was doing something like 6 aligned loads and 6 aligned stores per loop iteration (the code if you're interested).

Talk of tight loop optimisation over fitting brings to mind again the need for a holistic performance testing program that emulates a typical pygame game application (or perhaps several such programs with different game styles). While it would not be that useful in testing an individual optimisation it might help track whether the combined weight of multiple optimisations is having an impact over time and serve as an important double-check on over-fitting to usages that might not occur very often in real programs.

In terms of cache utilization imo real programs don't change much, simply because running stuff from python is extremely slow to the processor, and going from function call to function call or even line to next line is definitely going to flush much of the cache since it's generally LRU based and there's always the python interpreter running under the hood.

(From your link)
Well this answers why SDL's implementation is doing four 128 bit stores per batch! 4 * 128 bits = 64 bytes, the typical size of a cache line. And maybe it also answers why the performance dips so much in less well aligned regions!

I bet ya it is doing partial cache lines because it's only running inside each row of the surface individually. If the cache line crosses the row it's doing normal writes to catch up instead of being able to broadcast across rows.

About these yes that's the reason. I think SDL opting for this strategy is a direct consequence of not knowing what cache amount or cache hierarchy the code will be run on.
Modern processors have much faster and bigger caches and stuff like memcpy or memmove take advantage of this and it shows in the graphs, but they can only do so as long as inside cache size boundaries; outside that performance starts to dip.

As I've stated before i think pygame could use a specific drawing mode to take advantage of the hardware and use faster blits or fills in those situations, with something like a custom flag. At very small surface sizes though (1-5 pixels) parsing lists and the multiple checks we run on surfaces are what hinders performance the most.

@Starbuck5
Copy link
Member Author

Starbuck5 commented Nov 24, 2024

On further investigation of SDL's algorithm they're not cache line aligning themselves, which makes sense because they are using non temporal stores. I now believe the 64 bytes at once strategy is merely a loop unrolling strategy.

Talk of tight loop optimisation over fitting brings to mind again the need for a holistic performance testing program that emulates a typical pygame game application (or perhaps several such programs with different game styles). While it would not be that useful in testing an individual optimisation it might help track whether the combined weight of multiple optimisations is having an impact over time and serve as an important double-check on over-fitting to usages that might not occur very often in real programs.

I think the way to go is to grab a bunch of random games (with author permission) and turn them into benchmarks. I would also like a common catalog that could be run on pygame or pygame-ce for performance comparison. It's just hard to convert a game into a benchmark.

At very small surface sizes though (1-5 pixels) parsing lists and the multiple checks we run on surfaces are what hinders performance the most.

I think that some of the surface prepping / locking stuff we are doing is unnecessary, but it would be very difficult to decisively prove that, so it's hard to remove anything in that area given safety concerns.

I have a patch for smoothscale locally that uses SDL_SIMDAlloc, need to polish up for a PR. My dilemma is that the allocations in the scaling routines don't have any error reporting mechanism right now, and it may be quite challenging to add, given all the ancient paths. So maybe I will ignore for now.

Anyways I've prepared a minimal patch for SDL that helps a lot with these "full surface fills", smoothing away the hard performance cliffs I see when the surface width isn't a multiple of 64 bytes. Need to do more testing before I send it off to them though.

image

fill() seconds taken (10k repetitions), different strategies (1)

I also tested in Fluffy's game Wandering Soul, which I have converted into a benchmark by turning off player-object collision, teleporting the player into level 2 at the start, and automatically exiting after 30 seconds. He uses a 900 by 600 resolution. I see a 30% fill performance improvement in profiling the game.

@Starbuck5
Copy link
Member Author

The SDL patch I came up was inspired by Myre's work on grayscale_avx2 last year, with the "batches" strategy.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
discussion Performance Related to the speed or resource usage of the project Surface pygame.Surface
Projects
None yet
Development

No branches or pull requests

3 participants