-
-
Notifications
You must be signed in to change notification settings - Fork 158
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Performance analysis of surface.fill #3227
Comments
For all the data I gathered, I haven't actually tried to make a replacement for SDL_FillRect() using our SIMD strategies yet. I see that all our filler code has to load and unpack the source surface, which a replacement for SDL_FillRect() wouldn't need to do, it would just need to broadcast. Maybe the loading and unpacking is what throws our performance off a cliff with large surfaces versus theirs (which does no loading and unpacking because it doesn't need it!) |
Cool to see some research being done in this field. I've done some similar research in the past:
This has everything to do with the cache size and internal cache organization, but mainly size. I used your very same program but with bigger surfaces and got this: As you can see the |
(From your link) I bet ya it is doing partial cache lines because it's only running inside each row of the surface individually. If the cache line crosses the row it's doing normal writes to catch up instead of being able to broadcast across rows. |
Here is updated data with a quick AVX2 implementation that follows all of our typical strategies. I also realize that before I thought I was running 20k iterations when I was actually running 10k, so that is fixed as well. This AVX2 implementation is consistently faster than the standard implementation until 1100 by 1100 pixel Surfaces, but it follows the same trend as the prototype SUB-ADD demonstrated. Also with cache in mind and having looked into the resources you linked @itzpr3d4t0r it also is true that running something in a tight loop may not be the most representative, since typical programs will have lots of other things coming in and out of memory as well, with other blits and things. So maybe that can be simulated by adding more fills() of different surfaces to the loop. |
Talk of tight loop optimisation over fitting brings to mind again the need for a holistic performance testing program that emulates a typical pygame game application (or perhaps several such programs with different game styles). While it would not be that useful in testing an individual optimisation it might help track whether the combined weight of multiple optimisations is having an impact over time and serve as an important double-check on over-fitting to usages that might not occur very often in real programs. It might also help us design better example programs for users to copy from if we discover that particular ways of building a game app structure work better together than others (e.g. keeping surfaces at certain dimensions to match typical cache sizes). I guess for this issue, (and pygame-ce more generally) it would be good to experiment with the implications of aligned loads and stores in AVX2? Perhaps we could also dynamically switch from one strategy of filling to another depending on the surface size versus the cache size? |
A big issue is that SDL doesn’t align memory in a SIMD-friendly way. Their code checks for alignment on a specific boundary, and only when it’s aligned do they use vector loads/stores; otherwise, they fall back to sequential scalar operations. In my particle manager module, I’ve tested using the SDL_SIMDAlloc function. It’s excellent for providing properly aligned data with extra padding at the end of the array, eliminating the need for masked or sequential operations when the array size isn’t a multiple of the vector width. This change allowed me to rely entirely on aligned operations and resulted in a roughly 50% performance boost in my particle update loop with floats. I was doing something like 6 aligned loads and 6 aligned stores per loop iteration (the code if you're interested).
In terms of cache utilization imo real programs don't change much, simply because running stuff from python is extremely slow to the processor, and going from function call to function call or even line to next line is definitely going to flush much of the cache since it's generally LRU based and there's always the python interpreter running under the hood.
About these yes that's the reason. I think SDL opting for this strategy is a direct consequence of not knowing what cache amount or cache hierarchy the code will be run on. As I've stated before i think pygame could use a specific drawing mode to take advantage of the hardware and use faster blits or fills in those situations, with something like a custom flag. At very small surface sizes though (1-5 pixels) parsing lists and the multiple checks we run on surfaces are what hinders performance the most. |
On further investigation of SDL's algorithm they're not cache line aligning themselves, which makes sense because they are using non temporal stores. I now believe the 64 bytes at once strategy is merely a loop unrolling strategy.
I think the way to go is to grab a bunch of random games (with author permission) and turn them into benchmarks. I would also like a common catalog that could be run on pygame or pygame-ce for performance comparison. It's just hard to convert a game into a benchmark.
I think that some of the surface prepping / locking stuff we are doing is unnecessary, but it would be very difficult to decisively prove that, so it's hard to remove anything in that area given safety concerns. I have a patch for smoothscale locally that uses SDL_SIMDAlloc, need to polish up for a PR. My dilemma is that the allocations in the scaling routines don't have any error reporting mechanism right now, and it may be quite challenging to add, given all the ancient paths. So maybe I will ignore for now. Anyways I've prepared a minimal patch for SDL that helps a lot with these "full surface fills", smoothing away the hard performance cliffs I see when the surface width isn't a multiple of 64 bytes. Need to do more testing before I send it off to them though. I also tested in Fluffy's game Wandering Soul, which I have converted into a benchmark by turning off player-object collision, teleporting the player into level 2 at the start, and automatically exiting after 30 seconds. He uses a 900 by 600 resolution. I see a 30% fill performance improvement in profiling the game. |
The SDL patch I came up was inspired by Myre's work on grayscale_avx2 last year, with the "batches" strategy. |
Introduction
Last week I went and profiled a handful of random games, mostly from pygame community game jams. One thing I noticed is that fill was often one of the higher ranked pygame-ce functions, in terms of runtime. Not as expensive as blits or display.flip or anything, but it tends to be up there. Which surprised me. It doesn't seem like it would be that expensive.
I wondered if pygame-ce's internal fill routines could do a better job than SDL's, so I came up with my own alternative to surface.fill():
And I found that even doing these 2 function calls was faster than a normal fill! Amazing! If going through twice is faster, we could easily make our special_flags=0 routine to take over fill() and do it more efficiently and show a noticeable speed improvement to a typical game.
However, the story is not that simple. I tested a larger surface and SDL was now faster. What gives? Why is SDL better at large surfaces and we are better at small ones, and is there anything we can contribute to them or learn for ourselves from this?
Data
Raw data: https://docs.google.com/spreadsheets/d/1WBCVvzkL9HAZJ7Yo1N86-tAFhAP0d2J72Wcp8mveCl4/edit?gid=2144097095#gid=2144097095
Benchmarking script
Analysis
_mm_stream_ps
)_mm_storeu_si128
,_mm256_storeu_si256
)This is a very open ended issue, I mainly want to bring up what I've found to those who might also be interested. Potentially we can contribute something to SDL or learn something from their strategy to improve our own.
@MyreMylar @itzpr3d4t0r
The text was updated successfully, but these errors were encountered: