Skip to content

Conversation

cBournhonesque
Copy link
Contributor

@cBournhonesque cBournhonesque commented Jul 18, 2024

Objective

As described in the relations RFC: https://github.com/james-j-obrien/rfcs/blob/minimal-fragmenting-relationships/rfcs/79-minimal-fragmenting-relationships.md#access-bitsets-and-component-sparsesets

The reasons while the access bitsets are efficient currently is because the ComponentIds are dense: they are incremented from 0 and should remain small.
With relations, a ComponentId will have some upper bits 1, so the amount of memory allocated in the FixedBitSet to represent the value would be non-trivial.

Solution

One way to fix the issue would be to replace the FixedBitSets with sorted vectors.
The vectors should remain relatively small since queries usually don't involved hundreds of components.

These allow to do union, difference, intersection operations in O(1).
(however inserting a value is O(n))

Testing

Ran the running_systems benchmarks

group                                   main                                   pr
-----                                   ----                                   --
busy_systems/01x_entities_03_systems    1.00     27.7±3.44µs        ? ?/sec    1.02     28.2±5.25µs        ? ?/sec
busy_systems/01x_entities_06_systems    1.00     42.3±1.46µs        ? ?/sec    1.11     47.0±2.95µs        ? ?/sec
busy_systems/01x_entities_09_systems    1.08    67.3±13.90µs        ? ?/sec    1.00     62.1±1.42µs        ? ?/sec
busy_systems/01x_entities_12_systems    1.00     77.9±1.76µs        ? ?/sec    1.08    83.8±14.63µs        ? ?/sec
busy_systems/01x_entities_15_systems    1.00     94.2±1.92µs        ? ?/sec    1.08    101.8±5.94µs        ? ?/sec
busy_systems/02x_entities_03_systems    1.00     42.4±0.86µs        ? ?/sec    1.28    54.1±15.42µs        ? ?/sec
busy_systems/02x_entities_06_systems    1.00     74.0±3.57µs        ? ?/sec    1.15     85.0±5.44µs        ? ?/sec
busy_systems/02x_entities_09_systems    1.00    109.8±1.81µs        ? ?/sec    1.05    114.9±5.82µs        ? ?/sec
busy_systems/02x_entities_12_systems    1.00    142.3±1.42µs        ? ?/sec    1.11   157.5±27.83µs        ? ?/sec
busy_systems/02x_entities_15_systems    1.01   184.3±47.23µs        ? ?/sec    1.00   182.3±10.41µs        ? ?/sec
busy_systems/03x_entities_03_systems    1.00     59.7±0.74µs        ? ?/sec    1.06     63.0±5.44µs        ? ?/sec
busy_systems/03x_entities_06_systems    1.00     98.2±0.59µs        ? ?/sec    1.12    110.0±5.41µs        ? ?/sec
busy_systems/03x_entities_09_systems    1.00   156.0±12.20µs        ? ?/sec    1.06   165.5±37.93µs        ? ?/sec
busy_systems/03x_entities_12_systems    1.00   207.1±11.09µs        ? ?/sec    1.05   217.1±33.47µs        ? ?/sec
busy_systems/03x_entities_15_systems    1.00    252.2±1.91µs        ? ?/sec    1.20  301.6±162.98µs        ? ?/sec
busy_systems/04x_entities_03_systems    1.00     75.2±3.92µs        ? ?/sec    1.03     77.2±5.84µs        ? ?/sec
busy_systems/04x_entities_06_systems    1.00    127.2±2.50µs        ? ?/sec    1.14   145.0±16.45µs        ? ?/sec
busy_systems/04x_entities_09_systems    1.00    200.8±1.37µs        ? ?/sec    1.12   224.2±69.54µs        ? ?/sec
busy_systems/04x_entities_12_systems    1.00   277.7±16.88µs        ? ?/sec    1.02   282.3±16.00µs        ? ?/sec
busy_systems/04x_entities_15_systems    1.02   332.8±10.41µs        ? ?/sec    1.00    326.3±7.38µs        ? ?/sec
busy_systems/05x_entities_03_systems    1.04     97.4±4.89µs        ? ?/sec    1.00     93.3±3.49µs        ? ?/sec
busy_systems/05x_entities_06_systems    1.00    159.0±4.15µs        ? ?/sec    1.09    173.1±3.51µs        ? ?/sec
busy_systems/05x_entities_09_systems    1.00    251.8±6.01µs        ? ?/sec    1.00   252.3±13.01µs        ? ?/sec
busy_systems/05x_entities_12_systems    1.06   352.4±25.20µs        ? ?/sec    1.00   332.7±37.06µs        ? ?/sec
busy_systems/05x_entities_15_systems    1.00   415.6±11.30µs        ? ?/sec    1.03   426.7±54.68µs        ? ?/sec
contrived/01x_entities_03_systems       1.00     16.0±0.36µs        ? ?/sec    1.04     16.8±1.12µs        ? ?/sec
contrived/01x_entities_06_systems       1.00     28.5±1.21µs        ? ?/sec    1.00     28.4±0.42µs        ? ?/sec
contrived/01x_entities_09_systems       1.00     40.9±9.31µs        ? ?/sec    1.03     42.0±5.06µs        ? ?/sec
contrived/01x_entities_12_systems       1.00     50.5±0.77µs        ? ?/sec    1.08     54.4±1.75µs        ? ?/sec
contrived/01x_entities_15_systems       1.00     65.1±3.88µs        ? ?/sec    1.01     65.6±1.89µs        ? ?/sec
contrived/02x_entities_03_systems       1.00     25.4±1.16µs        ? ?/sec    1.01     25.7±0.58µs        ? ?/sec
contrived/02x_entities_06_systems       1.00     44.8±1.77µs        ? ?/sec    1.04     46.5±4.55µs        ? ?/sec
contrived/02x_entities_09_systems       1.04     62.7±4.50µs        ? ?/sec    1.00     60.6±0.91µs        ? ?/sec
contrived/02x_entities_12_systems       1.00     76.0±2.91µs        ? ?/sec    1.03     78.1±7.32µs        ? ?/sec
contrived/02x_entities_15_systems       1.00     89.9±3.65µs        ? ?/sec    1.07     96.0±5.51µs        ? ?/sec
contrived/03x_entities_03_systems       1.00     33.7±1.93µs        ? ?/sec    1.10     37.0±3.53µs        ? ?/sec
contrived/03x_entities_06_systems       1.00     58.7±1.16µs        ? ?/sec    1.02     59.6±4.95µs        ? ?/sec
contrived/03x_entities_09_systems       1.08    89.1±55.77µs        ? ?/sec    1.00     82.2±6.06µs        ? ?/sec
contrived/03x_entities_12_systems       1.00    105.6±4.30µs        ? ?/sec    1.01    106.4±6.91µs        ? ?/sec
contrived/03x_entities_15_systems       1.00    125.3±6.35µs        ? ?/sec    1.01    126.9±7.29µs        ? ?/sec
contrived/04x_entities_03_systems       1.00    48.1±13.25µs        ? ?/sec    1.03    49.4±27.36µs        ? ?/sec
contrived/04x_entities_06_systems       1.00     70.3±5.20µs        ? ?/sec    1.06     74.6±5.89µs        ? ?/sec
contrived/04x_entities_09_systems       1.00   103.4±17.56µs        ? ?/sec    1.15   118.5±27.34µs        ? ?/sec
contrived/04x_entities_12_systems       1.00    128.5±2.82µs        ? ?/sec    1.02    131.4±3.46µs        ? ?/sec
contrived/04x_entities_15_systems       1.03    156.9±9.78µs        ? ?/sec    1.00    152.5±2.27µs        ? ?/sec
contrived/05x_entities_03_systems       1.01     52.0±2.36µs        ? ?/sec    1.00     51.4±0.86µs        ? ?/sec
contrived/05x_entities_06_systems       1.00     84.3±4.84µs        ? ?/sec    1.05    88.2±18.08µs        ? ?/sec
contrived/05x_entities_09_systems       1.00    118.9±7.57µs        ? ?/sec    1.02    120.8±3.71µs        ? ?/sec
contrived/05x_entities_12_systems       1.01   150.8±11.52µs        ? ?/sec    1.00    150.0±4.43µs        ? ?/sec
contrived/05x_entities_15_systems       1.00    183.7±4.40µs        ? ?/sec    1.02    186.6±3.80µs        ? ?/sec

@alice-i-cecile alice-i-cecile added A-ECS Entities, components, systems, and events C-Performance A change motivated by improving speed, memory usage or compile times S-Waiting-on-Author The author needs to make changes or address concerns before this can be merged labels Jul 18, 2024
@hymm
Copy link
Contributor

hymm commented Jul 25, 2024

I don't think the existing benchmarks are a good test of this pr as they don't use a lot of archetype components. We need a benchmark that creates 100's to 1000's of them or maybe more depending on how many active number of them we expect with relations.

@cBournhonesque
Copy link
Contributor Author

I don't think the existing benchmarks are a good test of this pr as they don't use a lot of archetype components. We need a benchmark that creates 100's to 1000's of them or maybe more depending on how many active number of them we expect with relations.

I'm not sure that to merge this we need a benchmark with tons of archetype components. My thought process is more like: "the current design won't be sustainable with relations, so we want to replace it with something that has equivalent performance for the current usage pattern of bevy". So we need to prove that the change is acceptable for the current kind of queries I think?

Also, even with relations, systems would probably have the same amount of access, even though the ComponentIds themselves can get larger.
You probably wouldn't have systems with tons of queries, but instead systems like Query<&Color, HasPlanet<Mars>>, which doesn't have a ton of archetype components, but instead has a high ComponentId for HasPlanet<Mars>.

That being set more useful benchmarks are always welcome

@cBournhonesque cBournhonesque marked this pull request as ready for review August 26, 2024 17:31
@cart
Copy link
Member

cart commented Aug 26, 2024

I'm not sure that to merge this we need a benchmark with tons of archetype components. My thought process is more like: "the current design won't be sustainable with relations, so we want to replace it with something that has equivalent performance for the current usage pattern of bevy". So we need to prove that the change is acceptable for the current kind of queries I think?

Given how "hot" access is, I think we absolutely need benchmarks here. The current multithreaded executor was implemented under the assumption that constructing and comparing access was very cheap.

Ex: the active_access field is rebuilt multiple (potentially many) times per frame. And that particular Access is potentially very large depending on the number of systems being run. I can see this being prohibitively expensive.

These allow to do union, difference, intersection operations in O(1).

Calling this O(1) is a bit of a stretch, as something like a union with fixedbitsets actually took O(1) within a block, whereas a union with the new approach is very clearly O(N). The O(1) is only in reference to the construction of the iterator. Resolving it is O(N).

Copy link
Contributor

@Trashtalk217 Trashtalk217 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good.

The only thing I can recommend is to maybe extract the SortedSmallVec data structure into a separate file (maybe in bevy_utils).

@cBournhonesque
Copy link
Contributor Author

I'm not sure that to merge this we need a benchmark with tons of archetype components. My thought process is more like: "the current design won't be sustainable with relations, so we want to replace it with something that has equivalent performance for the current usage pattern of bevy". So we need to prove that the change is acceptable for the current kind of queries I think?

Given how "hot" access is, I think we absolutely need benchmarks here. The current multithreaded executor was implemented under the assumption that constructing and comparing access was very cheap.

Ex: the active_access field is rebuilt multiple (potentially many) times per frame. And that particular Access is potentially very large depending on the number of systems being run. I can see this being prohibitively expensive.

These allow to do union, difference, intersection operations in O(1).

Calling this O(1) is a bit of a stretch, as something like a union with fixedbitsets actually took O(1) within a block, whereas a union with the new approach is very clearly O(N). The O(1) is only in reference to the construction of the iterator. Resolving it is O(N).

Would you like to see benchmarks focused on Access operations directly?

@Trashtalk217
Copy link
Contributor

Trashtalk217 commented Aug 26, 2024

Also, with regards to performance: More benchmarks are always nice, but is it also possible to see a slowdown in some of the examples? Maybe with regards to fps in complicated render scenes?

@cart
Copy link
Member

cart commented Aug 26, 2024

Would you like to see benchmarks focused on Access operations directly?

I think the highest priority is seeing benchmarks of the executor running many systems with many component accesses. I'd like both a small bevy_ecs-scoped executor benchmark that generates thousands of components used by hundreds of systems. The challenge here is ensuring rebuild_active_access is actually getting called in a way that is reflective of real apps.

I'd also like to see if this has measurable effects on frame time in full bevy apps. "Big scenes" matter less than "executes systems in parallel that in combination reference many components". So even something like 3d_scene should be sufficient, as that will run all of the built in bevy systems.

@cBournhonesque
Copy link
Contributor Author

cBournhonesque commented Aug 27, 2024

Would you like to see benchmarks focused on Access operations directly?

I think the highest priority is seeing benchmarks of the executor running many systems with many component accesses. I'd like both a small bevy_ecs-scoped executor benchmark that generates thousands of components used by hundreds of systems. The challenge here is ensuring rebuild_active_access is actually getting called in a way that is reflective of real apps.

I'd also like to see if this has measurable effects on frame time in full bevy apps. "Big scenes" matter less than "executes systems in parallel that in combination reference many components". So even something like 3d_scene should be sufficient, as that will run all of the built in bevy systems.

Here are my results on 3d scene: (red is PR, yellow in main)
image

Not much difference overall, the median time is very similar.

Comparing the multithreaded executor span:
image

There is a sizable difference, the PR version is about twice as slow. However the overall difference adds up to less than 1us.
The executor accounts for 8% of total time:
image
so it's not insignificant by any means. We will still need to move away from the FixedBitSets if we want to have relations though.

In the PR, the executor accounts for 9.87% of total time:
image

@hymm
Copy link
Contributor

hymm commented Aug 27, 2024

You might have done it, but when you benchmark 3d_scene, you should change the present mode to immediate. That won't really change the multithreaded span, but should change the frame time significantly

github-merge-queue bot pushed a commit that referenced this pull request Dec 10, 2024
# Objective

We currently have no benchmarks for large worlds with many entities,
components and systems.
Having a benchmark for a world with many components is especially useful
for the performance improvements needed for relations. This is also a
response to this [comment from
cart](#14385 (comment)).

> I'd like both a small bevy_ecs-scoped executor benchmark that
generates thousands of components used by hundreds of systems.

## Solution

I use dynamic components and components to construct a benchmark with
2000 components, 4000 systems, and 10000 entities.

## Some notes

- ~I use a lot of random entities, which creates unpredictable
performance, I should use a seeded PRNG.~
- Not entirely sure if everything is ran concurrently currently. And
there are many conflicts, meaning there's probably a lot of
first-come-first-serve going on. Not entirely sure if these benchmarks
are very reproducible.
- Maybe add some more safety comments
- Also component_reads_and_writes() is about to be deprecated #16339,
but there's no other way to currently do what I'm trying to do.

---------

Co-authored-by: Chris Russell <[email protected]>
Co-authored-by: BD103 <[email protected]>
BD103 added a commit to BD103/bevy that referenced this pull request Dec 10, 2024
# Objective

We currently have no benchmarks for large worlds with many entities,
components and systems.
Having a benchmark for a world with many components is especially useful
for the performance improvements needed for relations. This is also a
response to this [comment from
cart](bevyengine#14385 (comment)).

> I'd like both a small bevy_ecs-scoped executor benchmark that
generates thousands of components used by hundreds of systems.

## Solution

I use dynamic components and components to construct a benchmark with
2000 components, 4000 systems, and 10000 entities.

## Some notes

- ~I use a lot of random entities, which creates unpredictable
performance, I should use a seeded PRNG.~
- Not entirely sure if everything is ran concurrently currently. And
there are many conflicts, meaning there's probably a lot of
first-come-first-serve going on. Not entirely sure if these benchmarks
are very reproducible.
- Maybe add some more safety comments
- Also component_reads_and_writes() is about to be deprecated bevyengine#16339,
but there's no other way to currently do what I'm trying to do.

---------

Co-authored-by: Chris Russell <[email protected]>
Co-authored-by: BD103 <[email protected]>
ecoskey pushed a commit to ecoskey/bevy that referenced this pull request Jan 6, 2025
# Objective

We currently have no benchmarks for large worlds with many entities,
components and systems.
Having a benchmark for a world with many components is especially useful
for the performance improvements needed for relations. This is also a
response to this [comment from
cart](bevyengine#14385 (comment)).

> I'd like both a small bevy_ecs-scoped executor benchmark that
generates thousands of components used by hundreds of systems.

## Solution

I use dynamic components and components to construct a benchmark with
2000 components, 4000 systems, and 10000 entities.

## Some notes

- ~I use a lot of random entities, which creates unpredictable
performance, I should use a seeded PRNG.~
- Not entirely sure if everything is ran concurrently currently. And
there are many conflicts, meaning there's probably a lot of
first-come-first-serve going on. Not entirely sure if these benchmarks
are very reproducible.
- Maybe add some more safety comments
- Also component_reads_and_writes() is about to be deprecated bevyengine#16339,
but there's no other way to currently do what I'm trying to do.

---------

Co-authored-by: Chris Russell <[email protected]>
Co-authored-by: BD103 <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

A-ECS Entities, components, systems, and events C-Performance A change motivated by improving speed, memory usage or compile times S-Waiting-on-Author The author needs to make changes or address concerns before this can be merged

Projects

Status: Candidate

Development

Successfully merging this pull request may close these issues.

6 participants