Never Flush Entities #18577

ElliottjPierce · 2025-03-27T19:17:55Z

ElliottjPierce
Mar 27, 2025
Collaborator

Background

This is heavily motivated by remote entity reservation, but much of it can be applied without that context.

How `Entities` works now

Right now, there are 2 parts of Entities. First, we keep a Vec<EntityMeta> that is the source of truth for where an entity is, who freed it, etc. Second, we keep a pending list that tracks which entities no longer exist and can be reused.

Allocating and freeing entities are simple. To allocate, try to pull from the pending list, and if it's empty, extend the meta list, and return a brand new entity. Once allocated, set the EntityLocation of the entity before releasing it though any public API. To free, make sure the entity exists, increase its generation, and add it to the pending list.

We also need to reserve entities, currently via &Entities but soon via an Arc-like structure that can be shared between threads and operate fully concurrently with &mut Entities. Right now, (with &Entities), We keep a free_cursor: AtomicISize that tracks the target length of pending. When we reserve, we decrement the free_cursor as if we were popping from the list (just like an allocation). If free_cursor goes negative, we just need to extend the meta list by its absolute value. In other words, if we do a n = free_cursor.fetch_sub(1) , if n >= 0, we reserved the pending[n] entity, otherwise we reserved the brand new entity at index meta.len() -n - 1. Now the reservation has given us an Entity. But right now, it's invalid; it has no EntityLocation. That's where flush comes in. flush sets the EntityLocation of newly reserved entities to point to the empty archetype.

What's wrong with this

Even aside from remote entity reservation, a lot could be improved. First, let's clarify exactly what flush does. It removes reserved entities from the pending list, creates new, reserved entities on the meta list, and performs an init call on all of them. Technically, init is kept generic, but in practice, it always sets the entity location to the empty archetype. So what does that look like? We push the entity onto the Archetype::entities: Vec<ArchetypeEntity>. ArchetypeEntity is just an Entity and it's TableRow.

So let me get this strait. Let's say we use commands to spawn 100 entities with MyComponent. And lets say there are some pending entities to reuse (which is expected after the app has been running for a bit). For each of those 100 entities: 1) We remove them from the pending list. 2) We re-arrange them into a ArchetypeEntity by adding TableRow::INVALID (the empty archetype has no table.) 3) We set their EntityLocation. Then, now that flush is over, we 4) swap remove them from the empty archetype. 5) Insert them into the MyComponent archetype, and 6) set their new EntityLocation.

What the heck? Steps 1-3 are completely irrelevant after steps 4-6. It's wasted work! And's it's not just "oh we need one more mem-copy calls" since we have to change Entity to ArchetypeEntity at both steps. Plus, we're swap removing front to back, so that takes even longer too! Again, what the heck?

Solution (WIP that needs discussion)

I'll cut to the chase: Let's remove the empty archetype and consider "invalid" (not yet flushed) entities as valid but empty entities. In other words: ArchetypeId::INVALID becomes ArchetypeId::Empty. As soon as an entity is reserved or allocated, it is fully valid and usable; it's just empty.

That means, we never need to flush. We save 50% of the work for most commands. We ditch expensive flush calls that really just existed to let us alloc or free an entity. Etc. This could be a really big win for performance!

There are some downsides too:

Entities needs to keep archetype Edges and have some extra moving parts.
The empty archetype needs to get special treatment. (BundleInserter, BundleSpawner, and BundleRemover all need a new if statement.)
Queries matching the empty archetype are a bit harder. Query<Entity> can't just loop through every Archetype anymore. But in practice, who does Query<Entity> completely unqualified. The minute you add any component into the query at all, this goes away.
Observers that track anything spawning need a bit more work since reserving an entity technically makes it spawned. I would suggest we just trigger these observers the minute an entity moves from the empty archetype to any other. (Since that's when the information becomes usable anyway.)

Details

How does this work with remote entity reservation, and what does the new Entities look like in practice.

First, lets get some obvious things out of the way. 1) The meta list needs to continue to be a normal Vec directly on Entities for performance. 2) reserve_entity and reserve_entities are now exactly the same as alloc mechanically, so we can get rid of them. 3) As a result alloc and a new alloc_many need to be created with only &Entities. 4) Because of remote reservation, ideally alloc at least only needs some shared, Arc'd state, not full &Entities. This prevents remote reservation from being blocking or async waiting. This is bonus points, but it's ends up not costing anything extra.

With that out of the way, the next big question is "how does reservation work?" The answer is complex, so let me break this down. At the end of the day, we need something like a Vec<Entity> that stores a slice of empty archetype entities followed by a slice of pending/free entities. Let's call this owned since these entities are owned by Entities. We also need an atomic int to separate them, free_cursor, which holds the index into owned of the first free/pending entity. And we need an atomic int to track the length of the meta list; let's call it meta_len.

Let's explore that model a little bit:

To get the EntityLocation of an Entity, we need a bit of work. If the index is valid in meta, we can just get it like normal. If the index is greater than the atomic meta_len or the entity's generation is higher than that of a brand new entity, the entity is invalid/from a different world, so we return None. If neither of those conditions are met, the entity must be brand new, so it is not in owned and it's ArchetypeRow is ArchetypeRow::INVALID. (Originally I thought using ArchetypeRow::INVALID here was dangerous, but it ends up being completely safe.) This takes 1 relaxed atomic operation for brand new entities.
To access the EntityMeta of an entity index (internally), we just access meta at the proper index. If that index is currently out of bounds, we just extend it with the EntityMeta for a brand new entity. A brand new entity has ArchetypeId::EMPTY and ArchetypeRow::INVALID, so we just fill that in. (No atomic operations)
To set the EntityLocation of an Entity, there's a bit more work too. First, we update its EntityMeta in meta. If it's previous location is ArchetypeId::EMPTY and it is not ArchetypeRow::INVALID, then we need to remove it from our owned list. This is the hardest part since a remote thread might be doing alloc at the same time as us. First, if the free_cursor index is not valid in owned, we have the owned list all to ourselves, so we can just swap remove and be done. Otherwise, we start by incrementing free_cursor to get an index that we own. Then we swap the entity to remove with that index, and swap their ArchetypeRows accordingly. Now, the entity leaving the empty archetype is right before the current free_cursor. So, we do an atomic compare exchange with the free_cursor to try to decrement it. If that works, we're done. Otherwise, another one must have been reserved, so we need to do that swap again to put the entity leaving the empty archetype right before free_cursor. We try this again until the compare exchange works. Note that the compare exchange will pretty much always work the first time and that none of this will happen unless we're moving a freshly allocated reused entity. In practice, when the current archetype is empty, this has between 1 and 3 atomic operations, all of which can be relaxed I think.
To free an entity, we just make sure it exists, increase the generation, push onto owned, and update its EntityMeta in meta. We also need to set its new location. Importantly, it is now in the empty archetype (though not yet spawned), so we set it's EntityLocation::archetype_id to ArchetypeId::EMPTY and we set its EntityLocation::archetype_row to its index in owned. If free_cursor is greater than this index, that means this entity is now the only pending/free entity, and we need to set free_cursor to its index in owned. This takes 2 relaxed atomic operations.
To alloc/reserve an entity, we just increment the free_cursor. If the previous free_cursor was a valid index in owned, we look it up in owned, and return it. Otherwise, we just return a generation 0 entity at index meta_len.fetch_add(1). This takes either 1 or 2 relaxed atomic operations, which we can amortize some in the future.
To iterate the empty archetype, iterate owned[0..free_cursor], the new entities with indices meta.len()..meta_len, and any new indices ranges we missed because of recent meta resizes. We'll have to do a tiny bit if book keeping for this, but not much. (Or we can just not iterate this because what would be the point?)

If these atomic operations are bothering you, remember that all of them are relaxed, which is pretty much free.

To do this model, what we need is a Vec<Entity>-like data structure for owned that we can push, pop, and index without mutable access. Thankfully, this is a much simper problem that's already solved. We need some form of liked list, but to make things more compact, we need a linked list of Box<[UnsafeCell<Entity>]>s where the length of each slice is doubling. We can also skip the first few powers of 2, and we can flatten the linked list into a Vec<OwnedChunk>, where each chunk is the Box<[UnsafeCell<Entity>]> described, and the first one has length 256. We could even store a pointer to the biggest chunk inline since that's the most likely to be used. We also need an atomic integer for the length, and in reality, those Boxes will need to be AtomicPtr, but we can still use Relaxed ordering, so there's no practical difference.

Note that this structure lets us index easily (we don't have to do any searches), has the same memory footprint as a normal Vec, has the same growing strategy as a normal Vec, etc. The only downside here is that it uses more cache lanes if the chunks are not adjacent in memory. But remember that unless we're finally seting the EntityLocation of an Entity we allocated ages ago, we're only going to be using the biggest chunk (1 cache lane). And worst case, we just need 2 for swaping between chunks. Just like the current flushing system uses 2 (the pending list and the entities in the empty archetype).

Final notes

This could maybe be simplified a little bit without the remote reservations requirement. For example, we could maybe re-introduce flush for fulfilling remote reservations that are waiting in async. But this doesn't give us many wins for performance over this implementation.

Part of me thinks I've missed something (otherwise we'd be doing this already), but I can't think of anything. Note that this would use a few atomic operations even during alloc calls, etc. But they could all be Relaxed (no more costly than a normal operation just not as optimizable by the compiler). So I don't think this is a big deal.

In any case, I think that this would be much much faster than the current flush system, since it's doing double the work it needs to! Plus this solves a lot of design problems with remote reservation, etc.

maniwani · 2025-03-28T02:16:07Z

maniwani
Mar 28, 2025
Collaborator

Let's remove the empty archetype and consider "invalid" (not yet flushed) entities as valid but empty entities.

It may not be useful now, but I think querying for entities with no components won't be so strange once Bevy's editor workflow matures and defining queries at runtime becomes a common thing. (I can imagine users wanting to know if entities without components exist when they're not expected to.)

Let's say we use commands to spawn 100 entities with MyComponent. [proceeds to demonstrate how flush is inefficient]

Hmm, I don't remember why flush can't just leave entities in the invalid archetype. I don't think an entity should be considered "existing" until its spawn command is applied.

Too many intermediate archetype/table moves when applying commands is a known problem. I think that problem is better solved by command "batching" (#10154) where we'd "preview" all commands queued on the same entity to first figure out its destination archetype and then move/place it only once.

There are some downsides too:
<list of downsides>

All of the downsides you've listed sound quite alarming to me. More coupling between the entity metadata and archetype structures, special handling requirements for the empty archetype, and the inherent complexity of the proposed allocation process seem like a combo that'll have future contributors (and even maintainers) very anxious about messing with this code.

So even if this is faster, I'm reluctant to support it because of its inherent complexity. Bevy hasn't even presented Entity as an asset handle yet, so V6 seems (to me) like the only solution I can justify with what we know today. It's small, easily understood, and fully self-contained, so there's virtually no penalty to pay if it turns out to be lacking.

10 replies

chescock Mar 28, 2025
Collaborator

For reference on some of the perf questions: I finally got around to benchmarking the branch I had that used a ConcurrentQueue with thread-local caches (chescock#8), and it was about 20% slower on the spawn_batch benchmarks. Those currently use &mut so that they don't need to do any atomic operations, and it's going to be really hard not to regress that.

ElliottjPierce Mar 28, 2025
Collaborator Author

Yeah. This is one of the reasons I'm kinda allergic to concurrent_queue here. It's just slow. A local cache helps but not enough.

In this proposal, to alloc, we do a relaxed fetch_add to get our index and one relaxed AtomicUsize::load to get the owned length. If it's in bounds we need a relaxed AtomicPtr::load and an pointer deref. If it's not in bounds we need a relaxed fetch_add on meta_len. So alloc is 3 relaxed atomic ops total. I think that's pretty reasonable.

The other half of spawn set has by my count 3 relaxed atomic ops. One fetch_add, one AtomicUsize::load for owned's len, and one AtomicPtr::load. If there are still free entities in owned we need a 4th to compare_exchange.

So the whole spawn is maybe 7 relaxed atomic ops. Honestly not the best, but much better than concurrent_queue. Plus the AtomicPtr::loads we can probably cache if we need to, and we can probably speed it up some more by doing a spawn_at(EntityLocation) in addition to the spawn_empty/alloc.

chescock Mar 28, 2025
Collaborator

I think we must have different mental models for performance here? The slow thing is the atomic read-modify-write operations. ConcurrentQueue is slow only because it needs to do an atomic compare_exchange, plus an additional fetch_or in the unbounded case. Your proposal does an atomic fetch_add, so it should perform similar to a bounded queue.

I think any change that requires atomic RMW operations in alloc() is likely to cause a similar regression in spawn_batch performance.

ElliottjPierce Mar 28, 2025
Collaborator Author

Here's what I mean:

There's a lot more too, depending on some branching.

It all uses Aquire Release though to guard agains race conditions and ordering. I really don't care about race conditions and ordering though. Who cares which entity we reserve as long as it is valid and unique? So we can at least use relaxed here, which should be faster.

Plus we don't need fetch_or or anything I don't think.

I'm doing some very early prototyping just to see what it would actually look like. I'm going to hold off linking it into the ECS until the Entities part is nailed down.

chescock Mar 28, 2025
Collaborator

It all uses Aquire Release though to guard agains race conditions and ordering. I really don't care about race conditions and ordering though. Who cares which entity we reserve as long as it is valid and unique? So we can at least use relaxed here, which should be faster.

We care about race conditions because we don't want data races causing Undefined Behavior, or two clients popping the same entity. You will need to use Acquire and Release operations for this.

But that's fine, because Relaxed isn't that much faster. In particular, on x86, all memory operations have acquire and release semantics, so a load-acquire or store-release isn't any more expensive than even a non-atomic read or write.

chescock · 2025-03-28T14:13:50Z

chescock
Mar 28, 2025
Collaborator

To free an entity, we just make sure it exists, increase the generation, push onto owned, and update its EntityMeta in meta. [...] we set its EntityLocation::archetype_row to its index in owned. If free_cursor is greater than this index, that means this entity is now the only pending/free entity, and we need to set free_cursor to its index in owned. [...]

To alloc/reserve an entity, we just increment the free_cursor. If the previous free_cursor was a valid index in owned, we look it up in owned, and return it. [...]

If I understand this correctly, freeing an entity will write it to the owned array and then increment free_cursor, while allocating a recycled entity will decrement the free_cursor then read from the owned array.

What prevents an additional "free" from happening between the decrement and the read during the recycle? That would overwrite the entity in owned before the recycling thread could read it, causing it to read the wrong value, resulting in the old entity being lost and the new one being used twice.

I believe this is why the concurrent queue implementations use a "stamp" value, although I don't see a straightforward way to make that work with a stack instead of a queue. More generally, it looks like you're describing a custom concurrent queue implementation, but I'm still not clear how it differs from other queues or how you expect to do better than them.

To do this model, what we need is a Vec<Entity>-like data structure for owned that we can push, pop, and index without mutable access. Thankfully, this is a much simper problem that's already solved. We need some form of liked list, but to make things more compact, we need a linked list of Box<[UnsafeCell<Entity>]>s where the length of each slice is doubling.

This is clever! It reminds me a bit of the data structure used in ThreadLocal. https://github.com/Amanieu/thread_local-rs/blob/b5aa92903596e5ad1af0fce20d7356dc55934ec5/src/lib.rs#L104-L105

5 replies

ElliottjPierce Mar 28, 2025
Collaborator Author

To free an entity, we just make sure it exists, increase the generation, push onto owned, and update its EntityMeta in meta. [...] we set its EntityLocation::archetype_row to its index in owned. If free_cursor is greater than this index, that means this entity is now the only pending/free entity, and we need to set free_cursor to its index in owned. [...]

To alloc/reserve an entity, we just increment the free_cursor. If the previous free_cursor was a valid index in owned, we look it up in owned, and return it. [...]

If I understand this correctly, freeing an entity will write it to the owned array and then increment free_cursor, while allocating a recycled entity will decrement the free_cursor then read from the owned array.

What prevents an additional "free" from happening between the decrement and the read during the recycle? That would overwrite the entity in owned before the recycling thread could read it, causing it to read the wrong value, resulting in the old entity being lost and the new one being used twice.

Really good catch. So at any given time we can have 0 or 1 frees and any number of allocs. Ok. Lets push to the stack first. Then we do a compare exchange to increment free cursor. If it fails, we just move it to the new spot. That should fix it. And we can still used relaxed, so it's free!

Let me know if you still see any problems here. I'm always grateful and annoyed when you find stuff like this. I'm getting flashbacks to v2.

I believe this is why the concurrent queue implementations use a "stamp" value, although I don't see a straightforward way to make that work with a stack instead of a queue. More generally, it looks like you're describing a custom concurrent queue implementation, but I'm still not clear how it differs from other queues or how you expect to do better than them.

This is similar to a custom concurrent queue, but there's a few differences. First, it's a stack, but I guess that's superficial. Second, we need to be able to index the queue. Third, we need to be able to swap elements in the queue. concurrent_queue can't do that. Even if it could, we can do better. Concurrent queues need to enforce ordering, grow infinitely, etc. We don't need that, which gives us this window of optimization. For example, concurrent queues typically have small spin locks, but we really don't care if they race, as long as the operation works and is valid.

To do this model, what we need is a Vec<Entity>-like data structure for owned that we can push, pop, and index without mutable access. Thankfully, this is a much simper problem that's already solved. We need some form of liked list, but to make things more compact, we need a linked list of Box<[UnsafeCell<Entity>]>s where the length of each slice is doubling.

This is clever! It reminds me a bit of the data structure used in ThreadLocal. https://github.com/Amanieu/thread_local-rs/blob/b5aa92903596e5ad1af0fce20d7356dc55934ec5/src/lib.rs#L104-L105

Yeah, and thread local is another option kinda. I was pretty happy with this design, but we'll see. I want to make sure everyone is on the same page before attempting this lol.

chescock Mar 28, 2025
Collaborator

Really good catch. So at any given time we can have 0 or 1 frees and any number of allocs. Ok. Lets push to the stack first. Then we do a compare exchange to increment free cursor. If it fails, we just move it to the new spot. That should fix it. And we can still used relaxed, so it's free!

Let me know if you still see any problems here. I'm always grateful and annoyed when you find stuff like this. I'm getting flashbacks to v2.

I don't entirely follow your proposal here. The "push to the stack" is the part that overwrites the old data, so you can't do that first.

You need to make sure that the array read for an alloc happens after any write from the previous free and before any write from a later free. So you need some atomics X and Y such that free does a load-acquire to X, then a write, then a store-release to Y, and alloc does a load-acquire to Y, then a read, then a store-release to X. Y can be the free cursor, but X is the "stamp" that the concurrent queue implementations use, and I don't see what it corresponds to in your proposal.

Hmm, if we can assume the data has a niche and an be updated atomically, we might be able to combine the stamp with the data. That could be a way to beat an off-the-shelf queue that can't make those assumptions. So, alloc would store-release Entity::INVALID after reading, and free would make sure the cell was invalid before writing. Although that just saves a non-atomic write for the data, so it might not matter much.

I also don't see how to do this in a non-blocking way with a stack. If one thread is running alloc and claims an index and then hangs without reading the data, then we can never free another entity into that index! The queues avoid that issue by writing to the other end; a hung reader in a bounded queue eventually looks like a full queue, and a hung reader in an unbounded queue keeps the old allocations alive longer.

Yeah, and thread local is another option kinda. I was pretty happy with this design, but we'll see. I want to make sure everyone is on the same page before attempting this lol.

Oh, I didn't mean to use ThreadLocal, I just meant that the data structure is similar! They use an array of 64 buckets, each with 2^N entries, and each initialized once and then never moved.

chescock Mar 28, 2025
Collaborator

Oh, and the other assumption we might have here is that there is a single writer! That could wind up looking like a work-stealing queue, like crossbeam_deque or st3, although those can have "steal" operations fail if someone else is stealing, which we wouldn't have a way to handle here.

ElliottjPierce Mar 28, 2025
Collaborator Author

I'm not sure we're on the same page. Let me work on a prototype and show you. (I'm not building v7, just the data structure.) I think having a code to discuss will be more productive.

ElliottjPierce Mar 29, 2025
Collaborator Author

Ok. The draft is out here: #18611. (This is not actually fully implemented yet. It's just the atomic synchronization bit that we need to discuss correctness and performance of.)

ElliottjPierce · 2025-03-30T19:20:43Z

ElliottjPierce
Mar 30, 2025
Collaborator Author

@maniwani Your criticism of this design is really helpful. I've been thinking more about it, and I think I now agree with more of what you've been saying.

Here's some improvements that should make this less complex and easier to justify:

First, we don't actually need to make the empty archetype part of the pending list. I thought we did in order to get rid of flush, but we actually don't. All flush does is make it easier to insert, remove, etc. Instead, we could just protect against ArchetypeId::INVALID in those operations. That's an early return on remove and take, and a tiny bit of logic on insert and despawn.
Second, since the pending list no longer includes the empty archetypes, we can greatly simplify the proposed Entities code. Practically the only difference from main then would be the custom Vec type for the pending list, and adding a meta_len atomic.
As a consequence, reserve and alloc still become the same thing, but they don't actually spawn anything. On paper, this is no slower than the current v7 design for spawning empties, too. And we still never need to flush, which is a big win. And the empty archetype doesn't need any special treatment either.

The only downside is that we would be releasing entities to public APIs that still have an invalid archetype. But that doesn't worry me very much. We already do this now (it's just that flush will be a catch all that applies soon). In practice, as long as we queue a move from invalid archetype to the empty archetype with each command that just spawns an empty, I think this should be a non issue.

Do you have any thoughts here? This fixes a lot of valid issues you've brought up, but it could have problems of its own.

1 reply

maniwani Apr 1, 2025
Collaborator

Do you have any thoughts here? This fixes a lot of valid issues you've brought up, but it could have problems of its own.

I haven't had a chance to really grok this, but what you're saying does sound like it avoids most things I was worried about (mostly introducing new special treatment of the empty archetype).

IIUC decoupling entity allocation from archetype assignment even sounds like an improvement.

The only downside is that we would be releasing entities to public APIs that still have an invalid archetype. But that doesn't worry me very much. We already do this now (it's just that flush will be a catch all that applies soon).

Handing user code an Entity before it's been spawned is OK, yeah. As long as an entity is still assigned a real location by the time the user expects it to have one, I can't think of any obvious issues off the top of my head.

Uh oh!

Uh oh!

Never Flush Entities #18577

Uh oh!

Uh oh!

ElliottjPierce Mar 27, 2025 Collaborator

Background

How Entities works now

What's wrong with this

Solution (WIP that needs discussion)

Details

Final notes

Replies: 3 comments · 16 replies

Uh oh!

maniwani Mar 28, 2025 Collaborator

Uh oh!

chescock Mar 28, 2025 Collaborator

Uh oh!

ElliottjPierce Mar 28, 2025 Collaborator Author

Uh oh!

chescock Mar 28, 2025 Collaborator

Uh oh!

ElliottjPierce Mar 28, 2025 Collaborator Author

Uh oh!

chescock Mar 28, 2025 Collaborator

Uh oh!

chescock Mar 28, 2025 Collaborator

Uh oh!

ElliottjPierce Mar 28, 2025 Collaborator Author

Uh oh!

chescock Mar 28, 2025 Collaborator

Uh oh!

chescock Mar 28, 2025 Collaborator

Uh oh!

ElliottjPierce Mar 28, 2025 Collaborator Author

Uh oh!

ElliottjPierce Mar 29, 2025 Collaborator Author

Uh oh!

ElliottjPierce Mar 30, 2025 Collaborator Author

Uh oh!

Uh oh!

maniwani Apr 1, 2025 Collaborator

ElliottjPierce
Mar 27, 2025
Collaborator

How `Entities` works now

Replies: 3 comments 16 replies

maniwani
Mar 28, 2025
Collaborator

chescock Mar 28, 2025
Collaborator

ElliottjPierce Mar 28, 2025
Collaborator Author

chescock Mar 28, 2025
Collaborator

ElliottjPierce Mar 28, 2025
Collaborator Author

chescock Mar 28, 2025
Collaborator

chescock
Mar 28, 2025
Collaborator

ElliottjPierce Mar 28, 2025
Collaborator Author

chescock Mar 28, 2025
Collaborator

chescock Mar 28, 2025
Collaborator

ElliottjPierce Mar 28, 2025
Collaborator Author

ElliottjPierce Mar 29, 2025
Collaborator Author

ElliottjPierce
Mar 30, 2025
Collaborator Author

maniwani Apr 1, 2025
Collaborator