Faster queue free #61932

lawnjelly · 2022-06-11T11:44:06Z

Calling queue_free() for large numbers of siblings could previously be pathologically slow, with the time taken rising exponentially with number of children. This looked partly due to ordered_remove from the child list and notifications.

This PR identifies objects that are nodes, and deletes children flagged for deletion in reverse order. This minimizes the costs of reordering.

Fixes #61929

Performance testing, number of child nodes deleted

50000 nodes before 137017 ms, after 422 ms seconds (325x faster)
20000 nodes before 19425 ms, after 250 ms seconds (78x faster)
5000 nodes before 1228 ms, after 169 ms seconds (7.3x faster)
1000 nodes before 194 ms, after 150 ms seconds (1.3x faster)

(at lower node numbers the measurement from the MRP in #61929 becomes less useful as there are some fixed delays, however the pattern is that the benefit really starts to show with a decent number of children)

Notes

It might be even nicer to sort the delete_queue, but there seems no efficient way of getting the child number from a node.
The element "id" could be removed from the linked list during the first stage, but it is probably not worth it provided the ObjectDB::get_instance() call is efficient for deleted objects.
This could alternatively be done as part of the lower iteration section of the function (instead of separately), which might save some ObjectDB::get_instance() calls, but might be more tricky to turn off efficiently.
The flip side of this PR is that it can add a small cost when e.g. a single child is being deleted from among 1000 siblings - the siblings will be traversed just in case they are queued for deletion. This will usually be avoided however by the check for the overall queue size.
The 2 magic numbers are to some extent guessed. I've set them rather conservatively so in most cases during gameplay they won't activate, but during level shutdown / closing the editor etc they will help prevent long pauses.
Same principle can be applied in 4.x as the code is similar.

Update

It turns out that after the fix the bottleneck was in the BVH doing collision detection of all the boxes overlapping creating a lot of pairs. When the boxes were hidden, the artificial BVH bottleneck is removed and the new PR is much faster (i.e. 100x faster).

Calling queue_free() for large numbers of siblings could previously be very slow, with the time taken rising exponentially with number of children. This looked partly due to ordered_remove from the child list and notifications. This PR identifies objects that are nodes, and deletes children flagged for deletion in reverse order. This minimizes the costs of reordering.

hawkinsonb · 2022-06-13T05:07:26Z

scene/main/scene_tree.cpp

+	// Less need to apply this when deleting small numbers of nodes.
+	// This magic number is guessed .. to get most of the benefit while not checking through
+	// child lists when unnecessary (as this will have a cost).
+	if (delete_queue.size() > 128) {


Is there performance degradation for <128 nodes?

There is a cost to the technique. If a parent has e.g. 1000 children, and only one is queued for deletion, then it will iterate through all the children just in case any of them are queued. This could worst case result in a bunch of cache misses. So we ideally don't want to use it unless we know it will help.

We know there is no possible benefit in using the optimization with a queue size of 1, and we know the benefits tend to scale approximately with number of children being deleted / number of children (on average, it depends on the exact deletion patterns), so it makes sense to turn it on at some overall queue size between 2 and 1000 (which was measured somewhat inaccurately as 1.3x faster earlier).

In the same way there is less absolute benefit possible in a parent that has few children, as the effect scales something like O(n*n) where n is the number of children.

The magic numbers in both cases could probably be decreased (maybe as a result of further testing), but this conservative approach should be good to start because it will eliminate most long pauses.

There are a few more complex techniques which could be used to achieve the same thing in a less "haphazard" fashion, such as sorting the queue delete list, or having a list of queue deleted nodes on each node, but the main difficulty is that currently each node has no idea of its child ID relative to its parent. This could be fixed by storing a child ID per node and keeping it up to date as children are added / moved, but it is pretty easy for bugs to creep in here and for it to get out of sync. So the simple approach in this PR is probably more sensible to get started (unless profiling shows the same thing is a problem in other situations).

Another variation is we could store a counter of the number of queue deleted children per node, and use this as a heuristic "number in queue / number of children" to turn on or off.

I understand where you're coming from. It's definitely a band-aid for a larger problem regarding deletion performance. It makes more sense to look deeper into the deletion queue for a cleaner solution.

I'm newer to the engine so I apologize if this is off.

But it appears node's know their depth at the time they are removed. Maybe having different queues for each scene_tree level would be more efficient? If you process deletion of the deepest level first, you'll always delete the deepest children first.

This seems to be a misunderstanding of the problem. The relevant factor is the child ID (i.e. which number child in the list of the parent), the depth is unrelated to the problem (afaik).

Maybe? The PR appears to be looking for children of the parent to delete first, because deleting the parent before the children is causing the issue.

Wouldn't the depth determine what would potentially be a child? IE, the deepest node would always be a child.

So there would be no need to check all of a parent's children to see if they need to be deleted first, as anything deeper than the parent(all of its potential children) was already deleted from the deletion queue.

Maybe? The PR appears to be looking for children of the parent to delete first, because deleting the parent before the children is causing the issue.

Wouldn't the depth determine what would potentially be a child? IE, the deepest node would always be a child.

So there would be no need to check all of a parent's children to see if they need to be deleted first, as anything deeper than the parent(all of its potential children) was already deleted from the deletion queue.

You seem to be misunderstanding the issue and the PR, all of your suppositions here are incorrect. I can only suggest re-reading the issue, and examining what the MRP does.

EDIT: Were you are thinking in terms of optimizing by deleting parents before children? That is a valid avenue, but is not addressed here.

goatchurchprime · 2022-06-13T10:13:19Z

This seems like an overly-narrow fix for an interesting performance issue that's going to occur every time an object is deleted, even if it's not always big enough to be noticed and isolated.

Also, won't it make things worse if I am already calling queue_free() on objects in reverse order to work around this problem?

There must be something wrong in the implementation if it makes this much difference by simply changing the order of deletion. Is there a low level function missing its binary search?

I am also skeptical about a batch delete function that simply loops through a single object delete, if a lot of time can be wasted through relinking and shuffling objects that are also due to be deleted. There is a memdelete_arr() function, but this also does it by looping through its individual records too because it doesn't know enough about what the object types are.

lawnjelly · 2022-06-13T10:59:37Z

an interesting performance issue that's going to occur every time an object is deleted, even if it's not always big enough to be noticed and isolated.

It only seems to occur when deleting multiple objects. Deleting a single object (that has many children) doesn't seem to incur the same cost, presumably because the children of this object are deleted in bulk, and don't get reshuffled and sent notifications multiple times etc.

I am also skeptical about a batch delete function that simply loops through a single object delete, if a lot of time can be wasted through relinking and shuffling objects that are also due to be deleted.

Absolutely, a bulk delete function could be made more efficient, see particularly #61929 (comment) :

Additionally, if there is a notification on re-ordering, perhaps it can be sent once (after all deletes) instead of after each deletion.

The API doesn't currently offer a "multiple delete" afaik, only queue_free() can be called multiple times. And we are constrained by history here. However we can make the machinery more efficient behind this API in a number of different ways.

This PR is currently a super simple way of addressing this and getting an easy "bang for your buck", but if desired it could be made more comprehensive - there are several obvious ways this could be improved further. But this is something we will typically decide at a PR meeting. It is good practice to make a simple solution first before investing significant time, as often reduz will prefer the simple solution, and this saves developer time and frustration.

There is a trade off, with code complexity versus benefit. There's also a difference between theoretical benefit and practical benefit. The PR currently is simple and has a lot of practical benefit. We could increase the complexity and theoretical benefit further but maybe not gain much in terms of practical benefit. The Godot philosophy on the whole errs on the side of simple maintainable code where possible (over outright performance).

Aside from other techniques I mentioned in earlier posts, we could also run through all the children of the parent node as a pre-step, and make a list of all the child ids to be deleted, then do a bulk delete and send the notifications once. I may end up doing further work on this, depending on profiling (I discovered this bottleneck when working on #61568 and if the reduced bottleneck is still significant I may end up spending further time on this).

Is there a low level function missing its binary search?

I didn't understand what you meant by this.

Also, won't it make things worse if I am already calling queue_free() on objects in reverse order to work around this problem?

In theory, but in practice it will make little difference, because the first child it hits in the list to be deleted will delete all the queued children of the parent, then the later get_instance() will fail (this relies on get_instance() being reasonably fast, but it does seem to be fast in my tests, relative to the costs of deletion).

lawnjelly · 2022-06-27T05:38:19Z

Some testing of alternative strategies based on discussion with reduz:

Dirty flag for "moved in parent"

Based on the hypothesis that a lot of the cost might be the moved in parent notifications being called multiple times, instead of calling notification(NOTIFICATION_MOVED_IN_PARENT) directly during remove_child(), we set a flag on the node (if not set already) and add to a list, then flush this list once (per frame / iteration), ensuring that there is only a single NOTIFICATION_MOVED_IN_PARENT per node rather than multiple.

Timings (5000 nodes, debug)
This PR (reverse iteration approach) 280 ms
Dirty flag 2526 ms
Existing 3.x 3424 ms

So using a dirty flag seems to be faster than the existing 3.x branch, but not nearly as fast as the reverse iteration approach (for this particular benchmark test anyway, which is worst case).

One advantage of the dirty flag (over reverse iteration alone) is it will also work when calling remove_child() directly, rather than via queue_free(). However a disadvantage is that the dirty flag needs to be made thread safe, and creates the potential for bugs because the timing for the NOTIFICATION_MOVED_IN_PARENT will have been altered (it will be called in a deferred fashion rather than immediately). This could potentially cause knock on bugs so is less safe in terms of regressions.

The dirty flag could potentially be used in addition to the reverse iteration in this PR. I tested this combined approach, and it is slightly slower than just reverse iteration on its own, but as said above could be helpful in more cases. But this should probably be added in a separate PR, if we were to go with this.

lawnjelly · 2022-06-27T08:33:03Z

Superceded by #62444 .

lawnjelly added topic:core performance labels Jun 11, 2022

lawnjelly added this to the 3.x milestone Jun 11, 2022

lawnjelly force-pushed the faster_queue_free branch from caf970c to 50c364d Compare June 11, 2022 12:43

lawnjelly marked this pull request as ready for review June 11, 2022 12:45

lawnjelly requested a review from a team as a code owner June 11, 2022 12:45

lawnjelly force-pushed the faster_queue_free branch from 50c364d to 340a57b Compare June 11, 2022 13:06

hawkinsonb reviewed Jun 13, 2022

View reviewed changes

lawnjelly mentioned this pull request Jun 27, 2022

[3.x] Faster queue free #62444

Merged

lawnjelly closed this Jun 27, 2022

akien-mga added enhancement archived labels Jun 27, 2022

YuriSizov removed this from the 3.x milestone Dec 2, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Faster queue free #61932

Faster queue free #61932

lawnjelly commented Jun 11, 2022 •

edited

Loading

hawkinsonb Jun 13, 2022

lawnjelly Jun 13, 2022 •

edited

Loading

hawkinsonb Jun 14, 2022 •

edited

Loading

lawnjelly Jun 14, 2022

hawkinsonb Jun 14, 2022

lawnjelly Jun 14, 2022 •

edited

Loading

goatchurchprime commented Jun 13, 2022

lawnjelly commented Jun 13, 2022 •

edited

Loading

lawnjelly commented Jun 27, 2022

lawnjelly commented Jun 27, 2022

Faster queue free #61932

Faster queue free #61932

Conversation

lawnjelly commented Jun 11, 2022 • edited Loading

Performance testing, number of child nodes deleted

Notes

Update

hawkinsonb Jun 13, 2022

Choose a reason for hiding this comment

lawnjelly Jun 13, 2022 • edited Loading

Choose a reason for hiding this comment

hawkinsonb Jun 14, 2022 • edited Loading

Choose a reason for hiding this comment

lawnjelly Jun 14, 2022

Choose a reason for hiding this comment

hawkinsonb Jun 14, 2022

Choose a reason for hiding this comment

lawnjelly Jun 14, 2022 • edited Loading

Choose a reason for hiding this comment

goatchurchprime commented Jun 13, 2022

lawnjelly commented Jun 13, 2022 • edited Loading

lawnjelly commented Jun 27, 2022

Dirty flag for "moved in parent"

lawnjelly commented Jun 27, 2022

lawnjelly commented Jun 11, 2022 •

edited

Loading

lawnjelly Jun 13, 2022 •

edited

Loading

hawkinsonb Jun 14, 2022 •

edited

Loading

lawnjelly Jun 14, 2022 •

edited

Loading

lawnjelly commented Jun 13, 2022 •

edited

Loading