-
-
Notifications
You must be signed in to change notification settings - Fork 21.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Faster queue free #61932
Faster queue free #61932
Conversation
caf970c
to
50c364d
Compare
Calling queue_free() for large numbers of siblings could previously be very slow, with the time taken rising exponentially with number of children. This looked partly due to ordered_remove from the child list and notifications. This PR identifies objects that are nodes, and deletes children flagged for deletion in reverse order. This minimizes the costs of reordering.
50c364d
to
340a57b
Compare
// Less need to apply this when deleting small numbers of nodes. | ||
// This magic number is guessed .. to get most of the benefit while not checking through | ||
// child lists when unnecessary (as this will have a cost). | ||
if (delete_queue.size() > 128) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there performance degradation for <128 nodes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a cost to the technique. If a parent has e.g. 1000 children, and only one is queued for deletion, then it will iterate through all the children just in case any of them are queued. This could worst case result in a bunch of cache misses. So we ideally don't want to use it unless we know it will help.
We know there is no possible benefit in using the optimization with a queue size of 1, and we know the benefits tend to scale approximately with number of children being deleted / number of children
(on average, it depends on the exact deletion patterns), so it makes sense to turn it on at some overall queue size between 2 and 1000 (which was measured somewhat inaccurately as 1.3x faster earlier).
In the same way there is less absolute benefit possible in a parent that has few children, as the effect scales something like O(n*n) where n is the number of children.
The magic numbers in both cases could probably be decreased (maybe as a result of further testing), but this conservative approach should be good to start because it will eliminate most long pauses.
There are a few more complex techniques which could be used to achieve the same thing in a less "haphazard" fashion, such as sorting the queue delete list, or having a list of queue deleted nodes on each node, but the main difficulty is that currently each node has no idea of its child ID relative to its parent. This could be fixed by storing a child ID per node and keeping it up to date as children are added / moved, but it is pretty easy for bugs to creep in here and for it to get out of sync. So the simple approach in this PR is probably more sensible to get started (unless profiling shows the same thing is a problem in other situations).
Another variation is we could store a counter of the number of queue deleted children per node, and use this as a heuristic "number in queue / number of children" to turn on or off.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand where you're coming from. It's definitely a band-aid for a larger problem regarding deletion performance. It makes more sense to look deeper into the deletion queue for a cleaner solution.
I'm newer to the engine so I apologize if this is off.
But it appears node
's know their depth at the time they are removed. Maybe having different queues for each scene_tree
level would be more efficient? If you process deletion of the deepest level first, you'll always delete the deepest children first.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems to be a misunderstanding of the problem. The relevant factor is the child ID (i.e. which number child in the list of the parent), the depth is unrelated to the problem (afaik).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe? The PR appears to be looking for children of the parent to delete first, because deleting the parent before the children is causing the issue.
Wouldn't the depth determine what would potentially be a child? IE, the deepest node would always be a child.
So there would be no need to check all of a parent's children to see if they need to be deleted first, as anything deeper than the parent(all of its potential children) was already deleted from the deletion queue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe? The PR appears to be looking for children of the parent to delete first, because deleting the parent before the children is causing the issue.
Wouldn't the depth determine what would potentially be a child? IE, the deepest node would always be a child.
So there would be no need to check all of a parent's children to see if they need to be deleted first, as anything deeper than the parent(all of its potential children) was already deleted from the deletion queue.
You seem to be misunderstanding the issue and the PR, all of your suppositions here are incorrect. I can only suggest re-reading the issue, and examining what the MRP does.
EDIT: Were you are thinking in terms of optimizing by deleting parents before children? That is a valid avenue, but is not addressed here.
This seems like an overly-narrow fix for an interesting performance issue that's going to occur every time an object is deleted, even if it's not always big enough to be noticed and isolated. Also, won't it make things worse if I am already calling queue_free() on objects in reverse order to work around this problem? There must be something wrong in the implementation if it makes this much difference by simply changing the order of deletion. Is there a low level function missing its binary search? I am also skeptical about a batch delete function that simply loops through a single object delete, if a lot of time can be wasted through relinking and shuffling objects that are also due to be deleted. There is a memdelete_arr() function, but this also does it by looping through its individual records too because it doesn't know enough about what the object types are. |
It only seems to occur when deleting multiple objects. Deleting a single object (that has many children) doesn't seem to incur the same cost, presumably because the children of this object are deleted in bulk, and don't get reshuffled and sent notifications multiple times etc.
Absolutely, a bulk delete function could be made more efficient, see particularly #61929 (comment) :
The API doesn't currently offer a "multiple delete" afaik, only This PR is currently a super simple way of addressing this and getting an easy "bang for your buck", but if desired it could be made more comprehensive - there are several obvious ways this could be improved further. But this is something we will typically decide at a PR meeting. It is good practice to make a simple solution first before investing significant time, as often reduz will prefer the simple solution, and this saves developer time and frustration. There is a trade off, with code complexity versus benefit. There's also a difference between theoretical benefit and practical benefit. The PR currently is simple and has a lot of practical benefit. We could increase the complexity and theoretical benefit further but maybe not gain much in terms of practical benefit. The Godot philosophy on the whole errs on the side of simple maintainable code where possible (over outright performance). Aside from other techniques I mentioned in earlier posts, we could also run through all the children of the parent node as a pre-step, and make a list of all the child ids to be deleted, then do a bulk delete and send the notifications once. I may end up doing further work on this, depending on profiling (I discovered this bottleneck when working on #61568 and if the reduced bottleneck is still significant I may end up spending further time on this).
I didn't understand what you meant by this.
In theory, but in practice it will make little difference, because the first child it hits in the list to be deleted will delete all the queued children of the parent, then the later |
Some testing of alternative strategies based on discussion with reduz: Dirty flag for "moved in parent"Based on the hypothesis that a lot of the cost might be the Timings (5000 nodes, debug) So using a dirty flag seems to be faster than the existing 3.x branch, but not nearly as fast as the reverse iteration approach (for this particular benchmark test anyway, which is worst case). One advantage of the dirty flag (over reverse iteration alone) is it will also work when calling The dirty flag could potentially be used in addition to the reverse iteration in this PR. I tested this combined approach, and it is slightly slower than just reverse iteration on its own, but as said above could be helpful in more cases. But this should probably be added in a separate PR, if we were to go with this. |
Superceded by #62444 . |
Calling queue_free() for large numbers of siblings could previously be pathologically slow, with the time taken rising exponentially with number of children. This looked partly due to ordered_remove from the child list and notifications.
This PR identifies objects that are nodes, and deletes children flagged for deletion in reverse order. This minimizes the costs of reordering.
Fixes #61929
Performance testing, number of child nodes deleted
50000 nodes before 137017 ms, after 422 ms seconds (325x faster)
20000 nodes before 19425 ms, after 250 ms seconds (78x faster)
5000 nodes before 1228 ms, after 169 ms seconds (7.3x faster)
1000 nodes before 194 ms, after 150 ms seconds (1.3x faster)
(at lower node numbers the measurement from the MRP in #61929 becomes less useful as there are some fixed delays, however the pattern is that the benefit really starts to show with a decent number of children)
Notes
delete_queue
, but there seems no efficient way of getting the child number from a node.ObjectDB::get_instance()
call is efficient for deleted objects.ObjectDB::get_instance()
calls, but might be more tricky to turn off efficiently.Update
It turns out that after the fix the bottleneck was in the BVH doing collision detection of all the boxes overlapping creating a lot of pairs. When the boxes were hidden, the artificial BVH bottleneck is removed and the new PR is much faster (i.e. 100x faster).