-
Notifications
You must be signed in to change notification settings - Fork 373
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Garbage collection should be aware of app_id
/recording_id
semantics
#1904
Comments
There are many ways to solve this:
Different use cases have different requirements. What is obvious that the current behavior sucks. I suggest we just GC every open recording, OR drop the oldest recording, whatever is easier. |
As far as I'm aware we now evenly distribute GC pass across all recordings: pub fn purge_fraction_of_ram(&mut self, fraction_to_purge: f32) {
re_tracing::profile_function!();
for store_db in self.store_dbs.values_mut() {
store_db.purge_fraction_of_ram(fraction_to_purge);
}
} which is overall an improvement, but there probably should be a blueprint setting to configure whether you want to distribute evenly, or prioritize cleaning up previous recordings of the same app_id first. |
I think we should make the default behavior be to drop old data in a way that if you are running serial experiments, old recordings get dropped first, and if you are doing parallel experiments, we drop data from all recordings evenly. That is, we should drop based on time. |
As it stands, I think this is a blocker for the hugging face spaces demo |
A good starting strategy: only run gc on oldest recording. When it is empty, close it. Werinkle: row-protection |
**Commit by commit, there's renaming involved!** GC will now focus on the oldest-modified recording first. Tried a lot of fancy things, but a lot of stress testing has shown that nothing worked as well as doing this the dumb way. Speaking of stress testing, the scripts I've used are now committed in the repository. Make sure to try them out when modifying the GC code :grimacing:. In general, the GC supports stress much better than I thought/hoped: - `many_medium_sized_single_row_recordings.py`, `many_medium_sized_many_rows_recordings.py` & `many_large_many_rows_recordings.py` all behave pretty nicely, something like this: https://github.com/rerun-io/rerun/assets/2910679/26f67d69-de0e-4002-8936-2ac32c451cc3 - `many_large_single_row_recordings.py` on the other hand is _still_ a disaster (watch til the end, this slowly devolves into a blackhole): https://github.com/rerun-io/rerun/assets/2910679/673ee10c-2eca-4e3e-b285-77714e5c3d61 This is not a new problem (not to me at least 😬), large recordings with very few rows have always been a nightmare on the GC (not specifically the DataStore GC, the GC as a whole through the entire app). I've never had time to investigate why, but now we have an issue for it at least: - #4185 --- - Fixes #1904
We've seen plenty of reports of users that start a Rerun instance and then run their algorithm a bunch of times as they go through their iterative improvement cycle.
It ends up looking like a little this:
Now obviously at some point these users run out of memory, at which point they learn about
--memory-limit
.That's all fine except that garbage collection is completely unaware of these
app_id
/recording_id
semantics, and so will only purge the currently active datastore (which likely contains the only data that the user still cares about at this point) while the old recordings are left untouched forever.The text was updated successfully, but these errors were encountered: