Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Garbage collection should be aware of app_id/recording_id semantics #1904

Closed
Tracked by #1898
teh-cmc opened this issue Apr 18, 2023 · 7 comments · Fixed by #4183
Closed
Tracked by #1898

Garbage collection should be aware of app_id/recording_id semantics #1904

teh-cmc opened this issue Apr 18, 2023 · 7 comments · Fixed by #4183
Assignees
Labels
🪳 bug Something isn't working ⛃ re_datastore affects the datastore itself 📺 re_viewer affects re_viewer itself

Comments

@teh-cmc
Copy link
Member

teh-cmc commented Apr 18, 2023

We've seen plenty of reports of users that start a Rerun instance and then run their algorithm a bunch of times as they go through their iterative improvement cycle.

It ends up looking like a little this:
image

Now obviously at some point these users run out of memory, at which point they learn about --memory-limit.
That's all fine except that garbage collection is completely unaware of these app_id/recording_id semantics, and so will only purge the currently active datastore (which likely contains the only data that the user still cares about at this point) while the old recordings are left untouched forever.

image

@teh-cmc teh-cmc added 🪳 bug Something isn't working ⛃ re_datastore affects the datastore itself 📺 re_viewer affects re_viewer itself labels Apr 18, 2023
@emilk
Copy link
Member

emilk commented Apr 18, 2023

@emilk
Copy link
Member

emilk commented Apr 18, 2023

There are many ways to solve this:

  • GC the oldest recording first
  • GC all at the same rate

Different use cases have different requirements. What is obvious that the current behavior sucks.

I suggest we just GC every open recording, OR drop the oldest recording, whatever is easier.

@nikolausWest
Copy link
Member

@teh-cmc
Copy link
Member Author

teh-cmc commented Nov 7, 2023

As far as I'm aware we now evenly distribute GC pass across all recordings:

pub fn purge_fraction_of_ram(&mut self, fraction_to_purge: f32) {
    re_tracing::profile_function!();

    for store_db in self.store_dbs.values_mut() {
        store_db.purge_fraction_of_ram(fraction_to_purge);
    }
}

which is overall an improvement, but there probably should be a blueprint setting to configure whether you want to distribute evenly, or prioritize cleaning up previous recordings of the same app_id first.

@nikolausWest
Copy link
Member

I think we should make the default behavior be to drop old data in a way that if you are running serial experiments, old recordings get dropped first, and if you are doing parallel experiments, we drop data from all recordings evenly. That is, we should drop based on time.

@nikolausWest
Copy link
Member

As it stands, I think this is a blocker for the hugging face spaces demo

@emilk
Copy link
Member

emilk commented Nov 7, 2023

A good starting strategy: only run gc on oldest recording. When it is empty, close it.

Werinkle: row-protection

@teh-cmc teh-cmc self-assigned this Nov 8, 2023
teh-cmc added a commit that referenced this issue Nov 9, 2023
**Commit by commit, there's renaming involved!**

GC will now focus on the oldest-modified recording first.
Tried a lot of fancy things, but a lot of stress testing has shown that
nothing worked as well as doing this the dumb way.

Speaking of stress testing, the scripts I've used are now committed in
the repository. Make sure to try them out when modifying the GC code
:grimacing:.

In general, the GC supports stress much better than I thought/hoped:
- `many_medium_sized_single_row_recordings.py`,
`many_medium_sized_many_rows_recordings.py` &
`many_large_many_rows_recordings.py` all behave pretty nicely, something
like this:


https://github.com/rerun-io/rerun/assets/2910679/26f67d69-de0e-4002-8936-2ac32c451cc3


- `many_large_single_row_recordings.py` on the other hand is _still_ a
disaster (watch til the end, this slowly devolves into a blackhole):


https://github.com/rerun-io/rerun/assets/2910679/673ee10c-2eca-4e3e-b285-77714e5c3d61



This is not a new problem (not to me at least 😬), large
recordings with very few rows have always been a nightmare on the GC
(not specifically the DataStore GC, the GC as a whole through the entire
app).
I've never had time to investigate why, but now we have an issue for it
at least:
- #4185 

---

- Fixes #1904
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🪳 bug Something isn't working ⛃ re_datastore affects the datastore itself 📺 re_viewer affects re_viewer itself
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants