GC livelocks when temporal data is exhausted #7369

teh-cmc · 2024-09-06T15:42:43Z

We're in possession of a recording that triggers this behavior:

The time cursor is paused. Data is not being ingested. This is therefore a bug: the GC is constantly trying to clean up to free space, but somehow it's not doing succeeding.

I haven't been able to reproduce this with a simplified script yet.

How to reproduce with the recording in question:

rerun --memory-limit 300MiB rec.rrd
open memory panel
gently scrub the time cursor through the entire timeline, back and forth, a bunch of time (we want to fill the latest-at cache with a bunch of stuff
switch between timelines and repeat the scrubbing
continue this process until the GC goes into livelock

24-09-07_12.11.42.patched.mp4

The text was updated successfully, but these errors were encountered:

teh-cmc · 2024-09-06T16:20:52Z

I'm 99% sure this is just a edge case I missed in:

Deal with invalidation parallelism-induced memleak #7182

I'll be back.

teh-cmc · 2024-09-07T11:56:40Z

I think I know what's going on. I was looking in the wrong place: this is unrelated to #7182.

The problem stems from a nasty interaction between static data, latest-at query-time caching, and garbage collection.

Look at the following store stats after the recording has just loaded:

Without --memory-limit:

With --memory-limit 300MiB:

By the time the GC enters into livelock, the stats will look like this:

What's happening here is that all the temporal data that could be legally removed, has already been legally removed.
Those 39 temporal chunks left that we see in the stats can never go away, since the GC runs with protect_latest=1: we're already at the point where all that's left is one temporal chunk per (entity, timeline, component).
The livelock has started: we can run the GC all we want, from this point onwards it will never be able to remove anything.

But if we only have ~40MiB of data in the store, why are we hitting the 300MiB limit and running the GC in the first place?

The problem is that the latest-at cache is organized by query time, as opposed to data time.
That is why this cache grows a lot when you scrub the time cursor around: even though we are just caching references to chunks, all these btrees still take a lot of space on their own with this many query timestamps to keep track of.
The reason we do this is performance: the viewer runs a number of latest-at queries per frame that grows linearly with the number of entities in the recording, every microsecond we can shave off on the latest-at path matters, it all compounds very quickly by the end of the frame.

Now here's where this gets ugly: if there's nothing to GC, then there are no ChunkStoreEvents being sent, and this there is no cache invalidation going on.
The cache will therefore grow indefinitely, while the GC desperately attempts to keeps things under control: livelock.

The fix is to garbage collect the query cache directly when such a situation happens.

When the GC needs to reclaim memory but has already exhausted all the data from the store, then purge the query cache directly. The complete rationale for this patch is there: #7369 (comment). Before: ![image](https://github.com/user-attachments/assets/0ee825d1-11a5-43ad-b4ac-4e9ee6134108) After: ![image](https://github.com/user-attachments/assets/6bb83281-94e7-42ca-8cd6-5502bdefe47b) - Fixes #7369 ### Checklist * [x] I have read and agree to [Contributor Guide](https://github.com/rerun-io/rerun/blob/main/CONTRIBUTING.md) and the [Code of Conduct](https://github.com/rerun-io/rerun/blob/main/CODE_OF_CONDUCT.md) * [x] I've included a screenshot or gif (if applicable) * [x] I have tested the web demo (if applicable): * Using examples from latest `main` build: [rerun.io/viewer](https://rerun.io/viewer/pr/7370?manifest_url=https://app.rerun.io/version/main/examples_manifest.json) * Using full set of examples from `nightly` build: [rerun.io/viewer](https://rerun.io/viewer/pr/7370?manifest_url=https://app.rerun.io/version/nightly/examples_manifest.json) * [x] The PR title and labels are set such as to maximize their usefulness for the next release's CHANGELOG * [x] If applicable, add a new check to the [release checklist](https://github.com/rerun-io/rerun/blob/main/tests/python/release_checklist)! * [x] If have noted any breaking changes to the log API in `CHANGELOG.md` and the migration guide - [PR Build Summary](https://build.rerun.io/pr/7370) - [Recent benchmark results](https://build.rerun.io/graphs/crates.html) - [Wasm size tracking](https://build.rerun.io/graphs/sizes.html) To run all checks from `main`, comment on the PR with `@rerun-bot full-check`.

ricpruss · 2024-09-12T03:22:56Z

Can we reopen. I am still seeing this behaviour where rerun does not maintain a memory limit with the latest development build from last night. I am happy to provide a recording for this one if the old recording does not show it.

teh-cmc · 2024-09-23T14:44:17Z

I haven't been able to repro using either main or the latest development build @ricpruss (that is pip install --pre --no-index -f https://build.rerun.io/commit/de49fce/wheels --upgrade rerun-sdk at the time I'm writing this).

I'm using the second RRD file you've provided (memory-too-much.rrd) with a 300MiB memory limit, and putting infinite pressure on the query cache by repeatedly looping the time cursor over the entire timeline:

What memory limit are you using in your test?
Can you take a screenshot of the information shown in the about menu?
Can you also copy the output of rerun --version?

teh-cmc · 2024-09-24T09:22:35Z

Just to make sure, I tried the above on 0.18.2 one more time, and sure enough it very rapidly descends into madness:

teh-cmc · 2024-09-25T07:12:46Z

Confirmed fixed, see #7370 (comment)

teh-cmc added 🪳 bug Something isn't working 🚀 performance Optimization, memory use, etc labels Sep 6, 2024

teh-cmc self-assigned this Sep 6, 2024

teh-cmc mentioned this issue Sep 7, 2024

Purge the query cache to prevent GC livelocks #7370

Merged

6 tasks

teh-cmc changed the title ~~Data is leaking somewhere in some undefined edge case, livelocking the GC~~ GC livelocks when temporal data is exhausted Sep 7, 2024

jleibs closed this as completed in #7370 Sep 9, 2024

emilk reopened this Sep 12, 2024

emilk added this to the 0.19 - Dataframe and web video milestone Sep 16, 2024

teh-cmc closed this as completed Sep 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GC livelocks when temporal data is exhausted #7369

GC livelocks when temporal data is exhausted #7369

teh-cmc commented Sep 6, 2024 •

edited

Loading

teh-cmc commented Sep 6, 2024

teh-cmc commented Sep 7, 2024

ricpruss commented Sep 12, 2024

teh-cmc commented Sep 23, 2024

teh-cmc commented Sep 24, 2024

teh-cmc commented Sep 25, 2024

GC livelocks when temporal data is exhausted #7369

GC livelocks when temporal data is exhausted #7369

Comments

teh-cmc commented Sep 6, 2024 • edited Loading

teh-cmc commented Sep 6, 2024

teh-cmc commented Sep 7, 2024

ricpruss commented Sep 12, 2024

teh-cmc commented Sep 23, 2024

teh-cmc commented Sep 24, 2024

teh-cmc commented Sep 25, 2024

teh-cmc commented Sep 6, 2024 •

edited

Loading