Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GC livelocks when temporal data is exhausted #7369

Closed
teh-cmc opened this issue Sep 6, 2024 · 6 comments · Fixed by #7370
Closed

GC livelocks when temporal data is exhausted #7369

teh-cmc opened this issue Sep 6, 2024 · 6 comments · Fixed by #7370
Assignees
Labels
🪳 bug Something isn't working 🚀 performance Optimization, memory use, etc

Comments

@teh-cmc
Copy link
Member

teh-cmc commented Sep 6, 2024

We're in possession of a recording that triggers this behavior:
image

The time cursor is paused. Data is not being ingested. This is therefore a bug: the GC is constantly trying to clean up to free space, but somehow it's not doing succeeding.

I haven't been able to reproduce this with a simplified script yet.


How to reproduce with the recording in question:

  • rerun --memory-limit 300MiB rec.rrd
  • open memory panel
  • gently scrub the time cursor through the entire timeline, back and forth, a bunch of time (we want to fill the latest-at cache with a bunch of stuff
  • switch between timelines and repeat the scrubbing
  • continue this process until the GC goes into livelock
24-09-07_12.11.42.patched.mp4
@teh-cmc teh-cmc added 🪳 bug Something isn't working 🚀 performance Optimization, memory use, etc labels Sep 6, 2024
@teh-cmc teh-cmc self-assigned this Sep 6, 2024
@teh-cmc
Copy link
Member Author

teh-cmc commented Sep 6, 2024

I'm 99% sure this is just a edge case I missed in:

I'll be back.

@teh-cmc
Copy link
Member Author

teh-cmc commented Sep 7, 2024

I think I know what's going on. I was looking in the wrong place: this is unrelated to #7182.

The problem stems from a nasty interaction between static data, latest-at query-time caching, and garbage collection.

Look at the following store stats after the recording has just loaded:

Without --memory-limit:
image

With --memory-limit 300MiB:
image

By the time the GC enters into livelock, the stats will look like this:
image

What's happening here is that all the temporal data that could be legally removed, has already been legally removed.
Those 39 temporal chunks left that we see in the stats can never go away, since the GC runs with protect_latest=1: we're already at the point where all that's left is one temporal chunk per (entity, timeline, component).
The livelock has started: we can run the GC all we want, from this point onwards it will never be able to remove anything.

But if we only have ~40MiB of data in the store, why are we hitting the 300MiB limit and running the GC in the first place?

The problem is that the latest-at cache is organized by query time, as opposed to data time.
That is why this cache grows a lot when you scrub the time cursor around: even though we are just caching references to chunks, all these btrees still take a lot of space on their own with this many query timestamps to keep track of.
The reason we do this is performance: the viewer runs a number of latest-at queries per frame that grows linearly with the number of entities in the recording, every microsecond we can shave off on the latest-at path matters, it all compounds very quickly by the end of the frame.

Now here's where this gets ugly: if there's nothing to GC, then there are no ChunkStoreEvents being sent, and this there is no cache invalidation going on.
The cache will therefore grow indefinitely, while the GC desperately attempts to keeps things under control: livelock.

The fix is to garbage collect the query cache directly when such a situation happens.

@teh-cmc teh-cmc changed the title Data is leaking somewhere in some undefined edge case, livelocking the GC GC livelocks when temporal data is exhausted Sep 7, 2024
jleibs pushed a commit that referenced this issue Sep 9, 2024
When the GC needs to reclaim memory but has already exhausted all the
data from the store, then purge the query cache directly.
The complete rationale for this patch is there:
#7369 (comment).

Before:

![image](https://github.com/user-attachments/assets/0ee825d1-11a5-43ad-b4ac-4e9ee6134108)

After:

![image](https://github.com/user-attachments/assets/6bb83281-94e7-42ca-8cd6-5502bdefe47b)


- Fixes #7369

### Checklist
* [x] I have read and agree to [Contributor
Guide](https://github.com/rerun-io/rerun/blob/main/CONTRIBUTING.md) and
the [Code of
Conduct](https://github.com/rerun-io/rerun/blob/main/CODE_OF_CONDUCT.md)
* [x] I've included a screenshot or gif (if applicable)
* [x] I have tested the web demo (if applicable):
* Using examples from latest `main` build:
[rerun.io/viewer](https://rerun.io/viewer/pr/7370?manifest_url=https://app.rerun.io/version/main/examples_manifest.json)
* Using full set of examples from `nightly` build:
[rerun.io/viewer](https://rerun.io/viewer/pr/7370?manifest_url=https://app.rerun.io/version/nightly/examples_manifest.json)
* [x] The PR title and labels are set such as to maximize their
usefulness for the next release's CHANGELOG
* [x] If applicable, add a new check to the [release
checklist](https://github.com/rerun-io/rerun/blob/main/tests/python/release_checklist)!
* [x] If have noted any breaking changes to the log API in
`CHANGELOG.md` and the migration guide

- [PR Build Summary](https://build.rerun.io/pr/7370)
- [Recent benchmark results](https://build.rerun.io/graphs/crates.html)
- [Wasm size tracking](https://build.rerun.io/graphs/sizes.html)

To run all checks from `main`, comment on the PR with `@rerun-bot
full-check`.
@ricpruss
Copy link

Can we reopen. I am still seeing this behaviour where rerun does not maintain a memory limit with the latest development build from last night. I am happy to provide a recording for this one if the old recording does not show it.

GCLock

@emilk emilk reopened this Sep 12, 2024
@emilk emilk added this to the 0.19 - Dataframe and web video milestone Sep 16, 2024
@teh-cmc
Copy link
Member Author

teh-cmc commented Sep 23, 2024

I haven't been able to repro using either main or the latest development build @ricpruss (that is pip install --pre --no-index -f https://build.rerun.io/commit/de49fce/wheels --upgrade rerun-sdk at the time I'm writing this).

I'm using the second RRD file you've provided (memory-too-much.rrd) with a 300MiB memory limit, and putting infinite pressure on the query cache by repeatedly looping the time cursor over the entire timeline:
image

  • What memory limit are you using in your test?
  • Can you take a screenshot of the information shown in the about menu?
  • Can you also copy the output of rerun --version?

@teh-cmc
Copy link
Member Author

teh-cmc commented Sep 24, 2024

Just to make sure, I tried the above on 0.18.2 one more time, and sure enough it very rapidly descends into madness:
image

@teh-cmc
Copy link
Member Author

teh-cmc commented Sep 25, 2024

Confirmed fixed, see #7370 (comment)

@teh-cmc teh-cmc closed this as completed Sep 25, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🪳 bug Something isn't working 🚀 performance Optimization, memory use, etc
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants