-
Notifications
You must be signed in to change notification settings - Fork 383
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GC livelocks when temporal data is exhausted #7369
Comments
I'm 99% sure this is just a edge case I missed in: I'll be back. |
I think I know what's going on. I was looking in the wrong place: this is unrelated to #7182. The problem stems from a nasty interaction between static data, latest-at query-time caching, and garbage collection. Look at the following store stats after the recording has just loaded: By the time the GC enters into livelock, the stats will look like this: What's happening here is that all the temporal data that could be legally removed, has already been legally removed. But if we only have ~40MiB of data in the store, why are we hitting the 300MiB limit and running the GC in the first place? The problem is that the latest-at cache is organized by query time, as opposed to data time. Now here's where this gets ugly: if there's nothing to GC, then there are no The fix is to garbage collect the query cache directly when such a situation happens. |
When the GC needs to reclaim memory but has already exhausted all the data from the store, then purge the query cache directly. The complete rationale for this patch is there: #7369 (comment). Before: ![image](https://github.com/user-attachments/assets/0ee825d1-11a5-43ad-b4ac-4e9ee6134108) After: ![image](https://github.com/user-attachments/assets/6bb83281-94e7-42ca-8cd6-5502bdefe47b) - Fixes #7369 ### Checklist * [x] I have read and agree to [Contributor Guide](https://github.com/rerun-io/rerun/blob/main/CONTRIBUTING.md) and the [Code of Conduct](https://github.com/rerun-io/rerun/blob/main/CODE_OF_CONDUCT.md) * [x] I've included a screenshot or gif (if applicable) * [x] I have tested the web demo (if applicable): * Using examples from latest `main` build: [rerun.io/viewer](https://rerun.io/viewer/pr/7370?manifest_url=https://app.rerun.io/version/main/examples_manifest.json) * Using full set of examples from `nightly` build: [rerun.io/viewer](https://rerun.io/viewer/pr/7370?manifest_url=https://app.rerun.io/version/nightly/examples_manifest.json) * [x] The PR title and labels are set such as to maximize their usefulness for the next release's CHANGELOG * [x] If applicable, add a new check to the [release checklist](https://github.com/rerun-io/rerun/blob/main/tests/python/release_checklist)! * [x] If have noted any breaking changes to the log API in `CHANGELOG.md` and the migration guide - [PR Build Summary](https://build.rerun.io/pr/7370) - [Recent benchmark results](https://build.rerun.io/graphs/crates.html) - [Wasm size tracking](https://build.rerun.io/graphs/sizes.html) To run all checks from `main`, comment on the PR with `@rerun-bot full-check`.
I haven't been able to repro using either I'm using the second RRD file you've provided (
|
Confirmed fixed, see #7370 (comment) |
We're in possession of a recording that triggers this behavior:
The time cursor is paused. Data is not being ingested. This is therefore a bug: the GC is constantly trying to clean up to free space, but somehow it's not doing succeeding.
I haven't been able to reproduce this with a simplified script yet.
How to reproduce with the recording in question:
rerun --memory-limit 300MiB rec.rrd
24-09-07_12.11.42.patched.mp4
The text was updated successfully, but these errors were encountered: