-
-
Notifications
You must be signed in to change notification settings - Fork 3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add usage-aware garbage collection #8870
Comments
Hopefully this should be easy enough to do using some combination of a custom blockstore wrapper and the contexts passed through requests.
This might be tricky. In particular, most proposals for different GC strategies involve tracking more state. While this is fine it likely means we have to do one of:
I don't know off-hand if any of these are particularly palatable and it may depend on the types of GC strategies that get built. I'm highly supportive of making new GC strategies easier to work with at the library layer though so each application can choose its own strategy. My thinking is once we've got that we can figure out if/how to move that level of choice into the binary. |
Yeah you cannot make cache decisions without holding additional data to do them. The current "flat out" write of the key-value store isn't enough for that (without very costly traversing of all datastructures from pin level down). So the current mark-and-sweep approach is not only unsafe - as its not thread-safe but also the worst in terms of performance. My mentioned proposal is a bit older but I still think it's a viable choice to go forward. It not only tracks the usage by pins per chunk with a ref counter, but allows to a mixed time/usage drop like ARC does (which is sadly patented). It would also eliminate the performance issues with some other approaches, as it is lock-free - it only requires to get messages over the actions of the rest of the program. |
Could we address the use case mentioned by @iand by considering only access time? E.g. mount the flatfs datastore w/ |
@guseggert well, this would be a Least recently used (LRU) and is one of the worst choices. This would also neither limit the maximum amount of storage use (as we stupidly drop everything not touched in an hour) nor does it work for "normal" ipfs usage, where files are stored. But the main issue is that most stuff is probably not reused within an hour, but more like several hours to several days. We need to be able to track accessed over longer periods of time to make a somewhat optimal decision. |
Yes it's not optimal for all use cases. But for some use cases it would work well enough, such as for gateways, which is the use case described in this issue. |
I doubt that it would work for the usecase. As I said: It might be able to catch some really high access blocks that way, but most stuff will be discarded before reuse. That's also not what's described in this issue as a possible solution:
|
I have used LRU eviction for this kind of caching in the past, it works. It is easy to extend the strategy to e.g. only evict the oldest entries until the cache doesn't exceed some size, and then tweak the cache size and eviction threshold to adjust the hit rate for your traffic patterns. Most importantly, it is easy to implement outside of go-ipfs with a script and a cron job, so is a path forward now for a common use case, without having to build stateful GC into go-ipfs (which is unquestionably more optimal but has a much higher engineering cost). |
Tracking will add storage overhead - mostly write operations and they are slow. Probably viable to store LRU tracking information in memory and do periodic batch dumps and then during GC load all dumps, delete data, delete tracking dumps and write new tracking information. I dont think this feature is needed. Important data are pinned and adhoc data are not long term important to justify additional overhead. |
@guseggert the problem with LRU is that it doesn't account for the cost for refilling the cache, which is usually a non-issue in filesystems and databases. In a distributed system like IPFS the fill cost is highly variable so I want to factor that into the algorithm to meet my goal of reducing time to first byte for the gateway. A good reference on some of the issues here is LNC-R-WS-U (citeseer or dweb.link ) . In that paper the authors define a delay savings metric (DSR) which represents the fraction of communication delays saved by satisfying requests from cache. Ignoring cache revalidation (not meaningful in an immutable system) it's defined as: sum over i (di*hi) / sum over i (di*fi) where:
To maximise the DSR while also maintaining a fixed cache size they assign a profit to each document and then retain the top N documents that fit the defined cache size. The profit function is: pi = ri*di / si where:
The reference rate and mean delays can be calculated from a sliding window of three values for time last requested and time taken to request. (The paper has a section analyzing the size of the sliding window and 3 is sufficient) At the moment I think it's best to limit this caching algorithm to roots rather than individual blocks. The accounting overhead is then six values per root (3 timestamps and 3 durations) and we need to persist this so it isn't lost between restarts. |
Two thoughts on this line of work:
|
@yiannisbot wrote:
Got an idea on that. We don't need to store the blocks for each request on each node (gateway in this case). We can fully locally decide which blocks to keep and which not:
Distributed wayI planned a kind of seldom storage section for blocks which we wanted to drop, but asked the DHT and found no other copies. This way we don't drop the last copy of data in the network. While I still think that could be a good idea for low throughput nodes, that's not a good idea for high throughput nodes. High throughput nodes would use a different mode: A DHT-alike probabilistic caching.
This avoids any communication between nodes, but on the other hand makes it predictable where a block may be stored. This would allow sorting the neighboring requests for bitswap by the distance between the CID and the Node-IDs for each request, to optimize the amount of queries we have to make on average to fetch the block before we resolve to using the DHT. |
Updating this to note the in-progress implementation of phase 1 at master...iand:block-accounting |
@iand ah cool. But it looks like a massive overhead to do this amount of accounting for each block. How much memory does this need per block? |
I think I will end up doing it only for roots of dags, since the use case is the gateway which predominantly serves complete dags over individual blocks. The accounting is currently 3 values for the last 3 times the block was served, 3 values for the last 3 elapsed times when retrieving and a value to represent the size of the block/dag that would be freed if it were deleted. My current unoptimized version looks like this: type BlockAccounting struct {
// Last 3 times this block was referenced
ReferenceTimes [3]time.Time
// Last 3 durations of time taken to request block from network
FetchDelays [3]time.Duration
// Size of the block
Size int64
} But I plan to define a smaller representation since our fetch durations are capped and we don't need to represent reference times in the medieval era 😄 |
Got a bit left in the bitfield (still have no idea what to do with "precaching" - so this bit will be dropped. So my proposal could be expanded to have this bit signal very hard to find/fetch blocks, to maybe give them another round in the cache for example. |
Btw, how are your locking going to work? Example A: A pinning operation is running and your GC is running as well. Will it delete the blocks already fetched? Example B: While your GC is running a file gets added to the MFS and the Pin for it gets deleted. Will it delete the MFS blocks when it has already marked the Pins? Or are you planning a "world hold" lock which waits for all current writing operations to be completed but will block everything from beeing pinned/modified in the MFS? |
go-ipfs already uses the GCLocker type to hold a global lock while GC is in progress. This is checked fairly inconsistently in various places, such as when adding a unixfs file But we should treat the locking as orthogonal to the algorithm used which should be interruptible. My sense is that locking needs to be improved throughout as a separate exercise. |
Initial evaluation of the profit function (#8870 (comment)) against a version of go-ipfs instrumented to record block retrieval times and access patterns: protocol/prodeng#7 (comment) |
Overall, LRU is one of the worst approaches to caching. That's why ARC in ZFS does MRU and MFU, and the recently-part is pretty bad in performance, as you can see from the ARC statistics from two of my IPFS nodes: $ uptime
08:19:14 up 3 days, 19:16
# arc_summary -s archits
[...]
ARC total accesses (hits + misses): 45.7M
Cache hit ratio: 84.5 % 38.6M
Cache miss ratio: 15.5 % 7.1M
Actual hit ratio (MFU + MRU hits): 84.2 % 38.5M
Data demand efficiency: 2.5 % 6.9M
Cache hits by cache type:
Most frequently used (MFU): 91.0 % 35.2M
Most recently used (MRU): 8.6 % 3.3M
[...]
# arc_summary -s arc
ARC size (current): 100.2 % 15.7 GiB
[...]
Most Frequently Used (MFU) cache size: 81.6 % 480.4 MiB
Most Recently Used (MRU) cache size: 18.4 % 108.2 MiB
[...] $ uptime
08:16:43 up 30 days, 13:46 [...]
# arc_summary -s archits
[...]
ARC total accesses (hits + misses): 7.4G
Cache hit ratio: 99.8 % 7.3G
Cache miss ratio: 0.2 % 14.7M
Actual hit ratio (MFU + MRU hits): 99.8 % 7.3G
Data demand efficiency: 99.7 % 1.5G
Data prefetch efficiency: 14.4 % 6.4M
Cache hits by cache type:
Most frequently used (MFU): 98.5 % 7.2G
Most recently used (MRU): 1.5 % 108.0M
[...]
# arc_summary -s arc
ARC size (current): 100.2 % 15.7 GiB
[...]
Most Frequently Used (MFU) cache size: 90.1 % 13.5 GiB
Most Recently Used (MRU) cache size: 9.9 % 1.5 GiB
[...]
Yeah, there are various ways the GC currently can lose data which should actually be retained while the node is running. A world lock in a new design is a pretty bad approach tbh, as it's pretty expensive to stop all new writes while the cleanup is taking place. That's why I designed ipfs/notes#428 to completely avoid locking writes from happening via time based locks on the newly written and touched blocks. |
Checklist
Description
When operating as an HTTP gateway, go-ipfs should adopt a garbage collection strategy that retains blocks that improve gateway request times.
The hypothesis is that blocks are being deleted from the blockstore by garbage collection only to be re-requested at a later date which is much slower. Garbage collection is required to keep within disk space limits but the current implementation is unware of the usage patterns of unpinned blocks.
A crude experiment demonstrates the effect. The same URL (/ipfs/QmVoe22tWS8JqfrMHj5WTHJ6zWrWqTL3YeG5uFQmAH6zhs/182.png) was requested every 30 seconds with a manual GC run every few minutes (using
ipfs repo gc
)The request times at 154, 583, 830 and 984 after a garbage collection show a tenfold increase in time to first byte.
The timings at 706 are much higher, indicating that the peer probably dropped from node's peer list.
The default garbage collection strategy is mark and sweep:
bestEffortRoots, by default, only includes the MFS root
Proposal
A cache-aware garbage collection algorithm would assign a cost to blocks derived from the frequency of access and time taken to fetch from peers. The algorithm then deletes blocks with least cost until a sufficient space has been freed.
For space efficiency we may want to only record metrics for roots of dags, i.e. directories and files. In this case the garbage collector would treat all descendant blocks of the root as a unit: either all are deleted together or none are. The final scoring formula will probably want to factor in the size of the dag.
Outline of proposed approach:
Related Work
The text was updated successfully, but these errors were encountered: