-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Improve preload mechanism to support random access via ipfs.cat #3510
Comments
This comment has been minimized.
This comment has been minimized.
As a slight optimization, I've switched to Current preloading is:
The 'Random Access'-aware preloading:
|
@achingbrain I've moved it here as preload is specific to js-ipfs. Context: we need smarter preload, specifically a range-aware cat, to unlock use cases such as:
|
It's probably useful to restate what 'preloading' is and why it's currently necessary. When you run If you ask your node for a block and it does not have that block, bitswap will ask all connected peers if they have the block. If they do not, your node will issue a DHT query to find another node on the network that has the block. If the query returns a result, it'll dial that peer, bitswap will kick in and the block will be retrieved. In order for this to work you have to be able to dial the peer - that is, the peer has to be able to accept unsolicited connections like a server. Browsers only have one mechanism available for this currently - webrtc-star. There are several problems with webrtc-star - it's really heavy, you can only maintain a few concurrent connections before the browser starts throttling/killing connections. If you switch away from a tab, browsers suspend connections so you won't be dialable. There's also no webrtc-star implementation for go-IPFS so even if you could be found on the network, and your computer is running with the tab open you're not dialable by the majority of the network. Looking at it from the other direction, most content is hosted on go-IPFS nodes, which only listen on TCP. They are capable of listening on websockets, but that requires TLS which requires certificates and other non-trivial setup (unless you use NAT-hole-punching+reverse-DNS+Let's Encrypt) so there are not many with WS addresses - browser nodes cannot dial TCP so even if they could find a node with the content they are looking for, chances are they can't dial them. This is where 'preloading' comes in. When I add a block to my browser repo, I 'preload' it, that is, I get a go-IPFS node on the network to pull the content from me. The preload nodes also happen to be in the default bootstrap list for all js-IPFS nodes, in the browser and running under node.js. So when a remote browser node tries to fetch the content, it's on a preload node, which happens to be a bootstrap node, which it is connected to, so bitswap will fetch the block. Note that it doesn't fetch the content directly from you, because you are effectively undialable. If I fetch content, I also 'preload' the CID, which tells the go-IPFS preload node to search the network for the content if it doesn't have it already which all things being equal should increase the number of directly connected peers I have that have the content I am looking for. This works until the preload node runs it's hourly GC and the block is deleted, otherwise the preload nodes end up hosting every single block ever added/fetched by a js-IPFS node. Random access cat. When trying to start reading from an offset, you don't want to load every byte leading up to the offset, so you need to know the layout of the DAG that represents the file. That way you can say, 'ok, byte 3827874 is contained by the fourth child of this node, I'll just load that one'. This is how offset/length works in js-IPFS today. The layout information is contained in the root block of the DAG, which is the block the initial CID refers to. If this block is not available, you can't get the file as you don't know the layout. There's no way to link to parents blocks from child blocks because DAGs are acyclic, they have to be, otherwise you can't calculate the CID of anything, as the CID of linked nodes forms part of the Content that you are trying to calculate the ID of.
This does not work because you disabled preloading. If you'd preloaded the DAG, it'd be on the preload node and you could
We have a range-aware cat, but the blocks that make up the file have to be on the network somewhere, held by a node that you can find and then successfully dial.
Preloading wikipedia won't download 80GBs of data to your browser but it will make a recursive refs call on the root CID which will take ages and send about 15MB of text to your browser in the response (80GB, 305k blocks at the default block size, 46 bytes per CID plus JSON formatting). If you are trying to access an offset some way into the dataset, and that dataset is not available on a preload node, it's probably going to take a long time for that content to become available, since the refs call is likely stepping through the links sequentially. For use-cases like this, stand up a few go-IPFS nodes with websocket transports available at publicly accessible domains/ports and have them host the data on some fast storage, then either dial them directly from the browser nodes (faster) or ensure you have DHT delegates configured (slower) and away you go. You'll obviate the need for preloading entirely. You could thread preloading deeper into js-IPFS, and have it preload everything that goes into the bitswap want list, but you'll likely end up making an enormous amount of requests. You could experiment with this - libp2p takes multiple DHT delegates which get asked to find providers when blocks are not available in the blockstore, you could implement one that returns an empty list of providers but also send a non-recursive refs call to a preload node which should cause it to find the block and then bitswap would do it's thing? |
Thanks for the detailed response and background! I guess a question is if it is possible to preload a partial DAG, but not all of the data. Perhaps it is not a good idea for 80GB, but what about a smaller data set? What I was trying to accomplish with the ls() call is to preload the root node of the DAG, and other nodes of the file, without loading all of the data leaf nodes. I actually have it working now, you can look at the network traffic when loading, for example: https://replayweb.page/?source=ipfs%3A%2F%2FQmYvsdJt7ji8bqBFLLjRAcAPgcqFMfb7WGsbXzr6TFk6yM%2Fissue-02.wacz I removed the built-in preloading, and am calling ls just via the http api on one of the preloads manually, causing the preloads to load just enough of the DAG to be able to then cat the data. I've tested this with a local go-ipfs node as well: first calling ls, then doing a cat on a specific range (say even last 64k of a file). The result is that less data is pulled than when using the default recursive refs. This works the other way too, a user may be sharing a large archive (say upto 1GB), and makes an ls() call to the preload its connected to to pull in the list of files. With the default recursive refs preloading, it would start synching the entire archive, but that's not desirable. Again, the goal is to have the preload node be aware of file structure, but not yet load all of the leaf nodes, saving both bandwidth and space (but mostly bandwidth, since the preload connection is actually over a websocket). |
I wanted to add a concrete example of how the custom preloading 'hack' that I'm using, and how it could perhaps be better supported Let's say I just want to load last ~64k of a single file stored in a hash using unixfs. (This is also how I tested this system to come up with this workaround). Starting with a blank IPFS go-ipfs repo and using this 'custom preloading approach', we get:
To run an ls command, just 2 blocks were needed, presumably the root DAG node and the unixfs node. But using the default behavior involves the recursive refs call results in:
Using this default preload behavior, the preload node is forced to load all 360 blocks in this file, when clearly this is not necessary if the goal is just to get the last 64k -- only 4 blocks are needed on the preload node. Range-aware preloading would be supporting this use case. The hard-coded recursive refs seems to be the issue. Perhaps it is possible to make it configurable/disable it? To make the custom behavior work in the browser, an api call is first made to the preload node via HEAD requests, and then the same call is made in the local ipfs instance:
Perhaps the 'preload' system could do this automatically, working more like a proxy, where a command executed on a local instance is mirrored on a connected preload first? If someone wanted the recursive resolve, they could then explicitly call (Or, perhaps is this more of what the 'delegate routing' is supposed to do? I am not entirely clear on the distinction between the delegate vs preload nodes, since they actually resolve to the same nodes in the default config. Are they generally supposed to be the same nodes?). |
I think the low-hanging fruit here is to implement It is demonstrated to save bandwidth and that should translate not only to better performance in browser, but also to a decreased load on default preload servers. @achingbrain any reason to not do it? |
|
(Not sure if this should be here or in js-ipfs or elsewhere, feel free to move).
As I understand it, the current transport for sharing IPFS data browsers relies on a 'preload' node, which 'preloads' all the blocks of a multihash in response to a
/api/v0/refs
API call.. The refs call will start pull all the blocks of the hash onto the preload node over the delegate websocket connection..This makes sense for a default use case where the entire hash should be shared by default. However, this is less than ideal for a use case where a large amount of data is being shared over IPFS, and data should only be loaded on-demand.
I've been able to implement the following 'alternative preload' setup, and wonder if this could be improved and/or supported API as an option to make in-browser usage more scalable for large amounts of data, or if this is too specific of a use case?
The goal is to load data via
ipfs.cat()
random access, and avoid preloading anything that is not actually needed.In my setup, the root multihash contains several files all in the root dir.
1). Disable preload in the config, to avoid making the http
/api/v0/refs
call by default.2). When new data is added, making an http
/api/v0/ls
call to the preload node. This works since I'm sharing several files all in the root directory.3). When reading data from an ipfs hash in a different browser, the preloading also make an http
/api/v0/ls
call to the preload node in place of/api/v0/refs
. Then, locally callipfs.ls()
in the browser and quickly fetch a list of files.ipfs.cat()
with offset and length to fetch the necessary blocks after callingipfs.ls()
. Unfortunately, this doesn't seem to work currently. Instead, my current workaround is simply to call the http/api/v0/cat
on the preload node, and this works perfectly as expected! The preload node searches for the blocks necessary forcat
and only preloads those, and only loads those from the browser on the other end as well, allowing for quick loads from a large ipfs hash!To summarize, a few questions from this:
Does it make sense to support an alternate preload behavior, similar to the above? I'm not sure if
ls
is the right command, but a way to specify which blocks should be preloaded? A non-recursive/api/v0/refs
call, or something else?Is this a generic enough use case that could make sense to other applications?
For some reason the
ipfs.cat()
in the end does not work, but calling/api/v0/cat
to the preload node does. I would think that the local cat() would do discovery over the preload websocket connection, but it appears that it does not. Maybe I'm doing something wrong or the discovery can't work this way?The text was updated successfully, but these errors were encountered: