-
Notifications
You must be signed in to change notification settings - Fork 121
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
chaining data loaders #1522
base: main
Are you sure you want to change the base?
chaining data loaders #1522
Conversation
This seems like the right direction. 👍 I’m thinking of going a step further and spinning up a temporary “asset server” each time we invoke a data loader, so that each data loader gets a separate port for making requests to load assets/dependent files. I think that’s cleaner than using a prefixed path on the preview server, means build and preview follow the same code path, and anyway it should be cheap to expose a little asset server port temporarily while running a data loader (since there’s no initialization other than opening the port). And the asset server can further restrict what each data loader has access too (only files, not pages). And we can give it clean paths. I think we could also get away with not tracking the dependencies explicitly and just let circular data loaders deadlock for now… |
Could we make the "protocol" of how to load data loaders flexible enough to provide other capabilities? For example, it would be nice to be able to explicitly record dependencies that Framework can't detect, and one way we could do that is by making an HTTP call to the injected environment variable to record a file that we don't need to request but should be watched for changes. Two ways I could imagine doing this is either to make that environment variable less generic, like (to be clear, I'm not asking you to implement any of that in this PR, just to leave enough room in the naming to allow future extension) |
I have rewritten this based on the suggestions in the review. There is a new TODO list. |
(rebased after #1662)
Some of our examples will benefit from chained data loaders (in lieu of an intermediate archive). This will allow to split the analysis into independent small scripts. (thinking about this in the context of #1667). |
Implemented as a FILE_SERVER variable passed to data loaders (as an environment variable), allowing them to query an asset server.
For instance, a bash data loader
mags.txt.sh
can query another file (say,quakes.json
) by calling,and a JavaScript data loader will call:
In preview, when
quakes.json
is updated,mags.txt
gets updated. Ifquakes.json
is in fact generated by a data loaderquakes.json.sh
, then touching that script live-updatesmags.txt
. (The interpreter used by any of the data loaders is inconsequent: python can talk to typescript, and vice-versa.)We track the dependency "graph" on-disk, by associating to any file (say
filename.json
) generated by a data loader a filesrc/.observablehq/cache/filename.json__dependencies
that is a simple list of the paths that it requested against itsFILE_SERVER
.TODO:
error on circular dependencies, maybe?closes #332
To test this easily within the preview server:
cp test/input/build/chain/chain*.* docs/
then open http://127.0.0.1:3000/chain
You can replace
3
in chain-source.json.sh by$RANDOM
if you want to check that the source data loader runs only once for its two dependents.Old description
We must use the same file server for browser preview and machine calls (i.e. chained data loaders), because concurrent requests on a same data loader must be joined.
But the paths are different. The data loader for
caller.csv
receives a$SERVER
environment variable equal tohttp://127.0.0.1:3000/_chain/caller.csv::
, and might retrieve a dependency by concatenating that variable and the file path it needs, e.g. calling$SERVER/dependency.zip
.This should make it possible to derive the dependency graph, at least after we run the data loaders (not sure how to maintain state when we restart a server and we have a cache). Also, if the file is not found, we send an empty 404 instead of the decorated page intended for the browser; this makes it a bit more foolproof (tip: use
curl -f
to fail on 404).The current server information is saved as a global (in
process.env
—should it beglobalThis
?) for now. It feels a bit wrong, but at the same time it really is a global state.