-
-
Notifications
You must be signed in to change notification settings - Fork 263
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ad-Hoc Artifacts (Data Containers) #1234
Comments
I like this design a lot. Of course, we can’t call this package Artifactory, but maybe |
Yeah, I'm thinking this will just live in
Yes, this is an interesting point. We're probably going to have to have
I agree; I explicitly kept it out of this because I didn't want to get bogged down in details of designing REST services and whatnot, but as long as we can keep things working well with content-addressing as the common tongue, I think publishing should be quite straightforward. |
Agree, that’s why content addressing is magical. It makes publishing and caching completely obvious. |
1277: Add Artifacts to Pkg r=StefanKarpinski a=staticfloat This adds the artifacts subsystem to Pkg, [read this WIP blog post](https://github.com/JuliaLang/www.julialang.org/pull/417/files?short_path=514f74c#diff-514f74c34d50677638b76f65d910ad17) for more details. Closes #841 and #1234. This PR still needs: - [x] A `pkg> gc` hook that looks at the list of projects that we know about, examines which artifacts are bound, and marks all that are unbound. Unbound artifacts that have been continuously unbound for a certain time period (e.g. one month, or something like that) will be automatically reaped. - [x] Greater test coverage (even without seeing the codecov report, I am certain of this), especially as related to the installation of platform-specific binaries. - [x] `Overrides.toml` support for global overrides of artifact locations Co-authored-by: Elliot Saba <[email protected]>
Closed by #1277 |
While thinking through #841 and #796, I thought that we might have the potential to create a powerful data lifecycle primitive; At first I was calling them "data containers", but really they're "ad-hoc artifacts", so I'm just going to use the word "Artifacts" and generalize a bit as compared to the Artifacts as described in #841 (comment). Half inspired by docker volumes, the basic idea is that you may want to process some data, then "save it" as something that can be used by other packages on your system. The scope of this is purposefully very low-level, so that other packages can build on top of it in order to create more high-level behaviors.
The kinds of Artifacts I'm talking about here are similar to those in #841 (comment), but would be generated on-machine, on-the-fly, and initially empty. Design goals: these should be immutable, simple, content-addressable, have well-defined lifecycle, and be composable with #841 (comment) Artifacts.
API
Pkg
would offer a simple interface to these kinds of Artifacts:create(f::Function)
: Create a new Artifact. Meant to be used in ado-block
form, this calls a user callback where code can be run to fill the newly created directory with data, after which the artifact is "frozen", a tree hash is calculated and returned. That hash is now the primary way to access the artifact from here on out.installed(hash::SHA1)
: Returnstrue
if the given Artifact hash exists and is installed.installed(name::String
): same as above, but returnsfalse
if the given mapping doesn't exist, as well as if the mapping points to a hash that does not exist.get(hash::SHA1)
: Get the path to an installed Artifact by its hash (not necessarily bound inArtifacts.toml
). If not installed, but bound inArtifacts.toml
, installs the artifact.get(name::String)
: Get the path to an installed Artifact by its name (must be bound inArtifacts.toml
). Subs out toget(hash)
once the hash is known, so has identical behavior.bind(hash::SHA1, name::String; force::Bool = false)
: Create a binding from aname
to ahash
, and write it out into the current project'sArtifacts.toml
. This has the double effect of allowingget(name)
as well as defining an explicit lifecycle for this artifact. This errors if the givenhash
is not an installed artifact. If the givenname
is already bound, overwrites ifforce
is set totrue
. Callingbind()
twice does not error.unbind(name::String)
: Unbind an artifact from the current project.Usage Sketch The Zeroeth: post-processing artifacts
First example; post-processing of downloaded artifacts. Assume
Artifacts.toml
contains an artifact nameddata_csv_gz
, and we want to get an artifact with some subset of that data:Usage Sketch The First: pregenerating expensive data
Let's imagine you have a project that can generate data, but it takes a while. You want a way to cache the fact that you have bothered to create this pile of data, you don't want to clutter your Pkg directory with mutable data (death to all mutable state!) and you might even want to share it with other packages. Easy:
Usage Sketch The Second: A Dataflow Management Package
I'm not going to sketch out the code for this one, but you could imagine a full data flow pipeline with dependencies and on-demand downloading of large fundamental blobs, large processing steps, etc... built out of these fundamentals. You could even build a distributed data processing system by publishing these artifacts to some kind of internal server, then writing that URL into the
Artifacts.toml
files.Details
Here I'm writing down small details as I think of them, to better flesh this out.
Lifecycle
Artifacts would be generally floating in the void; unless they are listed in a project's
Artifacts.toml
, agc
would remove them. Therefore, anything not bound to a name within a project somewhere should be thought of as ephemeral.Portability
All artifacts generated this way would of course not be available to coworkers on other computers, without some kind of publishing mechanism. We are explicitly NOT addressing that here, as it's a little out of scope at the moment. What we can do for now is to make all artifacts written into
Artifacts.toml
withbind()
lazy
by default, (so that they are ignored when installing the parent project), and then they can be recreated according to the steps within the parent project itself. As long as they are bit-identical, the tree hashes should match, and we won't have different committers checking in differentArtifacts.toml
files over and over again, so don't do things like store timestamps in these things, unless you want that TOML churn.Platform-dependence
I kind of don't want to encourage people to generate platform-dependent packages with this, but since it's integrated with the #841 (comment) concept of Artifacts, it would be possible.
Sharing between packages
Packages can share these kinds of artifacts by simply passing around hashes, which can get written into
Artifact.toml
files to bind a dependency (and thereby stopgc
from ruining your day), or not, if you like to live dangerously.The text was updated successfully, but these errors were encountered: