-
Notifications
You must be signed in to change notification settings - Fork 157
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Is it possible to easily use Sync with a content-addressable, apend-only store? #2328
Comments
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as spam.
This comment was marked as spam.
Sorry for the noisy issue. Let me further illustrate my use case, as I have hit another snag. My goal is to create two stores, a KV store for articles, and a content-addressable store for content. I have functions that replace the content of an article with its hash. Here is a simplified version of my code: type hash = Irmin.Hash.SHA512.t
let hash_t = Irmin.Hash.SHA512.t
type 'a node = Text of string | Transclude of 'a
and content = Content of content node list
type 'a article =
{ content: 'a;
iri: iri; (* https://ocaml.org/p/iri/1.0.0 *)
metadata: (string * string) list;
}
let store_content : 'a Content_store.t -> content -> hash = ...
let hash_article : 'a Content_store.t -> content article -> hash article =
fun store {content; _;} ->
let hash = store_content store content in
{content = hash; ...} The stores are implemented like so (with the appropriate content modules): module Content_store =
Irmin.Content_addressable.Make
(Irmin_fs_unix.Append_only)
(Irmin.Hash.SHA512)
(Content)
module Article_store = Irmin_git_unix.FS.KV(Article) Here we have The thing that makes all this work is that the I have suceeded in creating a git-content-addressable store like so: module Git_store = Irmin_git.Content_addressable(Git_unix.Make(Digestif.SHA512))
module Content_store = Git_store.Make(Content) The issue is that this store is no longer indexed by So ultimately, my question is about how to proceed. Should I write a representation for the abstract type Of course, I am also interested in any other comments, feedback or advice. Also, thanks for the work on this amazing library! ❤️ For reference, here is the real code: |
@art-w You gave an example of how to implement a repr for an abstract type here: module Hash : Irmin.Contents.S with type t = Digestif.SHA512.t = struct
type t = Digestif.SHA512.t
let t : Digestif.SHA512.t Repr.ty =
let open Digestif.SHA512 in
let pp h hash =
Format.fprintf h "%s" (to_raw_string hash)
in
let of_string str =
try Scanf.sscanf str "%s" (fun str -> Ok (of_raw_string str))
with _ -> Error (`Msg "of_string")
in
let encode_json encoder hash =
let (`Ok | _) = Jsonm.encode encoder (`Lexeme `Os) in
let (`Ok | _) = Jsonm.encode encoder (`Lexeme (`String (to_raw_string hash))) in
let (`Ok | _) = Jsonm.encode encoder (`Lexeme `Oe) in
()
in
let decode_json decoder =
(* Must be able to decode the output of [encode_json] *)
let (let*) = Result.bind in
let* () = Repr.Json.decode decoder |> function `Lexeme `Os -> Ok () | _ -> Error (`Msg "decode_json") in
let* v = Repr.Json.decode decoder |> function `Lexeme (`String v) -> Ok v | _ -> Error (`Msg "decode_json") in
let* () = Repr.Json.decode decoder |> function `Lexeme `Oe -> Ok () | _ -> Error (`Msg "decode_json") in
Ok (of_raw_string v)
in
(*This is as far as I got*)
let encode_bin (hash : Digestif.SHA512.t) write =
let encoder = Repr.Binary.Bytes.encode (`Fixed digest_size) |> Repr.Staging.unstage (*???*)in
write hash
in
let decode_bin buffer at =
(* Must be able to decode the output of [encode_bin] *)
let len = Repr.Binary.Varint.decode buffer at in
let v = String.sub buffer !at len in
at := !at + len ;
v
in
let size =
Repr.Size.custom_dynamic
~of_value:(fun foo ->
(* Optional, precompute the size that will be used by [encode_bin] *)
let len = String.length foo in
let varint_len_size = ref 0 in
Repr.Binary.Varint.encode len
(fun s -> varint_len_size := !varint_len_size + String.length s) ;
!varint_len_size + len)
~of_encoding:(fun str offset ->
(* You can skip this function, it's unused nowadays *)
let at = ref offset in
let len = Repr.Binary.Varint.decode str at in
let varint_len_size = !at - offset in
varint_len_size + len)
()
in
let pre_hash =
(* Same as [encode_bin], unless the binary serialization has changed
but we want to preserve the previous cryptographic hash for
backward compatibility. *)
encode_bin
in
let equal = equal in
let compare = unsafe_compare in
(* let equal a b = String.equal a.v b.v in *)
(* let compare a b = String.compare a.v b.v in *)
let short_hash ?seed foo =
match seed with
| None -> Hashtbl.hash foo
| Some seed -> Hashtbl.seeded_hash seed foo
in
Repr.abstract
~pp ~of_string
~json:(encode_json, decode_json)
~bin:(encode_bin, decode_bin, size)
~pre_hash
~equal ~compare ~short_hash
()
let merge = Irmin.Merge.(option (idempotent t))
end |
I've succeeded in writing the repr for the hash type. I now have the following module structure: module Git_store = Irmin_git.Content_addressable(Git_unix.Make(Digestif.SHA512))
module Content_store = Git_store.Make(Content) where sig
type 'a t = 'a Content_store.t
type key = hash
type value = Content.t
val mem : [> Irmin.Perms.read ] t -> key -> bool
val find : [> Irmin.Perms.read ] t -> key -> value option
val add : [> Irmin.Perms.write ] t -> value -> key
val unsafe_add : [> Irmin.Perms.write ] t -> key -> value -> unit
val close : 'a t -> unit
val batch : Irmin.Perms.read t -> ([ `Read | `Write ] t -> 'a) -> 'a
end I am now wondering how to actually obtain a store? The docs say: Furthermore, I still don't know how to obtain a syncable store from the content-addressable git store. |
Hello! Thanks for the report, we are looking into the lack of constructor for the content-addressable store as it seems like an oversight... In the meantime, regarding your previous questions: Did you consider
Oh this is actually how Irmin is implemented! What is your motivation for storing the content separate from the iri link / metadata? It's a shame that irmin-git lacks customizable metadata like the other backends due to its compatibility with git, but you could probably store both together (?) or split it into two entries in the KV store? (with one folder for contents, one folder for articles) (I'm asking because I've a small doubt that your separate CA store will be replicated by git fetch algorithm, since it doesn't "see" that the contents are referenced from the articles by their hash... but I'll check if it works and update shortly) |
As I was afraid, the objects stored in the git repo via Can you tell me a bit more about why you wanted to use Content_addressable directly instead of just Irmin KV? |
Just to the record, you can associate metadata with a commit (like a gpg key, for example) and be compatible with Git. The latter are not “scannable” (in the sense that they are only values, not pointers; it seems to me that Git LFS uses these metadatas) but it may be worthwhile for Irmin to make use of this possibility. |
Oh neat thanks for the tip! So we could lift this type restriction to allow users to store any metadata with their commits! But this wouldn't help @kentookura though as the git files metadata still appear to be restricted to file kind/permissions :/ |
Ah, I didn't know about Repr.map... That is indeed much simpler.
We (pinging @jonsterling) want to enable what we are calling "federated forestry". In short, we want to enable our users to transclude the content of their collaborators. Forester is tag-based, meaning that each tree has a permanent unique identifier. On the other hand, the content of the tree might change over time, so the architecture I envision will allow us to do stuff like notifying the user if remote content has changed. Now as for the reason for splitting the stores up like this, I thought it was a reasonable approach just due to the limitations the API imposes on me as a programmer. One content type should be indexed by the unique identifiers, where as the other content type should only ever be referred to by hash. If I just put articles in the store, then the content is not retrievable by itself, right? |
Getting more into the weeds, we have this notion of content target that controls what is actually being transcluded, so it would be nice for the store to be able to directly store the smallest unit of transclusion. That is to say: My reason for not simply storing the unmodified articles is that (I think/assume that) storing the article will not allow me to retrieve a subcomponent of the article by hash? If I am wrong, does this mean that when I have a representation of a type that contains other representations, are all components of this type retrievable via hash? |
Thanks for the details, I love that you are bringing transclusions to federations! I could not find a reference to hashes in the code you linked, so reading between the lines I'm guessing the content hash is intended to track if the transcluded article has been updated since, to inform the user of the change? (Please correct me if I'm wrong!) A couple of thoughts, in hope that they inspire you some ideas even I turn out to have completely misunderstood your problem: (sorry the examples got very long)
type hash = Digestif.SHA1.t (* = the default hash algorithm used by [Irmin_git_unix.FS.KV] *)
let hash_t = Digestif.SHA1.(Repr.(map string) of_raw_string to_raw_string)
type transclusion = { tag : string; expected_hash : hash } [@@deriving irmin]
type node = Text of string | Transclude of transclusion [@@deriving irmin]
type content = Content of node list [@@deriving irmin]
type article = { content : content; metadata : (string * string) list }
[@@deriving irmin]
module Store = Irmin_git_unix.FS.KV (struct
type t = article [@@deriving irmin]
let merge = Irmin.Merge.(option (idempotent t))
end)
open Lwt.Syntax
(* these type definition with [@@deriving irmin] are only there for debug prints *)
type transclusion_error =
[ `Deleted of article
| `Updated of article * article
| `Original_article_lost of article
| `Unknown_article of string ]
[@@deriving irmin]
type transcluded_result = (article, transclusion_error) result
[@@deriving irmin]
let get_transcluded repo branch { tag; expected_hash } =
let* article = Store.Contents.of_key repo expected_hash in
let* latest_hash = Store.hash branch [ tag ] in
match (article, latest_hash) with
| Some article, Some latest_hash when latest_hash = expected_hash ->
(* transcluded article is uptodate *)
Lwt.return (Ok article)
| Some article, None ->
(* we still have access to the originally transcluded article contents,
but can inform the user that it was deleted in more recent versions *)
Lwt.return (Error (`Deleted article))
| Some old_article, Some latest_hash ->
(* inform the user that the article was updated since the transclusion *)
let+ latest_article =
Lwt.map Option.get @@ Store.Contents.of_key repo latest_hash
in
Error (`Updated (old_article, latest_article))
| None, Some latest_hash ->
(* should not happen... indicates that the git history
was rewritten + git gc was run to force-fully erase
the previous article contents from existence *)
let+ latest_article =
Lwt.map Option.get @@ Store.Contents.of_key repo latest_hash
in
Error (`Original_article_lost latest_article)
| None, None ->
(* should not happen either, either history rewrite like above
or it might indicate that we are trying to transclude from
an unknown remote *)
Lwt.return (Error (`Unknown_article tag))
let date () = Int64.of_float (Unix.gettimeofday ())
let store_article t tag article =
let message = Printf.sprintf "Update %s" tag in
let info () = Store.Info.v ~author:"art-w" ~message (date ()) in
Store.set_exn t ~info [ tag ] article
let main () =
let* repo = Store.Repo.v @@ Irmin_git.config "/tmp/test-remote" in
let* main = Store.main repo in
(* create a first article *)
let aw_0001 = { metadata = []; content = Content [ Text "first" ] } in
let* () = store_article main "aw-0001" aw_0001 in
(* compute hash of article, either from the article or its tag: *)
let hash_aw_0001 = Store.Contents.hash aw_0001 in
let* hash_aw_0001' = Store.hash main [ "aw-0001" ] in
assert (hash_aw_0001' = Some hash_aw_0001);
(* create a second article with a transclusion *)
let aw_0002 =
{
metadata = [];
content =
Content [ Transclude { tag = "aw-0001"; expected_hash = hash_aw_0001 } ];
}
in
let* () = store_article main "aw-0002" aw_0002 in
(* when computing the transclusion to render the second article: *)
let* tr =
get_transcluded repo main { tag = "aw-0001"; expected_hash = hash_aw_0001 }
in
Format.printf "%a@." (Irmin.Type.pp transcluded_result_t) tr;
(* => prints "Ok article_contents" since the transclusion is uptodate *)
(* if we update the first article, then its hash will change: *)
let aw_0001_upd = { metadata = []; content = Content [ Text "UPDATED" ] } in
let* () = store_article main "aw-0001" aw_0001_upd in
assert (Store.Contents.hash aw_0001_upd <> hash_aw_0001);
(* so if we retry to render the second article, we get a warning that the transcluded content has been updated *)
let* tr =
get_transcluded repo main { tag = "aw-0001"; expected_hash = hash_aw_0001 }
in
Format.printf "%a@." (Irmin.Type.pp transcluded_result_t) tr;
(* => prints "Error (`Updated (old_article, new_article))" *)
(* to accept the changes, update the second article with the new expected_hash *)
Lwt.return_unit
let () = Lwt_main.run (main ()) A difference between this and your previous CA store approach is that the articles are reachable from the git history and so their hashes will be synced by fetch/pull/push automatically.
If I understand you correctly then no, if you store some
I don't know if that's actually required considering that your article contents are expected to be small (measured in ~kb?) before their transcluded articles are "inlined" (which I assumed is only done when rendering to xml). At this small scale it's unlikely to bring any performance benefit to directly access the smallest unit of transclusion (but it has a cost since it implies more content-hash are indexed). (* ... *)
type article = {
content : content;
title : string;
metadata : (string * string) list;
}
[@@deriving irmin]
(* to represent the fact that we are going to store the sub-components of an article in irmin: *)
module Internal = struct
type t =
| Content of content
| Title of string
| Metadata of (string * string) list
[@@deriving irmin]
let merge = Irmin.Merge.(option (idempotent t))
end
module Store = Irmin_git_unix.FS.KV (Internal)
let date () = Int64.of_float (Unix.gettimeofday ())
(* utility function to split an [article] into its sub-components before storing it *)
let store_article db ~tag { content; title; metadata } =
let message = Printf.sprintf "Update %s" tag in
let info () = Store.Info.v ~author:"art-w" ~message (date ()) in
Store.set_tree_exn ~info db [ tag ]
@@ Store.Tree.of_concrete
@@ `Tree
[
("title", `Contents (Internal.Title title, `Normal));
("metadata", `Contents (Internal.Metadata metadata, `Normal));
("content", `Contents (Internal.Content content, `Normal));
]
open Lwt.Syntax
open Lwt.Infix
(* utility functions to query only a sub-component from an article *)
let get_title db tag =
Store.get db [ tag; "title" ] >|= function
| Internal.Title title -> title
| _ -> invalid_arg "expected title"
let get_metadata db tag =
Store.get db [ tag; "metadata" ] >|= function
| Internal.Metadata metadata -> metadata
| _ -> invalid_arg "expected metadata"
let get_content db tag =
Store.get db [ tag; "content" ] >|= function
| Internal.Content content -> content
| _ -> invalid_arg "expected content"
(* or fetch everything at once: *)
let get_article db tag =
let* root = Store.get_tree db [ tag ] in
let+ title = Store.Tree.get root [ "title" ]
and+ metadata = Store.Tree.get root [ "metadata" ]
and+ content = Store.Tree.get root [ "content" ] in
match (title, metadata, content) with
| Internal.Title title, Internal.Metadata metadata, Internal.Content content
->
{ title; metadata; content }
| _ -> invalid_arg "type error"
(* test *)
let main () =
let* repo = Store.Repo.v @@ Irmin_git.config "/tmp/test-remote" in
let* main = Store.main repo in
let aw_0001 =
{
title = "first";
metadata = [ ("date", "2024-08-03") ];
content = Content [ Text "test" ];
}
in
let* () = store_article main ~tag:"aw-0001" aw_0001 in
let* t = get_title main "aw-0001" in
Format.printf "Title: %S@." t;
let* art = get_article main "aw-0001" in
Format.printf "Article: %a@." (Irmin.Type.pp article_t) art;
Lwt.return_unit
let () = Lwt_main.run (main ()) As you can see, this requires a bit of setup with utility functions to store and get stuff from Irmin. (In the future, we would like to improve on the type-safety in Irmin API to remove the need for When combined with the previous code example, this splitting will allow you to differentiate between the "article hash" (= content+metadata) and the "content hash" (which is what you were trying to get the hash of in your initial post?). I do not understand why the content hash is more important than the rest, as it seems to me that the title/author/etc changes are also relevant if they happen to change (?) and those are only captured by the "article hash"? (When splitting, your article becomes a "folder" but you can still get its merkle hash (which captures the hash of all of its children title/metadata/content) with Sorry this got so long... Do not hesitate to correct my misunderstandings or share additional details/constraints about your system :) |
Ah, we have several branches that we are working on: https://git.sr.ht/~jonsterling/ocaml-forester/refs At this point I'd really like for @jonsterling to pitch in :)
This is extremely helpful, nothing to be sorry for! I will come back to you once I've explored the options a bit. |
@art-w I am wondering if it would be helpful to you if I report any issues I encounter when using the eio branch of irmin, or if it is too early to experiment with that branch. |
Yes yes! Any feedback or issue report on the Eio PRs would be extremely welcome! We are planning a minor release with bugfixes for the current lwt version, but then the eio branch will become the main one. If you are already experimenting with it, I think you should use the version from #2316 (so the follow-up on the initial eio port, which fixes some known issues). (Thanks for your links btw, I see a lot more content hashes there but haven't had time to dig into it yet :) ) |
@art-w Using |
@art-w @jonsterling Do you think it is possible/practical to use We are introducing the notion of a "host" to forester to prepare for some features we hope to implement in irmin: Suppose the user configures some remote hosts PS.: If you are interested, the latest changes in our work regarding irmin are here: https://git.sr.ht/~jonsterling/ocaml-forester/log/irmin-datastore |
Hm that is concerning. Sorry for the late reply, but I could not reproduce nor see anything in the code that could explain this so I didn't know how to respond... Can you confirm that the following is slow on your machine? (it takes at most 1ms on mine) let () =
Eio_main.run @@ fun env ->
let t0 = Eio.Time.now env#clock in
Mirage_crypto_rng_eio.run (module Mirage_crypto_rng.Fortuna) env @@ fun () ->
let t1 = Eio.Time.now env#clock in
Format.printf "%fs@." (t1 -. t0) And that the following is not? (with the let () = Mirage_crypto_rng_unix.initialize (module Mirage_crypto_rng.Fortuna) In which case, it should be fine for you to use the It should be possible to support the directory replication that you describe for the remotes in the local store, by using Store.set_tree_exn ~info local_store ["remotename"] @@ Commit.tree @@ Result.get_ok @@ Sync.fetch remote But I think it'll be easier and more robust to use branches instead (like we would do when using git with remotes), as that would also preserve commit history of the remote: Store.Branch.set local_store "remotename" @@ Result.get_ok @@ Sync.fetch remote (Sorry I haven't had time to check your code, I'll try to update as soon as I can have a look) |
Hi all,
I have a content-addressable store like so:
I would like to architect my application such that:
It is surely possible to implement this by writing a bunch of custom code, but I was wondering if I was overlooking something in the API to easily enable this. My best lead is to use this module, but it's a bit daunting.
Thanks!
The text was updated successfully, but these errors were encountered: