-
Notifications
You must be signed in to change notification settings - Fork 194
Stable Deterministic Node identifiers by encoding edit history in Node name identity
The nodes are named by the edits required to the original paths. We realise as the biology is unstable, stable identifiers are not truly possible but we should have as stable as possible identifiers as possible and each identifier should have its provenance associated with it. The following proposal is a solution to this problem:
We are have a reference assembly ACTGA named r1. As a variant graph we end up with one node. We can name this node after the reference, where it starts and how long it is. Giving an implicit name of r1:0:5 (original sequence:starting:ending)
Then add a second sequence AATGA, lets call that patient1 or p1.
So we have two nodes, r1:0:5 and p1:0:5, this would be turned into four nodes in a VG graph. However we name the nodes as much as possible by where they originated in the graph, giving preference to modifying nodes that where already in the graph.
Inside VG we keep using integer identifiers, only when serialising do we write out the long names. The edit history must be saved, but vg can write these to disk e.g. like a write ahead log as used in databases. Reading the log from future to past allows vg to build these identifiers on demand. The integers to long names can be a simple int to string map that is written to disk.
In the RDF world we would have names like
http://www.ncbi.nlm.nih.gov/projects/genome/assembly/grc/38/chromosome/11/0:6000
and
http://www.ncbi.nlm.nih.gov/projects/genome/assembly/grc/38/chromosome/11/6000:248950422
as well as
http://www.example_hospital/sample/XXXXXX/11/6002:248956422
The first would be for a node that goes from the zeroth base to base 6000, on the grc38 chromosome 11 sequence. The second would follow the first and go from the 6000th base to base 248950422 again on chromosome 11 of cr38. The third would represent a node that is unique on a patient.
So we would prefix the authority of the sequencer in the node names. We can then say
http://www.example_hospital/sample/XXXXXX/11/6002:248956422
owl:sameAs
http://www.ncbi.nlm.nih.gov/projects/genome/assembly/grc/FUTURE/chromosome/11/6007:248956422
Allowing us to link nodes in vg versions.