Internal Representation (row/column based) #8
Replies: 3 comments 12 replies
-
My solution at the moment is to allow the IR to vary with the serialization target. For example this means we would use Arrow internally when dealing with Feather and Parquet etc, as well as the native IR for Avro, ORC etc. The reason I think this is a fine solution is because we don't actually have a long term goal of providing translation between multiple different next gen formats. We only need translation between "traditional" (e.g. BED, VCF) and "the best" next gen format, e.g. Parquet, Feather etc. By untethering the serde format from the IR, we guarantee maximal performance. This solution does admittedly detract from the "Rosetta Stone" metaphor we had going, but I think it's the best option we have. Of course we'll need to see what everyone thinks of this ideal (@brainstorm). |
Beta Was this translation helpful? Give feedback.
-
I realise that I hadn't thought much about how we will do the format conversions under this proposed scheme, only the benchmarking. So I've realised that this can just be implemented by a // This lets us re-use the same code for parquet and feather
enum ArrowSerializationFormat {
Parquet,
Feather
}
// Technically this is our own IR, but it's just a very light wrapper around RecordBatch
struct ArrowIr {
batch: arrow::record_batch::RecordBatch,
format: ArrowSerializationFormat
}
// These traits represent the ability to do things relating to a specific domain in bioinformatics
trait AlignmentOps : BioFormat {
// We need any alignment format to be able to interop with BAM
fn load_bam(path: Path) -> Self;
fn save_bam(&self, path: Path);
fn save_native(&self, path: std::path::Path);
fn load_native(path: std::path::Path) -> Self;
// This is the stuff we will benchmark
fn reads_at_position(reference: &str, start: u64, end: u64) -> IntoIter<T>;
fn iter_reads() -> IntoIter<T>;
}
// Here's another example of a different format Ops trait
trait SequenceOps {
fn load_fasta(path: Path) -> Self;
fn save_fasta(&self, path: Path);
fn save_native(&self, path: std::path::Path);
fn load_native(path: std::path::Path) -> Self;
}
// All the implementation for a given format will live in here
impl AlignmentOps for ArrowIr {
// Save to parquet or feather
fn save_native(&self, path: std::path::Path);
// Load from parquet or feather
fn load_native(path: std::path::Path) -> Self;
// Using Noodles for these
fn load_bam(path: Path) -> Self {}
fn save_bam(&self, path: Path) {}
} Here's a draft of the conversion process: graph TD
A[Arrow IR]:::IR;
B[Parquet File]:::NextGen;
C[Feather File]:::NextGen;
D[BAM File]:::Primitive;
E[Avro IR]:::IR;
F[Avro File]:::NextGen;
A -- "save_native" --> B;
B -- load_native --> A;
A -- save_bam --> D;
D -- load_bam --> A;
A -- save_native --> C;
C -- load_native --> A;
E -- save_native --> F;
F -- load_native --> E;
E -- save_bam --> D;
D -- load_bam --> E;
classDef IR fill:orange;
classDef NextGen fill:white;
classDef Primitive fill:skyblue;
|
Beta Was this translation helpful? Give feedback.
-
For the past couple of days I've been iterating on #10, and I think I have some stuff to add to the conversation. Here is a playground about my best effort to serialize the proposed When that wasn't working, I went on to draft a more specific bed to bed conversion, to see if the concrete definitions would help analyze the architecture proposals, and it really feels like we will need to make a trait definition of which writers and readers we are ready to deal with. It may just be that I heavily missed some points in the discussion, if there is something that I am clearly missed you are welcome to point that out. I point out in the PR draft #11 that we still have the problem of dealing with heavy contraints in the particular crates about how to insert the records to be written (In noodles-bed case, it has to be a specific struct defined by them, and not any generic data that we could pass directly, which should be the case for the other modern formats aswell). When I initially looked at the Looking forward for your feedback, @brainstorm, @multimeric. |
Beta Was this translation helpful? Give feedback.
-
Beta Was this translation helpful? Give feedback.
All reactions