-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update to Marko's API #2
Update to Marko's API #2
Conversation
This reverts commit afb9130.
…BedRecordWrapper" This reverts commit bffa648987ab4c27f90a8d5efc1b3e864c7ba6b0.
This reverts commit 830441ca22426d52a3a89252c414b6f1ed00dd01.
2d3d413
to
658ebed
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking good so far!
Let me know if my comments makes sense.
|
||
pub struct Serializer { | ||
// This string starts empty and JSON is appended as values are serialized. | ||
pub struct Record3Serializer { | ||
output: String, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To avoid serializing records manually, you could keep track of a bed::record::builder::Builder
state, and progressively build it up inside the serialize_*
functions. For example, if a chromStart
field is expected next, and a serialize_u64
(or another integer serialize function) is called, then the set_start_position
builder function could be called. At the end (or intermittently throughout) the records could be converted to string output. This way all the string parsing is done by the noodles library.
This is similar to keeping track of the state for the Deserializer
as shown below.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, we are already using noodles functionality to parse the serialization for us, via the DisplayFromStr
functionality on serde_with
used with the AuxiliarBedRecordWrapper
#[serde_as]
#[derive(Deserialize, Serialize)]
pub struct AuxiliarBedRecordWrapper<T>
where
T: BedN<3> + std::str::FromStr + fmt::Display,
<T as std::str::FromStr>::Err: std::fmt::Display,
{
#[serde_as(as = "DisplayFromStr")]
pub record: T,
}
I think I tipped you in the wrong direction since impl<'a> ser::Serializer for &'a mut Record3Serializer {}
is really verbose and has lots of implementation that looks like its made to parse bed structs by itself, is that correct?
This implementation could be heavily dried out now that we are using serde_with
, but we still need some kind of plain serializer to grab thinks between each \n
and sending directly to record: T
Display
method.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ahh yeah, I think I see what you mean. Do you think there is any way to avoid even having to parse the \n
characters? Ideally there would be no need to touch a raw string, and just allow noodles to do all of this.
However, it looks like noodles might write records one by one anyway, so this might not be possible.
Does this mean that serde_with
or serde_as
is generating the Serializer
implementations, or does that need to be implemented as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's kind of weird: we need some Serializer or Deserializer specific implementation to exist to have a full pipeline of serializaton and deserialization, but as soon as we arrive on our Serializer, all it needs to do is:
- Call Display from the specific Record implementation (which is being done by the
serde_with
derive) - Insert
\n
onVec<T>
serializations.
So it's the most plain implementation of a Serializer trait ever, we probably can and should call unreachable!()
on most of the functions (I didn't do this yet because maybe serde_with fails.)
Answering your question: I haven't seen a way to write a string separated \n
inside of noodles, but now that you mention it, maybe implementing a Display
and FromStr
for Vec<Record<N>>
shouldn't be that hard, the weird part is that we would still need an AuxiliarWrapper
for this Vec
, and I can't see a way to use the same AuxiliarWrapper
both for a single and a collection of Records, in the past I've seen this solved by coercing single Records to a collection of one element, but maybe we don't want that.
noodles-bed/src/de.rs
Outdated
enum RecordState { | ||
ExpectingChrom, | ||
ExpectingChromStart(Record<3>), | ||
ExpectingChromEnd(Record<3>), | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Keeping track of the Deserializer
state is kind of like the inverse of keeping track of the bed::record::builder::Builder
state in the Serializer
.
noodles-bed/src/de.rs
Outdated
// self.first = false; | ||
// // Deserialize a map key. | ||
// seed.deserialize(&mut *self.de).map(Some) | ||
seed.deserialize(&mut *self).map(Some) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems to be getting stuck here, as this function is called over and over again, without returning Ok(None)
. next_key_seed
should return Ok(None)
when there are no more entries left.
6f2cfaf
to
80f6b38
Compare
* Adapted noodles-bed/src/main.rs to changes on the code * Renamed AuxiliarBedRecordWrapper to SerdeRecordWrapper * Renamed Record3Serializer to RecordSerializer * Add documentation * Remove useless comments * Run clippy and fix warnings
I believe this is ready to review. A couple of things that I wanna add context to here:
Anyway, I feel like even if we wanted to implement it, we would have the problem of making the wrapper and the non-wrapper Record representation have the same result. But the Wrapper, standalone, would produce a new bracket level of identation, I think (
If however, I find a way to make
Edit - Adding one other point of discussion that I replicated on the PR description for completeness: serde_with worked wonders, we are now reusing a lot of code directly from noodles, and this should be replicable to the other data formats. It's still needed to implement a |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking great @GabrielSimonetto, well done 😄.
Ideally the ignored tests would work too, as converting it to a JSON representation is one of the reasons why using Serde
is advantageous. And yes, I agree about the main.rs
comment. It could be used to showcase the design.
I think we can still do that, but it will be our job to tell But this conversation makes me realize that this communication between formats is important, so maybe I can already arrange one example of this (: (starting with a json serialization, finishing with a bed serialization, and vice-versa) |
Yea, that's fair enough. You could create an example in this PR or another. Feel free to merge this PR either way. |
Following the ideas commented on the main PR, @mmalenic set up the environment where serde-json is derived, and I am implementing serde for the regular bed format.
Currently, the serializer is printing out the fields that belong to the other Bed formats , which makes it incompatible with the reader and writer already implemented by noodles:
The
serde-json
solution also behaves like this, but since we are creating that representation it's probably fine:{"chrom":"sq0","start":8,"end":13,"name":null,"score":null,"strand":null,"thick_start":8,"thick_end":13,"color":null,"blocks":[]}
One problem that it currently has is that it uses a
thick_start
andthick_end
which is incompatible with the noodles definition of reusing thestart
andend
values. I haven't found out how to make the default function in serde derive use an argument (the start value, for example):Other than that, I am exposing the problem of our serializer here, but it seems pretty clear to me that the fields which don't belong to Bed<3> can't be present in the serialization, this might however introduce the problem of having specific serialization processes for each Bed format. Noodles already does this in
record.rs
to make every higher version of bed inherit the fuctionality from the lesser versionsBut at that point it becomes clear that all we want is for the serializer to call the
Display
method fromRecord
the question becomes: how do we represent that in the serializing entrypoints, and also, can we change it?Notice that we now have to call
record.to_string()
inside the serializer. Which is probably a problem, since the serializer only knows types from the data-format, and probably doesn't have a way to know which struct its dealing with.At this point I realized that this couldn't be this hard, and to be fair, I haven't realized that now that we are serializing to the actual bed format, marko's previous suggestions like serde_with and serde_state, would now make a lot of sense.
Still I wanted to showcase what I have been working on so you guys can correct any big blunders I made in my decision process.
========= update ==========
serde_with worked wonders, we are now reusing a lot of code directly from noodles, and this should be replicable to the other data formats.
It's still needed to implement a
Serializer
andDeserializer
which does the minimal work of basically parsing each element on a sequence, and callingDisplay
orFromStr
on each element. TheDeserializer
can be really lean thanks to theforward_to_deserialize_any!
macro, theSerializer
however doesn't have anything similar, and needs to implement a lot ofunreachable!
functions because of the trait signature. Maybe there is something in the ecosystem for such a bare-bones use case, but I am not aware of it.