Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update to Marko's API #2

Merged
merged 35 commits into from
Sep 7, 2022

Conversation

GabrielSimonetto
Copy link

@GabrielSimonetto GabrielSimonetto commented Aug 14, 2022

Following the ideas commented on the main PR, @mmalenic set up the environment where serde-json is derived, and I am implementing serde for the regular bed format.

Currently, the serializer is printing out the fields that belong to the other Bed formats , which makes it incompatible with the reader and writer already implemented by noodles:

// Serializer's current output
"sq0\t8\t13\tnull\tnull\tnull\t8\t13\tnull\t\n"

// Noodles output
"sq0\t8\t13\n"

// Edit: solved

The serde-json solution also behaves like this, but since we are creating that representation it's probably fine:

{"chrom":"sq0","start":8,"end":13,"name":null,"score":null,"strand":null,"thick_start":8,"thick_end":13,"color":null,"blocks":[]}

One problem that it currently has is that it uses a thick_start and thick_end which is incompatible with the noodles definition of reusing the start and end values. I haven't found out how to make the default function in serde derive use an argument (the start value, for example):

// Current
fn default_position() -> Position {
    Position::new(1).unwrap()
}

// Probably desired?
fn default_position_thick_start(start: Position) -> Position {
    start
}

Other than that, I am exposing the problem of our serializer here, but it seems pretty clear to me that the fields which don't belong to Bed<3> can't be present in the serialization, this might however introduce the problem of having specific serialization processes for each Bed format. Noodles already does this in record.rs to make every higher version of bed inherit the fuctionality from the lesser versions

impl BedN<3> for Record<3> {}
impl BedN<3> for Record<4> {}
impl BedN<3> for Record<5> {}
...

But at that point it becomes clear that all we want is for the serializer to call the Display method from Record the question becomes: how do we represent that in the serializing entrypoints, and also, can we change it?

pub fn to_string<T>(value: &T) -> Result<String>
where
    // T: Serialize // Current signature
    T: Serialize + BedN<3> + std::fmt::Display, // This would prevent a Vec<Record<_>> from being allowed.
{
    // TODO: How to generalize Bed<N>
    let mut serializer = Record3Serializer {
        output: String::new(),
    };
    value.serialize(&mut serializer)?;
    Ok(serializer.output)
}

Notice that we now have to call record.to_string() inside the serializer. Which is probably a problem, since the serializer only knows types from the data-format, and probably doesn't have a way to know which struct its dealing with.

At this point I realized that this couldn't be this hard, and to be fair, I haven't realized that now that we are serializing to the actual bed format, marko's previous suggestions like serde_with and serde_state, would now make a lot of sense.

Still I wanted to showcase what I have been working on so you guys can correct any big blunders I made in my decision process.

========= update ==========

serde_with worked wonders, we are now reusing a lot of code directly from noodles, and this should be replicable to the other data formats.

It's still needed to implement a Serializer and Deserializer which does the minimal work of basically parsing each element on a sequence, and calling Display or FromStr on each element. The Deserializer can be really lean thanks to the forward_to_deserialize_any! macro, the Serializer however doesn't have anything similar, and needs to implement a lot of unreachable! functions because of the trait signature. Maybe there is something in the ecosystem for such a bare-bones use case, but I am not aware of it.

GabrielSimonetto and others added 26 commits July 3, 2022 16:23
…BedRecordWrapper"

This reverts commit bffa648987ab4c27f90a8d5efc1b3e864c7ba6b0.
This reverts commit 830441ca22426d52a3a89252c414b6f1ed00dd01.
@GabrielSimonetto GabrielSimonetto force-pushed the gabriel/example_serialization_deserialization branch from 2d3d413 to 658ebed Compare August 18, 2022 03:35
Copy link
Member

@mmalenic mmalenic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good so far!

Let me know if my comments makes sense.


pub struct Serializer {
// This string starts empty and JSON is appended as values are serialized.
pub struct Record3Serializer {
output: String,
Copy link
Member

@mmalenic mmalenic Aug 19, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To avoid serializing records manually, you could keep track of a bed::record::builder::Builder state, and progressively build it up inside the serialize_* functions. For example, if a chromStart field is expected next, and a serialize_u64 (or another integer serialize function) is called, then the set_start_position builder function could be called. At the end (or intermittently throughout) the records could be converted to string output. This way all the string parsing is done by the noodles library.

This is similar to keeping track of the state for the Deserializer as shown below.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, we are already using noodles functionality to parse the serialization for us, via the DisplayFromStr functionality on serde_with used with the AuxiliarBedRecordWrapper

#[serde_as]
#[derive(Deserialize, Serialize)]
pub struct AuxiliarBedRecordWrapper<T>
where
    T: BedN<3> + std::str::FromStr + fmt::Display,
    <T as std::str::FromStr>::Err: std::fmt::Display,
{
    #[serde_as(as = "DisplayFromStr")]
    pub record: T,
}

I think I tipped you in the wrong direction since impl<'a> ser::Serializer for &'a mut Record3Serializer {} is really verbose and has lots of implementation that looks like its made to parse bed structs by itself, is that correct?

This implementation could be heavily dried out now that we are using serde_with, but we still need some kind of plain serializer to grab thinks between each \n and sending directly to record: T Display method.

Copy link
Member

@mmalenic mmalenic Aug 25, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahh yeah, I think I see what you mean. Do you think there is any way to avoid even having to parse the \n characters? Ideally there would be no need to touch a raw string, and just allow noodles to do all of this.

However, it looks like noodles might write records one by one anyway, so this might not be possible.

Does this mean that serde_with or serde_as is generating the Serializer implementations, or does that need to be implemented as well?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's kind of weird: we need some Serializer or Deserializer specific implementation to exist to have a full pipeline of serializaton and deserialization, but as soon as we arrive on our Serializer, all it needs to do is:

  1. Call Display from the specific Record implementation (which is being done by the serde_with derive)
  2. Insert \n on Vec<T> serializations.

So it's the most plain implementation of a Serializer trait ever, we probably can and should call unreachable!() on most of the functions (I didn't do this yet because maybe serde_with fails.)

Answering your question: I haven't seen a way to write a string separated \n inside of noodles, but now that you mention it, maybe implementing a Display and FromStr for Vec<Record<N>> shouldn't be that hard, the weird part is that we would still need an AuxiliarWrapper for this Vec, and I can't see a way to use the same AuxiliarWrapper both for a single and a collection of Records, in the past I've seen this solved by coercing single Records to a collection of one element, but maybe we don't want that.

Comment on lines 23 to 27
enum RecordState {
ExpectingChrom,
ExpectingChromStart(Record<3>),
ExpectingChromEnd(Record<3>),
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Keeping track of the Deserializer state is kind of like the inverse of keeping track of the bed::record::builder::Builder state in the Serializer.

// self.first = false;
// // Deserialize a map key.
// seed.deserialize(&mut *self.de).map(Some)
seed.deserialize(&mut *self).map(Some)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems to be getting stuck here, as this function is called over and over again, without returning Ok(None). next_key_seed should return Ok(None) when there are no more entries left.

@GabrielSimonetto GabrielSimonetto force-pushed the gabriel/example_serialization_deserialization branch from 6f2cfaf to 80f6b38 Compare September 1, 2022 18:18
 	* Adapted noodles-bed/src/main.rs to changes on the code
        * Renamed AuxiliarBedRecordWrapper to SerdeRecordWrapper
        * Renamed Record3Serializer to RecordSerializer
        * Add documentation
	* Remove useless comments
	* Run clippy and fix warnings
@GabrielSimonetto
Copy link
Author

GabrielSimonetto commented Sep 1, 2022

I believe this is ready to review.

A couple of things that I wanna add context to here:

  1. I removed from_bytes() e to_bytes() because I had the feeling that having only one function to analyze makes things clearer for now, and it's weird to understand bytes tests. If needed we can put them back.

  2. serde_json is not prepared to deal with SerdeRecordWrapper whatsoever, I feel like that's expected, if not, let me know . I left 2 ignored tests if anybody wants to see how it works

Anyway, I feel like even if we wanted to implement it, we would have the problem of making the wrapper and the non-wrapper Record representation have the same result. But the Wrapper, standalone, would produce a new bracket level of identation, I think ({{<record>}}, instead of {<record>})

# Errors from the ignored test:
failures:

---- record::tests::test_serde_json_to_string_for_serde_record_wrapper stdout ----
thread 'record::tests::test_serde_json_to_string_for_serde_record_wrapper' panicked at 'assertion failed: `(left == right)`
  left: `"\"sq0\\t7\\t13\""`,
 right: `"{\"chrom\":\"sq0\",\"start\":8,\"end\":13,\"thick_start\":8,\"thick_end\":13}"`', noodles-bed/src/record.rs:1519:9
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

---- record::tests::test_serde_json_from_str_for_serde_record_wrapper stdout ----
thread 'record::tests::test_serde_json_from_str_for_serde_record_wrapper' panicked at 'called `Result::unwrap()` on an `Err` value: Error("invalid type: map, expected a string", line: 1, column: 0)', noodles-bed/src/record.rs:1467:10

If however, I find a way to make serde_json functions behave by using a Into<Record> trait, we could force a wrapper to always be flattened before entering serde processes.

  1. I removed as many warnings as I could, but one still remains, I think we want to ignore it, if not, please let me know:
warning: method `from_str` can be confused for the standard trait method `std::str::FromStr::from_str`
  --> noodles-bed/src/de.rs:50:5
   |
50 | /     pub fn from_str(input: &'de str) -> Self {
51 | |         RecordDeserializer { input }
52 | |     }
   | |_____^
   |
   = note: `#[warn(clippy::should_implement_trait)]` on by default
   = help: consider implementing the trait `std::str::FromStr` or choosing a less ambiguous method name
   = help: for further information visit https://rust-lang.github.io/rust-clippy/master/index.html#should_implement_trait

warning: `noodles-bed` (lib) generated 1 warning
  1. I think our main.rs file will be the main file used to discuss with Michael Macias about the work we've done, if it's currently lackluster, we can keep improving it before merging. I was even thinking if we didn't need some kind of conversion from one format to the other before, since this is the BioSerDe proposal. But even if that is the case, we have a stopping point here which can be used in the discussion I think. (although, we do have the BED + BED_JSON to work with, if needed)

Edit - Adding one other point of discussion that I replicated on the PR description for completeness:

serde_with worked wonders, we are now reusing a lot of code directly from noodles, and this should be replicable to the other data formats.

It's still needed to implement a Serializer and Deserializer which does the minimal work of basically parsing each element on a sequence, and calling Display or FromStr on each element. The Deserializer can be really lean thanks to the forward_to_deserialize_any! macro, the Serializer however doesn't have anything similar, and needs to implement a lot of unreachable! functions because of the trait signature. Maybe there is something in the ecosystem for such a bare-bones use case, but I am not aware of it.

@GabrielSimonetto GabrielSimonetto marked this pull request as ready for review September 1, 2022 20:30
Copy link
Member

@mmalenic mmalenic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking great @GabrielSimonetto, well done 😄.

Ideally the ignored tests would work too, as converting it to a JSON representation is one of the reasons why using Serde is advantageous. And yes, I agree about the main.rs comment. It could be used to showcase the design.

noodles-bed/src/ser.rs Outdated Show resolved Hide resolved
noodles-bed/src/de.rs Outdated Show resolved Hide resolved
noodles-bed/src/de.rs Outdated Show resolved Hide resolved
noodles-bed/src/de.rs Outdated Show resolved Hide resolved
@GabrielSimonetto
Copy link
Author

GabrielSimonetto commented Sep 5, 2022

Ideally the ignored tests would work too, as converting it to a JSON representation is one of the reasons why using Serde is advantageous.

I think we can still do that, but it will be our job to tell serde_json that it should use a Record, and that our own serde should use the SerdeRecordWrapper. At least I can't see how to tell specifically serde_json that it should unnest the wrapper before starting it's operations.

But this conversation makes me realize that this communication between formats is important, so maybe I can already arrange one example of this (: (starting with a json serialization, finishing with a bed serialization, and vice-versa)

@mmalenic
Copy link
Member

mmalenic commented Sep 7, 2022

But this conversation makes me realize that this communication between formats is important, so maybe I can already arrange one example of this (: (starting with a json serialization, finishing with a bed serialization, and vice-versa)

Yea, that's fair enough. You could create an example in this PR or another. Feel free to merge this PR either way.

@GabrielSimonetto GabrielSimonetto changed the base branch from bed_serde to serde September 7, 2022 18:47
@GabrielSimonetto GabrielSimonetto merged commit 4590155 into serde Sep 7, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants