Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support semantic error messages by supporting deserializing spans #302

Closed
Tracked by #340
epage opened this issue Jan 24, 2022 · 6 comments · Fixed by #435
Closed
Tracked by #340

Support semantic error messages by supporting deserializing spans #302

epage opened this issue Jan 24, 2022 · 6 comments · Fixed by #435
Labels
A-error Area: Error reporting A-parse Area: Parsing TOML C-enhancement Category: Raise on the bar on expectations

Comments

@epage
Copy link
Member

epage commented Jan 24, 2022

See toml::Spanned for inspiration

@Gankra
Copy link

Gankra commented Jun 26, 2022

(I rolled by this issue because my project uses toml-edit for it's higher-quality serialization, but its low-quality spans for deserialization means I've been experimenting with preferring toml-rs for the other side of the equation, even though toml-edit/easy is explicitly designed to preserve more info, alas!)

I've been using this API in toml-rs in anger a lot this week, so I figured I'd chime in with some details on how it works, and some issues it has:

Spanned defines four magic strings:

pub(crate) const NAME: &str = "$__toml_private_Spanned";
pub(crate) const START: &str = "$__toml_private_start";
pub(crate) const END: &str = "$__toml_private_end";
pub(crate) const VALUE: &str = "$__toml_private_value";

And then uses a custom visitor and does the following:

        let visitor = SpannedVisitor(::std::marker::PhantomData);

        static FIELDS: [&str; 3] = [START, END, VALUE];
        deserializer.deserialize_struct(NAME, &FIELDS, visitor)

The deserializer then, at various points in the code, checks for that exact state and wraps itself in a SpannedDeserializer that emits the 3 magic fields, the first 2 being the span info:

        if name == spanned::NAME && fields == [spanned::START, spanned::END, spanned::VALUE] {
            let start = 0;
            let end = self.input.len();

            let res = visitor.visit_map(SpannedDeserializer {
                phantom_data: PhantomData,
                start: Some(start),
                value: Some(self),
                end: Some(end),
            });
            return res;
        }

That the SpannedVisitor expects:

            fn visit_map<V>(self, mut visitor: V) -> Result<Spanned<T>, V::Error>
            where
                V: de::MapAccess<'de>,
            {
                if visitor.next_key()? != Some(START) {
                    return Err(de::Error::custom("spanned start key not found"));
                }

                let start: usize = visitor.next_value()?;
                ...

I've been trying to use this in anger a fair bit this week and it seems to largely work for "normal" derived types but is more of a mess with custom deserialization. Because it relies on random places in the code detecting the magic fields, if you ever invoke it in a way it wasn't expecting (like variants on the "string or struct" pattern in serde's docs) then you get nasty "expected spanned" from it not creating the SpannedDeserializer and emitting the magic fields.

I'm not sure if those problems are "I'm bad at serde" or "this design is buggy" or "serde makes this impossible to do robustly".

The actual Spanned type also has a pretty suboptimal API. It's very opaque, and you have to explicitly call a method to get the wrapped value, so retroactively adding spans to your existing types is extremely disruptive to code. Thankfully because it communicates with magic fields, you're able to "fork" just Spanned. I did this to make it more like Box<T>, in that it does its best to forward everything from T via Deref and trait impls. All that actual span methods I made into associated functions so you have to do things like Spanned::start(&spanned_val), which is fine since you don't actually care about the span until you want to emit an error.

I am... incredibly bad at serde stuff so I'm not sure if I could help with bringing this stuff up, but here's a brain dump.

@epage
Copy link
Member Author

epage commented Jun 28, 2022

Thanks for the brain dump! These details will be important when implementing this!

@Gankra
Copy link

Gankra commented Jun 28, 2022

Some extra notes from landing this:

  • serde(from = "toml::Spanned<T>") seems to work well as an alternative to get a better API without importing the private details
  • It seemed that my woes were rooted more in #[serde(flatten)] than anything else. #[serde(untagged)] did also seem messed up. basically anything that requires the input to be "extra" buffered makes Spanned break. I worked around this by replacing these kinds of things with a #[serde(try_from = "OuterTypeWithEveryPossibleField")] that manually implements flatten/untagged logic after deserializing all the Spanned more naturally. Not sure if the problem is in serde or Spanned, and whether it's fundamental to the design.

@epage epage added A-parse Area: Parsing TOML A-error Area: Error reporting labels Sep 23, 2022
@olivia-fl
Copy link

I'm pretty sure the answer is "serde makes this impossible to do robustly". I looked into span support for serde things a bit ago in serde-rs/serde#1811 (comment), and as far as I can tell there's no way to propagate span info through serde without resorting to wild hacks like what toml::Spanned was doing previously.

It's possible to extend serde's API to better support spans. Unfortunately, with the designs I've found so far, you generally have two choices:

  • Support propagating spans in a way that allows the deserializer to choose an appropriate span representation, which requires a breaking change to serde's API
  • Bake a particular representation choice into serde's API, which will not be appropriate for all formats

@Kixunil
Copy link

Kixunil commented Dec 27, 2022

I wonder if it'd make sense to check that the magic strings are the same instead of checking if they are equal, that means comparing the address rather than contents. Unless I'm missing something this should make it more robust (if anyone for whatever reason uses these strings in the input it shouldn't confuse the parser). If there was some way to prevent the linker from merging same symbols they could all be empty strings which would be even nicer but I didn't find how to do that.

epage added a commit to epage/toml_edit that referenced this issue Jan 13, 2023
@epage
Copy link
Member Author

epage commented Jan 13, 2023

Note that I'm keeping the scope of this issue fairly limited to "do what toml did" so we can make progress towards merging the crates. I'd welcome people to open a new issue with their feedback for further improvement.

epage added a commit to epage/toml_edit that referenced this issue Jan 13, 2023
epage added a commit to epage/toml_edit that referenced this issue Jan 13, 2023
epage added a commit to epage/toml_edit that referenced this issue Jan 13, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-error Area: Error reporting A-parse Area: Parsing TOML C-enhancement Category: Raise on the bar on expectations
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants