Support for embedded raw binary #623

elrnv · 2023-07-16T05:15:12Z

There are use cases where embedding binary data into an XML document is convenient. Although not officially supported by the XML spec there are valid and popular uses that exist in the wild (e.g. VTK). Binary VTK files are quite common, in fact the tag (which can contain binary data) is the default when writing vtk files, and there are plenty of files like this in the wild.

In the past quick-xml was able to support binary data, but this has changed and in the current state requires some non-trivial changes for adding support for being able to write and read uninterpreted binary data.

My first attempt didn't really address the issue correctly since binary data can have valid '<' and '>' tags, which can trip up the parser, meaning that at least reading must be augmented from the lowest level (probably as part of Reader).

As discussed previously the implementation for this should work as an additional feature.

Reading

I think it is not difficult to add a customization point for the reader in quick-xml without being too invasive by adding a "context" field (supplied externally) to Reader that can call ask how many bytes should be read by feeding it context about what has already been read.

Writing

All types in the serialize part of the code is bound by fmt::Write. I haven't found a way to maintain that bound everywhere and also support for writing binary files to an optional or custom io::Write type. To enable this I think we have to switch back to io::Write. I understand that there are some performance concerns about having to write to a buffer if io::Write is used, but I don't see another way of adding support for binary writing without io::Write. I can really use some help from the maintainers to address this point.

The text was updated successfully, but these errors were encountered:

Mingun · 2023-07-16T17:24:04Z

It would be valuable, if you could link or attach some examples of files and provide links to the explanation, how VTK parser should understand that some byte is not a markup, but a part of a binary data.

I found an archive with some files in elrnv/vtkio#27 and fight now I suspect the following algorithm for resolving ambiguity:

when read opening tag with a special mark (<AppendedData encoding="raw"> in an example) the parser turned into special binary reading mode
in that mode any sequence of bytes are allowed until it contains the closing tag sequence (</AppendedData> in an example)
closing tag sequence should be interpreted in the document encoding (utf-8 in an example)

So I think we could work on this issue in a few steps:

Implement a way to read binary data. I think, it should be a special method(s) of a Reader that should be called after consuming Event::Start(r#"<AppendedData encoding="raw">"#) and fill a provided buffer with the data and return a Span. The special Event::Binary is unlikely to be needed at this point, because there is no way to generically understand that we should read binary data instead of a text data. But we could think about a callback that will take &BytesStart and return a parsing mode. I think, that this part could be available without feature flags
Think about API for a Writer. Actually, I think, that we need to split the current Event into reader::Event and writer::Event, because this could benefit for this and other cases
Think about serde API and implementation

elrnv · 2023-07-16T18:38:12Z

Thank you for the quick response!

It would be valuable, if you could link or attach some examples of files and provide links to the explanation, how VTK parser should understand that some byte is not a markup, but a part of a binary data.

Here are a few examples:

a file in the wild: https://github.com/elrnv/vtkio/files/11272859/fluid_1_91.zip
binary VTK exported with Paraview: https://github.com/elrnv/vtkio/blob/master/assets/hexahedron_binary.vtu
a binary VTK with compressed bytes: https://github.com/elrnv/vtkio/raw/master/assets/hexahedron_lz4.vtu

There doesn't seem to be any documentation on how to parse this kind of binary section, just on how to decode it. The official vtk file format is documented here
The binary section is always prefixed with an underscore, and the first few bytes determine how large the binary section is.

when read opening tag with a special mark ( in an example) the parser turned into special binary reading mode

in that mode any sequence of bytes are allowed until it contains the closing tag sequence ( in an example)

closing tag sequence should be interpreted in the document encoding (utf-8 in an example)

That makes sense to me, and should be robust enough to support practically all such VTK files with embedded binary blobs. Although, I think ideally this logic is best left in vtkio, here it seems we should just provide a way to do context dependent parsing that is general enough for this use case.

So I think we could work on this issue in a few steps ...

This makes sense to me, but for the writer, I am still in the dark as to how io::Writer should be supported throughout the serde modules.

It is easiest for me to prototype a complete solution for reading and writing and test it directly using the tests I already have in vtkio. I am happy to start a PR and try to separate the reading and writing changes though to make it easier to review.

This commit sadly breaks support for VTK files with binary AppendedData blobs. The code needed to support this is currently not functional and depends on the completion of this feature in quick-xml (tracked in tafia/quick-xml#623) All binary related code is now featured gated, which is a regression from previous version of vtkio, but will be re-enabled in the future.

Mingun · 2024-07-11T20:31:14Z

Probably reading binary data is possible today, if you get inner reader when you've ready to read binary and read from it. The possible drawback is that internal offsets in the Reader won't be updated, so you will need to track read bytes and correct result of .error_position() / .buffer_position() accordingly.

elrnv · 2024-07-13T20:05:49Z

Would this also work from the Serde API?

Mingun · 2024-07-13T20:33:57Z

No, serde requires more work. Suggested approach lefts to the user the decision where to stop reading binary data and return to reading XML events. In serde API you should somehow to tell the Reader these things. That is subject for design. I do not plan to design such API in the near time, so you can try to do that yourself.

ssokolow · 2024-08-01T02:01:04Z

Just to get some clarification, am I correct in my understanding that this would also allow proper handling of embedded DSLs or de facto embedded DSLs (eg. HTML's <textarea>, <script>) as long as the apparent lack of a call to push back/un-read data is compensated for by disabling check_end_names and matching the closing tag as if the it was part of the "binary data" and the "binary data" was designed to be self-terminating?

(I know that, given a restricted HTML subset that lacks those "embedded DSL elements" but contains self-closing tags like <br>, such as a Pidgin chatlog, older versions of quick-xml can be used to parse it an order of magnitude more quickly than with html5ever by turning off paired tag enforcement and enforcing expected well-formedness rules manually.)

Mingun · 2024-08-01T07:04:48Z

Yes, you understand correctly. You even don't need to disable check_end_names because when you read from .stream(), you are effectively cut some data from seeing by XML parser. As long as not cutted content represent valid XML with proper nesting everything will work smoothly.

ssokolow · 2024-08-01T07:12:35Z

Yes but, unless the cut content is length-prefixed, which embedded DSLs generally aren't, there'd need to be some way to "rewind" the stream to put the closing tag back into what the XML parser sees after the DSL parser has seen it. Hence, disabling check_end_names.

Mingun · 2024-08-01T15:44:58Z

While that may be needed in some cases, I think, that usually you shouldn't do that. .stream() returns an impl BufRead, so you can lookahead without consuming data. After all, it wraps the type that you pass to the from_reader method, so you can pass such a type which will able to buffer arbitrary amount of data without consuming them.

ssokolow · 2024-08-03T14:49:36Z

Huh. You're right... and it looks like I'm not the only person who was confused by the naming and documentation of fill_buf() into thinking that BufRead had no peek().

(Sorry for the delayed response. It took me a couple of days to realize that I was remembering a response I'd intended to send instead of one I'd actually sent.)

Mingun added enhancement help wanted labels Jul 16, 2023

Mingun mentioned this issue Jun 4, 2024

Allow  character references #750

Open

Mingun mentioned this issue Jul 12, 2024

Implement ability to read data directly from the underlying reader #783

Merged

elrnv mentioned this issue Jul 21, 2024

Serde support for serializing and deserializing binary blobs in XML files #788

Open

Mingun closed this as completed in #783 Jul 22, 2024

elrnv mentioned this issue Aug 2, 2024

Support for embedded raw binary with serde API #792

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support for embedded raw binary #623

Support for embedded raw binary #623

elrnv commented Jul 16, 2023

Mingun commented Jul 16, 2023

elrnv commented Jul 16, 2023 •

edited

Loading

Mingun commented Jul 11, 2024

elrnv commented Jul 13, 2024

Mingun commented Jul 13, 2024

ssokolow commented Aug 1, 2024 •

edited

Loading

Mingun commented Aug 1, 2024

ssokolow commented Aug 1, 2024

Mingun commented Aug 1, 2024

ssokolow commented Aug 3, 2024

Support for embedded raw binary #623

Support for embedded raw binary #623

Comments

elrnv commented Jul 16, 2023

Reading

Writing

Mingun commented Jul 16, 2023

elrnv commented Jul 16, 2023 • edited Loading

Mingun commented Jul 11, 2024

elrnv commented Jul 13, 2024

Mingun commented Jul 13, 2024

ssokolow commented Aug 1, 2024 • edited Loading

Mingun commented Aug 1, 2024

ssokolow commented Aug 1, 2024

Mingun commented Aug 1, 2024

ssokolow commented Aug 3, 2024

elrnv commented Jul 16, 2023 •

edited

Loading

ssokolow commented Aug 1, 2024 •

edited

Loading