-
Notifications
You must be signed in to change notification settings - Fork 239
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ergonomics & encodings #158
Comments
Thanks for reaching out, these kind of feedbacks are very useful! I agree this is not super ergonomic but in practice I don't feel this is really an issue. This is very subjective of course and i'd be more than happy to improve the current situation. Just to be sure we're on the same page we want to avoid both converting (checking) to str AND decoding when possible (we'd rather encode the matching keyword once, most of the time at compile time than decode all events). Having 2 types of events is definitely a good idea and could help. I prefer the |
Okay, glad to hear that you're receptive to this. I'm not sure if it's something I'm going to dig too deeply into, but if I do I'll propose a more thorough design first. Can you clarify precisely what you mean by checking / decoding? Are you talking only about the source xml data, or are you also thinking about the constants that keys are matched against? |
Rust performs those checks for a reason: failing to do so is a security vulnerability. Skipping decoding also makes it impossible to support ASCII-incompatible encodings like UTF-16 (which is required by the XML spec, by the way). A correct and secure XML parser must always decode. |
Right. I don't see why these checks wouldn't be performed here as well, thus I don't see how it'd impact security. Granted it is a manual process but an absolutely mandatory one: Event::Start(utf16_event) => {
let utf8_name = reader.decode(utf16_event.local_name())?;
match utf8_name {
"utf8 name1" => { ... }
"utf8 name2" => { ... }
} Please correct me if I'm wrong. As far as I know, quick-xml is the only one which actually lets you deal with non-utf8 data at all! The way the api is done is to provide extra perf to developers who know the xml they are dealing with, which is very common in my experience. Coming back to the security examples, quick-xml will not try to decode anything per default and will not try to delete or replace any character. True, it will NOT break per default if the data is invalid but I believe it will not let you insert anything malicious. |
I'm interested to know what your design goal is (both for this issue and general). Do you always want
Besides ergonomics, I think the current encoding design doesn't work. In particular, because of the assumption here that decoding is just something the caller does on text nodes as needed,
I see there's issue #322 open about UTF-16 not working, and a test of UTF-16 is I think the most straightforward approach would be (as well as fixing the detection) to decode non-UTF-8 encodings into a buffer and feed that into the XML parsing. Then everything afterward is UTF-8. And then it can give btw:
I'm not sure I agree with this. In particular, I suspect it's possible to do "interesting" things via non-canonical UTF-8 and the like. For example, I think the duplicate attribute detection works on the raw bytes now, so by not validating the UTF-8 it's possible to defeat it. Then maybe two implementations differ on which one they use, bypassing some validation logic or something. There are people who are really good at coming up with crazy scenarios vaguely like this that result in actual security holes in some system or another. I'm not really one of them but I'm doing my best to think like them. |
I've been thinking about this a bit lately (and writing my thoughts on this PR) so I will copy them here where they are more relevant. I agree with the OP and also, in particular, this comment
First, UTF-8 validation can be really really fast with the right libraries. If you look at the benchmarks in https://github.com/rusticstuff/simdutf8, even on years old hardware (Zen 2) it is possible to validate UTF-8 on large blocks of text at a speed of 150 gigabytes per second (for pure ascii) and 17 gigabtyes per second (for Cyrillic and Chinese text). In comparison, quick-xml even without doing any encoding checks only manages around 600-700 megabytes per second (measured using #418). So the cost could be pretty miniscule, 5% actually sounds roughly correct. One thing you should notice though is that this only applies to large blocks of data. Regardless of whether you're using SIMD or the standard library, validating many small strings is much slower, in terms of throughput, than validating longer ones. So if you take the assumption that users will need to convert some significant portion of their XML to a UTF-8 encoded type ( Another issue is that if the user does need to support multiple encodings, with the current API (assuming utf-16 support were eventually implemented), they would need to decode things everywhere, which means a lot of small allocations. The performance cost of those allocations is likely to be as bad or worse than simply decoding everything up-front and working with UTF-8 directly. So the performance case against this is actually not very clear cut. There are still probably scenarios where it would be a bit worse, but I would wager that those scenarios are not so common. Combine this with some of the other points brought up:
And I think, it actually does make sense to pre-decode all data as UTF-8. It would most certainly be simpler to use and maintain, we could quickly gain UTF-16 support mostly for "free", and it might actually be faster (but at least, probably not much slower). |
I also ran the
[0] UTF-8 being more compact on average - it can be worse for Asian / Cyrillic character sets, but with XML you can assume there are a lot of tags, attribute names, and sentinel characters which are ASCII, which diminishes the advantage |
I ran some additional benchmarks against quick-xml and it appears my assumptions are correct, at least as far as the utf-8 validation / Using the
On the existing UTF-8 test inputs. The results:
While I didn't do any testing with UTF-16 yet, I expect similar results, given the impact that the additional small allocations of @tafia @Mingun, does this sound sufficiently compelling to consider adjusting the current approach? Is there any particular additional pieces of data which you would like to collect before making that decision? The proposal, in summary:
|
About ergonomics: Is it possible to construct a Based on the previous comment of @dralley and for implementation of UTF-16 with reasonable effort, I would prefer to convert to or validate as UTF-8 before parsing. Maybe quick-xml could offer both strategies: For example, a const bool |
Yes. BytesStart::to_end(). |
Hi there,
I've just been playing around with quick-xml to handle reading and writing of Glyph Interchange Format files, and I had some observations I wanted to share.
[u8]
/Cow<[u8]>
vsCow<str>
My understanding is that
quick-xml
attempts to do the least work possible, and to this end exposes byte slices by default. I can understand the reasoning (don't allocate unless the user has asked for it explicitly) but there are two things I keep thinking about, here:utf-8 is common: the majority of XML I've been dealing with is encoded in utf-8. In this case, we should be able to provide direct access to string slices, allocating only if we need to escape something.
In a non-utf8 encoding, it is hard to use bytes correctly: exposing bytes directly encourages users to do things like
if attr.key == b"mykey"
(which is used in the readme). This is fine with an ascii-compatible encoding like utf-8 or Shift JIS, but will fail unexpectly on, say, utf-16; and utf-16 is an explicitly supported encoding according to the xml spec.API thoughts
I haven't thought about this too too much, but I do worry that the current behaviour has worse ergonomics than necessary in the general (utf-8) case, while also making it easy to write bugs in the exceptional (utf-16) case.
One possible alternative I could imagine would be having separate
Event
andEventRaw
types;Event
would work withCow<str>
, which would be slices of the underlying buffer where possible (when the encoding is utf-8) and which would be decoded and allocated during parsing, otherwise.EventRaw
would work like the currentevent
.Another option would be to have add some new type,
XmlSlice
, which would include a lazily-allocatingto_str
method, while also exposing access to the underlying bytes and even the encoding; this type would be used everywhere that[u8]
orCow<[u8]>
is, currently. This could also include information like the location of this text in the original document.thanks
I want to clarify that this is not a feature request or a concrete proposal for a new API; I just wanted to voice these thoughts and hear if anyone had anything to add. It may be that what I'm describing should just be an alternative crate, or that there are other design constraints I'm not aware of.
Thank you for the time you've put into this code so far!
The text was updated successfully, but these errors were encountered: