-
Notifications
You must be signed in to change notification settings - Fork 195
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement JSON token stream deserializer #454
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
overall I like the direction, but we need to avoid allocating strings during parsing
} | ||
|
||
/// Reads a JSON string out of the stream. | ||
fn read_string(&mut self) -> Result<String, Error> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the XML parsing library I used didn't unescape in the tokenizer. That enabled the tokenizer to be allocation-free (and for input that you know won't need to be escaped (like base64), you can even skip the unescape step later)
I don't think we want to be allocating for every string we read out of the input stream, especially since lots of them (eg. a key in a map) we never need to own
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This makes a lot of sense. Will refactor.
pub enum Token { | ||
StartArray, | ||
EndArray, | ||
ObjectKey(String), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
these almost certainly want to be &'a str
that refer into the input
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, if there is a harness to attach this to the Json test suite, we should definitely include that
Looks like the unescaping has a bug around UTF-16 surrogate pairs. Will examine fixing that in addition to decoupling parsing from unescaping. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! we should add a proper fuzzing harness either here or in a follow up PR. A bench suite would also be good but not critical right now
I will add fuzzing in a follow-up PR. |
This adds a JSON token streaming deserializer to smithy-json for #161.
It passes all tests in JSONTestSuite, except for some notable ones:
There are some test cases around invalid Unicode sequences which are expected to pass but always fail since this implementation is coercing all JSON strings into Rust strings, and Rust wants valid UTF-8. I don't think accepting these invalid sequences would be valuable for our SDK implementation since the protocols call for valid UTF-8 in strings.I was wrong here.[][]
, which is totally valid for this implementation since it would result instart_array, end_array, start_array, end_array
.By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.