Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Escape control characters in JSON strings #427

Merged
merged 4 commits into from
May 27, 2021
Merged

Escape control characters in JSON strings #427

merged 4 commits into from
May 27, 2021

Conversation

jdisanti
Copy link
Collaborator

This escapes the control characters in range 0x00-0x1F (inclusive), and also reworks the escaping so that it should be more performant. It now only iterates over the input string exactly once, rather than once to see if escaping is necessary, and then once again to escape, and there's no longer any conversion to char.

To verify it correctly escapes everything, I ran the following additional test (not included in the PR since it takes approximately 100 seconds in a debug test run):

    #[test]
    fn all_of_them() {
        for value in 0..u32::MAX {
            if let Some(chr) = char::from_u32(value) {
                let string = String::from(chr);
                let escaped = escape_string(&string);
                let serde_escaped = serde_json::to_string(&string).unwrap();
                let serde_escaped = &serde_escaped[1..(serde_escaped.len() - 1)];
                assert_eq!(&escaped, serde_escaped);
            }
        }
    }

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

Copy link
Collaborator

@rcoh rcoh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you bring this test back?

Just mark it #[ignore]

});
last = index + 1;
fn escape_string_inner(start: &[u8], rest: &[u8]) -> String {
let mut escaped = start.to_vec();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would probably try to pre-allocate enough capacity here:

let mut escaped = Vec::with_capacity(start.len() + rest.len() + 1);
escaped.copy_from_slice(start)

}
escaped.push_str(&value[last..end]);
Cow::Owned(escaped)
// Our input was originally valid UTF-8, and we didn't do anything to invalidate it
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the safety argument is actually a little more subtle here:

  • We only escaped bytes that were already single character code points
  • We only replaced them with valid UTF-8

let mut escaped = start.to_vec();
for byte in rest {
match byte {
b'"' => escaped.extend("\\\"".bytes()),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

did you use bytes() here instead of b to make it clear that these were valid strings before being turned into bytes?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, I just didn't think to use b.

// - The original input was valid UTF-8 since it came in as a `&str`
// - Only single-byte code points were escaped
// - The escape sequences are valid UTF-8
debug_assert!(String::from_utf8(escaped.clone()).is_ok());
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit:

Suggested change
debug_assert!(String::from_utf8(escaped.clone()).is_ok());
debug_assert!(std::str::from_utf8(escaped).is_ok());

}

use proptest::proptest;
proptest! {
#[test]
fn matches_serde_json(s: String) {
fn matches_serde_json(s in ".*") {
assert_eq!(
serde_json::to_string(&s).unwrap(),
format!(r#""{}""#, escape_string(&s))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
format!(r#""{}""#, escape_string(&s))
escape_string(&s)

Cow and String are comparable

@@ -70,10 +70,9 @@ mod test {
proptest! {
#[test]
fn matches_serde_json(s in ".*") {
assert_eq!(
serde_json::to_string(&s).unwrap(),
format!(r#""{}""#, escape_string(&s))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you probably want to enable rustfmt on save

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like the proptest macro hid the assert_eq from the formatter.

@jdisanti jdisanti merged commit 1b5453f into smithy-lang:main May 27, 2021
@jdisanti jdisanti deleted the json-escape-fix branch June 1, 2021 16:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants