-
Notifications
You must be signed in to change notification settings - Fork 195
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Escape control characters in JSON strings #427
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you bring this test back?
Just mark it #[ignore]
}); | ||
last = index + 1; | ||
fn escape_string_inner(start: &[u8], rest: &[u8]) -> String { | ||
let mut escaped = start.to_vec(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would probably try to pre-allocate enough capacity here:
let mut escaped = Vec::with_capacity(start.len() + rest.len() + 1);
escaped.copy_from_slice(start)
} | ||
escaped.push_str(&value[last..end]); | ||
Cow::Owned(escaped) | ||
// Our input was originally valid UTF-8, and we didn't do anything to invalidate it |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the safety argument is actually a little more subtle here:
- We only escaped bytes that were already single character code points
- We only replaced them with valid UTF-8
let mut escaped = start.to_vec(); | ||
for byte in rest { | ||
match byte { | ||
b'"' => escaped.extend("\\\"".bytes()), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
did you use bytes()
here instead of b
to make it clear that these were valid strings before being turned into bytes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, I just didn't think to use b
.
// - The original input was valid UTF-8 since it came in as a `&str` | ||
// - Only single-byte code points were escaped | ||
// - The escape sequences are valid UTF-8 | ||
debug_assert!(String::from_utf8(escaped.clone()).is_ok()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit:
debug_assert!(String::from_utf8(escaped.clone()).is_ok()); | |
debug_assert!(std::str::from_utf8(escaped).is_ok()); |
} | ||
|
||
use proptest::proptest; | ||
proptest! { | ||
#[test] | ||
fn matches_serde_json(s: String) { | ||
fn matches_serde_json(s in ".*") { | ||
assert_eq!( | ||
serde_json::to_string(&s).unwrap(), | ||
format!(r#""{}""#, escape_string(&s)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
format!(r#""{}""#, escape_string(&s)) | |
escape_string(&s) |
Cow
and String
are comparable
@@ -70,10 +70,9 @@ mod test { | |||
proptest! { | |||
#[test] | |||
fn matches_serde_json(s in ".*") { | |||
assert_eq!( | |||
serde_json::to_string(&s).unwrap(), | |||
format!(r#""{}""#, escape_string(&s)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you probably want to enable rustfmt on save
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like the proptest macro hid the assert_eq from the formatter.
This escapes the control characters in range 0x00-0x1F (inclusive), and also reworks the escaping so that it should be more performant. It now only iterates over the input string exactly once, rather than once to see if escaping is necessary, and then once again to escape, and there's no longer any conversion to
char
.To verify it correctly escapes everything, I ran the following additional test (not included in the PR since it takes approximately 100 seconds in a debug test run):
By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.