-
-
Notifications
You must be signed in to change notification settings - Fork 115
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(path): allow utf8 chars in path #178
base: master
Are you sure you want to change the base?
feat(path): allow utf8 chars in path #178
Conversation
507fc2a
to
6cb1b83
Compare
After playing with this, i think it should be a feature instead of a config, and it should be enabled by default. Even if it's not really compliant with the RFC there seems to be some browsers that include utf8 in path of a HTTP request and without this it failed, having a feature allow someone to create a strict server but IMO default use case should handle most of web browser. I also added a new Error if the given utf8 in path is not valid |
47380b3
to
6ff4cd6
Compare
src/lib.rs
Outdated
@@ -963,6 +989,14 @@ pub fn parse_uri<'a>(bytes: &mut Bytes<'a>) -> Result<&'a str> { | |||
return Err(Error::Token); | |||
} | |||
|
|||
#[cfg(feature = "utf8_in_path")] | |||
// SAFETY: all bytes up till `i` must have been `is_token` and therefore also utf-8. | |||
return match str::from_utf8(unsafe { bytes.slice_skip(1) }) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This induce an overhead in uri benches, so not sure if we want this ? but in this case we are not sure that the string is valid utf8 unless using it (see the relative test)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
group no_utf8 utf8
----- ------- ----
uri/uri_0001b 1.00 1.1±0.01ns 1786.2 MB/sec 4.58 4.9±0.09ns 390.1 MB/sec
uri/uri_0002b 1.00 1.4±0.01ns 2.1 GB/sec 4.45 6.0±0.34ns 474.7 MB/sec
uri/uri_0004b 1.00 1.9±0.02ns 2.5 GB/sec 4.36 8.1±0.08ns 590.5 MB/sec
uri/uri_0008b 1.00 1.2±0.08ns 6.8 GB/sec 4.98 6.2±0.10ns 1390.9 MB/sec
uri/uri_0016b 1.00 1.8±0.01ns 9.0 GB/sec 2.80 4.9±0.08ns 3.2 GB/sec
uri/uri_0032b 1.00 2.8±0.20ns 11.1 GB/sec 2.05 5.7±0.12ns 5.4 GB/sec
uri/uri_0064b 1.00 5.5±0.75ns 11.1 GB/sec 1.30 7.1±0.04ns 8.5 GB/sec
uri/uri_0256b 1.11 19.1±0.09ns 12.5 GB/sec 1.00 17.2±2.03ns 14.0 GB/sec
uri/uri_0512b 1.36 40.4±0.26ns 11.8 GB/sec 1.00 29.8±3.51ns 16.0 GB/sec
uri/uri_1024b 1.35 80.6±0.53ns 11.8 GB/sec 1.00 59.9±0.34ns 15.9 GB/sec
uri/uri_2048b 1.69 163.7±1.62ns 11.7 GB/sec 1.00 97.1±0.45ns 19.7 GB/sec
uri/uri_4096b 1.95 310.9±22.99ns 12.3 GB/sec 1.00 159.4±0.92ns 23.9 GB/sec
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also try to use simdutf8, but not sure we wants a dep for that, result are better on large string, but no really significant on small string :
group no_utf8 simd_utf8 utf8
----- ------- --------- ----
uri/uri_0001b 1.00 1.1±0.01ns 1786.2 MB/sec 5.42 5.8±0.07ns 329.6 MB/sec 4.58 4.9±0.09ns 390.1 MB/sec
uri/uri_0002b 1.00 1.4±0.01ns 2.1 GB/sec 4.80 6.5±0.02ns 440.1 MB/sec 4.45 6.0±0.34ns 474.7 MB/sec
uri/uri_0004b 1.00 1.9±0.02ns 2.5 GB/sec 4.34 8.0±0.44ns 593.1 MB/sec 4.36 8.1±0.08ns 590.5 MB/sec
uri/uri_0008b 1.00 1.2±0.08ns 6.8 GB/sec 5.58 6.9±0.06ns 1240.3 MB/sec 4.98 6.2±0.10ns 1390.9 MB/sec
uri/uri_0016b 1.00 1.8±0.01ns 9.0 GB/sec 3.47 6.1±0.07ns 2.6 GB/sec 2.80 4.9±0.08ns 3.2 GB/sec
uri/uri_0032b 1.00 2.8±0.20ns 11.1 GB/sec 2.48 6.9±0.12ns 4.5 GB/sec 2.05 5.7±0.12ns 5.4 GB/sec
uri/uri_0064b 1.00 5.5±0.75ns 11.1 GB/sec 1.23 6.7±0.13ns 9.0 GB/sec 1.30 7.1±0.04ns 8.5 GB/sec
uri/uri_0128b 1.21 10.3±0.06ns 11.7 GB/sec 1.00 8.5±1.10ns 14.1 GB/sec 1.19 10.2±0.90ns 11.8 GB/sec
uri/uri_0256b 1.72 19.1±0.09ns 12.5 GB/sec 1.00 11.1±0.30ns 21.5 GB/sec 1.54 17.2±2.03ns 14.0 GB/sec
uri/uri_0512b 2.02 40.4±0.26ns 11.8 GB/sec 1.00 20.0±0.36ns 23.9 GB/sec 1.49 29.8±3.51ns 16.0 GB/sec
uri/uri_1024b 2.48 80.6±0.53ns 11.8 GB/sec 1.00 32.5±1.46ns 29.4 GB/sec 1.84 59.9±0.34ns 15.9 GB/sec
uri/uri_2048b 2.70 163.7±1.62ns 11.7 GB/sec 1.00 60.6±1.06ns 31.5 GB/sec 1.60 97.1±0.45ns 19.7 GB/sec
uri/uri_4096b 2.77 310.9±22.99ns 12.3 GB/sec 1.00 112.4±1.09ns 33.9 GB/sec 1.42 159.4±0.92ns 23.9 GB/sec
@@ -2053,7 +2087,7 @@ mod tests { | |||
assert_eq!(parse_chunk_size(b"567f8a\rfoo"), Err(crate::InvalidChunkSize)); | |||
assert_eq!(parse_chunk_size(b"567f8a\rfoo"), Err(crate::InvalidChunkSize)); | |||
assert_eq!(parse_chunk_size(b"567xf8a\r\n"), Err(crate::InvalidChunkSize)); | |||
assert_eq!(parse_chunk_size(b"ffffffffffffffff\r\n"), Ok(Status::Complete((18, std::u64::MAX)))); | |||
assert_eq!(parse_chunk_size(b"ffffffffffffffff\r\n"), Ok(Status::Complete((18, u64::MAX)))); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
unrelated, can remove it, but without this test without default features would fail (since we don't have std)
37533c4
to
df0651e
Compare
I asked in the referenced issue to check with a few implementers if this is desirable :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! I think we can do this, but some changes commented inline.
Also, to test all SIMD variants, we can probably just have a unit test that loops all possible bytes of URI_MAP, making a long enough vec with that byte, and then checking the match value is the same as without SIMD. I believe we have a test like that for one of the other byte maps.
a00e411
to
3f068fd
Compare
I have apply your comments and squash, will add a test later that will try to iterate over all utf8 char so we are sure to not miss something EDIT : I don't think we can iterate over all uri_map since some bytes can be only one byte of a unicode char, so it will fail for some of them during the utf8 check, what i want to do is
|
I added a test for this, is this what you want or did i misunderstood ? |
Hey, do you need something more for this to be reviewed or merged ? |
+1 looking forward to progress on this! |
Ref #146
I'm not used to work with simd parser so i have done what i think is right but not sure if it's right in all use case.
Mainly the uri non compliant rfc3986 parser is the same as the current one but allow every char between 128 to 255 which sould make it utf8 compliant, if i recall correctly code point always are above or equal to 128
I did not make a feature but rather an option in parser, however i wonder if this should be true by default ? or if it should be a feature (enabled by default ?) ?