Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add normalize_line_end for unescape and test #807

Open
wants to merge 18 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 4 additions & 0 deletions Changelog.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,10 @@

### New Features

- [#806]: Perform normalization of line end during unescape process.

[#806]: https://github.com/tafia/quick-xml/issues/806

### Bug Fixes

### Misc Changes
Expand Down
20 changes: 10 additions & 10 deletions benches/microbenches.rs
Original file line number Diff line number Diff line change
Expand Up @@ -327,18 +327,18 @@ fn unescaping(c: &mut Criterion) {

group.bench_function("mixed", |b| {
let text =
"Lorem ipsum dolor sit amet, &consectetur adipiscing elit, sed do eiusmod tempor incididunt
ut labore et dolore magna aliqua. Hac habitasse platea dictumst vestibulum rhoncus est pellentesque.
Risus ultricies "tristique nulla aliquet enim tortor" at. Fermentum odio eu feugiat pretium
nibh ipsum. Volutpat sed cras ornare arcu dui. Scelerisque fermentum dui faucibus in ornare quam. Arcu
cursus euismod quis <viverra nibh cras pulvinar mattis. Sed viverra tellus in hac habitasse platea.
"Lorem ipsum dolor sit amet, &consectetur adipiscing elit, sed do eiusmod tempor incididunt\r
ut labore et dolore magna aliqua. Hac habitasse platea dictumst vestibulum rhoncus est pellentesque.\r
Risus ultricies "tristique nulla aliquet enim tortor" at. Fermentum odio eu feugiat pretium\r
nibh ipsum. Volutpat sed cras ornare arcu dui. Scelerisque fermentum dui faucibus in ornare quam. Arcu\r
cursus euismod quis <viverra nibh cras pulvinar mattis. Sed viverra tellus in hac habitasse platea.\r
Quis commodo odio aenean sed. Cursus in hac habitasse platea dictumst quisque sagittis purus.

Neque convallis a cras semper auctor. Sit amet mauris commodo quis imperdiet massa. Ac ut consequat
semper viverra nam libero justo # laoreet sit. Adipiscing commodo elit at imperdiet dui accumsan.
Enim lobortis scelerisque fermentum dui faucibus in ornare. Natoque penatibus et magnis dis parturient
montes nascetur ridiculus mus. At lectus urna !duis convallis convallis tellus id interdum. Libero
volutpat sed cras ornare arcu dui vivamus arcu. Cursus in hac habitasse platea dictumst quisque sagittis
Neque convallis a cras semper auctor. Sit amet mauris commodo quis imperdiet massa. Ac ut consequat\r
semper viverra nam libero justo # laoreet sit. Adipiscing commodo elit at imperdiet dui accumsan.\r
Enim lobortis scelerisque fermentum dui faucibus in ornare. Natoque penatibus et magnis dis parturient\r
montes nascetur ridiculus mus. At lectus urna !duis convallis convallis tellus id interdum. Libero\r
volutpat sed cras ornare arcu dui vivamus arcu. Cursus in hac habitasse platea dictumst quisque sagittis\r
purus. Consequat id porta nibh venenatis cras sed felis.";

b.iter(|| {
Expand Down
77 changes: 73 additions & 4 deletions src/escape.rs
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
//! Manage xml character escapes

use memchr::memchr2_iter;
use memchr::{memchr2_iter, memchr3_iter};
use std::borrow::Cow;
use std::num::ParseIntError;
use std::ops::Range;
Expand Down Expand Up @@ -266,7 +266,9 @@ where
unescaped = Some(String::with_capacity(raw.len()));
}
let unescaped = unescaped.as_mut().expect("initialized");
unescaped.push_str(&raw[last_end..start]);
for normalized in normalize_line_end_iter(&raw[last_end..start]) {
unescaped.push_str(normalized);
}

// search for character correctness
let pat = &raw[start + 1..end];
Expand All @@ -290,14 +292,81 @@ where

if let Some(mut unescaped) = unescaped {
if let Some(raw) = raw.get(last_end..) {
unescaped.push_str(raw);
for normalized in normalize_line_end_iter(raw) {
unescaped.push_str(normalized);
}
}
Ok(Cow::Owned(unescaped))
} else {
Ok(Cow::Borrowed(raw))
let mut norm_iter = normalize_line_end_iter(raw);
let first = norm_iter.next();
match first {
Some(normalized) if normalized.len() != raw.len() => {
let mut s = String::with_capacity(raw.len());
s.push_str(normalized);
for normalized in normalize_line_end_iter(raw) {
s.push_str(normalized);
}
Ok(s.into())
}
_ => Ok(raw.into()),
}
}
}

/// Normalize the line end in input, replace \r or \r\n with \n, return iterator that can given normalized str.
/// link to [line end spec]https://www.w3.org/TR/xml11/#sec-line-ends somewhere near to the code that handles normalization also would be useful. Well, actually, it describes the 5 combinations of LOF characters, that should be normalized to \n:

/// \r\n
/// \r\u0085 (UTF-8: \r\xC2\x85)
/// \r
/// \u0085 (UTF-8: \xC2\x85)
/// \u2028 (UTF-8: \xE2\x80\xA8)
///
/// The reason to use iterator is to avoid allocation during normalizing line end.
/// If nothing to normalize of the input, it will give back whole input in the first iteration. Caller can know
/// there is nothing to normalize, so that it needs not to do normalization thus avoid allocation.
#[inline]
fn normalize_line_end_iter(input: &str) -> impl Iterator<Item = &str> {
let bytes = input.as_bytes();
let len = input.len();
let mut cursor = 0;
let mut iter = memchr3_iter(b'\r', b'\xC2', b'\xE2', bytes);
let mut temp = None;
std::iter::from_fn(move || {
if let Some(v) = temp.take() {
return Some(v);
}
loop {
if let Some(p) = iter.next() {
if p < cursor {
// already normalized in previous iteration, this position is invalid
continue;
}
let skips = match &bytes[p..] {
[b'\r', b'\n', ..] => 2,
[b'\r', b'\xC2', b'\x85', ..] => 3,
[b'\r', ..] => 1,
[b'\xC2', b'\x85', ..] => 2,
[b'\xE2', b'\x80', b'\xA8', ..] => 3,
_ => continue,
};
// normalized
temp = Some("\n");
let start = cursor;
cursor = p + skips;
break Some(&input[start..p]);
} else if cursor < len {
let start = cursor;
cursor = len;
break Some(&input[start..]);
} else {
break None;
}
}
})
}

/// Resolves predefined XML entities or all HTML5 entities depending on the feature
/// [`escape-html`](https://docs.rs/quick-xml/latest/quick_xml/#escape-html).
///
Expand Down
39 changes: 39 additions & 0 deletions tests/escape.rs
Original file line number Diff line number Diff line change
Expand Up @@ -75,6 +75,45 @@ fn unescape() {
);
}

#[test]
fn unescape_line_end() {
let unchanged = escape::unescape("test\n");
// assert_eq does not check that Cow is borrowed, but we explicitly use Cow
// because it influences diff
// TODO: use assert_matches! when stabilized and other features will bump MSRV
assert_eq!(unchanged, Ok(Cow::Borrowed("test\n")));
assert!(matches!(unchanged, Ok(Cow::Borrowed(_))));

assert_eq!(
escape::unescape("&lt;&amp;test&apos;\r&quot;\r\n&gt;\r\n\r"),
Ok("<&test'\n\"\n>\n\n".into())
);
assert_eq!(escape::unescape("&#x30;\r\r\n"), Ok("0\n\n".into()));
assert_eq!(escape::unescape("\r&#48;\n\r\r"), Ok("\n0\n\n\n".into()));
assert_eq!(
escape::unescape("\r\n&foo;\n"),
Err(EscapeError::UnrecognizedEntity(3..6, "foo".into()))
);

assert_eq!(
escape::unescape("&lt;&amp;test&apos;\u{0085}\r\r\u{0085}\u{2028}&quot;\r\n&gt;\r\n\r"),
Ok("<&test'\n\n\n\n\"\n>\n\n".into())
);
assert_eq!(
escape::unescape("&#x30;\r\r\n\u{0085}"),
Ok("0\n\n\n".into())
);
assert_eq!(
escape::unescape("\r&#48;\n\r\r\u{2028}"),
Ok("\n0\n\n\n\n".into())
);
assert_eq!(escape::unescape("\r\r\u{0085}\n\n"), Ok("\n\n\n\n".into()));
assert_eq!(
escape::unescape("\r\n&foo;\n\u{2028}"),
Err(EscapeError::UnrecognizedEntity(3..6, "foo".into()))
);
}

/// XML allows any number of leading zeroes. That is not explicitly mentioned
/// in the specification, but enforced by the conformance test suite
/// (https://www.w3.org/XML/Test/)
Expand Down
33 changes: 32 additions & 1 deletion tests/reader-attributes.rs
Original file line number Diff line number Diff line change
@@ -1,7 +1,10 @@
use std::borrow::Cow;

use quick_xml::events::attributes::Attribute;
use quick_xml::events::{BytesEnd, Event::*};
use quick_xml::events::{
BytesEnd,
Event::*,
};
use quick_xml::name::QName;
use quick_xml::reader::Reader;

Expand Down Expand Up @@ -159,3 +162,31 @@ fn equal_sign_in_value() {
e => panic!("Expecting Empty event, got {:?}", e),
}
}

#[test]
fn line_ends() {
const XML: &str = "<root attribute=\"\r\r\u{0085}\n\n\u{0085}\u{2028}value1\r\r\u{0085}\n\n\u{0085}\u{2028}value2\r\r\u{0085}\n\n\u{0085}\u{2028}\">\r\r\u{0085}\n\n\u{0085}\u{2028}value3\r\r\u{0085}\n\n\u{0085}\u{2028}value4\r\r\u{0085}\n\n\u{0085}\u{2028}</root>";
let mut reader = Reader::from_str(XML);
match reader.read_event().unwrap() {
Start(event) => {
let mut iter = event.attributes();
let a = iter.next().unwrap().unwrap();
#[cfg(not(feature = "encoding"))]
assert_eq!(
a.unescape_value().unwrap(),
"\n\n\n\n\n\nvalue1\n\n\n\n\n\nvalue2\n\n\n\n\n\n"
);
assert_eq!(
a.decode_and_unescape_value(reader.decoder()).unwrap(),
"\n\n\n\n\n\nvalue1\n\n\n\n\n\nvalue2\n\n\n\n\n\n"
);
}
event => panic!("Expected Start, found {:?}", event),
}
match reader.read_event().unwrap() {
Text(event) => {
assert_eq!(event.unescape().unwrap(), "\n\n\n\n\n\nvalue3\n\n\n\n\n\nvalue4\n\n\n\n\n\n")
}
event => panic!("Expected Text, found {:?}", event),
}
}
5 changes: 3 additions & 2 deletions tests/serde-se.rs
Original file line number Diff line number Diff line change
Expand Up @@ -1955,9 +1955,10 @@ mod with_root {
<root>3</root>");
serialize_as!(tuple:
// Use to_string() to get owned type that is required for deserialization
("<\"&'>".to_string(), "with\t\r\n spaces", 3usize)
// Note: \r cannot include in whitespace character because we performs line end normalization.
("<\"&'>".to_string(), "with\t\n spaces", 3usize)
=> "<root>&lt;\"&amp;'&gt;</root>\
<root>with\t\r\n spaces</root>\
<root>with\t\n spaces</root>\
<root>3</root>");
serialize_as!(tuple_struct:
Tuple(42.0, "answer")
Expand Down