-
Notifications
You must be signed in to change notification settings - Fork 238
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add normalize_line_end for unescape and test #807
Open
yorkz1994
wants to merge
18
commits into
tafia:master
Choose a base branch
from
yorkz1994:master
base: master
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from 10 commits
Commits
Show all changes
18 commits
Select commit
Hold shift + click to select a range
4f5ce0d
Add normalize_line_end for unescape and test
45a63c5
Change normalize_line_end function parameter type to allow more type …
1c2ddc5
Update src/escape.rs
yorkz1994 139b2cb
change parameter type from generic to Cow
6962ae3
Add issue 806 to change log.
11320ad
Fix regression test failure due to line end normalization
0bb282e
Update Changelog.md
yorkz1994 f970370
Still need to normalize if nothing unescaped
d11c756
Add line ends test for attribute and text
d3ee1ad
roundtrip test cannot include \r
cbf200f
Update tests/reader-attributes.rs
yorkz1994 e4f83a5
Update tests/reader-attributes.rs
yorkz1994 cf99a45
Update src/escape.rs
yorkz1994 628dce7
Add comment why \r is not include in roundtrip test.
44df9bf
Remove unused import
4e1eefc
Missing b in byte literal
ecfeefc
Add some \r in unescape_text/mixed
d880edc
Normalizing more line ends, no allocation
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,7 +1,10 @@ | ||
use std::borrow::Cow; | ||
|
||
use quick_xml::events::attributes::Attribute; | ||
use quick_xml::events::{BytesEnd, Event::*}; | ||
use quick_xml::events::{ | ||
BytesEnd, | ||
Event::{self, *}, | ||
}; | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Can be reverted, because of changes below. |
||
use quick_xml::name::QName; | ||
use quick_xml::reader::Reader; | ||
|
||
|
@@ -159,3 +162,31 @@ fn equal_sign_in_value() { | |
e => panic!("Expecting Empty event, got {:?}", e), | ||
} | ||
} | ||
|
||
#[test] | ||
fn line_ends() { | ||
const XML: &str = "<root attribute=\"\r\r\n\nvalue1\r\r\n\nvalue2\r\r\n\n\">\r\r\n\nvalue3\r\r\n\nvalue4\r\r\n\n</root>"; | ||
let mut reader = Reader::from_str(XML); | ||
match reader.read_event().unwrap() { | ||
Event::Start(event) => { | ||
yorkz1994 marked this conversation as resolved.
Show resolved
Hide resolved
|
||
let mut iter = event.attributes(); | ||
let a = iter.next().unwrap().unwrap(); | ||
#[cfg(not(feature = "encoding"))] | ||
assert_eq!( | ||
a.unescape_value().unwrap(), | ||
"\n\n\nvalue1\n\n\nvalue2\n\n\n" | ||
); | ||
assert_eq!( | ||
a.decode_and_unescape_value(reader.decoder()).unwrap(), | ||
"\n\n\nvalue1\n\n\nvalue2\n\n\n" | ||
); | ||
} | ||
event => panic!("Expected Start, found {:?}", event), | ||
} | ||
match reader.read_event().unwrap() { | ||
Event::Text(event) => { | ||
yorkz1994 marked this conversation as resolved.
Show resolved
Hide resolved
|
||
assert_eq!(event.unescape().unwrap(), "\n\n\nvalue3\n\n\nvalue4\n\n\n") | ||
} | ||
event => panic!("Expected Text, found {:?}", event), | ||
} | ||
} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Escaping may be on the hot path, so it is not fine that we can allocate new owned strings just to get a reference and put it to already allocated string. Would it possible instead use already managed
Cow
and pass it to thenormalize_line_end
? Then upgradingCow
from borrowed to owned could be performed either inunescape
, or innormalize_line_end
and that upgrade always will be performed only once.Also, do we really need two loops?
memchr
can search for three bytes at once and we search for three bytes:&
,;
and\r
. So it could be possible to perform unescaping and line end normalization in one loop.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know how to do it. What if there is nothing to unescape? You still need to call
normalize_line_end
for such input in the end, then what is the correct function parameter fornormalize_line_end
so that it can handle both situations?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, don't understand what the trouble?
unescape
will always be called when we request data via API methods. If cycling search of&
,;
, or\r
will not find anything, than it is really nothing to change in the data (I assume that cycles for unescaping and normalization merged in one).Even if did not merge loops,
normalize_line_end
can accept&mut Cow<str>
and change it instead of creating an ownCow
by taking reference to&str
and returning its ownCow<str>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am talking about the not merge loops, as you point out that there are more line end situations to consider, it is beyond my knowledge to do SIMD stuff myself to merge the loops.
Current function signature is:
fn normalize_line_end(input: &str) -> Cow<str>
Change to what? How to call it?
Maybe your requirement is beyond my knowledge. In that case, I cannot help. We can close this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, read my
&mut Cow<str>
as&mut Option<String>
. So the whole code will look like:Actually, SIMD and merging loops are two separate conception, it is not necessary to explicitly apply SIMD techniques if you not aware of them, it's totally fine. I only tried to be helpful if you aware of it.
The requirement in only is not to make things worse. First of all we should not make it impossible to correctly process input (that is why we should at least process
\r\u0085
because if we turn it into\n\u0085
, the user will see two newlines in data instead of one, because specification allows\n
,\r\u0085
and\u0085
be served as (one) newline characters). The second, we should to try that without much impact to performance, so if we can avoid unnecessary allocations we should do that.If you're having a hard time figuring out how to get there, that's fine, it could be so. Thanks for opening PR anyway. I will consider it later when I return to active phase on the project (currently working on other project) and see how to improve it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for showing the code.
Yes, it works if you have something to unescape. But what if there is nothing to unescape from the original input?
Then you still have to call normlize_line_end to normalize the original input. How to you put original input raw str as parameter to your current unescape function?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unescape function decides where it is need to unescape something or not. It itself called always (well, because user requests unescaped and normalized data by demand, it will be called by demand, but that "demand" means "unesсape [and normalize] if there is anything")
Cannot understand what confuses you here. Calls to
unescape
will not be changed and it is already called whenever necessary.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nothing to escape means there is no
&
and;
in raw input. It will not go insidewhile
below. So normalize will not happen to input here.Will not go inside here either because nothing has been unescaped
Must run normalize here, so how to call your changed function
fn normalize_line_end(input: &mut Option<String>, raw_len: usize)
here with raw as parameter?