Allow attributes in the `Event::End` and fix `.error_position()` #780

Mingun · 2024-07-04T16:16:44Z

This PR allows us to have attributes in the Event::End (but they are not accessible via BytesEnd directly). As explained in #776 (comment) we anyway can produce an event which .name() does not valid, so it is better ar least keep it useful for some scenarios. When validation API would be added (I'm working on it) we can validate all events and report an error.

Closes #776

When changing parser I noticed, that in one arm we do not set last_error_offset. I checked if we really cover that lines -- yes, but we actually does not check anything. last_error_offset was checked for 0, but this is its default value, so was unclear is that code running or not. I improved check and fix the bug that was found due to that.

codecov-commenter · 2024-07-04T16:28:22Z

⚠️ Please install the to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

Attention: Patch coverage is 92.47312% with 7 lines in your changes missing coverage. Please review.

Project coverage is 60.01%. Comparing base (7558577) to head (6a48a28).
Report is 79 commits behind head on master.

Files	Patch %	Lines
src/reader/mod.rs	95.50%	4 Missing ⚠️
examples/custom_entities.rs	0.00%	1 Missing ⚠️
examples/read_buffered.rs	0.00%	1 Missing ⚠️
examples/read_texts.rs	0.00%	1 Missing ⚠️

❗ Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files

@@            Coverage Diff             @@
##           master     #780      +/-   ##
==========================================
- Coverage   61.81%   60.01%   -1.81%     
==========================================
  Files          41       41              
  Lines       16798    16048     -750     
==========================================
- Hits        10384     9631     -753     
- Misses       6414     6417       +3

Flag	Coverage Δ
unittests	`60.01% <92.47%> (-1.81%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

dralley · 2024-07-05T00:36:48Z

src/reader/mod.rs

+            //   do. This is still invalid tag but it also has no practical meaning
+            //   to parse it in a such way. End tag `</tag attr=">" >`, however, can
+            //   be produced by some real-world parsers, for example, by the parser
+            //   of the Macromedia Flash player.


We shouldn't allow this by default, we should put it behind an opt-in. Even non-validating parsers check for well-formedness, and this is malformed XML.

expat for example blows up

ExpatError: not well-formed (invalid token): line 1, column 11

as does Lxml and libxml2

XMLSyntaxError: expected '>', line 1, column 12

parserError: xmlParseDoc() failed

I'm OK with making this possible to support, but our default behavior should probably be throwing an IllFormedError just like we do for mismatched tags (and like other parsers do).

ExpatError: mismatched tag: line 1, column 7

Also line 358 should be rewritten - XML isn't "produced" by a parser. You can just say that such documents exist in the wild and are tolerated by some parsers such as the one used by Adobe Flash.

I want to evolve library in this direction:

rename Reader to RawReader (and Event to RawEvent)

make RawReader as liberal as possible. It should not make any validation checks

add a validate method to RawEvent which will perform all well-formedless and validity checks (I already grab list of all checks from the specification, 35 VC + 16 WFC references, actually most of them about DTD). All validation must be run manually by calling RawEvent::validate. Only one check cannot be performed in this way: check that start tag name matches end tag name

move existing checks (for example, IllFormedError::DoubleHyphenInComment) to the new validate method

introduce a new Reader which will perform validation checks when flag in config is set (and process many other things, like dealing with encodings and expanding entity references as shown in the custom_entities example in Rework handling general entity references (&entity;) #766). Add necessary config flags

I haven't thought through all the details yet and don't want to focus on them yet. There will be a separate PR where all this will be considered. It should be pretty soon (this year).

So, given that we are not currently performing most validity checks and I am planning a generic consistent solution for them, do you agree that it is not worth introducing them now?

Also line 358 should be rewritten - XML isn't "produced" by a parser. You can just say that such documents exist in the wild and are tolerated by some parsers such as the one used by Adobe Flash.

Any help are appreciated 😃 . I feel that my English is no good to writing simple-to-read explanations.

That all sounds good to me, but we probably shouldn't document it as a "new feature" given that we do want this to eventually fail for users of the Reader API.

This is not generic code and it is more clear when we construct them with explicit state

If test will fail due to another syntax error we get a nice diff

…er`.

Check position for zero actually does not check anything because this is start position before parse. We move position by 1 by using Text event failures (36): syntax::tag::unclosed3::ns_reader::async_tokio syntax::tag::unclosed3::ns_reader::borrowed syntax::tag::unclosed3::ns_reader::buffered syntax::tag::unclosed3::reader::async_tokio syntax::tag::unclosed3::reader::borrowed syntax::tag::unclosed3::reader::buffered syntax::tag::unclosed4::ns_reader::async_tokio syntax::tag::unclosed4::ns_reader::borrowed syntax::tag::unclosed4::ns_reader::buffered syntax::tag::unclosed4::reader::async_tokio syntax::tag::unclosed4::reader::borrowed syntax::tag::unclosed4::reader::buffered syntax::tag::unclosed5::ns_reader::async_tokio syntax::tag::unclosed5::ns_reader::borrowed syntax::tag::unclosed5::ns_reader::buffered syntax::tag::unclosed5::reader::async_tokio syntax::tag::unclosed5::reader::borrowed syntax::tag::unclosed5::reader::buffered syntax::tag::unclosed6::ns_reader::async_tokio syntax::tag::unclosed6::ns_reader::borrowed syntax::tag::unclosed6::ns_reader::buffered syntax::tag::unclosed6::reader::async_tokio syntax::tag::unclosed6::reader::borrowed syntax::tag::unclosed6::reader::buffered syntax::tag::unclosed7::ns_reader::async_tokio syntax::tag::unclosed7::ns_reader::borrowed syntax::tag::unclosed7::ns_reader::buffered syntax::tag::unclosed7::reader::async_tokio syntax::tag::unclosed7::reader::borrowed syntax::tag::unclosed7::reader::buffered syntax::tag::unclosed8::ns_reader::async_tokio syntax::tag::unclosed8::ns_reader::borrowed syntax::tag::unclosed8::ns_reader::buffered syntax::tag::unclosed8::reader::async_tokio syntax::tag::unclosed8::reader::borrowed syntax::tag::unclosed8::reader::buffered

…open or self-closed tag

…here error position is reported `.error_position()` is intended to report position where error start occurring, `.buffer_position()` is intended to show to what position reader read data tests/issues.rs leave untouched because this is the code from real problems in GH issues, it should remain as close to issue as possible

src/reader/mod.rs

dralley · 2024-07-07T12:22:07Z

src/reader/mod.rs

+            //   optional validate step which user should call manually
+            // - if we just look for `>` we will parse `</tag attr=">" >` as end tag
+            //   `</tag attr=">` and text `" >` which probably no one existing parser
+            //   do. This is still invalid tag but it also has no practical meaning


do -> does

This is malformed XML, however it is tolerated by some parsers
(e.g. the one used by Adobe Flash) and such documents do exist in the wild.

…e Flash parser)

Mingun · 2024-07-08T04:46:59Z

Task CIFuzz was failed by unknown reasons -- I asked google/oss-fuzz#12168 but no response yet... It not related to this changes, so merge

Mingun added bug enhancement labels Jul 4, 2024

Mingun requested a review from dralley July 4, 2024 16:16

dralley reviewed Jul 5, 2024

View reviewed changes

Mingun mentioned this pull request Jul 5, 2024

Start CDATA section only after uppercase <![CDATA[ #781

Merged

Mingun and others added 9 commits July 6, 2024 20:23

Sort options alphabetically

c195f5d

Move tests for allow_unmatched_ends to its own sub-module

a4c71fb

Remove incorrect example

28c031d

Do not use Default implementation to construct parsers

be86267

This is not generic code and it is more clear when we construct them with explicit state

Assert syntax errors via pretty_assertions

0d99a16

If test will fail due to another syntax error we get a nice diff

Move Parser, ElementParser and PiParser to the new module `pars…

d387ed7

…er`.

Fix incorrect .error_position() when encountering syntax error for …

df65be0

…open or self-closed tag

Mingun force-pushed the end-attributes branch from 522389d to 65d50a5 Compare July 6, 2024 16:03

dralley reviewed Jul 7, 2024

View reviewed changes

src/reader/mod.rs Outdated Show resolved Hide resolved

dralley reviewed Jul 7, 2024

View reviewed changes

Mingun force-pushed the end-attributes branch from 65d50a5 to 04842de Compare July 7, 2024 14:23

Allow to have attributes in closing tags (compatibility with the Adob…

6a48a28

…e Flash parser)

Mingun force-pushed the end-attributes branch from 04842de to 6a48a28 Compare July 7, 2024 14:23

dralley approved these changes Jul 7, 2024

View reviewed changes

Mingun merged commit 959eb55 into tafia:master Jul 8, 2024
6 of 7 checks passed

Mingun deleted the end-attributes branch July 8, 2024 04:47

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow attributes in the `Event::End` and fix `.error_position()` #780

Allow attributes in the `Event::End` and fix `.error_position()` #780

Mingun commented Jul 4, 2024

codecov-commenter commented Jul 4, 2024 •

edited

Loading

dralley Jul 5, 2024 •

edited

Loading

Mingun Jul 5, 2024

Mingun Jul 5, 2024

dralley Jul 7, 2024

dralley Jul 7, 2024 •

edited

Loading

Mingun Jul 7, 2024

Mingun commented Jul 8, 2024

Allow attributes in the Event::End and fix .error_position() #780

Allow attributes in the Event::End and fix .error_position() #780

Conversation

Mingun commented Jul 4, 2024

codecov-commenter commented Jul 4, 2024 • edited Loading

Codecov Report

dralley Jul 5, 2024 • edited Loading

Choose a reason for hiding this comment

Mingun Jul 5, 2024

Choose a reason for hiding this comment

Mingun Jul 5, 2024

Choose a reason for hiding this comment

dralley Jul 7, 2024

Choose a reason for hiding this comment

dralley Jul 7, 2024 • edited Loading

Choose a reason for hiding this comment

Mingun Jul 7, 2024

Choose a reason for hiding this comment

Mingun commented Jul 8, 2024

Allow attributes in the `Event::End` and fix `.error_position()` #780

Allow attributes in the `Event::End` and fix `.error_position()` #780

codecov-commenter commented Jul 4, 2024 •

edited

Loading

dralley Jul 5, 2024 •

edited

Loading

dralley Jul 7, 2024 •

edited

Loading