Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow attributes in the Event::End and fix .error_position() #780

Merged
merged 10 commits into from
Jul 8, 2024

Conversation

Mingun
Copy link
Collaborator

@Mingun Mingun commented Jul 4, 2024

This PR allows us to have attributes in the Event::End (but they are not accessible via BytesEnd directly). As explained in #776 (comment) we anyway can produce an event which .name() does not valid, so it is better ar least keep it useful for some scenarios. When validation API would be added (I'm working on it) we can validate all events and report an error.

Closes #776

When changing parser I noticed, that in one arm we do not set last_error_offset. I checked if we really cover that lines -- yes, but we actually does not check anything. last_error_offset was checked for 0, but this is its default value, so was unclear is that code running or not. I improved check and fix the bug that was found due to that.

@codecov-commenter
Copy link

codecov-commenter commented Jul 4, 2024

⚠️ Please install the 'codecov app svg image' to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

Attention: Patch coverage is 92.47312% with 7 lines in your changes missing coverage. Please review.

Project coverage is 60.01%. Comparing base (7558577) to head (6a48a28).
Report is 79 commits behind head on master.

Files Patch % Lines
src/reader/mod.rs 95.50% 4 Missing ⚠️
examples/custom_entities.rs 0.00% 1 Missing ⚠️
examples/read_buffered.rs 0.00% 1 Missing ⚠️
examples/read_texts.rs 0.00% 1 Missing ⚠️

❗ Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files
@@            Coverage Diff             @@
##           master     #780      +/-   ##
==========================================
- Coverage   61.81%   60.01%   -1.81%     
==========================================
  Files          41       41              
  Lines       16798    16048     -750     
==========================================
- Hits        10384     9631     -753     
- Misses       6414     6417       +3     
Flag Coverage Δ
unittests 60.01% <92.47%> (-1.81%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

// do. This is still invalid tag but it also has no practical meaning
// to parse it in a such way. End tag `</tag attr=">" >`, however, can
// be produced by some real-world parsers, for example, by the parser
// of the Macromedia Flash player.
Copy link
Collaborator

@dralley dralley Jul 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We shouldn't allow this by default, we should put it behind an opt-in. Even non-validating parsers check for well-formedness, and this is malformed XML.

expat for example blows up

ExpatError: not well-formed (invalid token): line 1, column 11

as does Lxml and libxml2

XMLSyntaxError: expected '>', line 1, column 12

parserError: xmlParseDoc() failed

I'm OK with making this possible to support, but our default behavior should probably be throwing an IllFormedError just like we do for mismatched tags (and like other parsers do).

ExpatError: mismatched tag: line 1, column 7

Also line 358 should be rewritten - XML isn't "produced" by a parser. You can just say that such documents exist in the wild and are tolerated by some parsers such as the one used by Adobe Flash.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I want to evolve library in this direction:

  • rename Reader to RawReader (and Event to RawEvent)
  • make RawReader as liberal as possible. It should not make any validation checks
  • add a validate method to RawEvent which will perform all well-formedless and validity checks (I already grab list of all checks from the specification, 35 VC + 16 WFC references, actually most of them about DTD). All validation must be run manually by calling RawEvent::validate. Only one check cannot be performed in this way: check that start tag name matches end tag name
  • move existing checks (for example, IllFormedError::DoubleHyphenInComment) to the new validate method
  • introduce a new Reader which will perform validation checks when flag in config is set (and process many other things, like dealing with encodings and expanding entity references as shown in the custom_entities example in Rework handling general entity references (&entity;) #766). Add necessary config flags

I haven't thought through all the details yet and don't want to focus on them yet. There will be a separate PR where all this will be considered. It should be pretty soon (this year).

So, given that we are not currently performing most validity checks and I am planning a generic consistent solution for them, do you agree that it is not worth introducing them now?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also line 358 should be rewritten - XML isn't "produced" by a parser. You can just say that such documents exist in the wild and are tolerated by some parsers such as the one used by Adobe Flash.

Any help are appreciated 😃 . I feel that my English is no good to writing simple-to-read explanations.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That all sounds good to me, but we probably shouldn't document it as a "new feature" given that we do want this to eventually fail for users of the Reader API.

Mingun and others added 9 commits July 6, 2024 20:23
This is not generic code and it is more clear when we construct them with explicit state
If test will fail due to another syntax error we get a nice diff
Check position for zero actually does not check anything because this is start
position before parse. We move position by 1 by using Text event

failures (36):
    syntax::tag::unclosed3::ns_reader::async_tokio
    syntax::tag::unclosed3::ns_reader::borrowed
    syntax::tag::unclosed3::ns_reader::buffered
    syntax::tag::unclosed3::reader::async_tokio
    syntax::tag::unclosed3::reader::borrowed
    syntax::tag::unclosed3::reader::buffered
    syntax::tag::unclosed4::ns_reader::async_tokio
    syntax::tag::unclosed4::ns_reader::borrowed
    syntax::tag::unclosed4::ns_reader::buffered
    syntax::tag::unclosed4::reader::async_tokio
    syntax::tag::unclosed4::reader::borrowed
    syntax::tag::unclosed4::reader::buffered
    syntax::tag::unclosed5::ns_reader::async_tokio
    syntax::tag::unclosed5::ns_reader::borrowed
    syntax::tag::unclosed5::ns_reader::buffered
    syntax::tag::unclosed5::reader::async_tokio
    syntax::tag::unclosed5::reader::borrowed
    syntax::tag::unclosed5::reader::buffered
    syntax::tag::unclosed6::ns_reader::async_tokio
    syntax::tag::unclosed6::ns_reader::borrowed
    syntax::tag::unclosed6::ns_reader::buffered
    syntax::tag::unclosed6::reader::async_tokio
    syntax::tag::unclosed6::reader::borrowed
    syntax::tag::unclosed6::reader::buffered
    syntax::tag::unclosed7::ns_reader::async_tokio
    syntax::tag::unclosed7::ns_reader::borrowed
    syntax::tag::unclosed7::ns_reader::buffered
    syntax::tag::unclosed7::reader::async_tokio
    syntax::tag::unclosed7::reader::borrowed
    syntax::tag::unclosed7::reader::buffered
    syntax::tag::unclosed8::ns_reader::async_tokio
    syntax::tag::unclosed8::ns_reader::borrowed
    syntax::tag::unclosed8::ns_reader::buffered
    syntax::tag::unclosed8::reader::async_tokio
    syntax::tag::unclosed8::reader::borrowed
    syntax::tag::unclosed8::reader::buffered
…here error position is reported

`.error_position()` is intended to report position where error start occurring,
`.buffer_position()` is intended to show to what position reader read data

tests/issues.rs leave untouched because this is the code from real problems in GH issues,
it should remain as close to issue as possible
src/reader/mod.rs Outdated Show resolved Hide resolved
// optional validate step which user should call manually
// - if we just look for `>` we will parse `</tag attr=">" >` as end tag
// `</tag attr=">` and text `" >` which probably no one existing parser
// do. This is still invalid tag but it also has no practical meaning
Copy link
Collaborator

@dralley dralley Jul 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do -> does

This is malformed XML, however it is tolerated by some parsers
(e.g. the one used by Adobe Flash) and such documents do exist in the wild.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed!

@Mingun
Copy link
Collaborator Author

Mingun commented Jul 8, 2024

Task CIFuzz was failed by unknown reasons -- I asked google/oss-fuzz#12168 but no response yet... It not related to this changes, so merge

@Mingun Mingun merged commit 959eb55 into tafia:master Jul 8, 2024
6 of 7 checks passed
@Mingun Mingun deleted the end-attributes branch July 8, 2024 04:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Parsing malformed closing tags
3 participants