Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replace disallowed characters #15

Merged
merged 1 commit into from
Sep 7, 2023
Merged

Replace disallowed characters #15

merged 1 commit into from
Sep 7, 2023

Conversation

bryanburgers
Copy link

Make it possible for the parser to successfully parse anny XML document that contains characters that are invalid according to the XML spec.

A document that contains a character entity reference that resolves to an invalid character according to the XML spec – for example <doc>&#x10;</doc> – is technically invalid according to the XML specification. However, replacing these with U+FFFD is both better for the consuming application (because the application does not need to do some weird pre-processing of the document before parsing it). This takes previous work that already handled this for surrogate pairs and extends it to any invalid unicode character.

This, similar to the previous work, technically makes the XML parser non-conformant, so put it behind the existing
replace_unknown_entity_references field on ParserConfig.


Previous context at:

Make it possible for the parser to successfully parse anny XML document
that contains characters that are invalid according to the XML spec.

A document that contains a character entity reference that resolves to
an invalid character according to the XML spec – for example
`<doc>&#x10;</doc>` – is *technically* invalid according to the XML
specification. However, replacing these with U+FFFD is both better for
the consuming application (because the application does not need to do
some weird pre-processing of the document before parsing it). This
takes previous work that already handled this for surrogate pairs and
extends it to any invalid unicode character.

This, similar to the previous work, technically makes the XML parser
non-conformant, so put it behind the existing
`replace_unknown_entity_references` field on ParserConfig.
@kornelski
Copy link
Owner

Thanks for the PR.

Like the previous maintainer, I'll lament that the library should be moving to better XML conformance, not the other way. But since the option was already there, let's roll with it.

@kornelski kornelski merged commit aa1c1d6 into kornelski:main Sep 7, 2023
2 checks passed
@bryanburgers
Copy link
Author

Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants