Compiler: Fix and improve performance of entity decoding#12182
Merged
DustinCampbell merged 8 commits intomainfrom Sep 8, 2025
Merged
Compiler: Fix and improve performance of entity decoding#12182DustinCampbell merged 8 commits intomainfrom
DustinCampbell merged 8 commits intomainfrom
Conversation
This change refactors ComponentMarkupEncodingPass.TryGetHtmlEntity(...) to avoid allocations by using spans where possible instead of creating sub-strings. However, while doing this I found some very strange behavior. Numeric HTML character entities references allow both decimal and hexadecimal numbers in the form of `&#nnnn;` and `&#xhhhh;`, respectively. However, it seems that Razor's decoding never supported hexadecimal numbers properly. The original implementation uses `Convert.ToInt32(..., 16)` to parse hexadecimal numbers, which supports leading '0x' but not a leading 'x'. So, Razor *only* support an *illegal* variant of hex character entities: `�xhhhh;`. Surprisingly, there's even a test (Execute_MixedHtmlContent_MultipleHTMLEntities_DoesNotSetEncoded) that validates an illegal hex character entity! In addition, I found that the Razor compiler will fail to decode numeric character entities that do not appear in the ParserHelpers.HtmlEntityCodePoints dictionary. This is strange because the dictionary is far from complete! There are many basic character entities (like '/' -> '/') that are not present. Character entity decoding in the compiler is really a best effort. If it fails for any reason, the compiler will generate a call to RenderTreeBuilder.AddMarkupContent with the encoded text instead of RenderTreeBuilder.AddContent. However, when that happens, it means that character decoding will happen at runtime, which might(?) have performance implications. I've gone ahead and fixed both of these issues: 1. Added support to TryGetHtmlEntity for legal hex character entities in addition to the illegal variants. 2. Added support to TryGetHtmlEntity for legal printable code points that don't happen to appear in the ParserHelpers.HtmlEntityCodePoints dictionary.
On .NET 9+, TryGetHtmlEntity can use GetAlternateLookup<ReadOnlySpan<char>>() to avoid allocating a string when calling ParserHelpers.NamedHtmlEntities.TryGetValue(...).
TryDecodeHtmlEntities uses a pretty inefficient for building decoded text after HTML character entities are processed. It scans through the content and tries to decode character entities when it encounters a '&' character. It adds each character entity and its replacement to a dictionary. Then, when it's done scanning, it looks through the dictionary and calls string.Replace for each one. This results in a string allocation per different character entity found. So, if the content contains both '>' and '<'. There will be two string allocations to produce the decoded text. The change refactors TryDecodeHtmlEntities to avoid these string allocations. Instead of using a dictionary, it uses a MemoryBuilder<ReadOnlyMemory<char>> to track the chunks needed to build the final decoded chunks.
Add several helpers for efficiently creating strings with a MemoryBuilder<ReadOnlyMemory<char>>. 1. MemoryBuilderExtensions.CreateString: An instance extension method on MemoryBuilder<ReadOnlyMemory<char>> that concatenates the contents to a string. Note that care is taken for the cases where the builder is empty or contains just a single ReadOnlyMemory<char>. 2. StringBuilderExtensions.Build: A static extension method on string that takes a delegate that provides a builder for building up a string. 3. StringBuilderExtensions.TryBuild: A static extension method on string that takes a delegate that provides a builder for building up a string. The delegate can return true or false to indicate whether the string was built successfully or not. Comprehensive unit tests have been added for each. ComponentMarkupEncodingPass.TryDecodeHtmlEntities(...) has been updated to call the string.TryBuild(...) helper.
- Use a FrozenSet<char> to avoid Enumerable.Contains(...) call. - Use a PooledarrayBuilder to keep track of decoded content to avoid constructing an array up front. - Avoid decoding completely by counting ampersand characters up front.
- Enable nullability - Mark as sealed - Clean up comment
- Use FrozenSet<string> for VoidElementNames - Use FrozenDictionary<string, string> for HtmlEntityCodePoints and NamedHtmlEntities. - Move TryGetHtmlEntity from ComponentMarkupEncodingPass to ParserHelpers. - Make HtmlEntityCodePoints and NamedHitmlEntities private, since they're only used by TryGetHtmlEntity. - Add loads of tests for TryGetHtmlEntity.
davidwengier
approved these changes
Sep 5, 2025
Member
davidwengier
left a comment
There was a problem hiding this comment.
LGTM. Once again with your PRs, if you don't have to change test baselines, its an easy green tick.
Having said that, if we have any compiler integration tests that include �x... then perhaps its worth adding one for the legal, now-supported, form.
...Microsoft.CodeAnalysis.Razor.Compiler/src/Language/Components/ComponentMarkupEncodingPass.cs
Outdated
Show resolved
Hide resolved
src/Compiler/Microsoft.CodeAnalysis.Razor.Compiler/src/Language/Legacy/ParserHelpers.cs
Outdated
Show resolved
Hide resolved
src/Compiler/Microsoft.CodeAnalysis.Razor.Compiler/src/Language/Legacy/ParserHelpers.cs
Outdated
Show resolved
Hide resolved
ToddGrun
reviewed
Sep 5, 2025
src/Compiler/Microsoft.CodeAnalysis.Razor.Compiler/src/Language/Legacy/ParserHelpers.cs
Outdated
Show resolved
Hide resolved
ToddGrun
reviewed
Sep 5, 2025
src/Compiler/Microsoft.CodeAnalysis.Razor.Compiler/src/Language/Legacy/ParserHelpers.cs
Outdated
Show resolved
Hide resolved
ToddGrun
reviewed
Sep 5, 2025
src/Compiler/Microsoft.CodeAnalysis.Razor.Compiler/src/Language/Legacy/ParserHelpers.cs
Outdated
Show resolved
Hide resolved
ToddGrun
reviewed
Sep 5, 2025
src/Compiler/Microsoft.CodeAnalysis.Razor.Compiler/src/Language/Legacy/ParserHelpers.cs
Show resolved
Hide resolved
ToddGrun
reviewed
Sep 5, 2025
src/Compiler/Microsoft.CodeAnalysis.Razor.Compiler/src/Language/Legacy/ParserHelpers.cs
Outdated
Show resolved
Hide resolved
ToddGrun
reviewed
Sep 5, 2025
...Microsoft.CodeAnalysis.Razor.Compiler/src/Language/Components/ComponentMarkupEncodingPass.cs
Outdated
Show resolved
Hide resolved
Member
Author
|
@chsienki, PTAL |
- Add Debug.Asserts - Add more explanatory comments - Clean up logic a bit - Use PooledArrayBuilder<(HtmlIntermediateToken, string)> to clean up code that updates tokens with decoded content.
Member
Author
There's an existing test, which is how I discovered the issue during refactoring. |
chsienki
approved these changes
Sep 8, 2025
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Tip
I strongly recommend reviewing this commit-by-commit. The commit history represents a progression of changes, and each commit has a detailed message describing its purpose.
This change overhauls the HTML character entity reference decoding performed by
ComponentMarkupEncodingPass. The Razor compiler makes a best effort to decode HTML content for code generation. If it determines that the content is suitable as raw text and all character entity references can be decoded (if any), the compiler's code gen will callRenderTreeBuilder.AddContent(...). If not, it'll callRenderTreeBuilder.AddMarkupContent(...), which can have an impact on the runtime performance of the compiled application. So, the goal is to callAddContent(...)as much as possible.While tuning the performance of
ComponentMarkupEncodingPassto reduce allocations and avoid unnecessary work, I encountered two bugs:&#xhhhh;. However, the compiler's implementation only supports an illegal form,�xhhhh;and doesn't actually support the legal form. So, if content contains legal hex numeric character entity references, they will fail to decode andRenderTreeBuilder.AddMarkupContent(...)will be code-gen'd.Ӓ, a lookup table is used to retrieve the code point value. However, the lookup table isn't complete and doesn't include very simple values. For example, it doesn't include/which translates to a/character. So, if the content contains decimal numeric character entity references that aren't included in the lookup table, they will fail to decode andRenderTreeBuilder.AddMarkupContent(...)will be code-gen'd.I've gone ahead and fixed both of these issues in 6f23ef0.
In addition, there are several performance and allocation fixes in this commit. These changes resulted in the creation of a handful of useful helpers for building and creating strings using a
MemoryBuilder<ReadOnlyMemory<char>>. Those helpers and tests are included in e266160.Many thanks to Copilot for help with writing all of the tests. 🤖🧪❤️
CI Build: https://dev.azure.com/dnceng/internal/_build/results?buildId=2786521&view=results
Test Insertion: https://dev.azure.com/devdiv/DevDiv/_git/VS/pullrequest/667364
Toolset Run: https://dev.azure.com/dnceng/internal/_build/results?buildId=2786524&view=results