Compiler: Fix and improve performance of entity decoding by DustinCampbell · Pull Request #12182 · dotnet/razor

DustinCampbell · 2025-09-04T21:09:10Z

Tip

I strongly recommend reviewing this commit-by-commit. The commit history represents a progression of changes, and each commit has a detailed message describing its purpose.

This change overhauls the HTML character entity reference decoding performed by ComponentMarkupEncodingPass. The Razor compiler makes a best effort to decode HTML content for code generation. If it determines that the content is suitable as raw text and all character entity references can be decoded (if any), the compiler's code gen will call RenderTreeBuilder.AddContent(...). If not, it'll call RenderTreeBuilder.AddMarkupContent(...), which can have an impact on the runtime performance of the compiled application. So, the goal is to call AddContent(...) as much as possible.

While tuning the performance of ComponentMarkupEncodingPass to reduce allocations and avoid unnecessary work, I encountered two bugs:

The Razor compiler doesn't implement hexadecimal HTML character entity references correctly. Hex character entity references are supposed to be in the form, &#xhhhh;. However, the compiler's implementation only supports an illegal form, &#0xhhhh; and doesn't actually support the legal form. So, if content contains legal hex numeric character entity references, they will fail to decode and RenderTreeBuilder.AddMarkupContent(...) will be code-gen'd.
When decoding a decimal character entity reference, such as &#1234, a lookup table is used to retrieve the code point value. However, the lookup table isn't complete and doesn't include very simple values. For example, it doesn't include / which translates to a / character. So, if the content contains decimal numeric character entity references that aren't included in the lookup table, they will fail to decode and RenderTreeBuilder.AddMarkupContent(...) will be code-gen'd.

I've gone ahead and fixed both of these issues in 6f23ef0.

In addition, there are several performance and allocation fixes in this commit. These changes resulted in the creation of a handful of useful helpers for building and creating strings using a MemoryBuilder<ReadOnlyMemory<char>>. Those helpers and tests are included in e266160.

Many thanks to Copilot for help with writing all of the tests. 🤖🧪❤️

CI Build: https://dev.azure.com/dnceng/internal/_build/results?buildId=2786521&view=results
Test Insertion: https://dev.azure.com/devdiv/DevDiv/_git/VS/pullrequest/667364
Toolset Run: https://dev.azure.com/dnceng/internal/_build/results?buildId=2786524&view=results

This change refactors ComponentMarkupEncodingPass.TryGetHtmlEntity(...) to avoid allocations by using spans where possible instead of creating sub-strings. However, while doing this I found some very strange behavior. Numeric HTML character entities references allow both decimal and hexadecimal numbers in the form of `&#nnnn;` and `&#xhhhh;`, respectively. However, it seems that Razor's decoding never supported hexadecimal numbers properly. The original implementation uses `Convert.ToInt32(..., 16)` to parse hexadecimal numbers, which supports leading '0x' but not a leading 'x'. So, Razor *only* support an *illegal* variant of hex character entities: `&#0xhhhh;`. Surprisingly, there's even a test (Execute_MixedHtmlContent_MultipleHTMLEntities_DoesNotSetEncoded) that validates an illegal hex character entity! In addition, I found that the Razor compiler will fail to decode numeric character entities that do not appear in the ParserHelpers.HtmlEntityCodePoints dictionary. This is strange because the dictionary is far from complete! There are many basic character entities (like '/' -> '/') that are not present. Character entity decoding in the compiler is really a best effort. If it fails for any reason, the compiler will generate a call to RenderTreeBuilder.AddMarkupContent with the encoded text instead of RenderTreeBuilder.AddContent. However, when that happens, it means that character decoding will happen at runtime, which might(?) have performance implications. I've gone ahead and fixed both of these issues: 1. Added support to TryGetHtmlEntity for legal hex character entities in addition to the illegal variants. 2. Added support to TryGetHtmlEntity for legal printable code points that don't happen to appear in the ParserHelpers.HtmlEntityCodePoints dictionary.

On .NET 9+, TryGetHtmlEntity can use GetAlternateLookup<ReadOnlySpan<char>>() to avoid allocating a string when calling ParserHelpers.NamedHtmlEntities.TryGetValue(...).

TryDecodeHtmlEntities uses a pretty inefficient for building decoded text after HTML character entities are processed. It scans through the content and tries to decode character entities when it encounters a '&' character. It adds each character entity and its replacement to a dictionary. Then, when it's done scanning, it looks through the dictionary and calls string.Replace for each one. This results in a string allocation per different character entity found. So, if the content contains both '>' and '<'. There will be two string allocations to produce the decoded text. The change refactors TryDecodeHtmlEntities to avoid these string allocations. Instead of using a dictionary, it uses a MemoryBuilder<ReadOnlyMemory<char>> to track the chunks needed to build the final decoded chunks.

Add several helpers for efficiently creating strings with a MemoryBuilder<ReadOnlyMemory<char>>. 1. MemoryBuilderExtensions.CreateString: An instance extension method on MemoryBuilder<ReadOnlyMemory<char>> that concatenates the contents to a string. Note that care is taken for the cases where the builder is empty or contains just a single ReadOnlyMemory<char>. 2. StringBuilderExtensions.Build: A static extension method on string that takes a delegate that provides a builder for building up a string. 3. StringBuilderExtensions.TryBuild: A static extension method on string that takes a delegate that provides a builder for building up a string. The delegate can return true or false to indicate whether the string was built successfully or not. Comprehensive unit tests have been added for each. ComponentMarkupEncodingPass.TryDecodeHtmlEntities(...) has been updated to call the string.TryBuild(...) helper.

- Use a FrozenSet<char> to avoid Enumerable.Contains(...) call. - Use a PooledarrayBuilder to keep track of decoded content to avoid constructing an array up front. - Avoid decoding completely by counting ampersand characters up front.

- Enable nullability - Mark as sealed - Clean up comment

- Use FrozenSet<string> for VoidElementNames - Use FrozenDictionary<string, string> for HtmlEntityCodePoints and NamedHtmlEntities. - Move TryGetHtmlEntity from ComponentMarkupEncodingPass to ParserHelpers. - Make HtmlEntityCodePoints and NamedHitmlEntities private, since they're only used by TryGetHtmlEntity. - Add loads of tests for TryGetHtmlEntity.

davidwengier

LGTM. Once again with your PRs, if you don't have to change test baselines, its an easy green tick.

Having said that, if we have any compiler integration tests that include &#0x... then perhaps its worth adding one for the legal, now-supported, form.

...Microsoft.CodeAnalysis.Razor.Compiler/src/Language/Components/ComponentMarkupEncodingPass.cs

src/Compiler/Microsoft.CodeAnalysis.Razor.Compiler/src/Language/Legacy/ParserHelpers.cs

ToddGrun

...Microsoft.CodeAnalysis.Razor.Compiler/src/Language/Components/ComponentMarkupEncodingPass.cs

DustinCampbell · 2025-09-08T17:16:04Z

@chsienki, PTAL

- Add Debug.Asserts - Add more explanatory comments - Clean up logic a bit - Use PooledArrayBuilder<(HtmlIntermediateToken, string)> to clean up code that updates tokens with decoded content.

DustinCampbell · 2025-09-08T18:18:03Z

Having said that, if we have any compiler integration tests that include &#0x... then perhaps its worth adding one for the legal, now-supported, form.

There's an existing test, which is how I discovered the issue during refactoring.

DustinCampbell added 7 commits September 4, 2025 10:10

Use GetAlternateLookup() to avoid string allocation on .NET 9+

6aabf18

On .NET 9+, TryGetHtmlEntity can use GetAlternateLookup<ReadOnlySpan<char>>() to avoid allocating a string when calling ParserHelpers.NamedHtmlEntities.TryGetValue(...).

Improve ComponentMarkupEncodingPass.VisitHtml

519d43b

- Use a FrozenSet<char> to avoid Enumerable.Contains(...) call. - Use a PooledarrayBuilder to keep track of decoded content to avoid constructing an array up front. - Avoid decoding completely by counting ampersand characters up front.

Clean up ComponentMarkupEncodingPass

b0cad51

- Enable nullability - Mark as sealed - Clean up comment

DustinCampbell requested a review from a team as a code owner September 4, 2025 21:09

DustinCampbell requested review from a team and ToddGrun September 4, 2025 21:09

davidwengier approved these changes Sep 5, 2025

View reviewed changes