Skip to content

Compiler: Fix and improve performance of entity decoding#12182

Merged
DustinCampbell merged 8 commits intomainfrom
dev/dustinca/improve-entity-decoding
Sep 8, 2025
Merged

Compiler: Fix and improve performance of entity decoding#12182
DustinCampbell merged 8 commits intomainfrom
dev/dustinca/improve-entity-decoding

Conversation

@DustinCampbell
Copy link
Member

@DustinCampbell DustinCampbell commented Sep 4, 2025

Tip

I strongly recommend reviewing this commit-by-commit. The commit history represents a progression of changes, and each commit has a detailed message describing its purpose.

This change overhauls the HTML character entity reference decoding performed by ComponentMarkupEncodingPass. The Razor compiler makes a best effort to decode HTML content for code generation. If it determines that the content is suitable as raw text and all character entity references can be decoded (if any), the compiler's code gen will call RenderTreeBuilder.AddContent(...). If not, it'll call RenderTreeBuilder.AddMarkupContent(...), which can have an impact on the runtime performance of the compiled application. So, the goal is to call AddContent(...) as much as possible.

While tuning the performance of ComponentMarkupEncodingPass to reduce allocations and avoid unnecessary work, I encountered two bugs:

  1. The Razor compiler doesn't implement hexadecimal HTML character entity references correctly. Hex character entity references are supposed to be in the form, &#xhhhh;. However, the compiler's implementation only supports an illegal form, &#0xhhhh; and doesn't actually support the legal form. So, if content contains legal hex numeric character entity references, they will fail to decode and RenderTreeBuilder.AddMarkupContent(...) will be code-gen'd.
  2. When decoding a decimal character entity reference, such as &#1234, a lookup table is used to retrieve the code point value. However, the lookup table isn't complete and doesn't include very simple values. For example, it doesn't include / which translates to a / character. So, if the content contains decimal numeric character entity references that aren't included in the lookup table, they will fail to decode and RenderTreeBuilder.AddMarkupContent(...) will be code-gen'd.

I've gone ahead and fixed both of these issues in 6f23ef0.

In addition, there are several performance and allocation fixes in this commit. These changes resulted in the creation of a handful of useful helpers for building and creating strings using a MemoryBuilder<ReadOnlyMemory<char>>. Those helpers and tests are included in e266160.

Many thanks to Copilot for help with writing all of the tests. 🤖🧪❤️


CI Build: https://dev.azure.com/dnceng/internal/_build/results?buildId=2786521&view=results
Test Insertion: https://dev.azure.com/devdiv/DevDiv/_git/VS/pullrequest/667364
Toolset Run: https://dev.azure.com/dnceng/internal/_build/results?buildId=2786524&view=results

This change refactors ComponentMarkupEncodingPass.TryGetHtmlEntity(...) to avoid allocations by using spans where possible instead of creating sub-strings. However, while doing this I found some very strange behavior.

Numeric HTML character entities references allow both decimal and hexadecimal numbers in the form of `&#nnnn;` and `&#xhhhh;`, respectively. However, it seems that Razor's decoding never supported hexadecimal numbers properly. The original implementation uses `Convert.ToInt32(..., 16)` to parse hexadecimal numbers, which supports leading '0x' but not a leading 'x'. So, Razor *only* support an *illegal* variant of hex character entities: `&#0xhhhh;`. Surprisingly, there's even a test (Execute_MixedHtmlContent_MultipleHTMLEntities_DoesNotSetEncoded) that validates an illegal hex character entity!

In addition, I found that the Razor compiler will fail to decode numeric character entities that do not appear in the ParserHelpers.HtmlEntityCodePoints dictionary. This is strange because the dictionary is far from complete! There are many basic character entities (like '&#47;' -> '/') that are not present.

Character entity decoding in the compiler is really a best effort. If it fails for any reason, the compiler will generate a call to RenderTreeBuilder.AddMarkupContent with the encoded text instead of RenderTreeBuilder.AddContent. However, when that happens, it means that character decoding will happen at runtime, which might(?) have performance implications.

I've gone ahead and fixed both of these issues:

1. Added support to TryGetHtmlEntity for legal hex character entities in addition to the illegal variants.
2. Added support to TryGetHtmlEntity for legal printable code points that don't happen to appear in the ParserHelpers.HtmlEntityCodePoints dictionary.
On .NET 9+, TryGetHtmlEntity can use GetAlternateLookup<ReadOnlySpan<char>>() to avoid allocating a string when calling ParserHelpers.NamedHtmlEntities.TryGetValue(...).
TryDecodeHtmlEntities uses a pretty inefficient for building decoded text after HTML character entities are processed. It scans through the content and tries to decode character entities when it encounters a '&' character. It adds each character entity and its replacement to a dictionary. Then, when it's done scanning, it looks through the dictionary and calls string.Replace for each one.

This results in a string allocation per different character entity found. So, if the content contains both '&gt;' and '&lt;'. There will be two string allocations to produce the decoded text.

The change refactors TryDecodeHtmlEntities to avoid these string allocations. Instead of using a dictionary, it uses a MemoryBuilder<ReadOnlyMemory<char>> to track the chunks needed to build the final decoded chunks.
Add several helpers for efficiently creating strings with a MemoryBuilder<ReadOnlyMemory<char>>.

1. MemoryBuilderExtensions.CreateString: An instance extension method on MemoryBuilder<ReadOnlyMemory<char>> that concatenates the contents to a string. Note that care is taken for the cases where  the builder is empty or contains just a single ReadOnlyMemory<char>.
2. StringBuilderExtensions.Build: A static extension method on string that takes a delegate that provides a builder for building up a string.
3. StringBuilderExtensions.TryBuild: A static extension method on string that takes a delegate that provides a builder for building up a string. The delegate can return true or false to indicate whether the string was built successfully or not.

Comprehensive unit tests have been added for each.

ComponentMarkupEncodingPass.TryDecodeHtmlEntities(...) has been updated to call the string.TryBuild(...) helper.
- Use a FrozenSet<char> to avoid Enumerable.Contains(...) call.
- Use a PooledarrayBuilder to keep track of decoded content to avoid constructing an array up front.
- Avoid decoding completely by counting ampersand characters up front.
- Enable nullability
- Mark as sealed
- Clean up comment
- Use FrozenSet<string> for VoidElementNames
- Use FrozenDictionary<string, string> for HtmlEntityCodePoints and NamedHtmlEntities.
- Move TryGetHtmlEntity from ComponentMarkupEncodingPass to ParserHelpers.
- Make HtmlEntityCodePoints and NamedHitmlEntities private, since they're only used by TryGetHtmlEntity.
- Add loads of tests for TryGetHtmlEntity.
@DustinCampbell DustinCampbell requested a review from a team as a code owner September 4, 2025 21:09
@DustinCampbell DustinCampbell requested review from a team and ToddGrun September 4, 2025 21:09
Copy link
Member

@davidwengier davidwengier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Once again with your PRs, if you don't have to change test baselines, its an easy green tick.

Having said that, if we have any compiler integration tests that include &#0x... then perhaps its worth adding one for the legal, now-supported, form.

Copy link
Contributor

@ToddGrun ToddGrun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:shipit:

@DustinCampbell
Copy link
Member Author

@chsienki, PTAL

- Add Debug.Asserts
- Add more explanatory comments
- Clean up logic a bit
- Use PooledArrayBuilder<(HtmlIntermediateToken, string)> to clean up code that updates tokens with decoded content.
@DustinCampbell
Copy link
Member Author

Having said that, if we have any compiler integration tests that include &#0x... then perhaps its worth adding one for the legal, now-supported, form.

There's an existing test, which is how I discovered the issue during refactoring.

@DustinCampbell DustinCampbell merged commit 132fbec into main Sep 8, 2025
11 checks passed
@DustinCampbell DustinCampbell deleted the dev/dustinca/improve-entity-decoding branch September 8, 2025 20:09
@dotnet-policy-service dotnet-policy-service bot added this to the Next milestone Sep 8, 2025
@akhera99 akhera99 modified the milestones: Next, 18.0 P1 Sep 22, 2025
@akhera99 akhera99 added this to the 18.0 P2 milestone Sep 23, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants