Experiment: Run through esc_attr() in a single optimized pass. #5337

dmsnell · 2023-09-27T23:53:11Z

Important note

In a test I had esc_attr() return its given input and do no processing and could measure no impact on page runtime, at least nothing significant, so this PR is entirely exploratory and more focused now on:

could this be more maintainable?
does this impose a performance penalty?
could this lead to better sanitization?

Benchmark on single-page.html 12 MB test document:

trunk runs in 90 ms using 79.3 MB of memory.
branch runs in 193 ms using 73.8 MB of memory.
branch without $allowedentitynames runs in 184 ms using 73.8 MB of memory.
branch matching legacy behavior is at 156 ms using 73.9 MB of memory.
This branch is currently ~~around~~ less than twice as slow but uses noticeably less memory. Since calling everything but esc_attr() consumes 41 MB we can estimate that the comparison is 39 MB against 34 MB, or that this branch uses only 86% as much as trunk

Benchmark on a short string more typical of a real value:

trunk runs in 4.0 µs / 40.5 MB
branch runs in 4.6 µs / 40.6 MB
branch without $allowedentitynames runs in 3.9 µs / 40.6 MB
branch matching legacy behavior runs in 3.0 µs / 41.0 MB
On short or normal inputs the performance between the two is thus insignificant.

The existing implementation of esc_attr() runs a jumble of regular expression and other search passes over its input.

In this patch, if the site uses UTF-8, then an exploratory single-pass custom parser is used to escape the attribute values.

Trac ticket:

This Pull Request is for code review only. Please keep all other discussion in the Trac ticket. Do not merge this Pull Request. See GitHub Pull Requests for Code Review in the Core Handbook for more details.

The existing implementation of `esc_attr()` runs a jumble of regular expression and other search passes over its input. In this patch, if the site uses UTF-8, then an exploratory single-pass custom parser is used to escape the attribute values.

In order to clarify the main loop of `_esc_attr_single_pass_utf8` I've moved the named character reference lookup outside of the function and into a new high-performance token set class dubbed `WP_Token_Set`. I created this class to retain the performance perks brought by the optimized data format. There are two lookup sets though because WordPress traditionally has its own custom set based on HTML4, but I would like to see us allow everything that HTML5 allows, including the common `'` so we don't have to keep writing `&WordPress#39;` (because that doesn't stand out as clearly as the name does). Performance in this change is even better than it was previously because I've removed the substitutions from the lookup table and that removes both iteration and working memory. In order to provide the reverse function, decoding these entities, it would probably be best to create two separate tables, or add a fixed byte length and offset value as a lookup into another table so that we can avoid reintroducing the double crawling scan that we had before.

dmsnell · 2024-03-06T21:41:01Z

src/wp-includes/class-wp-token-set.php

+		if ( self::KEY_LENGTH < $text_length ) {
+			$group_key = substr( $text, $offset, self::KEY_LENGTH );
+
+			if ( ! isset( $this->large_words[ $group_key ] ) ) {


bug! if the group doesn't exist, it doesn't also imply that the short word doesn't exist. this should break but not return

dmsnell force-pushed the experiment/single-pass-esc-attr branch 5 times, most recently from 947db01 to ede200b Compare October 1, 2023 06:56

dmsnell added 8 commits October 2, 2023 15:49

Put it in its own file.

43db5f0

Advance the pointer to avoid an infinite loop

1bc7b33

Final fixes

2450dd9

Small docs changes

41365a6

More small changes.

0a4d4c2

Add question about limiting the list of allowable names

804969a

Preserve more Core behaviors

35cc667

dmsnell force-pushed the experiment/single-pass-esc-attr branch from e2db788 to 35cc667 Compare October 2, 2023 22:49

dmsnell added 2 commits October 2, 2023 21:16

Skip HTML4 allowable entity names check.

49af2fd

dmsnell commented Mar 6, 2024

View reviewed changes

dmsnell mentioned this pull request Apr 12, 2024

HTML API: Add custom text decoder #6387

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Experiment: Run through esc_attr() in a single optimized pass. #5337

Experiment: Run through esc_attr() in a single optimized pass. #5337

dmsnell commented Sep 27, 2023 •

edited

Loading

dmsnell Mar 6, 2024

Experiment: Run through esc_attr() in a single optimized pass. #5337

Are you sure you want to change the base?

Experiment: Run through esc_attr() in a single optimized pass. #5337

Conversation

dmsnell commented Sep 27, 2023 • edited Loading

dmsnell Mar 6, 2024

Choose a reason for hiding this comment

dmsnell commented Sep 27, 2023 •

edited

Loading