Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Experiment: Run through esc_attr() in a single optimized pass. #5337

Draft
wants to merge 10 commits into
base: trunk
Choose a base branch
from

Conversation

dmsnell
Copy link
Member

@dmsnell dmsnell commented Sep 27, 2023

Trac ticket: Core-60841

Important note

In a test I had esc_attr() return its given input and do no processing and could measure no impact on page runtime, at least nothing significant, so this PR is entirely exploratory and more focused now on:

  • could this be more maintainable?
  • does this impose a performance penalty?
  • could this lead to better sanitization?

Benchmark on single-page.html 12 MB test document:

  • trunk runs in 90 ms using 79.3 MB of memory.
  • branch runs in 193 ms using 73.8 MB of memory.
  • branch without $allowedentitynames runs in 184 ms using 73.8 MB of memory.
  • branch matching legacy behavior is at 156 ms using 73.9 MB of memory.
  • This branch is currently around less than twice as slow but uses noticeably less memory. Since calling everything but esc_attr() consumes 41 MB we can estimate that the comparison is 39 MB against 34 MB, or that this branch uses only 86% as much as trunk

Benchmark on a short string more typical of a real value:

  • trunk runs in 4.0 µs / 40.5 MB
  • branch runs in 4.6 µs / 40.6 MB
  • branch without $allowedentitynames runs in 3.9 µs / 40.6 MB
  • branch matching legacy behavior runs in 3.0 µs / 41.0 MB
  • On short or normal inputs the performance between the two is thus insignificant.

The existing implementation of esc_attr() runs a jumble of regular expression and other search passes over its input.

In this patch, if the site uses UTF-8, then an exploratory single-pass custom parser is used to escape the attribute values.

Trac ticket:


This Pull Request is for code review only. Please keep all other discussion in the Trac ticket. Do not merge this Pull Request. See GitHub Pull Requests for Code Review in the Core Handbook for more details.

@dmsnell dmsnell force-pushed the experiment/single-pass-esc-attr branch 5 times, most recently from 947db01 to ede200b Compare October 1, 2023 06:56
The existing implementation of `esc_attr()` runs a jumble of regular expression
and other search passes over its input.

In this patch, if the site uses UTF-8, then an exploratory single-pass custom
parser is used to escape the attribute values.
@dmsnell dmsnell force-pushed the experiment/single-pass-esc-attr branch from e2db788 to 35cc667 Compare October 2, 2023 22:49
In order to clarify the main loop of `_esc_attr_single_pass_utf8` I've moved the
named character reference lookup outside of the function and into a new high-performance
token set class dubbed `WP_Token_Set`. I created this class to retain the performance
perks brought by the optimized data format.

There are two lookup sets though because WordPress traditionally has its own custom
set based on HTML4, but I would like to see us allow everything that HTML5 allows,
including the common `'` so we don't have to keep writing `&WordPress#39;` (because
that doesn't stand out as clearly as the name does).

Performance in this change is even better than it was previously because I've removed
the substitutions from the lookup table and that removes both iteration and working
memory. In order to provide the reverse function, decoding these entities, it would
probably be best to create two separate tables, or add a fixed byte length and offset
value as a lookup into another table so that we can avoid reintroducing the double
crawling scan that we had before.
if ( self::KEY_LENGTH < $text_length ) {
$group_key = substr( $text, $offset, self::KEY_LENGTH );

if ( ! isset( $this->large_words[ $group_key ] ) ) {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bug! if the group doesn't exist, it doesn't also imply that the short word doesn't exist. this should break but not return

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant