mirrored from git://develop.git.wordpress.org/
-
Notifications
You must be signed in to change notification settings - Fork 2.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Experiment: Run through esc_attr() in a single optimized pass. #5337
Draft
dmsnell
wants to merge
10
commits into
WordPress:trunk
Choose a base branch
from
dmsnell:experiment/single-pass-esc-attr
base: trunk
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
dmsnell
force-pushed
the
experiment/single-pass-esc-attr
branch
5 times, most recently
from
October 1, 2023 06:56
947db01
to
ede200b
Compare
The existing implementation of `esc_attr()` runs a jumble of regular expression and other search passes over its input. In this patch, if the site uses UTF-8, then an exploratory single-pass custom parser is used to escape the attribute values.
dmsnell
force-pushed
the
experiment/single-pass-esc-attr
branch
from
October 2, 2023 22:49
e2db788
to
35cc667
Compare
In order to clarify the main loop of `_esc_attr_single_pass_utf8` I've moved the named character reference lookup outside of the function and into a new high-performance token set class dubbed `WP_Token_Set`. I created this class to retain the performance perks brought by the optimized data format. There are two lookup sets though because WordPress traditionally has its own custom set based on HTML4, but I would like to see us allow everything that HTML5 allows, including the common `'` so we don't have to keep writing `&WordPress#39;` (because that doesn't stand out as clearly as the name does). Performance in this change is even better than it was previously because I've removed the substitutions from the lookup table and that removes both iteration and working memory. In order to provide the reverse function, decoding these entities, it would probably be best to create two separate tables, or add a fixed byte length and offset value as a lookup into another table so that we can avoid reintroducing the double crawling scan that we had before.
dmsnell
commented
Mar 6, 2024
if ( self::KEY_LENGTH < $text_length ) { | ||
$group_key = substr( $text, $offset, self::KEY_LENGTH ); | ||
|
||
if ( ! isset( $this->large_words[ $group_key ] ) ) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
bug! if the group doesn't exist, it doesn't also imply that the short word doesn't exist. this should break
but not return
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Trac ticket: Core-60841
Important note
In a test I had
esc_attr()
return its given input and do no processing and could measure no impact on page runtime, at least nothing significant, so this PR is entirely exploratory and more focused now on:Benchmark on
single-page.html
12 MB test document:trunk
runs in 90 ms using 79.3 MB of memory.$allowedentitynames
runs in 184 ms using 73.8 MB of memory.aroundless than twice as slow but uses noticeably less memory. Since calling everything butesc_attr()
consumes 41 MB we can estimate that the comparison is 39 MB against 34 MB, or that this branch uses only 86% as much astrunk
Benchmark on a short string more typical of a real value:
trunk
runs in 4.0 µs / 40.5 MB$allowedentitynames
runs in 3.9 µs / 40.6 MBThe existing implementation of
esc_attr()
runs a jumble of regular expression and other search passes over its input.In this patch, if the site uses UTF-8, then an exploratory single-pass custom parser is used to escape the attribute values.
Trac ticket:
This Pull Request is for code review only. Please keep all other discussion in the Trac ticket. Do not merge this Pull Request. See GitHub Pull Requests for Code Review in the Core Handbook for more details.