Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dennis' list of broad and interesting things. #62437

Open
dmsnell opened this issue Jun 9, 2024 · 0 comments
Open

Dennis' list of broad and interesting things. #62437

dmsnell opened this issue Jun 9, 2024 · 0 comments
Labels
[Feature] HTML API An API for updating HTML attributes in markup [Type] Enhancement A suggestion for improvement.

Comments

@dmsnell
Copy link
Member

dmsnell commented Jun 9, 2024

Overall values and goals.

  • Make processing occur in lazy, streamable, chunkable, single-pass, reentrant, and low-overhead ways.
  • Safety, reliability, and performance go hand-in-hand. Convenience is what leads people to use the better system.
  • There is no implementation of any spec without asking for what? WordPress needs interfaces with well-defined behaviors where specifications meet real observed need. It's not good enough to build a system in isolation: it must be designed so that developers will find it to be more natural than the alternatives which are probably unreliable, buggy, unsafe, and slow/bloated.
  • Calling code should be aware of all of the costs involved in executing its requests. Do not surprise callers with performance explosions. Prioritize latency, memory overhead, and raw throughput in that order.

Performance guidelines:

  • If it's not measured, it's neither faster nor slower.

    • Synthetic benchmarks are useless and should be rejected, especially when I propose them. Production code does not behave the way synthetic benchmarks behave.
    • Profile snapshots are almost useless, but useless for comparing performance. Profilers change the system so much that they dramatically shift where the runtime spends its time and they can overreport changes or report runtime costs that shift instead of disappear. Only measurements from realistic systems and runs are valuable.
    • PHP lies about memory use in extensions. The only memory use worth believing is what Linux or the OS reports.
  • Modern CPUs are incredible machines. Take advantage of every abstraction leak. PHP does not run the way it looks like it should.

    • Optimizing cache reuse is king, and misuse can kill. Prefer algorithms which prioritize cache locality and reuse over ones with seemingly better complexity, within limits. Push changes upstream into PHP when possible.
    • Avoid allocations at all costs and until inevitable. Most are not bad, but the surprise allocations can crash a system.
    • Avoid data dependencies when processing can be done in parallel. Many of PHP's built-in functions operate sequentially when they could fan out. While PHP doesn't expose this parallelism, if written in the right way, it's possible to get the CPU to do it for us.
  • Defer where possible.

    • If it's not required to parse an entire document or build a full parse tree all at once, don't. Stick to step() or next_thing() functions which communicate where they find their match and how long the match is. These functions can appear inside a loop to do a full parse, but they can also be used for finding the first of a thing in a document, or analyze a document with low overhead.
    • Make everything lazy where possible. This often implies creating new semantic classes for things that are more usually an array(). This carries the added benefit that it's possible to add semantics and avoid pushing out internal details to all of the call sites for a given thing. For example, WP_HTML_Decoder::attribute_starts_with() is much more efficient than str_starts_with( WP_HTML_Decoder::decode( ) ) because it stops parsing as soon as it finds the given prefix or asserts that it cannot be there. This can save processing and allocating megabytes of data when applied on data URLs which are the src of images pasted from other applications.
  • Static structures are much faster than array(), and they provide inline documentation too!

Block Parser

Replace the default everything-at-once block parser with a lazy low-overhead parser.

  • next_delimiter() as a low-level utility. [#6760]
  • A unified block parser. [#6381]
  • Lazy parser as a replacement for Core's existing parser. [#5705]

The current block parser has served WordPress well, but it demands that it parses the entire document into a block tree in memory all at once, and it's not particularly efficient. In one damaged post that was 3 MB in size, it took 14 GB to fully parse the document. This should not happen.

Core needs to be able to view blocks in isolation and only store in memory as much as it needs to properly render and process blocks. The need for less block structure has been highlighted by projects and needs such as:

  • Block Hooks only need to find specific blocks and lexically insert content before, after, or inside them.
  • Many features only examine block attributes and determine if a single JSON attribute is present, then apply a CSS class to the rendered HTML. They don't need to know about inner block structure or the HTML beyond what's rendered.
  • Many features analyze document or look for the first block matching a given query. There's no need to load anything into memory beyond the block under inspection.

Block API

  • Add the ability to read block attributes on the server which are "sourced" from the block's HTML, as described in a block.json file. [#6388]

Block Hooks

  • Optimize block hooks by avoiding a full block-tree parse. [#5753]

HTML API

Overall Roadmap for the HTML API

There is no end in sight to the development of the HTML API, but development work largely falls into two categories: developing the API itself; and rewriting Core to take advantage of what the HTML API offers.

Further developing the HTML API.

New features and functionality.

  • Introduce safe-by-default HTML templating. [#5949]

    • For creating just a tag and its markup. [#5884]
  • Properly parse and normalize URLs. [#6666]

  • Introduce Bits, for server-replacement of dynamic tokens. [Make, Discussion]

    • We never lost the need for Shortcodes, but we didn't have the mechanisms to bring them back safely.

Encoding and Decoding of Text Spans

There is so much in Core that would benefit from clarifying all of these boundaries, or of creating a clear point of demarcation between encoded and decoded content.

  • Provide normal set of string functions that operate on the raw encoded HTML, making replacements and updates possible without rewriting the entire thing. So far there's attribute_starts_with() which is akin to str_starts_with() but only for attributes.

Decoding GET and POST args.

There is almost no consistency in how code decodes the values from $_GET and $_POST. Yet, there is and can be incredible confusion over some basic transformations that occur:

  • In what character set are the arguments encoded?
  • If they are percent-escaped, are they percent-escaping UTF-8?
  • Are there HTML character references in the values?
  • Are slashes escaped?
Prior art

The HTML API can help here in coordination with other changes in core. Notably:

  • Declare that UTF-8 is the only acceptable character set for inbound arguments.
  • This should demand that all FORM elements add the accept-charset="utf-8" argument, which overrides a user-preferred charset for a webpage (meaning that this is still necessary even if the <meta charset=utf-8> tag is present).

With these new specifications, the HTML API can ensure that whatever is decoded from $_GET and $_POST are what was intended to be communicated from a browser or other HTTP request. In addition, they can provide helpers not present with existing WordPress idioms, like default values.

$search_query = request_arg( 'GET', 'q' );
$search_index = request_arg( 'GET', 'i', 'posts' );

Rewriting Core to take advantage of the HTML API.

Big Picture Changes

  • Create a final pass over the fully-rendered HTML for global filtering and processing. [#5662]

    • Core currently runs many different processing stages on the output, but each processing stage runs over the full contents, and often those contents are processed repeatedly as strings are stitched together and passed around. A final global HTML filter powered by the HTML API could give an opportunity for all of these processing stages to run only once and they could all run together while traversing the document, for a single pass through the HTML that minimizes allocations and duplicated work.
  • Mandate HTML5 and UTF-8 output everywhere. [#6536]

    • Character encodings are blurry and confused all throughout Core. HTML5 and UTF-8 are everywhere so WordPress could simplify much of its logic if it converts other formats at the boundaries. Instead of loading content from a database table into memory as raw binaries, or as the text encoded as they are, all database queries should request the UTF-8 encoding from the database. All input should be validated as UTF-8.
    • Every theme should assume HTML5 output and stop sending any other <meta charset="…"> that besides UTF-8. All escaping and encoding should occur as needed for HTML5. XML parsing, encoding, and decoding must take a completely different path. [See the section on the XML API].
  • Create a new fundamental Search infrastructure for WordPress.

    • Search should default to only searching text nodes in posts. It shouldn't return matches on HTML syntax or attribute values; it shouldn't return matches for block comment delimiters.
    • Search should be fast and relevant. Likely a new search index need be maintained. This can be built on top of the sync state table from the Sync Protocol.

Confusion of encoded and decoded text.

There's a dual nature to encoded text in HTML. WordPress itself frequently conflates the encoded domain and the decoded domain.

Consider, for example, wp_space_regexp(), which by default returns the following pattern: [\r\n\t ]|\xC2\xA0|&nbsp;. There are multiple things about this pattern that reflect the legacy of conflation:

  • The pattern is checking for the UTF-8 bytes 0xC2 0xA0, which correspond to the non-breaking space (U+00A0). It also checks for &nbsp;. So if the text is encoded we may find either, but if the text is decoded then this pattern will erroneously match on &nbsp; which presumably started as &amp;nbsp; and might have been someone trying to write about the non-breaking space.

Parsing and performance.

In addition to confused and corrupted content, Core also stands to make significant performance improvements by adopting the values of the HTML API and the streaming parser interfaces. Some functions are themselves extremely susceptible to catastrophic backtracking or memory bloat.

  • convert_smilies(). [#6762]

    • This can crash a page through PCRE backtracking.
    • This change highlights the need to perform string search and replace in place without extracting, decoding, modifying, encoding, and replacing.- The unit tests fail because every character reference is decoded in the process of updating the smilies.
  • force_balance_tags(). [#5562]

    • The HTML Processor provides a way to obviate this function with a new normalize() method for constructing fully-normative HTML. But even this may not be necessary given the fact that the HTML Processor can properly navigate through a document structurally.
  • wp_html_split(). [#6651]

  • wp_kses_hair() and friends. [#6572]

    • These functions attempt to parse HTML attributes and return structural results, but the results are in a custom array format. It would be nice if we could pass around HTML API processors and interact directly with the HTML structure.
  • wp_replace_in_html_tags(). [#6651]

    • Is this function even necessary anymore? It seems dangerous and its name is confusing. It attempts to perform string replace operations inside of HTML tag tokens.
  • wp_strip_tags().

  • wp_strip_all_tags(). [#6196]

  • wp_targeted_link_rel(). [#5590]

    • This relies on exposing the "link HTML" to filters, which corresponds to the part of the opening tag where the attributes are.
    • It relies on custom HTML parsing, wp_kses_hair(), and passes around PCRE results.

Database

Sync Protocol

WordPress needs the ability to reliably synchronize data with other WordPresses and internal services. This depends on having two things:

  • A secure two-way communication channel from WordPress to WordPress, where one is publicly reachable. This to be accomplished via HTTP long-polling.
  • A sync state tracking table that stores a vector clock for each database row and table. It tracks the last-confirmed clock value for each resource in every connected sync target.

While this works to synchronize resources between WordPresses, it also serves interesting purposes within a single WordPress, for any number of processes that rely on invalidating data or caches:

  • A search index can be kept in sync with this. This includes not only the text searches, but also an index of which posts contain which block types.
  • Static page caches could rely on this.
  • Cleanup and migration work can use this to efficiently pause and resume migration work.

XML API

Overall Roadmap for the XML API

While less prominent than the HTML API, WordPress also needs to reliably read, modify, and write XML. XML handling appears in a number of places:

  • Parsing pingbacks and inbound XML-based API calls.
  • Handling XML-RPC requests.
  • Writing WXR exports.
  • Reading WXR imports.
  • Generating RSS feeds.
@dmsnell dmsnell mentioned this issue Jun 10, 2024
10 tasks
@jordesign jordesign added [Type] Enhancement A suggestion for improvement. [Feature] HTML API An API for updating HTML attributes in markup labels Jun 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
[Feature] HTML API An API for updating HTML attributes in markup [Type] Enhancement A suggestion for improvement.
Projects
None yet
Development

No branches or pull requests

2 participants