Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WP_HTML_Tag_Processor: Allow non-attribute lexical updates #47068

Closed
wants to merge 1 commit into from

Conversation

ockham
Copy link
Contributor

@ockham ockham commented Jan 11, 2023

What?

Part of #44410.

Allow updating of arbitrary HTML (rather than just tag attributes) by derived classes by making $lexical_updates protected, and adding some required logic.

Why?

This is needed in order to be able to implement methods like set_content_inside_balanced_tags(). See #47036 and #46680 (comment) for more context.

How?

As noted elsewhere:

[W]hile $attribute_updates is keyed by ("comparable", i.e. lowercase) attribute name, those keys are only relevant for set_attribute() (to check if an attribute of the same name already exists). They are however entirely ignored by apply_attributes_updates -- which is why we can use that mechanism for set_content_inside_balanced_tags without much modification 🎉

This PR thus:

These changes only affect existing private methods and members, so we're not breaking any APIs.

TODO

With "arbitrary" lexical updates, there's more potential for collisions:

  • There's a chance that existing bookmarks become invalid -- e.g. when an update overwrites an existing tag that had a bookmark attached. We thus need to add some logic to apply_lexical_updates that invalidates bookmarks, if necessary.
  • Even worse, there's a chance that lexical updates themselves collide. Imagine that in the aforementioned case, we have an attribute update for the tag that gets overwritten by the lexical update. The only way to avoid this might be to flush attribute updates before adding the lexical update (see). Edit: Or maybe this isn't really a problem after all, see WP_HTML_Tag_Processor: Allow non-attribute lexical updates #47068 (comment).

Testing Instructions

See unit tests.

@ockham ockham added the [Type] Experimental Experimental feature or API. label Jan 11, 2023
@ockham ockham self-assigned this Jan 11, 2023
@github-actions
Copy link

github-actions bot commented Jan 11, 2023

Flaky tests detected in 41a3919.
Some tests passed with failed attempts. The failures may not be related to this commit but are still reported for visibility. See the documentation for more information.

🔍 Workflow run URL: https://github.com/WordPress/gutenberg/actions/runs/4007383888
📝 Reported issues:

* @param string $text The replacement.
* @return void
*/
protected function add_lexical_update( $start, $end, $text ) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We probably need to flush updates here to avoid collisions.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We also might need to change the signature to use bookmarks instead of offsets for $start and $end, to guarantee that they're moved correctly upon update.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We probably need to flush updates here to avoid collisions.

I'd really like to avoid any possibility of adding colliding updates here. I think at that point we've exposed too much of the internal logic of the system or we have failed to maintain our internal consistency.

This is something I've worried about as we look at making sure get_attribute() returns the updated values for enqueued set_attribute() calls.

Put another way, the fact that we are raising these concerns might be more evidence that we're trying to stuff too much in when we could call get_updated_html() before some of these operations and then not worry about it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe I was being too pessimistic here. The example(s) I was thinking of all involved moving to a different tag (via next_tag() or seek()), but I had forgotten that we flush in those cases anyway, since our updates are per-tag.

Is it really that simple? Does the updating mechanism Just Work™️, even for more generic changes? 🤔

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does the updating mechanism Just Work™️, even for more generic changes?

Right now attribute updates have a name corresponding to the attribute name, so successive attribute updates overwrite previous ones. Other textual updates are appended without a name (so probably get some generic integer key).

If we add overlapping textual updates we could collide. If we add successive attribute updates they replace the earlier ones and things are safe.

On this note we might be able to run a variant of parse_next_attribute() on get_attribute() if we have updates enqueued for the given attribute. That could save us the complexity of tracking attribute updates separately in #46680

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right now attribute updates have a name corresponding to the attribute name, so successive attribute updates overwrite previous ones.

Yeah. With the changes in this PR, we'd be retaining this behavior (which is happening entirely inside set_attribute). Eventually, when they're carried over to $lexical_updates by attribute_updates_to_lexical_updates, those keys are dismissed.

Other textual updates are appended without a name (so probably get some generic integer key).

Yeah -- I'm implementing this behavior more explicitly in this PR, where other textual updates go straight to $lexical_updates, which is a non-associative array.

If we add overlapping textual updates we could collide. If we add successive attribute updates they replace the earlier ones and things are safe.

I'm starting to think that things really might work in our favor -- at least as long as we only support set_content_inside_balanced_tags: By its nature, that's an operation that shouldn't collide with set_attribute when applied to the same tag, e.g. the section in the following:

<section id="foobar">
   <div>Text</div>
</section>

Right now I'm thinking to maybe just throw a ton of test coverage at the problem with every possible scenario I can think of, to see if I'm missing something 😅


One thing I'm pretty sure is that we will need to invalidate (i.e. release?) bookmarks. Anything that points to a tag that's overwritten by set_content_inside_balanced_tags is going to get lost.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right now I'm thinking to maybe just throw a ton of test coverage at the problem with every possible scenario I can think of, to see if I'm missing something 😅

Adding those to #47036, since that PR has the actual, public set_content_inside_balanced_tags method. (I've updated that PR to implement that method now using add_lexical_update from this PR.)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd remove this method – $this->lexical_updates[] = new WP_HTML_Text_Replacement( $start, $end, $text ); can be done in-place where its needed.

I'd really like to avoid any possibility of adding colliding updates here. I think at that point we've exposed too much of the internal logic of the system or we have failed to maintain our internal consistency.

+1 to that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd remove this method – $this->lexical_updates[] = new WP_HTML_Text_Replacement( $start, $end, $text ); can be done in-place where its needed.

I'm exposing it (protected) for use by set_content_inside_bookmarks (which is a method of WP_HTML_Processor, which in turn is derived from WP_HTML_Tag_Processor). I don't mind removing the method, but it means I'll have to make $lexical_updates protected.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ockham ockham force-pushed the rename/attribute-updates-to-lexical-updates branch from 08d1380 to 4ea8fa6 Compare January 12, 2023 13:03
@ockham ockham force-pushed the rename/attribute-updates-to-lexical-updates branch from 7f84941 to b24d6c7 Compare January 16, 2023 13:35
Base automatically changed from rename/attribute-updates-to-lexical-updates to trunk January 16, 2023 18:54
@ockham
Copy link
Contributor Author

ockham commented Jan 19, 2023

As I've now implemented bookmark invalidation for bookmarks that are "overwritten", I'll tentatively open this for review.

Note that since the method itself is private, I haven't added any unit test coverage here. Instead, please refer to #47036, where set_content_inside_balanced_tags is implemented using the add_lexical_update method from this PR, and I've added some unit test coverage for set_content_inside_balanced_tags in that PR.

@ockham ockham marked this pull request as ready for review January 19, 2023 17:58
@adamziel
Copy link
Contributor

The changes look good to me. Let's observe how we end up using this API – same as Dennis I am worried about the possibility of overlapping updates. I'll leave the final approval to @dmsnell as he clearly gave this code more though.

@ockham
Copy link
Contributor Author

ockham commented Jan 23, 2023

The changes look good to me. Let's observe how we end up using this API – same as Dennis I am worried about the possibility of overlapping updates.

Thank you, Adam! Yeah, that was my major point of concern as well -- and the ultimate goal for this PR to avoid them! 😅 Per this convo, I've tried to ascertain that mostly by adding test coverage to #47036 😊 LMK if you can think of any other scenarios that aren't covered by those tests 😄

@ockham ockham force-pushed the add/add-lexical-update branch 2 times, most recently from 05bc7fe to 41a3919 Compare January 25, 2023 15:25
@ockham ockham changed the title WP_HTML_Tag_Processor: Add add_lexical_update() method WP_HTML_Tag_Processor: Allow non-attribute lexical updates Jan 26, 2023
@dmsnell
Copy link
Member

dmsnell commented Jan 26, 2023

What does the addition of lexical_updates and attribute_updates buy us? We can already distinguish these because attribute updates have a string array key while non-attribute updates are appended with [] = , giving them numeric keys. I find the distinction a bit confusing.

It seems like we now have two PRs here: one which unlocks a private method to subclassing and one which addresses some accounting issues for bookmarks that are eliminated by updates. Would it take much work to extract the bookmarking PR? I understand that before this PR the bookmark issue isn't really an issue at all, but I don't think it would be wrong to add the safety into the system before it's needed, and the additional integer comparison checks shouldn't appreciably slow anything down, particularly since we're only ever theoretically taking the same branch until more work is made available, such as proposed in this PR

@ockham
Copy link
Contributor Author

ockham commented Jan 30, 2023

What does the addition of lexical_updates and attribute_updates buy us? We can already distinguish these because attribute updates have a string array key while non-attribute updates are appended with [] = , giving them numeric keys. I find the distinction a bit confusing.

For context, this PR originally also contained a add_lexical_update() method, which I removed per Adam's suggestion 😊

It seems like we now have two PRs here: one which unlocks a private method to subclassing and one which addresses some accounting issues for bookmarks that are eliminated by updates. Would it take much work to extract the bookmarking PR? I understand that before this PR the bookmark issue isn't really an issue at all, but I don't think it would be wrong to add the safety into the system before it's needed, and the additional integer comparison checks shouldn't appreciably slow anything down, particularly since we're only ever theoretically taking the same branch until more work is made available, such as proposed in this PR

I can file a PR to extract the bookmarking logic 👍

@adamziel adamziel mentioned this pull request Feb 9, 2023
26 tasks
@gziolo gziolo added the [Feature] Block API API that allows to express the block paradigm. label Feb 21, 2023
@ockham
Copy link
Contributor Author

ockham commented Feb 21, 2023

Rebased, now that the bookmark invalidation PR (#47559) has been merged.

@dmsnell
Copy link
Member

dmsnell commented Feb 21, 2023

with $this->lexical_updates already being protected, is this required now? I'm still wondering what separating attribute from lexical updates gain us that we can't already do, apart from needing to check in two places for an update rather than one.

also, how do you feel about moving this PR into WordPress/WordPress-develop?

@adamziel
Copy link
Contributor

adamziel commented Mar 2, 2023

Now I came around again @ockham @dmsnell .

The name lexical_update suggests the ability to apply a diff to any part of the markup. However, WP_HTML_Tag_Processor assumes that diffs are applied only to the currently processed tag.

I've been using this PR for the last week and struggled to:

  • Insert a tag or text before the currently processed tag
  • Replace the current tag's outer HTML

I'd add support for these use-cases before merging this PR. Otherwise, we're replacing a name that reflects the limitations reasonably well (attribute updates) with one that doesn't.

@dmsnell
Copy link
Member

dmsnell commented Mar 2, 2023

suggests the ability to apply a diff to any part of the markup

It shouldn't suggest this. It should only suggest that at the point we have the diff it's purely lexical, based on the text and void of semantic information about the structure of the document.

WP_HTML_Tag_Processor assumes that diffs are applied only to the currently processed tag

Does it though? It doesn't in apply_attribute_updates()? It does overlook diffs inside of seek(), which seems like it would be a problem, but one we could also resolve with the same bookmark delta/shifting algorithm employed in apply_attribute_updates().


For both of the situations you listed I kind of envisioned we could create a zero-width bookmark before a tag starts and use that as a reference. I have not tested this out in practice, but that bookmark wouldn't interfere with any syntax and it wouldn't be affected by set_outer upates.

@ockham
Copy link
Contributor Author

ockham commented Mar 6, 2023

with $this->lexical_updates already being protected, is this required now?

Probably not anymore!

I'm still wondering what separating attribute from lexical updates gain us that we can't already do, apart from needing to check in two places for an update rather than one.

Yeah, it would've been needed mostly if $lexical_updates were private.

There's one more reason I find somewhat compelling: For the update handling logic, the update array keys (which happen to be attribute names) are purely circumstantial. This PR basically reflects that updates can exist without such keys, and that attribute updates are basically just one class of such updates.

also, how do you feel about moving this PR into WordPress/WordPress-develop?

Happy to -- but would you say it's even still worthwhile pursuing? 😊

@ockham
Copy link
Contributor Author

ockham commented Mar 6, 2023

suggests the ability to apply a diff to any part of the markup

It shouldn't suggest this. It should only suggest that at the point we have the diff it's purely lexical, based on the text and void of semantic information about the structure of the document.

Yeah, that's how I read "lexical" as well 👍😊

WP_HTML_Tag_Processor assumes that diffs are applied only to the currently processed tag

Does it though? It doesn't in apply_attribute_updates()? It does overlook diffs inside of seek(), which seems like it would be a problem, but one we could also resolve with the same bookmark delta/shifting algorithm employed in apply_attribute_updates().

For both of the situations you listed I kind of envisioned we could create a zero-width bookmark before a tag starts and use that as a reference. I have not tested this out in practice, but that bookmark wouldn't interfere with any syntax and it wouldn't be affected by set_outer upates.

Seems like a reasonable approach to me 👍

@dmsnell
Copy link
Member

dmsnell commented Mar 6, 2023

the update array keys (which happen to be attribute names) are purely circumstantial

this wasn't circumstantial, because those keys were chosen so that subsequent updates would automatically replace earlier updates without us needing to track them.

that we can use numeric keys for non-attribute updates isn't entirely chance either. 😄

read "lexical"

the difference is "the ability to apply a diff to any part of the markup"

the rules are still supposed to be maintained about where an update can occur to an input document. that we've stripped those semantic rules away at the point we enqueue an update doesn't imply that those rules don't exist.

my point is that there's nothing in this class at all that should suggest that we can or should do arbitrary edits to the HTML. opening up the ability to do this is an unfortunate requirement of extending the class, and we should be on the lookout for code trying to apply diffs to "any part of the markup" and learn how we can better enforce the rules that it shouldn't be able to. never should it suggest you can change <div><p>Test</p></div> into <diTes>.

@adamziel
Copy link
Contributor

adamziel commented Mar 8, 2023

my point is that there's nothing in this class at all that should suggest that we can or should do arbitrary edits to the HTML.

To me, there still is. Lexical update only has a start and an end; nothing about it communicates these semantic rules. An "attribute update" may be just a WP_HTML_Text_Replacement as well, but the name tells me specifically what it is about. With a lexical update, I need to dig in the docstrings. That's not a deal breaker, but I find it a worse DX.

That being said, we need more than attribute updates here or else we'll reimplement the same logic in Directive processor, HTML processor, etc. Perhaps this PR is for the best then – we'll likely find ourselves adjusting the flushing logic to cater more use-cases anyway. In other words, I'm fine with this name change that I think is a bit misleading because I believe the behavior will change in a way I'll find more intuitive.

never should it suggest you can change

Test

into .

Oh I didn't mean that. I just meant changing the current <div> into <span>. I like the zero-width bookmark idea!

@dmsnell
Copy link
Member

dmsnell commented Mar 9, 2023

if we're getting "this allows arbitrary changes" out of "lexical updates" then I think we should scrap lexical updates, but find something better than "attribute updates"

"semantic_patches" ?

I'm still struggling with why having text-based diffs implies the rules go out the window. these aren't and never were semantic updates, and from the day we created attribute updates we said "it's up to anyone creating these to maintain the semantic rules" and clearly we haven't found a good name yet to communicate that.

$lexical_updates_that_must_respect_html_syntax_boundaries

@gziolo gziolo added [Feature] HTML API An API for updating HTML attributes in markup and removed [Feature] Block API API that allows to express the block paradigm. labels Apr 18, 2023
@ockham
Copy link
Contributor Author

ockham commented Jul 10, 2023

Closing this PR as obsolete, now that we have WordPress/wordpress-develop#4519.

@ockham ockham closed this Jul 10, 2023
@ockham ockham deleted the add/add-lexical-update branch July 10, 2023 12:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
[Feature] HTML API An API for updating HTML attributes in markup [Type] Experimental Experimental feature or API.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants