Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use TinyMCE utilities to parse post content #426

Merged
merged 15 commits into from
Apr 19, 2017
Merged

Conversation

nylen
Copy link
Member

@nylen nylen commented Apr 14, 2017

This PR is intended to start addressing #391 and is also closely related to the discussion at #413. I've replaced the PegJS grammar parsing with a way of parsing post content into a tree using TinyMCE's parsing utilities.

TinyMCE already uses its internal parsing utilities in our implementation, whenever setContent is called. So this shouldn't cause issues related to multiple parsing implementations having conflicting behavior.

For now, to preserve compatibility with existing blocks, this code parses the content, serializes blocks to strings, then passes these strings off to the individual blocks, much like what happened before.

Things that need to be done before this PR is not breaking existing functionality:

  • Fix the tests
  • Send chunks of HTML content (not wrapped in comment delimiters) to the freeform block

Next steps using this work as a foundation:

  • Find a less-hacky way to parse our comment delimiters (we will either need to extend/modify parts of TinyMCE's parser, which doesn't currently appear to be possible, or use another library).
  • Make blocks accept a tree representation instead of a string. This can be opt-in at first, until all blocks are converted to the new format, then we can remove the ability for a block to accept a content string. (Part of this task should be figuring out an alternative method to TinyMCE's setContent that accepts a tree rather than a string; this will mean that we are only parsing content once.)
  • Implement schemas for the kinds of markup that blocks can accept, rather than simply querying the DOM for attributes.
  • Validate these schemas against the markup received for a block. Fall back to freeform (or the next block that validates against the schema?) if there is unexpected markup present.
  • Use the same schema mechanism to identify sections of existing post content that represent valid blocks.

@nylen nylen requested review from youknowriad and aduth April 14, 2017 14:29
@nylen nylen force-pushed the update/use-tinymce-parsing branch from e4087e3 to 11dd68c Compare April 17, 2017 12:47
@nylen
Copy link
Member Author

nylen commented Apr 17, 2017

There is a lot that can be significantly improved here, but this PR has reached parity with the existing parsing code.

Copy link
Member

@aduth aduth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A bit dense to digest (as parsing tends to be), but this reads and looks to work quite well 👍 What would you say is the main advantage of this method of parsing vs. the current PegJS grammar? Consistency with TinyMCE behavior of the Editable component? Is the use of TinyMCE for parsing utilities merely for convenience since it's already available?

Just trying to clarify boundaries between the role of TinyMCE and the concept of a block generally i.e. a post could consist of blocks of which none depend on TinyMCE, at least in the current implementation.

Implement schemas for the kinds of markup that blocks can accept, rather than simply querying the DOM for attributes.

Do you think there's a path here where we have a query API like the one that exists currently, but operating on the node produced by TinyMCE's parser? Trying to think of ways we can balance requirements of parsing robustness, flexibility of supporting variations of markup (e.g. with/without caption, citation), and developer experience generally.

If not, what else do you have in mind? Fine to defer this discussion to a separate issue also. Maybe composing an object by hooking into specific node types in a parser stream? Accessing values on the parsed tree a la XPath / E4X ? The latter is a neat case since React's (JSX's) object-to-HTML model is partly based on E4X, so this could serve as a natural completion of the other half (HTML-to-object). This last link also leads to JXON, which has some ideas around representing XML as a JavaScript object, which could be combined with a Lodash _.get-like path lookup for transforming HTML to an object.

);

// Create a custom HTML schema
const schema = new tinymce.html.Schema();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I forget why we chose to reference the global directly for Editable, but do you think it'd make sense to import TinyMCE as a module here to avoid the need for ProvidePlugin in the tests? I suppose in order to avoid including TinyMCE into the generated bundle we'd still need to modify the Webpack configuration to externalize TinyMCE (as we do with React currently).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assumed the idea behind referencing the global directly in Editable was that as a plugin we wanted to use the TinyMCE version that's bundled with WordPress. I wanted to preserve that behavior here.

We could import it as a module here, but as you say, it looks like we'll need to modify the Webpack configuration either way, and I'd prefer to keep usage consistent in the code (currently, use the global version).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assumed the idea behind referencing the global directly in Editable was that as a plugin we wanted to use the TinyMCE version that's bundled with WordPress.

Right, and we can do that with Webpack externals. I'm fine to leave this for a separate pull request.

// In : <!-- wp:core/embed url:youtube.com/xxx& -->
// Out : <wp-block slug="core/embed" attributes="url:youtube.com/xxx&amp;">
content = content.replace(
/<!--\s*(\/?)wp:([a-z0-9/-]+)((?:\s+[a-z0-9_-]+:[^\s]+)*)\s*-->/g,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TIL about ?:. Handy.

Do you have any thoughts on if and how we would support spaces in an attribute value?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes - on that front, here is what I would like to see happen next:

  • Make it possible to extend TinyMCE's parser to recognize our comment delimiters. Get rid of this hack to turn them into temporary tags.
  • Change the attribute format in our comment delimiters to match regular HTML tags: <!-- wp:core/block attr="something with spaces" -->. This is more standard and will be much easier for TinyMCE's parser to deal with.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think about just using JSON for the comment attributes? Seems even easier to parse.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

JSON would definitely be cleaner, and perhaps easier to parse when starting from nothing, but when reusing TinyMCE's parsing that is already working well for HTML tags I'm not sure.

@@ -45,31 +44,150 @@ export function getBlockAttributes( blockNode, blockSettings ) {
return attrs;
}

function htmlEscape( str ) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we have it, maybe defer to Lodash's equivalents (also in Underscore if we end up going that route)?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in 9866944.

@nylen
Copy link
Member Author

nylen commented Apr 18, 2017

What would you say is the main advantage of this method of parsing vs. the current PegJS grammar?

Ability to deal with more kinds of markup gracefully. For one example, if block delimiters get inserted into the middle of a block, we will be able to recover from that situation much better because we now parse the entire tree of post content in a single step. Previously, this would lead to block markup being split, TinyMCE parsing would kick in when we setContent and "correct" each "half" of the broken markup including stray opening and closing tags, and it's hard to predict what would happen next but it doesn't seem good.

Consistency with TinyMCE behavior of the Editable component?

This is also a big benefit - we're now using an algorithm/library to parse WP post content that has years of extensive real-world usage and testing behind basically this exact task (minus our new delimiters).

Just trying to clarify boundaries between the role of TinyMCE and the concept of a block generally i.e. a post could consist of blocks of which none depend on TinyMCE, at least in the current implementation.

I don't see a problem with this. If needed, it should be possible to make just the parsing logic available separately from the rest of TinyMCE, and we should continue to make sure the actual implementation of a block doesn't depend on TinyMCE.

@nylen
Copy link
Member Author

nylen commented Apr 18, 2017

Do you think there's a path here where we have a query API like the one that exists currently, but operating on the node produced by TinyMCE's parser? Trying to think of ways we can balance requirements of parsing robustness, flexibility of supporting variations of markup (e.g. with/without caption, citation), and developer experience generally.

It's hard for me to see how this will work out in practice as well. I am definitely open to suggestions. I'll illustrate one of the main reasons I wanted to pursue this approach via an example:

A core/text block contains a series of p tags. One of these p tags has a custom HTML structure (say, a few images in a specific arrangement) that a user entered into it, which the text block type does not support because it is intended to offer basic formatting like bold, italics only.

What happens when we edit such a block? Can we guarantee that we'll allow a reasonable editing experience and preserve the HTML content on save?

Why is this important? Granted, you probably shouldn't directly edit something inside a core/text block to contain markup the block doesn't support, but people aren't going to know this without a pretty deep understanding of how the editor works.

There is also a bigger reason: if we can make this work really well, we can apply the same logic and user experience (validate markup, otherwise fall back to freeform block) to the task of upgrading existing post content to blocks. Ideally this would require little to no extra work on the part of the block implementer.

what else do you have in mind?

Here is a very rough idea that would cover the example above.

import { OneOrMore, HTML, Param } from 'blocks/schema';

registerBlock( 'core/text', {
    schema: (
        <OneOrMore>
            <p styleAlign={ <Param name="align" /> }>
                <HTML
                    paramName="content"
                    allowedTags=[ 'strong', 'em', 'b', 'i', 'span', ... ]
                />
            </p>
        </OneOrMore>
    ), ...
} );

I'm not crazy about the styleAlign thing, maybe we would just use style instead and leave CSS parsing up to the block. However, a data structure like this could potentially be used for all of the following cases:

  • Validating markup against the schema.
  • Initializing the block from a content tree. In this example, the block would receive an array of align and content properties.
  • Serializing the edited block content back to a tree. Instead of blocks having to handle this themselves, they could just pass the same align and content parameters back to the next layer of the code.
  • With the addition of an array of possible schemas, we also get support for legacy markup structures, with the advantage that we know exactly which schema we matched against and can preserve that shape of markup if desired.

I think it also has the potential to be a very nice developer experience: rather than having to implement their own querying/validation/serialization logic, block implementers can specify declaratively what kinds of markup they support and how to parse it into the parameters they need, and the syntax looks a lot like the HTML that the blocks are designed to output.

General thoughts? Where do you see something like this working well or not? There is a lot to be done here, and it probably won't work for all use cases. I would also probably start implementing this with a simpler block, as handling arrays of tags will probably be fairly tricky.

@nylen nylen force-pushed the update/use-tinymce-parsing branch from 04739ae to 4554985 Compare April 18, 2017 14:38
@nylen
Copy link
Member Author

nylen commented Apr 18, 2017

I've rebased and done some further refactoring as of 4554985:

  • Fixed a bug parsing raw HTML content at the beginning of posts
  • Introduced a new createBlockWithFallback function to remove some duplication between the different places we need to create blocks
  • Removed the concept of a blockNode object containing both content and attributes - this was imposed by the PegJS parser.

@aduth
Copy link
Member

aduth commented Apr 18, 2017

In your proposal, when covering the case that a text block has foreign content (images in your example), what would happen? I suppose if it fails to pass the schema, it simply defers to the fallback handler?

I do quite like how it unifies the parsing and serialization steps.

A few aspects I'm not so certain of:

  • If there's more than one paragraph, how are the align and content params determined? That they're declared as children of the OneOrMore makes this unclear, since the params themselves seem to be singular values, but OneOrMore implies an array.
  • Along these lines, how does OneOrMore know how to operate on multiple values of align and content in the serialization?
  • Mapping attributes of an element is not entirely clear. I suppose this is what you're trying to achieve with styleAlign in your example? I'm having a hard time understanding the convention though. For an image would this look like <img src={ <Param name="url" /> }>?
  • What type is the value of content in your example? Some related discussion to this point at Propose block APIs for backwards compatibility #413 (comment) .

I see the appeal in being explicit about where variations can occur, rather than blindly querying into the DOM / markup and potentially finding a value. Using JSX Is neat but maybe not entirely necessary (might be an alternative with a plain JavaScript object description or string paths).

I've avoided it mainly because of XML, but perhaps if we are to go the JSX route, there's some inspiration to be taken from XSLT (mapping between JSX and a JS object instead of HTML and XML).

@mtias mtias added the Framework Issues related to broader framework topics, especially as it relates to javascript label Apr 18, 2017
@aduth
Copy link
Member

aduth commented Apr 18, 2017

Another interesting variation is on the node types themselves. For example, the heading block will need to have its level as an attribute, and the list whether it is ordered or unordered. These are driven by the nodeName in the markup, so we'd need some way of schematizing variable child types.

@nylen
Copy link
Member Author

nylen commented Apr 18, 2017

when covering the case that a text block has foreign content (images in your example), what would happen? I suppose if it fails to pass the schema, it simply defers to the fallback handler?

Yes.

As far as the other points you raised, you've definitely picked up on some of the concerns I've had with this as well...

If there's more than one paragraph, how are the align and content params determined?

In this case they would have to be arrays. Maybe a bit counterintuitive.

How does OneOrMore know how to operate on multiple values of align and content in the serialization?

I would expect that it would accept arrays and verify that they have the same length.

Mapping attributes of an element is not entirely clear.

Ignore the styleAlign thing, I meant to write styleTextAlign but that is pretty bad too.
Your <img src={ <Param name="url" /> }> is a much better example of what I was going for.

What type is the value of content in your example?

This could be any of the types you mentioned in the comment you linked (string, object tree...), this is definitely something that needs further thought.

might be an alternative with a plain JavaScript object description or string paths

Definitely, @iseulde has suggested something like this too (though still need to figure out how to parameterize it, and block implementers would have to learn more in order to use it):

[ 'figure', {},
    [ 'img', { alt: 'Alternative text.', src: 'https://example.com' } ],
    [ 'figcaption', {}, 'This is ', [ 'em', {}, 'some' ], ' caption.' ]
]

Another interesting variation is on the node types themselves.

Yes, maybe something like <Tag choices={ [ 'ul', 'ol' ] }> would work there, but I do see this getting a bit clunky with real-world use cases. If we were using something like the JS notation above then ul|ol would be more concise.

@mtias
Copy link
Member

mtias commented Apr 18, 2017

@nylen would it be possible to build this and merge it without replacing the existing one? It'd be great if we could just switch between one and the other at will to compare them.

@nylen
Copy link
Member Author

nylen commented Apr 18, 2017

@mtias good idea, done. Both parsers are present and can be switched out by editing the export default parseWithTinyMCE line in blocks/api/parser.js. Both parsers pass the test suite.

The PegJS-based approach is very simplistic and will not support any of the follow-up steps we've discussed here unless it receives significant enhancements along the lines of https://github.com/Automattic/wp-post-grammar. Even then, TinyMCE already does much of the work we will need for us.

@dmsnell
Copy link
Member

dmsnell commented Apr 18, 2017

The PegJS-based approach is very simplistic and will not support any of the follow-up steps we've discussed here unless it receives significant enhancements

can you link to the follow-up steps you think it can't handle? it is a very simple approach but it's also incredibly capable.

@nylen
Copy link
Member Author

nylen commented Apr 19, 2017

I didn't say "can't handle", rather "will not support without significant enhancements". A couple of the things we will have to build:

  • Turning a lexed sequence of tokens into a tree of a tags. Even if simplistic, I do think this should include some handling of invalid nesting cases, ideally based on what TinyMCE does to match current behavior.
  • Handling of invalid nesting of block comment tags. I'm more concerned about this one because it will be pretty easy to accidentally transpose these tags.

I'd like to merge this and keep moving.

@dmsnell
Copy link
Member

dmsnell commented Apr 19, 2017

invalid nesting cases…

this is where I'm uneasy. we have to make a choice when it's invalid and right now we just leave it up to TinyMCE whose behavior is completely unspecified. regardless, maybe the markup is intended on being bad (terrible idea, but who knows… maybe they are trying to demonstrate a browser quirk) - well, try as hard as they might, TinyMCE botches up what they actually intended. this behavior is present with auto_p where it turns plain text into HTML…

I'm happy to preserve backwards compatibility, but I don't think we should bring over the worst part of that: implicitly making judgments about code whose meaning is impossible to infer. if the text comes into a block it must be valid to whatever extent valid means. I also want a no-edit session to be an identity of the input.

first = edit( input, [ /* no edits */ ] )
second = edit( first, [ /* no edits */ ] )
first === second

what's so good about unspecified behavior anyway? wouldn't it be better to simply notify that something has become invalid and provide an undo button to revert the breaking change? just not allow it to be possible to create invalid content

@nylen
Copy link
Member Author

nylen commented Apr 19, 2017

"completely unspecified" - "impossible to infer". These statements are a bit too strong: https://github.com/tinymce/tinymce/blob/542c74d/src/core/src/main/js/html/SaxParser.js#L163

Remember also that we're building a WordPress editor that we plan to have millions of people use to edit their existing content. We don't necessarily have the luxury of making choices like that...

@dmsnell
Copy link
Member

dmsnell commented Apr 19, 2017

"completely unspecified" - "impossible to infer". These statements are a bit too strong:

they aren't, and they aren't a jab or insult either

completely unspecified

having an implementation in code is just that - an implementation. it tells us what it does but not what the goal is. even in the linked code most of it is dealing with a stack structure. it's very difficult to build an alternative implementation matching the TinyMCE parser because it doesn't distinguish between "this is how we happen to be doing it" and "this is what it needs to do"

impossible to infer

what is the meaning of <b>just <i>some</b> text</i>

  • just some text
  • just some text (more likely?)
  • just some text

the problem is that because the document is ambiguous there is no possible way to determine which of several alternatives was meant. we have to make assumptions and make choices but those choices aren't specified (unless we lay out formal rules that say what to do in all of these invalid situations)

that's just a simple example though

how about <img>where do I go?</img> or <b>bold or italic typo?</i> or what<p>horror<p>is this?

the point is that if we're going for blocks and have the opportunity to push all edits inside a defined editor (leaving the old TinyMCE block as a type of block for handling backwards compatibility) then why enable storing ambiguity and non-determinism into a post?

Remember also that we're building a WordPress editor that we plan to have millions of people use to edit their existing content. We don't necessarily have the luxury of making choices like that...

Not sure which choices in particular you are referring to, but none of us are trying to take away existing content. Even proper schemas won't help existing content because it's just HTML + Shortcode stew. The only meaningful way to upgrade a block from soup to something specific is if we have a valid parse be it via a schema or JSON structure or whatever…

@nylen
Copy link
Member Author

nylen commented Apr 19, 2017

Let's back up a bit... I would consider a grammar-based parser to be a superior solution. However what I implemented here was simply the easiest way to get a parse tree out of post content in a way that I know matches existing WP behavior. There are undoubtedly a ton of other edge cases there, and unfortunately I'm not the person to ask about what they are. @azaozz and @iseulde come to mind as people who would know about what kinds of content we need to be especially careful with when parsing.

This PR is a first (second?) draft that will serve well for building on top of. Even after this, it's not set in stone and it's really easy to switch out the parser. Let's replace it with something more robust as soon as we can. We could use your help there as I'm also not an expert in this area.

@nylen nylen merged commit 0c2d06f into master Apr 19, 2017
@nylen nylen deleted the update/use-tinymce-parsing branch April 19, 2017 16:47
@ellatrix
Copy link
Member

There are undoubtedly a ton of other edge cases there, and unfortunately I'm not the person to ask about what they are.

@spocke probably knows best which things TinyMCE's SAX parser does handle that others (even DOM parser) do not.

return attrs;
}, {} ) }

WP_Block_Attribute
= name:WP_Block_Attribute_Name ":" value:WP_Block_Attribute_Value
{ return [ name, value ] }
{ return { name, value }; }
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's an issue where this file isn't transpiled, and UglifyJs chokes on this line. You can see the error when running npm run build. The workaround was just to use ES5 in the file, otherwise we need to investigate how to apply Babel to the file.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Framework Issues related to broader framework topics, especially as it relates to javascript
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants