Skip to content

Commit

Permalink
block-serialization-default-parser: set up auto-generated API docs (#…
Browse files Browse the repository at this point in the history
  • Loading branch information
nosolosw committed Mar 8, 2019
1 parent bf1dd41 commit e66738b
Show file tree
Hide file tree
Showing 3 changed files with 112 additions and 12 deletions.
2 changes: 1 addition & 1 deletion bin/update-readmes.js
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@ const packages = [
'blob',
'block-editor',
'block-library',
//'block-serialization-default-parser',
'block-serialization-default-parser',
'blocks',
'compose',
//'data',
Expand Down
45 changes: 34 additions & 11 deletions packages/block-serialization-default-parser/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,9 +12,20 @@ npm install @wordpress/block-serialization-default-parser --save

_This package assumes that your code will run in an **ES2015+** environment. If you're using an environment that has limited or no support for ES2015+ such as lower versions of IE then using [core-js](https://github.com/zloirock/core-js) or [@babel/polyfill](https://babeljs.io/docs/en/next/babel-polyfill) will add support for these methods. Learn more about it in [Babel docs](https://babeljs.io/docs/en/next/caveats)._

## Usage
## API

<!-- START TOKEN(Autogenerated API docs) -->

### parse

[src/index.js#L150-L162](src/index.js#L150-L162)

Parser function, that converts input HTML into a block based structure.

**Usage**

Input post:

```html
<!-- wp:columns {"columns":3} -->
<div class="wp-block-columns has-3-columns"><!-- wp:column -->
Expand All @@ -36,6 +47,7 @@ Input post:
```

Parsing code:

```js
import { parse } from '@wordpress/block-serialization-default-parser';

Expand Down Expand Up @@ -84,6 +96,17 @@ parse( post ) === [
];
```

**Parameters**

- **doc** `string`: The HTML document to parse.

**Returns**

`Array`: A block-based representation of the input HTML.


<!-- END TOKEN(Autogenerated API docs) -->

## Theory

### What is different about this one from the spec-parser?
Expand All @@ -98,26 +121,26 @@ Every serialized Gutenberg document is nominally an HTML document which, in addi

This parser attempts to create a state-machine around the transitions triggered from those delimiters -- the "tokens" of the grammar. Every time we find one we should only be doing either of:

- enter a new block;
- exit out of a block.
- enter a new block;
- exit out of a block.

Those actions have different effects depending on the context; for instance, when we exit a block we either need to add it to the output block list _or_ we need to append it as the next `innerBlock` on the parent block below it in the block stack (the place where we track open blocks). The details are documented below.

The biggest challenge in this parser is making the right accounting of indices required to construct the `innerHTML` values for each block at every level of nesting depth. We take a simple approach:

- Start each newly opened block with an empty `innerHTML`.
- Whenever we push a first block into the `innerBlocks` list, add the content from where the content of the parent block started to where this inner block starts.
- Whenever we push another block into the `innerBlocks` list, add the content from where the previous inner block ended to where this inner block starts.
- When we close out an open block, add the content from where the last inner block ended to where the closing block delimiter starts.
- If there are no inner blocks then we take the entire content between the opening and closing block comment delimiters as the `innerHTML`.
- Start each newly opened block with an empty `innerHTML`.
- Whenever we push a first block into the `innerBlocks` list, add the content from where the content of the parent block started to where this inner block starts.
- Whenever we push another block into the `innerBlocks` list, add the content from where the previous inner block ended to where this inner block starts.
- When we close out an open block, add the content from where the last inner block ended to where the closing block delimiter starts.
- If there are no inner blocks then we take the entire content between the opening and closing block comment delimiters as the `innerHTML`.

### I meant, how does it perform?

This parser operates much faster than the generated parser from the specification. Because we know more about the parsing than the PEG does we can take advantage of several tricks to improve our speed and memory usage:

- We only have one or two distinct tokens, depending on how you look at it, and they are all readily matched via a regular expression. Instead of parsing on a character-per-character basis we can allow the PCRE RegExp engine to skip over large swaths of the document for us in order to find those tokens.
- Since `preg_match()` takes an `offset` parameter we can crawl through the input without passing copies of the input text on every step. We can track our position in the string and only pass a number instead.
- Not copying all those strings means that we'll also skip many memory allocations.
- We only have one or two distinct tokens, depending on how you look at it, and they are all readily matched via a regular expression. Instead of parsing on a character-per-character basis we can allow the PCRE RegExp engine to skip over large swaths of the document for us in order to find those tokens.
- Since `preg_match()` takes an `offset` parameter we can crawl through the input without passing copies of the input text on every step. We can track our position in the string and only pass a number instead.
- Not copying all those strings means that we'll also skip many memory allocations.

Further, tokenizing with a RegExp brings an additional advantage. The parser generated by the PEG provides predictable performance characteristics in exchange for control over tokenization rules -- it doesn't allow us to define RegExp patterns in the rules so as to guard against _e.g._ cataclysmic backtracking that would break the PEG guarantees.

Expand Down
77 changes: 77 additions & 0 deletions packages/block-serialization-default-parser/src/index.js
Original file line number Diff line number Diff line change
Expand Up @@ -70,6 +70,83 @@ function Frame( block, tokenStart, tokenLength, prevOffset, leadingHtmlStart ) {
};
}

/**
* Parser function, that converts input HTML into a block based structure.
*
* @param {string} doc The HTML document to parse.
*
* @example
* Input post:
* ```html
* <!-- wp:columns {"columns":3} -->
* <div class="wp-block-columns has-3-columns"><!-- wp:column -->
* <div class="wp-block-column"><!-- wp:paragraph -->
* <p>Left</p>
* <!-- /wp:paragraph --></div>
* <!-- /wp:column -->
*
* <!-- wp:column -->
* <div class="wp-block-column"><!-- wp:paragraph -->
* <p><strong>Middle</strong></p>
* <!-- /wp:paragraph --></div>
* <!-- /wp:column -->
*
* <!-- wp:column -->
* <div class="wp-block-column"></div>
* <!-- /wp:column --></div>
* <!-- /wp:columns -->
* ```
*
* Parsing code:
* ```js
* import { parse } from '@wordpress/block-serialization-default-parser';
*
* parse( post ) === [
* {
* blockName: "core/columns",
* attrs: {
* columns: 3
* },
* innerBlocks: [
* {
* blockName: "core/column",
* attrs: null,
* innerBlocks: [
* {
* blockName: "core/paragraph",
* attrs: null,
* innerBlocks: [],
* innerHTML: "\n<p>Left</p>\n"
* }
* ],
* innerHTML: '\n<div class="wp-block-column"></div>\n'
* },
* {
* blockName: "core/column",
* attrs: null,
* innerBlocks: [
* {
* blockName: "core/paragraph",
* attrs: null,
* innerBlocks: [],
* innerHTML: "\n<p><strong>Middle</strong></p>\n"
* }
* ],
* innerHTML: '\n<div class="wp-block-column"></div>\n'
* },
* {
* blockName: "core/column",
* attrs: null,
* innerBlocks: [],
* innerHTML: '\n<div class="wp-block-column"></div>\n'
* }
* ],
* innerHTML: '\n<div class="wp-block-columns has-3-columns">\n\n\n\n</div>\n'
* }
* ];
* ```
* @return {Array} A block-based representation of the input HTML.
*/
export const parse = ( doc ) => {
document = doc;
offset = 0;
Expand Down

0 comments on commit e66738b

Please sign in to comment.