`doc2` is a parser in lexer’s clothing #5187

sellout · 2024-07-05T22:56:47Z

Overview

This exposes the Doc parser and then converts directly from the Doc AST to Unison Terms.

This makes the permitted structure of Docs much clearer, and should have no impact on users. It should make working on the Doc parser much easier in future.

Part of the separation is that the Doc parser has no notion that it is wrapped in {{/}}. That is handled by the Unison lexer, and Doc just processes the part of the stream that it is allowed to.

The Doc parser is also completely independent of the Unison lexer/parser, so this could be used to add support for Unison Doc to any PL that has a Megaparsec parser..

Fixes #5076.

Implementation notes

This was already the case, but previously the parser directly emitted a list of Lexemes as it went. This now preserves the tree structure created by the parser, and adds a new Lexeme case called Doc, which passes that structure to the Unison parser to be converted directly to Unison Terms.

Doc can contain Unison code in a number of places, and those are stored as [Token Lexeme] in the Doc structure. The Unison parser then runs “sub-parses” on these chunks as it encounters them.

It also makes sense to review this commit-by-commit. E.g., the fourth commit separated the parsing (in Lexer.hs) from creating the Lexeme stream, which allows you to see that the parsing logic is almost entirely untouched, and that we only switch from building a stream to building a tree. Then the fifth commit removes the stream production in favor of handing the tree to the Unison parser and allowing it to produce the Terms directly from the tree.

Interesting/controversial decisions

Probably a number of things. One is just the data representation of Doc, which is more complicated than would be ideal, but is informed by what is possible to parse. Simplifying the data model requires changing the parser to match it.

I had also considered making it a GADT, which would simplify some things, but then that would make it harder to parallel in Unison’s data model of Doc, and would hide some of the complexities that should be eliminated.

Test coverage

All transcripts which include {{ }} test this change.

Loose ends

There are a lot of improvements to the data model to be made, but this change was intended strictly as a refactor, to clarify some things.

pchiusano · 2024-07-10T16:55:06Z

Oh, cool. I don't know why I didn't think of this.

The whole time I was doing the doc support, I was finding it annoying that the lexer and parser were separated (they were previously combined, but we moved away from that due to it being overly complicated and hard to debug to deal with layout and whitespace issues in the parser), but this PR seems like it gets the best of both worlds.

`doc2` is a Unison lexer that traverses a `Doc`. `docBody` is the actual `Doc` lexer that is ignorant of the fact that Unison wraps `Doc` blocks in `{{`/`}}`.

This is in preparation for using `Ann` in the `Lexer` module, as that module actually does some parsing.

`doc2` was a parser in lexer’s clothing. It would parse recursively, but then return the result as a flat list of tokens. This separates the parsing from the “unparsing” (which returns the tokens), so now we have a parser to a recursive `Doc` structure. This currently immediately applies the unparser, and should result in an identical stream of tokens as the previous version. Eventually, we should be able to avoid unparsing the `Doc` structure.

This removes the layer that makes the `Doc` parser look like a lexer and replaces it with a function that converts the Doc structure directly Unison Terms.

After running the core of the lexer, the `lexer` function then does some work to turn the stream into a tree, and reorder some lexemes. It then throws away the tree structure. This is the first step of preserving the tree structure for the parser. It extracts the “pre-parser” from `lexer` so that it can eventually be used _after_ the lexer, rather than internally. This also moves `fixup` to be applied on each block as we reorder it, rather than across the entire stream at the end (since the goal is to not _have_ an entire stream any more).

This removes the need to pad the lexer stream with trailing `Close` lexemes. If EOF is reached, the parser will automatically close any layout blocks (but not context-free blocks).

We now build the stanzas at the same time as the tree, and don’t discard them after reordering. This also changes the closing element of `Block` to be `Maybe` instead of `[]`.

These are needed for the new Doc types, but had been stubbed out. Moving the Doc types to their own module forced the changes that got in the way of generating these with Template Haskell.

It’s only used inside `local`, so its attempts to restore the layout are for naught.

In general, they map to the constructors of the Doc types, with some wiggle room for now. It’s probably beneficial to review this commit by ignoring whitespace.

It is now completely[^1] independent of the Unison language. The parser takes a few parsers as arguments: one for identifiers, one for code, and one to indicate the end of the Doc block. [^1]: There is one last bit to be removed in the next commit – Doc still looks for `type` or `ability` to identify type links.

The Doc parser shouldn’t know how Unison terminates Doc blocks.

This was the last thing tying Doc to Unison.

Some handling of blocks without final newlines was improved in the course of this PR. Fixes unisonweb#5076.

sellout force-pushed the doc-lexer branch from fe022a3 to 4f1ae68 Compare July 9, 2024 20:13

sellout added 22 commits July 27, 2024 00:52

Extract the Doc lexer into a top-level function

21209e2

Separate the Doc lexer from the Unison lexer

d1fe6d9

`doc2` is a Unison lexer that traverses a `Doc`. `docBody` is the actual `Doc` lexer that is ignorant of the fact that Unison wraps `Doc` blocks in `{{`/`}}`.

Move the Annotated class to the Ann module

543daa3

This is in preparation for using `Ann` in the `Lexer` module, as that module actually does some parsing.

Don’t “un-parse” Doc.

227ff27

This removes the layer that makes the `Doc` parser look like a lexer and replaces it with a function that converts the Doc structure directly Unison Terms.

Allow EOF to close layout blocks

32472bd

This removes the need to pad the lexer stream with trailing `Close` lexemes. If EOF is reached, the parser will automatically close any layout blocks (but not context-free blocks).

Make comments into Haddock

94065e0

Expose preParse to the parser

567238f

Rename T to BlockTree

6c561f3

Restructure BlockTree

3158e66

We now build the stanzas at the same time as the tree, and don’t discard them after reordering. This also changes the closing element of `Block` to be `Maybe` instead of `[]`.

Remove unnecessary docOpen in Doc parser

a6f6d9c

Split Doc into its own module

c53cb08

Add Data.Functor.Classes instances

70fe615

These are needed for the new Doc types, but had been stubbed out. Moving the Doc types to their own module forced the changes that got in the way of generating these with Template Haskell.

Simplify restoreStack

31f9522

It’s only used inside `local`, so its attempts to restore the layout are for naught.

Split Doc parser from Unison lexer

6f2d188

Split the Doc parser into multiple functions

e9512a6

In general, they map to the constructors of the Doc types, with some wiggle room for now. It’s probably beneficial to review this commit by ignoring whitespace.

Caught a hardcoded }} in the Doc parser

9a941a3

The Doc parser shouldn’t know how Unison terminates Doc blocks.

Make Doc parser ignorant of type/term distinctions

beecaa9

This was the last thing tying Doc to Unison.

Simplify Doc parser from State to Reader

c5a66d5

Add a transcript showing that unisonweb#5076 was fixed

96f865b

Some handling of blocks without final newlines was improved in the course of this PR. Fixes unisonweb#5076.

sellout force-pushed the doc-lexer branch from 4f1ae68 to 96f865b Compare August 2, 2024 04:59

Merge remote-tracking branch 'upstream/trunk' into doc-lexer

1ee188d

sellout marked this pull request as ready for review August 2, 2024 05:47

aryairani approved these changes Aug 3, 2024

View reviewed changes

aryairani merged commit c049c65 into unisonweb:trunk Aug 3, 2024
20 checks passed

sellout deleted the doc-lexer branch August 15, 2024 05:50

aryairani mentioned this pull request Sep 14, 2024

Empty code block in doc causes UCM exception #5349

Closed

sellout mentioned this pull request Sep 17, 2024

Handle empty code blocks in Doc2 #5352

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`doc2` is a parser in lexer’s clothing #5187

`doc2` is a parser in lexer’s clothing #5187

sellout commented Jul 5, 2024 •

edited

Loading

pchiusano commented Jul 10, 2024

doc2 is a parser in lexer’s clothing #5187

doc2 is a parser in lexer’s clothing #5187

Conversation

sellout commented Jul 5, 2024 • edited Loading

Overview

Implementation notes

Interesting/controversial decisions

Test coverage

Loose ends

pchiusano commented Jul 10, 2024

`doc2` is a parser in lexer’s clothing #5187

`doc2` is a parser in lexer’s clothing #5187

sellout commented Jul 5, 2024 •

edited

Loading