Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

doc2 is a parser in lexer’s clothing #5187

Merged
merged 23 commits into from
Aug 3, 2024
Merged

Conversation

sellout
Copy link
Contributor

@sellout sellout commented Jul 5, 2024

Overview

This exposes the Doc parser and then converts directly from the Doc AST to Unison Terms.

This makes the permitted structure of Docs much clearer, and should have no impact on users. It should make working on the Doc parser much easier in future.

Part of the separation is that the Doc parser has no notion that it is wrapped in {{/}}. That is handled by the Unison lexer, and Doc just processes the part of the stream that it is allowed to.

The Doc parser is also completely independent of the Unison lexer/parser, so this could be used to add support for Unison Doc to any PL that has a Megaparsec parser..

Fixes #5076.

Implementation notes

This was already the case, but previously the parser directly emitted a list of Lexemes as it went. This now preserves the tree structure created by the parser, and adds a new Lexeme case called Doc, which passes that structure to the Unison parser to be converted directly to Unison Terms.

Doc can contain Unison code in a number of places, and those are stored as [Token Lexeme] in the Doc structure. The Unison parser then runs “sub-parses” on these chunks as it encounters them.

It also makes sense to review this commit-by-commit. E.g., the fourth commit separated the parsing (in Lexer.hs) from creating the Lexeme stream, which allows you to see that the parsing logic is almost entirely untouched, and that we only switch from building a stream to building a tree. Then the fifth commit removes the stream production in favor of handing the tree to the Unison parser and allowing it to produce the Terms directly from the tree.

Interesting/controversial decisions

Probably a number of things. One is just the data representation of Doc, which is more complicated than would be ideal, but is informed by what is possible to parse. Simplifying the data model requires changing the parser to match it.

I had also considered making it a GADT, which would simplify some things, but then that would make it harder to parallel in Unison’s data model of Doc, and would hide some of the complexities that should be eliminated.

Test coverage

All transcripts which include {{ }} test this change.

Loose ends

There are a lot of improvements to the data model to be made, but this change was intended strictly as a refactor, to clarify some things.

@pchiusano
Copy link
Member

Oh, cool. I don't know why I didn't think of this.

The whole time I was doing the doc support, I was finding it annoying that the lexer and parser were separated (they were previously combined, but we moved away from that due to it being overly complicated and hard to debug to deal with layout and whitespace issues in the parser), but this PR seems like it gets the best of both worlds.

`doc2` is a Unison lexer that traverses a `Doc`.

`docBody` is the actual `Doc` lexer that is ignorant of the fact that Unison wraps `Doc` blocks in `{{`/`}}`.
This is in preparation for using `Ann` in the `Lexer` module, as that module
actually does some parsing.
`doc2` was a parser in lexer’s clothing. It would parse recursively, but
then return the result as a flat list of tokens.

This separates the parsing from the “unparsing” (which returns the
tokens), so now we have a parser to a recursive `Doc` structure. This
currently immediately applies the unparser, and should result in an
identical stream of tokens as the previous version. Eventually, we
should be able to avoid unparsing the `Doc` structure.
This removes the layer that makes the `Doc` parser look like a lexer and
replaces it with a function that converts the Doc structure directly
Unison Terms.
After running the core of the lexer, the `lexer` function then does some
work to turn the stream into a tree, and reorder some lexemes. It then
throws away the tree structure.

This is the first step of preserving the tree structure for the parser.
It extracts the “pre-parser” from `lexer` so
that it can eventually be used _after_ the lexer, rather than internally.

This also moves `fixup` to be applied on each block as we reorder it,
rather than across the entire stream at the end (since the goal is to
not _have_ an entire stream any more).
This removes the need to pad the lexer stream with trailing `Close` lexemes. If
EOF is reached, the parser will automatically close any layout blocks (but not
context-free blocks).
We now build the stanzas at the same time as the tree, and don’t discard them after reordering.

This also changes the closing element of `Block` to be `Maybe` instead of `[]`.
These are needed for the new Doc types, but had been stubbed out. Moving
the Doc types to their own module forced the changes that got in the way
of generating these with Template Haskell.
It’s only used inside `local`, so its attempts to restore the layout are for naught.
In general, they map to the constructors of the Doc types, with some wiggle room for now.

It’s probably beneficial to review this commit by ignoring whitespace.
It is now completely[^1] independent of the Unison language. The parser takes a few parsers as arguments: one for
identifiers, one for code, and one to indicate the end of the Doc block.

[^1]: There is one last bit to be removed in the next commit – Doc still looks for `type` or `ability` to identify type
links.
The Doc parser shouldn’t know how Unison terminates Doc blocks.
This was the last thing tying Doc to Unison.
Some handling of blocks without final newlines was improved in the course of this PR.

Fixes unisonweb#5076.
@sellout sellout marked this pull request as ready for review August 2, 2024 05:47
@aryairani aryairani merged commit c049c65 into unisonweb:trunk Aug 3, 2024
20 checks passed
@sellout sellout deleted the doc-lexer branch August 15, 2024 05:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

parser bug on docs with double backticks
3 participants