-
Notifications
You must be signed in to change notification settings - Fork 271
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
doc2
is a parser in lexer’s clothing
#5187
Merged
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oh, cool. I don't know why I didn't think of this. The whole time I was doing the doc support, I was finding it annoying that the lexer and parser were separated (they were previously combined, but we moved away from that due to it being overly complicated and hard to debug to deal with layout and whitespace issues in the parser), but this PR seems like it gets the best of both worlds. |
`doc2` is a Unison lexer that traverses a `Doc`. `docBody` is the actual `Doc` lexer that is ignorant of the fact that Unison wraps `Doc` blocks in `{{`/`}}`.
This is in preparation for using `Ann` in the `Lexer` module, as that module actually does some parsing.
`doc2` was a parser in lexer’s clothing. It would parse recursively, but then return the result as a flat list of tokens. This separates the parsing from the “unparsing” (which returns the tokens), so now we have a parser to a recursive `Doc` structure. This currently immediately applies the unparser, and should result in an identical stream of tokens as the previous version. Eventually, we should be able to avoid unparsing the `Doc` structure.
This removes the layer that makes the `Doc` parser look like a lexer and replaces it with a function that converts the Doc structure directly Unison Terms.
After running the core of the lexer, the `lexer` function then does some work to turn the stream into a tree, and reorder some lexemes. It then throws away the tree structure. This is the first step of preserving the tree structure for the parser. It extracts the “pre-parser” from `lexer` so that it can eventually be used _after_ the lexer, rather than internally. This also moves `fixup` to be applied on each block as we reorder it, rather than across the entire stream at the end (since the goal is to not _have_ an entire stream any more).
This removes the need to pad the lexer stream with trailing `Close` lexemes. If EOF is reached, the parser will automatically close any layout blocks (but not context-free blocks).
We now build the stanzas at the same time as the tree, and don’t discard them after reordering. This also changes the closing element of `Block` to be `Maybe` instead of `[]`.
These are needed for the new Doc types, but had been stubbed out. Moving the Doc types to their own module forced the changes that got in the way of generating these with Template Haskell.
It’s only used inside `local`, so its attempts to restore the layout are for naught.
In general, they map to the constructors of the Doc types, with some wiggle room for now. It’s probably beneficial to review this commit by ignoring whitespace.
It is now completely[^1] independent of the Unison language. The parser takes a few parsers as arguments: one for identifiers, one for code, and one to indicate the end of the Doc block. [^1]: There is one last bit to be removed in the next commit – Doc still looks for `type` or `ability` to identify type links.
The Doc parser shouldn’t know how Unison terminates Doc blocks.
This was the last thing tying Doc to Unison.
Some handling of blocks without final newlines was improved in the course of this PR. Fixes unisonweb#5076.
aryairani
approved these changes
Aug 3, 2024
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Overview
This exposes the
Doc
parser and then converts directly from the Doc AST to UnisonTerm
s.This makes the permitted structure of Docs much clearer, and should have no impact on users. It should make working on the Doc parser much easier in future.
Part of the separation is that the Doc parser has no notion that it is wrapped in
{{
/}}
. That is handled by the Unison lexer, and Doc just processes the part of the stream that it is allowed to.The Doc parser is also completely independent of the Unison lexer/parser, so this could be used to add support for Unison Doc to any PL that has a Megaparsec parser..
Fixes #5076.
Implementation notes
This was already the case, but previously the parser directly emitted a list of
Lexeme
s as it went. This now preserves the tree structure created by the parser, and adds a newLexeme
case calledDoc
, which passes that structure to the Unison parser to be converted directly to UnisonTerm
s.Doc can contain Unison code in a number of places, and those are stored as
[Token Lexeme]
in the Doc structure. The Unison parser then runs “sub-parses” on these chunks as it encounters them.It also makes sense to review this commit-by-commit. E.g., the fourth commit separated the parsing (in Lexer.hs) from creating the
Lexeme
stream, which allows you to see that the parsing logic is almost entirely untouched, and that we only switch from building a stream to building a tree. Then the fifth commit removes the stream production in favor of handing the tree to the Unison parser and allowing it to produce theTerm
s directly from the tree.Interesting/controversial decisions
Probably a number of things. One is just the data representation of
Doc
, which is more complicated than would be ideal, but is informed by what is possible to parse. Simplifying the data model requires changing the parser to match it.I had also considered making it a GADT, which would simplify some things, but then that would make it harder to parallel in Unison’s data model of
Doc
, and would hide some of the complexities that should be eliminated.Test coverage
All transcripts which include
{{ }}
test this change.Loose ends
There are a lot of improvements to the data model to be made, but this change was intended strictly as a refactor, to clarify some things.