WIP - POC with Chevrotain Parser. #142

bd82 · 2017-08-28T18:38:20Z

** DO NOT MERGE WIP**

bd82 · 2017-08-28T18:43:50Z

Hi. I'm playing around with re-implementing the JDL parser using the Chevrotain parsing library.

This is related to the technology choice discussion in #141 and the
the preceding discussion in jhipster/generator-jhipster#6275
This may need its own separate issue...

This PR currently contains a lexer for JDL implemented using Chevrotain RegExp based lexer engine.

There are some minor changes from the original JDL pegjs implementation documented in the comments.

The next step is to try and convert the grammar itself (just syntax, no AST building).

deepu105 · 2017-08-29T06:15:00Z

@bd82 this looks interesting and personally, I like it as it seems more readable than PegJS syntax.

bd82 · 2017-08-29T18:45:17Z

Thanks @deepu105 .

Being an internal DSL the grammar itself is imho a little uglier than pure EBNF style syntax, but still highly readable by using vertical spacing.

Example:

    // comments will be handled outside(after) the parser in this implementation.
    $.RULE('entityBody', () => {
      $.CONSUME(t.LCURLY);
      $.AT_LEAST_ONE(() => {
        $.SUBRULE($.fieldDec);
      });
      $.CONSUME(t.RCURLY);
    });

    $.RULE('fieldDec', () => {
      $.CONSUME(t.NAME);
      $.SUBRULE($.type);
      // Short form for: "(X(,X)*)?"
      $.MANY_SEP({
        SEP: t.COMMA,
        DEF: () => {
          $.SUBRULE($.validation);
        }
      });
      $.CONSUME(t.RCURLY);
    });

With pegjs it would look something like: (taken from existing Grammar)
Which is a-lot more "horizontal"...

entityBody
  = '{' SPACE* fdl:fieldDeclList SPACE* '}' { return fdl; }
  / '' { return []; }

fieldDeclList
  = SPACE* com:comment? SPACE* f:FIELD_NAME SPACE_WITHOUT_NEWLINE* t:type SPACE_WITHOUT_NEWLINE* vl:validationList? SPACE_WITHOUT_NEWLINE* com2:comment? SPACE_WITHOUT_NEWLINE* ','? SPACE* fdl:fieldDeclList {
    return addUniqueElements([{ name: f, type: t, validations: vl, javadoc: com || com2 }], fdl );
  }
  / '' { return []; }

Which is in theory could be prettier, but because many things were added:

"SPACE*" everywhere because pegjs cannot ignore tokens.
labels (**vl:**validationlist?)
JS code snippets to execute (semantic actions).
- With Chevrotain you can optionally implement the semantic actions outside the grammar.

The end result is much less readable imho as there is no separation of concerns...

But it is not just about readability, it is also about maintainability.
With chevrotain you can place a breakpoint anywhere in your grammar and just debug it
as any other javaScript code you write. :)

How a parser implemented using Chevrotain would look like. Next step would be some tests to demonstrate capabilities.

bd82 · 2017-08-29T18:53:04Z

Please have a look at a subset of the grammar implemented in latest commit.

The next step would be to add a few tests to demonstrate capabilities on this JDL grammar subset.

Autocomplete support.
Building an AST using external semantic actions (not embedded in the grammar as with pegjs).
Linking jsdocs comments back to the AST.
Multiple Syntax Errors (for a single input text).
Error Recovery.
Extracting data required to implement a JDL code formatter.
Syntax Diagrams.

Hopefully I will have time to implement some of these tomorrow.

MathieuAA · 2017-08-29T19:37:35Z

lib/dsl/chev_grammar.js

+
+// HIGHLIGHT:
+// "MIN_MAX_KEYWORD" is an "abstract" token which other concrete tokens inherit from.
+// This can be used to reduce verbosity in the parser.


It is used in the validation rule instead of specifying the six different keywords.
https://github.com/jhipster/jhipster-core/pull/142/files#diff-802ee05eaf770a8bbbc2fe7ef13a3efaR233

Here is the corresponding section in the existing grammar:

jhipster-core/lib/dsl/grammar.txt

Lines 511 to 522 in fd8f712

/ MINLENGTH SPACE* '(' SPACE* int:INTEGER SPACE* ')' { return { key: 'minlength', value: int }; }

/ MINLENGTH SPACE* '(' SPACE* constantName:CONSTANT_NAME SPACE* ')' { return { key: 'minlength', value: constantName, constant: true }; }

/ MAXLENGTH SPACE* '(' SPACE* int:INTEGER SPACE* ')' { return { key: 'maxlength', value: int }; }

/ MAXLENGTH SPACE* '(' SPACE* constantName:CONSTANT_NAME SPACE* ')' { return { key: 'maxlength', value: constantName, constant: true }; }

/ MINBYTES SPACE* '(' SPACE* int:INTEGER SPACE* ')' { return { key: 'minbytes', value: int }; }

/ MINBYTES SPACE* '(' SPACE* constantName:CONSTANT_NAME SPACE* ')' { return { key: 'minbytes', value: constantName, constant: true }; }

/ MAXBYTES SPACE* '(' SPACE* int:INTEGER SPACE* ')' { return { key: 'maxbytes', value: int }; }

/ MAXBYTES SPACE* '(' SPACE* constantName:CONSTANT_NAME SPACE* ')' { return { key: 'maxbytes', value: constantName, constant: true }; }

/ MIN SPACE* '(' SPACE* int:INTEGER SPACE* ')' { return { key: 'min', value: int };}

/ MIN SPACE* '(' SPACE* constantName:CONSTANT_NAME SPACE* ')' { return { key: 'min', value: constantName, constant: true }; }

/ MAX SPACE* '(' SPACE* int:INTEGER SPACE* ')' { return { key: 'max', value: int };}

/ MAX SPACE* '(' SPACE* constantName:CONSTANT_NAME SPACE* ')' { return { key: 'max', value: constantName, constant: true }; }

The token inheritance does not have to be used.
It is an example for what is possible and could be considered...

MathieuAA · 2017-08-29T20:02:50Z

lib/dsl/chev_grammar.js

+    // very important to call this after all the rules have been defined.
+    // otherwise the parser may not work correctly as it will lack information
+    // derived during the self analysis phase.
+    Parser.performSelfAnalysis(this);


This is during the object's construction-time. All this. Why not having another way?
One file is enough, but putting everything in the constructor isn't really something I look forward to maintaining, even if the improvement of using using Chevrotain over PegJS is obvious. Why not, for instance, use a factory of some sort (a function that calls other functions to build the parser instance)?

Why not having another way?

Answer

The syntax I prefer relies on using class fields ESNext syntax.
https://github.com/tc39/proposal-class-fields
But this is not yet supported afaik (currently stage 3 proposal).
I suppose Babel will support this at some point:
babel/proposals#12

TypeScript has something similar which already works now.
See this example:
This is similar the "official" API I'm aiming for, but may need to wait for ES2018 for that. :(

Alternative

Anyhow as it is all just plain JavaScript you can define it (mostly) however you want...
An extreme example would be this completely different DSL for specifying Chevrotain grammars
https://github.com/kristianmandrup/chevrotain-rule-dsl

I'm am a bit too tired to think a concrete alternative syntax right now.
But I believe one should be possible even with ES6, perhaps you have a suggestion?
The constraints are:

Parser.performSelfAnalysis must be called after the rules have been defined.

As it relies on side effects of creating the rules.

The RULE calls must be called in the context of the parser instance (this).

And if it helps normally you only use a single parser instance and reset it's internal state before each use.

Future / Long term.

There is also an open issue for better support of custom APIs for building Chevrotain parsers.
And I'm hoping in the long term to support three different API styles (same as Mocha/Chai have different APIs using the same underlying engine).

Low Level Hand-Built style.

Combinator Style, fluent DSL.

EBNF generator style (Like pegjs).

Here is a really quick and dirty factory style hack.
https://github.com/SAP/chevrotain/blob/5235a12da1818aaf2ac075cd4326d46e46da15fc/examples/grammars/json/json.js#L95-L126

And here are the rules defined outside the constructor.
https://github.com/SAP/chevrotain/blob/5235a12da1818aaf2ac075cd4326d46e46da15fc/examples/grammars/json/json.js#L129-L180

I don't think this should be part of Chevrotain's official API
as I would rather wait for class fields proposal, but it can be cleaned up and reused
by end users if needed...

Also note that this factory mixes in the rules, so they could easily be split up
to multiple files for large grammars.

Hope this example demonstrates how due to Chevrotain being a library
instead of a code generator makes it much more malleable to customization. 😄

deepu105 · 2017-08-30T05:24:34Z

Wow great work @bd82 and thanks

bd82 · 2017-08-30T18:58:54Z

Happy to help @deepu105 😄

Latest commit cleaned up a bit and has a small parser test (happy path) which you can debug.
I plan to add more scenarios (as specs) tomorrow.

* Lexer, Parser and APIs in different files. * A single test which parses a simple valid input and outputs a CST.

MathieuAA · 2017-08-31T06:55:12Z

Wow. Nice!

* Automatic Error recovery. * Syntatic content assist.

bd82 · 2017-08-31T21:15:50Z

Added some more examples for both syntactic content assist
and for error recovery / fault tolerance.

Additionally syntax diagrams can be generated from the grammar.
This can be useful both for development purposes and as part of a documentation site.
Diagrams of the current sub-grammar

bd82 · 2017-09-08T21:58:52Z

I think there is now enough content and E2E flows in this POC to be worth discussing and reviewing.
I will create a separate Issue for this (tomorrow) with specific highlights and links to the source code
to help review this fairly large number of lines.

bd82 · 2017-09-09T20:19:43Z

Added the discussion issue:
#144

MathieuAA · 2017-09-10T15:07:01Z

lib/dsl/poc/lexer.js

+const NAME = chevrotain.createToken({ name: 'NAME', pattern: namePattern });
+
+
+function createToken(config) {


Two functions could be created and type tests could be avoiided. Like createPatternToken and createStringToken

This is subjective style question, Choose whichever you prefer...
There are actually four/five possible argument types documented in the Docs

In addition using strings is just a convenience style, if you prefer conformity
and no runtime type checks you can replace the strings with regExps, eg:

// both of these are equivalent. createToken({ name: 'ENTITY', pattern: "entity" }); createToken({ name: 'ENTITY', pattern: /entity/ });

MathieuAA · 2017-09-10T15:08:25Z

lib/dsl/poc/lexer.js

+createToken({ name: 'DOT', pattern: '.' });
+
+// Imperative the "NAME" token will be added after all the keywords to resolve keywords vs identifier conflict.
+tokens.NAME = NAME;


If order matters, are there other rules to remember when writing a parser?

There are many rules as with any complex domains.
Most of these rules are not specific to Chevrotain but to general writing of parsers.
E.g: Keywords vs Identifiers in Antlr4

In general Chevrotain tries to detect these issues and provide useful error messages
or even links with detailed instructions[1] [2] on how to resolve those.

But not everything can (currently) be automatically detected (such as keyword vs identifiers).But you just gave me an idea how to automatically detect keywords vs Identifiers!
👍

deepu105 · 2017-09-15T08:54:49Z

Awesome work. Though I have no idea how the formatting could be utilized with codeMirror we use in JDL studio, personally formatting is not a requirement at this point but if easy to integrate it would be cool as well

bd82 · 2017-09-15T09:04:29Z

have no idea how the formatting could be utilized with codeMirror we use in JDL studio.

Neither do I 😄.
The important point is that this approach enables future Editor tooling extensions
by keeping all the syntactic data, and more importantly it enables those without needing to modify the parser.

deepu105 · 2017-09-24T08:33:38Z

That escalated quickly 😄

POC with Chevrotain Parser.

9b5363b

Added a subset of the JDL grammar to demonstrates

be0544d

How a parser implemented using Chevrotain would look like. Next step would be some tests to demonstrate capabilities.

bd82 force-pushed the chev branch from 0383126 to be0544d Compare August 29, 2017 18:46

MathieuAA reviewed Aug 29, 2017

View reviewed changes

Cleaned up and single "happy path spec".

2bf7903

* Lexer, Parser and APIs in different files. * A single test which parses a simple valid input and outputs a CST.

bd82 force-pushed the chev branch from bc117f7 to 2bf7903 Compare August 30, 2017 18:59

A couple of more specs for syntax errors.

23b41c6

bd82 and others added 2 commits August 31, 2017 19:36

Additional examples specs.

33df20d

* Automatic Error recovery. * Syntatic content assist.

Added diagrams for the grammar.

b2d50a7

Shahar Soel added 2 commits September 2, 2017 01:22

WIP on naive formatter example using CST.

1dcee28

Finished formatted example.

0c27008

bd82 force-pushed the chev branch 4 times, most recently from 81a81dd to 6df5348 Compare September 8, 2017 12:07

Ast Builder.

b5ef657

bd82 force-pushed the chev branch from 6df5348 to b5ef657 Compare September 8, 2017 21:53

MathieuAA reviewed Sep 11, 2017

View reviewed changes

MathieuAA merged commit b81bb09 into jhipster:master Sep 23, 2017

MathieuAA mentioned this pull request Sep 23, 2017

Revert "WIP - POC with Chevrotain Parser." #145

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WIP - POC with Chevrotain Parser. #142

WIP - POC with Chevrotain Parser. #142

bd82 commented Aug 28, 2017 •

edited

Loading

bd82 commented Aug 28, 2017 •

edited

Loading

deepu105 commented Aug 29, 2017

bd82 commented Aug 29, 2017 •

edited

Loading

bd82 commented Aug 29, 2017 •

edited

Loading

MathieuAA Aug 29, 2017

bd82 Aug 29, 2017

bd82 Aug 29, 2017

MathieuAA Aug 29, 2017

bd82 Aug 29, 2017

bd82 Aug 29, 2017 •

edited

Loading

deepu105 commented Aug 30, 2017

bd82 commented Aug 30, 2017

MathieuAA commented Aug 31, 2017

bd82 commented Aug 31, 2017 •

edited

Loading

bd82 commented Sep 8, 2017 •

edited

Loading

bd82 commented Sep 9, 2017

MathieuAA Sep 10, 2017

bd82 Sep 11, 2017

MathieuAA Sep 10, 2017

bd82 Sep 11, 2017

deepu105 commented Sep 15, 2017

bd82 commented Sep 15, 2017

deepu105 commented Sep 24, 2017

	/ MINLENGTH SPACE* '(' SPACE* int:INTEGER SPACE* ')' { return { key: 'minlength', value: int }; }
	/ MINLENGTH SPACE* '(' SPACE* constantName:CONSTANT_NAME SPACE* ')' { return { key: 'minlength', value: constantName, constant: true }; }
	/ MAXLENGTH SPACE* '(' SPACE* int:INTEGER SPACE* ')' { return { key: 'maxlength', value: int }; }
	/ MAXLENGTH SPACE* '(' SPACE* constantName:CONSTANT_NAME SPACE* ')' { return { key: 'maxlength', value: constantName, constant: true }; }
	/ MINBYTES SPACE* '(' SPACE* int:INTEGER SPACE* ')' { return { key: 'minbytes', value: int }; }
	/ MINBYTES SPACE* '(' SPACE* constantName:CONSTANT_NAME SPACE* ')' { return { key: 'minbytes', value: constantName, constant: true }; }
	/ MAXBYTES SPACE* '(' SPACE* int:INTEGER SPACE* ')' { return { key: 'maxbytes', value: int }; }
	/ MAXBYTES SPACE* '(' SPACE* constantName:CONSTANT_NAME SPACE* ')' { return { key: 'maxbytes', value: constantName, constant: true }; }
	/ MIN SPACE* '(' SPACE* int:INTEGER SPACE* ')' { return { key: 'min', value: int };}
	/ MIN SPACE* '(' SPACE* constantName:CONSTANT_NAME SPACE* ')' { return { key: 'min', value: constantName, constant: true }; }
	/ MAX SPACE* '(' SPACE* int:INTEGER SPACE* ')' { return { key: 'max', value: int };}
	/ MAX SPACE* '(' SPACE* constantName:CONSTANT_NAME SPACE* ')' { return { key: 'max', value: constantName, constant: true }; }

		const NAME = chevrotain.createToken({ name: 'NAME', pattern: namePattern });


		function createToken(config) {

WIP - POC with Chevrotain Parser. #142

WIP - POC with Chevrotain Parser. #142

Conversation

bd82 commented Aug 28, 2017 • edited Loading

bd82 commented Aug 28, 2017 • edited Loading

deepu105 commented Aug 29, 2017

bd82 commented Aug 29, 2017 • edited Loading

bd82 commented Aug 29, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Answer

Alternative

Future / Long term.

bd82 Aug 29, 2017 • edited Loading

Choose a reason for hiding this comment

deepu105 commented Aug 30, 2017

bd82 commented Aug 30, 2017

MathieuAA commented Aug 31, 2017

bd82 commented Aug 31, 2017 • edited Loading

bd82 commented Sep 8, 2017 • edited Loading

bd82 commented Sep 9, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

deepu105 commented Sep 15, 2017

bd82 commented Sep 15, 2017

deepu105 commented Sep 24, 2017

bd82 commented Aug 28, 2017 •

edited

Loading

bd82 commented Aug 28, 2017 •

edited

Loading

bd82 commented Aug 29, 2017 •

edited

Loading

bd82 commented Aug 29, 2017 •

edited

Loading

bd82 Aug 29, 2017 •

edited

Loading

bd82 commented Aug 31, 2017 •

edited

Loading

bd82 commented Sep 8, 2017 •

edited

Loading