Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[JavaScript] New structure for expressions #1009

Merged
merged 55 commits into from
Nov 8, 2017

Conversation

Thom1729
Copy link
Collaborator

Background

In order to correctly parse JavaScript expressions, you must always know whether you are at the beginning or end of an expression. Examples:

/abc/ /def/ /ghi/;
// <- string.regexp
//    ^ keyword.operator.arithmetic
//     ^^^ variable.other.readwrite
//          ^ keyword.operator.arithmetic
//          ^^^^^ string.regexp

x // expression continues
/abc/ /def/ /ghi/ y;
// <- keyword.operator.arithmetic
//^^ variable.other.readwrite
//  ^ keyword.operator.arithmetic
//    ^^^^^ string.regexp
//          ^ keyword.operator.arithmetic
//           ^^^ variable.other.readwrite
//              ^ keyword.operator.arithmetic

{} // semicolon not required
// <- meta.block

+ // expression continues
{};
// <- meta.object-literal

function foo() {} // semicolon not required after function declaration
/42/;
// <- string.regexp

+ // expression continues
function foo() {} // expression continues
/42/ z;
// <- keyword.operator.arithmetic

All of these cases can be handled exactly correctly by sublime's parsing engine. This is not a case like arrow functions where perfect parsing is impossible due to engine limitations. The premise for this work is that the JavaScript syntax therefore should handle these distinctions perfectly.

Current state

The existing syntax tries to maintain these distinctions, but it does not do so systematically. For instance, the distinction between division and regexps is in many cases determined using a lookahead. This approach is buggy (#874, #885, #894).

In addition, the distinction is not made in some cases where it may be desirable. For instance, the syntax does not distinguish between array literals and bracketed property accesses (#324):

[item];
// <- meta.brackets

x
[prop];
// <- meta.brackets

As a result, it is impossible to distinguish the rare comma operator from the more common comma separator punctuation (#831). An implementation of the distinction between arrays and property accesses has been attempted, but that implementation is vulnerable to the same sorts of bugs as the regex/division implementation and for the same reason (https://github.com/jcberquist/Packages/commit/3240bcb5ea4162e5acc2a4a8c0efbc93adde9088).

Since the current syntax was adopted for the core package, there have been a number of tweaks and patches to this behavior. Most notable is the addition of the after-operator and after-identifier contexts (62d142a). These contexts attempt to maintain the critical distinction between the beginning and end of an expression. They do so in a way that is relatively noninvasive. Unfortunately, because they are noninvasive, they do not and cannot solve the problem in the general case. While these contexts are a substantial improvement, bugs remain, and so do various patches that attempt to handle regexp corner cases.

The solution

The core principle of the solution is that there must be separate states for the beginning and end of an expression. That is, when parsing x / y;, the parser must move back and forth between a state that expects constructs like numbers and regexps and a state that expects constructs like binary operators. When the parser is in one state, a slash will be interpreted as the beginning of a regexp, and when it is in the other state, it will be interpreted as the division operator.

The after-operator and after-identifier contexts represent such a system, but as implemented they are insufficient. The entire expression syntax must be written in this tick-tock fashion; it cannot be bolted on or it will not work in the general case. On the other hand, if expressions are designed in this way from the ground up, then these knotty syntax problems become simple and no patches or special cases will be needed.

After much trial and error, I have come up with a framework that handles this efficiently and effectively. To demonstrate this framework, here is a toy example:

%YAML 1.2
---
name: Toy Syntax
scope: source.toy

contexts:
  main:
    - match: ';'
      scope: punctuation.terminator
    - match: (?=\S)
      push: [ expression-end, expression-begin ]

  expression-begin:
    - match: \d+
      scope: constant.numeric
      pop: true

    - match: /
      set:
        - meta_scope: string.regexp
        - match: /
          pop: true

  expression-end:
    - match: (?=;)
      pop: true
    - match: '[-+*/]'
      scope: keyword.operator.arithmetic
      push: expression-begin

This example has numeric literals, regexp literals, and the four common arithmetic operators. It will perfectly distinguish between division and regexps. (Try /abc/ /def/ /ghi/;, and add line breaks wherever you please.)

There are several key points that deserve explanation.

First, the expression contexts parse a single expression. The contexts are highly stack-oriented, pushing and popping after nearly every match. In the end, when a complete expression has been consumed, the contexts will both pop off the stack. This means that whatever parent context uses expressions must push them rather than including them.

Second, they parse a single expression. Terminating semicolons are not part of an expression. Nor are JavaScript constructs like throw and return. Nor, in JavaScript, are function declarations. This is critical: a function declaration is a complete statement that does not need to be terminated by a semicolon, but an expression does need termination. The examples at the top include an illustration of this distinction.

Third, we are pushing and popping rather than setting back and forth between expression-begin and expression-end. There are several reasons for this. One is that in this scheme, expression-begin does not have to refer explicitly to expression-end; it merely pops to get there. This is very important, because it allows us to have specialized versions of expression-end. In JavaScript, I use one of these for expressions that may not contain commas at top level (such as for function arguments, list items, or lambda concise body expressions). Another is used to implement the automatic semicolon insertion algorithm for expression statements. If set were used instead of push and pop, then we would need duplicate copies of expression-begin. Compare Ecmascript Sublime, which uses many redundant almost-copies of contexts and in doing so exceeds 5,000 lines.

Fourth, every contexts that wants expressions does have to push both expression-end and expression-begin in the correct order. However, this can be easily shortcut for convenience:

  main:
    - match: ';'
      scope: punctuation.terminator
    - match: (?=\S)
      push: expression

  expression:
    - match: (?=\S)
      set: [ expression-end, expression-begin ]

Fifth, we don't have any error handling. This is easily remedied in practice; each of expression-begin and expression-end should pop if no match succeeds. In practice, I have found pushing and popping in this manner to be far more resilient to invalid input than setting.

This toy example does little, but I hope that it can be seen how this framework can trivially distinguish blocks from objects, arrays from property accesses, function invocations from parenthesized expressions.

Implementation

In order to implement the above approach in the existing JavaScript syntax, I made a number of changes.

Zeroth, I moved comment logic to a prototype (#1007). This cleaned up the other scopes and fixed at least one bug.

First, I separated statements from expressions by handling semicolons in statements and popping expressions whenever one was encountered.

Second, I moved non-expression constructs like variable declarations, throw, and so forth from expressions to statements. In the process, I implemented restricted productions in a general way and removed the existing implementations (that failed in some cases).

Third, I implemented "expression statements", which are basically just expressions that handle automatic semicolon insertion. I added class and function declarations to statements.

Fourth, I made the large change of implementing the two-level expression framework. I:

  • Created the expression-begin and expression-end contexts, and made expressions set them.
  • Split the former contents of expressions between them as appropriate.
  • Caused those contents to push or pop as appropriate.
  • Broke out unary-operators into prefix-operators and postfix-operators.
  • Broke out round-brackets into round-brackets and parenthesized-expression.
  • Broke out square-brackets into square-brackets and bracketed-property-access.
  • Added expression-no-comma for various uses.
  • Removed after-identifier and after-operator.
  • Changed various places that had included expressions to instead push it.
  • Fixed various tests that specified incorrect behavior.

Fifth, I renamed expressions to singular expression, reflecting its new behavior.

As a result of these changes, expressions universally use the two-context framework. All tests currently pass, including some that needed modification because the previous behavior was incorrect. Issues #874, #885, and #894 are fixed.

Next steps

I write this PR early, as soon as the change is fundamentally implemented and before refactoring clutters the diffs. More work should be done to clean up the syntax and take advantage of the new expressions framework:

  • Most of the match anything/push expressions can and should be refactored to something cleaner, generally by using the stack more.
  • The match anything/pop behavior should probably be factored out and included.
  • Much of the lookahead logic for constructs like x = function() {}; should be refactored for simplicity. I would not be surprised if bugs were hiding there after the big change.
  • Arrays should be rescoped ([JavaScript] possible enhancement: distinguish uses of square brackets #324). This is a trivial change.
  • The line continuation lookahead probably has bugs and should be carefully reviewed.

I anticipate that these changes will decrease the size and complexity of the syntax as well as likely fixing bugs. In addition, I see opportunities for future improvement:

  • Better scoping for the inside of an object literal.
  • Specific-enough meta scopes that we may be able to provide lexically-aware completions (e.g. all variables in lexical scope).
  • Easy JSX support. (It would be simple enough to provide versions both with and without JSX. The key difference would be a single include in expression-begin.)
  • ([JavaScript] Should the comma operator be highlighted as an operator? #831) Marking the comma operator as an operator (leaving comma punctuation as punctuation).

For now, I would appreciate feedback on the general approach and the specific implementation. I am aware that this is a major change to the structure of the syntax and that we will certainly want to add more tests. I will be available to fix any bugs that may come out of it in the forseeable future.

@wbond
Copy link
Member

wbond commented Jun 8, 2017

It seems in the process of merging your other PRs, this one got conflicted. Would you mind resolving the conflict for me?

@Thom1729
Copy link
Collaborator Author

Thom1729 commented Jun 8, 2017

I'll try to resolve the conflicts and wrap this up by the end of Saturday. All tests pass (at least before the recent merges), but there are a couple of oddities in the test file that aren't covered.

@Thom1729
Copy link
Collaborator Author

That's probably correct; I'll verify after lunch. The only new change is fixing a test that had a missing semicolon.

@Thom1729
Copy link
Collaborator Author

That should be it, I think. I've fixed the remaining weirdness; it turns out there was a tiny regex typo that interacted badly with the new structure.

@wbond
Copy link
Member

wbond commented Oct 18, 2017

Can you check and see what the filesize of the {data}/Cache/JavaScript/JavaScript.sublime-syntax.rcache file is on your machine?

@wbond
Copy link
Member

wbond commented Oct 18, 2017

There appear to be 29 anonymous nested contexts. I don't think that is too much to maintain. I'd much rather that than layer a macro system over the whole thing.

@Thom1729
Copy link
Collaborator Author

31 KB (versus 78 KB for the original). The new version also takes about 58 ms on my machine to parse the complete jQuery source, down from 79 ms.

Myself, I find the anonymous contexts to be easier to read, but then I've been staring at this file for an awfully long time and become accustomed to its quirks. If you think it looks better with named contexts instead, I can certainly do that.

@Thom1729
Copy link
Collaborator Author

I've removed all nested lists from the syntax.

@rwols
Copy link
Contributor

rwols commented Oct 19, 2017

The new version also takes about 58 ms on my machine to parse the complete jQuery source, down from 79 ms.

I've removed all nested lists from the syntax.

What can you report about the performance now?

@wbond
Copy link
Member

wbond commented Oct 20, 2017

@rwols The change was strictly syntax, so there should be a 0% change in performance.

@Thom1729
Copy link
Collaborator Author

Thom1729 commented Nov 8, 2017

@wbond Have you had a chance to take a look at the syntax now that the nested lists are gone? I'm anxious to get to work on #1269, but I don't want to pile yet more changes on top of an unmerged PR.

Copy link
Member

@wbond wbond left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall things look good for a merge. I'm sure there are some things that could be tweaked, but I'm confident we'll get feedback if so. I like it that the tests changed minimally.

@@ -198,8 +234,12 @@ not_a_comment;
'// /* not a comment'() {},
// ^^^^^^^^^^^^^^^^^^^^^^^^^^^ -comment() {}

'// /* not a comment'() {},
// ^^^^^^^^^^^^^^^^^^^^^^^^^^^ -comment() {}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How is this different than the assertions right above?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It almost certainly isn't. Looks like it was added four months ago, probably by mistake.

@wbond wbond merged commit 00305fc into sublimehq:master Nov 8, 2017
@wbond
Copy link
Member

wbond commented Nov 8, 2017

Thanks for all of your work on this @Thom1729, and thanks for the reviews @rwols and @jcberquist.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants