Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Linting the spec #173

Open
bakkot opened this issue Feb 24, 2020 · 25 comments
Open

Linting the spec #173

bakkot opened this issue Feb 24, 2020 · 25 comments

Comments

@bakkot
Copy link
Contributor

bakkot commented Feb 24, 2020

I would like to start enforcing some basic sanity checks for the spec. Some of them will end up implemented in other projects, but I would like to use this issue to track possible lint rules everywhere.

The main goal of this is to reduce editorial churn and correctness issues, but I know that it would also be helpful to people who build tooling based on the spec, like https://github.com/jscert/jsexplain (so cc @brabalan in case you have any ideas).

Currently these are mostly enforced by editors noticing them or @jmdyck submitting after-the-fact PRs, which isn't sustainable.

Here's a few possibilities to get started. Please suggest more.

  • Output is valid HTML
  • No trailing whitespace
  • In algorithms:
    • Consistent spacing, e.g. foo ( a, b ) in algorithm headers and foo(a, b) in algorithm steps.
    • Parameter names are surrounded by _
    • If an If has substeps, the If line ends with , then
    • If an If has an Else which has substeps, it is spelled Else, not Otherwise
    • If an If has an else on the same line, it is spelled ; else rather than , else or ; else, or ; otherwise (we are inconsistent about else vs otherwise here, and I am OK not enforcing it, but we should at least enforce the semicolon and lack of following space).
    • Lines end in ., :, ,, do, or then
    • Any line not ending in . is followed by an indented sequences of steps.
    • Consistent casing for variable names - currently we're very bad about this one
    • If we say "A and B" or "A, B, and C", the commas are as they are in those examples (rather than "A, and B" or "A, B and C")
    • Consistently "let x be" and "set x to" (e.g. not "let x to", "set x be", or "set x to be")
    • Ban If _foo_ is present, let _bar_ be _foo_; else let _bar_ be *undefined*. (Editorial: Treat not present parameters as undefined ecma262#1411) [edit: actually that PR only applied to functions, not AOs, so we can only do this if we first sort out the treatment of missing parameters in AO e.g. by banning optional parameters in AOs)
    • "... to a new empty List" rather than "to a new List", "to be a new List", "to be a new empty List"
      • Probably something similar for records
    • The last step should not be Return., since that's implicit (as of Editorial: describe behavior for algorithms without a "Return" ecma262#2397).
  • Grammar lookahead restrictions and flags are omitted in early error definitions and syntax-directed operations
  • In the grammar:
    • Lookahead restrictions and flags do not have spaces between brackets (so [lookahead != `let`] or [+Await] rather than [ lookahead != `let` ] or [ +Await ], etc)
    • One space before the : on the LHS
    • One space between each terminal or nonterminal on the LHS
    • Grammar flags are consistent: every ?-prefixed flag on the RHS appears as a parameter flag on the LHS, and every LHS flag is passed down at least one nonterminal in one production on the RHS, or is used to gate a production (see more about what grammar flags mean here)
  • A bunch of HTML consistency stuff:
    • tags are lowercase auto-formatter #367
    • attribute values are quoted (or unquoted; I don't care as long as we're consistent)
    • tags don't have any unexpected attributes (e.g.)
    • tags have all the expected attributes (e.g. <emu-xref="foo>" should be caught, since they meant <emu-xref href="foo">)
    • for tags for which the closing tag is optional, they are always included (or not included; again, as long as we're consistent)
    • is spelled &ge; (etc) formatter: render HTML entities #481
    • no unknown emu tags lint: some checks for legal tags/attributes #279
    • sec- is only a prefix for an ID when attached to a clause (cf Editorial: Don't use "sec-" prefix for dfn ids ecma262#2103)
    • all rows in a table have the same number of cells (account for colspan)
    • consistent indentation
    • emu-grammar and emu-alg tags are not adjacent with others of their kind
    • no more than one blank line in a row
      • and maybe consistent rules about where blank lines go, at least in some cases: e.g., never between <emu-clause> and <h1>.
    • Consistent spacing for records: exactly the spacing in { [[key]]: value, [[key2]]: value2 }
  • Consistent spelling
    • the *this* value, not *this* value or the *this* object
      • we have a lint for "this object" but not no-"the" "this value" because the latter form is still in use (generally as an argument to an AO call)
    • British vs American spelling for words where it's an issue, like "behaviour"
    • "one's complement" and "two's complement", not "1's complement" or "2's complement"
    • "uppercase" and "lowercase", not "upper case" or "upper-case" (Editorial: Standardize the spelling of "uppercase" and "lowercase" ecma262#2598)
  • Annex A ("Grammar Summary") has emu-prodrefs to all productions (or, ideally, is automatically generated)
  • As of Editorial: use consistent wording for abstract operation preambles ecma262#1914, every abstract operation has a preamble in the correct format (though, what is an abstract operation from the perspective of ecmarkup? - I guess an emu-clause with an AOID is a reasonable heuristic.)
  • When the steps or prose for a syntax-directed operation refer to the name of a nonterminal, it is surrounded with |, as in |UnicodeLeadSurrogate|.
  • In syntax-directed operations,
    • All referenced nonterminals occur in the production for the SDO.
    • In the algorithm steps or prose, opt subscripts and grammar parameters are not included.
  • Miscellaneous stuff:
    • ~~Always *+0*<sub>𝔽</sub> or *-0*<sub>𝔽</sub>, never *0*<sub>𝔽</sub>. ~~(add some lint rules for numbers #257)
    • "be the Record {", not "be a new Record {"
    • Never *+1* for any string of digits except 0.
    • Exactly one space between sentences.
    • <p> is not followed by a linebreak, and </p> is not preceded by a linebreak (even with intervening whitespace).
    • [Cc]lauses? \d should be forbidden.
    • An inline if does not have a then: If foo, return false. not If foo, then return false.
    • "ECMAScript language value", not "ECMAScript value"
  • No namespace collisions between constants and AOs (and maybe other namespaces)
  • All AOs have structured headers (once people have had time to get used to the new syntax)
  • No unnecessary explicit suppressions / annotations for can-call-user-code
  • Every algorithm returns.
  • consistently "a number or a bigint", in that order, in types: Editorial: Use consistent phrasing for parameters that are Number or BigInt ecma262#2622
  • no unused Let bindings or captured variables in AOs: Editorial: remove unused capture in %TypedArray%.prototype.sort ecma262#2836 lint for used-but-not-declared and declared-but-not-used variables in algorithms #483
  • If _x_ is *null*, return *null* rather than If _x_ is *null*, return _x_, per Editorial: Prefer returning static values after alias comparison ecma262#3122 - anywhere there's a comparison against a thing in *s or ~s, or a literal number or +/-∞, and then the alias is returned, we should return the constant instead.

It would also be nice to have a few more static-analysis-y checks, like

  • The grammar should be unambiguous
    • And LR(k)
  • Typechecking for abstract operations
    • All used operations are defined
    • They are invoked with the right number of arguments
    • They don't reference any values which are not defined lint for used-but-not-declared and declared-but-not-used variables in algorithms #483
      • And when updating an already defined value, this is done with set rather than let, increase, increment, add, etc.
    • Their return value is treated as a completion record or not as a completion record, as appropriate (pending Sorting out completion records ecma262#1796)
    • In an ideal world, actual typechecking for values
      • Given that, enforce the * vs ~ vs _ vs " rules for referring to different kinds of values
      • As a particular case, algorithms should say If _x_ is *true*, not If _x_ is true.
  • Grammar productions have all and only the appropriate flags
  • All syntax-directed operations correspond to actual productions

Edit: some of these are done in #199, #205, #207, #209, #210, and #239. I've struck them from the above list. I'm keeping this issue open to track remaining items.

@syg
Copy link
Contributor

syg commented Feb 24, 2020

All used operations are defined

This would need to be amended for AOs that are only called upstream, but at least it would force us to enumerate such AOs.

@michaelficarra
Copy link
Member

As an additional step here, I would like for us to define a bunch of "pattern matchers" of some sort that identify idiomatic spec phrasing, and use a special rendering of the spec that highlights/styles these idiomatic sections of text so that we can identify algorithm steps and whatnot that (unexpectedly) do not conform. We might even want to track/relate the idiomatic phrasing in the same way we do AO references.

@jmdyck
Copy link
Contributor

jmdyck commented Feb 25, 2020

I'll point out that ecmaspeak-py does almost all of what @bakkot listed, and more. Not that that helps ecmarkup a whole lot, but you're welcome to adapt my code.

@michaelficarra
Copy link
Member

michaelficarra commented Feb 25, 2020

@jmdyck Can you expand on the "and more"? What other things does it do that aren't listed in @bakkot's comment?

edit: I see you posted a link. I'll check out the code.

@jmdyck
Copy link
Contributor

jmdyck commented Feb 25, 2020

In addition to the things @bakkot listed, ...

analyze_spec.py checks that...

  • Only expected elements are used, and that their content conforms to expected content models.
  • re 'id' attributes:
    • There are no duplicates.
    • They conform to certain expectations (e.g., the 'id' of an emu-table should start with "table-").
    • A name isn't both an 'id' and an 'oldid'.
    • There's a defining 'id' for every id-reference.
    • Defined ids (with many exceptions) are referenced somewhere in the spec.
  • Some of the above similarly for 'aoid' attributes.
  • emu-tables with certain kinds of caption have expected column-heads.
  • The content of the well-known intrinsics table conforms to certain expectations.
  • Each %Foo% in the spec (outside the intrinsics table) resolves to one in the intrinsics table.

Section.py checks that...

  • clause-headers conform to certain expectations.
  • Within certain clauses (e.g., the properties of object Foo), sub-clauses are in 'alphabetical order'.

emu_grammars.py checks that...

  • Every emu-grammar element is syntactically well-formed.
  • Every production in a 'definition' emu-grammar "makes sense" in various ways, especially in its use of grammatical parameters.
  • The total set of defining productions makes sense as a whole.
  • Every production in a non-definition emu-grammar is a version of some defining production. (In general, it needn't be an exact copy of one, for various reasons.)
  • Outside of emu-grammar elements, references to nonterminals are well-defined.
  • (It also generates a good approximation to Annex A.)
  • (It also has in-progress code to generate a parser.)

Pseudocode.py ...

static_type_analysis.py...

@michaelficarra
Copy link
Member

@bakkot

Grammar lookahead restrictions and flags are omitted in early error definitions and syntax-directed operations

I think you mean included, right?

@bakkot
Copy link
Contributor Author

bakkot commented Mar 30, 2020

If we decide they should be universally included, sure.

@michaelficarra
Copy link
Member

@bakkot I thought that's what we decided in the editor call. We even have it on our editor update slides.

@bakkot
Copy link
Contributor Author

bakkot commented Mar 30, 2020

I only remembered that we were vaguely in support of it, not that we necessarily intended to actually change it. But if we did, sure, sounds good.

@michaelficarra
Copy link
Member

michaelficarra commented Mar 30, 2020

I'm not good at remembering those sorts of things without notes. We can confirm at the next call.

edit: In a later editor call, we did in fact confirm that universal inclusion is desired.

@rbuckton
Copy link
Contributor

rbuckton commented Apr 3, 2020

emu_grammars.py checks that...

  • Every emu-grammar element is syntactically well-formed.
  • Every production in a 'definition' emu-grammar "makes sense" in various ways, especially in its use of grammatical parameters.
  • The total set of defining productions makes sense as a whole.
  • Every production in a non-definition emu-grammar is a version of some defining production. (In general, it needn't be an exact copy of one, for various reasons.)
  • Outside of emu-grammar elements, references to nonterminals are well-defined.
  • (It also generates a good approximation to Annex A.)
  • (It also has in-progress code to generate a parser.)

grammarkdown already does a lot of this, ecmarkup just doesn't employ those features.

@rbuckton
Copy link
Contributor

rbuckton commented Apr 3, 2020

Feel free to file bugs against grammarkdown for anything you think needs to be added to the parser, checking, etc.: https://github.com/rbuckton/grammarkdown

@michaelficarra
Copy link
Member

Nesting more than one AO invocation as parameters to another AO should be disallowed. See tc39/ecma262#1573 (comment)

@ljharb
Copy link
Member

ljharb commented Apr 6, 2020

To confirm, it seems like they should be disallowed only because the current macro doesn't handle them, not because they're inherently something we don't want?

@bakkot
Copy link
Contributor Author

bakkot commented Apr 7, 2020

The issue there isn't about the macro, it's the fact that there's no evaluation order for the arguments defined, and when they're abstract operations they can have side effects which must be specified to happen in a specific order.

@michaelficarra
Copy link
Member

@michaelficarra
Copy link
Member

In algorithm steps, each alias should only be introduced with Let at most once for any possible trace through the algorithm.

@bakkot
Copy link
Contributor Author

bakkot commented May 10, 2020

In algorithm steps, all aliases should be introduced using Let or with a parameter list before they are used.

There are other ways we introduce aliases. Other than those Let and parameter lists (including parameter lists of abstract closures), we also have various permutations on:

  • For each _x_ (For each element _k_ of _keys_, For each integer _k_ that satisfies ..., etc)
  • Evaluate |Foo| to obtain _x_, which is used only in regexes and is complicated by the fact that some of the regex operations return multiple results
  • If there exists _x_ (e.g.)
  • the smallest possible integer _k_ not smaller than (e.g.)
  • For yield we say something like when evaluation is resumed with a Completion _resumptionValue_ the following steps will be performed, which introduces _resumptionValue_ for use in those steps
  • For MakeDay we say Find a value _t_ such that

Also, we occasionally refer to aliases from other algorithms. For example, the Function constructor says Let _args_ be the _argumentsList_ that was passed to this function by [[Call]].

Also also, some places we have algorithms (especially but not uniquely in Annex B, e.g. in the Note at the end of Number.prototype.toExponential) we're actually defining replacements to existing steps in to other algorithms, in which case it is not possible to locally determine which aliases are in scope.

@bakkot
Copy link
Contributor Author

bakkot commented Jun 14, 2020

Also, the things which are currently console.log warnings should become linting errors, ideally.

Edit: done in #229.

@bakkot
Copy link
Contributor Author

bakkot commented Jul 14, 2020

We should figure out if we want to use html entities for non-ASCII stuff, and enforce that decision. Currently we use both &laquo; and «.

@ljharb
Copy link
Member

ljharb commented Jul 14, 2020

Since I'm on a Mac, it's easy to type the actual characters, so I'd prefer that; but if others editing the spec aren't, it might not be as easy.

@michaelficarra
Copy link
Member

It's 2020, embrace the Unicode.

@NotWearingPants
Copy link

We should figure out if we want to use html entities for non-ASCII stuff, and enforce that decision.

Why not use ASCII quotes and convert automatically in build time? Here is how markdown-it does this.

If an If has an else on the same line, it is spelled ; else rather than , else or ; else, or ; otherwise

Instead of enforcing consistency, how about adding tags for this, e.g.

1. <emu-if><cond>x is 1 and y is 1</cond><then>return 1</then></emu-if>
1. <emu-else>return 0</emu-else>

and

1. <emu-if-multiline><cond>_exponent_ is *+&infin;*<sub>𝔽</sub></cond><then>
  1. <emu-if><cond>abs(ℝ(_base_)) &gt; 1</cond><then>return *+&infin;*<sub>𝔽</sub></then></emu-if>
  1. <emu-if><cond>abs(ℝ(_base_)) is 1</cond><then>return *NaN*</then></emu-if>
  1. <emu-if><cond>abs(ℝ(_base_)) &lt; 1</cond><then>return *+0*<sub>𝔽</sub></then></emu-if>
</then></emu-if-multiline>
1. <emu-else-multiline>...</emu-else-multiline>

This would allow enforcing a consistent style, would also allow authors to not have to remember where commas/semicolons go, and would also let you change the style across the entire document anytime (after all ifs have been converted of course).

@jmdyck
Copy link
Contributor

jmdyck commented May 6, 2021

Instead of enforcing consistency, how about adding tags for this

The ecmarkdown page says:

Some of Ecmarkdown's biggest benefits are when using it to write algorithm steps, without the many formalities HTML requires for list items and inline formatting of common algorithmic constructs.

Adding tags as you suggest would make algorithms harder to write + edit (and read, in source).

@michaelficarra
Copy link
Member

When For each iterates over the elements of a List, we say "of", not "in".

tc39/ecma262#3382 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants