Syntax highlighting splits unicode characters #5069

eric-wieser · 2020-10-26T10:36:10Z

Preliminary Steps

Please confirm you have...

reviewed How Linguist Works,
reviewed the Troubleshooting docs,
considered implementing an override,
verified an issue has not already been logged for your issue (linguist issues).

Problem Description

Syntax highlighting is splitting a unicode codepoint into two garbled halves:

https://github.com/leanprover-community/mathlib/blob/d4477fa7f79beea1058f72fc3741c88a1832d9a1/src/group_theory/ore_localization.lean#L29

To eliminate browser interference, you can reproduce the issue with

import requests
w = requests.get("https://github.com/leanprover-community/mathlib/blob/d4477fa7f79beea1058f72fc3741c88a1832d9a1/src/group_theory/ore_localization.lean#L29")
c = w.content
i = c.find(b'has_coe_to_sort')
print(c[i:][:65])
print('⟨'.encode('utf8'))

which prints

b'has_coe_to_sort (ore_set M) := \xe2\x9f<span class="pl-k">\xa8Type</span>*'
b'\xe2\x9f\xa8'

\xe2\x9f\xa8 is the utf8 encoding of ⟨, which has somehow ended up with a span tag right in the middle of it. ~~The hypothesis over at https://leanprover.zulipchat.com/#narrow/stream/113488-general/topic/github.20syntax.20highlighting is that some kind of chunking is going on in the file.~~

An even simpler reproduction is https://gist.github.com/eric-wieser/caef77bc87edc0feae06bd91b0d241f2/756f85e2f06618ef2b7261e7ec3fca0aa0d73e2f:

#check ⟨Type⟩

While I know that the "How Linguist Works" page says that highlighting issues belong in upstream repos, this looks like a highlighting issue with how the grammar files themselves are processed.

URL of the affected repository:

As above, https://github.com/leanprover-community/mathlib/blob/d4477fa7f79beea1058f72fc3741c88a1832d9a1/src/group_theory/ore_localization.lean#L29

Last modified on:

The text was updated successfully, but these errors were encountered:

lildude · 2020-10-26T13:51:58Z

🤔 I think this really is a problem with the grammar and not a problem with the way the grammars are being processed as using the old Lean Textmate grammar we used to use, using the exact same parsing/processing, works as expected:

Old Textmate grammar: lightshow results

... versus ...

Current VSCode grammar: lightshow results

We switched grammars in #4546.

eric-wieser · 2020-10-26T13:58:09Z

Thanks for doing that test.

What does the conversion between the various types of grammar file? Could it be a bug that applies only to json grammars (https://github.com/leanprover/vscode-lean/blob/master/syntaxes/lean.json) and not to tmLanguage grammars (https://github.com/leanprover/Lean.tmbundle/blob/master/Syntaxes/Lean.tmLanguage)?

lildude · 2020-10-26T14:40:36Z

What does the conversion between the various types of grammar file?

There are two but in this case it's an internal library called PrettyLights heavily based off the Textmate grammar processing. It's not open source because of various licensing requirements.

The other is the grammar processor in this repo which produces the JSON files used by PrettyLights in production. These files are attached to each release.

Could it be a bug that applies only to json grammars and not to tmLanguage grammars

It's possible, but I don't think so as ultimately we convert all grammars to JSON.

I've taken the source.lean.json file from v7.5.0 which still uses the TextMate grammar and the same from v7.11.1 which uses the new VSCode grammar, both produced by the grammar compiler in this repo, and placed in this gist and you can see the problem stays with the newer grammar:

Old grammar lightshow results
New grammar lightshow results

I've taken a look at the history of the new grammar and I can see it switched from TextMate to JSON in this commit. Using that JSON file produces the same syntax highlighting at the old TextMate grammar as can be see here so I think this confirms this is definitely an issue with the grammar itself.

I'm not very good with writing grammars, but I know @Alhadis is quite the dab hand; he may be able to spot where things are going wrong in the current version of the grammar.

eric-wieser · 2020-10-26T15:02:59Z

Thanks for the further investigation. I think you're right that the tmLanguage vs json format is irrelevant, but I still think this is a bug in the grammar application, not the grammar itself. This extremely minimal example fails:

{
  "name": "Lean",
  "scopeName": "source.lean",
  "patterns": [
    {
      "name": "storage.type.lean",
      "match": "\\b(Prop|Type|Sort)\\b"
    }
  ]
}

https://gist.github.com/eric-wieser/c5a9efea2581d65fda99ec2816177fde

https://github-lightshow.herokuapp.com/?utf8=%E2%9C%93&scope=from-url&grammar_format=auto&grammar_url=https%3A%2F%2Fgist.github.com%2Feric-wieser%2Fc5a9efea2581d65fda99ec2816177fde%2Fraw%2F58eb1e47a0086a3d5ab405ebeb4cd909ba7dc59e%2Flean-tiny.json&grammar_text=&code_source=from-text&code_url=&code=x+%E2%9F%A8Type%E2%9F%A9

Alhadis · 2020-10-26T15:25:18Z

Okay, this is weird. leanprover/vscode-lean@deb64b0 appears to be the commit that broke the syntax highlighting, according to the Lightshow results:

leanprover/vscode-lean@8690a57 (Commit prior to deb64b0)
leanprover/vscode-lean@deb64b0

However, that doesn't make sense, because the commit in question only added a single keyword to an unrelated pattern:

-     "match": "\\b(Prop|Type)\\b",
+     "match": "\\b(Prop|Type|Sort)\\b",

I agree with @eric-wieser; I don't believe the grammar file is to blame.

eric-wieser · 2020-10-26T15:27:36Z

@Alhadis is spot on - I found that if I removed either Prop| or |Sort from the example in #5069 (comment), everything works.

lildude · 2020-10-26T15:39:50Z

🤔 Interesting. Good work peeps. I've reached out to the internal maintainers of the syntax highlighter to get some 👀 on this.

eric-wieser · 2021-06-09T08:52:40Z

Did the internal maintainers of the "PrettyLights" syntax highlighter make any progres on this?

lildude · 2021-06-09T09:09:35Z

No, and they're not likely to in the near future as work is prioritized on the replacement for PrettyLights which uses Treesitter-based grammars. There are no plans at the moment to allow Linguist to supply treesitter-based grammars but this might become a possibility in the future.

lildude · 2022-11-24T12:00:05Z

Closing as "won't fix" as there is no more funding for the ancient prettylights highlighter so this will never be fixed for Textmate-based grammars.

eric-wieser · 2022-12-15T01:08:29Z

Is there an alternative format to textmate-based grammars that uses the new system that we could contribute?

Alhadis · 2022-12-15T01:40:57Z

@eric-wieser Yes, but it requires external tooling to generate a several megabytes C file from a weird-looking dialect of Scheme (that you need to track with version control…). However… I'm looking through Lean's grammar now, and I notice you're not tokenising the Unicode brackets in ⟨Type*, λ S, S.carrier⟩. Try adding this to lean.json:

--- syntaxes/lean.json	2022-12-15 12:22:52.000000000 +1100
+++ grammars/lean.json	2022-12-15 12:36:21.000000000 +1100
@@ -76,11 +76,18 @@
     },
     { "match": "\\b(?<!\\.)(variable|variables|parameter|parameters|constants)(?!\\.)\\b",
       "name": "keyword.other.lean"
     },
+    { "include": "#brackets" },
     { "include": "#expressions" }
   ],
   "repository": {
+    "brackets": {
+      "patterns": [
+        {"match": "⟨", "name": "punctuation.definition.bracket.angle.begin.lean"},
+        {"match": "⟩", "name": "punctuation.definition.bracket.angle.end.lean"}
+      ]
+    },
     "expressions": {
       "patterns": [
         { "match": "\\b(Prop|Type|Sort)\\b", "name": "storage.type.lean" },
         { "match": "\\b(sorry)\\b", "name": "invalid.illegal.lean" },

You can and should scope all significant punctuation like brackets, operators, separators (etc) as punctuation, as it greatly improves readability in themes that style these characters with less distracting colours (notice the dark grey parentheses in Seti's highlighting, for example). Aside from readability and aesthetics, this should solve your issue with character splitting (as the Unicode characters will have already been tokenised).

lildude added Bug Syntax Highlighting labels Oct 28, 2020

lildude closed this as not planned Won't fix, can't repro, duplicate, stale Nov 24, 2022

lildude mentioned this issue May 5, 2023

TOML syntax highliting breaks with strings longer than 1024 characters #6402

Closed

github-linguist locked as resolved and limited conversation to collaborators Jun 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Syntax highlighting splits unicode characters #5069

Syntax highlighting splits unicode characters #5069

eric-wieser commented Oct 26, 2020 •

edited

Loading

lildude commented Oct 26, 2020

eric-wieser commented Oct 26, 2020 •

edited

Loading

lildude commented Oct 26, 2020 •

edited

Loading

eric-wieser commented Oct 26, 2020 •

edited

Loading

Alhadis commented Oct 26, 2020 •

edited

Loading

eric-wieser commented Oct 26, 2020 •

edited

Loading

lildude commented Oct 26, 2020

eric-wieser commented Jun 9, 2021 •

edited

Loading

lildude commented Jun 9, 2021

lildude commented Nov 24, 2022

eric-wieser commented Dec 15, 2022

Alhadis commented Dec 15, 2022 •

edited

Loading

Syntax highlighting splits unicode characters #5069

Syntax highlighting splits unicode characters #5069

Comments

eric-wieser commented Oct 26, 2020 • edited Loading

Preliminary Steps

Problem Description

URL of the affected repository:

Last modified on:

lildude commented Oct 26, 2020

eric-wieser commented Oct 26, 2020 • edited Loading

lildude commented Oct 26, 2020 • edited Loading

eric-wieser commented Oct 26, 2020 • edited Loading

Alhadis commented Oct 26, 2020 • edited Loading

eric-wieser commented Oct 26, 2020 • edited Loading

lildude commented Oct 26, 2020

eric-wieser commented Jun 9, 2021 • edited Loading

lildude commented Jun 9, 2021

lildude commented Nov 24, 2022

eric-wieser commented Dec 15, 2022

Alhadis commented Dec 15, 2022 • edited Loading

eric-wieser commented Oct 26, 2020 •

edited

Loading

eric-wieser commented Oct 26, 2020 •

edited

Loading

lildude commented Oct 26, 2020 •

edited

Loading

eric-wieser commented Oct 26, 2020 •

edited

Loading

Alhadis commented Oct 26, 2020 •

edited

Loading

eric-wieser commented Oct 26, 2020 •

edited

Loading

eric-wieser commented Jun 9, 2021 •

edited

Loading

Alhadis commented Dec 15, 2022 •

edited

Loading