Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add @injection.shebang for setting the language in injections.scm #26939

Closed
tgross35 opened this issue Jan 7, 2024 · 5 comments
Closed

Add @injection.shebang for setting the language in injections.scm #26939

tgross35 opened this issue Jan 7, 2024 · 5 comments
Labels

Comments

@tgross35
Copy link

tgross35 commented Jan 7, 2024

Problem

Parsing shebangs in tree-sitter is possible, but extremely difficult to get correct without a fairly complex external scanner. Being able to do this is useful for nested languages if a direct injection specifier isn't available.

Expected behavior

Helix provides an @injection.shebang capture that gets parsed by the editor and the language extracted. Adding this to treesitter would be great!

Docs: https://docs.helix-editor.com/guides/injection.html#capture-types and related discussions there helix-editor/helix#3970

Based on a quick search, they use this for Nix, Markdown (as a fallback), and typst.

Adding a field to parsers.lua for common shebanged languages, but a simple fallback that extracts #!\S*bin\S*[/ ](?P<lang>[^-'"]\S*) would probably work for most cases

I originally opened this in nvim-treesitter but I learned that this is the correct repo.

I also opened an issue for upstream tree-sitter: tree-sitter/tree-sitter#2851

@clason
Copy link
Member

clason commented Jan 7, 2024

I think that's way too niche. What would be the benefit of having this as a general "magic capture" rather than an explicit, language-specific directive (like we already support)?

@tgross35
Copy link
Author

tgross35 commented Jan 7, 2024

It's strictly for convenience and reusability for any file types with embedded code. Parsing the shebang to extract the right information can be tricky, and it doesn't make much sense to repeat its logic and internal grammar across a variety of file formats.

Conversely, it is easy to find a simple #! and add an @injection.shebang to markdown, yaml, nix, just, toml, typst, tex, etc. tree-sitter also has the first-line-regex field to help, but I don't believe this can be accessed from the .so.

But yes, it is just a convenience. Maybe it would be better to have a regex #replace? predicate that is more generally useful.

@clason
Copy link
Member

clason commented Jan 7, 2024

This just shifts the burden from the language maintainers to the Neovim devs without any net reduction or performance gain. Also note that there could be variations among file formats, so the generality is dubious. (For example, TeX has a very different "shebang".)

Maybe it would be better to have a regex #replace? predicate that is more generally useful.

We already have that: #gsub!. (Not everyone likes that, though; having that standardized upstream -- like #match? would go some way.)

@tgross35
Copy link
Author

tgross35 commented Jan 7, 2024

Fair enough, thanks for the feedback

We already have that: #gsub!. (Not everyone likes that, though; having that standardized upstream -- like #match? would go some way.)

Perhaps nvim-treesitter/nvim-treesitter#3944 will help with this at some point

@tgross35 tgross35 closed this as completed Jan 7, 2024
@tgross35 tgross35 closed this as not planned Won't fix, can't repro, duplicate, stale Jan 7, 2024
@tgross35
Copy link
Author

tgross35 commented Jan 8, 2024

In lieu of this, does anyone have an idea for how to reliably extract the executable from a shebang in TS? I have tried various variants of this slightly messy grammar:

    shebang: ($) =>
      prec.left(
        seq(
          choice($.shebang_executable, field("unrecognized_shebang", /#!.*/)),
          optional($._newline),
        ),
      ),

    shebang_executable: ($) =>
      token.immediate(seq(
        "#!",
        /\S*[/ ]/,
        field("cmd", /\S+/),
        /.*/,
      )),

Which is an attempt to extract the capture group#!\S*bin\S*[/ ](?P<lang>\S+).*. I unfortunately cannot query any forms of

(shebang_executable
    cmd: _)

since TS complains that cmd is an invalid field. And I can't make it a named node because you can't have nonterminal nodes within token.immediate (tree-sitter/tree-sitter#474).

My fallback is to generate two different injection.scm files that use either #gsub! or @injection.shebang, but it would be nice to be cross-platform somehow without a handwritten scanner.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants