Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reintroduce header patterns for filetype detection #3208

Merged
merged 12 commits into from
Mar 25, 2024

Conversation

dmaluka
Copy link
Collaborator

@dmaluka dmaluka commented Mar 24, 2024

Replacing header patterns with signature patterns in #2819 was a mistake, since both are quite different from each other, and both have their uses. In fact, this caused a serious regression: for such files as shell scripts without *.sh extension but with #!/bin/sh inside (and other similar common use cases that have no filename matches and thus were relying on header matches), filetype detection does not work at all anymore.

Since both header and signature patterns are useful, reintroduce support for header patterns while keeping support for signature patterns as well, and make both work nicely together.

Also, unlike in the old implementation (before signatures were introduced), ensure that filename matches take precedence over header matches, i.e. if there is at least one filename match found, all header matches are ignored. This makes the behavior more deterministic and prevents previously observed issues like #2894 and #3054: wrongly detected filetypes caused by some overly general header patterns.

Precisely, the new behavior is:

  1. if there is at least one filename match, use filename matches only
  2. if there are no filename matches, use header matches
  3. in both cases, try to use signatures to find the best match among
    multiple filename or header matches

Changed signature patterns back to header patterns in almost all syntax files, except C++ and Objective-C (for which signature was actually introduced).

Also done a bit of refactoring and bugfixing of the code in UpdateRules() (see commit messages for the details).

Fixes #3201

The original meaning of foundDef was: "we already found the final syntax
definition in a user's custom syntax file". After introducing signatures
its meaning became: "we found some potential syntax definition in a
user's custom syntax file, but we don't know yet if it's the final one".
This makes the code confusing and actually buggy.

At least one bug is that if we found some potential filename matches in
the user's custom syntax files, we don't search for more matches in the
built-in syntax files. Which is wrong: we should keep searching for as
many potential matches as possible, in both user's and built-in syntax
files, to select the best one among them.

Fix that by restoring the original meaning of foundDef and updating the
logic accordingly.
No need to parse a syntax YAML file if we are not going to use it,
it's a waste of CPU cycles.
As a preparation for reintroducing header matches.
Replacing header patterns with signature patterns was a mistake, since
both have their own uses. So restore support for header regex, while
keeping support for signature regex as well.
Replacing header patterns with signature patterns was a mistake, since
both are quite different from each other, and both have their uses. In
fact, this caused a serious regression: for such files as shell scripts
without *.sh extension but with #!/bin/sh inside, filetype detection
does not work at all anymore.

Since both header and signature patterns are useful, reintroduce support
for header patterns while keeping support for signature patterns as well
and make both work nicely together.

Also, unlike in the old implementation (before signatures were
introduced), ensure that filename matches take precedence over header
matches, i.e. if there is at least one filename match found, all header
matches are ignored. This makes the behavior more deterministic and
prevents previously observed issues like zyedidia#2894 and zyedidia#3054: wrongly
detected filetypes caused by some overly general header patterns.

Precisely, the new behavior is:

1. if there is at least one filename match, use filename matches only
2. if there are no filename matches, use header matches
3. in both cases, try to use signatures to find the best match among
multiple filename or header matches
Turning `header` patterns into `signature` patterns in all syntax files
was a mistake. The two are different things. In almost all syntax files
those patterns are things like shebangs or <?xml ... ?> or
<!DOCTYPE html5> i.e. things that:

1. can be (and should be) used for detecting the filetype when there is
   no `filename` match (and that is actually the purpose of those
   patterns, so it's a regression that it doesn't work anymore).

2. should only occur in the first line of the file, not in the first
   100 lines or so.

In other words, the old `header` semantics was exactly what was needed
for those filetypes, while the new `signature` semantics makes little
sense for them.

So replace `signature` back with `header` in most syntax files. Keep
`signature` only in C++ and Objective-C syntax files, for which it was
actually introduced.
Purely cosmetic change: make the code a bit more readable by reducing
its visual "density".
runtime/syntax/PowerShell.yaml Outdated Show resolved Hide resolved
runtime/syntax/bat.yaml Outdated Show resolved Hide resolved
@JoeKar
Copy link
Collaborator

JoeKar commented Mar 24, 2024

Thank you for taking care and cleaning it up that far!

Copy link
Collaborator

@JoeKar JoeKar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couldn't find a problem so far and my test files are supported in the expected way.

@dmaluka dmaluka merged commit 2ab1b31 into zyedidia:master Mar 25, 2024
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Error detecting file type by signature
2 participants