Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error detecting file type by signature #3201

Closed
taconi opened this issue Mar 20, 2024 · 16 comments · Fixed by #3208
Closed

Error detecting file type by signature #3201

taconi opened this issue Mar 20, 2024 · 16 comments · Fixed by #3208

Comments

@taconi
Copy link
Contributor

taconi commented Mar 20, 2024

Description of the problem or steps to reproduce

When opening the Pipfile.lock file it looks like this:

{
    "_meta": {}
}

The type shown is unknown, but as specified in the signature of the file:

filetype: json
detect:
filename: "\\.json$"
signature: "^\\{$"

I did the same test with the file named index with the following content:

<!DOCTYPE html5>

From what I understand, both should be of type json and html5 given the signature of html5:

filetype: html5
detect:
filename: "\\.htm[l]?5$"
signature: "<!DOCTYPE html5>"

I didn't get to test the other types of files that can be defined with signature but as both files should match with the regex I believe the same thing must be happening.

This does not happen in version 2.0.13.
The file detection was changed with #2819.

Specifications

Commit hash: b518bda
OS: Ubuntu 22.04.3 LTS
Terminal: terminator

@JoeKar
Copy link
Collaborator

JoeKar commented Mar 20, 2024

With #2819 it was changed in that way that it considers the signature-check only in case more than one file type check hit, to concretize the result (C* file types). Previously it has led to to false positive results that the header-check hit (#2894), before the possible later file type check could hit.

It can be discussed to try this guess as well in the moment no file type check hit at all, but this again could lead to false positives in the moment the signature-check identifies some lines by the "wrong" signature of a different syntax definition than expected due to an much to generalized wording within the signature definition. For example it detects a buzz word within a comment of your unspecified/unknown file.

So from that point there is no guarantee to hit exactly, it would be a guess and it could be wrong (with a chance for new issues).
The correct way would be to add an concrete file type in a new or existing syntax definition.

@taconi
Copy link
Contributor Author

taconi commented Mar 20, 2024

In my opinion, the fields within the detect key should be used to detect the file type, as well as the header in 2.0.13.

But in view of the behavior of the signature field, I think it would be worth removing the signatures that don't make sense, for example, in what scenario would the signature of json files be used? I can't think of any filenames for this functionality to be used.
The change was made as if the behavior of the header and signature were the same, but some signatures will never be used and could be removed.

@JoeKar
Copy link
Collaborator

JoeKar commented Mar 20, 2024

...for example, in what scenario would the signature of json files be used?

But wasn't the same question legit for the old json header too?

I can't think of any filenames for this functionality to be used.

Currently me too, but this doesn't mean that they don't exist. I "broke" some expected default handling in the past.
But besides ^{$ is very unspecific so far and doesn't really help the initial problem still persists.

The signature check can help as fallback and heavily relies on sufficient rules.

the fields within the detect key should be used to detect the file type

So you'd recommend to consider to provide this as a fallback method and remove insufficient signatures, right?

@taconi
Copy link
Contributor Author

taconi commented Mar 20, 2024

...for example, in what scenario would the signature of json files be used?

But wasn't the same question legit for the old json header too?

I don't think so, all files that start with { in the first line will be considered json files, but this is very vague and can generate a lot of false positives.

The point is that the header has been changed as if it were for an equivalent, which is not true.
The json signature will only be used if another syntax file matches the name of the file, currently json files are those that end with .json and no other syntax file matches a similar pattern, so the way it is in file types like json, this field is unusable.

I think the right thing to do in this case would be to remove it from the file, but that's not my decision.

the fields within the detect key should be used to detect the file type

So you'd recommend to consider to provide this as a fallback method and remove insufficient signatures, right?

No, actually the signature is used to define the type of the file in cases where more than one detect.filename match (as far as I understand, correct me if I'm wrong) and the header that was used to define the type of the file by the content of the first line of the file has been removed, this was an approved decision and merged.

I was just thinking that the signature was the same as the header, I hadn't correctly understood the function of signature, I'll close the subject.

@taconi taconi closed this as completed Mar 20, 2024
@taconi
Copy link
Contributor Author

taconi commented Mar 21, 2024

But besides ^{$ is very unspecific so far and doesn't really help the initial problem still persists.

But the problem still persists but it will only happen if the file name matches more than one file type, the problem would be the signature patterns, they must be fixed, improved and added so that false positives can be corrected. The signature should be used in case the file type is unknown too and that false positives are reported by whoever finds it and that the signatures that would avoid these errors are implemented.

@taconi taconi reopened this Mar 21, 2024
@dustdfg
Copy link
Contributor

dustdfg commented Mar 21, 2024

Maybe user can be asked about filetype if it was deduced from patterns or can be added one more option to control if patterns are used at all

@taconi
Copy link
Contributor Author

taconi commented Mar 21, 2024

Placing an option that activates or deactivates the use of signature would be trying to sweep the problem of detecting the file type from the file content down the table, in the same way as placing these bugs only in files that match with more than one filename.

Maybe warn that signature may present errors and ask the user to report bugs found with this option activated in the option documentation, something like:

* `detectsignature`:  If it is `true`, the `signature` field of the syntax files
   will be used to try to find the `filetype` when it is `unknown` or when
   more than one `filetype` is found (this function may end up generating false positives, if this happens please report the bug so we can improve its functionality).

	default value: `false`

But I still believe that signature should be used to try to find the type of files of type unknown by default, so signature will have a greater frequency of use than it is today, like #3183 (comment) that shows an example of use of the signature of the json files and in which scenarios this could happen unexpectedly.

@JoeKar
Copy link
Collaborator

JoeKar commented Mar 21, 2024

But I still believe that signature should be used to try to find the type of files of type unknown by default [...]

I'm open for that. We just need to keep in mind, that this will have downsides.

Just one example:
Any (new) language which has types (structs, classes, etc.) and allows to "open" them at the next line with {, but no file extension is given yet or the language is yet unknown, because it's new.
Do we want to have it highlighted as json? 😉

So proper highlighting should be produced with explicit file extensions or whole file names in the syntax definition. Everything else is some (better) guessing, especially in cases with more or less "wildcard" signatures.

@dustdfg
Copy link
Contributor

dustdfg commented Mar 21, 2024

Ok. What vim does? You said in one of the PR's vim uses similar approach so do they have the issue?

@JoeKar
Copy link
Collaborator

JoeKar commented Mar 21, 2024

Vim detects it as json, but it does this by this simple fact:
https://github.com/vim/vim/blob/master/runtime/filetype.vim#L1627

So the file name "Pipfile.lock" is explicit tracked to be of a json file type. 😉

For *.h files for example Vim does that dynamic lookup...
https://github.com/vim/vim/blob/master/runtime/filetype.vim#L364 -> https://github.com/vim/vim/blob/master/runtime/autoload/dist/ft.vim#L183
...from which micros signature check was inspired. It doesn't do this by default.

@taconi
Copy link
Contributor Author

taconi commented Mar 21, 2024

From what I understand, the FTheader function is only used for *.h files, just as the FTbas function is only used for *.sys\c files. Here signature can (?) run for all files definitions.

I believe that the solution for the Pipfile.lock file is to update the filename regex of json to match this file.

The question is, what does the signature field look like?

You may also provide an optional signature regex that will check a certain
amount of lines of a file to find specific marks. For example:

detect:
    filename: "\\.ya?ml$"
    signature: "%YAML"

If I put the signature with "%YAML", any file that has this content of this type?
We found out not with Pipfile.lock and signature from json, but that's not what the documentation says.

What is this field for?
How is it used by the micro to carry out detection?
In which scenarios would I have which behaviors?
In what cases could there be an error?
How to make proper use of it?

How do I find this information without having to review PR's and issues?

@JoeKar
Copy link
Collaborator

JoeKar commented Mar 21, 2024

If I put the signature with "%YAML", any file that has this content of this type?

No, as described in #3183 the signature check is currently used as last resort to sort out multiple detected filenames (more or less as Vim does)

What is this field for?

See above.

How is it used by the micro to carry out detection?

It will iterate over the amount of detectlimit lines to search for e.g. keywords and or patterns with which the file type can be recognized.

In which scenarios would I have which behaviors?

As written, it's currently used to define the chosen one of multiple filename matches, depending on the given/defined patterns.

In what cases could there be an error?

In case the pattern is insufficient or filename will be changed without considering signature once again.
That's definitely something to keep in mind when we decide to add more filename patterns into already existing definitions and don't create new dedicated definitions.

How to make proper use of it?

As of now, I agree that it most probably was a fault to keep them available (with the same pattern as before with header) for everything else other than the C* headers. Because this caused unintended confusion.
But as already written, there is still some room to use it as a absolute last fallback.

How do I find this information without having to review PR's and issues?

Well, what shall I say/write?
We figured out by discussion, since confusion came up due to the lack of proper documentation.

As already often said:
This project is already larger than 2-3 maintainers can handle (which was the reason why I vote for you too). 😃
At least labeling should be possible for volunteers, but unfortunately it isn't.

Help is appreciated with support to prepare the proper doc update. 😉

@dmaluka
Copy link
Collaborator

dmaluka commented Mar 22, 2024

I've read the discussion and I come to a conclusion that we should admit that replacing "headers" with "signatures" was a hasty decision. It is not just a matter of a proper doc update, it is a matter of breakage of useful functionality. header and signature are different things, with quite different purposes, both are useful, we should have added the latter without obstructing the former.

If we have a file named foo with the following content:

#!/bin/sh

echo "hello world"

with v2.0.13 it is recognized and highlighted as a shell script. With the newest master it is not.

To me, this is a regression, and a rather serious one.

The majority of micro's syntax files have no header/signature patterns. In those syntax files that do, those patterns are mainly shebangs (#!/bin/sh, #!/usr/bin/awk etc) and things like <?xml ... ?> or <!DOCTYPE html5>, i.e. things that: 1. can be quite reliably used for detecting the filetype independently of filename, not just in addition to it. 2. should only occur in the first line of the file, not in the first 100 lines or so. In other words, the old header semantics was exactly what was needed for those filetypes, while the new signature semantics makes little sense for them.

So I think we should restore the header functionality, restore its usage instead of signature for most filetypes, and keep using signature probably just for distinguishing C/C++/Objective-C headers.

When it comes to false positives caused by header in the past, were #2894 and #3054 the only known cases of such? If so, in all cases json was guilty, so I think we should fix that simply by removing this "^{$" pattern from json.yaml, since this pattern is indeed too general and probably was causing more harm than good.

@dmaluka
Copy link
Collaborator

dmaluka commented Mar 22, 2024

So I think we should restore the header functionality

Although perhaps we should make header have lower priority than filename, like we now already do with signature (and unlike in the old implementation, where it was non-deterministic whether filename match or header match would take precedence). Perhaps then we won't even need to fix json?

@dustdfg
Copy link
Contributor

dustdfg commented Mar 22, 2024

So proper highlighting should be produced with explicit file extensions or whole file names in the syntax definition. Everything else is some (better) guessing, especially in cases with more or less "wildcard" signatures.

Off topic sorry... It is a good place for a tuned and lightweight AI similar to googles magika which weights 1MB and it would be really perfect for such a problem.

It is even open source

@JoeKar
Copy link
Collaborator

JoeKar commented Mar 22, 2024

To me, this is a regression, and a rather serious one.

Ok, then we keep it short and header will return. signature is and can be used to solve multiple hits with the same filename and header will step up in the moment no filename hit at all.

When it comes to false positives caused by header in the past, were #2894 and #3054 the only known cases of such?

Yep, grep tells that the json definition is right now the only one with a very generic pattern.

Last but not least the doc of signature can then be extended/optimized too.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants