Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve license detection for wrong SPDX license identifiers #3912

Open
AyanSinhaMahapatra opened this issue Sep 9, 2024 · 3 comments
Open

Comments

@AyanSinhaMahapatra
Copy link
Member

Consider the following text:

SPDX-License-Identifier: (GPL-2.0+ OR BSD)

Here BSD is not a valid license expression and even adding a rule is insufficient because the SPDX-License-Identifier based detection was moved before the hash license detection.

We should either:

  1. do the hash license detection first so we can catch these with rules, and then do the SPDX identifier based detection
  2. if we get unknown-spdx we consider license detection with rules
  3. Also optionally consider license detection with required phrase rules if nothing works (would lose license expression info for this potentially)?
@pombredanne
Copy link
Member

create a rule for gpl-2.0-plus AND bsd-new with this text

SPDX-License-Identifier: (GPL-2.0+ OR BSD)

and make this 99 relevant

that's the approach for BSD's that will be picked over the SPDX detection, it should at least

pombredanne added a commit that referenced this issue Sep 12, 2024
Add a new matcher_order attribute to LicenseMatch and use it for sorting
matches rather than the matcher string.
This was we can ensure that there is a proper precedence between
matchers when two matches are matching exactly the same text.

The new sort order for matcher is like that:
- 0: 1-hash
- 1: 2-aho
- 2: 1-spdx-id
- 3: 3-seq
- 4: 5-undetected
- 5: 5-aho-frag
- 6: 6-unknown

The outcome is that a hash or aho match for the same text at the same
position will take precedence of the SPDX id match, allowing to curate
and correct some incorrect license expressions if needed.

Reference: #3912
Reported-by: Ayan Sinha Mahapatra <[email protected]>
Signed-off-by: Philippe Ombredanne <[email protected]>
@pombredanne
Copy link
Member

I pushed a fix in c581828

The default sort order or LicenseMatch was based on the "matcher" string, hence "1-spdx-id" would always beat a "2-aho" match. Now we have a new "matcher_order" integer attribute that is used to sort instead and the hash and aho always take precedence over SPDX.

pombredanne added a commit that referenced this issue Sep 24, 2024
Add a new matcher_order attribute to LicenseMatch and use it for sorting
matches rather than the matcher string.
This was we can ensure that there is a proper precedence between
matchers when two matches are matching exactly the same text.

The new sort order for matcher is like that:
- 0: 1-hash
- 1: 2-aho
- 2: 1-spdx-id
- 3: 3-seq
- 4: 5-undetected
- 5: 5-aho-frag
- 6: 6-unknown

The outcome is that a hash or aho match for the same text at the same
position will take precedence of the SPDX id match, allowing to curate
and correct some incorrect license expressions if needed.

Reference: #3912
Reported-by: Ayan Sinha Mahapatra <[email protected]>
Signed-off-by: Philippe Ombredanne <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants