Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

No way to obtain LicenseMatch origin #3608

Closed
srehm opened this issue Nov 24, 2023 · 6 comments
Closed

No way to obtain LicenseMatch origin #3608

srehm opened this issue Nov 24, 2023 · 6 comments
Labels

Comments

@srehm
Copy link

srehm commented Nov 24, 2023

Description

I was scanning the subversion package from Debian Bullseye (1.14.1-3+deb11u1) with scancode 32.0.8. After the scan completed, I noticed bogus license detections in at least one file: build/ac-macros/swig.m4
Not only are there 51 (if I counted correctly :)) matches from which only the first one is correct. There are also matches in line ranges that don`t even exist in the file (e.g. the file has 360 lines and there are detections on lines 400+).

What makes this even stranger is the fact that in my tests I could not reproduce the error by scanning that file separately. It only happens if I scan the complete source tree. Almost as if the matches are pulled in from other files.

For your convenience I have attached both the sourcecode and the result json.

sourcecode.zip
result.zip

How To Reproduce

  • Download scancode v32.0.8 package for python 3.9 from the release page
  • Extract and configure
  • Download and extract the attached source code
  • Change directory to the extracted scancode package
  • Run: .\scancode.bat -cli --license-references --license-score 65 --strip-root -n 6 --verbose --json-pp result.json /path/to/extracted/source

System configuration

For bug reports, it really helps us to know:

  • What OS are you running on? Windows + Linux
  • What version of scancode-toolkit was used to generate the scan file? v32.0.8
  • What installation method was used to install/run scancode? Downloaded from release page
@srehm srehm added the bug label Nov 24, 2023
@AyanSinhaMahapatra
Copy link
Member

AyanSinhaMahapatra commented Nov 24, 2023

@srehm Thanks for reporting! This is a bug indeed.

Almost as if the matches are pulled in from other files.

Yeah, that's the case here actually. 😅

I could not reproduce the error by scanning that file separately.

Yeah in this we could not find the referenced NOTICE file as it was not in the scanned codebase, so we could not add licenses from there.

From the file you referenced see line 2:

See the NOTICE file distributed with this work for additional information

Since there is a reference in this file to the NOTICE file we were getting the license detections from the NOTICE file and adding it back here. But as you've rightly pointed out this is a bit weird and originally we only wanted to do this when it was an unknown reference, but here this is a proper license notice present so we would be doing things a bit differently.
See this issue talked about more detail in #3547 (comment)

@srehm
Copy link
Author

srehm commented Nov 24, 2023

Thanks for the explanation. I think I get the point. In my case the swig.m4 references the NOTICE file and that in turn references the LICENSE file. That explains the additional licenses that apply to the file.
However, for the purpose of checking the matches it is very confusing when the lines actually reference a different file. Either an attribute like 'referenced_file' or alternatively remapping start/end line of the match to reflect the reference in the original file (in the case of the swig.m4 that would be lines 3-5) would make things much clearer.
For context, we have a tool that displays the files and visualizes the matches according to the scancode result and currently I dont see how I can filter those referenced matches.

@pombredanne
Copy link
Member

@srehm as @AyanSinhaMahapatra pointed, the resolution may be described in #3547

You wrote:

For context, we have a tool that displays the files and visualizes the matches according to the scancode result and currently I dont see how I can filter those referenced matches.

You picked my curiosity. Tell me more!

That said, there are really two issues:

@AyanSinhaMahapatra
Copy link
Member

@pombredanne wrt.

tracking the file where a license match originated when we follow references so other tools can leverages this easily, that we can track here.

Yes! We've discussed this too (to make sure we can distinguish the matches which come from other files in SCIO) and this is somewhere in a branch as I had started working on this. To summarise what we discussed (was to be discussed/reviewed) further to make sure this design is correct:

To pinpoint which file a match is coming from we were going to add a ln attribute somewhat like from_file to the matches which will have two possibile states:

  • None: which is the default case where the match is from the present file and not originating from some other file. (Here adding the actual path of the present file would be adding too much info which is not really required)
  • path_to_file: when this match originated from some other file and this is the path to that file.

We point to paths and not LicenseDetection id because we carry over all the matches in a file in the following reference case, so this would be enough.

pombredanne added a commit that referenced this issue Nov 26, 2023
These rules improve the accuracy of the license detection in
Subversion

Reference: #3608
Reported-by: Stefan Rehm @srehm
Signed-off-by: Philippe Ombredanne <[email protected]>
@AyanSinhaMahapatra AyanSinhaMahapatra changed the title Out of bounds license detections No way to obtain LicenseMatch origin Dec 7, 2023
@sschuberth
Copy link
Collaborator

To pinpoint which file a match is coming from we were going to add a ln attribute somewhat like from_file to the matches

As the from_file field is available now with ScanCode 32.1.0, can this issue be closed?

@AyanSinhaMahapatra
Copy link
Member

Yes, this can be closed, thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants