-
Notifications
You must be signed in to change notification settings - Fork 853
feat: Add license enrichment from pypi to python packages #4295
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
spiffcs
merged 4 commits into
anchore:main
from
timols:feat/add-python-license-enrichment
Nov 6, 2025
Merged
Changes from all commits
Commits
Show all changes
4 commits
Select commit
Hold shift + click to select a range
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,40 @@ | ||
| package python | ||
|
|
||
| const pypiBaseURL = "https://pypi.org/pypi" | ||
|
|
||
| type CatalogerConfig struct { | ||
| // GuessUnpinnedRequirements attempts to infer package versions from version constraints when no explicit version is specified in requirements files. | ||
| // app-config: python.guess-unpinned-requirements | ||
| GuessUnpinnedRequirements bool `yaml:"guess-unpinned-requirements" json:"guess-unpinned-requirements" mapstructure:"guess-unpinned-requirements"` | ||
| // SearchRemoteLicenses enables querying the NPM registry API to retrieve license information for packages that are missing license data in their local metadata. | ||
| // app-config: python.search-remote-licenses | ||
| SearchRemoteLicenses bool `json:"search-remote-licenses" yaml:"search-remote-licenses" mapstructure:"search-remote-licenses"` | ||
| // PypiBaseURL specifies the base URL for the Pypi registry API used when searching for remote license information. | ||
| // app-config: python.pypi-base-url | ||
| PypiBaseURL string `json:"pypi-base-url" yaml:"pypi-base-url" mapstructure:"pypi-base-url"` | ||
| } | ||
|
|
||
| func DefaultCatalogerConfig() CatalogerConfig { | ||
| return CatalogerConfig{ | ||
| GuessUnpinnedRequirements: false, | ||
| SearchRemoteLicenses: false, | ||
| PypiBaseURL: pypiBaseURL, | ||
| } | ||
| } | ||
|
|
||
| func (c CatalogerConfig) WithSearchRemoteLicenses(input bool) CatalogerConfig { | ||
| c.SearchRemoteLicenses = input | ||
| return c | ||
| } | ||
|
|
||
| func (c CatalogerConfig) WithGuessUnpinnedRequirements(input bool) CatalogerConfig { | ||
| c.GuessUnpinnedRequirements = input | ||
| return c | ||
| } | ||
|
|
||
| func (c CatalogerConfig) WithPypiBaseURL(input string) CatalogerConfig { | ||
| if input != "" { | ||
| c.PypiBaseURL = input | ||
| } | ||
| return c | ||
| } |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,131 @@ | ||
| package python | ||
|
|
||
| import ( | ||
| "context" | ||
| "encoding/json" | ||
| "fmt" | ||
| "io" | ||
| "net/http" | ||
| "net/url" | ||
| "strings" | ||
| "time" | ||
|
|
||
| "github.com/anchore/syft/internal/cache" | ||
| "github.com/anchore/syft/internal/log" | ||
| "github.com/anchore/syft/syft/pkg" | ||
| ) | ||
|
|
||
| type pythonLicenseResolver struct { | ||
| catalogerConfig CatalogerConfig | ||
| licenseCache cache.Resolver[[]pkg.License] | ||
| } | ||
|
|
||
| func newPythonLicenseResolver(config CatalogerConfig) pythonLicenseResolver { | ||
| return pythonLicenseResolver{ | ||
| licenseCache: cache.GetResolverCachingErrors[[]pkg.License]("python", "v1"), | ||
| catalogerConfig: config, | ||
| } | ||
| } | ||
|
|
||
| func (lr *pythonLicenseResolver) getLicenses(ctx context.Context, packageName string, packageVersion string) pkg.LicenseSet { | ||
| var licenseSet pkg.LicenseSet | ||
|
|
||
| if lr.catalogerConfig.SearchRemoteLicenses { | ||
| licenses, err := lr.getLicensesFromRemote(ctx, packageName, packageVersion) | ||
| if err == nil && licenses != nil { | ||
| licenseSet = pkg.NewLicenseSet(licenses...) | ||
| } | ||
| if err != nil { | ||
| log.Debugf("unable to extract licenses from pypi registry for package %s:%s: %+v", packageName, packageVersion, err) | ||
| } | ||
| } | ||
| return licenseSet | ||
| } | ||
|
|
||
| func (lr *pythonLicenseResolver) getLicensesFromRemote(ctx context.Context, packageName string, packageVersion string) ([]pkg.License, error) { | ||
| return lr.licenseCache.Resolve(fmt.Sprintf("%s/%s", packageName, packageVersion), func() ([]pkg.License, error) { | ||
| license, err := getLicenseFromPypiRegistry(lr.catalogerConfig.PypiBaseURL, packageName, packageVersion) | ||
| if err == nil && license != "" { | ||
| licenses := pkg.NewLicensesFromValuesWithContext(ctx, license) | ||
| return licenses, nil | ||
| } | ||
| if err != nil { | ||
| log.Debugf("unable to extract licenses from pypi registry for package %s:%s: %+v", packageName, packageVersion, err) | ||
| } | ||
| return nil, err | ||
| }) | ||
| } | ||
|
|
||
| func formatPypiRegistryURL(baseURL, packageName, version string) (requestURL string, err error) { | ||
| if packageName == "" { | ||
| return "", fmt.Errorf("unable to format pypi request for a blank package name") | ||
| } | ||
|
|
||
| urlPath := []string{packageName, version, "json"} | ||
| requestURL, err = url.JoinPath(baseURL, urlPath...) | ||
| if err != nil { | ||
| return requestURL, fmt.Errorf("unable to format pypi request for pkg:version %s%s; %w", packageName, version, err) | ||
| } | ||
| return requestURL, nil | ||
| } | ||
|
|
||
| func getLicenseFromPypiRegistry(baseURL, packageName, version string) (string, error) { | ||
| // "https://pypi.org/pypi/%s/%s/json", packageName, version | ||
| requestURL, err := formatPypiRegistryURL(baseURL, packageName, version) | ||
| if err != nil { | ||
| return "", fmt.Errorf("unable to format pypi request for pkg:version %s%s; %w", packageName, version, err) | ||
| } | ||
| log.WithFields("url", requestURL).Info("downloading python package from pypi") | ||
|
|
||
| pypiRequest, err := http.NewRequest(http.MethodGet, requestURL, nil) | ||
| if err != nil { | ||
| return "", fmt.Errorf("unable to format remote request: %w", err) | ||
| } | ||
|
|
||
| httpClient := &http.Client{ | ||
| Timeout: time.Second * 10, | ||
| } | ||
|
|
||
| resp, err := httpClient.Do(pypiRequest) | ||
| if err != nil { | ||
| return "", fmt.Errorf("unable to get package from pypi registry: %w", err) | ||
| } | ||
| defer func() { | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Should this precede the status code check? there can be response bodies that need to be closed with other statuses, I think
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. You're right. |
||
| if err := resp.Body.Close(); err != nil { | ||
| log.Errorf("unable to close body: %+v", err) | ||
| } | ||
| }() | ||
|
|
||
| if resp.StatusCode != 200 { | ||
| return "", fmt.Errorf("unable to get package from pypi registry") | ||
| } | ||
|
|
||
| bytes, err := io.ReadAll(resp.Body) | ||
| if err != nil { | ||
| return "", fmt.Errorf("unable to parse package from pypi registry: %w", err) | ||
| } | ||
|
|
||
| dec := json.NewDecoder(strings.NewReader(string(bytes))) | ||
|
|
||
| // Read "license" from the response | ||
| var pypiResponse struct { | ||
| Info struct { | ||
| License string `json:"license"` | ||
| LicenseExpression string `json:"license_expression"` | ||
| } `json:"info"` | ||
| } | ||
|
|
||
| if err := dec.Decode(&pypiResponse); err != nil { | ||
| return "", fmt.Errorf("unable to parse license from pypi registry: %w", err) | ||
| } | ||
|
|
||
| var license string | ||
| if pypiResponse.Info.LicenseExpression != "" { | ||
| license = pypiResponse.Info.LicenseExpression | ||
| } else { | ||
| license = pypiResponse.Info.License | ||
| } | ||
| log.Tracef("Retrieved License: %s", license) | ||
|
|
||
| return license, nil | ||
| } | ||
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should have some common configuration for timeout options, retry behavior possibly, rate limiting, etc.. This is becoming more important as we add more online resolution. We also might think about adding these features in a way that could be used in parallel and to enhance existing SBOMs rather than only at creation time. I don't think that has to happen as part of this PR, but I think we're getting to the point that we need to start being conscious of some problems people will start running into more frequently like the Maven rate limiting.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, that makes sense.