Skip to content

Add filters to package cataloger#1021

Merged
luhring merged 18 commits intoanchore:mainfrom
jonasagx:add-catalog-package-filters
Jun 3, 2022
Merged

Add filters to package cataloger#1021
luhring merged 18 commits intoanchore:mainfrom
jonasagx:add-catalog-package-filters

Conversation

@jonasagx
Copy link
Copy Markdown
Contributor

@jonasagx jonasagx commented May 31, 2022

This PR adds filters so a package without name or version doesn't go in the list of all discovered packages. Two catalogers were modified:

  • generic cataloger: used by most specific catalogers, acts as a central-ish way of guaranteeing packages have a name.
  • python package cataloger: doesn't use the generic cataloger, therefore needed its own package filter.

This PR addresses items 1, 2 and 3 from #780 (comment).

Integration and units tests were added to validate behavior.

Fix: #780

Signed-off-by: Jonas Xavier jonasx@anchore.com

@jonasagx jonasagx marked this pull request as draft May 31, 2022 21:16
@jonasagx jonasagx marked this pull request as ready for review May 31, 2022 22:05
@jonasagx
Copy link
Copy Markdown
Contributor Author

jonasagx commented May 31, 2022

SPDX doesn't require package version, while CycloneDX does.
We should be compatible with both formats and only require package name.

@stevespringett
Copy link
Copy Markdown

As of CDX v1.4, component version is optional, not required.

Copy link
Copy Markdown
Contributor

@luhring luhring left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for getting this started! This isn't ready for merging just yet.

At the highest level, we need to figure out the best place to filter out unacceptable packages. As implemented currently, this filtering happens fairly downstream in the process. This is good in that there's a central place where filtering happens, but it's not good in that it means unusable packages are flowing through more parts of the logic. This makes the code a bit trickier to debug and makes it easier to miss problems — such as that we haven't yet accounted for package relationships properly (I commented on this below). So I think we need to figure out a better place to do this filtering.

I see that you've added CLI tests, but it's not yet clear to me why these tests needed to be created at the "CLI" level. I've asked about this in a code comment below.

The PR description says that you've added integration tests, but I don't see new integration tests. Is there more code to check into this branch? (I do see new test fixture files added.)

See the other code comments for the rest of my thoughts — although some may become moot depending on how the implementation ends up pivoting.

Let me know if you have any questions!

Comment thread syft/pkg/cataloger/catalog.go Outdated
Comment on lines +112 to +125
type filterPkgFn func(pkg.Package) bool

func filterOutPkg(p pkg.Package, filters ...filterPkgFn) bool {
for _, f := range filters {
if f(p) {
return true
}
}
return false
}

func hasEmptyName(p pkg.Package) bool {
return p.Name == ""
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I'm understanding correctly, we just want to test to see if a package has a name, is that right? If so, it seems like we don't need all these pieces, and we could just use the p.Name == "" directly above.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct, this code no longer exists. Generic and Python Package catalogers do the equivalent filtering.

Comment thread syft/pkg/cataloger/catalog.go Outdated
Comment on lines +89 to +91
if filterOutPkg(p, hasEmptyName) {
continue
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we're going to discard this package from processing, we should do this check as early as possible. Putting the check here is problematic for a couple of reasons:

  1. We're doing work above this point (e.g. creating CPEs) that might ultimately be wasted. So this would point to at least moving this check to the top of the for ... packages loop. But more importantly,
  2. This implementation doesn't deal with relationships that may have been created for packages that we'll end up excluding from the catalog, so we may end up with relationships that reference packages that don't exist, which is a big problem. We generate relationships here in this loop, but catalogers also can return relationships. So this filtering probably needs to be done at least as upstream as in the catalogers themselves, so that we never generate relationships for packages we're going to discard.

Copy link
Copy Markdown
Contributor Author

@jonasagx jonasagx Jun 2, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I addressed items 1 and 2 with the generic cataloger and the python package cataloger. In the generic cataloger I remove all relationships From and To a dropped package.

So this filtering probably needs to be done at least as upstream as in the catalogers themselves, so that we never generate relationships for packages we're going to discard.

Having package and relationships dropped by the generic cataloger still a (centralized) last barrier for invalid packages.

@@ -0,0 +1 @@
Name: broken-2.0.0
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this test fixture called "broken..."? It looks like this fixture is intended to lead to a correct cataloging of a package, is that right?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, I think it's a bit confusing to reason about this entry in the Syft output, since this name actually includes a version-looking piece of data. I'd suggest dropping the -2.0.0, so it's really clear that this is a case where we did not find any version information.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for pointing that out. I moved it to the integration tests folder of fixtures, under a new file: no-version.

Comment thread test/cli/package_missing_data_test.go Outdated
@jonasagx
Copy link
Copy Markdown
Contributor Author

jonasagx commented Jun 2, 2022

Thank you for the review @luhring!

At the highest level, we need to figure out the best place to filter out unacceptable packages. As implemented currently, this filtering happens fairly downstream in the process. This is good in that there's a central place where filtering happens, but it's not good in that it means unusable packages are flowing through more parts of the logic. This makes the code a bit trickier to debug and makes it easier to miss problems — such as that we haven't yet accounted for package relationships properly (I commented on this below). So I think we need to figure out a better place to do this filtering.

I added package filtering validation to the generic cataloger, used by most specific catalogers. I added code removing impacted relationships.

The Python Package Cataloger doesn't use the generic cataloger, I had to change it to filter out packages without name. This cataloger doesn't gather relationships so the change is pretty simple.

The PR description says that you've added integration tests, but I don't see new integration tests. Is there more code to check into this branch? (I do see new test fixture files added.)

I removed the cli tests in favor of unit and integration tests. Before I had added only integration test fixtures, sorry for the confusion.

pkgs = append(pkgs, *p)
}
}
return pkgs, nil, nil
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Notice this cataloger doesn't return relationships

@jonasagx
Copy link
Copy Markdown
Contributor Author

jonasagx commented Jun 2, 2022

Other than the Python Package Cataloger I've found a couple others that don't depend on generic cataloger:

  • deb.NewDpkgdbCataloger() requires version and name to make a package 👍🏼
  • rpmdb.NewRpmdbCataloger() doesn't enforce non-empty name, but it depends on data from rpmDBs, I don't know how reliable they are 🤞🏼
  • golang.NewGoModuleBinaryCataloger() relies on go mod path, which is always present 👍🏼

@jonasagx jonasagx requested a review from luhring June 2, 2022 02:08
Copy link
Copy Markdown
Contributor

@luhring luhring left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is getting closer... 👍

I left some specific comments, but I'll also provide some high-level feedback here:

  1. There's still an integration test being leveraged here, but again, we don't need an integration test here — we can just use a unit test. Using a unit test is preferable wherever possible.

  2. We should have a central function that checks is a package is viable (i.e. has a name). A function like this is defined in the python package, but it should be defined instead in a central place, and then used by all the catalogers, so that we're applying the same criteria to packages from all catalogers. I saw your comment about some of the catalogers maybe being okay without this check, but I'd strongly encourage the use of this check for all code paths, so that we can centrally control what the requirements are for a viable package. It's not clear to me that the assumptions about other catalogers like RPM and Go are sufficient here.

continue
}

pkgIDsForRemoval := make([]artifact.Identifiable, 0)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Using the var pkgIDsForRemoval []artifact.Identifiable syntax for statements like this is a bit cleaner, and also has the advantage of avoiding a wasted array allocation in the event that no items are ever appended to this slice.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(this comment can apply to other similar lines like this added in this PR — but again, this is just a "nit" comment)

p.SetID()
// doing it here so all packages have an ID,
// IDs are used to remove relationships
if p.Name == "" {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we have a need for this check across multiple catalogers — both the generic cataloger and others — we should extract this check out as a centralized function that can be used by any catalogers that need it

Copy link
Copy Markdown
Contributor Author

@jonasagx jonasagx Jun 2, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I moved the validation to a func in the pkg package (:P words...), which is then reused by other catalogers.

return packages, relationships, nil
}

func removePkgsFromRelationships(pkgs []artifact.Identifiable, relationships []artifact.Relationship) []artifact.Relationship {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can make the name removePkgsFromRelationships more clear. This function isn't removing packages, it's removing relationships.

Additionally, the logic here isn't specific to packages, since you've used the type artifact.Identifiable. I think it's a good choice to use this type, since all you need are IDs. 👍

I'd suggest a function name like removeRelationshipsWithArtifactIDs, and I'd suggest renaming the pkgs parameter to something like artifactsToExclude.

Copy link
Copy Markdown
Contributor Author

@jonasagx jonasagx Jun 2, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call, I renamed the function and param with your suggestions.

Comment on lines +87 to +92
for _, r := range relationships {
for _, p := range pkgs {
if r.To.ID() == p.ID() || r.From.ID() == p.ID() {
goto skip
}
}
cleanedRelationships = append(cleanedRelationships, r)
skip:
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can simplify and optimize this implementation. Specifically, we can avoid the nested for loops and the goto. And we can also avoid looping through all packages for every single relationship.

My suggestion here is to use a map to store the artifact IDs to exclude, so that we can determine whether a given ID is in the set with an O(1) lookup. You have two options for doing this: 1) you could build the map inside this function, using the given []artifact.Identifiable` value, or 2) you could just ask the function caller for this map in the first place (instead of a slice). Given how this function is called above, I'd lean toward option 2, since it looks like it'd be just as easy to construct this set as a map as it is now to construct it as a slice.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, I implemented option 2.

}
}

func Test_removePkgsFromRelationships(t *testing.T) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice testing here! 👍

return pkgs, nil, nil
}

func pkgHasName(p *pkg.Package) bool {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the kind of function that should exist centrally for all catalogers to leverage, as opposed to defined just in the python package.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Separately, this function takes a pointer (*pkg.Package) but then doesn't check for a nil pointer condition, which could lay a trap for future developers to enable a nil pointer dereference panic. I'd suggest adjusting this function signature so that it doesn't take a pointer (it would just take a pkg.Package), since the caller is already checking for p != nil and could just as easily pass a dereferenced pkg.Package to this function.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new pkg validation check for nil values as well.

jonasagx added 12 commits June 2, 2022 15:01
This PR adds filters so a package without name or version doesn't go in
the list of all discovered packages.

Integration and cli tests were added to validate the feature.

Signed-off-by: Jonas Xavier <jonasx@anchore.com>
Signed-off-by: Jonas Xavier <jonasx@anchore.com>
Signed-off-by: Jonas Xavier <jonasx@anchore.com>
also removes cli tests in favor of integration and unit tests

Signed-off-by: Jonas Xavier <jonasx@anchore.com>
Signed-off-by: Jonas Xavier <jonasx@anchore.com>
Signed-off-by: Jonas Xavier <jonasx@anchore.com>
Signed-off-by: Jonas Xavier <jonasx@anchore.com>
Signed-off-by: Jonas Xavier <jonasx@anchore.com>
Signed-off-by: Jonas Xavier <jonasx@anchore.com>
Signed-off-by: Jonas Xavier <jonasx@anchore.com>
Signed-off-by: Jonas Xavier <jonasx@anchore.com>
Signed-off-by: Jonas Xavier <jonasx@anchore.com>
@jonasagx jonasagx force-pushed the add-catalog-package-filters branch from e50c421 to 838c328 Compare June 2, 2022 22:07
jonasagx added 3 commits June 2, 2022 15:48
Signed-off-by: Jonas Xavier <jonasx@anchore.com>
Signed-off-by: Jonas Xavier <jonasx@anchore.com>
Signed-off-by: Jonas Xavier <jonasx@anchore.com>
@jonasagx
Copy link
Copy Markdown
Contributor Author

jonasagx commented Jun 2, 2022

I saw your comment about some of the catalogers maybe being okay without this check, but I'd strongly encourage the use of this check for all code paths, so that we can centrally control what the requirements are for a viable package. It's not clear to me that the assumptions about other catalogers like RPM and Go are sufficient here.

I've added pkg validation to deb/rpm/golang catalogers, I expanded the unit tests to ensure correct dropping of invalid packages.

Signed-off-by: Jonas Xavier <jonasx@anchore.com>
Copy link
Copy Markdown
Contributor

@luhring luhring left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we're close... a couple of things to clear up

Comment thread syft/pkg/cataloger/common/generic_cataloger.go Outdated
Comment thread syft/pkg/cataloger/common/generic_cataloger.go Outdated

globParsers := map[string]ParserFn{
"**/a-path.txt": parser,
"**/*.txt": parser, // this glob should capture all files, including the empty one
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you explain this change? I want to understand your thinking here

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this glob (**/*.txt) will find a file named empty.txt, that will turn into an invalid package, and relationship, later on dropped from the final list. The previous glob (**/a-path.txt) wouldn't find empty.txt.

Signed-off-by: Jonas Xavier <jonasx@anchore.com>
@jonasagx jonasagx requested a review from luhring June 3, 2022 04:32
Signed-off-by: Jonas Xavier <jonasx@anchore.com>
Copy link
Copy Markdown
Contributor

@luhring luhring left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great, thanks!

@luhring luhring merged commit caff672 into anchore:main Jun 3, 2022
spiffcs added a commit to jonasagx/syft that referenced this pull request Jun 27, 2022
* main: (70 commits)
  fix: add php catalogers to all catalogers (anchore#1065)
  feat: add use-all-catalogers flag (anchore#1050)
  Updates parsing of `yarn.lock` to use `resolved` URLs that are pulled from yarn and npm registries (anchore#926)
  remove OSS Meetup message (anchore#1057)
  add pom.xml cataloger (anchore#1055)
  Add support for CBL-Mariner distroless images (anchore#1045)
  Add catalogers configuration (anchore#1038)
  add template output (anchore#1051)
  update stereoscope to latest version (anchore#1052)
  update zip_read_closer to incorporate zip64 support (anchore#1041)
  Add pacman (alpm) parser support (anchore#943)
  Update of README.md (anchore#1027)
  bump cosign to v1.9.0 to resolve reporting of GHSA-66x3-6cw3-v5gj (anchore#1025)
  add workflows to test new project automation (anchore#1023)
  improve LanguageByName and add unit tests (anchore#1034)
  Read Description from dpkg status files (anchore#996)
  Add announcement for Anchore OSS Virtual Meetup (anchore#1033)
  add main module field to go bin metadata (anchore#1026)
  Add filters to package cataloger (anchore#1021)
  change draft to false for release process (anchore#1016)
  ...

Signed-off-by: Christopher Phillips <christopher.phillips@anchore.com>
aiwantaozi pushed a commit to aiwantaozi/syft that referenced this pull request Oct 20, 2022
* Add filters to package cataloger

This PR adds filters so a package without name or version doesn't go in
the list of all discovered packages.

Integration and cli tests were added to validate the feature.

Signed-off-by: Jonas Xavier <jonasx@anchore.com>

* add nolint:funlen to cataloger/catalog.go

Signed-off-by: Jonas Xavier <jonasx@anchore.com>

* don't require package version

Signed-off-by: Jonas Xavier <jonasx@anchore.com>

* add package filtering to generic and python cataloger

also removes cli tests in favor of integration and unit tests

Signed-off-by: Jonas Xavier <jonasx@anchore.com>

* drop nolint:funlen

Signed-off-by: Jonas Xavier <jonasx@anchore.com>

* check for no-removal operation

Signed-off-by: Jonas Xavier <jonasx@anchore.com>

* remove unused fixtures

Signed-off-by: Jonas Xavier <jonasx@anchore.com>

* rename no-version file to hide semantic version

Signed-off-by: Jonas Xavier <jonasx@anchore.com>

* drop integration tests and add pkg func for validation

Signed-off-by: Jonas Xavier <jonasx@anchore.com>

* python cataloger use global pkg validation

Signed-off-by: Jonas Xavier <jonasx@anchore.com>

* check for valid packages on deb/go/rpm catalogers

Signed-off-by: Jonas Xavier <jonasx@anchore.com>

* update rpm cataloger after rebase

Signed-off-by: Jonas Xavier <jonasx@anchore.com>

* nit with pointers

Signed-off-by: Jonas Xavier <jonasx@anchore.com>

* simpler use of package validation

Signed-off-by: Jonas Xavier <jonasx@anchore.com>

* remmove double pkg validations

Signed-off-by: Jonas Xavier <jonasx@anchore.com>

* rename func param to artifactsToExclude

Signed-off-by: Jonas Xavier <jonasx@anchore.com>

* add test for relationships and bug fix

Signed-off-by: Jonas Xavier <jonasx@anchore.com>

* feedback changes

Signed-off-by: Jonas Xavier <jonasx@anchore.com>
GijsCalis pushed a commit to GijsCalis/syft that referenced this pull request Feb 19, 2024
* Add filters to package cataloger

This PR adds filters so a package without name or version doesn't go in
the list of all discovered packages.

Integration and cli tests were added to validate the feature.

Signed-off-by: Jonas Xavier <jonasx@anchore.com>

* add nolint:funlen to cataloger/catalog.go

Signed-off-by: Jonas Xavier <jonasx@anchore.com>

* don't require package version

Signed-off-by: Jonas Xavier <jonasx@anchore.com>

* add package filtering to generic and python cataloger

also removes cli tests in favor of integration and unit tests

Signed-off-by: Jonas Xavier <jonasx@anchore.com>

* drop nolint:funlen

Signed-off-by: Jonas Xavier <jonasx@anchore.com>

* check for no-removal operation

Signed-off-by: Jonas Xavier <jonasx@anchore.com>

* remove unused fixtures

Signed-off-by: Jonas Xavier <jonasx@anchore.com>

* rename no-version file to hide semantic version

Signed-off-by: Jonas Xavier <jonasx@anchore.com>

* drop integration tests and add pkg func for validation

Signed-off-by: Jonas Xavier <jonasx@anchore.com>

* python cataloger use global pkg validation

Signed-off-by: Jonas Xavier <jonasx@anchore.com>

* check for valid packages on deb/go/rpm catalogers

Signed-off-by: Jonas Xavier <jonasx@anchore.com>

* update rpm cataloger after rebase

Signed-off-by: Jonas Xavier <jonasx@anchore.com>

* nit with pointers

Signed-off-by: Jonas Xavier <jonasx@anchore.com>

* simpler use of package validation

Signed-off-by: Jonas Xavier <jonasx@anchore.com>

* remmove double pkg validations

Signed-off-by: Jonas Xavier <jonasx@anchore.com>

* rename func param to artifactsToExclude

Signed-off-by: Jonas Xavier <jonasx@anchore.com>

* add test for relationships and bug fix

Signed-off-by: Jonas Xavier <jonasx@anchore.com>

* feedback changes

Signed-off-by: Jonas Xavier <jonasx@anchore.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Python package Name and pURL have errors in CycloneDX output

3 participants