Add data-driven approach to v6 distro search #2265

wagoodman · 2024-11-18T22:24:53Z

Today we have a few hard coded elements that affect which namespace is selected for given distro information:

debian unstable: look for indications of sid in the pretty-name then explicitly set the version to unstable which is acted upon downstream to map to sid vulnerability data
rolling versions: manually list which distros via code that should match against a single vulnerability stream, regardless of the version.
special versions: in alpine processing, we map alpha versions to a specially named edge vulnerability stream manually
major vs major.minor versions: we search multiple combinations of version segment combinations to find the most accurate vulnerability stream

The upside of this logic is that it makes matching more accurate, the downside is that this kind of code requires a grype release to update (thus users need to update grype itself to get these updates). This is not ideal since a distro-related version change or vulnerability data change can happen at any time.

Ideally this should be fundamentally data driven and flow in from the grype database, which is updated nightly -- this PR nudges us heavily in that direction by migrating this kind of logic to a DB lookup table, which is populated with grype-db, and paired with more agnostic processing of distro data (for example, we still have version specificity searching, but it is unaware of specific ).

Open questions:

debian@sid: today we look for "sid" in the PRETTY_NAME for os-release, which is something that is fixed to the artifact, but as I understand it sid is a moving target... so wouldn't it be invalid to match non-trixie sid vuln data (foxy is the next codename) with trixie artifacts? Doesn't that mean we should exclusively look at the codename and ignore sid status altogether? We're going to need to appropriate pkg source lists into syft / grype / grype-db / vunnel. That is outside of the scope of this PR
alpine@edge: it looks like this would be a single version with multiple pre-releases, but the vunnel data is specifically hard coding this to "edge", is this fictitious? (should we just use versions as we find them in the vuln data) this is correct, we should be using the alpha versions to correlate with edge, no matter the major/minor version
should this logic be lifted from the DB into memory for faster continual access of OS specific questions? That is, much of this is querying the OS table for each distro search, but that table is relatively small and could be replaced with a single datastructure that is read into memory upon the first query. deferred; I don't want to optimize too much in advance though, so elected to not do this first since it could always be done at a later date since that could be implemented in a non-breaking way.

kzantow · 2024-11-19T16:58:38Z

To offer opinions:

debian@sid: today we look for "sid" in the PRETTY_NAME for os-release, which is something that is fixed to the artifact, but as I understand it sid is a moving target... so wouldn't it be invalid to match non-trixie sid vuln data (foxy is the next codename) with trixie artifacts? Doesn't that mean we should exclusively look at the codename and ignore sid status altogether?

I was under the impression that there is no vuln data for old unstable versions -- that the unstable feed is only for the current unstable branch, is this correct? Do we know which codename (probably mixing up terms here) is currently in unstable at any given point? If so, I think considering something with an old codename + sid status as a prerelease for that codename would be the right thing to do, if we have enough info to do so.

alpine@edge: it looks like this would be a single version with multiple pre-releases, but the vunnel data is specifically hard coding this to "edge", is this fictitious? (should we just use versions as we find them in the vuln data)

IIRC there was a difficulty mapping the unstable alpine feed without this behavior. Maybe just that the data for the distro isn't published in the normal feed until the release becomes stable? But if the versions align without doing some mapping to "edge", I would agree it sounds better to just use the version name directly.

should this logic be lifted from the DB into memory for faster continual access of OS specific questions? That is, much of this is querying the OS table for each distro search, but that table is relatively small and could be replaced with a single datastructure that is read into memory upon the first query. I don't want to optimize too much in advance though, so elected to not do this first since it could always be done at a later date since that could be implemented in a non-breaking way.

Probably, yes. Agree it doesn't have to be done here -- only optimize things that need it. I worked on a configuration-heavy app that did something like this: always keep configuration loaded in memory and it worked great. If nothing else, it kept a lot of the programming model easier.

wagoodman · 2024-11-19T20:51:30Z

It looks like there are a couple of answers here:

debian sid will be dealt with later as we'll need to add another enhancement: searching by source list. This is out of scope for this PR and will be added at a later time
I've confirmed that alpine edge processing is simply looking for alpha in the version

grype/db/v6/affected_package_store.go

Signed-off-by: Alex Goodman <[email protected]>

westonsteimel · 2024-11-21T15:17:24Z

grype/db/v6/affected_package_store.go

-		return query
+	// search for aliases for the given distro; we intentionally map some OSs to other OSs in terms of
+	// vulnerability (e.g. `centos` is an alias for `rhel`). If an alias is found always use that alias in
+	// searches (there will never be anything in the DB for aliased distros).


So what happens here when we do add affected entries for a distro that was previous an alias (like when we bring in data for almalinux)? I think what would happen is that the alias would no longer exist in the grype db that is published that now contains the almalinux data, but just want to validate that assumption.

And all prior clients would adjust correctly to it becoming a comprehensive vulnerability dataset rather than an alias to rhel data

wagoodman added the changelog-ignore Don't include this issue in the release changelog label Nov 18, 2024

wagoodman added this to the DB v6 milestone Nov 18, 2024

wagoodman requested a review from a team November 18, 2024 22:24

wagoodman self-assigned this Nov 18, 2024

wagoodman marked this pull request as ready for review November 19, 2024 20:52

kzantow approved these changes Nov 19, 2024

View reviewed changes

westonsteimel reviewed Nov 20, 2024

View reviewed changes

grype/db/v6/affected_package_store.go Outdated Show resolved Hide resolved

westonsteimel reviewed Nov 20, 2024

View reviewed changes

grype/db/v6/affected_package_store.go Outdated Show resolved Hide resolved

allow distro search to be entirely data driven

8d38337

Signed-off-by: Alex Goodman <[email protected]>

wagoodman force-pushed the v6-data-driven-distro branch from ed7834a to 8d38337 Compare November 21, 2024 15:05

wagoodman enabled auto-merge (squash) November 21, 2024 15:05

westonsteimel reviewed Nov 21, 2024

View reviewed changes

wagoodman merged commit df54421 into main Nov 21, 2024
10 checks passed

wagoodman deleted the v6-data-driven-distro branch November 21, 2024 15:27

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add data-driven approach to v6 distro search #2265

Add data-driven approach to v6 distro search #2265

wagoodman commented Nov 18, 2024 •

edited

Loading

kzantow commented Nov 19, 2024 •

edited

Loading

wagoodman commented Nov 19, 2024

westonsteimel Nov 21, 2024

westonsteimel Nov 21, 2024

Add data-driven approach to v6 distro search #2265

Add data-driven approach to v6 distro search #2265

Conversation

wagoodman commented Nov 18, 2024 • edited Loading

kzantow commented Nov 19, 2024 • edited Loading

wagoodman commented Nov 19, 2024

westonsteimel Nov 21, 2024

Choose a reason for hiding this comment

westonsteimel Nov 21, 2024

Choose a reason for hiding this comment

wagoodman commented Nov 18, 2024 •

edited

Loading

kzantow commented Nov 19, 2024 •

edited

Loading