Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposal: Supporting Maven registries #208

Closed
oliverchang opened this issue Oct 24, 2023 · 16 comments · Fixed by #231
Closed

Proposal: Supporting Maven registries #208

oliverchang opened this issue Oct 24, 2023 · 16 comments · Fixed by #231

Comments

@oliverchang
Copy link
Contributor

The current Maven ecosystem definition is "The Maven Java package ecosystem. The name field is a Maven package name.", which is a little vague.

We should clarify that this is referring to Maven Central by default, and if a different registry is required, allow the ecosystem to be "Maven:", similar to the Linux distro definitions (e.g. "Debian:7").

The default (Maven Central) should always be used/preferred where it makes sense.

@oliverchang
Copy link
Contributor Author

@darakian @cuixq thoughts?

@cuixq
Copy link

cuixq commented Oct 26, 2023

I think this makes sense, and here are a few things about this in my mind:

  • There are nearly 2000 Maven repositories (see mvnrepository), we may need to have a way to define how we identify a Maven repository, maybe their URL?
  • What if a vulnerable package exists in multiple registries? Do we include all repositories or just Maven Central if the package can be found in Central?
  • Related to above, we have seen the same Maven package version in different repositories have different contents (looking at their hashes). For these versions, do we need to check which version is actually the vulnerable one? Good news is that this case is rare, at least for the repositories that we did analysis.

@rhalar
Copy link

rhalar commented Oct 26, 2023

Note that there may also be different targets per version for one repository, on some Maven packages, e.g.

https://mvnrepository.com/artifact/co.fs2/fs2-io

@darakian
Copy link
Collaborator

The default (Maven Central) should always be used/preferred where it makes sense.

100% agree. Perhaps it makes sense to add an optional field to the OSV payload to define a registry. eg.

{
      "package": {
        "ecosystem": "Maven",
        "name": "io.netty:netty-handler"
        "registry": "mycoolregistry.com"
      },

Where the field missing would be equivalent to "registry": "https://repo.maven.apache.org/maven2"
I think the registry term should refer to the directory listing of packages, but maybe that's up for discussion.

So, the downside to this approach would be that we could not have a single advisory which may apply to a coherent version range which was uploaded to multiple registries.
eg. https://mvnrepository.com/artifact/org.apache.activemq/activemq-core

@joshbressers
Copy link

It might be wise to think about this in larger context. It's not hard to imagine a universe where there are multiple registries (not just Java) hosting the same artifacts. Containers are another easy example here, there are many registries serving up the same content.

It would be fair to set some boundaries, like the artifact name and version should be consistent across repositories

I agree with @darakian, defining a registry as a URL would allow for flexibility and some future proofing

@joshbressers
Copy link

I would like to give this a bump as it's become stale

@oliverchang
Copy link
Contributor Author

oliverchang commented Jan 8, 2024

Thanks for the reminder @joshbressers ! Let me write out a more detailed consideration of all the possibilities here, and their implications for both vulnerability database maintainers and vulnerability scanners.

Please let me know what everyone thinks on this!

The main two alternatives are:

1. Add a registry parameter to the package

    {
      "package": {
        "ecosystem": "Maven",
        "name": "io.netty:netty-handler"
        "registry": "mycoolregistry.com"
      },

2. Use ":" to define registries, with ecosystem-specific rules.

   {
      "package": {
        "ecosystem": "Maven:https://mycoolregistry.com",
        "name": "io.netty:netty-handler"
      },

There is a third alternative, or defining every single Maven registry out there as its own ecosystem, but this is infeasible as @cuixq points out in #208 (comment).

Handling overlaps between registries

In both cases, one can of worms that multiple registries open up is conflicting packages across different registries (i.e. the same package name across different registries). This makes it difficult to determine which registry to specify in an advisory for a database maintainer, and how vulnerability scanners should behave.

Most of the ecosystems that OSV supports today have a single, default/canonical repository (i.e. Maven Central in Maven's case) where the vast vast majority of open source packages live. This should be the default and preferred registry used by both database maintainers and vulnerability scanners.

Using Maven as an example, where a registry is a (possible subset) mirror of Maven Central, users of that registry can still directly use OSV advisories that key on Maven Central. These mirrors may also contain a small number of packages where vulnerabilities are fixed faster than Maven Central (or have fixes backported to older release branches), and in these cases VEX statements should be issued by the registry to communicate these to vulnerability scanners.

When a registry diverges from Maven Central in that it can no longer be considered a mirror (i.e. naming and version conflicts exist), then that should be considered separately. There will need to be a separate vulnerability database for this registry, and vulnerability scanners will need to explicitly opt out of Maven Central scanning for these registries. This will be very rare -- @cuixq also did some analysis a while back on Maven that found that package conflicts against Maven Central in other registries was extremely minimal.

Which alternative?

Now, which alternative should we choose?

Option 1 is attractive because it's a structured field that makes parsing trivial. However, this has a major downside: Most OSV ecosystems tie the registry into the ecosystem definition. So far, Maven has been the only case where we need to support multiple registries inside the ecosystem, so this field will be redundant/confusing for most ecosystems.

@joshbressers brought up containers as another example where a registry field would be useful, but this is just one more example out of the 22 ecosystems we have today that don't need it.

There's also a separate discussion to be had whether or not it makes sense for advisories (as opposed to VEX) be issued against containers themselves -- when most vulnerability scanners today traverse the container to discover vulnerabilities inside it already.

Option 2 has the downside that it's a little bit ugly, but it is consistent with existing OSV conventions (e.g. Linux distros use ":" to qualify their releases in the ecosystem string -- "ecosystem": "Debian:10"). Regarding extensibility, if there are more general qualifiers required in the future that apply to most ecosystems, we can still consider separate fields for them.

Conclusion?

Option 2. seems like the most consistent with what we have today, and what we should proceed with.

@rhalar
Copy link

rhalar commented Jan 8, 2024

Another alternative which seems very in line with 'Option 2' would be to encode this information in purls perhaps? It is a field which already exists in package and has qualifiers which could be used for purposes such as these. They would work even for the Debian case, and maybe for things such as #202?

It's unfortunate that purl fields are optional but the default could just be Maven Central. But it would have to be enforced that entries with the same ecosystem/name must have purls with a different registry qualifier.

@joshbressers
Copy link

@oliverchang I think Option 2 sounds sane. There's nothing we can pick that will make everyone happy, and option 2 should catch a lot of the current examples

@darakian
Copy link
Collaborator

I think I prefer option 1 based on the aesthetics of it, but I don't have any real complaint about option 2. I do agree that we probably want to avoid having multiple registries on a single ecosystem where possible too.

@chrisbloom7 do you have any thoughts?

@pombredanne
Copy link

@oliverchang

Summary:

  • I would prefer the missing 3. Use PURL approach over the suggested 1. and 2.
  • Option 2. has issues wrt. stuffing multiple data in a single field with yet another syntax to parse and is a problem.
  • Option 1. using "repository_url" instead of "registry" would be the least problematic.

Details:

FWIW, the PURL way is to use a qualifier with the repository_url for things off Maven central. It has worked nicely so far.

repository_url is an extra URL for an alternative, non-default package repository or registry. When a package does not come from the default public package repository for its type a purl may be qualified with this extra URL. The default repository or registry of a type is documented in the "Known purl types" section.

If you are not adopting PURL, I would suggest at the minimum to keep similar names for similar concepts and use "repository_url" and not "registry".

I would strongly advise against stuffing multiple attributes in a single field as in your solution 2. "ecosystem": "Maven:https://mycoolregistry.com" as this is forcing yet another layer of parsing on downstream users and tools.

I would also suggest to look beyond Maven, as alternative repositories are common for RPMs, NuGet and other package types.

But again, I would strongly suggest that you adopt PURL as a base, as many of the questions you have raised here and in #202 (for sources) may have been addressed in the spec and would best addressed in PURL otherwise. Each time you adopt a different naming convention and attribute names, you are effectively making it harder for users and the community: their SCA tools and SBOMs return PURLs and they need to add a (likely lossy) translation layer between these PURL data and OSV's. Not a happy thing IMHO.

@darakian

I do agree that we probably want to avoid having multiple registries on a single ecosystem where possible too.

The way you interact with any Maven repository is always the same, and each Maven repository does not define a new type of package, just a different base package repository URL, and IMHO never a different, new "ecosystem".
While the bulk of things come from the default repo

@oliverchang
Copy link
Contributor Author

oliverchang commented Feb 12, 2024

Thanks for the feedback @pombredanne !

Re adopting PURL as the base, it would be a rather large breaking change for the schema. PURL is absolutely supported for interop, and https://osv.dev provides conversion between PURL and OSV types in both its re-exported entries and the API, to avoid conversion pain for users.

Re fitting the attribute inside "ecosystem", it fits with our existing conventions in OSV for e.g. Linux distro ecosystems such as

      "package": {
        "ecosystem": "Debian:11",
        "name": "libgit2"
      },

Re alternate repositories in general, as discussed in #208 (comment) it seems like for the vast majority of ecosystems, vulnerability databases would be unlikely to encode vulnerabilities outside of the main default registries. Maven happens to be an outlier, because of large alternative registries such as https://repo.jenkins-ci.org/releases/ or https://maven.google.com/

Are there examples in other such ecosystems? That would make a stronger case for adding e.g. repository_url. For cases like rpm, OSV's approach has to been to encode the individual distros as separate ecosystems, as an "ecosystem" in OSV just refers to a defined namespace (typically both the packaging mechanism and the actual registry).

@rhalar
Copy link

rhalar commented Feb 12, 2024

I also advocated for PURLs in #208 (comment) and I agree that would perhaps be a good course of action, it would help with a lot of things down the line too, e.g. #202, and my note in the same issue about different builds of e.g pypi packages (wheel, egg, sdist) or different builds on conan.io.

We've already encountered situations where only one them is actually malicious/vulnerable while the others were not. Vulnerabilities for specific architectures or OSs also come up.

Also, some Maven packages for Scala have builds for different targets, e.g https://mvnrepository.com/artifact/co.fs2/fs2-io
But vulnerabilities are reported on the base package, e.g. https://osv.dev/vulnerability/GHSA-2cpx-6pqp-wf35.

Though all of that could get resolved with a binary artifact field the as discussed in #202.

PURLs would perhaps allow encoding ecosystem specific information such as affected functions for crates, or affected imports and paths for Go, if that seems of interest.

@darakian
Copy link
Collaborator

I'm aligned with @oliverchang here. I don't see a reason to make defining an alternate source of packages more complicated than it needs to be. PURLs would also be inconsistent with any advisories which apply to a range of packages rather than to a single specific version.

The way you interact with any Maven repository is always the same, and each Maven repository does not define a new type of package, just a different base package repository URL, and IMHO never a different, new "ecosystem".
While the bulk of things come from the default repo

Each Maven package registry does define a new package namespace though and that needs to be accounted for.

@rhalar
Copy link

rhalar commented Feb 12, 2024

I'm not too opposed to other solutions but I guess what we are worried about is that any new issues that pop up will necessitate further additions and new fields to the format. The PURL seems like a natural choice since it is designed to be 'ecosystem specific' in a way and it already exists in the schema.

Note that to resolve both this issue and #202, alterations need to be made to the package object, that's exactly the place where the purl resides :)

PURLs would also be inconsistent with any advisories which apply to a range of packages rather than to a single specific version

A PURL can identify a package only, the version component is optional. It is effectively the same solution as Option 1 or 2, just encoded differently.
If you meant multiple binaries per source package, then yes, each package would need to be a separate entry.

There are other challenges in using the purl field tho, since it is currently optional. And there would need to be some kind of implicit defaults for every ecosystem.

@oliverchang
Copy link
Contributor Author

Thank you all for the feedback re PURL. While the OSV-Schema absolutely supports PURL for interop (as discussed in #208 (comment)), it's not possible to move to PURL as the source of truth for identifying packages without a large breaking change requiring migration from the current data providers. There are also certain underspecified parts in the PURL spec (e.g. package-url/purl-spec#247) that would have to be resolved.

I'll open a PR for option 2 of #208 (comment) shortly. I'm not seeing any answers/examples of other ecosystems that would point to the necessity of a repository_url/registry (given that OSV ecosystems define a namespace that includes the registry in almost all cases).

oliverchang added a commit that referenced this issue Mar 20, 2024
oliverchang added a commit that referenced this issue Mar 21, 2024
Fixes #208.

Signed-off-by: Oliver Chang <[email protected]>
oliverchang added a commit that referenced this issue Apr 4, 2024
Fixes #208.

---------

Signed-off-by: Oliver Chang <[email protected]>
Signed-off-by: Oliver Chang <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants