Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable GHSA to consolidate affected entries #35

Open
oliverchang opened this issue Mar 3, 2022 · 15 comments
Open

Enable GHSA to consolidate affected entries #35

oliverchang opened this issue Mar 3, 2022 · 15 comments

Comments

@oliverchang
Copy link
Contributor

oliverchang commented Mar 3, 2022

Context

Currently, GitHub's Advisory Database encodes pairs of (introduced, fixed) events in separate affected objects. This is because GitHub needs to preserve these pairs when converting the OSV entry back into their internal database represenation.

An example: https://github.com/github/advisory-database/blob/d6004eb8de91ad341605da869ab1b9f1e4abe433/advisories/github-reviewed/2017/10/GHSA-xgr2-v94m-rc9g/GHSA-xgr2-v94m-rc9g.json#L14

"affected": [
    {
      "package": {
        "ecosystem": "RubyGems",
        "name": "activesupport"
      },
      "ranges": [
        {
          "type": "ECOSYSTEM",
          "events": [
            { "introduced": "2.3.2" },
            { "fixed": "2.3.16" }
          ]
        }
      ]
    },
    {
      "package": {
        "ecosystem": "RubyGems",
        "name": "activesupport"
      },
      "ranges": [
        {
          "type": "ECOSYSTEM",
          "events": [
            { "introduced": "3.0.0" },
            { "fixed": "3.0.20" }
          ]
        }
      ]
    }
  ],

One obvious alternative here is to use a single affected entry and have two separate ECOSYSTEM ranges in there:

      "ranges": [
        {
          "type": "ECOSYSTEM",
          "events": [
            { "introduced": "2.3.2" },
            { "fixed": "2.3.16" }
          ]
        }, 
        {
          "type": "ECOSYSTEM",
          "events": [
            { "introduced": "3.0.0" },
            { "fixed": "3.0.20" }
          ]
        }
      ]

However, I believe this wasn't done, because GitHub needs to record version metadata using a <= operator, and relies on database_specific to this. One example: https://github.com/github/advisory-database/blob/d6004eb8de91ad341605da869ab1b9f1e4abe433/advisories/github-reviewed/2020/03/GHSA-gww7-p5w4-wrfv/GHSA-gww7-p5w4-wrfv.json

This has a bunch of affected entries that look like:

GHSA-gww7-p5w4-wrfv.json

    {
      "ranges": [
        {
          "type": "ECOSYSTEM",
          "events": [
            {
              "introduced": "2.0.0"
            }
          ]
        }
      ],
      "database_specific": {
        "last_known_affected_version_range": "<= 2.0.6"
      }
    },

Problems

There are a number of problems here:

  1. Using the the current evaluation pseudo-code, the semantics of the GHSA-gww7-p5w4-wrfv.json entry above would mean: everything after 2.0.0 is affected, regardless of any fixes that are encoded afterwards in other affected entries.

    This advisory has fixes later specified in 2.8.11.5 and 2.9.10.1, but they're not going to be considered under the current evaluation algorithm. Any version after 2.0.0 will be incorrectly considered to be vulnerable.

  2. Having duplicate package entries makes the entry a lot less lean than it should be

Proposed solutions

Add database_specific to affected[].ranges[]

If we instead add database_specific to affected[].ranges[], we could encode GHSA-gww7-p5w4-wrfv as

{
      "package": {
        "ecosystem": "Maven",
        "name": "com.fasterxml.jackson.core:jackson-databind"
      },
      "ranges": [
        {
          "type": "ECOSYSTEM",
          "events": [
            {
              "introduced": "2.0.0"
            },
          ],
          "database_specific": {
            "last_known_affected_version_range": "<= 2.0.6"
          }
        }, 
        {
          "type": "ECOSYSTEM",
          "events": [
            {
              "introduced": "2.1.0"
            },
          ],
          "database_specific": {
            "last_known_affected_version_range": "<= 2.1.5"
          }
        }, 
        ...
      ],

The pseudo-code as is should work as intended with this encoding (with minor changes re sorting), and the resulting JSON has far less duplication.

Support a <= event

Similar to above, rather than encoding this in database_specific, we support a "last_affected" event or something similar:

      "ranges": [
        {
          "type": "ECOSYSTEM",
          "events": [
            {
              "introduced": "2.0.0",
              "last_affected": "2.0.6"
            },
          ],
        }, 
        {
          "type": "ECOSYSTEM",
          "events": [
            {
              "introduced": "2.1.0",
              "last_affected": "2.1.5"
            },
          ],
        }, 

In the vast majority of cases, GitHub does actually know the exact fix version (example), but they would still like to record this <= information for use in tools such as Dependabot.

Supporting this would mean we potentially complicate our version specification, and potentially encourage entries that only use last_affected when we really want fixed in all cases. On the other hand, CVE 5.0 does support a lessThanOrEqual, and perhaps OSV was missing this.

Fix the evaluation algorithm to consider all affected entries for the same package.

We could alternatively make the pseudo-code handle these problematic cases, but it doesn't seem ideal as it:

  • Does not address the duplication issue. The intention of having multiple duplicate affected entries for the same package was to encode things like platform configurations where the set of affected versions are not the same.
  • Complicates the algorithm.
@oliverchang
Copy link
Contributor Author

@rsc @chrisbloom7 WDYT? I'm starting to think supporting an event that encodes "<=version" makes sense, especially since it's supported in CVE 5.0.

@melkikh
Copy link

melkikh commented Mar 4, 2022

Add database_specific to affected[].ranges[]

But in this case, affected[].ranges[].database_specific.last_known_affected_version_range doesn't correspond to the versioning format relevant for ECOSYSTEM (in your example, <= 2.0.6 is semver, but the native versioning format for Maven is different).

+1 for supporting <= event (with the caveat that fixed and last_known_affected events cannot be used simultaneously).

@chrisbloom7
Copy link
Collaborator

@oliverchang Thanks for explaining why the current approach could be problematic and for iterating through a number of potential solutions.

For starters, I would love to remove our dependence on custom database_specific hint fields like last_known_affected_version_range as it adds hidden requirements to contributing to our repo that are not immediately clear outside of the curation pipeline and would be easy to get wrong and difficult to explain how to get it right - a classic anti-pattern that we must unfortunately rely on at the moment. So, adding support for a last_known_affected event type (and validating that either it or fixed can be present but not both) sounds really appealing. I share your conern that this could end up being used where fixed would be more appropriate. Maybe naming it last_known_unfixed etc might offer a built-in hint that it is less preferred than fixed? Just a suggestion.

If that potential for misuse makes option 2 less attractive, then your first suggestion for Add database_specific to affected[].ranges[] does resolve one problem we would otherwise have with rehydration (the last_known_affected_version_range hinting key). There is another potential issue with mixing affected[].versions fields with affected[].ranges[] where the versions don't match (we use this to determine when a range should be rehydrated using the = operator), but I have some ideas for working around that problem.

@chrisbloom7
Copy link
Collaborator

chrisbloom7 commented Mar 7, 2022

grumble ... thinking about this further, we use last_known_affected_version_range to retain enough context about an upper bound that we can determine if it originated as a < or as a <=, in the absence of a fixed version. We would still need a way to do that even if last_known_affected event was introduced.

@oliverchang
Copy link
Contributor Author

oliverchang commented Mar 8, 2022

Thanks for the comments!

Perhaps we can start with the range-specific database_specific, since that would be required either way?

@rsc, what do you think of all of this?

@chrisbloom7
Copy link
Collaborator

Maybe we can do both options 1 and 2? They serve different purposes, and having both options available would be useful on our side to clean up our round-trip transformation process.

@rsc
Copy link

rsc commented Mar 15, 2022

It is a design goal for OSV for all clients to be able to determine in an algorithmically precise way whether a particular version is affected. Having "database_specific" in the affected object would harm that goal, because now the meaning depends on knowing about each particular database. So I'd very very much like to avoid doing that.

Changing the algorithm has similar problems.

That leaves the potential for adding a <= operator in some form. Since we did that for CVE JSON 5.0 and we ideally want to be able to convert losslessly between CVE and OSV, it does seem reasonable to add <=. Today we have "introduced" and "fixed". If we added <=, I assume the way it would work is that we'd add "last_affected", and an entry can have either "last_affected" or "fixed" but not both. And then the test would be < fixed or else <= last_effected. I think that matches CVE JSON 5.0's lessThan and lessThanOrEqual.

Do I have that right?

@oliverchang
Copy link
Contributor Author

oliverchang commented Mar 15, 2022

It is a design goal for OSV for all clients to be able to determine in an algorithmically precise way whether a particular version is affected. Having "database_specific" in the affected object would harm that goal, because now the meaning depends on knowing about each particular database. So I'd very very much like to avoid doing that.

Right, in this case though, I don't think database_specific would necessarily change the algorithm. <= isn't completely necessary for GitHub. It's mostly providing necessary metadata for GitHub to convert entries back to their internal DB representation (and may still be necessary even if we add last_affected).

Changing the algorithm has similar problems.

That leaves the potential for adding a <= operator in some form. Since we did that for CVE JSON 5.0 and we ideally want to be able to convert losslessly between CVE and OSV, it does seem reasonable to add <=. Today we have "introduced" and "fixed". If we added <=, I assume the way it would work is that we'd add "last_affected", and an entry can have either "last_affected" or "fixed" but not both. And then the test would be < fixed or else <= last_effected. I think that matches CVE JSON 5.0's lessThan and lessThanOrEqual.

I'm also leaning toward this given the CVE JSON 5.0 support and the closer alignment with GHSA. I can draft up a PR for this if we all agree here.

Do I have that right?

@chrisbloom7
Copy link
Collaborator

chrisbloom7 commented Mar 15, 2022

Might I suggest that if we're willing to add last_affected we also add last_known_good, and potentially last_known_bad? With those in place we can easily represent >, >= (i.e. introduced), = (introduced and/or specific versions), <= (this proposal), and < operators, all of which can be used to describe certain problems, eliminating the need for special database specific context hints that are not obvious how to use or consistent across databases. We've discussed the possibility of a last_known_good operator in the past (#31) and I could not make a strong argument for it at the time, but realistically sometimes exclusive operators are the best way to describe a vulnerability when a curation team cannot identify an inclusive version for any number of reasons. The same constraints would apply, i.e. only one of ..., and it's still machine parseable. I will end this suggestion with what is likely obvious to all of us, which is that curation is difficult because humans are difficult and humans create version strings that are difficult to parse, even for machines. The more flexible the spec can be, the better we can represent all classes of vulnerabilities from all databases. /soapbox

@chrisbloom7
Copy link
Collaborator

chrisbloom7 commented Mar 15, 2022

(Also, selfishly, that would solve soooo many problems on our translation side)

@oliverchang
Copy link
Contributor Author

I do think we want to optimise more for the consumers of these entries to be able to consume/edit them more simply. As @rsc also explored in #31 (comment), there are other reasons to not add operators like "last_known_good", including lack of CVE 5.0 support and inaccurate encodings.

Given the compatibility constraints of our schema, adding new fields/semantics is expensive and I think we should be a bit cautious here in introducing too many new fields at once.

@rsc
Copy link

rsc commented Mar 16, 2022

I agree with @oliverchang about being very selective in how many different constraints we add, for all the reasons he enumerated. I suggest we start with just "last_affected".

oliverchang added a commit that referenced this issue Mar 17, 2022
This is intended only for metadata that enables databases to losslessly
convert OSV entries back into their original representation.

Part of #35.
oliverchang added a commit that referenced this issue Mar 17, 2022
oliverchang added a commit that referenced this issue Mar 17, 2022
oliverchang added a commit that referenced this issue Mar 24, 2022
* Add `last_affected` event type.

Part of #35.

* Update docs/schema.md

Co-authored-by: Chris Bloom <[email protected]>

* JSON validation

Co-authored-by: Chris Bloom <[email protected]>
oliverchang added a commit that referenced this issue Mar 24, 2022
* Add database_specific to `affected[].ranges[]`.

This is intended only for metadata that enables databases to losslessly
convert OSV entries back into their original representation.

Part of #35.

* Update docs/schema.md

Co-authored-by: Chris Bloom <[email protected]>

Co-authored-by: Chris Bloom <[email protected]>
oliverchang added a commit that referenced this issue Mar 28, 2022
* Add `last_affected` event type. (#38)

* Add `last_affected` event type.

Part of #35.

* Update docs/schema.md

Co-authored-by: Chris Bloom <[email protected]>

* JSON validation

Co-authored-by: Chris Bloom <[email protected]>

* Add database_specific to `affected[].ranges[]`. (#37)

* Add database_specific to `affected[].ranges[]`.

This is intended only for metadata that enables databases to losslessly
convert OSV entries back into their original representation.

Part of #35.

* Update docs/schema.md

Co-authored-by: Chris Bloom <[email protected]>

Co-authored-by: Chris Bloom <[email protected]>

* Bump version and add change log.

Co-authored-by: Chris Bloom <[email protected]>
@G-Rath
Copy link
Contributor

G-Rath commented Jul 5, 2022

@oliverchang I'm just in the process of implementing support for last_affected in osv-detector and wanted to check if the last example in your OP is correct or not:

      "ranges": [
        {
          "type": "ECOSYSTEM",
          "events": [
            {
              "introduced": "2.0.0",
              "last_known_affected": "2.0.6"
            },
          ],
        }, 
        {
          "type": "ECOSYSTEM",
          "events": [
            {
              "introduced": "2.1.0",
              "last_known_affected": "2.0.6"
            },
          ],
        }, 

Should that last last_known_affected really be 2.0.6, or should it be 2.1.5 (based on the example above it)?

@oliverchang
Copy link
Contributor Author

@oliverchang I'm just in the process of implementing support for last_affected in osv-detector and wanted to check if the last example in your OP is correct or not:

      "ranges": [
        {
          "type": "ECOSYSTEM",
          "events": [
            {
              "introduced": "2.0.0",
              "last_known_affected": "2.0.6"
            },
          ],
        }, 
        {
          "type": "ECOSYSTEM",
          "events": [
            {
              "introduced": "2.1.0",
              "last_known_affected": "2.0.6"
            },
          ],
        }, 

Should that last last_known_affected really be 2.0.6, or should it be 2.1.5 (based on the example above it)?

Good catch, thanks! That was indeed a bad copy/paste. I've fixed up the original post, and renamed this to the actual field name we went with (last_affected).

@G-Rath
Copy link
Contributor

G-Rath commented Jul 9, 2022

@oliverchang awesome, thanks for that! I've landed support for last_affected in v0.7.0 of osv-detector

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants