Skip to content

define handling of plus and space#261

Closed
matt-phylum wants to merge 3 commits intopackage-url:mainfrom
matt-phylum:plus-space
Closed

define handling of plus and space#261
matt-phylum wants to merge 3 commits intopackage-url:mainfrom
matt-phylum:plus-space

Conversation

@matt-phylum
Copy link
Contributor

@matt-phylum matt-phylum commented Nov 7, 2023

This is a big change (in terms of impact, not lines), and I'm not entirely sure its the correct change, but something needs to be done in this area.

Problem

The PURL spec describes qualifiers as being an & delimited sequenced of = delimited key value pairs where the value is percent encoded, and the section on encoding describes a minimal set of characters that are supposed to be percent encoded in different contexts. This looks a lot like x-www-formurlencoded, but x-www-formurlencoded encodes almost all characters besides the ascii alphanumeric set, and has a special behavior where ' ' is encoded as '+'.

I am aware of thirteen PURL implementations:

Implementation ' ' decodes as ' ' ' ' encodes as %20 '+' decodes as '+' '+' encodes as %2B
althonos/packageurl.rs ⚠️
anchore/packageurl-go ⚠️ implemented in prerelease
giterlizzi/perl-URI-PackageURL ☠️ parses incorrectly
maennchen/purl 💣 does not parse ⚠️ sometimes encodes as ' ' ☠️ '+' is encoded as '+' which decodes as ' '
package-url/packageurl-dotnet
package-url/packageurl-go
package-url/packageurl-java 💣 does not parse
package-url/packageurl-js
package-url/packageurl-php
package-url/packageurl-python
package-url/packageurl-ruby
package-url/packageurl-swift
phylum-dev/purl
sonatype/package-url-java ☠️ '+' is sometimes encoded as %20

That means if you're working with qualifiers that have spaces or plus signs in their values, it's fairly likely that software using a different implementation of PURL will interpret the PURLs differently. This may also happen if your PURLs have plus signs in their versions (deb) or spaces in their names (swid) and the implementation incorrectly decodes the name and version as if they were qualifier values (see below).

6/14 of the implementations are decoding '+' as ' ', so here's why I think it's better for the spec to specify '+' is '+':

  1. The spec doesn't say it should be x-www-formurlencoded, although it could be argued that it is implied. Using x-www-formurlencoded would make PURLs more compatible with regular URL parsers and utilities (but still not 100% compatible IIRC), at the expensive of making the specification more complicated.
  2. Three of those six implementations that encode and decode qualifier values as if they were x-www-formurlencoded also encode and decode other parts of the PURL, such as the version, as if it were an x-www-formurlencoded value. This is not standard URL behavior and it is not part of the PURL spec. This seems like an implementation mistake, incorrect regardless of how the qualifier values are encoded. (package-url/packageurl-dotnet, package-url/packageurl-ruby, sonatype/package-url-java)
  3. Having a different encoding scheme for the qualifiers vs the rest of the PURL, while common for regular URLs, can lead to confusion, both for implementors (see above) and for users when dealing with PURLs that contain plus signs in and out of qualifiers.

Proposal

  1. ' ' decodes as ' ' and '+' decodes as '+' (as in 8/14 implementations)
  2. In qualifiers, ' ' encodes as %20 and '+' encodes as %2B (as in 8/14 implementations) to avoid ambiguity when PURLs are parsed by implementations pre this change (unfortunately, there's nothing we can do about parsing PURLs created by implementations pre this change)
  3. In the rest of the PURL, ' ' encodes as %20 and '+' encodes as '+' because that's how the spec was written, it's not ambiguous that the rest of the PURL might be x-www-formurlencoded, and plus signs seem like they're common enough with some package types that there's benefit in keeping them unencoded.

The spec is updated to be specific, new tests are added to the test suite, and incorrectly escaped examples in the package types spec are updated to be consistent.

Unfortunately, this requires changes to at least 7/14 implementations to get everything aligned, but they should be minor changes.

@pombredanne
Copy link
Member

@matt-phylum this is more than nice research and work you did there! 🙇 Let me review this in details and come with feedback!

@matt-phylum
Copy link
Contributor Author

I found links to three more implementations in the readme file of this repository and updated the description to include them. packageurl-java merged some PRs that fixed most of the problems with spaces and plus signs so it's been updated here.

@matt-phylum
Copy link
Contributor Author

I've published a slightly cleaned up version of the code used to collect the data for the above table in https://github.com/phylum-dev/purl-survey. Hopefully it's useful for future cross-implementation testing like this.

@tiegz
Copy link

tiegz commented Jan 23, 2024

+1 to this.

I just discovered that syft will not encode + in Golang versions even though + is a valid character in SemVer, so PURL parsers of Syft-generated PURLs will decode as v1.2.3 incompatible instead of v1.2.3+incompatible.

I think you could probably add this https://github.com/anchore/packageurl-go library to this list since that's what syft is using.

jkowalleck
jkowalleck previously approved these changes Oct 31, 2024
@johnmhoran johnmhoran added the Ecma standard Part of the Ecma standard for PURL label Nov 5, 2024
@johnmhoran
Copy link
Member

johnmhoran commented Nov 21, 2024

Hi @matt-phylum . Thank you for your PR. When you have the chance, could you please resolve the conflicts referred to below?

@pombredanne
Copy link
Member

@matt-phylum Hey, now that we clarified the encoding, this is still relevant... do you want to get a quick stab at updating this PR to the latest spec? Otherwise @johnmhoran or I can handle it.
Thanks for being so patient and consistently helpful.

@johnmhoran
Copy link
Member

johnmhoran commented Jul 1, 2025

@matt-phylum Thanks for your PR and thoughtful analysis. It looks like your proposed character-encoding changes concerning the plus '+' and the space ' ' have been addressed by PRs we've merged in the last month or so. I'm going to go ahead and close this PR, but please let me know if you spot anything that still needs to be addressed.

Two of your related proposed changes concern PURL-TYPES.rst and test-suite-data.json. We've split our work between the PURL standard -- the core -- and the larger PURL specification, of which the standard is a part. Since we're handling work on the types and the test suite as part of the spec, not the standard, please open a new PR for each of these so we can address them individually.

[edit] I forgot to mention -- I'll take care of the issue @pombredanne opened concerning the plus '+' in the type definition. See #436 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

1 high priority Ecma standard Part of the Ecma standard for PURL PURL component: qualifiers PURL core specification Format and syntax that define PURL (excludes PURL type definitions) PURL encoding

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants