Skip to content

Conversation

@matt-phylum
Copy link
Contributor

About colons, the spec says:

- the '#', '?', '@' and ':' characters must NOT be encoded when used as
separators. They may need to be encoded elsewhere
- the ':' ``scheme`` and ``type`` separator does not need to and must NOT be encoded.
It is unambiguous unencoded everywhere

Therefore, the colon should not be encoded. I've read #364 and I can't understand the reasoning for changing the unencoded colon to be encoded. #364 also quotes the spec and references another issue, contradicting its own implementation.

9/14 implementations agree that this colon is not to be encoded.

…ing"

This reverts commit e2c8eec, reversing
changes made to 2d06019.
@johnmhoran
Copy link
Member

johnmhoran commented Mar 10, 2025

@matt-phylum @pombredanne These test changes are putting the cart before the horse. Let's first clarify the qualifiers component section and the related "Character encoding" section (see #398), and we can then simply apply the clarified rules to this and other tests. That would address not just the long-discussed colon but all other characters that are permitted and need to be addressed. For example, if the colon MUST be unencoded throughout the purl, we can say so, clearly.

@johnmhoran
Copy link
Member

@pombredanne What did you mean by the language @matt-phylum cites at the top: It is unambiguous unencoded everywhere? The colon separator, i.e., the colon when (and only when) used as that defined separator? Or the colon, wherever used in the purl? Or something else? Why refer at all to "the ':' scheme and type separator" if we mean the literal colon character wherever used?

@matt-phylum
Copy link
Contributor Author

This PR is just changing the test case back to be consistent with the existing documentation and other tests. The test used to be this way for 7 years from 2017-11-16 when the test suite was introduced until 2025-02-05. As it currently stands, you cannot pass the tests without a special case to encode : in the version but not in the name or qualifiers, which is not part of the spec.

The colon character is unambiguous unencoded everywhere. Not the colon separator. Colons only have meaning when used after pkg.

The wording of these rules has always been problematic. The first rule says that the separator must not be encoded, but that the : may need to be encoded when it's not a separator, and then second rule immediately clarifies that the case where the character :, not the separator, requires encoding is never.

Compare to the other rules:

- the '/' used as ``type``/``namespace``/``name`` and ``subpath`` segments separator
does not need to and must NOT be percent-encoded. It is unambiguous unencoded
everywhere

This one is technically wrong. There is one time when slashes are ambiguous and do need to be encoded: when they are within the name. However, it follows the same pattern of referencing the character / when not used as a separator by calling it the separator. "It is unambiguous unencoded everywhere" means slashes within qualifiers are not encoded, and the tests agree.

- the '@' ``version`` separator must be encoded as ``%40`` elsewhere
- the '?' ``qualifiers`` separator must be encoded as ``%3F`` elsewhere
- the '=' ``qualifiers`` key/value separator must NOT be encoded
- the '#' ``subpath`` separator must be encoded as ``%23`` elsewhere

For these ones it is more obvious what's going on. @ is only the version separator when it is used as the version separator, and it must never be encoded when it is the version separator. However, the rule says that the @ character must be encoded as %40 by saying that the version separator character must be encoded when it's not the version separator, and the same pattern is used for the other characters that must be encoded.

These tests are in contradiction to the change being reverted:

{
"description": "Hugging Face model with staging endpoint",
"purl": "pkg:huggingface/microsoft/deberta-v3-base@559062ad13d311b87b2c455e67dcd5f1c8f65111?repository_url=https://hub-ci.huggingface.co",
"canonical_purl": "pkg:huggingface/microsoft/deberta-v3-base@559062ad13d311b87b2c455e67dcd5f1c8f65111?repository_url=https://hub-ci.huggingface.co",
"type": "huggingface",
"namespace": "microsoft",
"name": "deberta-v3-base",
"version": "559062ad13d311b87b2c455e67dcd5f1c8f65111",
"qualifiers": {"repository_url": "https://hub-ci.huggingface.co"},
"subpath": null,
"is_invalid": false
},

The colon in the qualifier is not encoded.

{
"description": "cpan module name are case sensitive",
"purl": "pkg:cpan/URI::[email protected]",
"canonical_purl": "pkg:cpan/URI::[email protected]",
"type": "cpan",
"namespace": null,
"name": "URI::PackageURL",
"version": "2.11",
"qualifiers": null,
"subpath": null,
"is_invalid": false
},

The colon in the name is not encoded.

@pombredanne
Copy link
Member

@matt-phylum @johnmhoran Thanks for all the research and details!

@pombredanne
Copy link
Member

@matt-phylum re:

9/14 implementations agree that this colon is not to be encoded.

So my intent, albeit looking weird to me in hindsight, was indeed to ensure that we would NOT encode the colon anywhere indeed. Let's update the spec accordingly and will can merge these tests afterwards.

Thank you++

Copy link
Contributor

@jeremylong jeremylong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM - but I agree with @pombredanne, the wording in the spec should be updated

@pombredanne
Copy link
Member

@matt-phylum re:

So my intent, albeit looking weird to me in hindsight, was indeed to ensure that we would NOT encode the colon anywhere indeed. Let's update the spec accordingly and will can merge these tests afterwards.

We are getting close to merge state in #398 for this

@johnmhoran
Copy link
Member

Given the changes we made yesterday to the "purl separators" section -- "When percent-encoding is required, all characters MUST be encoded except for the colon ':'." -- the changes proposed for the test (leaving the ':' unencoded in the version) look good.

Copy link
Member

@pombredanne pombredanne left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree. Especially now that #439 is merged!
Thanks.

@pombredanne pombredanne merged commit b751238 into package-url:main Apr 2, 2025
@jkowalleck jkowalleck mentioned this pull request Apr 2, 2025
2 tasks
jkowalleck added a commit to jkowalleck/fork_purl-spec that referenced this pull request Apr 2, 2025
incorporate package-url#416

Signed-off-by: Jan Kowalleck <[email protected]>
iseki0 added a commit to iseki0/PUrlKt that referenced this pull request Apr 17, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants