-
Notifications
You must be signed in to change notification settings - Fork 221
Clarify encoding of colon between scheme and type #361
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
The colon separator between scheme and type must not be encoded. This is also using, a new, and not yet adopted wording for MUST according to its definition in RFC2119 as clarified in RFC8174 Suggested-by: @gernot-h Reference: #39 (comment) Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
|
Great, thx for addressing this! I think this makes very clear that special characters must not be encoded as separators. |
| @@ -247,8 +247,9 @@ Use these rules for percent-encoding and decoding ``purl`` components: | |||
| - the '#', '?', '@' and ':' characters must NOT be encoded when used as | |||
| separators. They may need to be encoded elsewhere | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To fully address the confusion of #39, I would probably also change the last sentence in the para before:
| separators. They may need to be encoded elsewhere | |
| separators. Some of them need to be encoded elsewhere as specified in the rules below. |
I think, this would also make clearer where they need to be encoded.
| - The colon ':' separator between ``scheme`` and ``type`` MUST NOT be encoded. | ||
| For example, in the PURL snippet ``pkg:npm`` the colon ':' MUST NOT be encoded, | ||
| and the PURL snippet ``pkg%3Anpm`` is invalid. | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gernot-h @pombredanne Consider adding at the top of the file, perhaps as a new one-line paragraph following the current first paragraph, something along the lines of the following:
This specification uses RFC 2119 (https://datatracker.ietf.org/doc/html/rfc2119), as clarified in RFC 8174 (https://datatracker.ietf.org/doc/html/rfc8174), for the interpretation of certain terms, e.g.,
MUST NOT.
Or perhaps a slight modification to the example provided by RFC 2119:
The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in RFC 2119, as clarified in RFC 8174.
(Note that the core spec currently contains a great deal of language that will need to be modified to implement RFC 2119/8174.)
PURL-SPECIFICATION.rst
Outdated
|
|
||
| - the ':' ``scheme`` and ``type`` separator does not need to and must NOT be encoded. | ||
| It is unambiguous unencoded everywhere | ||
| - The colon ':' separator between ``scheme`` and ``type`` MUST NOT be encoded. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess that following @JimFuller-RedHat suggestion in today's triage call #360 ... this could reworded as:
| - The colon ':' separator between ``scheme`` and ``type`` MUST NOT be encoded. | |
| - The colon ':' separator between ``scheme`` and ``type`` MUST be used as-is, unencoded. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hehe, I was (badly) trying say that specs typically define grammar, schema and behavior (ex. processing rules) ... in this case I think how you had it is ok though it is easier to define a 'thing' precisely which has the side effect of defining all the things it cannot be ;)
From grammar pov we could define (in EBNF) as a literal colon ... any parser built with such an EBNF would bork on encoded input ... we might suggest defining pre processing rules to normalise odd input into canonical purl. or define a looser grammar allowing encoding though a separate matter altogether.
I would propose:
"Thescheme and type MUST be separated by a colon ':' "
Armed with a grammar (EBNF) and the spec one can discriminate exactly what is intended.
FWIW, wherever encoding is allowed in a pURL we can explicitly state that in the spec and adjust grammar to reflect that though I think the challenge is unpicking pURL set of encoding rules against the general set of encoding rules on URLs (because people use url stuff on pURL).
Maybe with v1 we are bound to explicitly define encoding behaviour - that is a set of processing rules combined with grammar to get what we have today. It might be that we define the grammar in separate section from processing rules in the spec would help untangle - eg. often a spec (with grammar, processing) is not very readable (and spawns 'annotated spec' or tutorials).
Sorry for not being very helpful!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would propose:
"The
schemeandtypeMUST be separated by a colon ':' "
clean and neat.
Yet someone would likely immediately bet a nickel that someone else will open an issue asking whether the colon can or must or must not be encoded. ;)
Here is the way out:
We can keep these negative assertions as comments in a new FAQ document. I think that we can create a leaner spec this way.
I think it makes sense for the spec to avoid specifying all the things that it does not support as you rightly suggest, because there is an infinity of these. And we can push "negative" support questions in that FAQ.
So here:
-
In the spec:
The scheme and type MUST be separated by a colon ':'
-
In the FAQ:
- Q: Is the colon between scheme and type encoded? Or can it be encoded? If yes, how?
- A: The spec has no mention of encoding there, so the colon should be used as-is, never encoded and never requiring any decoding since this is not encoded. It is a parsing error if the colon does not come directly after
pkg; Lenient parsing tools are welcome to recover from this error to help process and sanitize damaged purls, but that's not a requirement, and not part of the spec.
That approach can become the standard. The spec will be leaner and crisper and a the FAQ comprehensive to address the questions that will come up for sure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cool ! fantastic suggestion - brevity and conciseness enables implementators to focus ... a good example is the json spec - https://ecma-international.org/wp-content/uploads/ECMA-404_2nd_edition_december_2017.pdf , which only contains normative text ;)
Signed-off-by: John M. Horan <johnmhoran@gmail.com>
|
I added a test for the |
- This includes rewriting scheme-related language to remove the use of "not", now addressed in the newly-added FAQ (faq.rst). Signed-off-by: John M. Horan <johnmhoran@gmail.com>
|
I've just pushed a commit that (1) updates the scheme-related sections of the core spec (PURL-SPECIFICATION.rst) inter alia to remove the use of "not" and to address questions that have been raised in the open issues/PRs and (2) provides added details and clarifications (including the use of "not" where helpful) in a new faq.rst. I look forward to comments and questions, with the goal of finalizing the changes (including added modifications, tests etc. as needed) as part of the 1.0 release. This PR is intended to address only the scheme-related updates, and other updates will be addressed in a similar narrowly-focused manner. |
Reference: #376 Reference: #39 (comment) Signed-off-by: John M. Horan <johnmhoran@gmail.com>
Reference: #376 Reference: #39 (comment) Signed-off-by: John M. Horan <johnmhoran@gmail.com>
|
@matt-phylum @JimFuller-RedHat @pombredanne I've pushed my updates to this |
Reference: #39 (comment) Reference: #376 Signed-off-by: John M. Horan <johnmhoran@gmail.com>
Reference: #39 (comment) Reference: #376 Signed-off-by: John M. Horan johnmhoran@gmail.com
pombredanne
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here is some feedback
PURL-SPECIFICATION.rst
Outdated
| pkg:gem/ruby-advisory-db-check@0.12.4 | ||
| pkg://gem/ruby-advisory-db-check@0.12.4 | ||
| - The ``scheme`` is a constant with the value "pkg". | ||
| - The ``scheme`` and ``type`` MUST be separated by a colon ':'. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| - The ``scheme`` and ``type`` MUST be separated by a colon ':'. | |
| - The ``scheme`` MUST be followed by a unencoded colon ':'. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good suggestion -- change made.
PURL-SPECIFICATION.rst
Outdated
| pkg://gem/ruby-advisory-db-check@0.12.4 | ||
| - The ``scheme`` is a constant with the value "pkg". | ||
| - The ``scheme`` and ``type`` MUST be separated by a colon ':'. | ||
| - ``purl`` parsers MUST accept URLs in which the ``scheme`` and colon ':' are |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| - ``purl`` parsers MUST accept URLs in which the ``scheme`` and colon ':' are | |
| - ``purl`` parsers MUST accept URLs where the ``scheme`` and colon ':' are |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Change made.
PURL-SPECIFICATION.rst
Outdated
| - The ``scheme`` and ``type`` MUST be separated by a colon ':'. | ||
| - ``purl`` parsers MUST accept URLs in which the ``scheme`` and colon ':' are | ||
| followed by one or more slash '/' characters, such as 'pkg://', and MUST | ||
| ignore -- i.e., normalize by removing -- all such '/' characters. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| ignore -- i.e., normalize by removing -- all such '/' characters. | |
| ignore and remove all such '/' characters. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good clarification, change made, and I'm making a corresponding change to related language in the FAQ.
PURL-SPECIFICATION.rst
Outdated
| For clarity and simplicity a ``purl`` is always an ASCII string. To ensure that | ||
| there is no ambiguity when parsing a ``purl``, separator characters and non-ASCII | ||
| characters must be UTF-encoded and then percent-encoded as defined at:: | ||
| For clarity and simplicity, a ``purl`` is always an ASCII string. To ensure that |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to refactor this section separately. @mprpic volunteered to make this PR.
Let's remove all changes to this Character encoding section from this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pombredanne I've backed out all of my prior changes to the "Character encoding" section as requested. In light of @mprpic 's PR #389, I've also deleted a reference to the "Character encoding" section in the FAQ and made a few other adjustments.
|
@pombredanne @mprpic I've made the requested changes and some other needed updates, including deleting a reference in the FAQ to the "Character encoding" section in light of @mprpic 's PR #389 (not yet merged as of right now), which deletes the "Character encoding" section. One thought -- we might want to defer review of this PR until PR #389 has been merged and I can merge the updated @pombredanne I tried to request a review by you, but given the unusual parentage of this PR, which you opened, GH won't allow me to request your approving review. 🙂 |
mprpic
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, just some minor nitpicks around consistency.
faq.rst
Outdated
| neatly into a component-focused category. | ||
|
|
||
| If you have a question about the PURL specification and don't find an answer | ||
| below, you can open an issue `here <https://github.com/package-url/purl-spec/issues/new?template=Blank+issue>`_. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would probably ditch this paragraph since it's stating something that is hopefully obvious to anyone interacting with this repository.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mprpic I wholeheartedly agree. 🙂 Will do as soon as we clarify whether to retain (temporarily or otherwise) the current lowercase purl usage in the core spec. (See my comment further below to you and @pombredanne re Ecma TC54-TG2's approval of 'PURL'/'Package-URL'.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed para re opening an issue.
faq.rst
Outdated
|
|
||
| **QUESTION**: Can the ``scheme`` component be followed by a colon and two slashes, like a URI? | ||
|
|
||
| No. Since a Package-URL, or PURL, never contains a URL Authority, its ``scheme`` should not be suffixed with double slash as in 'pkg://' and should use 'pkg:' instead. Otherwise this would be an invalid URI per RFC 3986 at https://tools.ietf.org/html/rfc3986#section-3.3:: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| No. Since a Package-URL, or PURL, never contains a URL Authority, its ``scheme`` should not be suffixed with double slash as in 'pkg://' and should use 'pkg:' instead. Otherwise this would be an invalid URI per RFC 3986 at https://tools.ietf.org/html/rfc3986#section-3.3:: | |
| No. Since a ``purl`` never contains a URL Authority, its ``scheme`` should not be suffixed with double slash as in 'pkg://' and should use 'pkg:' instead. Otherwise this would be an invalid URI per RFC 3986 at https://tools.ietf.org/html/rfc3986#section-3.3:: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed to 'purl'.
faq.rst
Outdated
|
|
||
| This rule applies to all slash '/' characters between the ``scheme``'s colon separator and the ``type`` component, e.g., ':/', '://', ':///' et al. | ||
|
|
||
| In its canonical form, a PURL must not use any such ':/' ``scheme`` suffix and may only use ':' as a ``scheme`` suffix. This means that: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| In its canonical form, a PURL must not use any such ':/' ``scheme`` suffix and may only use ':' as a ``scheme`` suffix. This means that: | |
| In its canonical form, a purl must not use any such ':/' ``scheme`` suffix and may only use ':' as a ``scheme`` suffix. This means that: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed to 'purl'.
faq.rst
Outdated
|
|
||
| In its canonical form, a PURL must not use any such ':/' ``scheme`` suffix and may only use ':' as a ``scheme`` suffix. This means that: | ||
|
|
||
| - PURL parsers must accept URLs such as 'pkg://' and must ignore and remove all such '/' characters. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| - PURL parsers must accept URLs such as 'pkg://' and must ignore and remove all such '/' characters. | |
| - purl parsers must accept URLs such as 'pkg://' and must ignore and remove all such '/' characters. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed to 'purl'.
faq.rst
Outdated
| In its canonical form, a PURL must not use any such ':/' ``scheme`` suffix and may only use ':' as a ``scheme`` suffix. This means that: | ||
|
|
||
| - PURL parsers must accept URLs such as 'pkg://' and must ignore and remove all such '/' characters. | ||
| - PURL builders should not create invalid URLs with one or more slash '/' characters between 'pkg:' and the ``type`` component. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| - PURL builders should not create invalid URLs with one or more slash '/' characters between 'pkg:' and the ``type`` component. | |
| - purl builders should not create invalid URLs with one or more slash '/' characters between 'pkg:' and the ``type`` component. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed to 'purl'.
faq.rst
Outdated
| - PURL parsers must accept URLs such as 'pkg://' and must ignore and remove all such '/' characters. | ||
| - PURL builders should not create invalid URLs with one or more slash '/' characters between 'pkg:' and the ``type`` component. | ||
|
|
||
| For example, although these two PURLs are strictly equivalent, the first is in canonical form, while the second -- with a '//' between 'pkg:' and the ``type`` 'gem' -- is an acceptable PURL but is an invalid URI/URL per RFC 3986:: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| For example, although these two PURLs are strictly equivalent, the first is in canonical form, while the second -- with a '//' between 'pkg:' and the ``type`` 'gem' -- is an acceptable PURL but is an invalid URI/URL per RFC 3986:: | |
| For example, although these two purls are strictly equivalent, the first is in canonical form, while the second -- with a '//' between 'pkg:' and the ``type`` 'gem' -- is an acceptable purl but is an invalid URI/URL per RFC 3986:: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed as suggested.
faq.rst
Outdated
|
|
||
| The "Rules for each ``purl`` component" section provides that "[t]he ``scheme`` MUST be followed by an unencoded colon ':'. | ||
|
|
||
| In this case, the colon ':' between ``scheme`` and ``type`` is being used as a separator, and consequently should be used as-is, never encoded and never requiring any decoding. Moreover, it should be a parsing error if the colon ':' does not come directly after 'pkg'. Tools are welcome to recover from this error to help with damaged PURLs, but that's not a requirement. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| In this case, the colon ':' between ``scheme`` and ``type`` is being used as a separator, and consequently should be used as-is, never encoded and never requiring any decoding. Moreover, it should be a parsing error if the colon ':' does not come directly after 'pkg'. Tools are welcome to recover from this error to help with damaged PURLs, but that's not a requirement. | |
| In this case, the colon ':' between ``scheme`` and ``type`` is being used as a separator, and consequently should be used as-is, never encoded and never requiring any decoding. Moreover, it should be a parsing error if the colon ':' does not come directly after 'pkg'. Tools are welcome to recover from this error to help with malformed purls, but that's not a requirement. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good suggestion -- done.
|
@mprpic Thanks for your comments. While I'm happy to incorporate them, before I do I want to check with you and @pombredanne to make sure that despite Ecma TC54-TG2's 2024-10-11 approval of |
|
@johnmhoran Ah, interesting, I didn't realize the decision was to use the upper-case variant. Won't that clash with https://en.wikipedia.org/wiki/Persistent_uniform_resource_locator, which too uses upper-case PURL? Those meeting notes are missing the explanation of why PURL was chosen. |
|
Thanks for sharing that link @mprpic . I've adopted your suggestions and made a few other changes, will make corresponding changes to my 'type' PR as well. |
We do not want to update this here. See-also: #389 Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
pombredanne
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking ready to merge IMHO.
|
I only reverted minor change to the encoding section, so I consider this still approved and will merge. |
The colon separator between scheme and type must not be encoded.
This is also using, a new, and not yet adopted wording for MUST according to its definition in RFC2119 as clarified in RFC8174
Suggested-by: @gernot-h
Reference: #39 (comment)
JMH: Reference: #376