Skip to content

Conversation

@pombredanne
Copy link
Member

@pombredanne pombredanne commented Dec 5, 2024

The colon separator between scheme and type must not be encoded.

This is also using, a new, and not yet adopted wording for MUST according to its definition in RFC2119 as clarified in RFC8174

Suggested-by: @gernot-h
Reference: #39 (comment)

JMH: Reference: #376

The colon separator between scheme and type must not be encoded.

This is also using, a new, and not yet adopted wording for MUST according to its definition in RFC2119 as clarified in RFC8174

Suggested-by: @gernot-h
Reference: #39 (comment)
Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
@pombredanne pombredanne marked this pull request as draft December 5, 2024 18:26
@johnmhoran johnmhoran added the Ecma standard Part of the Ecma standard for PURL label Dec 6, 2024
@pombredanne pombredanne mentioned this pull request Dec 8, 2024
@gernot-h
Copy link
Contributor

Great, thx for addressing this! I think this makes very clear that special characters must not be encoded as separators.

@@ -247,8 +247,9 @@ Use these rules for percent-encoding and decoding ``purl`` components:
- the '#', '?', '@' and ':' characters must NOT be encoded when used as
separators. They may need to be encoded elsewhere
Copy link
Contributor

@gernot-h gernot-h Dec 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To fully address the confusion of #39, I would probably also change the last sentence in the para before:

Suggested change
separators. They may need to be encoded elsewhere
separators. Some of them need to be encoded elsewhere as specified in the rules below.

I think, this would also make clearer where they need to be encoded.

- The colon ':' separator between ``scheme`` and ``type`` MUST NOT be encoded.
For example, in the PURL snippet ``pkg:npm`` the colon ':' MUST NOT be encoded,
and the PURL snippet ``pkg%3Anpm`` is invalid.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gernot-h @pombredanne Consider adding at the top of the file, perhaps as a new one-line paragraph following the current first paragraph, something along the lines of the following:

This specification uses RFC 2119 (https://datatracker.ietf.org/doc/html/rfc2119), as clarified in RFC 8174 (https://datatracker.ietf.org/doc/html/rfc8174), for the interpretation of certain terms, e.g., MUST NOT.

Or perhaps a slight modification to the example provided by RFC 2119:

The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in RFC 2119, as clarified in RFC 8174.

(Note that the core spec currently contains a great deal of language that will need to be modified to implement RFC 2119/8174.)


- the ':' ``scheme`` and ``type`` separator does not need to and must NOT be encoded.
It is unambiguous unencoded everywhere
- The colon ':' separator between ``scheme`` and ``type`` MUST NOT be encoded.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess that following @JimFuller-RedHat suggestion in today's triage call #360 ... this could reworded as:

Suggested change
- The colon ':' separator between ``scheme`` and ``type`` MUST NOT be encoded.
- The colon ':' separator between ``scheme`` and ``type`` MUST be used as-is, unencoded.

@JimFuller-RedHat ?

Copy link

@JimFuller-RedHat JimFuller-RedHat Dec 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hehe, I was (badly) trying say that specs typically define grammar, schema and behavior (ex. processing rules) ... in this case I think how you had it is ok though it is easier to define a 'thing' precisely which has the side effect of defining all the things it cannot be ;)

From grammar pov we could define (in EBNF) as a literal colon ... any parser built with such an EBNF would bork on encoded input ... we might suggest defining pre processing rules to normalise odd input into canonical purl. or define a looser grammar allowing encoding though a separate matter altogether.

I would propose:

"Thescheme and type MUST be separated by a colon ':' "

Armed with a grammar (EBNF) and the spec one can discriminate exactly what is intended.

FWIW, wherever encoding is allowed in a pURL we can explicitly state that in the spec and adjust grammar to reflect that though I think the challenge is unpicking pURL set of encoding rules against the general set of encoding rules on URLs (because people use url stuff on pURL).

Maybe with v1 we are bound to explicitly define encoding behaviour - that is a set of processing rules combined with grammar to get what we have today. It might be that we define the grammar in separate section from processing rules in the spec would help untangle - eg. often a spec (with grammar, processing) is not very readable (and spawns 'annotated spec' or tutorials).

Sorry for not being very helpful!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@JimFuller-RedHat

I would propose:

"Thescheme and type MUST be separated by a colon ':' "

clean and neat.

Yet someone would likely immediately bet a nickel that someone else will open an issue asking whether the colon can or must or must not be encoded. ;)

Here is the way out:
We can keep these negative assertions as comments in a new FAQ document. I think that we can create a leaner spec this way.

I think it makes sense for the spec to avoid specifying all the things that it does not support as you rightly suggest, because there is an infinity of these. And we can push "negative" support questions in that FAQ.

So here:

  • In the spec:

    • The scheme and type MUST be separated by a colon ':'
  • In the FAQ:

    • Q: Is the colon between scheme and type encoded? Or can it be encoded? If yes, how?
    • A: The spec has no mention of encoding there, so the colon should be used as-is, never encoded and never requiring any decoding since this is not encoded. It is a parsing error if the colon does not come directly after pkg; Lenient parsing tools are welcome to recover from this error to help process and sanitize damaged purls, but that's not a requirement, and not part of the spec.

That approach can become the standard. The spec will be leaner and crisper and a the FAQ comprehensive to address the questions that will come up for sure.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cool ! fantastic suggestion - brevity and conciseness enables implementators to focus ... a good example is the json spec - https://ecma-international.org/wp-content/uploads/ECMA-404_2nd_edition_december_2017.pdf , which only contains normative text ;)

Signed-off-by: John M. Horan <johnmhoran@gmail.com>
@johnmhoran
Copy link
Member

I added a test for the : between scheme and type. More to follow tomorrow.

- This includes rewriting scheme-related language to remove the use of "not",
  now addressed in the newly-added FAQ (faq.rst).
Signed-off-by: John M. Horan <johnmhoran@gmail.com>
@johnmhoran
Copy link
Member

I've just pushed a commit that (1) updates the scheme-related sections of the core spec (PURL-SPECIFICATION.rst) inter alia to remove the use of "not" and to address questions that have been raised in the open issues/PRs and (2) provides added details and clarifications (including the use of "not" where helpful) in a new faq.rst.

I look forward to comments and questions, with the goal of finalizing the changes (including added modifications, tests etc. as needed) as part of the 1.0 release. This PR is intended to address only the scheme-related updates, and other updates will be addressed in a similar narrowly-focused manner.

Reference: #376
Reference: #39 (comment)
Signed-off-by: John M. Horan <johnmhoran@gmail.com>
Reference: #376
Reference: #39 (comment)
Signed-off-by: John M. Horan <johnmhoran@gmail.com>
@johnmhoran johnmhoran marked this pull request as ready for review January 20, 2025 03:02
@johnmhoran
Copy link
Member

@matt-phylum @JimFuller-RedHat @pombredanne I've pushed my updates to this scheme-focused PR -- ready for your (and others') reviews and comments. scheme-focused changes made to PURL-SPECIFICATION.rst and test-suite-data.json and a new faq.rst created.

@johnmhoran johnmhoran mentioned this pull request Jan 20, 2025
@jkowalleck jkowalleck mentioned this pull request Jan 22, 2025
8 tasks
Reference: #39 (comment)
Reference: #376
Signed-off-by: John M. Horan <johnmhoran@gmail.com>
Reference: #39 (comment)
Reference: #376
Signed-off-by: John M. Horan johnmhoran@gmail.com
Copy link
Member Author

@pombredanne pombredanne left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is some feedback

pkg:gem/ruby-advisory-db-check@0.12.4
pkg://gem/ruby-advisory-db-check@0.12.4
- The ``scheme`` is a constant with the value "pkg".
- The ``scheme`` and ``type`` MUST be separated by a colon ':'.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- The ``scheme`` and ``type`` MUST be separated by a colon ':'.
- The ``scheme`` MUST be followed by a unencoded colon ':'.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good suggestion -- change made.

pkg://gem/ruby-advisory-db-check@0.12.4
- The ``scheme`` is a constant with the value "pkg".
- The ``scheme`` and ``type`` MUST be separated by a colon ':'.
- ``purl`` parsers MUST accept URLs in which the ``scheme`` and colon ':' are
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- ``purl`` parsers MUST accept URLs in which the ``scheme`` and colon ':' are
- ``purl`` parsers MUST accept URLs where the ``scheme`` and colon ':' are

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change made.

- The ``scheme`` and ``type`` MUST be separated by a colon ':'.
- ``purl`` parsers MUST accept URLs in which the ``scheme`` and colon ':' are
followed by one or more slash '/' characters, such as 'pkg://', and MUST
ignore -- i.e., normalize by removing -- all such '/' characters.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
ignore -- i.e., normalize by removing -- all such '/' characters.
ignore and remove all such '/' characters.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good clarification, change made, and I'm making a corresponding change to related language in the FAQ.

For clarity and simplicity a ``purl`` is always an ASCII string. To ensure that
there is no ambiguity when parsing a ``purl``, separator characters and non-ASCII
characters must be UTF-encoded and then percent-encoded as defined at::
For clarity and simplicity, a ``purl`` is always an ASCII string. To ensure that
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to refactor this section separately. @mprpic volunteered to make this PR.
Let's remove all changes to this Character encoding section from this PR.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pombredanne I've backed out all of my prior changes to the "Character encoding" section as requested. In light of @mprpic 's PR #389, I've also deleted a reference to the "Character encoding" section in the FAQ and made a few other adjustments.

Reference: #39
Reference: #376
Signed-off-by: John M. Horan johnmhoran@gmail.com
Reference: #376
Reference: #39
Signed-off-by: John M. Horan <johnmhoran@gmail.com>
@johnmhoran johnmhoran requested a review from mprpic February 6, 2025 02:54
@johnmhoran
Copy link
Member

@pombredanne @mprpic I've made the requested changes and some other needed updates, including deleting a reference in the FAQ to the "Character encoding" section in light of @mprpic 's PR #389 (not yet merged as of right now), which deletes the "Character encoding" section.

One thought -- we might want to defer review of this PR until PR #389 has been merged and I can merge the updated master into this branch and push again.

@pombredanne I tried to request a review by you, but given the unusual parentage of this PR, which you opened, GH won't allow me to request your approving review. 🙂

mprpic
mprpic previously approved these changes Feb 6, 2025
Copy link
Contributor

@mprpic mprpic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, just some minor nitpicks around consistency.

faq.rst Outdated
neatly into a component-focused category.

If you have a question about the PURL specification and don't find an answer
below, you can open an issue `here <https://github.com/package-url/purl-spec/issues/new?template=Blank+issue>`_.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would probably ditch this paragraph since it's stating something that is hopefully obvious to anyone interacting with this repository.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mprpic I wholeheartedly agree. 🙂 Will do as soon as we clarify whether to retain (temporarily or otherwise) the current lowercase purl usage in the core spec. (See my comment further below to you and @pombredanne re Ecma TC54-TG2's approval of 'PURL'/'Package-URL'.)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed para re opening an issue.

faq.rst Outdated

**QUESTION**: Can the ``scheme`` component be followed by a colon and two slashes, like a URI?

No. Since a Package-URL, or PURL, never contains a URL Authority, its ``scheme`` should not be suffixed with double slash as in 'pkg://' and should use 'pkg:' instead. Otherwise this would be an invalid URI per RFC 3986 at https://tools.ietf.org/html/rfc3986#section-3.3::
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
No. Since a Package-URL, or PURL, never contains a URL Authority, its ``scheme`` should not be suffixed with double slash as in 'pkg://' and should use 'pkg:' instead. Otherwise this would be an invalid URI per RFC 3986 at https://tools.ietf.org/html/rfc3986#section-3.3::
No. Since a ``purl`` never contains a URL Authority, its ``scheme`` should not be suffixed with double slash as in 'pkg://' and should use 'pkg:' instead. Otherwise this would be an invalid URI per RFC 3986 at https://tools.ietf.org/html/rfc3986#section-3.3::

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed to 'purl'.

faq.rst Outdated

This rule applies to all slash '/' characters between the ``scheme``'s colon separator and the ``type`` component, e.g., ':/', '://', ':///' et al.

In its canonical form, a PURL must not use any such ':/' ``scheme`` suffix and may only use ':' as a ``scheme`` suffix. This means that:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
In its canonical form, a PURL must not use any such ':/' ``scheme`` suffix and may only use ':' as a ``scheme`` suffix. This means that:
In its canonical form, a purl must not use any such ':/' ``scheme`` suffix and may only use ':' as a ``scheme`` suffix. This means that:

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed to 'purl'.

faq.rst Outdated

In its canonical form, a PURL must not use any such ':/' ``scheme`` suffix and may only use ':' as a ``scheme`` suffix. This means that:

- PURL parsers must accept URLs such as 'pkg://' and must ignore and remove all such '/' characters.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- PURL parsers must accept URLs such as 'pkg://' and must ignore and remove all such '/' characters.
- purl parsers must accept URLs such as 'pkg://' and must ignore and remove all such '/' characters.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed to 'purl'.

faq.rst Outdated
In its canonical form, a PURL must not use any such ':/' ``scheme`` suffix and may only use ':' as a ``scheme`` suffix. This means that:

- PURL parsers must accept URLs such as 'pkg://' and must ignore and remove all such '/' characters.
- PURL builders should not create invalid URLs with one or more slash '/' characters between 'pkg:' and the ``type`` component.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- PURL builders should not create invalid URLs with one or more slash '/' characters between 'pkg:' and the ``type`` component.
- purl builders should not create invalid URLs with one or more slash '/' characters between 'pkg:' and the ``type`` component.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed to 'purl'.

faq.rst Outdated
- PURL parsers must accept URLs such as 'pkg://' and must ignore and remove all such '/' characters.
- PURL builders should not create invalid URLs with one or more slash '/' characters between 'pkg:' and the ``type`` component.

For example, although these two PURLs are strictly equivalent, the first is in canonical form, while the second -- with a '//' between 'pkg:' and the ``type`` 'gem' -- is an acceptable PURL but is an invalid URI/URL per RFC 3986::
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
For example, although these two PURLs are strictly equivalent, the first is in canonical form, while the second -- with a '//' between 'pkg:' and the ``type`` 'gem' -- is an acceptable PURL but is an invalid URI/URL per RFC 3986::
For example, although these two purls are strictly equivalent, the first is in canonical form, while the second -- with a '//' between 'pkg:' and the ``type`` 'gem' -- is an acceptable purl but is an invalid URI/URL per RFC 3986::

Copy link
Member

@johnmhoran johnmhoran Feb 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed as suggested.

faq.rst Outdated

The "Rules for each ``purl`` component" section provides that "[t]he ``scheme`` MUST be followed by an unencoded colon ':'.

In this case, the colon ':' between ``scheme`` and ``type`` is being used as a separator, and consequently should be used as-is, never encoded and never requiring any decoding. Moreover, it should be a parsing error if the colon ':' does not come directly after 'pkg'. Tools are welcome to recover from this error to help with damaged PURLs, but that's not a requirement.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
In this case, the colon ':' between ``scheme`` and ``type`` is being used as a separator, and consequently should be used as-is, never encoded and never requiring any decoding. Moreover, it should be a parsing error if the colon ':' does not come directly after 'pkg'. Tools are welcome to recover from this error to help with damaged PURLs, but that's not a requirement.
In this case, the colon ':' between ``scheme`` and ``type`` is being used as a separator, and consequently should be used as-is, never encoded and never requiring any decoding. Moreover, it should be a parsing error if the colon ':' does not come directly after 'pkg'. Tools are welcome to recover from this error to help with malformed purls, but that's not a requirement.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good suggestion -- done.

@johnmhoran
Copy link
Member

@mprpic Thanks for your comments. While I'm happy to incorporate them, before I do I want to check with you and @pombredanne to make sure that despite Ecma TC54-TG2's 2024-10-11 approval of Package-URL and PURL as the proper capitalization, we want nonetheless to use lowercase purl in the spec. See https://github.com/Ecma-TC54/tg2/blob/main/meetings/2024-10-11.md.

- Adam, Steve, Philippe discussed capitalization (or not) of the word “purl”.
  - Adam: for non-technical writing, we propose Package-URL and PURL.
  - Steve: in favor of this. This group can lead the standardization of the name.
  - Decided: Package-URL and PURL

@mprpic
Copy link
Contributor

mprpic commented Feb 7, 2025

@johnmhoran Ah, interesting, I didn't realize the decision was to use the upper-case variant. Won't that clash with https://en.wikipedia.org/wiki/Persistent_uniform_resource_locator, which too uses upper-case PURL? Those meeting notes are missing the explanation of why PURL was chosen.

Reference: #376
Reference: #39
Signed-off-by: John M. Horan <johnmhoran@gmail.com>
- Also updated the canonical_purl value for the new colon ':' test object
Reference: #376
Reference: #39
Signed-off-by: John M. Horan <johnmhoran@gmail.com>
Reference: #376
Reference: #39
Signed-off-by: John M. Horan <johnmhoran@gmail.com>
@johnmhoran johnmhoran requested a review from mprpic February 9, 2025 01:56
Reference: #376
Reference: #39
Signed-off-by: John M. Horan <johnmhoran@gmail.com>
@johnmhoran
Copy link
Member

Thanks for sharing that link @mprpic . I've adopted your suggestions and made a few other changes, will make corresponding changes to my 'type' PR as well.

mprpic
mprpic previously approved these changes Feb 10, 2025
jkowalleck
jkowalleck previously approved these changes Feb 11, 2025
We do not want to update this here.

See-also: #389
Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
@pombredanne pombredanne dismissed stale reviews from jkowalleck and mprpic via ef81a58 February 12, 2025 17:20
Copy link
Member Author

@pombredanne pombredanne left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking ready to merge IMHO.

@pombredanne
Copy link
Member Author

I only reverted minor change to the encoding section, so I consider this still approved and will merge.

@pombredanne pombredanne merged commit 7317d59 into master Feb 12, 2025
@pombredanne pombredanne deleted the 39-colon-quoting branch February 12, 2025 17:30
@pombredanne pombredanne restored the 39-colon-quoting branch February 12, 2025 17:30
@pombredanne pombredanne deleted the 39-colon-quoting branch February 12, 2025 17:30
@johnmhoran johnmhoran mentioned this pull request Apr 3, 2025
2 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants