From 1c1fb315de714a2a1c9e05ce9ae4e37eb4e1d816 Mon Sep 17 00:00:00 2001 From: "John M. Horan" Date: Tue, 25 Feb 2025 18:52:25 -0800 Subject: [PATCH 1/7] Update 'qualifiers' rule in core spec #382 Reference: https://github.com/package-url/purl-spec/issues/382 Signed-off-by: John M. Horan --- PURL-SPECIFICATION.rst | 127 ++++++++++++++++++++++++++++++++++------- faq.rst | 7 ++- 2 files changed, 110 insertions(+), 24 deletions(-) diff --git a/PURL-SPECIFICATION.rst b/PURL-SPECIFICATION.rst index 32aa48f8..51b4c5f5 100644 --- a/PURL-SPECIFICATION.rst +++ b/PURL-SPECIFICATION.rst @@ -1,5 +1,5 @@ -Package URL specification v1.0.X -================================ +Package URL specification v1.0.X (from qualifiers PR) +===================================================== The Package URL core specification defines a versioned and formalized format, syntax, and rules used to represent and validate ``purl``. @@ -136,6 +136,8 @@ The rules for each component are: - The ``type`` MUST start with an ASCII letter. - The ``type`` MUST NOT be percent-encoded. - The ``type`` is case insensitive. The canonical form is lowercase. + - ``purl`` parsers MUST return an error when the ``type`` contains a prohibited + character. - **namespace**: @@ -176,25 +178,106 @@ The rules for each component are: - **qualifiers**: - - The ``qualifiers`` string is prefixed by a '?' separator when not empty - - This '?' is not part of the ``qualifiers`` - - This is a query string composed of zero or more ``key=value`` pairs each - separated by a '&' ampersand. A ``key`` and ``value`` are separated by the equal - '=' character - - These '&' are not part of the ``key=value`` pairs. - - ``key`` must be unique within the keys of the ``qualifiers`` string - - ``value`` cannot be an empty string: a ``key=value`` pair with an empty ``value`` - is the same as no key/value at all for this key - - For each pair of ``key`` = ``value``: - - - The ``key`` must be composed only of ASCII letters and numbers, '.', '-' and - '_' (period, dash and underscore) - - A ``key`` cannot start with a number - - A ``key`` must NOT be percent-encoded - - A ``key`` is case insensitive. The canonical form is lowercase - - A ``key`` cannot contain spaces - - A ``value`` must be a percent-encoded string - - The '=' separator is neither part of the ``key`` nor of the ``value`` + - The ``qualifiers`` component MUST be prefixed by a '?' separator when not empty. + - The '?' separator is not part of the ``qualifiers`` component. + - The ``qualifiers`` component is a query string composed of one or more ``key=value`` + pairs, each of which is separated by an ampersand '&'. A ``key`` and ``value`` + are separated by the equal '=' character. + - The '&' separator is not part of the ``key`` or the ``value``. + - The '=' separator is not part of the ``key`` or the ``value``. + - The ``key`` MUST be unique among the keys of the ``qualifiers`` string. + - The ``value`` MUST NOT be an empty string: a ``key=value`` pair with an empty ``value`` + is the same as no ``key=value`` pair at all for this ``key``. + + - For each ``key=value`` pair: + + - ``key`` + + - The ``key`` MUST be composed only of ASCII letters and numbers, '.', '-' and + '_' (period, dash and underscore). + - A ``key`` MUST start with an ASCII letter. + - A ``key`` MUST NOT be percent-encoded. + - A ``key`` is case insensitive. The canonical form is lowercase. + + - ``value`` + + - The ``value`` MUST be composed only of the following characters, encoded + as described below and in keeping with RFC 3986 Section 2.2: + + "If data for a URI component would conflict with a reserved character's + purpose as a delimiter, then the conflicting data must be percent-encoded + before the URI is formed." + https://datatracker.ietf.org/doc/html/rfc3986#section-2.2 + + 1. **All US-ASCII characters defined as "unreserved"** in RFC 3986 Section 2.3 + (https://datatracker.ietf.org/doc/html/rfc3986#section-2.3): + + .. code-block:: none + + 'A'-'Z', 'a'-'z', '0'-'9', '-', '.', '_', '~' + + - These 66 characters do not need to be percent-encoded. + + 2. **All US-ASCII characters defined as "sub-delims"**, a subset of + the "reserved" characters, in RFC 3986 Section 2.2 + (https://datatracker.ietf.org/doc/html/rfc3986#section-2.2): + + .. code-block:: none + + '!', '$', '&', ''', '(', ')', '*', '+', ',', ';', '=' + + - The '&' MUST be percent-encoded to avoid being incorrectly parsed + as a separator between multiple key-value pairs. See "How to parse + a purl string in its components" ("Split the qualifiers on '&'. + Each part is a key=value pair"). + + - The other 10 characters do not need to be percent-encoded, including + the '=' -- the parser will not mistakenly treat a '=' in the value + as a separator because it splits each key-value pair just once, + from left-to-right, on the first '=' it encounters, and thus there + is no conflict: + + .. code-block:: none + + - For each pair, split the key=value once from left on '=': + - The key is the lowercase left side + - The value is the percent-decoded right side + + 3. **Four additional US-ASCII characters** identified in the "query" + definition in RFC 3986 Section 3.4 (https://datatracker.ietf.org/doc/html/rfc3986#section-3.4) + and the "pchar" definition in RFC 3986 Appendix A (https://datatracker.ietf.org/doc/html/rfc3986#appendix-A): + + .. code-block:: none + + ':', '@', '/', '?' + + - The '?' MUST be percent-encoded to avoid being incorrectly parsed + as a ``qualifiers`` separator -- in the right-to-left parsing + (see "How to parse a purl string in its components"), an unencoded + '?' in the ``value`` would be the first '?' encountered by the + parser and incorrectly treated as a ``qualifiers`` separator. + + - The other three characters do not need to be percent-encoded. + + 4. **All other US-ASCII characters**. + + .. code-block:: none + + - 33 control characters (ASCII codes 0-31 and 127) + + - the 14 US-ASCII characters not covered in the preceding groups of US-ASCII characters: + + ' ' [space], '"', '#', '%', '<', '>', '[', '\', ']', '^', '`', '{', '|', '}' + + - Each of these 47 US-ASCII characters MUST be percent-encoded. + + 5. **Any character that is not a US-ASCII character** + (i.e., characters with ASCII code > 127 and non-ASCII characters). + + - All such characters MUST be UTF-8 encoded and then percent-encoded. + + - ``purl`` parsers MUST return an error when the ``key`` or ``value`` contains + a prohibited character. - **subpath**: @@ -206,9 +289,11 @@ The rules for each component are: in the canonical form - Each ``subpath`` segment MUST be a percent-encoded string - When percent-decoded, a segment: + - MUST NOT contain a '/' - MUST NOT be any of '..' or '.' - MUST NOT be empty + - The ``subpath`` MUST be interpreted as relative to the root of the package diff --git a/faq.rst b/faq.rst index 41307f76..50a272e5 100644 --- a/faq.rst +++ b/faq.rst @@ -6,7 +6,7 @@ Scheme **QUESTION**: Can the ``scheme`` component be followed by a colon and two slashes, like a URI? -No. Since a ``purl`` never contains a URL Authority, its ``scheme`` should not be suffixed with double slash as in 'pkg://' and should use 'pkg:' instead. Otherwise this would be an invalid URI per RFC 3986 at https://tools.ietf.org/html/rfc3986#section-3.3:: +**ANSWER**: No. Since a ``purl`` never contains a URL Authority, its ``scheme`` should not be suffixed with double slash as in 'pkg://' and should use 'pkg:' instead. Otherwise this would be an invalid URI per RFC 3986 at https://tools.ietf.org/html/rfc3986#section-3.3:: If a URI does not contain an authority component, then the path cannot begin with two slash characters ("//"). @@ -24,9 +24,10 @@ For example, although these two purls are strictly equivalent, the first is in c pkg://gem/ruby-advisory-db-check@0.12.4 + **QUESTION**: Is the colon between ``scheme`` and ``type`` encoded? Can it be encoded? If yes, how? -The "Rules for each ``purl`` component" section provides that "[t]he ``scheme`` MUST be followed by an unencoded colon ':'. +**ANSWER**: The "Rules for each ``purl`` component" section provides that "[t]he ``scheme`` MUST be followed by an unencoded colon ':'. In this case, the colon ':' between ``scheme`` and ``type`` is being used as a separator, and consequently should be used as-is, never encoded and never requiring any decoding. Moreover, it should be a parsing error if the colon ':' does not come directly after 'pkg'. Tools are welcome to recover from this error to help with malformed purls, but that's not a requirement. @@ -37,7 +38,7 @@ Type **QUESTION**: What behavior is expected from a purl spec implementation if a ``type`` contains a character like a slash '/' or a colon ':'? -The "Rules for each purl component" section provides that +**ANSWER**: The "Rules for each purl component" section provides that [t]he package ``type`` MUST be composed only of ASCII letters and numbers, '.', '+' and '-' (period, plus, and dash) From 17685b2d50adbdbd1c113b628201a7247c27280c Mon Sep 17 00:00:00 2001 From: "John M. Horan" Date: Sun, 2 Mar 2025 17:02:48 -0800 Subject: [PATCH 2/7] Simplify 'qualifiers', revise 'Character encoding' #382 Reference: https://github.com/package-url/purl-spec/issues/382 Signed-off-by: John M. Horan --- PURL-SPECIFICATION.rst | 163 +++++++++++++---------------------------- 1 file changed, 51 insertions(+), 112 deletions(-) diff --git a/PURL-SPECIFICATION.rst b/PURL-SPECIFICATION.rst index 51b4c5f5..30baaa89 100644 --- a/PURL-SPECIFICATION.rst +++ b/PURL-SPECIFICATION.rst @@ -1,5 +1,5 @@ -Package URL specification v1.0.X (from qualifiers PR) -===================================================== +Package URL specification v1.0.X +================================ The Package URL core specification defines a versioned and formalized format, syntax, and rules used to represent and validate ``purl``. @@ -136,8 +136,6 @@ The rules for each component are: - The ``type`` MUST start with an ASCII letter. - The ``type`` MUST NOT be percent-encoded. - The ``type`` is case insensitive. The canonical form is lowercase. - - ``purl`` parsers MUST return an error when the ``type`` contains a prohibited - character. - **namespace**: @@ -181,100 +179,26 @@ The rules for each component are: - The ``qualifiers`` component MUST be prefixed by a '?' separator when not empty. - The '?' separator is not part of the ``qualifiers`` component. - The ``qualifiers`` component is a query string composed of one or more ``key=value`` - pairs, each of which is separated by an ampersand '&'. A ``key`` and ``value`` - are separated by the equal '=' character. - - The '&' separator is not part of the ``key`` or the ``value``. - - The '=' separator is not part of the ``key`` or the ``value``. - - The ``key`` MUST be unique among the keys of the ``qualifiers`` string. - - The ``value`` MUST NOT be an empty string: a ``key=value`` pair with an empty ``value`` + pairs. Multiple ``key=value`` pairs MUST be separated by an ampersand '&'. + A ``key`` and ``value`` MUST be separated by the equal '=' character. + - Neither the '&' nor the '=' separator is part of the ``key`` or the ``value``. + - Each ``key`` MUST be unique among the keys of the ``qualifiers`` string. + - A ``value`` MUST NOT be an empty string: a ``key=value`` pair with an empty ``value`` is the same as no ``key=value`` pair at all for this ``key``. - For each ``key=value`` pair: - - ``key`` + - The ``key`` MUST be composed only of ASCII letters and numbers, '.', '-' and + '_' (period, dash and underscore). + - A ``key`` MUST start with an ASCII letter. + - A ``key`` MUST NOT be percent-encoded. + - A ``key`` is case insensitive. The canonical form is lowercase. + - A ``value`` MAY be composed of any ASCII or non-ASCII character, and + MUST be percent-encoded as described in the "Character encoding" + section below[.] [, except that: - - The ``key`` MUST be composed only of ASCII letters and numbers, '.', '-' and - '_' (period, dash and underscore). - - A ``key`` MUST start with an ASCII letter. - - A ``key`` MUST NOT be percent-encoded. - - A ``key`` is case insensitive. The canonical form is lowercase. - - - ``value`` - - - The ``value`` MUST be composed only of the following characters, encoded - as described below and in keeping with RFC 3986 Section 2.2: - - "If data for a URI component would conflict with a reserved character's - purpose as a delimiter, then the conflicting data must be percent-encoded - before the URI is formed." - https://datatracker.ietf.org/doc/html/rfc3986#section-2.2 - - 1. **All US-ASCII characters defined as "unreserved"** in RFC 3986 Section 2.3 - (https://datatracker.ietf.org/doc/html/rfc3986#section-2.3): - - .. code-block:: none - - 'A'-'Z', 'a'-'z', '0'-'9', '-', '.', '_', '~' - - - These 66 characters do not need to be percent-encoded. - - 2. **All US-ASCII characters defined as "sub-delims"**, a subset of - the "reserved" characters, in RFC 3986 Section 2.2 - (https://datatracker.ietf.org/doc/html/rfc3986#section-2.2): - - .. code-block:: none - - '!', '$', '&', ''', '(', ')', '*', '+', ',', ';', '=' - - - The '&' MUST be percent-encoded to avoid being incorrectly parsed - as a separator between multiple key-value pairs. See "How to parse - a purl string in its components" ("Split the qualifiers on '&'. - Each part is a key=value pair"). - - - The other 10 characters do not need to be percent-encoded, including - the '=' -- the parser will not mistakenly treat a '=' in the value - as a separator because it splits each key-value pair just once, - from left-to-right, on the first '=' it encounters, and thus there - is no conflict: - - .. code-block:: none - - - For each pair, split the key=value once from left on '=': - - The key is the lowercase left side - - The value is the percent-decoded right side - - 3. **Four additional US-ASCII characters** identified in the "query" - definition in RFC 3986 Section 3.4 (https://datatracker.ietf.org/doc/html/rfc3986#section-3.4) - and the "pchar" definition in RFC 3986 Appendix A (https://datatracker.ietf.org/doc/html/rfc3986#appendix-A): - - .. code-block:: none - - ':', '@', '/', '?' - - - The '?' MUST be percent-encoded to avoid being incorrectly parsed - as a ``qualifiers`` separator -- in the right-to-left parsing - (see "How to parse a purl string in its components"), an unencoded - '?' in the ``value`` would be the first '?' encountered by the - parser and incorrectly treated as a ``qualifiers`` separator. - - - The other three characters do not need to be percent-encoded. - - 4. **All other US-ASCII characters**. - - .. code-block:: none - - - 33 control characters (ASCII codes 0-31 and 127) - - - the 14 US-ASCII characters not covered in the preceding groups of US-ASCII characters: - - ' ' [space], '"', '#', '%', '<', '>', '[', '\', ']', '^', '`', '{', '|', '}' - - - Each of these 47 US-ASCII characters MUST be percent-encoded. - - 5. **Any character that is not a US-ASCII character** - (i.e., characters with ASCII code > 127 and non-ASCII characters). - - - All such characters MUST be UTF-8 encoded and then percent-encoded. + - *list exceptions to the updated "Character encoding" rules, e.g., if a + colon ':' does not need to, or SHOULD NOT, or MUST NOT, be percent-encoded.*] - ``purl`` parsers MUST return an error when the ``key`` or ``value`` contains a prohibited character. @@ -300,35 +224,50 @@ The rules for each component are: Character encoding ~~~~~~~~~~~~~~~~~~ -For clarity and simplicity a ``purl`` is always an ASCII string. To ensure that +For clarity and simplicity, a ``purl`` is always an ASCII string. To ensure that there is no ambiguity when parsing a ``purl``, separator characters and non-ASCII -characters must be UTF-encoded and then percent-encoded as defined at:: +characters MUST be UTF-encoded and then percent-encoded as defined at +https://en.wikipedia.org/wiki/Percent-encoding and as further defined below. + +Use these rules for percent-encoding and decoding ``purl`` components. Except +as otherwise provided in the "Rules for each ``purl`` component" section above: + +- A character used in a ``purl`` component MUST be percent-encoded unless it is: - https://en.wikipedia.org/wiki/Percent-encoding + (1) an unreserved character as defined in RFC 3986 section 2.3 (https://datatracker.ietf.org/doc/html/rfc3986#section-2.3), -Use these rules for percent-encoding and decoding ``purl`` components: + (2) expressly defined in this PURL-SPECIFICATION.rst as a ``purl`` separator (and only when used as such a separator), or -- the ``type`` must NOT be encoded and must NOT contain separators + (3) expressly permitted in that ``purl`` component. -- the '#', '?', '@' and ':' characters must NOT be encoded when used as - separators. They may need to be encoded elsewhere +- All non-ASCII characters MUST be encoded as UTF-8 and then percent-encoded. -- the ':' ``scheme`` and ``type`` separator does not need to and must NOT be encoded. - It is unambiguous unencoded everywhere +- The characters used as ``purl`` separators are listed below. These characters: -- the '/' used as ``type``/``namespace``/``name`` and ``subpath`` segments separator - does not need to and must NOT be percent-encoded. It is unambiguous unencoded - everywhere + - MUST NOT be percent-encoded when used as separators. + - MUST be percent-encoded when not used as separators unless expressly permitted + by a ``purl`` component. -- the '@' ``version`` separator must be encoded as ``%40`` elsewhere -- the '?' ``qualifiers`` separator must be encoded as ``%3F`` elsewhere -- the '=' ``qualifiers`` key/value separator must NOT be encoded -- the '#' ``subpath`` separator must be encoded as ``%23`` elsewhere +.. list-table:: + :widths: 1 4 + :header-rows: 1 -- All non-ASCII characters must be encoded as UTF-8 and then percent-encoded + * - Character + - Use as separator + * - ':' + - between ``scheme`` and ``type`` + * - '@' + - ``version`` prefix + * - '?' + - ``qualifiers`` prefix + * - '#' + - ``subpath`` prefix + * - '/' + - ``type``/``namespace``/``name`` and ``subpath`` segments separator + * - '=' + - ``qualifiers`` ``key``/``value`` separator -It is OK to percent-encode ``purl`` components otherwise except for the ``type``. -Parsers and builders must always percent-decode and percent-encode ``purl`` +Parsers and builders MUST always percent-decode and percent-encode ``purl`` components and component segments as explained in the "How to parse" and "How to build" sections. From 9d849b9e0b9dd7dfd1859e04ad68fef73679ea55 Mon Sep 17 00:00:00 2001 From: "John M. Horan" Date: Mon, 10 Mar 2025 18:51:32 -0700 Subject: [PATCH 3/7] Adjust qualifiers and character encoding sections #382 Reference: https://github.com/package-url/purl-spec/issues/382 Signed-off-by: John M. Horan --- PURL-SPECIFICATION.rst | 13 ++++--------- 1 file changed, 4 insertions(+), 9 deletions(-) diff --git a/PURL-SPECIFICATION.rst b/PURL-SPECIFICATION.rst index 30baaa89..2d221c3e 100644 --- a/PURL-SPECIFICATION.rst +++ b/PURL-SPECIFICATION.rst @@ -193,15 +193,8 @@ The rules for each component are: - A ``key`` MUST start with an ASCII letter. - A ``key`` MUST NOT be percent-encoded. - A ``key`` is case insensitive. The canonical form is lowercase. - - A ``value`` MAY be composed of any ASCII or non-ASCII character, and - MUST be percent-encoded as described in the "Character encoding" - section below[.] [, except that: - - - *list exceptions to the updated "Character encoding" rules, e.g., if a - colon ':' does not need to, or SHOULD NOT, or MUST NOT, be percent-encoded.*] - - - ``purl`` parsers MUST return an error when the ``key`` or ``value`` contains - a prohibited character. + - A ``value`` MAY be composed of any character. A ``value`` MUST be + percent-encoded as described in the "Character encoding" section. - **subpath**: @@ -266,6 +259,8 @@ as otherwise provided in the "Rules for each ``purl`` component" section above: - ``type``/``namespace``/``name`` and ``subpath`` segments separator * - '=' - ``qualifiers`` ``key``/``value`` separator + * - '&' + - ``qualifiers`` ``key=value`` separator Parsers and builders MUST always percent-decode and percent-encode ``purl`` components and component segments as explained in the "How to parse" and "How to From ccf07c5fa68b4dfc005be616629f6f6e88956c41 Mon Sep 17 00:00:00 2001 From: "John M. Horan" Date: Mon, 17 Mar 2025 13:17:30 -0700 Subject: [PATCH 4/7] Update character encoding section and replace table with list #382 Reference: https://github.com/package-url/purl-spec/issues/382 Signed-off-by: John M. Horan --- PURL-SPECIFICATION.rst | 82 +++++++++++++++++++++++++++--------------- 1 file changed, 53 insertions(+), 29 deletions(-) diff --git a/PURL-SPECIFICATION.rst b/PURL-SPECIFICATION.rst index 2d221c3e..418b5ecb 100644 --- a/PURL-SPECIFICATION.rst +++ b/PURL-SPECIFICATION.rst @@ -184,7 +184,7 @@ The rules for each component are: - Neither the '&' nor the '=' separator is part of the ``key`` or the ``value``. - Each ``key`` MUST be unique among the keys of the ``qualifiers`` string. - A ``value`` MUST NOT be an empty string: a ``key=value`` pair with an empty ``value`` - is the same as no ``key=value`` pair at all for this ``key``. + is the same as if no ``key=value`` pair exists for this ``key``. - For each ``key=value`` pair: @@ -217,50 +217,65 @@ The rules for each component are: Character encoding ~~~~~~~~~~~~~~~~~~ -For clarity and simplicity, a ``purl`` is always an ASCII string. To ensure that -there is no ambiguity when parsing a ``purl``, separator characters and non-ASCII -characters MUST be UTF-encoded and then percent-encoded as defined at +A canonical ``purl`` is always an ASCII string composed only of these characters: + +- ``A to Z``, +- ``a to z``, +- ``0 to 9`` and +- the punctuation marks ``:/@?#%.-_~`` . + +To ensure that there is no ambiguity when parsing a ``purl``, separator characters +and non-ASCII characters MUST be UTF-encoded and then percent-encoded as defined at https://en.wikipedia.org/wiki/Percent-encoding and as further defined below. -Use these rules for percent-encoding and decoding ``purl`` components. Except -as otherwise provided in the "Rules for each ``purl`` component" section above: +---- + +Use these rules for percent-encoding and decoding the characters that comprise +a ``purl`` string. Except as otherwise provided in the "Rules for each +``purl`` component" section above: - A character used in a ``purl`` component MUST be percent-encoded unless it is: - (1) an unreserved character as defined in RFC 3986 section 2.3 (https://datatracker.ietf.org/doc/html/rfc3986#section-2.3), + - an unreserved character as defined in RFC 3986 section 2.3 (https://datatracker.ietf.org/doc/html/rfc3986#section-2.3), - (2) expressly defined in this PURL-SPECIFICATION.rst as a ``purl`` separator (and only when used as such a separator), or + - expressly defined in this PURL-SPECIFICATION.rst as a ``purl`` separator (and only when used as such a separator), or - (3) expressly permitted in that ``purl`` component. + - expressly permitted in that ``purl`` component. - All non-ASCII characters MUST be encoded as UTF-8 and then percent-encoded. - The characters used as ``purl`` separators are listed below. These characters: - MUST NOT be percent-encoded when used as separators. + - MUST be percent-encoded when not used as separators unless expressly permitted by a ``purl`` component. -.. list-table:: - :widths: 1 4 - :header-rows: 1 - - * - Character - - Use as separator - * - ':' - - between ``scheme`` and ``type`` - * - '@' - - ``version`` prefix - * - '?' - - ``qualifiers`` prefix - * - '#' - - ``subpath`` prefix - * - '/' - - ``type``/``namespace``/``name`` and ``subpath`` segments separator - * - '=' - - ``qualifiers`` ``key``/``value`` separator - * - '&' - - ``qualifiers`` ``key=value`` separator + - ``purl`` separators: + + ':' (colon) + - between ``scheme`` and ``type`` + + '@' (at sign) + - ``version`` prefix + + '?' (question mark) + - ``qualifiers`` prefix + + '#' (number sign) + - ``subpath`` prefix + + '/' (slash) + - ``type``/``namespace``/``name`` separator + - ``subpath`` segments separator + + '=' (equals) + - ``qualifiers`` ``key``/``value`` separator + + '&' (ampersand) + - ``qualifiers`` ``key=value`` separator + +---- Parsers and builders MUST always percent-decode and percent-encode ``purl`` components and component segments as explained in the "How to parse" and "How to @@ -505,3 +520,12 @@ License ~~~~~~~ This document is licensed under the MIT license + +Definitions +~~~~~~~~~~~ + +[ASCII] See, e.g., + + - American National Standards Institute, "Coded Character Set -- 7-bit + American Standard Code for Information Interchange", ANSI X3.4, 1986. + - https://en.wikipedia.org/wiki/ASCII. From 9a7089bd39b5f430323a5dfd52d3c46b7be1a2ff Mon Sep 17 00:00:00 2001 From: "John M. Horan" Date: Wed, 26 Mar 2025 15:45:22 -0700 Subject: [PATCH 5/7] Updates the qualifiers and percent-encoding sections #382 Reference: https://github.com/package-url/purl-spec/issues/382 Signed-off-by: John M. Horan --- PURL-SPECIFICATION.rst | 117 ++++++++++++++++++----------------------- 1 file changed, 52 insertions(+), 65 deletions(-) diff --git a/PURL-SPECIFICATION.rst b/PURL-SPECIFICATION.rst index 418b5ecb..06732bad 100644 --- a/PURL-SPECIFICATION.rst +++ b/PURL-SPECIFICATION.rst @@ -176,25 +176,30 @@ The rules for each component are: - **qualifiers**: - - The ``qualifiers`` component MUST be prefixed by a '?' separator when not empty. - - The '?' separator is not part of the ``qualifiers`` component. - - The ``qualifiers`` component is a query string composed of one or more ``key=value`` - pairs. Multiple ``key=value`` pairs MUST be separated by an ampersand '&'. - A ``key`` and ``value`` MUST be separated by the equal '=' character. - - Neither the '&' nor the '=' separator is part of the ``key`` or the ``value``. - - Each ``key`` MUST be unique among the keys of the ``qualifiers`` string. - - A ``value`` MUST NOT be an empty string: a ``key=value`` pair with an empty ``value`` - is the same as if no ``key=value`` pair exists for this ``key``. + - The ``qualifiers`` component MUST be prefixed by an unencoded question + mark '?' separator when not empty. This '?' separator is not part of the + ``qualifiers`` component. + - The ``qualifiers`` component is a query string composed of one or more + ``key=value`` pairs. Multiple ``key=value`` pairs MUST be separated by an + unencoded ampersand '&'. This '&' separator is not part of the + ``qualifiers`` component. + + - A ``key`` and ``value`` MUST be separated by the unencoded equal sign '=' + character. This '=' separator is not part of the ``key`` or ``value``. + - A ``value`` MUST NOT be an empty string: a ``key=value`` pair with an + empty ``value`` is the same as if no ``key=value`` pair exists for this + ``key``. - For each ``key=value`` pair: - - The ``key`` MUST be composed only of ASCII letters and numbers, '.', '-' and - '_' (period, dash and underscore). + - The ``key`` MUST be composed only of lowercase ASCII letters and numbers, + period '.', dash '-' and underscore '_'. - A ``key`` MUST start with an ASCII letter. - A ``key`` MUST NOT be percent-encoded. - - A ``key`` is case insensitive. The canonical form is lowercase. - - A ``value`` MAY be composed of any character. A ``value`` MUST be - percent-encoded as described in the "Character encoding" section. + - Each ``key`` MUST be unique among all the keys of the ``qualifiers`` + component. + - A ``value`` MAY be composed of any character and all characters MUST be + encoded as described in the "Character encoding" section. - **subpath**: @@ -217,69 +222,51 @@ The rules for each component are: Character encoding ~~~~~~~~~~~~~~~~~~ -A canonical ``purl`` is always an ASCII string composed only of these characters: +Permitted characters +-------------------- -- ``A to Z``, -- ``a to z``, -- ``0 to 9`` and -- the punctuation marks ``:/@?#%.-_~`` . +A canonical ``purl`` is an ASCII string composed of these characters: -To ensure that there is no ambiguity when parsing a ``purl``, separator characters -and non-ASCII characters MUST be UTF-encoded and then percent-encoded as defined at -https://en.wikipedia.org/wiki/Percent-encoding and as further defined below. +- alphanumeric characters ``A to Z``, ``a to z``, ``0 to 9``, +- the ``purl`` separators ``:/@?=&#`` (colon ':', slash '/', at sign '@', + question mark '?', equal sign '=', ampersand '&' and pound sign '#'), and +- these punctuation marks ``%.-_~`` (percent sign '%', period '.', dash '-', + underscore '_' and tilde '~'). ----- +All other characters MUST be encoded as UTF-8 and then percent-encoded. +In addition, each component specifies its permitted characters and +its percent-encoding rules. -Use these rules for percent-encoding and decoding the characters that comprise -a ``purl`` string. Except as otherwise provided in the "Rules for each -``purl`` component" section above: -- A character used in a ``purl`` component MUST be percent-encoded unless it is: +``purl`` separators +------------------- - - an unreserved character as defined in RFC 3986 section 2.3 (https://datatracker.ietf.org/doc/html/rfc3986#section-2.3), +These ``purl`` separator characters MUST NOT be percent-encoded when used as +``purl`` separators: - - expressly defined in this PURL-SPECIFICATION.rst as a ``purl`` separator (and only when used as such a separator), or +- ':' (colon) is the separator between ``scheme`` and ``type`` +- '/' (slash) is the separator between ``type``, ``namespace`` and ``name`` +- '/' (slash) is the separator between ``subpath`` segments +- '@' (at sign) is the separator between ``name`` and ``version`` +- '?' (question mark) is the separator before ``qualifiers`` +- '=' (equals) is the separator between a ``key`` and a ``value`` of a + ``qualifier`` +- '&' (ampersand) is the separator between ``qualifiers`` (each being a + ``key=value`` pair) +- '#' (number sign) is the separator before ``subpath`` - - expressly permitted in that ``purl`` component. -- All non-ASCII characters MUST be encoded as UTF-8 and then percent-encoded. +Percent-encoding rules +---------------------- -- The characters used as ``purl`` separators are listed below. These characters: +When applying percent-encoding or decoding to a string, use the rules of RFC +3986 section 2 (https://datatracker.ietf.org/doc/html/rfc3986#section-2). - - MUST NOT be percent-encoded when used as separators. +Each component defines when and how to apply percent-encoding and decoding to +its content. - - MUST be percent-encoded when not used as separators unless expressly permitted - by a ``purl`` component. - - - ``purl`` separators: - - ':' (colon) - - between ``scheme`` and ``type`` - - '@' (at sign) - - ``version`` prefix - - '?' (question mark) - - ``qualifiers`` prefix - - '#' (number sign) - - ``subpath`` prefix - - '/' (slash) - - ``type``/``namespace``/``name`` separator - - ``subpath`` segments separator - - '=' (equals) - - ``qualifiers`` ``key``/``value`` separator - - '&' (ampersand) - - ``qualifiers`` ``key=value`` separator - ----- - -Parsers and builders MUST always percent-decode and percent-encode ``purl`` -components and component segments as explained in the "How to parse" and "How to -build" sections. +When percent-encoding is required, all characters MUST be encoded except for +the colon ':'. How to build ``purl`` string from its components From 898c64b32fca6f59ed81b3ccd68998238cd3fa30 Mon Sep 17 00:00:00 2001 From: "John M. Horan" Date: Thu, 27 Mar 2025 10:31:49 -0700 Subject: [PATCH 6/7] Fine-tune faq.rst #382 Reference: https://github.com/package-url/purl-spec/issues/382 Signed-off-by: John M. Horan --- faq.rst | 9 +++++---- 1 file changed, 5 insertions(+), 4 deletions(-) diff --git a/faq.rst b/faq.rst index 50a272e5..d917a1ad 100644 --- a/faq.rst +++ b/faq.rst @@ -27,7 +27,7 @@ For example, although these two purls are strictly equivalent, the first is in c **QUESTION**: Is the colon between ``scheme`` and ``type`` encoded? Can it be encoded? If yes, how? -**ANSWER**: The "Rules for each ``purl`` component" section provides that "[t]he ``scheme`` MUST be followed by an unencoded colon ':'. +**ANSWER**: The "Rules for each ``purl`` component" section provides that the ``scheme`` MUST be followed by an unencoded colon ':'. In this case, the colon ':' between ``scheme`` and ``type`` is being used as a separator, and consequently should be used as-is, never encoded and never requiring any decoding. Moreover, it should be a parsing error if the colon ':' does not come directly after 'pkg'. Tools are welcome to recover from this error to help with malformed purls, but that's not a requirement. @@ -38,10 +38,11 @@ Type **QUESTION**: What behavior is expected from a purl spec implementation if a ``type`` contains a character like a slash '/' or a colon ':'? -**ANSWER**: The "Rules for each purl component" section provides that +**ANSWER**: The "Rules for each purl component" section provides that the +package ``type`` - [t]he package ``type`` MUST be composed only of ASCII letters and numbers, - '.', '+' and '-' (period, plus, and dash) + MUST be composed only of ASCII letters and numbers, '.', '+' and '-' + (period, plus, and dash). As a result, a purl spec implementation must return an error when encountering a ``type`` that contains a prohibited character. From a99029077975b0a1b2954c857901c49442c05b4f Mon Sep 17 00:00:00 2001 From: "John M. Horan" Date: Fri, 28 Mar 2025 16:10:54 -0700 Subject: [PATCH 7/7] Move character-encoding work to new issue, fine-tune qualifiers #382 - Changes to "Character encoding" moved to new issue 438. - `qualifiers` section further updated and clarified. Reference: https://github.com/package-url/purl-spec/issues/382 Signed-off-by: John M. Horan --- PURL-SPECIFICATION.rst | 68 +++++++++++++++++------------------------- faq.rst | 4 +-- 2 files changed, 29 insertions(+), 43 deletions(-) diff --git a/PURL-SPECIFICATION.rst b/PURL-SPECIFICATION.rst index 06732bad..490dac4c 100644 --- a/PURL-SPECIFICATION.rst +++ b/PURL-SPECIFICATION.rst @@ -132,7 +132,7 @@ The rules for each component are: - **type**: - The package ``type`` MUST be composed only of ASCII letters and numbers, - '.', '+' and '-' (period, plus, and dash). + period '.', plus '+', and dash '-'. - The ``type`` MUST start with an ASCII letter. - The ``type`` MUST NOT be percent-encoded. - The ``type`` is case insensitive. The canonical form is lowercase. @@ -179,10 +179,10 @@ The rules for each component are: - The ``qualifiers`` component MUST be prefixed by an unencoded question mark '?' separator when not empty. This '?' separator is not part of the ``qualifiers`` component. - - The ``qualifiers`` component is a query string composed of one or more - ``key=value`` pairs. Multiple ``key=value`` pairs MUST be separated by an - unencoded ampersand '&'. This '&' separator is not part of the - ``qualifiers`` component. + - The ``qualifiers`` component is composed of one or more ``key=value`` + pairs. Multiple ``key=value`` pairs MUST be separated by an + unencoded ampersand '&'. This '&' separator is not part of an + individual ``qualifier``. - A ``key`` and ``value`` MUST be separated by the unencoded equal sign '=' character. This '=' separator is not part of the ``key`` or ``value``. @@ -222,51 +222,37 @@ The rules for each component are: Character encoding ~~~~~~~~~~~~~~~~~~ -Permitted characters --------------------- - -A canonical ``purl`` is an ASCII string composed of these characters: - -- alphanumeric characters ``A to Z``, ``a to z``, ``0 to 9``, -- the ``purl`` separators ``:/@?=&#`` (colon ':', slash '/', at sign '@', - question mark '?', equal sign '=', ampersand '&' and pound sign '#'), and -- these punctuation marks ``%.-_~`` (percent sign '%', period '.', dash '-', - underscore '_' and tilde '~'). - -All other characters MUST be encoded as UTF-8 and then percent-encoded. -In addition, each component specifies its permitted characters and -its percent-encoding rules. +For clarity and simplicity a ``purl`` is always an ASCII string. To ensure that +there is no ambiguity when parsing a ``purl``, separator characters and non-ASCII +characters must be UTF-encoded and then percent-encoded as defined at:: + https://en.wikipedia.org/wiki/Percent-encoding -``purl`` separators -------------------- +Use these rules for percent-encoding and decoding ``purl`` components: -These ``purl`` separator characters MUST NOT be percent-encoded when used as -``purl`` separators: +- the ``type`` must NOT be encoded and must NOT contain separators -- ':' (colon) is the separator between ``scheme`` and ``type`` -- '/' (slash) is the separator between ``type``, ``namespace`` and ``name`` -- '/' (slash) is the separator between ``subpath`` segments -- '@' (at sign) is the separator between ``name`` and ``version`` -- '?' (question mark) is the separator before ``qualifiers`` -- '=' (equals) is the separator between a ``key`` and a ``value`` of a - ``qualifier`` -- '&' (ampersand) is the separator between ``qualifiers`` (each being a - ``key=value`` pair) -- '#' (number sign) is the separator before ``subpath`` +- the '#', '?', '@' and ':' characters must NOT be encoded when used as + separators. They may need to be encoded elsewhere +- the ':' ``scheme`` and ``type`` separator does not need to and must NOT be encoded. + It is unambiguous unencoded everywhere -Percent-encoding rules ----------------------- +- the '/' used as ``type``/``namespace``/``name`` and ``subpath`` segments separator + does not need to and must NOT be percent-encoded. It is unambiguous unencoded + everywhere -When applying percent-encoding or decoding to a string, use the rules of RFC -3986 section 2 (https://datatracker.ietf.org/doc/html/rfc3986#section-2). +- the '@' ``version`` separator must be encoded as ``%40`` elsewhere +- the '?' ``qualifiers`` separator must be encoded as ``%3F`` elsewhere +- the '=' ``qualifiers`` key/value separator must NOT be encoded +- the '#' ``subpath`` separator must be encoded as ``%23`` elsewhere -Each component defines when and how to apply percent-encoding and decoding to -its content. +- All non-ASCII characters must be encoded as UTF-8 and then percent-encoded -When percent-encoding is required, all characters MUST be encoded except for -the colon ':'. +It is OK to percent-encode ``purl`` components otherwise except for the ``type``. +Parsers and builders must always percent-decode and percent-encode ``purl`` +components and component segments as explained in the "How to parse" and "How to +build" sections. How to build ``purl`` string from its components diff --git a/faq.rst b/faq.rst index d917a1ad..23989d23 100644 --- a/faq.rst +++ b/faq.rst @@ -41,8 +41,8 @@ Type **ANSWER**: The "Rules for each purl component" section provides that the package ``type`` - MUST be composed only of ASCII letters and numbers, '.', '+' and '-' - (period, plus, and dash). + MUST be composed only of ASCII letters and numbers, period '.', plus '+', + and dash '-'. As a result, a purl spec implementation must return an error when encountering a ``type`` that contains a prohibited character.