From 0beba16ec9824e026e4191358988353e69f72c8d Mon Sep 17 00:00:00 2001 From: "John M. Horan" Date: Mon, 21 Apr 2025 21:37:50 -0700 Subject: [PATCH 1/3] Update "Character encoding" and related provisions #438 Reference: https://github.com/package-url/purl-spec/issues/438 Signed-off-by: John M. Horan --- PURL-SPECIFICATION.rst | 47 ++++++++++++++++++++++++++++-------------- 1 file changed, 31 insertions(+), 16 deletions(-) diff --git a/PURL-SPECIFICATION.rst b/PURL-SPECIFICATION.rst index cb29f9cc..825b413c 100644 --- a/PURL-SPECIFICATION.rst +++ b/PURL-SPECIFICATION.rst @@ -114,9 +114,11 @@ Rules for each ``purl`` component A ``purl`` string is an ASCII URL string composed of seven components. -Some components are allowed to use other characters beyond ASCII: these -components must then be UTF-8-encoded strings and percent-encoded as defined in -the "Character encoding" section. +Except as expressly stated otherwise in this section, each component: + +- MAY be composed of any of the characters defined as "Permitted Characters" in + the "Character encoding" section +- MUST be encoded as defined in the "Character encoding" section The rules for each component are: @@ -225,17 +227,13 @@ Character encoding Permitted characters -------------------- -A canonical ``purl`` is an ASCII string composed of these characters: +A canonical ``purl`` is composed of these characters ("Permitted Characters"): - alphanumeric characters ``A to Z``, ``a to z``, ``0 to 9``, - the ``purl`` separators ``:/@?=&#`` (colon ':', slash '/', at sign '@', question mark '?', equal sign '=', ampersand '&' and pound sign '#'), and -- these punctuation marks ``%.-_~`` (percent sign '%', period '.', dash '-', - underscore '_' and tilde '~'). - -All other characters MUST be encoded as UTF-8 and then percent-encoded. -In addition, each component specifies its permitted characters and -its percent-encoding rules. +- the ASCII characters ``+%.-_~`` (plus '+', percent sign '%', period '.', + dash '-', underscore '_' and tilde '~'). ``purl`` separators @@ -259,14 +257,31 @@ These ``purl`` separator characters MUST NOT be percent-encoded when used as Percent-encoding rules ---------------------- -When applying percent-encoding or decoding to a string, use the rules of RFC -3986 section 2 (https://datatracker.ietf.org/doc/html/rfc3986#section-2). +Unless otherwise provided in this specification, when applying percent-encoding +or decoding to a string, use the rules of RFC 3986 section 2 +(https://datatracker.ietf.org/doc/html/rfc3986#section-2). In the event of any +conflict between this specification and RFC 3986 section 2, this specification +governs. + +In the "Rules for each ``purl`` component" section above, each component +defines when and how to apply percent-encoding and decoding to its content. + +When percent-encoding is required, all Permitted Characters MUST be encoded as +UTF-8 and then percent-encoded except for the following: + +- the alphanumeric characters, + +- the ASCII characters ``.-_~`` (period '.', dash '-', underscore + '_' and tilde '~'), + +- the percent sign '%' when used to represent a percent-encoded character, + +- a ``purl`` separator when being used as a ``purl`` separator, and -Each component defines when and how to apply percent-encoding and decoding to -its content. +- the colon ':', whether used as a ``purl`` separator or otherwise. -When percent-encoding is required, all characters MUST be encoded except for -the colon ':'. +In addition, where the space ' ' is permitted, it MUST be percent-encoded as +'%20'. How to build ``purl`` string from its components From 91f07a03987cd2da26107051dde2cfa3922c41cc Mon Sep 17 00:00:00 2001 From: "John M. Horan" Date: Sun, 27 Apr 2025 16:07:33 -0700 Subject: [PATCH 2/3] Clarify percent-encoding references #438 Reference: https://github.com/package-url/purl-spec/issues/438 Signed-off-by: John M. Horan --- PURL-SPECIFICATION.rst | 47 +++++++++++++++++++----------------------- 1 file changed, 21 insertions(+), 26 deletions(-) diff --git a/PURL-SPECIFICATION.rst b/PURL-SPECIFICATION.rst index 825b413c..188f8483 100644 --- a/PURL-SPECIFICATION.rst +++ b/PURL-SPECIFICATION.rst @@ -227,7 +227,7 @@ Character encoding Permitted characters -------------------- -A canonical ``purl`` is composed of these characters ("Permitted Characters"): +A canonical ``purl`` is composed of these Permitted Characters: - alphanumeric characters ``A to Z``, ``a to z``, ``0 to 9``, - the ``purl`` separators ``:/@?=&#`` (colon ':', slash '/', at sign '@', @@ -257,31 +257,26 @@ These ``purl`` separator characters MUST NOT be percent-encoded when used as Percent-encoding rules ---------------------- -Unless otherwise provided in this specification, when applying percent-encoding -or decoding to a string, use the rules of RFC 3986 section 2 -(https://datatracker.ietf.org/doc/html/rfc3986#section-2). In the event of any -conflict between this specification and RFC 3986 section 2, this specification -governs. - -In the "Rules for each ``purl`` component" section above, each component -defines when and how to apply percent-encoding and decoding to its content. - -When percent-encoding is required, all Permitted Characters MUST be encoded as -UTF-8 and then percent-encoded except for the following: - -- the alphanumeric characters, - -- the ASCII characters ``.-_~`` (period '.', dash '-', underscore - '_' and tilde '~'), - -- the percent sign '%' when used to represent a percent-encoded character, - -- a ``purl`` separator when being used as a ``purl`` separator, and - -- the colon ':', whether used as a ``purl`` separator or otherwise. - -In addition, where the space ' ' is permitted, it MUST be percent-encoded as -'%20'. +- In the "Rules for each ``purl`` component" section above, each component + defines when and how to apply percent-encoding and decoding to its content, + including which characters to percent-encode and when percent-encoding is + required. +- When percent-encoding is required by a component definition, each + codepoint MUST be replaced by the percent-encoded bytes of the codepoint's + UTF-8 encoding using the percent-encoding mechanism defined in RFC 3986 + section 2.1 (https://datatracker.ietf.org/doc/html/rfc3986#section-2.1). +- With the exception of the percent-encoding mechanism, the rules regarding + percent-encoding are defined by this specification alone. +- Where the space ' ' is permitted, it MUST be percent-encoded as + '%20'. +- The following characters do not need to be percent-encoded: + + - the alphanumeric characters ``A to Z``, ``a to z``, ``0 to 9``, + - the ASCII characters ``.-_~`` (period '.', dash '-', underscore + '_' and tilde '~'), + - the percent sign '%' when used to represent a percent-encoded character, + - a ``purl`` separator when being used as a ``purl`` separator, and + - the colon ':', whether used as a ``purl`` separator or otherwise. How to build ``purl`` string from its components From e7119e8cd579cb8463a977a775bd7f6dd35167fd Mon Sep 17 00:00:00 2001 From: "John M. Horan" Date: Thu, 8 May 2025 18:37:10 -0700 Subject: [PATCH 3/3] Restructure and clarify character-encoding section #438 Reference: https://github.com/package-url/purl-spec/issues/438 Signed-off-by: John M. Horan --- PURL-SPECIFICATION.rst | 73 ++++++++++++++++++++++-------------------- 1 file changed, 38 insertions(+), 35 deletions(-) diff --git a/PURL-SPECIFICATION.rst b/PURL-SPECIFICATION.rst index 188f8483..c823dd30 100644 --- a/PURL-SPECIFICATION.rst +++ b/PURL-SPECIFICATION.rst @@ -116,8 +116,8 @@ A ``purl`` string is an ASCII URL string composed of seven components. Except as expressly stated otherwise in this section, each component: -- MAY be composed of any of the characters defined as "Permitted Characters" in - the "Character encoding" section +- MAY be composed of any of the characters defined in the "Permitted + characters" section - MUST be encoded as defined in the "Character encoding" section The rules for each component are: @@ -221,26 +221,24 @@ The rules for each component are: - The ``subpath`` MUST be interpreted as relative to the root of the package -Character encoding -~~~~~~~~~~~~~~~~~~ - Permitted characters --------------------- +~~~~~~~~~~~~~~~~~~~~ -A canonical ``purl`` is composed of these Permitted Characters: +A canonical ``purl`` is composed of these permitted ASCII characters: -- alphanumeric characters ``A to Z``, ``a to z``, ``0 to 9``, -- the ``purl`` separators ``:/@?=&#`` (colon ':', slash '/', at sign '@', - question mark '?', equal sign '=', ampersand '&' and pound sign '#'), and -- the ASCII characters ``+%.-_~`` (plus '+', percent sign '%', period '.', - dash '-', underscore '_' and tilde '~'). +- the Alphanumeric Characters: ``A to Z``, ``a to z``, ``0 to 9``, +- the Punctuation Characters: ``.-_~`` (period '.', + dash '-', underscore '_' and tilde '~'), +- the Plus Character: ``+`` (plus '+'), +- the Percent Character: ``%`` (percent sign '%'), and +- the Separator Characters ``:/@?=&#`` (colon ':', slash '/', at sign '@', + question mark '?', equal sign '=', ampersand '&' and pound sign '#'). ``purl`` separators -------------------- +~~~~~~~~~~~~~~~~~~~ -These ``purl`` separator characters MUST NOT be percent-encoded when used as -``purl`` separators: +This is how each of the Separator Characters is used: - ':' (colon) is the separator between ``scheme`` and ``type`` - '/' (slash) is the separator between ``type``, ``namespace`` and ``name`` @@ -254,29 +252,34 @@ These ``purl`` separator characters MUST NOT be percent-encoded when used as - '#' (number sign) is the separator before ``subpath`` -Percent-encoding rules ----------------------- +Character encoding +~~~~~~~~~~~~~~~~~~ -- In the "Rules for each ``purl`` component" section above, each component - defines when and how to apply percent-encoding and decoding to its content, - including which characters to percent-encode and when percent-encoding is - required. -- When percent-encoding is required by a component definition, each - codepoint MUST be replaced by the percent-encoded bytes of the codepoint's - UTF-8 encoding using the percent-encoding mechanism defined in RFC 3986 - section 2.1 (https://datatracker.ietf.org/doc/html/rfc3986#section-2.1). +- In the "Rules for each ``purl`` component" section, each component + defines when and how to apply percent-encoding and decoding to its content. +- When percent-encoding is required by a component definition, the component + string MUST first be encoded as UTF-8. +- In the component string, each "data octet" MUST be replaced by the + percent-encoded "character triplet" applying the percent-encoding mechanism + defined in RFC 3986 section 2.1 (https://datatracker.ietf.org/doc/html/rfc3986#section-2.1), + including the RFC definition of "data octet" and "character triplet", + and using these definitions for RFC's "allowed set" and "delimiters": + + - "allowed set" is composed of the Alphanumeric Characters and the + Punctuation Characters + - "delimiters" is composed of the Separator Characters + +- The following characters MUST NOT be percent-encoded: + + - the Alphanumeric Characters, + - the Punctuation Characters, + - the Separator Characters when being used as ``purl`` separators, + - the colon ':', whether used as a Separator Character or otherwise, and + - the percent sign '%' when used to represent a percent-encoded character. + +- Where the space ' ' is permitted, it MUST be percent-encoded as '%20'. - With the exception of the percent-encoding mechanism, the rules regarding percent-encoding are defined by this specification alone. -- Where the space ' ' is permitted, it MUST be percent-encoded as - '%20'. -- The following characters do not need to be percent-encoded: - - - the alphanumeric characters ``A to Z``, ``a to z``, ``0 to 9``, - - the ASCII characters ``.-_~`` (period '.', dash '-', underscore - '_' and tilde '~'), - - the percent sign '%' when used to represent a percent-encoded character, - - a ``purl`` separator when being used as a ``purl`` separator, and - - the colon ':', whether used as a ``purl`` separator or otherwise. How to build ``purl`` string from its components