From a97fff5e45f9635d8f13cafd88037b8afb3df34c Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Martin=20Prpi=C4=8D?= Date: Wed, 5 Feb 2025 14:08:23 -0500 Subject: [PATCH] Remove section on Character Encoding This section contained duplicate information that is already present for each component in the "Rules" section and included confusing sentences that applied too generally to certain purl components. Each component's set of rules should clearly define how the content of that component should (or should not) be encoded. These improvements are being made in PRs such as #383 (more will be coming for other component sections). --- PURL-SPECIFICATION.rst | 47 +++++++----------------------------------- 1 file changed, 7 insertions(+), 40 deletions(-) diff --git a/PURL-SPECIFICATION.rst b/PURL-SPECIFICATION.rst index ef28d4cb..837b592d 100644 --- a/PURL-SPECIFICATION.rst +++ b/PURL-SPECIFICATION.rst @@ -112,11 +112,14 @@ A ``purl`` is a URL Rules for each ``purl`` component ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ -A ``purl`` string is an ASCII URL string composed of seven components. +A ``purl`` string is an ASCII URL string composed of seven components. Non-ASCII +characters MUST be UTF-8-encoded. -Some components are allowed to use other characters beyond ASCII: these -components must then be UTF-8-encoded strings and percent-encoded as defined in -the "Character encoding" section. +A ``purl``follows the percent-encoding rules defined in RFC 3986. When percent +encoding is required for ambiguous or special characters in a ``purl`` component, +implementers should refer to +[RFC 3986](https://datatracker.ietf.org/doc/html/rfc3986) +for the proper encoding methods. The rules for each component are: @@ -231,42 +234,6 @@ The rules for each component are: - The ``subpath`` must be interpreted as relative to the root of the package -Character encoding -~~~~~~~~~~~~~~~~~~ - -For clarity and simplicity a ``purl`` is always an ASCII string. To ensure that -there is no ambiguity when parsing a ``purl``, separator characters and non-ASCII -characters must be UTF-encoded and then percent-encoded as defined at:: - - https://en.wikipedia.org/wiki/Percent-encoding - -Use these rules for percent-encoding and decoding ``purl`` components: - -- the ``type`` must NOT be encoded and must NOT contain separators - -- the '#', '?', '@' and ':' characters must NOT be encoded when used as - separators. They may need to be encoded elsewhere - -- the ':' ``scheme`` and ``type`` separator does not need to and must NOT be encoded. - It is unambiguous unencoded everywhere - -- the '/' used as ``type``/``namespace``/``name`` and ``subpath`` segments separator - does not need to and must NOT be percent-encoded. It is unambiguous unencoded - everywhere - -- the '@' ``version`` separator must be encoded as ``%40`` elsewhere -- the '?' ``qualifiers`` separator must be encoded as ``%3F`` elsewhere -- the '=' ``qualifiers`` key/value separator must NOT be encoded -- the '#' ``subpath`` separator must be encoded as ``%23`` elsewhere - -- All non-ASCII characters must be encoded as UTF-8 and then percent-encoded - -It is OK to percent-encode ``purl`` components otherwise except for the ``type``. -Parsers and builders must always percent-decode and percent-encode ``purl`` -components and component segments as explained in the "How to parse" and "How to -build" sections. - - How to build ``purl`` string from its components ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~