-
Notifications
You must be signed in to change notification settings - Fork 221
Remove section on Character Encoding #389
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Closed
Closed
Changes from all commits
Commits
Show all changes
3 commits
Select commit
Hold shift + click to select a range
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mprpic all these could be moved to the FAQ maybe? We are losing details otherwise.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we're actually losing anything. Most of the information is present for each of the purl components in the sections above. For example, the
typesection notes:while we have the same information below:
The guidance on how to encode certain character is, in my opinion, just confusing since those specific characters when used as field separators should not be encoded. When used as values, users should just refer to the percent-encoding rules linked in the wiki page or the RFC. Also, I don't think anyone out there will be manually encoding these values anyway, most people rely on things like Python's
urllib.parse.If you think some of this information is still important to retain, can you point out which exact parts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is incorrect. The RFC does not specify when which characters are supposed to be encoded in a PURL, and the rules are different for PURL than for URL (eg
@).urllib.parsecan decode PURLs correctly, but it cannot encode a canonical PURL. packageurl-python uses urllib.parse and a string replace, which is right most of the time. It fails if the package name contains a slash, as proposed in thepkg:goPR.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mprpic You make good points, but I have a slightly different take on the situation and our goals.
Re your
typeexample, you can see what I've added to our new faq.rst fortypein PR Update 'type' component in core spec and add faq.rst #378 #383. I'm currently updating that PR and will also review the test-suite-data.json tests to update/add type-related tests as needed (which we want to do as part of all of our core spec updating efforts).If by "wiki page" you mean https://github.com/package-url/purl-spec/wiki/FAQ, that hasn't been updated since late 2017. The whole point of the faq.rst we've begun working on is to provide a place to flesh out encoding and other relevant details, including addressing the questions raised and differing views expressed by users in the numerous open issues and PRs. We can add sections in the faq.rst as part of each of the updating PRs. (Note that the component-by-component approach we've started with is good but can't cover everything we need to address -- see the encoding gSheet for some examples.)
Simply referring a user to that wiki or the RFC (I assume you mean RFC 3986) is not ideal -- after all, one reason for the issue/PR backlog is the number of encoding questions posed by members of the community. Providing details that address actual and reasonably anticipated questions in faq.rst would be far more helpful, particularly since even RFC 3986 is somewhat ambiguous and open to conflicting interpretations. (Section 2.2, for example, is helpful re encoding but remains somewhat ambiguous, e.g., check the conflicting use of "must" and "should":
urllib.parsequote(), I just ran a group of 32 reserved and other non-alphanumeric characters throughquote()and all but-,.,/,_and~returned percent-encoded values. This suggests that unless a character is a member of the unreserved group or defined as a purl separator (or "delimiter" to use RFC 3986's terminology), that character must be percent-encoded. And clarity would benefit us all. ;-)There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See the attached
ascii-encoding-examples-2025-02-13.txtfor an example of the output.ascii-encoding-examples-2025-02-13.txt
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
https://datatracker.ietf.org/doc/html/rfc3986#section-2.1 spells out how to encode every character, no? And
%40is the standard way to encode the@sign, so why do we need it spelled out in the purl spec?There is a single replacement that I can see here:
https://github.com/package-url/packageurl-python/blob/main/src/packageurl/__init__.py#L64
and it has been a point of confusion for quite a while:
package-url/packageurl-python#152 (comment)
So why not just let standard tools handle this instead of the magic that is currently in place? :)
@johnmhoran By wiki page I meant https://en.wikipedia.org/wiki/Percent-encoding, but I see I left that out of the paragraph I added at the top of the Rules section. I added it now. I'm also happy to move some of the remove content into an FAQ, I just don't think it's worth repeating information that is made clear in other sources. So if there is something in particular that you think should be further discussed in the FAQ, let me know.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @mprpic .
Re the FAQs, since we're doing the component updating component-by-component atm, it makes most sense for whoever handles a particular component PR add to the FAQs as that person sees fit. No need for you to do anything on that score imho since you're just removing the "Character encoding" section in this PR.
A purl user should be able to easily determine, for example, when/where in a purl string a
{or:is permitted and when/where it MUST be percent-encoded. Neither the core spec nor RFC 3986 currently enable that. We can revise the core spec to provide language that clearly states the intended meaning of the rules and requirements reflected in the spec. I'm not recommending the substance of the following, but I wonder whether we could use this sort of approach --At or near the top of the "Rules for each purl component" section we could add something like this:
Each of the seven component subsections would contain additional details and exceptions, clearly identify which characters are permitted (as several already do), and state that no other characters are permitted for that component.
BTW, back in 2018 @jgustie and @pombredanne discussed a similar approach to percent-encoding. See Clarifications, in particular about version encoding.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The RFC spells out how to encode any character, including characters that are not to be encoded, and the rules for when characters are to be encoded are different for PURL. For example, because URI has no concept of a package version, it specifies that
@signs in paths do not require escaping.There is no one standard tool for this.
Python's urllib.parse defaults to not escaping
/, which must be escaped in many cases.The set of characters encoded varies by implementations, and sometimes a single implementation provides multiple methods with different behaviors.
Python's urllib.parse.quote escapes [0, 0x2c], [0x3a, 0x40], [0x5b, 0x5e], 0x60, [0x7b, 0x7d], 0x7f.
Dotnet's Uri.EscapeDataString escapes [0, 0x2c], 0x2f, [0x3a, 0x40], [0x5b, 0x5e], 0x60, [0x7b, 0x7d], 0x7f.
Dotnet's Uri.EscapeUriString escapes [0, 0x20], 0x22, 0x25, 0x3c, 0x3e, 0x5c, 0x5e, 0x60, [0x7b, 0x7d], 0x7f.
Dotnet's HttpUtility.UrlEncode escapes [0, 0x1f], 0x20 as +, [0x22, 0x27], [0x2b, 0x2c], 0x2f, [0x3a, 0x40], [0x5b, 0x5e], 0x60, [0x7b, 0x7f].
Ruby's CGI::escape escapes [0, 0x1f], 0x20 as +, [0x21, 0x2c], 0x2f, [0x3a, 0x40], [0x5b, 0x5e], 0x60, [0x7b, 0x7d], 0x7f.
ECMAScript's escape escapes [0, 0x29], 0x2c, [0x3a, 0x3f], [0x5b, 0x5e], 0x60, [0x7b, 0x7f]
ECMAScript's encodeURI escapes [0, 0x20], 0x22, 0x25, 0x3c, 0x3e, [0x5b, 0x5e], 0x60, [0x7b, 0x7d], 0x7f
ECMAScript's encodeURIComponent escapes [0, 0x20], [0x22, 0x26], [0x2b, 0x2c], 0x2f, [0x3a, 0x40], [0x5b, 0x5e], 0x60, [0x7b, 0x7d], 0x7f
The confusion is because
purlcomponents otherwise" (it's not if you're building a canonical PURL).Most readers don't realize that the set of encoded characters even when building a URI is different for each component of the URI (the "reserved" characters in RFC3986 language are only sometimes encoded).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does "it specifies that
@signs in paths do not require escaping." mean here, specifically "paths do not require escaping"? I assume that the at sign should not be encoded when used to separate the name and version component, but it should be encoded when used as part of the version or any qualifier value for example. That is in line with the guidance of percent-encode the values of each field, but ignore encoding of separators as noted in the "how to build/parse purl" sections.Do you not feel we should make this a bit more easy to accomplish rather than requiring some obscure rules for encoding this, but not that? If I as a user have to read through the nitty-gritty parts of the spec and dig into the source code of the purl library of my choice to figure out if I need to encode or not encode a certain value using some special rules, I feel like I'm not set up for success. My intention behind simplifying this was to remove complexity, but I see there are a lot of opinions for keeping it as is.
How can we solve the problem of character encoding for purl once and for all? If you have time to join #377 (comment) tomorrow, we could try to find a better solution perhaps than my change here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@is 0x40, which is encoded by 5/8 of the above example implementations of URL encoding. That means if you build a PURL using one of the other three implementations you need a special case for encoding@. For ECMAScript, if you useencodeURIComponentit encodes@, but it also encodes/, which is correct for names but incorrect for namespaces.encodeURIis better for the namespace because it doesn't encode/, but it doesn't encode@.:is 0x3a, which is encoded by 6/8 when it should never be encoded in a canonical PURL with a valid package type.Removing the encoding section certainly does not make it easier to implement encoding. This makes the PURL spec underspecified on which characters are to be encoded when, making it impossible to implement PURL canonicalization. PURL needs more clarity around character encoding rules for canonicalization, not less.
The encoding section needs to be rewritten, and possibly the information in the encoding section needs to be moved to the "how to build" section instead. I think that the WHATWG URL spec does a much better job of defining character encoding than RFC3986, which does a much better job than PURL. RFC3986 defines a set of characters that are not escaped, a set of characters that may be escaped, and then defines the set of characters that are escaped for a given URI component by specifying the characters that are not escaped. WHATWG URL defines a constructive hierarchy of named sets of characters to be encoded and references those sets when describing how to build the URL. If the PURL spec said something like "The qualifier percent-encode set is the C0 control percent-encode set and U+0020 SPACE, U+0022 ("), U+0023 (#), U+0026(&), U+003C (<), U+003E (>), U+003F (?)" (based on WHATWG URL's query percent-encode set) and "Percent encode the qualifier value using the qualifier percent-encode set", or even just listed the characters without naming the final set, there wouldn't be any confusion about whether colons are supposed to be encoded in qualifier values.