-
Notifications
You must be signed in to change notification settings - Fork 17.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
proposal: net/url: adhere to RFC3986 #16127
Comments
Please be more specific about precisely which functions you want to change, and how. Thanks. |
Add a new const (
encodePath encoding = 1 + iota
encodeHost
encodeZone
encodeUserPassword
encodeQueryComponent
escapeQueryComponent
encodeFragment
) Adhere properly to RFC3986 in the // Return true if the specified character should be escaped when
// appearing in a URL string, according to RFC 3986.
//
// Please be informed that for now shouldEscape does not check all
// reserved characters correctly. See golang.org/issue/5684.
func shouldEscape(c byte, mode encoding) bool {
// §2.3 Unreserved characters (alphanum)
if 'A' <= c && c <= 'Z' || 'a' <= c && c <= 'z' || '0' <= c && c <= '9' {
return false
}
if mode == encodeHost || mode == encodeZone {
// §3.2.2 Host allows
// sub-delims = "!" / "$" / "&" / "'" / "(" / ")" / "*" / "+" / "," / ";" / "="
// as part of reg-name.
// We add : because we include :port as part of host.
// We add [ ] because we include [ipv6]:port as part of host.
// We add < > because they're the only characters left that
// we could possibly allow, and Parse will reject them if we
// escape them (because hosts can't use %-encoding for
// ASCII bytes).
switch c {
case '!', '$', '&', '\'', '(', ')', '*', '+', ',', ';', '=', ':', '[', ']', '<', '>', '"':
return false
}
}
switch c {
case '-', '_', '.', '~': // §2.3 Unreserved characters (mark)
return false
// §2.2 Reserved characters (reserved)
// reserved = gen-delims / sub-delims
// gen-delims = ":" / "/" / "?" / "#" / "[" / "]" / "@"
// sub-delims = "!" / "$" / "&" / "'" / "(" / ")" / "*"
// sub-delims (contd.) = "+" / "," / ";" / "="
case '$', '&', '!', '#', '+', ',', '\'', '/', ':', ';', '=', '?', '@',
'(', ')', '[', ']', '*':
// Different sections of the URL allow a few of
// the reserved characters to appear unescaped.
switch mode {
case encodePath: // §3.3
// The RFC allows sub-delims and : @ but saves / ; , for assigning
// meaning to individual path segments. This package
// only manipulates the path as a whole, so we allow those
// last two as well. That leaves only ? to escape.
return c == '?' || c == '#' || c == '[' || c == ']'
case encodeUserPassword: // §3.2.1
// The RFC allows ';', ':', '&', '=', '+', '$', and ',' in
// userinfo, so we must escape only '@', '/', and '?'.
// The parsing of userinfo treats ':' as special so we must escape
// that too.
return c == '@' || c == '/' || c == '?' || c == ':'
case encodeQueryComponent: // §3.4
// The RFC allows : @ and sub-delims. Escape all gen-delims except
// : and @. Additionally escape any superfluous &.
// pchar = unreserved / pct-encoded / sub-delims / ":" / "@".
// RFC 3986, Appendix A.
return c == '&' || c == '?' || c == '/' || c == '#' || c == '[' || c == ']'
case encodeFragment: // §4.1
// The RFC text is silent but the grammar allows
// everything, so escape nothing.
return false
case escapeQueryComponent:
// As a special case, escape everything by default when
// constructing URLs.
return true
}
}
// Everything else must be escaped.
return true
} Use the // QueryEscape escapes the string so it can be safely placed
// inside a URL query.
func QueryEscape(s string) string {
return escape(s, escapeQueryComponent)
} Add support for the func escape(s string, mode encoding) string {
spaceCount, hexCount := 0, 0
for i := 0; i < len(s); i++ {
c := s[i]
if shouldEscape(c, mode) {
if c == ' ' && mode == encodeQueryComponent {
spaceCount++
} else if c == ' ' && mode == escapeQueryComponent {
spaceCount++
} else {
hexCount++
}
}
}
if spaceCount == 0 && hexCount == 0 {
return s
}
t := make([]byte, len(s)+2*hexCount)
j := 0
for i := 0; i < len(s); i++ {
switch c := s[i]; {
case c == ' ' && (mode == encodeQueryComponent || mode == escapeQueryComponent):
t[j] = '+'
j++
case shouldEscape(c, mode):
t[j] = '%'
t[j+1] = "0123456789ABCDEF"[c>>4]
t[j+2] = "0123456789ABCDEF"[c&15]
j += 3
default:
t[j] = s[i]
j++
}
}
return string(t)
} Remove the work-around for RFC3986 characters in the // validEncodedPath reports whether s is a valid encoded path.
// It must not contain any bytes that require escaping during path encoding.
func validEncodedPath(s string) bool {
for i := 0; i < len(s); i++ {
switch s[i] {
case '[', ']':
// ok - not specified in RFC 3986 but left alone by modern browsers
case '%':
// ok - percent encoded, will decode
default:
if shouldEscape(s[i], encodePath) {
return false
}
}
}
return true
} |
Thanks. I apologize if I am missing something, but the only change I see to an exported function is that you want to change Are you suggesting that any changes to the current tests? |
I did leave out // QueryUnescape does the inverse transformation of QueryEscape, converting
// %AB into the byte 0xAB and '+' into ' ' (space). It returns an error if
// any % is not followed by two hexadecimal digits.
func QueryUnescape(s string) (string, error) {
return unescape(s, escapeQueryComponent)
}
// unescape unescapes a string; the mode specifies
// which section of the URL string is being unescaped.
func unescape(s string, mode encoding) (string, error) {
// Count %, check that they're well-formed.
n := 0
hasPlus := false
for i := 0; i < len(s); {
switch s[i] {
case '%':
n++
if i+2 >= len(s) || !ishex(s[i+1]) || !ishex(s[i+2]) {
s = s[i:]
if len(s) > 3 {
s = s[:3]
}
return "", EscapeError(s)
}
// Per https://tools.ietf.org/html/rfc3986#page-21
// in the host component %-encoding can only be used
// for non-ASCII bytes.
// But https://tools.ietf.org/html/rfc6874#section-2
// introduces %25 being allowed to escape a percent sign
// in IPv6 scoped-address literals. Yay.
if mode == encodeHost && unhex(s[i+1]) < 8 && s[i:i+3] != "%25" {
return "", EscapeError(s[i : i+3])
}
if mode == encodeZone {
// RFC 6874 says basically "anything goes" for zone identifiers
// and that even non-ASCII can be redundantly escaped,
// but it seems prudent to restrict %-escaped bytes here to those
// that are valid host name bytes in their unescaped form.
// That is, you can use escaping in the zone identifier but not
// to introduce bytes you couldn't just write directly.
// But Windows puts spaces here! Yay.
v := unhex(s[i+1])<<4 | unhex(s[i+2])
if s[i:i+3] != "%25" && v != ' ' && shouldEscape(v, encodeHost) {
return "", EscapeError(s[i : i+3])
}
}
i += 3
case '+':
if mode == encodeQueryComponent || mode == escapeQueryComponent {
hasPlus = true
}
i++
default:
if (mode == encodeHost || mode == encodeZone) && s[i] < 0x80 && shouldEscape(s[i], mode) {
return "", InvalidHostError(s[i : i+1])
}
i++
}
}
if n == 0 && !hasPlus {
return s, nil
}
t := make([]byte, len(s)-2*n)
j := 0
for i := 0; i < len(s); {
switch s[i] {
case '%':
t[j] = unhex(s[i+1])<<4 | unhex(s[i+2])
j++
i += 3
case '+':
if mode == encodeQueryComponent || mode == escapeQueryComponent {
t[j] = ' '
} else {
t[j] = '+'
}
j++
i++
default:
t[j] = s[i]
j++
i++
}
}
return string(t), nil
}
type shouldEscapeTest struct {
in byte
mode encoding
escape bool
}
var shouldEscapeTests = []shouldEscapeTest{
// Unreserved characters (§2.3)
{'a', encodePath, false},
{'a', encodeUserPassword, false},
{'a', encodeQueryComponent, false},
{'a', encodeFragment, false},
{'a', escapeQueryComponent, false},
{'a', encodeHost, false},
{'z', encodePath, false},
{'A', encodePath, false},
{'Z', encodePath, false},
{'0', encodePath, false},
{'9', encodePath, false},
{'-', encodePath, false},
{'-', encodeUserPassword, false},
{'-', encodeQueryComponent, false},
{'-', encodeFragment, false},
{'-', escapeQueryComponent, false},
{'.', encodePath, false},
{'_', encodePath, false},
{'~', encodePath, false},
// Reserved characters (§2.2)
{'?', encodePath, true},
{'#', encodePath, true},
{'[', encodePath, true},
{']', encodePath, true},
{'$', encodePath, false},
{'&', encodePath, false},
{'!', encodePath, false},
{'+', encodePath, false},
{',', encodePath, false},
{'\'', encodePath, false},
{':', encodePath, false},
{';', encodePath, false},
{'=', encodePath, false},
{'@', encodePath, false},
{'(', encodePath, false},
{')', encodePath, false},
{'*', encodePath, false},
// Query component (§3.4)
{'&', encodeQueryComponent, true},
{'?', encodeQueryComponent, true},
{'/', encodeQueryComponent, true},
{'#', encodeQueryComponent, true},
{'[', encodeQueryComponent, true},
{']', encodeQueryComponent, true},
{'$', encodeQueryComponent, false},
{'!', encodeQueryComponent, false},
{'+', encodeQueryComponent, false},
{',', encodeQueryComponent, false},
{'\'', encodeQueryComponent, false},
{':', encodeQueryComponent, false},
{';', encodeQueryComponent, false},
{'=', encodeQueryComponent, false},
{'@', encodeQueryComponent, false},
{'(', encodeQueryComponent, false},
{')', encodeQueryComponent, false},
{'*', encodeQueryComponent, false},
// User information (§3.2.1)
{':', encodeUserPassword, true},
{'/', encodeUserPassword, true},
{'?', encodeUserPassword, true},
{'@', encodeUserPassword, true},
{'$', encodeUserPassword, false},
{'&', encodeUserPassword, false},
{'+', encodeUserPassword, false},
{',', encodeUserPassword, false},
{';', encodeUserPassword, false},
{'=', encodeUserPassword, false},
// Host (IP address, IPv6 address, registered name, port suffix; §3.2.2)
{'!', encodeHost, false},
{'$', encodeHost, false},
{'&', encodeHost, false},
{'\'', encodeHost, false},
{'(', encodeHost, false},
{')', encodeHost, false},
{'*', encodeHost, false},
{'+', encodeHost, false},
{',', encodeHost, false},
{';', encodeHost, false},
{'=', encodeHost, false},
{':', encodeHost, false},
{'[', encodeHost, false},
{']', encodeHost, false},
{'0', encodeHost, false},
{'9', encodeHost, false},
{'A', encodeHost, false},
{'z', encodeHost, false},
{'_', encodeHost, false},
{'-', encodeHost, false},
{'.', encodeHost, false},
}
func TestShouldEscape(t *testing.T) {
for _, tt := range shouldEscapeTests {
if shouldEscape(tt.in, tt.mode) != tt.escape {
t.Errorf("shouldEscape(%q, %v) returned %v; expected %v", tt.in, tt.mode, !tt.escape, tt.escape)
}
}
} |
Maybe it's best if you just send a change (in Gerrit) so we can see differences and comment there, since large code dumps in comments aren't easy to read. See https://golang.org/doc/contribute.html But be warned that changes to |
Also, note that I removed the RFC mention in 83676d6, specifically because we've accepted that we can't fully implement the RFC at this point for compatibility reasons. |
Closing due to timeout waiting for reply. |
As detailed in #15891 ,
net/url
does not properly follow RFC3986 when encoding URL strings. Certain characters, such as(
and)
are URL-encoded when they are not required to be by the RFC.I propose updating the logic in
net/url
to allow for valid URL characters to remain un-encoded, possibly adding a newcase
statement for differentiating between escaping URL strings and encoding URL components.The text was updated successfully, but these errors were encountered: