proposal: net/url: adhere to RFC3986 #16127

gmccue · 2016-06-20T10:43:49Z

As detailed in #15891 , net/url does not properly follow RFC3986 when encoding URL strings. Certain characters, such as ( and ) are URL-encoded when they are not required to be by the RFC.

I propose updating the logic in net/url to allow for valid URL characters to remain un-encoded, possibly adding a new case statement for differentiating between escaping URL strings and encoding URL components.

The text was updated successfully, but these errors were encountered:

ianlancetaylor · 2016-06-20T15:36:48Z

Please be more specific about precisely which functions you want to change, and how. Thanks.

gmccue · 2016-06-21T14:02:22Z

Add a new const called escapeQueryComponent which would be used to escape "everything" by default when constructing safe URL strings.

const (
    encodePath encoding = 1 + iota
    encodeHost
    encodeZone
    encodeUserPassword
    encodeQueryComponent
    escapeQueryComponent
    encodeFragment
)

Adhere properly to RFC3986 in the shouldEscape() function.

// Return true if the specified character should be escaped when
// appearing in a URL string, according to RFC 3986.
//
// Please be informed that for now shouldEscape does not check all
// reserved characters correctly. See golang.org/issue/5684.
func shouldEscape(c byte, mode encoding) bool {
    // §2.3 Unreserved characters (alphanum)
    if 'A' <= c && c <= 'Z' || 'a' <= c && c <= 'z' || '0' <= c && c <= '9' {
        return false
    }

    if mode == encodeHost || mode == encodeZone {
        // §3.2.2 Host allows
        // sub-delims = "!" / "$" / "&" / "'" / "(" / ")" / "*" / "+" / "," / ";" / "="
        // as part of reg-name.
        // We add : because we include :port as part of host.
        // We add [ ] because we include [ipv6]:port as part of host.
        // We add < > because they're the only characters left that
        // we could possibly allow, and Parse will reject them if we
        // escape them (because hosts can't use %-encoding for
        // ASCII bytes).
        switch c {
        case '!', '$', '&', '\'', '(', ')', '*', '+', ',', ';', '=', ':', '[', ']', '<', '>', '"':
            return false
        }
    }

    switch c {
    case '-', '_', '.', '~': // §2.3 Unreserved characters (mark)
        return false

    // §2.2 Reserved characters (reserved)
    // reserved = gen-delims / sub-delims
    // gen-delims = ":" / "/" / "?" / "#" / "[" / "]" / "@"
    // sub-delims = "!" / "$" / "&" / "'" / "(" / ")" / "*"
    // sub-delims (contd.) =  "+" / "," / ";" / "="
    case '$', '&', '!', '#', '+', ',', '\'', '/', ':', ';', '=', '?', '@',
        '(', ')', '[', ']', '*':
        // Different sections of the URL allow a few of
        // the reserved characters to appear unescaped.
        switch mode {
        case encodePath: // §3.3
            // The RFC allows sub-delims and : @ but saves / ; , for assigning
            // meaning to individual path segments. This package
            // only manipulates the path as a whole, so we allow those
            // last two as well. That leaves only ? to escape.
            return c == '?' || c == '#' || c == '[' || c == ']'

        case encodeUserPassword: // §3.2.1
            // The RFC allows ';', ':', '&', '=', '+', '$', and ',' in
            // userinfo, so we must escape only '@', '/', and '?'.
            // The parsing of userinfo treats ':' as special so we must escape
            // that too.
            return c == '@' || c == '/' || c == '?' || c == ':'

        case encodeQueryComponent: // §3.4
            // The RFC allows : @ and sub-delims. Escape all gen-delims except
            // : and @. Additionally escape any superfluous &.
            // pchar = unreserved / pct-encoded / sub-delims / ":" / "@".
            // RFC 3986, Appendix A.
            return c == '&' || c == '?' || c == '/' || c == '#' || c == '[' || c == ']'

        case encodeFragment: // §4.1
            // The RFC text is silent but the grammar allows
            // everything, so escape nothing.
            return false

        case escapeQueryComponent:
            // As a special case, escape everything by default when
            // constructing URLs.
            return true
        }
    }

    // Everything else must be escaped.
    return true
}

Use the escapeQueryComponent case for the QueryEscape() function.

// QueryEscape escapes the string so it can be safely placed
// inside a URL query.
func QueryEscape(s string) string {
    return escape(s, escapeQueryComponent)
}

Add support for the escapeQueryComponent case to the escape() function.

func escape(s string, mode encoding) string {
    spaceCount, hexCount := 0, 0
    for i := 0; i < len(s); i++ {
        c := s[i]
        if shouldEscape(c, mode) {
            if c == ' ' && mode == encodeQueryComponent {
                spaceCount++
            } else if c == ' ' && mode == escapeQueryComponent {
                spaceCount++
            } else {
                hexCount++
            }
        }
    }

    if spaceCount == 0 && hexCount == 0 {
        return s
    }

    t := make([]byte, len(s)+2*hexCount)
    j := 0
    for i := 0; i < len(s); i++ {
        switch c := s[i]; {
        case c == ' ' && (mode == encodeQueryComponent || mode == escapeQueryComponent):
            t[j] = '+'
            j++
        case shouldEscape(c, mode):
            t[j] = '%'
            t[j+1] = "0123456789ABCDEF"[c>>4]
            t[j+2] = "0123456789ABCDEF"[c&15]
            j += 3
        default:
            t[j] = s[i]
            j++
        }
    }
    return string(t)
}

Remove the work-around for RFC3986 characters in the validEncodedPath() function

// validEncodedPath reports whether s is a valid encoded path.
// It must not contain any bytes that require escaping during path encoding.
func validEncodedPath(s string) bool {
    for i := 0; i < len(s); i++ {
        switch s[i] {
        case '[', ']':
            // ok - not specified in RFC 3986 but left alone by modern browsers
        case '%':
            // ok - percent encoded, will decode
        default:
            if shouldEscape(s[i], encodePath) {
                return false
            }
        }
    }
    return true
}

ianlancetaylor · 2016-06-21T17:48:41Z

Thanks. I apologize if I am missing something, but the only change I see to an exported function is that you want to change QueryEscape, although actually I don't quite see how its behavior will change. And it seems that you are suggesting that QueryEscape and QueryUnescape will no longer be inverse transforms.

Are you suggesting that any changes to the current tests?

gmccue · 2016-06-24T10:13:31Z

I did leave out QueryUnescape as well, but the only exported functions that would change are QueryEscape and QueryUnescape.

// QueryUnescape does the inverse transformation of QueryEscape, converting
// %AB into the byte 0xAB and '+' into ' ' (space). It returns an error if
// any % is not followed by two hexadecimal digits.
func QueryUnescape(s string) (string, error) {
    return unescape(s, escapeQueryComponent)
}

// unescape unescapes a string; the mode specifies
// which section of the URL string is being unescaped.
func unescape(s string, mode encoding) (string, error) {
    // Count %, check that they're well-formed.
    n := 0
    hasPlus := false
    for i := 0; i < len(s); {
        switch s[i] {
        case '%':
            n++
            if i+2 >= len(s) || !ishex(s[i+1]) || !ishex(s[i+2]) {
                s = s[i:]
                if len(s) > 3 {
                    s = s[:3]
                }
                return "", EscapeError(s)
            }
            // Per https://tools.ietf.org/html/rfc3986#page-21
            // in the host component %-encoding can only be used
            // for non-ASCII bytes.
            // But https://tools.ietf.org/html/rfc6874#section-2
            // introduces %25 being allowed to escape a percent sign
            // in IPv6 scoped-address literals. Yay.
            if mode == encodeHost && unhex(s[i+1]) < 8 && s[i:i+3] != "%25" {
                return "", EscapeError(s[i : i+3])
            }
            if mode == encodeZone {
                // RFC 6874 says basically "anything goes" for zone identifiers
                // and that even non-ASCII can be redundantly escaped,
                // but it seems prudent to restrict %-escaped bytes here to those
                // that are valid host name bytes in their unescaped form.
                // That is, you can use escaping in the zone identifier but not
                // to introduce bytes you couldn't just write directly.
                // But Windows puts spaces here! Yay.
                v := unhex(s[i+1])<<4 | unhex(s[i+2])
                if s[i:i+3] != "%25" && v != ' ' && shouldEscape(v, encodeHost) {
                    return "", EscapeError(s[i : i+3])
                }
            }
            i += 3
        case '+':
            if mode == encodeQueryComponent || mode == escapeQueryComponent {
                hasPlus = true
            }

            i++
        default:
            if (mode == encodeHost || mode == encodeZone) && s[i] < 0x80 && shouldEscape(s[i], mode) {
                return "", InvalidHostError(s[i : i+1])
            }
            i++
        }
    }

    if n == 0 && !hasPlus {
        return s, nil
    }

    t := make([]byte, len(s)-2*n)
    j := 0
    for i := 0; i < len(s); {
        switch s[i] {
        case '%':
            t[j] = unhex(s[i+1])<<4 | unhex(s[i+2])
            j++
            i += 3
        case '+':
            if mode == encodeQueryComponent || mode == escapeQueryComponent {
                t[j] = ' '
            } else {
                t[j] = '+'
            }
            j++
            i++
        default:
            t[j] = s[i]
            j++
            i++
        }
    }
    return string(t), nil
}

net/url_test.go would additionally change.

type shouldEscapeTest struct {
    in     byte
    mode   encoding
    escape bool
}

var shouldEscapeTests = []shouldEscapeTest{
    // Unreserved characters (§2.3)
    {'a', encodePath, false},
    {'a', encodeUserPassword, false},
    {'a', encodeQueryComponent, false},
    {'a', encodeFragment, false},
    {'a', escapeQueryComponent, false},
    {'a', encodeHost, false},
    {'z', encodePath, false},
    {'A', encodePath, false},
    {'Z', encodePath, false},
    {'0', encodePath, false},
    {'9', encodePath, false},
    {'-', encodePath, false},
    {'-', encodeUserPassword, false},
    {'-', encodeQueryComponent, false},
    {'-', encodeFragment, false},
    {'-', escapeQueryComponent, false},
    {'.', encodePath, false},
    {'_', encodePath, false},
    {'~', encodePath, false},

    // Reserved characters (§2.2)
    {'?', encodePath, true},
    {'#', encodePath, true},
    {'[', encodePath, true},
    {']', encodePath, true},
    {'$', encodePath, false},
    {'&', encodePath, false},
    {'!', encodePath, false},
    {'+', encodePath, false},
    {',', encodePath, false},
    {'\'', encodePath, false},
    {':', encodePath, false},
    {';', encodePath, false},
    {'=', encodePath, false},
    {'@', encodePath, false},
    {'(', encodePath, false},
    {')', encodePath, false},
    {'*', encodePath, false},

    // Query component (§3.4)
    {'&', encodeQueryComponent, true},
    {'?', encodeQueryComponent, true},
    {'/', encodeQueryComponent, true},
    {'#', encodeQueryComponent, true},
    {'[', encodeQueryComponent, true},
    {']', encodeQueryComponent, true},
    {'$', encodeQueryComponent, false},
    {'!', encodeQueryComponent, false},
    {'+', encodeQueryComponent, false},
    {',', encodeQueryComponent, false},
    {'\'', encodeQueryComponent, false},
    {':', encodeQueryComponent, false},
    {';', encodeQueryComponent, false},
    {'=', encodeQueryComponent, false},
    {'@', encodeQueryComponent, false},
    {'(', encodeQueryComponent, false},
    {')', encodeQueryComponent, false},
    {'*', encodeQueryComponent, false},

    // User information (§3.2.1)
    {':', encodeUserPassword, true},
    {'/', encodeUserPassword, true},
    {'?', encodeUserPassword, true},
    {'@', encodeUserPassword, true},
    {'$', encodeUserPassword, false},
    {'&', encodeUserPassword, false},
    {'+', encodeUserPassword, false},
    {',', encodeUserPassword, false},
    {';', encodeUserPassword, false},
    {'=', encodeUserPassword, false},

    // Host (IP address, IPv6 address, registered name, port suffix; §3.2.2)
    {'!', encodeHost, false},
    {'$', encodeHost, false},
    {'&', encodeHost, false},
    {'\'', encodeHost, false},
    {'(', encodeHost, false},
    {')', encodeHost, false},
    {'*', encodeHost, false},
    {'+', encodeHost, false},
    {',', encodeHost, false},
    {';', encodeHost, false},
    {'=', encodeHost, false},
    {':', encodeHost, false},
    {'[', encodeHost, false},
    {']', encodeHost, false},
    {'0', encodeHost, false},
    {'9', encodeHost, false},
    {'A', encodeHost, false},
    {'z', encodeHost, false},
    {'_', encodeHost, false},
    {'-', encodeHost, false},
    {'.', encodeHost, false},
}

func TestShouldEscape(t *testing.T) {
    for _, tt := range shouldEscapeTests {
        if shouldEscape(tt.in, tt.mode) != tt.escape {
            t.Errorf("shouldEscape(%q, %v) returned %v; expected %v", tt.in, tt.mode, !tt.escape, tt.escape)
        }
    }
}

bradfitz · 2016-06-24T13:36:00Z

Maybe it's best if you just send a change (in Gerrit) so we can see differences and comment there, since large code dumps in comments aren't easy to read. See https://golang.org/doc/contribute.html

But be warned that changes to net/url have become increasingly hard (nearly impossible) for compatibility reasons. There's a large chance the change won't make it in or will be reverted. Let's see what the diff looks like, though.

bradfitz · 2016-06-24T13:36:56Z

Also, note that I removed the RFC mention in 83676d6, specifically because we've accepted that we can't fully implement the RFC at this point for compatibility reasons.

bradfitz · 2016-08-22T23:40:35Z

Closing due to timeout waiting for reply.

ianlancetaylor added the Proposal label Jun 20, 2016

ianlancetaylor added this to the Proposal milestone Jun 20, 2016

ianlancetaylor changed the title ~~proposal: adhere to RFC3986 in net/url~~ proposal: net/url: adhere to RFC3986 Jun 20, 2016

bradfitz added the WaitingForInfo Issue is not actionable because of missing required information, which needs to be provided. label Jun 29, 2016

bradfitz closed this as completed Aug 22, 2016

spacewander mentioned this issue Jul 28, 2017

ngx.escape_uri should escape RFC 3986 section 2.2 Reserved Characters openresty/lua-nginx-module#1124

Open

golang locked and limited conversation to collaborators Aug 22, 2017

gopherbot added the FrozenDueToAge label Aug 22, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

proposal: net/url: adhere to RFC3986 #16127

proposal: net/url: adhere to RFC3986 #16127

gmccue commented Jun 20, 2016

ianlancetaylor commented Jun 20, 2016

gmccue commented Jun 21, 2016 •

edited

Loading

ianlancetaylor commented Jun 21, 2016

gmccue commented Jun 24, 2016 •

edited

Loading

bradfitz commented Jun 24, 2016

bradfitz commented Jun 24, 2016

bradfitz commented Aug 22, 2016

proposal: net/url: adhere to RFC3986 #16127

proposal: net/url: adhere to RFC3986 #16127

Comments

gmccue commented Jun 20, 2016

ianlancetaylor commented Jun 20, 2016

gmccue commented Jun 21, 2016 • edited Loading

ianlancetaylor commented Jun 21, 2016

gmccue commented Jun 24, 2016 • edited Loading

bradfitz commented Jun 24, 2016

bradfitz commented Jun 24, 2016

bradfitz commented Aug 22, 2016

gmccue commented Jun 21, 2016 •

edited

Loading

gmccue commented Jun 24, 2016 •

edited

Loading