Skip to content

Conversation

bsittler
Copy link

@bsittler bsittler commented Dec 6, 2017

Specify HTML numeric character reference fallback encoding for multipart upload filename characters not representable in form acceptCharset/form charset.

Rationale:

  • Consistency: this makes filename fallback character replacement consistent with encoding of form element names and values in multipart uploads when a source character is not representable in the acceptCharset/form charset. @annevk points out that this is exactly the "html" error handling of the Encoding Standard
  • Predictability: this is consistent with existing behavior in at least three browsers: Firefox, Edge, and Chrome (as of a few weeks ago)
  • Reduced data loss: this change reduces the risk of user confusion and website malfunction when multiple uploaded files with distinct local filenames but identical representation after user agent-specific fallback character replacement are uploaded using <input type=file multiple>; with this behavior standardized, web pages may even be able to portably recover useful user-visible representations of the original filenames, though some ambiguity remains with that approach as a local file could actually contain name parts matching numeric character references (moving to UTF-8 for the form submission of course resolves the ambiguity and should be the only recommended solution for newly-built web pages).

Closes #3223

WPT coverage already added (as ".tentative.") in web-platform-tests/wpt@cbe8c8e


/form-control-infrastructure.html ( diff )

…art upload filename characters not representable in form charset
@bsittler
Copy link
Author

bsittler commented Dec 7, 2017

@annevk @domenic is this the right place to change? I wasn't sure whether to attempt to unify the name and value fallback encoding with the file name fallback encoding either. Also, should this instead use the "html" error handling of the Encoding Standard by reference?

@bsittler
Copy link
Author

bsittler commented Dec 7, 2017

/cc @inexorabletash for Encoding Standard question

@bsittler
Copy link
Author

bsittler commented Dec 7, 2017

/cc @pwnall for still-open interop questions - should this specify %22 replacement for " and/or %26 or &amp; replacement for &?

Also: should newline elision or replacement be specified, either normatively or non-normatively?

All of these special cases still need WPT coverage and interop testing too, I think. Strictly speaking they are distinct from this issue, however (at least as it was originally described.)

Here's what we do in Chrome for the rest of the ASCII range right now:

  • \n -> %0A (quote LF)
  • \r -> %0D (quote CR)
  • " -> %22 (quote double-quote)
    - \x7F -> (empty) (remove DEL)
    - If the filename contains / only the part after the last / is used.
    - Otherwise: if the filename does not contain / but does contain \ only the part after the last \ is used
    edit: I was wrong about \x7F, / and \

@annevk
Copy link
Member

annevk commented Dec 7, 2017

The PR as-is seems like a reasonable improvement over the status quo, but what I'd like to have eventually is some kind of definition that takes a data structure as input and produces the correct byte sequence; making use of https://encoding.spec.whatwg.org/#encode to convert code points in the data structure to bytes.

That would require a more complete definition of the multipart/form-data format. If you're interested in working on that I'm happy to help out.

@annevk
Copy link
Member

annevk commented Dec 7, 2017

https://github.com/masinter/multipart-form-data/tree/master/test-cases has some tests and test ideas for this format btw. Might be worth looking through.

@bsittler
Copy link
Author

bsittler commented Dec 7, 2017

That looks useful, thank you! I had actually run across https://hixie.ch/tests/adhoc/html/forms/submission/multipart_form-data/ on which some of the tests you link seem to be based yesterday when looking for existing coverage.

When you write “some kind of definition that takes a data structure as input and produces the correct byte sequence”, do you mean that it would be inlined in the HTML standard, or in a separate specification that is used by reference (like text encoding would be, I guess?) If "by reference" is suitable, would it be reasonable to accomplish this by working with the editors of the multipart RFC to add any needed extension points/clarifications there, or is the RFC process (edit: or the goal of the RFC) in some way unsuited to HTML's purposes?

Also, does that mean this PR could be merged as it is? If so, how does that work/what do I need to do? If not, what is still needed?

@annevk
Copy link
Member

annevk commented Dec 7, 2017

I think it would be most natural to define the format alongside https://xhr.spec.whatwg.org/#interface-formdata, but the RFC process might work (just like https://url.spec.whatwg.org/#application/x-www-form-urlencoded is defined alongside its API). If you are willing to do the work there I certainly wouldn't oppose an updated RFC.

Looking at this PR it does seem to remove some advice that might be needed, such as mapping " to %22. If we go with some kind of intermediate PR (and @domenic is okay with it) I think we should at least preserve that.

@bsittler
Copy link
Author

bsittler commented Dec 7, 2017

@annevk would you be OK with specification rather than advice? Indeed, failing to quote ", CR or LF in parameter values (that is, field names or file names) would lead to unparseable multipart data. I have pushed a new version that specifies %-hex-hex encoding for these three characters in those contexts. I also notice that this was missing for field names before. Perhaps an oversight?

@bsittler
Copy link
Author

bsittler commented Dec 7, 2017

I've started work on WPT coverage for ASCII punctuation and controls (so far .tentative.) over in https://chromium-review.googlesource.com/c/chromium/src/+/814400 edit: a.k.a. web-platform-tests/wpt#8618

@bsittler
Copy link
Author

bsittler commented Dec 7, 2017

We don't have interop. Should we? So far I've tested in Firefox and Chrome.

Firefox converts " to \", and replaces CR and LF with spaces.
Chrome converts all three of those to %-hex-hex (that is, %22, %0D and %0A)

@bsittler
Copy link
Author

bsittler commented Dec 7, 2017

Also in Firefox a lone \ is untouched, but \" becomes \\". & and ; are not given special treatment.

edit: right, that's not special handling for \", that's just regular pass-through for \ followed by \" for " :feeling sheepish:

edit 2: Firefox results are confirmed in FirefoxNightly

@bsittler
Copy link
Author

bsittler commented Dec 7, 2017

Safari version 11.0.1 (13604.3.5) seems (somewhat unsurprisingly, given shared heritage) to behave identically to Chrome for ", CR and LF - all are %-hex-hex encoded.

@bsittler
Copy link
Author

bsittler commented Dec 7, 2017

AFAIK IE and Edge are not relevant here (at least in the absence of JS-constructed File objects in uploads) because Windows filesystems do not permit any of these characters in filenames.

@bsittler
Copy link
Author

bsittler commented Dec 7, 2017

Ok, IE does actually have a way to do this for XHR using FormData.prototype.append(name, blob, filename) - and it turns out IE does not quote or replace the problematic characters ", CR and LF at all, so the result is multipart that cannot actually be parsed by the recipient.

@domenic
Copy link
Member

domenic commented Dec 7, 2017

It sounds to me like this is a case where there's enough divergence where you get to choose the model. Congratulations! :)

We should definitely have interop here. Your help on tests, spec, and filing bugs on all browsers to get up to snuff would definitely be great.

@bsittler
Copy link
Author

bsittler commented Dec 7, 2017

https://jsfiddle.net/huysk24t/18/ will reveal how your browser does with CR, LF, ", NUL, &, and ;

edit: https://jsfiddle.net/huysk24t/26/ https://jsfiddle.net/huysk24t/28/ is easier to read and covers more controls too

@bsittler
Copy link
Author

bsittler commented Dec 7, 2017

(and somewhere inside IE it's not expecting a NUL inside the "file" name, and truncates. oops.)
edit: firefox too
edit 2: also, firefox turns CR LF in a filename into a single space

@bsittler
Copy link
Author

bsittler commented Dec 8, 2017

Results from Chrome and Safari:

"REVERSE SOLIDUS [\\]
QUOTATION MARK [%22]
CR [%0D]
LF [%0A]
CR LF [%0D%0A]
HT [\t]
VT [\u000b]
FF [\f]
BS [\b]
ESC [\u001b]
BEL [\u0007]
DEL [\u007f]
CSI [\u009b]
NUL [\u0000]
.txt"

@bsittler
Copy link
Author

bsittler commented Dec 8, 2017

Results from IE and Edge: (thanks for running this on Edge, @pwnall !)

"REVERSE SOLIDUS [\\]
 QUOTATION MARK [\"]
 CR [\r]
 LF [\n]
 CR LF [\r\n]
 HT [\t]
 VT [\u000b]
 FF [\f]
 BS [\b]
 ESC [\u001b]
 BEL [\u0007]
 DEL [\u007f]
 CSI [\u009b]
 NUL ["

@bsittler
Copy link
Author

bsittler commented Dec 8, 2017

Results from Firefox:

"REVERSE SOLIDUS [\\]
QUOTATION MARK [\\\"]
CR [ ]
LF [ ]
CR LF [ ]
HT [\t]
VT [\u000b]
FF [\f]
BS [\b]
ESC [\u001b]
BEL [\u0007]
DEL [\u007f]
CSI [\u009b]
NUL ["

@bsittler
Copy link
Author

bsittler commented Dec 8, 2017

Newer fiddle, also touches on /->: substitution from https://w3c.github.io/FileAPI/#file-constructor which the browsers edit: other than Firefox don't actually seem to follow (yet?) /cc @mkruisselbrink

https://jsfiddle.net/huysk24t/36/

Chrome

ua: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3287.0 Safari/537.36
file: [object File]

"REVERSE SOLIDUS [\\]\
 COLON [:]\
 SOLIDUS [/]\
 QUOTATION MARK [%22]\
 CR [%0D]\
 LF [%0A]\
 CR LF [%0D%0A]\
 HT [\t]\
 VT [\u000b]\
 FF [\f]\
 BS [\b]\
 ESC [\u001b]\
 BEL [\u0007]\
 DEL [\u007f]\
 CSI [\u009b]\
 NUL [\u0000]\
.txt"

Safari

ua: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_1) AppleWebKit/604.3.5 (KHTML, like Gecko) Version/11.0.1 Safari/604.3.5
file: [object File]

"REVERSE SOLIDUS [\\]\
 COLON [:]\
 SOLIDUS [/]\
 QUOTATION MARK [%22]\
 CR [%0D]\
 LF [%0A]\
 CR LF [%0D%0A]\
 HT [\t]\
 VT [\u000b]\
 FF [\f]\
 BS [\b]\
 ESC [\u001b]\
 BEL [\u0007]\
 DEL [\u007f]\
 CSI [\u009b]\
 NUL [\u0000]\
.txt"

Firefox Nightly

ua: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.13; rv:59.0) Gecko/20100101 Firefox/59.0
file: [object File]

"REVERSE SOLIDUS [\\]\
 COLON [:]\
 SOLIDUS [:]\
 QUOTATION MARK [\\\"]\
 CR [ ]\
 LF [ ]\
 CR LF [ ]\
 HT [\t]\
 VT [\u000b]\
 FF [\f]\
 BS [\b]\
 ESC [\u001b]\
 BEL [\u0007]\
 DEL [\u007f]\
 CSI [\u009b]\
 NUL ["

IE

ua: Mozilla/5.0 (Windows NT 6.3; WOW64; Trident/7.0; .NET4.0E; .NET4.0C; rv:11.0) like Gecko
file: [object Blob]

"REVERSE SOLIDUS [\\]\
 COLON [:]\
 SOLIDUS [/]\
 QUOTATION MARK [\"]\
 CR [\r]\
 LF [\n]\
 CR LF [\r\n]\
 HT [\t]\
 VT [\u000b]\
 FF [\f]\
 BS [\b]\
 ESC [\u001b]\
 BEL [\u0007]\
 DEL [\u007f]\
 CSI [\u009b]\
 NUL ["

Edge (thanks again, @pwnall !)

ua: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.79 Safari/537.36 Edge/14.14393
file: [object Blob]

"REVERSE SOLIDUS [\\]\
 COLON [:]\
 SOLIDUS [/]\
 QUOTATION MARK [\"]\
 CR [\r]\
 LF [\n]\
 CR LF [\r\n]\
 HT [\t]\
 VT [\u000b]\
 FF [\f]\
 BS [\b]\
 ESC [\u001b]\
 BEL [\u0007]\
 DEL [\u007f]\
 CSI [\u009b]\
 NUL ["

@domenic
Copy link
Member

domenic commented Dec 8, 2017

Please keep in mind that every comment here reaches 229 inboxes :). As such, consider consolidating and updating a single comment, instead of using new comments as a stream of consciousness. Then ping the thread again when you have a concrete question or proposal that needs people's attention.

That said, I'm happy you're enthusiastic, and don't want to discourage the great work you're doing! So feel free to err on the side of too many posts instead of too few; a few extra emails is not a big price to pay :).

(Note: editing posts to add @-mentions will work to send them a new notification, last time I tested.)

@bsittler
Copy link
Author

bsittler commented Dec 8, 2017

@domenic apologies for the spam, I switched to edits after that and also I think the noisy characterization of existing behavior is done.

@annevk I have pushed an updated snapshot that proposes the following replacements for controls and punctuation:

Field names and file names included in the generated multipart/form-data resource must undergo multipart parameter-value character replacement for syntactically significant characters: for each character in the parameter value that is one of U+0000 <control> (NUL), U+0022 QUOTATION MARK ("), U+000A LINE FEED (LF), or U+000D CARRIAGE RETURN (CR), replace the character by a string consisting of a U+0025 PERCENT SIGN character (%) and two ASCII hex digits representing the code point of the character in base sixteen.

Does this address the missing-advice issue you raised? It adds one new behavior not found in existing browsers (NUL to %00) however to me that replacement seems preferable to surprising string truncation.

I see elsewhere in the HTML spec that \ should not be possible in a file name selected for <input type=file>, but also notice that none of the browsers actually enforce that restriction. Should we drop it?

@mkruisselbrink over in the File API spec's constructor I also see that / should be transformed to : in the file name however only Firefox actually does that. Given that none of the other browsers do, should we drop it?

@annevk
Copy link
Member

annevk commented Dec 8, 2017

@bsittler it does, thank you. And yeah, it makes sense to ask browsers to change their behavior when the current behavior is suboptimal.

I'm assuming by \ you're referring to the "path components" language. I don't have a good suggestion there. It might be worth testing on Windows. The main thing this wants to ensure is that we don't leak directory names, but perhaps there's a better way to do that.

@bsittler
Copy link
Author

bsittler commented Dec 8, 2017

Yes, @annevk — I refer to https://html.spec.whatwg.org/multipage/input.html#file-upload-state-(type=file) which states

File names must not contain path components, even in the case that a user has selected an entire directory hierarchy or multiple files with the same name from different directories. Path components, for the purposes of the File Upload state, are those parts of file names that are separated by U+005C REVERSE SOLIDUS character (\) characters.

Edit: reading this with a little more context, I believe the intent here is to avoid breaking filenames that otherwise would confuse applications attempting to remove the C:\Fakepath\ prefix in the value property of InputElement. I suspect the wording in the File API's File constructor mandating replacement of / by : may be for the same reason. I think a better fix would be to perform those replacements at the time the value property is set, but leave the unmodified name in the File object's name property. What do you think, @mkruisselbrink ?

@bsittler
Copy link
Author

bsittler commented Dec 8, 2017

@annevk I did already test it on Windows in Chrome (I do not know of a way to test it in non-Blink browsers on Windows, though) — and in Chrome \ in a JS-constructed File's name is preserved even when it is attached to <input type=file> and uploaded in a multipart form submission, see https://crrev.com/c/814400 which passes even on Windows (win7_chromium_rel_ng
) http://build.chromium.org/p/tryserver.chromium.win/builders/win7_chromium_rel_ng/builds/61717 . In IE and Edge it is likewise preserved in the similar-yet-distinct FormData upload path, see #3276 (comment) above.

edit: specifically, that's the test case for file-for-upload-in-form-REVERSE-SOLIDUS-[\\].txt here https://github.com/w3c/web-platform-tests/pull/8618/files#diff-4a98331ea0cce009c9ebe1fa292b1e01R168

@bsittler
Copy link
Author

bsittler commented Dec 9, 2017

@bzbarsky @smaug---- @cdumez @rniwa @travisleithead this HTML PR changes and standardizes how outside-of-form charset characters (somewhat rare, and only possible on non-UTF-8 form posts) and ASCII control characters (very rare) are encoded in name="" and filename="" for multipart form uploads.

EDIT: this is more complicated than needed. See #3276 (comment) below for a newer, simpler proposal

TL;DR: in multipart form data name="field-name" and filename="file-name":

  • "%22
  • \\\ (open question: should it be %5C instead?)
  • U+001B (ESC) → U+001B (ESC) (unchanged to not break ISO-2022-JP)
  • U+0000…U+001A, U+0011…U+001F, U+007F -> %00%1A, %1C%1F, %7F
  • character outside of (necessarily non-UTF-8) form charset/accept-charset → &#NNN…;

Full version with "why each change" and "what exactly changes":

  • characters outside the form charset/accept-charset are replaced with numeric character references
    • why: make charset data loss semi-recoverable; standardizes de-facto browser interop (4½ out of the 5 tested);
    • this already happens in Chrome, Firefox, IE and Edge;
    • in Safari it already happens in name="" but instead uses ?-replacement in filename=""
    • this is only possible when accept-charset/form charset is not UTF-8, so it does not apply to script-constructed FormData posts via fetch() or XMLHttpRequest
  • Double quote " is replaced with %22
    • why: avoid breaking syntactic structure of quoted parameter values (also how 2 of the 3 "-quoting browsers already do it);
    • this already happens in Safari and Chrome;
    • Firefox currently uses \" instead, and
    • IE and Edge do not currently quote or escape it at all;
    • this change prevents parameter-injection errors
    • this will break non-%22-decoding ISO-2022-JP multipart handlers (they were already broken in Safari and Chrome, I think; for instance, 壁哂窟 will become corrupted as 壁咼臆轡臆)
  • Backslash \ is replaced with \\
    • why: this prevents spurious interpretation of a final \ and the following " as a quoted " in the style used up to now by Firefox, and
    • open question: should it be %5C instead? that way byte recovery requires only one decoder
    • this represents a change from all tested browsers
    • this will break non-\\-decoding ISO-2022-JP multipart handlers (for instance, 楞楞 will become corrupted as 樛捜楞 — however, ISO-2022-JP is fairly rare these days)
  • 7-bit control characters other than U+001B (ESC) are replaced by %-hex-hex sequences
    • why: avoid breaking syntactic structure of quoted parameter values (also how 2 of 3 CR- and LF-quoting browsers already do it for those specific characters);
      • ... but U+001B (ESC) is preserved (passed through unquoted) for Web-compatible ISO-2022-JP
    • why: this change prevents server-side parsing and confusion:
      • NUL-termination,
      • header-injection, and
      • body-injection errors too;
    • for two specific cases this matches existing behavior in Safari and Chrome:
      • U+000A (LF) → %0A (no change in Safari and Chrome) and
      • U+000D (CR) → %0D (no change in Safari and Chrome), but
    • for LF and CR this is a change for Firefox (which currently converts LF, CR and CR LF to space in filename) and
    • for LF and CR this is a change for IE and Edge (which currently pass CR and LF verbatim);
    • for other control characters and for all controls in other browsers this is new behavior, but
    • this is necessary to produce multipart that is actually parsed successfully by many server-side frameworks;
    • note however that IE, Edge and Firefox also have client-side NUL-termination in their FormData implementations
  • why: (all changes) improve web platform consistency and interoperability and remove some broken edges
    open question: should / be quoted as %2F when it occurs in a field name or the basename of a file name (the latter is only possible in the script-constructed case, I believe)

Tentative WPT for numeric character reference fallback is already added in web-platform-tests/wpt@cbe8c8e

Tentative WPT coverage for some punctuation and control characters (excluding a few of the control characters Chromium's wptserve is currently unhappy with) is under review in https://chromium-review.googlesource.com/c/chromium/src/+/814400 and web-platform-tests/wpt#8618

Thanks to JS-constructed FormData the control characters and punctuation have been technically possible in multipart for a while now, and many non-Windows OSes also allow them in filenames but they are not widely used. Recently DataTransfer became script-constructable (and this is implemented in Blink), meaning these can occur in regular non-fetch()/non-XHR form submissions from other platforms too.

edit: Brief quote from the Blink change with links to RFCs:

// Append a string as a quoted value, escaping quotes, line breaks,
// and other ASCII control characters except U+001B (ESC)
// (which needs to pass through unquoted for Web-compatible
// ISO-2022-JP) using the "percent-encoding" option of RFC 7578.
// https://tools.ietf.org/html/rfc5322#section-3.2.4
// https://tools.ietf.org/html/rfc7578#section-2
// https://encoding.spec.whatwg.org/#iso-2022-jp-encoder
//
// Also quote backslash by doubling it; otherwise it will confuse
// RFC 5322-style multipart processors which would interpret the
// backslash as the start of a quoted-pair.
// https://tools.ietf.org/html/rfc5322#section-3.2.1

What do you think?

@bsittler
Copy link
Author

bsittler commented Dec 9, 2017

@masinter this (see #3276 (comment) for an overlong attempt at a summary) is an attempt to bring HTML's use of multipart form data encoding close enough to RFC 7578 that more exotic inputs will still be parsed by RFC-compliant processors. What do you think?

@bzbarsky
Copy link
Contributor

bzbarsky commented Dec 9, 2017

It's worth explaining in a note why ESC doesn't get escaped, so someone doesn't come along and "fix" it later.

@bsittler
Copy link
Author

bsittler commented Dec 9, 2017

Good point, @bzbarsky - done, and also updated the summary comment above to highlight this

@masinter
Copy link

masinter commented Dec 9, 2017

It's been a while. Isn't this the kind of problem file: URLs have too? If the browser encodes the filename as the last segment of a file: URL, then it's up to the form-data recipient to turn that into a local URL. and then into a local filename. THere'd be no need to look at the form-charset.
Maybe too late for that, or maybe file: is too broken.

I don't think I ever finished the test suite but i'd like to hand it off.

If the RFC needs to change, submit an errata or better still update the RFC.
Now that browsers do PDF, you might check on PDF forms.

@bsittler
Copy link
Author

bsittler commented Dec 9, 2017

@masinter good point! I had neglected that parallel part of the platform.

Unfortunately form recipients are at this point used to getting the filename="" mostly not encoded, for instance with literal spaces rather than %20 and expecting non-ASCII parts of the filename to be unencoded, e.g. with 🌐 (for a UTF-8 form submission, as is likely now the most common sort) rather than %F0%9F%8C%90, and &127760; instead of 🌐 when the form charset cannot hold the 🌐 (identically to how it is replaced in non-multipart form submissions with a non-UTF-8 form character set), so I think sharing exactly the same rules will break too many recipients.

However, some of the rules for path delimiter normalization and quoting of syntax-critical punctuation may be re-usable. Fortunately (from the point of view of standards, at least) native file system syntax diversity is now much lower than it was when file: was specified 😄

@masinter
Copy link

masinter commented Dec 9, 2017

https://tools.ietf.org/html/rfc8089 Feb 2017 seems pretty recent.
Perhaps @phluid61 has an opinion

@bsittler
Copy link
Author

bsittler commented Dec 9, 2017

Agreed, and that's much newer than the one I was looking at. The vertical line character handling is strangely similar to the /: replacement in the File constructor. That RFC and the file:-relevant parts of https://url.spec.whatwg.org/ both seem to cover a lot of similar ground.

I guess %2F and %5C might be sane replacements when / or \ occurs in the filename="" (actually a basename), similar to what would happen in file: were those able to appear in basenames referring to actual files on the host file system (I think \ does occur but / does not - but our script-constructed filenames make both possible now in uploads.)

Unfortunately each one of these ASCII characters we modify will break yet-larger parts of the few remaining ISO-2022-JP multipart form handlers when they do not do %-decoding prior to character set interpretation, but I guess we already have that problem to some extent with ", and also ISO-2022-JP is an ever-smaller fraction of the web as migrated sites and new sites move to UTF-8 and SJIS.

We also have a similar problem in the download attribute which suggests a filename to use on the local file system.

I will also take a look at PDF forms, thanks for pointing that out!

@bsittler
Copy link
Author

bsittler commented Dec 9, 2017

@annevk @bzbarsky what do you think about using %5C (file:-style) rather than \\ when \ occurs in the source field name or file name (basename)? It would certainly simplify byte recovery, and that is needed for any ISO-2022-JP upload handlers to recover the original (non-ASCII) characters of the field name or file name.

edit: same question for /, but that is only possible in the script-constructed case so far as I am aware. Any opinion on that one, @mkruisselbrink @inexorabletash ?

@bzbarsky
Copy link
Contributor

bzbarsky commented Dec 9, 2017

I haven't really thought enough about this to have an opinion.

What do browsers do right now?

@phluid61
Copy link

phluid61 commented Dec 9, 2017

FWIW I don't have many/strong opinions. I think it's fair to say that file: is probably too broken to deal with encoding-related edge cases -- it was a cop-out to say that "characters SHOULD be converted to UTF-8 before percent-encoding" but that was as close as we could get for that type of RFC. Piggy-backing on RFC 3986 for percent-encoding "bad" ASCII characters (slashes, quotes, controls,..) is very convenient, though.

Encoding out-of-charset characters as &#NNN...; is a handy charset-agnostic ASCII-compatible way of encoding what is essentially an IRI, so 👍 to that.

Turning slashes into colons sounds funny, but I guess nobody is using HFS anymore. As long as everybody does the same thing, I think it doesn't matter if it matches the URL/URI specs or not.

@smaug----
Copy link

@hsivonen

@bsittler
Copy link
Author

bsittler commented Dec 11, 2017

@bzbarsky right now \ is not escaped by browsers, in some cases producing multipart that cannot be parsed by the recipient. However, thinking more about this I don't thing adding a third escaping mode can possible be the right answer. What if we were instead to do this:

  • convert U+001B (ESC) to %1B in field names and file names (alongside all the other controls already so escaped),
  • convert \ to %5C in field names and file names (alongside all the other punctuators so escaped),
  • convert % not followed by two hexadecimal digits to %25 in field names and file names, and
  • do all such %-conversions in field names and field values prior to encoding in the form charset.

I believe this would prevent mangling of ESC-sequences and encoded characters for ISO-2022-JP (since ISO-2022-JP encoding would happen after %-encoding), and this would also prevent mangling of trailing bytes (e.g. \) in Shift JIS, GBK, GB 18030, EUC KR, and Big5. For UTF-8 and all other text encodings usable in form submissions the ordering of %-encoding and charset encoding does not actually matter and is not observable by a recipient, since in those encodings ASCII-range bytes always represent themselves. It does mean that a recipient of form submissions in any of these multibyte charsets must be aware of the form character set by the time it attempts to process the query to avoid syntactic ambiguity (for instance, \ trailing byte before final enclosing ", or even % or " in an ISO-2022-JP sequences,) however this is already the case today, and is made easier by either knowing the charset in advance or processing a _charset_ hidden field before processing the remaining fields of the form.

edit: /cc @inexorabletash - what's your take on the wisdom (or lack thereof) in doing %-encoding prior to charset encoding?

@masinter
Copy link

There are 5 charsets involved: the charset of the form, the charset of the file system of the encoder, the charset used to encode the file name, the charset used to decode the file name, and the charset of the file system used to store the uploaded file.
It does no good to send an ISO-2022-JP filename to a Unicode file system unambiguously, because you can't store it that way -- the situation is very different from the form's data fields since you can have a single server supporting multiple encodings of data in text fields.

If you always convert file names to UTF8 and %xx encode only as necessary for a 'file:' URL (IRI) then the receiver can translate. Use case to consider is uploading a web page including HTML and images where the names of the image files aren't ascii, and the web server has unicode file names.
you want relative links to the images in the HTML file to match the file names of the image files.

@bsittler
Copy link
Author

@masinter good points!

To clarify, this is from the point of view of a browser (encoder) that has already previously converted the local filenames (if they are indeed names of local files and not just already-Unicode "filenames" constructed by JavaScript) to Unicode, and it assumes the form encoding and decoding character encodings match (well, use the same-named codecs from the text encoder spec - in practice the encoder produces a subset of what the decoder understands and a few characters are normalized in transit), out-of-form charset filename characters are replaced with SGML/XML-style decimal numeric character references in the non-UTF-8 form charset case (most of the browsers already do this; some upload handlers try to recover these characters but most do not, and this is an acceptable loss when dealing with uploads in legacy character encodings incapable of representing the full Unicode gamut), and makes few assumptions about whether the uploaded file will be stored or what character encoding or file system, application or database restrictions may apply when storing it.

We also need to ensure we don't break the web's widespread assumptions.

For most modern parts of the web the form character set (both for encoding and for decoding) is UTF-8, non-ASCII parts of the already-converted-to-Unicode filenames do not get %-escaped at all (in other words, we send UTF-8 "on the wire"), and the only potentially problematic parts here are control character and delimiter injections (so C0, DEL, \, and " need special handling.) \ and " are likely very rare in the non-JavaScript-constructed case because it is only the file's basename that is used here, and on Windows those characters are not allowed in basenames. In order to avoid side-effects with security consequences when a decoder has been implemented naïvely expecting no delimiters in this "basename", I believe we should also %-encode the other common modern path delimiter, /.

A slightly more in-depth defense would make special allowances for basenames matching ., .., `` (empty string), or (case-insensitive) matches for any of the following: rsrc, `..namedfork`, `CON`, `PRN`, `AUX`, `NUL`, `COM1` … `COM9`, `LPT1` … `LPT9`. It would also require `%`-encoding for all of the following punctuators: `<`, `>`, `:`, `"`, `/`, ``, `|`, `?`, and `*`. We could probably even just do this - again, these are all either uncommon or impossible names, depending on the host OS and filesystem.

@pwnall
Copy link
Contributor

pwnall commented Dec 11, 2017

The last proposal looks like progress to me. 2 encoding passes > 3 encoding passes. The only better thing I can think about is having a single pass, which I think would entail using SGML numerical character entities for control characters and characters with special meaning in MIME multipart with our encoding (I think that's &, ;, %, \, /, ".)

I think we shouldn't assume / worry about the uploaded files ending up on a real filesystem. Many systems I know about end up storing files as attachment metadata in a database and a blob in a separate storage system (S3 / block store / content-addressable filesystem). The relevant concern here is having a simple algorithm that parses the encoded filename into a string to be stored in the database and used for display / in URLs.

I don't think we should make any provisions for systems that attempt to use uploaded filenames directly to address a filesystem. This is a known bad security practice, and systems that have this design issue are already vulnerable to attacks. As long as there's an easy way to decode the encoded filename, servers should be responsible for re-encoding the name in a way that makes it suitable for their storage layer.

@bsittler
Copy link
Author

bsittler commented Dec 12, 2017

Actually if we assume processors won't shoot themselves in the foot we can get away with more minimal escaping:

Step one (still in the Unicode domain, so e.g. host file basenames are already decoded) replaces syntactically significant characters:

  • %00%1F, %7F for controls in the C0 range
  • "%22
  • \%5C

Step two (only for non-Unicode form charsets; this fallback replacement happens during transformation from Unicode to a non-Unicode form charset):

  • Replace unrepresentables: Unicode character with ordinal N&# N ;

Not using %-encoding for the second step is actually important, because existing %-decoding is likely often happening in the bytes domain before decoding (if any) of the multipart to equivalent Unicode.

I think that's it. What do you think?

edit: also, I think SGML-style numeric character references don't actually work for some control characters. Specifically, I think some of the decoders throw on e.g. &#0; and &#12;.

@domenic
Copy link
Member

domenic commented Dec 12, 2017

I like the simplicity of the last round, but I think everyone would enjoy learning how well it matches current browsers. Perhaps web platform tests are the best way to communicate that, instead of making you reproduce your test results in Markdown :)

@bsittler
Copy link
Author

bsittler commented Dec 13, 2017

@domenic I know I would, and I agree. It turns out reordering the %-encoding to before the charset-translation is a slightly more involved change to Blink, so this change is taking slightly longer than I had anticipated. I'm hoping to get it working soon and update https://chromium-review.googlesource.com/c/chromium/src/+/814400 with matching WPT coverage and Blink changes

@annevk annevk added topic: forms needs tests Moving the issue forward requires someone to write tests labels Dec 17, 2017
@andreubotella
Copy link
Member

andreubotella commented Nov 18, 2020

I've opened a PR in wpt including the changes from @bsittler's closed web-platform-tests/wpt#8618, which tests multipart/form-data payloads including controls and punctuation, as well as some FormData tests to make sure that the newline normalization in append an entry takes place before constructing the entry list in order to weed out some of the newline-related test failures.

https://wpt.fyi/results/?label=pr_head&max-count=1&pr=26556

The wpt.fyi results have Safari TP on macOS crashing on the FileAPI/file/send-file-form* tests, but that seems to be a macOS-only crash (https://bugs.webkit.org/show_bug.cgi?id=218112) and all of those tests pass on a build of WebKit on Linux.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

needs tests Moving the issue forward requires someone to write tests topic: forms

Development

Successfully merging this pull request may close these issues.

Specify HTML numeric character reference fallback encoding for multipart upload filename characters not representable in form charset

9 participants