-
Notifications
You must be signed in to change notification settings - Fork 3k
Specify HTML numeric character reference fallback encoding for filenames #3276
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Specify HTML numeric character reference fallback encoding for filenames #3276
Conversation
…art upload filename characters not representable in form charset
/cc @inexorabletash for Encoding Standard question |
/cc @pwnall for still-open interop questions - should this specify Also: should newline elision or replacement be specified, either normatively or non-normatively? All of these special cases still need WPT coverage and interop testing too, I think. Strictly speaking they are distinct from this issue, however (at least as it was originally described.) Here's what we do in Chrome for the rest of the ASCII range right now:
|
The PR as-is seems like a reasonable improvement over the status quo, but what I'd like to have eventually is some kind of definition that takes a data structure as input and produces the correct byte sequence; making use of https://encoding.spec.whatwg.org/#encode to convert code points in the data structure to bytes. That would require a more complete definition of the |
https://github.com/masinter/multipart-form-data/tree/master/test-cases has some tests and test ideas for this format btw. Might be worth looking through. |
That looks useful, thank you! I had actually run across https://hixie.ch/tests/adhoc/html/forms/submission/multipart_form-data/ on which some of the tests you link seem to be based yesterday when looking for existing coverage. When you write “some kind of definition that takes a data structure as input and produces the correct byte sequence”, do you mean that it would be inlined in the HTML standard, or in a separate specification that is used by reference (like text encoding would be, I guess?) If "by reference" is suitable, would it be reasonable to accomplish this by working with the editors of the multipart RFC to add any needed extension points/clarifications there, or is the RFC process (edit: or the goal of the RFC) in some way unsuited to HTML's purposes? Also, does that mean this PR could be merged as it is? If so, how does that work/what do I need to do? If not, what is still needed? |
I think it would be most natural to define the format alongside https://xhr.spec.whatwg.org/#interface-formdata, but the RFC process might work (just like https://url.spec.whatwg.org/#application/x-www-form-urlencoded is defined alongside its API). If you are willing to do the work there I certainly wouldn't oppose an updated RFC. Looking at this PR it does seem to remove some advice that might be needed, such as mapping " to %22. If we go with some kind of intermediate PR (and @domenic is okay with it) I think we should at least preserve that. |
@annevk would you be OK with specification rather than advice? Indeed, failing to quote |
I've started work on WPT coverage for ASCII punctuation and controls (so far |
We don't have interop. Should we? So far I've tested in Firefox and Chrome. Firefox converts |
Also in Firefox a lone edit: right, that's not special handling for edit 2: Firefox results are confirmed in FirefoxNightly |
Safari version 11.0.1 (13604.3.5) seems (somewhat unsurprisingly, given shared heritage) to behave identically to Chrome for |
AFAIK IE and Edge are not relevant here (at least in the absence of JS-constructed File objects in uploads) because Windows filesystems do not permit any of these characters in filenames. |
Ok, IE does actually have a way to do this for XHR using FormData.prototype.append(name, blob, filename) - and it turns out IE does not quote or replace the problematic characters |
It sounds to me like this is a case where there's enough divergence where you get to choose the model. Congratulations! :) We should definitely have interop here. Your help on tests, spec, and filing bugs on all browsers to get up to snuff would definitely be great. |
https://jsfiddle.net/huysk24t/18/ will reveal how your browser does with CR, LF, edit: |
(and somewhere inside IE it's not expecting a NUL inside the "file" name, and truncates. oops.) |
Results from Chrome and Safari: "REVERSE SOLIDUS [\\] |
Results from IE and Edge: (thanks for running this on Edge, @pwnall !) "REVERSE SOLIDUS [\\] |
Results from Firefox: "REVERSE SOLIDUS [\\] |
Newer fiddle, also touches on https://jsfiddle.net/huysk24t/36/ Chrome ua: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3287.0 Safari/537.36 "REVERSE SOLIDUS [\\]\
COLON [:]\
SOLIDUS [/]\
QUOTATION MARK [%22]\
CR [%0D]\
LF [%0A]\
CR LF [%0D%0A]\
HT [\t]\
VT [\u000b]\
FF [\f]\
BS [\b]\
ESC [\u001b]\
BEL [\u0007]\
DEL [\u007f]\
CSI [\u009b]\
NUL [\u0000]\
.txt" Safari ua: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_1) AppleWebKit/604.3.5 (KHTML, like Gecko) Version/11.0.1 Safari/604.3.5 "REVERSE SOLIDUS [\\]\
COLON [:]\
SOLIDUS [/]\
QUOTATION MARK [%22]\
CR [%0D]\
LF [%0A]\
CR LF [%0D%0A]\
HT [\t]\
VT [\u000b]\
FF [\f]\
BS [\b]\
ESC [\u001b]\
BEL [\u0007]\
DEL [\u007f]\
CSI [\u009b]\
NUL [\u0000]\
.txt" Firefox Nightly ua: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.13; rv:59.0) Gecko/20100101 Firefox/59.0 "REVERSE SOLIDUS [\\]\
COLON [:]\
SOLIDUS [:]\
QUOTATION MARK [\\\"]\
CR [ ]\
LF [ ]\
CR LF [ ]\
HT [\t]\
VT [\u000b]\
FF [\f]\
BS [\b]\
ESC [\u001b]\
BEL [\u0007]\
DEL [\u007f]\
CSI [\u009b]\
NUL [" IE ua: Mozilla/5.0 (Windows NT 6.3; WOW64; Trident/7.0; .NET4.0E; .NET4.0C; rv:11.0) like Gecko "REVERSE SOLIDUS [\\]\
COLON [:]\
SOLIDUS [/]\
QUOTATION MARK [\"]\
CR [\r]\
LF [\n]\
CR LF [\r\n]\
HT [\t]\
VT [\u000b]\
FF [\f]\
BS [\b]\
ESC [\u001b]\
BEL [\u0007]\
DEL [\u007f]\
CSI [\u009b]\
NUL [" Edge (thanks again, @pwnall !) ua: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.79 Safari/537.36 Edge/14.14393 "REVERSE SOLIDUS [\\]\
COLON [:]\
SOLIDUS [/]\
QUOTATION MARK [\"]\
CR [\r]\
LF [\n]\
CR LF [\r\n]\
HT [\t]\
VT [\u000b]\
FF [\f]\
BS [\b]\
ESC [\u001b]\
BEL [\u0007]\
DEL [\u007f]\
CSI [\u009b]\
NUL [" |
Please keep in mind that every comment here reaches 229 inboxes :). As such, consider consolidating and updating a single comment, instead of using new comments as a stream of consciousness. Then ping the thread again when you have a concrete question or proposal that needs people's attention. That said, I'm happy you're enthusiastic, and don't want to discourage the great work you're doing! So feel free to err on the side of too many posts instead of too few; a few extra emails is not a big price to pay :). (Note: editing posts to add @-mentions will work to send them a new notification, last time I tested.) |
@domenic apologies for the spam, I switched to edits after that and also I think the noisy characterization of existing behavior is done. @annevk I have pushed an updated snapshot that proposes the following replacements for controls and punctuation:
Does this address the missing-advice issue you raised? It adds one new behavior not found in existing browsers (NUL to %00) however to me that replacement seems preferable to surprising string truncation. I see elsewhere in the HTML spec that @mkruisselbrink over in the File API spec's constructor I also see that |
@bsittler it does, thank you. And yeah, it makes sense to ask browsers to change their behavior when the current behavior is suboptimal. I'm assuming by \ you're referring to the "path components" language. I don't have a good suggestion there. It might be worth testing on Windows. The main thing this wants to ensure is that we don't leak directory names, but perhaps there's a better way to do that. |
Yes, @annevk — I refer to https://html.spec.whatwg.org/multipage/input.html#file-upload-state-(type=file) which states
Edit: reading this with a little more context, I believe the intent here is to avoid breaking filenames that otherwise would confuse applications attempting to remove the |
@annevk I did already test it on Windows in Chrome (I do not know of a way to test it in non-Blink browsers on Windows, though) — and in Chrome edit: specifically, that's the test case for |
@bzbarsky @smaug---- @cdumez @rniwa @travisleithead this HTML PR changes and standardizes how outside-of-form charset characters (somewhat rare, and only possible on non-UTF-8 form posts) and ASCII control characters (very rare) are encoded in name="" and filename="" for multipart form uploads. EDIT: this is more complicated than needed. See #3276 (comment) below for a newer, simpler proposal TL;DR: in multipart form data name="field-name" and filename="file-name":
Full version with "why each change" and "what exactly changes":
Tentative WPT for numeric character reference fallback is already added in web-platform-tests/wpt@cbe8c8e Tentative WPT coverage for some punctuation and control characters (excluding a few of the control characters Chromium's wptserve is currently unhappy with) is under review in https://chromium-review.googlesource.com/c/chromium/src/+/814400 and web-platform-tests/wpt#8618 Thanks to JS-constructed FormData the control characters and punctuation have been technically possible in multipart for a while now, and many non-Windows OSes also allow them in filenames but they are not widely used. Recently DataTransfer became script-constructable (and this is implemented in Blink), meaning these can occur in regular non-fetch()/non-XHR form submissions from other platforms too. edit: Brief quote from the Blink change with links to RFCs:
What do you think? |
@masinter this (see #3276 (comment) for an overlong attempt at a summary) is an attempt to bring HTML's use of multipart form data encoding close enough to RFC 7578 that more exotic inputs will still be parsed by RFC-compliant processors. What do you think? |
It's worth explaining in a note why ESC doesn't get escaped, so someone doesn't come along and "fix" it later. |
Good point, @bzbarsky - done, and also updated the summary comment above to highlight this |
It's been a while. Isn't this the kind of problem file: URLs have too? If the browser encodes the filename as the last segment of a file: URL, then it's up to the form-data recipient to turn that into a local URL. and then into a local filename. THere'd be no need to look at the form-charset. I don't think I ever finished the test suite but i'd like to hand it off. If the RFC needs to change, submit an errata or better still update the RFC. |
@masinter good point! I had neglected that parallel part of the platform. Unfortunately form recipients are at this point used to getting the filename="" mostly not encoded, for instance with literal spaces However, some of the rules for path delimiter normalization and quoting of syntax-critical punctuation may be re-usable. Fortunately (from the point of view of standards, at least) native file system syntax diversity is now much lower than it was when |
https://tools.ietf.org/html/rfc8089 Feb 2017 seems pretty recent. |
Agreed, and that's much newer than the one I was looking at. The vertical line character handling is strangely similar to the I guess Unfortunately each one of these ASCII characters we modify will break yet-larger parts of the few remaining ISO-2022-JP multipart form handlers when they do not do %-decoding prior to character set interpretation, but I guess we already have that problem to some extent with We also have a similar problem in the download attribute which suggests a filename to use on the local file system. I will also take a look at PDF forms, thanks for pointing that out! |
@annevk @bzbarsky what do you think about using edit: same question for |
I haven't really thought enough about this to have an opinion. What do browsers do right now? |
FWIW I don't have many/strong opinions. I think it's fair to say that Encoding out-of-charset characters as Turning slashes into colons sounds funny, but I guess nobody is using HFS anymore. As long as everybody does the same thing, I think it doesn't matter if it matches the URL/URI specs or not. |
@bzbarsky right now
I believe this would prevent mangling of ESC-sequences and encoded characters for ISO-2022-JP (since ISO-2022-JP encoding would happen after %-encoding), and this would also prevent mangling of trailing bytes (e.g. edit: /cc @inexorabletash - what's your take on the wisdom (or lack thereof) in doing |
There are 5 charsets involved: the charset of the form, the charset of the file system of the encoder, the charset used to encode the file name, the charset used to decode the file name, and the charset of the file system used to store the uploaded file. If you always convert file names to UTF8 and %xx encode only as necessary for a 'file:' URL (IRI) then the receiver can translate. Use case to consider is uploading a web page including HTML and images where the names of the image files aren't ascii, and the web server has unicode file names. |
@masinter good points! To clarify, this is from the point of view of a browser (encoder) that has already previously converted the local filenames (if they are indeed names of local files and not just already-Unicode "filenames" constructed by JavaScript) to Unicode, and it assumes the form encoding and decoding character encodings match (well, use the same-named codecs from the text encoder spec - in practice the encoder produces a subset of what the decoder understands and a few characters are normalized in transit), out-of-form charset filename characters are replaced with SGML/XML-style decimal numeric character references in the non-UTF-8 form charset case (most of the browsers already do this; some upload handlers try to recover these characters but most do not, and this is an acceptable loss when dealing with uploads in legacy character encodings incapable of representing the full Unicode gamut), and makes few assumptions about whether the uploaded file will be stored or what character encoding or file system, application or database restrictions may apply when storing it. We also need to ensure we don't break the web's widespread assumptions. For most modern parts of the web the form character set (both for encoding and for decoding) is UTF-8, non-ASCII parts of the already-converted-to-Unicode filenames do not get A slightly more in-depth defense would make special allowances for basenames matching |
The last proposal looks like progress to me. 2 encoding passes > 3 encoding passes. The only better thing I can think about is having a single pass, which I think would entail using SGML numerical character entities for control characters and characters with special meaning in MIME multipart with our encoding (I think that's I think we shouldn't assume / worry about the uploaded files ending up on a real filesystem. Many systems I know about end up storing files as attachment metadata in a database and a blob in a separate storage system (S3 / block store / content-addressable filesystem). The relevant concern here is having a simple algorithm that parses the encoded filename into a string to be stored in the database and used for display / in URLs. I don't think we should make any provisions for systems that attempt to use uploaded filenames directly to address a filesystem. This is a known bad security practice, and systems that have this design issue are already vulnerable to attacks. As long as there's an easy way to decode the encoded filename, servers should be responsible for re-encoding the name in a way that makes it suitable for their storage layer. |
Actually if we assume processors won't shoot themselves in the foot we can get away with more minimal escaping: Step one (still in the Unicode domain, so e.g. host file basenames are already decoded) replaces syntactically significant characters:
Step two (only for non-Unicode form charsets; this fallback replacement happens during transformation from Unicode to a non-Unicode form charset):
Not using I think that's it. What do you think? edit: also, I think SGML-style numeric character references don't actually work for some control characters. Specifically, I think some of the decoders throw on e.g. |
I like the simplicity of the last round, but I think everyone would enjoy learning how well it matches current browsers. Perhaps web platform tests are the best way to communicate that, instead of making you reproduce your test results in Markdown :) |
@domenic I know I would, and I agree. It turns out reordering the %-encoding to before the charset-translation is a slightly more involved change to Blink, so this change is taking slightly longer than I had anticipated. I'm hoping to get it working soon and update https://chromium-review.googlesource.com/c/chromium/src/+/814400 with matching WPT coverage and Blink changes |
I've opened a PR in wpt including the changes from @bsittler's closed web-platform-tests/wpt#8618, which tests multipart/form-data payloads including controls and punctuation, as well as some https://wpt.fyi/results/?label=pr_head&max-count=1&pr=26556 The wpt.fyi results have Safari TP on macOS crashing on the |
Specify HTML numeric character reference fallback encoding for multipart upload filename characters not representable in form
acceptCharset
/form charset.Rationale:
acceptCharset
/form charset. @annevk points out that this is exactly the "html" error handling of the Encoding Standard<input type=file multiple>
; with this behavior standardized, web pages may even be able to portably recover useful user-visible representations of the original filenames, though some ambiguity remains with that approach as a local file could actually contain name parts matching numeric character references (moving to UTF-8 for the form submission of course resolves the ambiguity and should be the only recommended solution for newly-built web pages).Closes #3223
WPT coverage already added (as ".tentative.") in web-platform-tests/wpt@cbe8c8e
/form-control-infrastructure.html ( diff )