-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RFC: Stabilize OS string to bytes conversions #1255
RFC: Stabilize OS string to bytes conversions #1255
Conversation
Stabilize convenience methods for converting between OS strings and Rust's utf-8 string types, specifically stabilize the following APIs: * `OsString::from_platform_bytes` (renamed from `OsString::from_bytes`) * `OsStr::from_platform_bytes` (added API to mirror `OsString`) * `OsStr::to_bytes` * `OsStr::to_cstring`
Worth noting rust-lang/rust#27390 contains an extensive discussion between two camps. |
Ignoring our disagreement about lossy vs non-lossy for now (see rust-lang/rust#27390 for that), I still think "platform_bytes" is a terrible name for these functions. On windows, "platform bytes" means the encoding determined by the system code-page, ie. the encoding expected by Even from your own description, it's clear this function is not intended to convert from a "platfrom specific encoding" to an In this case, the source encoding is "potentially ill-formed utf8" (borrowing terminology from the WTF8 spec), so why not just call it See this PR: rust-lang/rust#27657 for how I would like this API to look. @alexcrichton I did try messaging you on IRC about actually taking up @aturon's offer. |
Note that I don't have strong feelings about this, I was just providing the My opinion is that, since the only use-case we can think of is paths, this is probably worth dropping and replacing by specific methods on It also makes documentation easier, since we can clearly state what a |
|
||
* On Unix, return `Some` always on `to_bytes` and call `CString::new(..).ok()` | ||
for `to_cstring`. | ||
* On Windows, attempt to interpret the `&[u16]` as utf-16, and if successful |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
UTF-16 is not actually involved here, is it? On Windows, this tries to interpret WTF-8 bytes as UTF-8.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Implementation-wise, yes, but conceptually OsString
on Windows is &[u16]
so this is more a description of what's metaphorically happening rather than literally.
I agree that the name I definitely wanted to take some time to consider the name as part of this RFC, but I didn't want to flesh it out 100% ahead of time as it seems like the meat of your disagreement was not with the name but rather the functionality itself.
Gah sorry, I missed this! Was it a PM? Regardless, though, if you PM me on IRC I'd be more than willing to discuss this! You're also more than welcome to send a PR to this RFC.
Unfortunately this approach isn't very composable, however. By adding these methods to |
What I care about most is that authors of new software who use rust do not treat windows as a second-class citizen without even realising it. I don't expect everyone to care about all platforms equally, but it should at least be obvious when you're doing something that's platform specific. Given that programmers are a very special kind of lazy, that also means providing more portable alternatives wherever possible. There's more than one ways to satisfy these requirements: with better names, clear documentation on these methods explaining the problems with them, and the addition of more portable alternatives (such as lossy conversions) which can be suggested instead, I'd be willing to compromise by accepting their inclusion as well.
I think it's better to name methods according their spec rather than their implementation details. This is a conversion method: What you said was missing falls out as a natural consequence: the reason this is a transmute on linux is because "From == To", the reason it may fail on windows is because "From != To", and the conversion is not injective. This sort of reasoning is common to all abstract conversions.
I disagree: unix may not enforce that its paths be valid utf8, but it (and by 'it' I mean the general unix ecosystem, not necessarily the OS) will try to interpret paths as utf8 wherever they need to be converted to a string or displayed. Imagine that rust used utf-16 throughout, including for its String class: the conversion from OsString to String on linux would still try to interpret the OsString as utf8. Doesn't that suggest to you that OsStrings (and by extension, filenames/paths) on linux should be thought of as more than just bytes with no associated encoding? |
This is actually bugging me as well. Shouldn't the system locale (or codepage) be used there, the way any other program would? There is no reason for a latin-1 system to show replacement characters everywhere just because the Rust language decided to represent strings as UTF-8 internally. |
@remram44 Rust chose long ago not to support encodings other than UTF-8 for displaying There are reasons for this, but this is out of scope for this RFC. If you want to discuss it more, please file a separate issue (and feel free to CC me). |
I don't think that it is. Whether the locale-aware methods should make it into std or a separate crate, it is an argument against the current version of these methods. |
|
Could you elaborate a little more on why you think these functions are treating Windows as a second class citizen? Performing a conversion between Also, I disagree that having a lossy conversion is a more portable alternative. The conversions will only be lossy on Windows, and the API does not indicate this at all in terms of the types coming in and out, so it's even easier to write code assuming the conversion always succeeds and is faithful to the original data (which isn't true). In the
I think it's very important to try to not conflate concerns with these methods. The purpose of an As a result, despite most Unix implementations attempting to interpret bytes as utf8, it's the job of |
Why would lossiness need to be expressed in terms of the types? (potential) lossiness is a property of the conversion itself, not the input or output. Under the compromise I described before, I would expect the lossy versions to be named The lossy conversion is more portable because it's impossible to write code which is simultaneously portable, and relies on this conversion being non-lossy: that reliance is itself non-portable. By not providing the non-lossy version, or by providing it along with a lossy version, and documenting these problems clearly, it's less likely that code which has this reliance will be written by accident.
If that was true, we wouldn't bother with strong typing at all, and could just make OsString an alias of It's not, and that's because the purpose of OsString is much wider than you're suggesting. I'd even go as far to say that the main reason for OsString taking the form it does, is for enabling efficient conversions to/from String (otherwise why bother with WTF8 at all?).
That's not incompatible with what I was saying: it's quite possible (even usual, outside of rust) for strings to have an encoding which should be used when interpretting them, and also "ferry along bytes completely independently of that assumption". To reiterate, I'm not saying that invalid utf8 in an OsString should be considered an error on linux, I'm saying that OsStrings are more than just arbitrary bytes: they are arbitrary bytes intended to be interpretted as utf8. (Otherwise it would make no sense to provide conversions to String without specifying an encoding) |
Another possible name would be |
Also, I am not really sure why we would create a function that can fail on Windows and can't fail on Linux - I would prefer either to fail on both platforms or output valid Unicode (possibly inserting replacement characters) on both platforms (are there functions for doing that?). In most cases I would prefer programs to at least issue a warning rather than silently create files with replacement characters in their filename. |
|
And in the reverse direction? |
@arielb1 @alexcrichton's motivation for these additional functions is for interoperability with existing C code which assumes paths are represented as byte arrays on every platform. (for example, git stores all paths internally as byte arrays). If you look at how git for windows works when trying to check out a repository which contains invalid UTF8 in its paths, you'll see that it succeeds, by using a lossy conversion. Since the git repository itself is the "source of truth" it doesn't matter that the conversion is not reversible. (The reverse conversion only ever occurs when checking in new files). This is IMO, an infinitely better outcome than being completely unable to check out certain repositories on windows, which work fine on linux. |
then why can't you use also, what does git do when there are files named |
OsStr::new<S: AsRef<OsStr> + ?Sized>(s: &S) -> &OsStr
impl AsRef<OsStr> for str |
This would needlessly prevent things like non-utf8 paths on unix systems, which are perfectly fine to pass through to the underlying system APIs. FWIW, I agree with @Diggsey that including both lossy and non-lossy variants of this functionality is useful and matches our existing handling of this kind of situation. I also agree that In any case, it would be helpful to separate out the questions of:
It's helpful to work toward consensus on the first two parts before nailing down the name; for the time being, let's just take |
So to make this concrete, I believe having both lossy and non-lossy versions would look like: impl OsString {
fn from_bytes(v: Vec<u8>) -> Option<OsString>;
fn from_bytes_lossy(v: &[u8]) -> Cow<OsStr>;
}
impl OsStr {
fn from_bytes(v: &[u8]) -> Option<&OsStr>;
fn to_bytes(&self) -> Option<&[u8]>;
fn to_bytes_lossy(&self) -> Cow<[u8]>;
fn to_cstring(&self) -> Option<CString>;
fn to_cstring_lossy(&self) -> CString;
} All of these would be an infallible transmute on Unix and on Windows the non-lossy methods would fail on invalid utf16 where the lossy versions are all semantically equivalent to the behavior in To mirror the existing situation of |
That looks mostly correct (modulo naming), with a few caveats:
|
Yeah that's a good point about recovering the vector in Could you elaborate a bit on why wide conversions should also be available? The motivation for byte conversions is that it comes up quite a bit on all platforms, but wide strings seem like they'll only ever be requested on Windows in which case you can always do it in a non-lossy fashion. |
Ok, I've pushed an update with the lossy versions in the API along with a few associated error types coming out of these APIs. I've also updated the names to use "narrow" as the "platform bytes" term, but I'm not super in love with the terminology either, so I'm certainly open to suggestions! |
While I admit it's less common, lots of things use (unchecked) UTF16 as their primary encoding: Mono (which replaces invalid UTF16 sequences with the replacement character, so that's a good precedent), Java, Javascript (sometimes), some windows file formats (which it may be useful to read/write on other platforms). My current preference for naming is:
|
Hm ok, even with those sorts of conversions in mind I think I'd still want to err on the conservative side for now and hold off on the wide conversions. I'd be fine considering the name (e.g. I'd personally shy away from What kind of confusion are you worried about with narrow strings on windows? Aren't byte sequences basically just narrow strings? |
No, I thought we'd already established: this particular way of treating byte strings is entirely unix specific: on windows a byte string is normallly in the system codepage, eg. the input to The only basis for including these conversions in the portable subset of rust is to support interop with unix-based tools (hence the 'unix' part of the name). I don't know how many ways I can say this, but utf8 is not specific to windows: literally nothing in the OS uses it. The only reason utf8 is used in this case is because they are unix paths, and rust has made the decision (rightly or wrongly) to use utf8 when converting from a unix path to a unicode string. You can't have it both ways: either this function is unix specific, and so it makes sense to use utf8 on windows, or the function is not unix specific, in which case it doesn't make sense to use utf8. However, for the use-cases you've described (which all involve using unix-style paths on windows) it sounds like you want the former. |
fn to_cstring_lossy(&self) -> CString; It's not obvious what this is supposed to do if the OsStr contains NUL. |
Ah yeah that was a mistake in my comment, the RFC has the correct signature: fn to_cstring_lossy(&self) -> Result<CString, NulError>; |
🔔 This RFC is now entering its week-long final comment period 🔔 |
I'm not sure the comment about these functions being Unix-specific in some sense was ever fully resolved. The justification for the API in this RFC is that 8-bit strings are most likely UTF-8 because they came from something like a tar file created on Linux, or a git clone. That's a dangerous notion on Windows. If you're dealing with a C library on Windows which takes a |
The comments on this were not really positive -- are you accepting it anyway? |
A final comment period does not mean it's going to be accepted, but rather means we're going to move forward with it (in one direction or another). |
To review this, I decided to bring myself back up to speed with
Scary! Turns out this is only about buffer overruns which we should be able to fix in Rust, but still a little off-putting.
Fascinating! Seems like they're literally recommending not using this function? Along these lines, it seems like it would be best if we avoided these functions for now? They could perhaps be dealt with at a later date with some Another point I'd like to mention is that if you're passing an Regardless, though, after rereading some discussion here, it seems like the point of contention may be largely the naming of these functions rather than the implementation itself. Stick with unicode on Windows provides a predictable behavior at least, and we just need to clearly signal that in the name. Unfortunately we may not have quite found the best name yet:
I seem to always lean a different direction every time I come to thinking about these names, right now we could consider biting the bullet and just using "bytes" and taking the hit for |
TL;DR: revisiting this discussion for FCP, I'm increasingly skeptical about adding this API to std. To me, the main strikes against it are:
The argument in favor seems to be:
On reflection, the design here cuts somewhat against the way Rust normally tries to deal with portable strings -- i.e., by working with UTF-8 everywhere. But of course, if you want to do that, there are already APIs that can help you. I don't normally argue for ergonomic "punishments" for doing questionable things, but the conclusion for me is that the functionality being proposed here is hiding enough portability pitfalls that it shouldn't be offered as a "default" way of dealing with byte sequences, and probably shouldn't be offered directly in std at all. It makes it look like you're doing something portable while in fact masking potential portability issues. |
The libs team discussed this RFC today and the decision was to close for now. Many of the concerns @aturon pointed out seem to be the consensus, so it's relatively clear that this is something that should bake externally before moving into the standard library if so. |
Stabilize convenience methods for converting between OS strings and Rust's utf-8
string types, specifically stabilize the following APIs:
OsString::from_platform_bytes
(renamed fromOsString::from_bytes
)OsStr::from_platform_bytes
(added API to mirrorOsString
)OsStr::to_bytes
OsStr::to_cstring
Rendered