Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC: Stabilize OS string to bytes conversions #1255

Closed

Conversation

alexcrichton
Copy link
Member

Stabilize convenience methods for converting between OS strings and Rust's utf-8
string types, specifically stabilize the following APIs:

  • OsString::from_platform_bytes (renamed from OsString::from_bytes)
  • OsStr::from_platform_bytes (added API to mirror OsString)
  • OsStr::to_bytes
  • OsStr::to_cstring

Rendered

Stabilize convenience methods for converting between OS strings and Rust's utf-8
string types, specifically stabilize the following APIs:

* `OsString::from_platform_bytes` (renamed from `OsString::from_bytes`)
* `OsStr::from_platform_bytes` (added API to mirror `OsString`)
* `OsStr::to_bytes`
* `OsStr::to_cstring`
@alexcrichton alexcrichton self-assigned this Aug 14, 2015
@alexcrichton alexcrichton added the T-libs-api Relevant to the library API team, which will review and decide on the RFC. label Aug 14, 2015
@alexcrichton
Copy link
Member Author

@nagisa
Copy link
Member

nagisa commented Aug 14, 2015

Worth noting rust-lang/rust#27390 contains an extensive discussion between two camps.

@Diggsey
Copy link
Contributor

Diggsey commented Aug 14, 2015

Ignoring our disagreement about lossy vs non-lossy for now (see rust-lang/rust#27390 for that), I still think "platform_bytes" is a terrible name for these functions.

On windows, "platform bytes" means the encoding determined by the system code-page, ie. the encoding expected by MessageBoxA and other narrow versions of OS functions.

Even from your own description, it's clear this function is not intended to convert from a "platfrom specific encoding" to an OsString, so why name it as such?

In this case, the source encoding is "potentially ill-formed utf8" (borrowing terminology from the WTF8 spec), so why not just call it from_ill_formed_utf8?

See this PR: rust-lang/rust#27657 for how I would like this API to look.

@alexcrichton I did try messaging you on IRC about actually taking up @aturon's offer.

@remram44
Copy link

Note that I don't have strong feelings about this, I was just providing the OsStr version of a method that already existed on OsString.

My opinion is that, since the only use-case we can think of is paths, this is probably worth dropping and replacing by specific methods on Path/PathBuf, that would fill the specific scenarios users are going to encounter, rather than keeping stuff with weird behavior on OsString.

It also makes documentation easier, since we can clearly state what a Path method is for, rather than explaining what the OsString method does and hoping that the user understands that he is to use it for his specific Path need.


* On Unix, return `Some` always on `to_bytes` and call `CString::new(..).ok()`
for `to_cstring`.
* On Windows, attempt to interpret the `&[u16]` as utf-16, and if successful
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

UTF-16 is not actually involved here, is it? On Windows, this tries to interpret WTF-8 bytes as UTF-8.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Implementation-wise, yes, but conceptually OsString on Windows is &[u16] so this is more a description of what's metaphorically happening rather than literally.

@alexcrichton
Copy link
Member Author

@Diggsey

I agree that the name from_platform_bytes isn't precisely the best, I was having trouble coming up with a better name, however. A name like from_ill_formed_utf8 also seems not quite appropriate because I feel it doesn't convey the fact that on Unix it's just a transmute and on Windows it's just a str::from_utf8. For example on Unix there's no actual utf8 involved, and on Windows ill-formed utf8 doesn't actually succeed. Overall from_bytes seems like one of the best options, but that's already taken by the stabilized <OsStr as OsStrExt>::from_bytes.

I definitely wanted to take some time to consider the name as part of this RFC, but I didn't want to flesh it out 100% ahead of time as it seems like the meat of your disagreement was not with the name but rather the functionality itself.

I did try messaging you on IRC about actually taking up @aturon's offer.

Gah sorry, I missed this! Was it a PM? Regardless, though, if you PM me on IRC I'd be more than willing to discuss this! You're also more than welcome to send a PR to this RFC.


@remram44

My opinion is that, since the only use-case we can think of is paths, this is probably worth dropping and replacing by specific methods on Path/PathBuf, that would fill the specific scenarios users are going to encounter, rather than keeping stuff with weird behavior on OsString.

Unfortunately this approach isn't very composable, however. By adding these methods to OsStr we avoid an explosion of various methods here and there. I also think that paths are not necessarily the only use case of OS strings, for example almost all APIs on Windows will take OS strings instead of normal strings, and not necessarily all of them are paths.

@Diggsey
Copy link
Contributor

Diggsey commented Aug 17, 2015

it seems like the meat of your disagreement was not with the name but rather the functionality itself.

What I care about most is that authors of new software who use rust do not treat windows as a second-class citizen without even realising it. I don't expect everyone to care about all platforms equally, but it should at least be obvious when you're doing something that's platform specific.

Given that programmers are a very special kind of lazy, that also means providing more portable alternatives wherever possible.

There's more than one ways to satisfy these requirements: with better names, clear documentation on these methods explaining the problems with them, and the addition of more portable alternatives (such as lossy conversions) which can be suggested instead, I'd be willing to compromise by accepting their inclusion as well.

A name like from_ill_formed_utf8 also seems not quite appropriate because I feel it doesn't convey the fact that on Unix it's just a transmute and on Windows it's just a str::from_utf8.

I think it's better to name methods according their spec rather than their implementation details. This is a conversion method:
To: Self (OsString, whatever encoding that may have, that's inherent to OsString and therefore need not be part of the method name)
From: potentially ill-formed utf8 (a byte array: the type says nothing about what it contains, and so it's important to specify in the method name how we are interpretting it)

What you said was missing falls out as a natural consequence: the reason this is a transmute on linux is because "From == To", the reason it may fail on windows is because "From != To", and the conversion is not injective. This sort of reasoning is common to all abstract conversions.

on Unix there's no actual utf8 involved

I disagree: unix may not enforce that its paths be valid utf8, but it (and by 'it' I mean the general unix ecosystem, not necessarily the OS) will try to interpret paths as utf8 wherever they need to be converted to a string or displayed.

Imagine that rust used utf-16 throughout, including for its String class: the conversion from OsString to String on linux would still try to interpret the OsString as utf8. Doesn't that suggest to you that OsStrings (and by extension, filenames/paths) on linux should be thought of as more than just bytes with no associated encoding?

@remram44
Copy link

I disagree: unix may not enforce that its paths be valid utf8, but it [...] will try to interpret paths as utf8 wherever they need to be converted to a string or displayed.

This is actually bugging me as well. Shouldn't the system locale (or codepage) be used there, the way any other program would? There is no reason for a latin-1 system to show replacement characters everywhere just because the Rust language decided to represent strings as UTF-8 internally.

@SimonSapin
Copy link
Contributor

@remram44 Rust chose long ago not to support encodings other than UTF-8 for displaying std::path::Path values. This was mentioned in RFC 474 and in RFC 517, but admittedly not discussed specifically that I could find since the introduction of the RFC process.

There are reasons for this, but this is out of scope for this RFC. If you want to discuss it more, please file a separate issue (and feel free to CC me).

@remram44
Copy link

this is out of scope for this RFC

I don't think that it is. Whether the locale-aware methods should make it into std or a separate crate, it is an argument against the current version of these methods.

@SimonSapin
Copy link
Contributor

locale-aware methods

OsStr::to_str, Path::to_str, and Path::display are stable, available on all platforms, and assume UTF-8 on Unix. This is not gonna change.

@alexcrichton
Copy link
Member Author

@Diggsey

Could you elaborate a little more on why you think these functions are treating Windows as a second class citizen? Performing a conversion between [u16] and [u8] simply cannot be done in a lossless fashion on Windows, but that doesn't mean that these functions cannot be provided. I would reason that if these functions didn't exist then Unix would be "second class" because it would be more ergonomic to return an error on valid data than to use #[cfg] to work in more situations.

Also, I disagree that having a lossy conversion is a more portable alternative. The conversions will only be lossy on Windows, and the API does not indicate this at all in terms of the types coming in and out, so it's even easier to write code assuming the conversion always succeeds and is faithful to the original data (which isn't true). In the Option-returning versions you are forced to handle the fact that the conversion can fail, whereas in the lossy situations you can easily start relying on the lossiness, and these functions will probably want to be avoided in robust applications anyway.

I disagree: unix may not enforce that its paths be valid utf8, but it (and by 'it' I mean the general unix ecosystem, not necessarily the OS) will try to interpret paths as utf8 wherever they need to be converted to a string or displayed.

I think it's very important to try to not conflate concerns with these methods. The purpose of an OsString isn't to be displayed, but rather to be handed back to the OS (or earned from the OS at some point). Consequently I think it's orthogonal to consider how these bytes are normally interpreted because that's a separable question from creating the OS string in the first place.

As a result, despite most Unix implementations attempting to interpret bytes as utf8, it's the job of OsString to ferry along bytes completely independently of this assumption.

@Diggsey
Copy link
Contributor

Diggsey commented Aug 18, 2015

Also, I disagree that having a lossy conversion is a more portable alternative. The conversions will only be lossy on Windows, and the API does not indicate this at all in terms of the types coming in and out, so it's even easier to write code assuming the conversion always succeeds and is faithful to the original data (which isn't true).

Why would lossiness need to be expressed in terms of the types? (potential) lossiness is a property of the conversion itself, not the input or output. Under the compromise I described before, I would expect the lossy versions to be named xxx_lossy to mirror those on String, so in that case there's even less scope for confusion.

The lossy conversion is more portable because it's impossible to write code which is simultaneously portable, and relies on this conversion being non-lossy: that reliance is itself non-portable. By not providing the non-lossy version, or by providing it along with a lossy version, and documenting these problems clearly, it's less likely that code which has this reliance will be written by accident.

I think it's very important to try to not conflate concerns with these methods. The purpose of an OsString isn't to be displayed, but rather to be handed back to the OS (or earned from the OS at some point).

If that was true, we wouldn't bother with strong typing at all, and could just make OsString an alias of Vec<u8> or Vec<u16> depending on the OS.

It's not, and that's because the purpose of OsString is much wider than you're suggesting. I'd even go as far to say that the main reason for OsString taking the form it does, is for enabling efficient conversions to/from String (otherwise why bother with WTF8 at all?).

As a result, despite most Unix implementations attempting to interpret bytes as utf8, it's the job of OsString to ferry along bytes completely independently of this assumption.

That's not incompatible with what I was saying: it's quite possible (even usual, outside of rust) for strings to have an encoding which should be used when interpretting them, and also "ferry along bytes completely independently of that assumption".

To reiterate, I'm not saying that invalid utf8 in an OsString should be considered an error on linux, I'm saying that OsStrings are more than just arbitrary bytes: they are arbitrary bytes intended to be interpretted as utf8. (Otherwise it would make no sense to provide conversions to String without specifying an encoding)

@Diggsey
Copy link
Contributor

Diggsey commented Aug 19, 2015

Another possible name would be from_unix_path_lossy/from_windows_path_lossy.

@arielb1
Copy link
Contributor

arielb1 commented Aug 19, 2015

from_platform_bytes is a terrible name for these functions. To me it suggest that the function is the equivalent of Python's lambda s: s.decode(sys.getfilesystemencoding()).

Also, I am not really sure why we would create a function that can fail on Windows and can't fail on Linux - I would prefer either to fail on both platforms or output valid Unicode (possibly inserting replacement characters) on both platforms (are there functions for doing that?). In most cases I would prefer programs to at least issue a warning rather than silently create files with replacement characters in their filename.

@SimonSapin
Copy link
Contributor

I would prefer either to fail on both platforms or output valid Unicode (possibly inserting replacement characters) on both platforms (are there functions for doing that?)

OsStr already has two methods: to_str() -> Option<&str> and to_string_lossy() -> Cow<str>

@arielb1
Copy link
Contributor

arielb1 commented Aug 19, 2015

@SimonSapin

And in the reverse direction?

@Diggsey
Copy link
Contributor

Diggsey commented Aug 19, 2015

@arielb1
str/String trivially converts to OsString losslessly on every platform (it's just a transmute).

@alexcrichton's motivation for these additional functions is for interoperability with existing C code which assumes paths are represented as byte arrays on every platform. (for example, git stores all paths internally as byte arrays).

If you look at how git for windows works when trying to check out a repository which contains invalid UTF8 in its paths, you'll see that it succeeds, by using a lossy conversion. Since the git repository itself is the "source of truth" it doesn't matter that the conversion is not reversible. (The reverse conversion only ever occurs when checking in new files).

This is IMO, an infinitely better outcome than being completely unable to check out certain repositories on windows, which work fine on linux.

@arielb1
Copy link
Contributor

arielb1 commented Aug 19, 2015

@Diggsey

then why can't you use str::from_utf8_lossy instead of this function if you want to be cross-platform?

also, what does git do when there are files named README and readme in the same repository?

@SimonSapin
Copy link
Contributor

@arielb1

OsStr::new<S: AsRef<OsStr> + ?Sized>(s: &S) -> &OsStr

impl AsRef<OsStr> for str

@aturon
Copy link
Member

aturon commented Aug 19, 2015

@arielb1

then why can't you use str::from_utf8_lossy instead of this function if you want to be cross-platform?

This would needlessly prevent things like non-utf8 paths on unix systems, which are perfectly fine to pass through to the underlying system APIs.

FWIW, I agree with @Diggsey that including both lossy and non-lossy variants of this functionality is useful and matches our existing handling of this kind of situation. I also agree that from_platform_bytes is not an ideal name. Perhaps something like interpret_bytes and interpret_bytes_lossy would be enough to call attention to the odd behavior here? (Just a thought.)

In any case, it would be helpful to separate out the questions of:

  • Should we have functions like this at all?
  • Which variants? Lossy, non-lossy, both?
  • Naming?

It's helpful to work toward consensus on the first two parts before nailing down the name; for the time being, let's just take from_platform_bytes as a standin for an eventual, better name.

@alexcrichton
Copy link
Member Author

So to make this concrete, I believe having both lossy and non-lossy versions would look like:

impl OsString {                                            
    fn from_bytes(v: Vec<u8>) -> Option<OsString>;         
    fn from_bytes_lossy(v: &[u8]) -> Cow<OsStr>;           
}                                                          

impl OsStr {                                               
    fn from_bytes(v: &[u8]) -> Option<&OsStr>;             
    fn to_bytes(&self) -> Option<&[u8]>;                   
    fn to_bytes_lossy(&self) -> Cow<[u8]>;                 
    fn to_cstring(&self) -> Option<CString>;               
    fn to_cstring_lossy(&self) -> CString;                 
}                                                          

All of these would be an infallible transmute on Unix and on Windows the non-lossy methods would fail on invalid utf16 where the lossy versions are all semantically equivalent to the behavior in String::from_utf16_lossy.

To mirror the existing situation of [u8] and str conversions I'm ok with this set of methods. So separating the concern about naming are there thoughts on this? Should some of these not be provided? Should more be provided?

@Diggsey
Copy link
Contributor

Diggsey commented Aug 20, 2015

That looks mostly correct (modulo naming), with a few caveats:

  • OsString::from_bytes should return Result<OsString, FromUtf8Error>, otherwise you lose the Vec in a failed conversion.
  • Conversions to/from windows paths would also be useful (from_wide, from_wide_lossy, to_wide, to_wide_lossy)
  • CString has its own problems, although they're not integral to this particular issue.

@alexcrichton
Copy link
Member Author

Yeah that's a good point about recovering the vector in OsString::from_bytes, I'd probably prefer the error type to not be FromUtf8Error exactly but a new wrapper type could certainly be made.

Could you elaborate a bit on why wide conversions should also be available? The motivation for byte conversions is that it comes up quite a bit on all platforms, but wide strings seem like they'll only ever be requested on Windows in which case you can always do it in a non-lossy fashion.

@alexcrichton
Copy link
Member Author

Ok, I've pushed an update with the lossy versions in the API along with a few associated error types coming out of these APIs. I've also updated the names to use "narrow" as the "platform bytes" term, but I'm not super in love with the terminology either, so I'm certainly open to suggestions!

@Diggsey
Copy link
Contributor

Diggsey commented Aug 21, 2015

Could you elaborate a bit on why wide conversions should also be available? The motivation for byte conversions is that it comes up quite a bit on all platforms, but wide strings seem like they'll only ever be requested on Windows in which case you can always do it in a non-lossy fashion.

While I admit it's less common, lots of things use (unchecked) UTF16 as their primary encoding: Mono (which replaces invalid UTF16 sequences with the replacement character, so that's a good precedent), Java, Javascript (sometimes), some windows file formats (which it may be useful to read/write on other platforms).

My current preference for naming is:
from_unix_path, followed by from_unix:

  • They won't be confused with narrow strings on windows
  • They are more convenient than from_ill_formed_utf8
  • The first one makes it clear that this only really makes sense for dealing with paths
  • If windows path conversion functions are added, the naming scheme is clear and consistent
  • It's clear that it will be a no-op on unix-like platforms
  • The potential for failure on non-unixy platforms is more obvious

@alexcrichton
Copy link
Member Author

Hm ok, even with those sorts of conversions in mind I think I'd still want to err on the conservative side for now and hold off on the wide conversions. I'd be fine considering the name (e.g. from_wide and friends), however.

I'd personally shy away from unix_path as the component here because it's not always having to do with paths (sometimes it's just strings going back and forth), and unix as well seems a little too unix-specific to me because you can generate byte sequences and consume them solely on Windows (and there's nothing unix-specific about what happens on Windows).

What kind of confusion are you worried about with narrow strings on windows? Aren't byte sequences basically just narrow strings?

@Diggsey
Copy link
Contributor

Diggsey commented Aug 24, 2015

What kind of confusion are you worried about with narrow strings on windows? Aren't byte sequences basically just narrow strings?

No, I thought we'd already established: this particular way of treating byte strings is entirely unix specific: on windows a byte string is normallly in the system codepage, eg. the input to MultiByteToWideChar. All OS functions which take a byte string expect to receive strings in this encoding.

The only basis for including these conversions in the portable subset of rust is to support interop with unix-based tools (hence the 'unix' part of the name).

I don't know how many ways I can say this, but utf8 is not specific to windows: literally nothing in the OS uses it. The only reason utf8 is used in this case is because they are unix paths, and rust has made the decision (rightly or wrongly) to use utf8 when converting from a unix path to a unicode string.

You can't have it both ways: either this function is unix specific, and so it makes sense to use utf8 on windows, or the function is not unix specific, in which case it doesn't make sense to use utf8. However, for the use-cases you've described (which all involve using unix-style paths on windows) it sounds like you want the former.

@eefriedman
Copy link
Contributor

    fn to_cstring_lossy(&self) -> CString;                 

It's not obvious what this is supposed to do if the OsStr contains NUL.

@alexcrichton
Copy link
Member Author

Ah yeah that was a mistake in my comment, the RFC has the correct signature:

    fn to_cstring_lossy(&self) -> Result<CString, NulError>;

@alexcrichton
Copy link
Member Author

🔔 This RFC is now entering its week-long final comment period 🔔

@alexcrichton alexcrichton added the final-comment-period Will be merged/postponed/closed in ~10 calendar days unless new substational objections are raised. label Oct 8, 2015
@eefriedman
Copy link
Contributor

I'm not sure the comment about these functions being Unix-specific in some sense was ever fully resolved. The justification for the API in this RFC is that 8-bit strings are most likely UTF-8 because they came from something like a tar file created on Linux, or a git clone. That's a dangerous notion on Windows. If you're dealing with a C library on Windows which takes a char*, your first guess in most cases should be that it expects 8-bit strings encoded using the system locale, not UTF-8.

@remram44
Copy link

remram44 commented Oct 8, 2015

The comments on this were not really positive -- are you accepting it anyway?

@sfackler
Copy link
Member

sfackler commented Oct 8, 2015

@remram44

A final comment period does not mean it's going to be accepted, but rather means we're going to move forward with it (in one direction or another).

@alexcrichton
Copy link
Member Author

@eefriedman

To review this, I decided to bring myself back up to speed with MultiByteToWideChar and those functions, so I took a look at the docs MSDN has, at which I was met with some interesting points:

Caution Using the MultiByteToWideChar function incorrectly can compromise the security of your application.

Scary! Turns out this is only about buffer overruns which we should be able to fix in Rust, but still a little off-putting.

For the most consistent results, applications should use Unicode, such as UTF-8 or UTF-16, instead of a specific code page, unless legacy standards or data formats prevent the use of Unicode.

Fascinating! Seems like they're literally recommending not using this function? Along these lines, it seems like it would be best if we avoided these functions for now? They could perhaps be dealt with at a later date with some _locale functions that also perhaps do crazy things on Unix.

Another point I'd like to mention is that if you're passing an OsString to a library on Windows (like you mentioned), any robust library for windows will accept a wide string. Otherwise other robust libraries will have documentation saying what it means to have a char* on Windows, and otherwise you're a bit in the weeds.


Regardless, though, after rereading some discussion here, it seems like the point of contention may be largely the naming of these functions rather than the implementation itself. Stick with unicode on Windows provides a predictable behavior at least, and we just need to clearly signal that in the name.

Unfortunately we may not have quite found the best name yet:

  • bytes - taken up by unix::OsStringExt::from_bytes partially, but also a tad bland
  • utf8 - unfortunately this isn't quite true on Unix because nothing is actually guaranteed to be utf8
  • ill_formed_utf8 - better than "utf8", but a bit wordy...
  • unix - it seems a little odd to call "let's think this is utf8 on Windows" a "unix artifact", overall it seems to me at least that the connection between these functions and the name "unix" isn't quite there.
  • narrow - also probably not quite true if "narrow string" means "system code page" on Windows. Also doesn't have much meaning on Unix.

I seem to always lean a different direction every time I come to thinking about these names, right now we could consider biting the bullet and just using "bytes" and taking the hit for unix::OsStringExt::from_bytes duplication (e.g. mandating UFCS there), but that seems like not a great place to be in.

@aturon
Copy link
Member

aturon commented Oct 13, 2015

TL;DR: revisiting this discussion for FCP, I'm increasingly skeptical about adding this API to std.

To me, the main strikes against it are:

  • The requirements it places on the byte sequence are very different across platforms, which means you can't portably use these functions to e.g. decode bytes from a file -- you will end up with imposing different encoding requirements that depend solely on your build target.
  • The behavior on Windows seems to be a compromise that no-one wants: it's not actually "unix-like", since on unix systems we allow arbitrary byte sequences, and it's not "Windows-like" either (for reasons that have been discussed at length on thread).
  • The fact that we can't seem to find good names for these functions I think underscores the lack of clarity about what they're actually doing.

The argument in favor seems to be:

  • Sometimes, you just have a byte sequence that you need to feed into an OsString. This API gives you a very easy way to do that that provides maximal flexibility on unix systems, and something that will often work in unix-centric software running on Windows (i.e. if you assume that things like paths get serialized via UTF-8).

On reflection, the design here cuts somewhat against the way Rust normally tries to deal with portable strings -- i.e., by working with UTF-8 everywhere. But of course, if you want to do that, there are already APIs that can help you.

I don't normally argue for ergonomic "punishments" for doing questionable things, but the conclusion for me is that the functionality being proposed here is hiding enough portability pitfalls that it shouldn't be offered as a "default" way of dealing with byte sequences, and probably shouldn't be offered directly in std at all. It makes it look like you're doing something portable while in fact masking potential portability issues.

@alexcrichton
Copy link
Member Author

The libs team discussed this RFC today and the decision was to close for now. Many of the concerns @aturon pointed out seem to be the consensus, so it's relatively clear that this is something that should bake externally before moving into the standard library if so.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
final-comment-period Will be merged/postponed/closed in ~10 calendar days unless new substational objections are raised. T-libs-api Relevant to the library API team, which will review and decide on the RFC.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants