-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RFC: conventions for placement of unsafe APIs #240
Conversation
string. This method makes it easy to work with the byte-based representation | ||
of the string, but thereby also allows violation of the utf8 guarantee. | ||
|
||
* A `raw` submodule with a number of free functions, like `from_parts`, that |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are problems with using a raw
module that are not addressed. There are modules are btree
treemap
, trie
exposing more than one type and a raw
module would fail to distinguish between these. A raw
module also interacts very poorly with re-exports and adds verbosity to usage of the API.
The motivation for making unsafe
code uglier in these cases is not laid out. The point of unsafe
is to draw a boundary between safe and unsafe
code without the need for conventions to distinguish between these. The rule about construction from raw pointer representation is very obscure, and doesn't change the fact that it makes the API less consistent and more painful to use without gaining anything in return.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are problems with using a raw module that are not addressed. There are modules are btree treemap, trie exposing more than one type and a raw module would fail to distinguish between these. A raw module also interacts very poorly with re-exports and adds verbosity to usage of the API. The motivation for making unsafe code uglier in these cases is not laid out.
I'm a little confused by this comment. The RFC recommends making unsafe functions into methods or static functions in the same cases you'd do so for safe APIs. It cites as an explicit benefit that importing/using these APIs becomes easier by doing so:
The benefit to moving unsafe APIs into methods (resp. static functions) is the usual one: you can gain easy access to these APIs merely by having a value of the type (resp. importing the type).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You keep asserting that this makes the API harder to use, but that's extremely subjective. I've always found the raw
modules in str
/slice
/string
/vec
make the API much easier to use, because any time I need to construct a value from raw pointers, I know precisely where to look.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@aturon: Sorry, I misinterpreted the proposed guidelines. The reason I didn't find the details about how to handle those edge cases and a detailed rationale is because it's not what you're proposing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think much of the "know where to look" problem would be fixed by segmenting the things in the source, and also allowing rustdoc to divide functions/methods into user-defined subsections. This would mean it is clear both in the .rs file and in the docs.
This description of Arguably, "raw pointers" could be considered "low-level representation", but I'm not comfortable making that assumption, because only some of the functions actually treat the raw pointer as the low-level representation. Other functions, such as There's also a couple of functions that don't even use raw pointers or low-level representations. I'm thinking here of Given these last two functions, I think the more general description of |
* Use `raw` submodules to group together *all* manipulation of low-level | ||
representations. No module in `std` currently does this; existing modules | ||
provide some free functions in `raw`, and some unsafe methods, without a clear | ||
driving principle. The ergonomics of moving *everything* into free functions |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The ergonomics of moving anything into free functions is poor. Downgrading more methods is worse, but doing it to a few is still more verbose and harder to navigate - especially due to re-exports. It also means introducing type prefixes in modules exposing more than one type.
@kballard Thanks for the clarifications. I will try to incorporate them into the RFC to more accurately convey the current state of affairs. That said, I'm curious what you think about the proposed conventions? One thing I'm not yet happy with is the recommendation for when |
underlying representation of a data structure (which is otherwise private), | ||
the API should use `raw` in its name. Specifically, `from_raw_parts` is the | ||
typical name used for constructing a value from e.g. a pointer-based | ||
representation. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a bit confused here. Vec::from_raw_parts
seems to be covered under the convention for raw
submodules, as it is constructing a Vec
from the low-level representation (pointer, length, capacity). If it's put in the raw
submodule then it shouldn't have raw
in the name, because that's unnecessarily verbose (raw::from_raw_parts
?),
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kballard It seems that the RFC text is unclear. I meant for from_raw_parts
to be a static function (according to the first bullet). The second bullet was just a clarification about the name.
I'm going to revise the RFC text and push; will ping when done.
Based on the With this rule, the usage because pretty simple. Do you want to construct a value and bypass invariant checks? Look in the |
|
||
# Unresolved questions | ||
|
||
The `core::raw` module provides structs with public representations equivalent |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Box
, Closure
and Procedure
will end up being removed since they're becoming obsolete. That will only leave behind Slice
(could just be in the slice
module) and TraitObject
so it probably does need to be replaced.
@aturon I'd also like to see this RFC recommend that |
That's a great idea! |
@kballard I like the suggestion of making unsafe stuff internally separate and at the end, as this matches the current convention of seperating private/public methods into seperate impls as well. I'd rather just see an additional subheading or maybe a red coloured "unsafe operations" box, than any kind of hide/show unsafe functionality though. |
@kballard: Your definition of |
@thestinger I assume you mean methods would need to become functions. And no they wouldn't. Neither of those methods are value constructors. I even explicitly stated that methods that modify existing values (such as |
@kballard: There's no functional difference between the |
@kballard @thestinger I've pushed a significant revision, which I hope clarifies what's being proposed here. @kballard Specifically regarding naming conventions, I don't think that requiring a |
Just spitballing: if the convention is going to be that |
@jfager What benefit would that provide? Language rules != conventions. |
|
||
# Alternatives | ||
|
||
There are a few alternatives: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One unlisted alternative is the design I gave in another issue (I forget where) that uses associated items (RFC PR #195) to provide e.g. String::raw::from_parts()
. This solves the ergonomics of needing to import something to get access to the values. This would look like
impl String {
/// Raw operations
static raw: StringRaw = StringRaw;
}
struct StringRaw;
impl StringRaw {
unsafe fn from_parts(buf: *mut u8, length: uint, capacity: uint) -> String;
// ...
}
Personally, I think the ergonomics of this are nice. It's basically providing a tiny module structure for the static methods of a type, similar to how we use a module structure in general for types/functions. The biggest downside from a usability perspective is that it's not immediately obvious in rustdoc how this is supposed to be used.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kballard I'm sorry I left this out -- I'll add a reference to it from the RFC text.
@kballard Of course convention != language rule; you weren't able to infer that the question was whether the convention should be promoted to a rule? If not a full-blown rule, then perhaps a lint detecting safe and unsafe methods in the same impl block. |
@jfager Of course I was, you don't need to be condescending. I'm telling you that there's no need to put hard rules into the language in order to enforce a convention. I don't believe this even qualifies for a stylistic lint. This is a recommendation that improves documentation, not a universal style. More importantly, while our standard libraries are going under heavy API review, most third-party code won't have the same level of API review and won't benefit from having compiler-enforced rules around the structure of their impl blocks. |
@kballard Hey, you're the one who felt the need to inform me that a language rule and a convention aren't the same thing. Anyways, that's a much better answer to the original question, thank you. It wasn't clear that this was intended primarily for the standard lib and not as a declaration of Idiomatic Rust Code that people should feel bad for ignoring in their own code. |
Can we stop focusing on documentation? We can equally improve |
@aturon I agree with these conventions in principle, though I am very curious about how this specifically affects the various |
That's right. Concretely, I think we'd be left with only |
I was thinking that a |
+1 For this proposal, unsafe methods/functions can already be grouped together by being unsafe and naming conventions, no point in making them more annoying to use with free functions in a raw module. |
# Rationale When dealing with strings, many functions deal with either a `char` (unicode codepoint) or a byte (utf-8 encoding related). There is often an inconsistent way in which methods are referred to as to whether they contain "byte", "char", or nothing in their name. There are also issues open to rename *all* methods to reflect that they operate on utf8 encodings or bytes (e.g. utf8_len() or byte_len()). The current state of String seems to largely be what is desired, so this PR proposes the following rationale for methods dealing with bytes or characters: > When constructing a string, the input encoding *must* be mentioned (e.g. > from_utf8). This makes it clear what exactly the input type is expected to be > in terms of encoding. > > When a method operates on anything related to an *index* within the string > such as length, capacity, position, etc, the method *implicitly* operates on > bytes. It is an understood fact that String is a utf-8 encoded string, and > burdening all methods with "bytes" would be redundant. > > When a method operates on the *contents* of a string, such as push() or pop(), > then "char" is the default type. A String can loosely be thought of as being a > collection of unicode codepoints, but not all collection-related operations > make sense because some can be woefully inefficient. # Method stabilization The following methods have been marked #[stable] * The String type itself * String::new * String::with_capacity * String::from_utf16_lossy * String::into_bytes * String::as_bytes * String::len * String::clear * String::as_slice The following methods have been marked #[unstable] * String::from_utf8 - The error type in the returned `Result` may change to provide a nicer message when it's `unwrap()`'d * String::from_utf8_lossy - The returned `MaybeOwned` type still needs stabilization * String::from_utf16 - The return type may change to become a `Result` which includes more contextual information like where the error occurred. * String::from_chars - This is equivalent to iter().collect(), but currently not as ergonomic. * String::from_char - This method is the equivalent of Vec::from_elem, and has been marked #[unstable] becuase it can be seen as a duplicate of iterator-based functionality as well as possibly being renamed. * String::push_str - This *can* be emulated with .extend(foo.chars()), but is less efficient because of decoding/encoding. Due to the desire to minimize API surface this may be able to be removed in the future for something possibly generic with no loss in performance. * String::grow - This is a duplicate of iterator-based functionality, which may become more ergonomic in the future. * String::capacity - This function was just added. * String::push - This function was just added. * String::pop - This function was just added. * String::truncate - The failure conventions around String methods and byte indices isn't totally clear at this time, so the failure semantics and return value of this method are subject to change. * String::as_mut_vec - the naming of this method may change. * string::raw::* - these functions are all waiting on [an RFC][2] [2]: rust-lang/rfcs#240 The following method have been marked #[experimental] * String::from_str - This function only exists as it's more efficient than to_string(), but having a less ergonomic function for performance reasons isn't the greatest reason to keep it around. Like Vec::push_all, this has been marked experimental for now. The following methods have been #[deprecated] * String::append - This method has been deprecated to remain consistent with the deprecation of Vec::append. While convenient, it is one of the only functional-style apis on String, and requires more though as to whether it belongs as a first-class method or now (and how it relates to other collections). * String::from_byte - This is fairly rare functionality and can be emulated with str::from_utf8 plus an assert plus a call to to_string(). Additionally, String::from_char could possibly be used. * String::byte_capacity - Renamed to String::capacity due to the rationale above. * String::push_char - Renamed to String::push due to the rationale above. * String::pop_char - Renamed to String::pop due to the rationale above. * String::push_bytes - There are a number of `unsafe` functions on the `String` type which allow bypassing utf-8 checks. These have all been deprecated in favor of calling `.as_mut_vec()` and then operating directly on the vector returned. These methods were deprecated because naming them with relation to other methods was difficult to rationalize and it's arguably more composable to call .as_mut_vec(). * String::as_mut_bytes - See push_bytes * String::push_byte - See push_bytes * String::pop_byte - See push_bytes * String::shift_byte - See push_bytes # Reservation methods This commit does not yet touch the methods for reserving bytes. The methods on Vec have also not yet been modified. These methods are discussed in the upcoming [Collections reform RFC][1] [1]: https://github.com/aturon/rfcs/blob/collections-conventions/active/0000-collections-conventions.md#implicit-growth
# Rationale When dealing with strings, many functions deal with either a `char` (unicode codepoint) or a byte (utf-8 encoding related). There is often an inconsistent way in which methods are referred to as to whether they contain "byte", "char", or nothing in their name. There are also issues open to rename *all* methods to reflect that they operate on utf8 encodings or bytes (e.g. utf8_len() or byte_len()). The current state of String seems to largely be what is desired, so this PR proposes the following rationale for methods dealing with bytes or characters: > When constructing a string, the input encoding *must* be mentioned (e.g. > from_utf8). This makes it clear what exactly the input type is expected to be > in terms of encoding. > > When a method operates on anything related to an *index* within the string > such as length, capacity, position, etc, the method *implicitly* operates on > bytes. It is an understood fact that String is a utf-8 encoded string, and > burdening all methods with "bytes" would be redundant. > > When a method operates on the *contents* of a string, such as push() or pop(), > then "char" is the default type. A String can loosely be thought of as being a > collection of unicode codepoints, but not all collection-related operations > make sense because some can be woefully inefficient. # Method stabilization The following methods have been marked #[stable] * The String type itself * String::new * String::with_capacity * String::from_utf16_lossy * String::into_bytes * String::as_bytes * String::len * String::clear * String::as_slice The following methods have been marked #[unstable] * String::from_utf8 - The error type in the returned `Result` may change to provide a nicer message when it's `unwrap()`'d * String::from_utf8_lossy - The returned `MaybeOwned` type still needs stabilization * String::from_utf16 - The return type may change to become a `Result` which includes more contextual information like where the error occurred. * String::from_chars - This is equivalent to iter().collect(), but currently not as ergonomic. * String::from_char - This method is the equivalent of Vec::from_elem, and has been marked #[unstable] becuase it can be seen as a duplicate of iterator-based functionality as well as possibly being renamed. * String::push_str - This *can* be emulated with .extend(foo.chars()), but is less efficient because of decoding/encoding. Due to the desire to minimize API surface this may be able to be removed in the future for something possibly generic with no loss in performance. * String::grow - This is a duplicate of iterator-based functionality, which may become more ergonomic in the future. * String::capacity - This function was just added. * String::push - This function was just added. * String::pop - This function was just added. * String::truncate - The failure conventions around String methods and byte indices isn't totally clear at this time, so the failure semantics and return value of this method are subject to change. * String::as_mut_vec - the naming of this method may change. * string::raw::* - these functions are all waiting on [an RFC][2] [2]: rust-lang/rfcs#240 The following method have been marked #[experimental] * String::from_str - This function only exists as it's more efficient than to_string(), but having a less ergonomic function for performance reasons isn't the greatest reason to keep it around. Like Vec::push_all, this has been marked experimental for now. The following methods have been #[deprecated] * String::append - This method has been deprecated to remain consistent with the deprecation of Vec::append. While convenient, it is one of the only functional-style apis on String, and requires more though as to whether it belongs as a first-class method or now (and how it relates to other collections). * String::from_byte - This is fairly rare functionality and can be emulated with str::from_utf8 plus an assert plus a call to to_string(). Additionally, String::from_char could possibly be used. * String::byte_capacity - Renamed to String::capacity due to the rationale above. * String::push_char - Renamed to String::push due to the rationale above. * String::pop_char - Renamed to String::pop due to the rationale above. * String::push_bytes - There are a number of `unsafe` functions on the `String` type which allow bypassing utf-8 checks. These have all been deprecated in favor of calling `.as_mut_vec()` and then operating directly on the vector returned. These methods were deprecated because naming them with relation to other methods was difficult to rationalize and it's arguably more composable to call .as_mut_vec(). * String::as_mut_bytes - See push_bytes * String::push_byte - See push_bytes * String::pop_byte - See push_bytes * String::shift_byte - See push_bytes # Reservation methods This commit does not yet touch the methods for reserving bytes. The methods on Vec have also not yet been modified. These methods are discussed in the upcoming [Collections reform RFC][1] [1]: https://github.com/aturon/rfcs/blob/collections-conventions/active/0000-collections-conventions.md#implicit-growth
# Rationale When dealing with strings, many functions deal with either a `char` (unicode codepoint) or a byte (utf-8 encoding related). There is often an inconsistent way in which methods are referred to as to whether they contain "byte", "char", or nothing in their name. There are also issues open to rename *all* methods to reflect that they operate on utf8 encodings or bytes (e.g. utf8_len() or byte_len()). The current state of String seems to largely be what is desired, so this PR proposes the following rationale for methods dealing with bytes or characters: > When constructing a string, the input encoding *must* be mentioned (e.g. > from_utf8). This makes it clear what exactly the input type is expected to be > in terms of encoding. > > When a method operates on anything related to an *index* within the string > such as length, capacity, position, etc, the method *implicitly* operates on > bytes. It is an understood fact that String is a utf-8 encoded string, and > burdening all methods with "bytes" would be redundant. > > When a method operates on the *contents* of a string, such as push() or pop(), > then "char" is the default type. A String can loosely be thought of as being a > collection of unicode codepoints, but not all collection-related operations > make sense because some can be woefully inefficient. # Method stabilization The following methods have been marked #[stable] * The String type itself * String::new * String::with_capacity * String::from_utf16_lossy * String::into_bytes * String::as_bytes * String::len * String::clear * String::as_slice The following methods have been marked #[unstable] * String::from_utf8 - The error type in the returned `Result` may change to provide a nicer message when it's `unwrap()`'d * String::from_utf8_lossy - The returned `MaybeOwned` type still needs stabilization * String::from_utf16 - The return type may change to become a `Result` which includes more contextual information like where the error occurred. * String::from_chars - This is equivalent to iter().collect(), but currently not as ergonomic. * String::from_char - This method is the equivalent of Vec::from_elem, and has been marked #[unstable] becuase it can be seen as a duplicate of iterator-based functionality as well as possibly being renamed. * String::push_str - This *can* be emulated with .extend(foo.chars()), but is less efficient because of decoding/encoding. Due to the desire to minimize API surface this may be able to be removed in the future for something possibly generic with no loss in performance. * String::grow - This is a duplicate of iterator-based functionality, which may become more ergonomic in the future. * String::capacity - This function was just added. * String::push - This function was just added. * String::pop - This function was just added. * String::truncate - The failure conventions around String methods and byte indices isn't totally clear at this time, so the failure semantics and return value of this method are subject to change. * String::as_mut_vec - the naming of this method may change. * string::raw::* - these functions are all waiting on [an RFC][2] [2]: rust-lang/rfcs#240 The following method have been marked #[experimental] * String::from_str - This function only exists as it's more efficient than to_string(), but having a less ergonomic function for performance reasons isn't the greatest reason to keep it around. Like Vec::push_all, this has been marked experimental for now. The following methods have been #[deprecated] * String::append - This method has been deprecated to remain consistent with the deprecation of Vec::append. While convenient, it is one of the only functional-style apis on String, and requires more though as to whether it belongs as a first-class method or now (and how it relates to other collections). * String::from_byte - This is fairly rare functionality and can be emulated with str::from_utf8 plus an assert plus a call to to_string(). Additionally, String::from_char could possibly be used. * String::byte_capacity - Renamed to String::capacity due to the rationale above. * String::push_char - Renamed to String::push due to the rationale above. * String::pop_char - Renamed to String::pop due to the rationale above. * String::push_bytes - There are a number of `unsafe` functions on the `String` type which allow bypassing utf-8 checks. These have all been deprecated in favor of calling `.as_mut_vec()` and then operating directly on the vector returned. These methods were deprecated because naming them with relation to other methods was difficult to rationalize and it's arguably more composable to call .as_mut_vec(). * String::as_mut_bytes - See push_bytes * String::push_byte - See push_bytes * String::pop_byte - See push_bytes * String::shift_byte - See push_bytes # Reservation methods This commit does not yet touch the methods for reserving bytes. The methods on Vec have also not yet been modified. These methods are discussed in the upcoming [Collections reform RFC][1] [1]: https://github.com/aturon/rfcs/blob/collections-conventions/active/0000-collections-conventions.md#implicit-growth
This is a conventions RFC for settling the location of
unsafe
APIs relativeto the types they work with, as well as the use of
raw
submodules.The brief summary is:
that safe APIs would be.
raw
submodules should be used only to provide APIs directly on low-levelrepresentations.
Rendered