-
Notifications
You must be signed in to change notification settings - Fork 12.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Tracking Issue for ascii::Char
(ACP 179)
#110998
Comments
I guess I'll kick off the discussion about how to actually define the #[repr(u8)]
enum Char {
Null = 0,
…
Tilde = 127,
} I'd like to first make sure I understand the pros of using this representation. Do I have them right?
Are there more? Am I mistaken about the above? And now, the mitigations:
In terms of negatives, why not do the enum in the first place? Good question. I'm not entirely sure. It feels to me like it stresses my weirdness budget. It's also a generally much bigger enum than most, and I wonder whether that will provoke any surprises in places. Maybe not. |
Do we need both |
Add `ascii::Char` (ACP#179) ACP second: rust-lang/libs-team#179 (comment) New tracking issue: rust-lang#110998 For now this is an `enum` as `@kupiakos` [suggested](rust-lang/libs-team#179 (comment)), with the variants under a different feature flag. There's lots more things that could be added here, and place for further doc updates, but this seems like a plausible starting point PR. I've gone through and put an `as_ascii` next to every `is_ascii`: on `u8`, `char`, `[u8]`, and `str`. As a demonstration, made a commit updating some formatting code to use this: scottmcm@ascii-char-in-fmt (I don't want to include that in this PR, though, because that brings in perf questions that don't exist if this is just adding new unstable APIs.)
Regarding the enum, @kupiakos said the following in
And mentioned icu4x's enum, Though given those names it's possible it's just doing that for the niche and not for direct use. (Man I wish we had private variants as a possibility.) To your questions, @BurntSushi ,
'\0' => EscapeDebug::backslash(Null),
'\r' => EscapeDebug::backslash(CarriageReturn),
'\n' => EscapeDebug::backslash(LineFeed ), because without the variants it ends up being more like '\0' => EscapeDebug::backslash(b'\0'.as_ascii().unwrap()),
'\r' => EscapeDebug::backslash(b'\r'.as_ascii().unwrap()),
'\n' => EscapeDebug::backslash(b'\n'.as_ascii().unwrap()), which is nice in that it keeps the escape sequences lining up, but having to copy-paste I guess this is yet another situation in which I'd like custom literals. Or something like EDIT a while later: The PR to update |
Constify `[u8]::is_ascii` (unstably) UTF-8 checking in `const fn`-stabilized back in 1.63 (rust-lang#97367), but apparently somehow ASCII checking was never const-ified, despite being simpler. New constness-tracking issue for `is_ascii`: rust-lang#111090 I noticed this working on `ascii::Char`: rust-lang#110998
Constify `[u8]::is_ascii` (unstably) UTF-8 checking in `const fn`-stabilized back in 1.63 (rust-lang#97367), but apparently somehow ASCII checking was never const-ified, despite being simpler. New constness-tracking issue for `is_ascii`: rust-lang#111090 I noticed this working on `ascii::Char`: rust-lang#110998
I wasn't in the discussion when this type was created (which I fully support), but is there a reason to explicitly use an enum instead of a newtype + internal niche attribute like If we decide to go the pattern types route (#107606) then IMHO, explicitly going with an enum kind of prevents these kinds of optimisations later down the road, since it forces an enum as the representation. |
@clarfonthey I think the discussion above answers your question about why we went with the enum for now. In summary, I think its principle upside is that it gives nice names to each character. Not particularly compelling, but nice. My general feeling is that nobody is stuck on using an enum here. My best guess is that unless something compelling comes up, I wouldn't be surprised if we changed to an opaque representation such that we could change to an enum later in a compatible manner. (I think that's possible?) Purely because it's the conservative route. I don't grok the second two paragraphs of your comment. Probably because I don't know what "pattern types" are. I also don't know what |
I mean, having nice names is still achievable with constants, and this still would make match expressions work as you'd expect. Personally, the benefits I'm seeing here are the ability to have more-performant code without the need for extra unsafe code, since any slice of ASCII characters is automatically valid UTF-8. Otherwise, a type invariant would mostly be unnecessary, and you could just have constants for the names. Explaining the two proposals:
The pattern type version is very enticing because it very naturally allows exhaustive matching, while still just being a compile-time constraint. And like I said, we could still have constants for the different names for characters without really affecting things. |
Yes, as I said:
I don't know what the stabilization story is for pattern types or generic integers, but I would prefer we focus on finding a path to stabilization that doesn't require being blocked on hypothetical language features. Or at least, focusing discussion on whether we can utilize them later in a backwards compatible fashion. I still don't really know what pattern types are (I just don't have time right now to grok it), but generic integer types seem like an implementation detail that's consistent with defining |
Right, the point here is that any opaque type would be equivalent to the primary hypothetical future implementations. So, explicitly not blocking on any particular path forward, but making sure that they're compatible. |
Though you can I agree that this could absolutely be done with the magic attributes. But is there anything in particular that you think is better with that approach? The enum version still gets the niche, notably:
|
I will admit that the lack of an ability to As far as benefits -- the main ones I see are the ability to unify this type with any one of the potential future features I mentioned. If we stabilise an enum now, we are explicitly locked out of this. That said, there is precedent for separating out the |
I think it's important that this stay a separate type. To me, it's an advantage that this not just be a For example, if we have pattern types, I don't want to Should it be something else to be able to stabilize a subset sooner? Maybe; I don't know. I think I'll leave questions like that more to libs-api. |
That's fair, and in that case I'll concede that the enum works in this case. I still am not sure if it's the best-performing version (I use the |
`ascii::Char`-ify the escaping code in `core` This means that `EscapeIterInner::as_str` no longer needs unsafe code, because the type system ensures the internal buffer is only ASCII, and thus valid UTF-8. Come to think of it, this also gives it a (non-guaranteed) niche. cc `@BurntSushi` as potentially interested `ascii::Char` tracking issue: rust-lang#110998
I don't like name |
If you don't like name |
I don't think
|
Honestly, the name In terms of precedent for the The only real downside I can see to using the name Personally, I've thought that the |
Using |
I mean, it does feel weird, but then again, so does calling a vector a "vec" until you say it enough times. If you think of "ASCII" as shorthand for "ASCII character", I don't think that's too far-fetched. Oh, and even "char" is an abbreviation. |
I see no issue with this;
This message is composed of 420 ASCIIs. |
I also support calling this type Also, rustdoc currently doesn’t play well with types that are exported under a different name from their declaration; for instance, this method shows the return type as (Note: I dislike the convention of having multiple types named |
With We're about to stabilize pub struct NonZero<T>(/* private fields */)
where T: ZeroablePrimitive; where So here we could do something similar: pub struct Ascii<T: ?Sized>(/* private fields */)
where T: SupersetOfAscii; Then for So internally that might be pub struct Ascii<T: ?Sized + SupersetOfAscii>(<T as SupersetOfAscii>::StorageType); Then we could implement that trait for various types -- (That would allow Thoughts, @BurntSushi ? Would you like to see that as a PR or an ACP (or nothing)? |
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
If we have |
On But thinking about it more as I type, maybe it's just a bad idea. We don't have Maybe I should propose |
Heh, that actually reminds me of the API I came up with for a toy hex formatting crate of me: https://docs.rs/easy-hex/1.0.0/easy_hex/struct.Hex.html Basically, the whole api resolves around being able to cast That said, I feel like a basic For the |
Will it be possible to backwards-compatibly redefine Currently the returned iterator provides a |
yes, but likely only if rust implements something like edition-dependent name resolution where the same name resolves to two different I previously proposed something like that: https://rust-lang.zulipchat.com/#narrow/stream/122651-general/topic/Effect.20of.20fma.20disabled/near/274199236 |
This is a random thought I had, but seeing the progress on this makes me wonder if we could make Figuring out what would have to be done so avoid any breakages there could probably be useful here, since they would also need to be applied to ASCII chars too. |
@clarfonthey Practically I think no, because Maybe eventually, with fancy enough matching-transparent pattern types, but I'm not sure it'd ever really be worth it. (For something like |
Getting back to this feature a bit. Genuine question: instead of the common named variants, now that the precedent has been set for Not sure if implementing this as a POC in unstable would require an RFC or not. |
is it possible to hide the variants in the sidebar of the documentation or even better the entire section? |
core: optimise Debug impl for ascii::Char Rather than writing character at a time, optimise Debug implementation for core::ascii::Char such that it writes the entire representation with a single write_str call. With that, add tests for Display and Debug. Issue: rust-lang#110998
core: optimise Debug impl for ascii::Char Rather than writing character at a time, optimise Debug implementation for core::ascii::Char such that it writes the entire representation with a single write_str call. With that, add tests for Display and Debug. Issue: rust-lang#110998
Rollup merge of rust-lang#120314 - mina86:i, r=Mark-Simulacrum core: optimise Debug impl for ascii::Char Rather than writing character at a time, optimise Debug implementation for core::ascii::Char such that it writes the entire representation with a single write_str call. With that, add tests for Display and Debug. Issue: rust-lang#110998
core: optimise Debug impl for ascii::Char Rather than writing character at a time, optimise Debug implementation for core::ascii::Char such that it writes the entire representation with a single write_str call. With that, add tests for Display and Debug. Issue: rust-lang/rust#110998
Re: ASCII Literals As an end user of Rust who wants good ASCII support, I believe the best way to improve this feature is to add literals such as Currently, the best solution to arriving at the ASCII subset is this:
Being forced to write out all of the On the other hand, byte string literals can be used in this manner:
This forces me to use runtime checks, or simply cross my fingers, rather than use compile time checks in order to assure that the ASCII subset is not being violated. I'm perfectly fine with doing runtime checks when I'm parsing what very well could be arbitrary data, but source code is not arbitrary and violations should be able to be caught at compile time. |
@NathanielHardesty If the functions for constructing |
Hello, do you plan to add a char::to_digit(self, radix: u32) equivalent? |
@Marcondiro you will need to open an issue in rust-lang/libs-team, which is called an ACP. The libs-api team would then provide you with feedback there. |
Interestingly, it looks like I had been about to say "just send a PR since it's something char has", but then I looked at the |
|
@CodesInChaos |
|
It's worth adding that, if these were available in const context, you could verify they're valid at compile-time by wrapping them in a |
A drive-by bikeshed comment that I don't think has been made before: |
I don't think these names are ever intended to be stable. IIRC using an enum is a bit of a hack to get all the niches. The intent is for it to appear more like That said, the official Unicode™ name is "Latin Capital letter A" |
Feature gate:
#![feature(ascii_char)]
#![feature(ascii_char_variants)]
This is a tracking issue for the
ascii::Char
type from rust-lang/libs-team#179https://doc.rust-lang.org/nightly/std/ascii/enum.Char.html
Public API
Steps / History
ascii::Char
(ACP#179) #111009[u8]::is_ascii
(unstably) #111222ascii::Char
methods fromas_
toto_
#114641Default
forAsciiChar
#121024Unresolved Questions
char
andChar
might be too confusing.Debug
impl work?Debug
impl forascii::Char
match that ofchar
#115434as
-casting from it a good or bad feature?char::to_u32
, justas u32
for it.as_ascii
methods take&self
for consistency withis_ascii
. Should they takeself
instead where possible, as the usually-better option, or stick with&self
for the consistency?Footnotes
https://std-dev-guide.rust-lang.org/feature-lifecycle/stabilization.html ↩
The text was updated successfully, but these errors were encountered: