Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tracking Issue for ascii::Char (ACP 179) #110998

Open
1 of 7 tasks
scottmcm opened this issue Apr 29, 2023 · 104 comments
Open
1 of 7 tasks

Tracking Issue for ascii::Char (ACP 179) #110998

scottmcm opened this issue Apr 29, 2023 · 104 comments
Labels
A-Unicode Area: Unicode C-tracking-issue Category: An issue tracking the progress of sth. like the implementation of an RFC T-libs-api Relevant to the library API team, which will review and decide on the PR/issue.

Comments

@scottmcm
Copy link
Member

scottmcm commented Apr 29, 2023

Feature gate: #![feature(ascii_char)] #![feature(ascii_char_variants)]

This is a tracking issue for the ascii::Char type from rust-lang/libs-team#179

https://doc.rust-lang.org/nightly/std/ascii/enum.Char.html

Public API

// core::ascii

#[repr(u8)]
enum Char {
    Null = 0,Tilde = 127,
}

impl Debug for Char {}
impl Display for Char {}
impl Default for Char { ... }

impl Step for Char { ... } // so `Range<Char>` is an Iterator

impl Char {
    const fn from_u8(x: u8) -> Option<Self>;
    const unsafe fn from_u8_unchecked(x: u8) -> Self;
    const fn digit(d: u8) -> Option<Self>;
    const unsafe fn digit_unchecked(d: u8) -> Self;
    const fn as_u8(self) -> u8;
    const fn as_char(self) -> char;
    const fn as_str(&self) -> &str;
}

impl [Char] {
    const fn as_str(&self) -> &str;
    const fn as_bytes(&self) -> &[u8];
}

impl From<Char> for u8 {}
impl From<Char> for char {}
impl From<&[Char]> for &str {}

// core::array

impl<const N: usize> [u8; N] {
    const fn as_ascii(&self) -> Option<&[ascii::Char; N]>;
    const unsafe fn as_ascii_unchecked(&self) -> &[ascii::Char; N];
}

// core::char

impl char {
    const fn as_ascii(&self) -> Option<ascii::Char>;
}

// core::num

impl u8 {
    const fn as_ascii(&self) -> Option<ascii::Char>;
}

// core::slice

impl [u8] {
    const fn as_ascii(&self) -> Option<&[ascii::Char]>;
    const unsafe fn as_ascii_unchecked(&self) -> &[ascii::Char];
}

// core::str

impl str {
    const fn as_ascii(&self) -> Option<&[ascii::Char]>;
}

Steps / History

Unresolved Questions

  • What should it be named? Code mixing char and Char might be too confusing.
  • How should its Debug impl work?
  • Is allowing as-casting from it a good or bad feature?
    • FWIW, there's no char::to_u32, just as u32 for it.
  • Some of the as_ascii methods take &self for consistency with is_ascii. Should they take self instead where possible, as the usually-better option, or stick with &self for the consistency?

Footnotes

  1. https://std-dev-guide.rust-lang.org/feature-lifecycle/stabilization.html

@scottmcm scottmcm added T-libs-api Relevant to the library API team, which will review and decide on the PR/issue. C-tracking-issue Category: An issue tracking the progress of sth. like the implementation of an RFC labels Apr 29, 2023
@BurntSushi
Copy link
Member

I guess I'll kick off the discussion about how to actually define the ascii::Char type. Right now, it's an enum like this:

#[repr(u8)]
enum Char {
    Null = 0,Tilde = 127,
}

I'd like to first make sure I understand the pros of using this representation. Do I have them right?

  1. It provides a niche optimization such that Option<ascii::Char> is the same size as ascii::Char.
  2. It provides a way to write ch as u8 where ch has type ascii::Char.
  3. For "free," you can get cute and write the names of ASCII characters instead of their visual or numerical representation. (I don't mean to trivialize this by calling it "cute." I think it's a nice benefit, especially for non-printable characters.)

Are there more? Am I mistaken about the above?

And now, the mitigations:

  1. I think the niche optimization can still be obtained in this case because std can use unstable features? And I think there's a way to say, "make this specific value the niche" without any other additional cost.
  2. We can provide a ascii::Char::as_u8() method. (Which is already done in the implementation PR.) Is there anything else that being able to write ch as u8 buys us?
  3. I don't think there is a mitigation here. It feels like a "nice to have" aspect of using an enum that we probably wouldn't go out of our way to re-create via other means.

In terms of negatives, why not do the enum in the first place? Good question. I'm not entirely sure. It feels to me like it stresses my weirdness budget. It's also a generally much bigger enum than most, and I wonder whether that will provoke any surprises in places. Maybe not.

@ChayimFriedman2
Copy link
Contributor

Do we need both u8::as_ascii() and ascii::Char::from_u8()?

matthiaskrgr added a commit to matthiaskrgr/rust that referenced this issue May 4, 2023
Add `ascii::Char` (ACP#179)

ACP second: rust-lang/libs-team#179 (comment)
New tracking issue: rust-lang#110998

For now this is an `enum` as `@kupiakos` [suggested](rust-lang/libs-team#179 (comment)), with the variants under a different feature flag.

There's lots more things that could be added here, and place for further doc updates, but this seems like a plausible starting point PR.

I've gone through and put an `as_ascii` next to every `is_ascii`: on `u8`, `char`, `[u8]`, and `str`.

As a demonstration, made a commit updating some formatting code to use this: scottmcm@ascii-char-in-fmt (I don't want to include that in this PR, though, because that brings in perf questions that don't exist if this is just adding new unstable APIs.)
@scottmcm
Copy link
Member Author

scottmcm commented May 4, 2023

Regarding the enum, @kupiakos said the following in

The names or properties of ASCII characters will never change and there aren't many of them, and so we might as well expose them as variants that can participate in pattern matching.

And mentioned icu4x's enum,

https://github.com/unicode-org/icu4x/blob/b6c4018a736e79790898c5b91ff3ab25d33192c2/utils/tinystr/src/asciibyte.rs#L8-L11

Though given those names it's possible it's just doing that for the niche and not for direct use.

(Man I wish we had private variants as a possibility.)

To your questions, @BurntSushi ,

  1. Yeah, there's at least three other possibilites here: a) put the enum in a newtype (like ptr::Alignment) to still get the niches without exposing the variants b) use the rustc-internal magic c) just don't get a niche for now since that's not needed for its core scenario of convertibility to UTF-8 (though it's certainly nice to have)

  2. Given that we're overall not fond of as casts, and it also allows weirder things like as i64, I actually wish the cast could be disabled, and have people use as_u8 as you mention. But that might also just happen by deprecating or warning on these casts, and eventually having ascii::Char: AsRepr<Repr = u8> does seem reasonable.

  3. Now that Refactor core::char::EscapeDefault and co. structures #105076 has landed, I've been thinking of redoing the internals of those using ascii::Char to try out how it feels using the names. That would mean code like

            '\0' => EscapeDebug::backslash(Null),
            '\r' => EscapeDebug::backslash(CarriageReturn),
            '\n' => EscapeDebug::backslash(LineFeed ),

because without the variants it ends up being more like

            '\0' => EscapeDebug::backslash(b'\0'.as_ascii().unwrap()),
            '\r' => EscapeDebug::backslash(b'\r'.as_ascii().unwrap()),
            '\n' => EscapeDebug::backslash(b'\n'.as_ascii().unwrap()),

which is nice in that it keeps the escape sequences lining up, but having to copy-paste .as_ascii().unwrap() all over like that doesn't fill me with joy.

I guess this is yet another situation in which I'd like custom literals. Or something like a'\n', I guess, but that's way too far a jump to propose right now.

EDIT a while later: The PR to update EscapeDebug like that is #111524

matthiaskrgr added a commit to matthiaskrgr/rust that referenced this issue May 6, 2023
Constify `[u8]::is_ascii` (unstably)

UTF-8 checking in `const fn`-stabilized back in 1.63 (rust-lang#97367), but apparently somehow ASCII checking was never const-ified, despite being simpler.

New constness-tracking issue for `is_ascii`: rust-lang#111090

I noticed this working on `ascii::Char`: rust-lang#110998
bors added a commit to rust-lang-ci/rust that referenced this issue May 7, 2023
Constify `[u8]::is_ascii` (unstably)

UTF-8 checking in `const fn`-stabilized back in 1.63 (rust-lang#97367), but apparently somehow ASCII checking was never const-ified, despite being simpler.

New constness-tracking issue for `is_ascii`: rust-lang#111090

I noticed this working on `ascii::Char`: rust-lang#110998
@clarfonthey
Copy link
Contributor

I wasn't in the discussion when this type was created (which I fully support), but is there a reason to explicitly use an enum instead of a newtype + internal niche attribute like NonZeroU8?

If we decide to go the pattern types route (#107606) then NonZeroU8 could become u8 is 1.. and this could become u8 as ..128. Alternatively, if we go the generic integers route (#2581) then this could become uint<7> or just u7.

IMHO, explicitly going with an enum kind of prevents these kinds of optimisations later down the road, since it forces an enum as the representation.

@BurntSushi
Copy link
Member

@clarfonthey I think the discussion above answers your question about why we went with the enum for now. In summary, I think its principle upside is that it gives nice names to each character. Not particularly compelling, but nice.

My general feeling is that nobody is stuck on using an enum here. My best guess is that unless something compelling comes up, I wouldn't be surprised if we changed to an opaque representation such that we could change to an enum later in a compatible manner. (I think that's possible?) Purely because it's the conservative route.

I don't grok the second two paragraphs of your comment. Probably because I don't know what "pattern types" are. I also don't know what uint<7> means. (Don't have time to follow your links at the moment.)

@clarfonthey
Copy link
Contributor

clarfonthey commented May 13, 2023

I mean, having nice names is still achievable with constants, and this still would make match expressions work as you'd expect. Personally, the benefits I'm seeing here are the ability to have more-performant code without the need for extra unsafe code, since any slice of ASCII characters is automatically valid UTF-8. Otherwise, a type invariant would mostly be unnecessary, and you could just have constants for the names.

Explaining the two proposals:

  • Pattern types just allow you to write T is pat which is a new type that means T, but only values which match pat. So, Option<T> is Some(_) means "options which are some" and u8 is 1.. is effectively NonZeroU8.
  • Generic integers is a separate proposal which "converts" all the existing uN and iN types into aliases for a more generic uint<N> and int<N> type, which has the benefit of allowing non-power-of-two sizes and sizes larger than 128. The idea behind uint<7> is that it would be equivalent to u8 internally with the first bit always zero, although still allow all the normal operations on integers like pattern-matching and be castable to u8.

The pattern type version is very enticing because it very naturally allows exhaustive matching, while still just being a compile-time constraint. And like I said, we could still have constants for the different names for characters without really affecting things.

@BurntSushi
Copy link
Member

having nice names is still achievable

Yes, as I said:

I don't think there is a mitigation here. It feels like a "nice to have" aspect of using an enum that we probably wouldn't go out of our way to re-create via other means.

I don't know what the stabilization story is for pattern types or generic integers, but I would prefer we focus on finding a path to stabilization that doesn't require being blocked on hypothetical language features. Or at least, focusing discussion on whether we can utilize them later in a backwards compatible fashion.

I still don't really know what pattern types are (I just don't have time right now to grok it), but generic integer types seem like an implementation detail that's consistent with defining ascii::Char as an opaque type?

@clarfonthey
Copy link
Contributor

Right, the point here is that any opaque type would be equivalent to the primary hypothetical future implementations. So, explicitly not blocking on any particular path forward, but making sure that they're compatible.

@scottmcm
Copy link
Member Author

I mean, having nice names is still achievable with constants, and this still would make match expressions work as you'd expect.

Though you can use a variant, but can't use an associated constant. Thus use ascii::Char::*; works with an enum, but not with constants for the names.

I agree that this could absolutely be done with the magic attributes. But is there anything in particular that you think is better with that approach?

The enum version still gets the niche, notably:

[src/main.rs:3] std::alloc::Layout::new::<Option<Option<Option<Option<Option<std::ascii::Char>>>>>>() = Layout {
    size: 1,
    align: 1 (1 << 0),
}

@clarfonthey
Copy link
Contributor

I will admit that the lack of an ability to use associated constants is an immediate downgrade, although I do think that'll eventually be allowed to eventually deprecate the std::fN::consts modules.

As far as benefits -- the main ones I see are the ability to unify this type with any one of the potential future features I mentioned. If we stabilise an enum now, we are explicitly locked out of this.

That said, there is precedent for separating out the char type from the other integer types, and I guess that ascii::Char fits right in with that. I'm just not convinced that locking in this as an enum is worth losing the benefits we could get later, especially considering how it could be converted from an opaque type into an enum publicly later if we eventually decide that is better than the alternatives.

@scottmcm
Copy link
Member Author

I think it's important that this stay a separate type. To me, it's an advantage that this not just be a u7, but that it have the semantic intent behind it of actually being ASCII.

For example, if we have pattern types, I don't want to impl Display for &[u8 is ..128], but I think it's possible that we'd add impl Display for &[ascii::Char]. (Like how Debug for char is very different from Debug for a hypothetical u32 is ..=1114111.)

Should it be something else to be able to stabilize a subset sooner? Maybe; I don't know. I think I'll leave questions like that more to libs-api.

@clarfonthey
Copy link
Contributor

That's fair, and in that case I'll concede that the enum works in this case. I still am not sure if it's the best-performing version (I use the bounded-integer crate for all sorts of things, and it uses enums as it's codegen) but I feel convinced enough that we can stabilise this as a separate type.

bors added a commit to rust-lang-ci/rust that referenced this issue May 20, 2023
`ascii::Char`-ify the escaping code in `core`

This means that `EscapeIterInner::as_str` no longer needs unsafe code, because the type system ensures the internal buffer is only ASCII, and thus valid UTF-8.

Come to think of it, this also gives it a (non-guaranteed) niche.

cc `@BurntSushi` as potentially interested
`ascii::Char` tracking issue: rust-lang#110998
@safinaskar
Copy link
Contributor

I don't like name Char. It is too similar to char, despite these are totally different things. I propose ASCII instead

@safinaskar
Copy link
Contributor

If you don't like name ASCII, then name it Septet, but, please, not Char

@BurntSushi
Copy link
Member

I don't think ASCII or Septet are going to work. I can't imagine those being the names we end up with personally.

ascii::Char and char are not totally different things. They both represent abstract characters. One is just a more limited definition than the other, but both are in fact quite limited! ascii::Char is namespaced under the ascii module, which provides pretty adequate context for what kind of character it represents IMO.

@clarfonthey
Copy link
Contributor

Honestly, the name Ascii for a type would make sense since it lives in the ascii module, since that would mirror existing other Rust types. It doesn't seem out of place to refer to an ASCII character as "an Ascii" especially considering how there are already lots of terms that use "ASCII" as a placeholder for "text," as in stuff like "ASCII art." Nowadays, people even refer to Unicode art as ASCII art.

In terms of precedent for the ascii::Char name, I would point toward the various Result types in modules offered across libstd and the ecosystem, where io::Result and fmt::Result are two common examples. While the base Result is in prelude, these separate Result types are in separate modules and often used exclusively as module-specific imports, although in rare cases code will directly import them. I don't think this is a good analogy, however, since ascii::Char and char actually are completely different types with entirely different properties, and not just shorthands for each other, even though you can convert one into the other.

The only real downside I can see to using the name Ascii is that it's converting an acronym into a a word and lowercasing most of the letters, which… didn't stop us for stuff like Arc and Rc and OsStr.

Personally, I've thought that the core::ascii module has been mostly useless until the addition of this character now that AsciiExt is deprecated, although revitalising it to introduce a type called Ascii would make the most sense to me. Plus, it would even lead a path forward to potentially adding this type to prelude if it were considered useful there, or maybe the preludes of other crates that rely on it.

@BurntSushi
Copy link
Member

Using Ascii as a singular noun sounds very strange to me. "ASCII art" is using "ASCII" as an adjective. I'm not aware of any wide usage of "Ascii" as a singular noun referring to a single character. Its usage as a noun, in my experience, refers to a character set and/or encoding. Not a member of that set/encoding.

@clarfonthey
Copy link
Contributor

clarfonthey commented May 22, 2023

I mean, it does feel weird, but then again, so does calling a vector a "vec" until you say it enough times. If you think of "ASCII" as shorthand for "ASCII character", I don't think that's too far-fetched.

Oh, and even "char" is an abbreviation.

@kupiakos
Copy link
Contributor

2. Given that we're overall not fond of as casts, and it also allows weirder things like as i64, I actually wish the cast could be disabled

I see no issue with this; ascii::Char, or whatever it's called, can be fully represented by every numeric type except bool. The issue with as casts changing the value is irrelevant for this case.

I mean, it does feel weird

This message is composed of 420 ASCIIs.

@bluebear94
Copy link
Contributor

I also support calling this type Ascii because calling it Char is likely to give worse diagnostics if you mix up char and Char. The Char name would also result in confusion to readers unless you qualify it as ascii::Char, which is more verbose than Ascii.

Also, rustdoc currently doesn’t play well with types that are exported under a different name from their declaration; for instance, this method shows the return type as AsciiChar, which is the name used in the declaration of this type.

(Note: I dislike the convention of having multiple types named Error and Result in the standard library and would rather have had FmtError, IoError, and so on, so my opinions might be biased.)

@scottmcm
Copy link
Member Author

With NonZero<T> in FCP (#120257 (comment)), I was inspired by it and jake's previous comment to come back to this and propose something that I think might be reasonable.

We're about to stabilize

pub struct NonZero<T>(/* private fields */)
where T: ZeroablePrimitive;

where ZeroablePrimitive is a forever-internal implementation detail, but there's always a get: NonZero<T> -> T.

So here we could do something similar:

pub struct Ascii<T: ?Sized>(/* private fields */)
where T: SupersetOfAscii;

Then for T: Sized we'd have the same get: Ascii<T> -> T, but we'd also have for everything a as_str: &Ascii<T> -> &str as well as allowing it to deref (perhaps indirectly) from &Ascii<T> to &T.

So internally that might be

pub struct Ascii<T: ?Sized + SupersetOfAscii>(<T as SupersetOfAscii>::StorageType);

Then we could implement that trait for various types -- u8 and u32 where the internal StorageType is a private thing with #[rustc_layout_scalar_valid_range_end(0x7F)], but also implement it for [u8; N] and [u8] and such to cover more general things.

(That would allow Ascii<u32>: AsRef<char> + AsRef<&str> for example, since you can get references to either from it. Might not be worth bothering having that, though, since I've never seen anything that cares about AsRef<char>.)

Thoughts, @BurntSushi ? Would you like to see that as a PR or an ACP (or nothing)?

@programmerjake

This comment was marked as resolved.

@scottmcm

This comment was marked as resolved.

@programmerjake
Copy link
Member

programmerjake commented Feb 22, 2024

If we have Ascii<T: ?Sized>: Deref<Target = T>, I think we'll need both Ascii<str> and Ascii<[u8]> as well as the corresponding owned types. We should have Ascii<[u8]>: AsRef<Ascii<str>> and visa versa, and other corresponding conversions for owned/borrowed types.

@scottmcm
Copy link
Member Author

On Deref: yeah, I haven't quite thought this one through all the way. I added the "(perhaps indirectly)" parenthetical to try to add some space for it -- like maybe we don't always have Ascii<T: ?Sized>: Deref<Target = T> because we deref the ascii one to something else that then derefs to the original thing.

But thinking about it more as I type, maybe it's just a bad idea. We don't have &String -> &Vec<u8> as a deref coercion -- not even indirectly -- so maybe trying to have it here would be wrong too.

Maybe I should propose Ascii<T: ?Sized>: AsRef<T> as the general thing instead, since that we can definitely do, and we'll be more limited in which things we allow to Deref at all.

@Kimundi
Copy link
Member

Kimundi commented Feb 25, 2024

Heh, that actually reminds me of the API I came up with for a toy hex formatting crate of me: https://docs.rs/easy-hex/1.0.0/easy_hex/struct.Hex.html

Basically, the whole api resolves around being able to cast T to/from Hex<T> just to get changed trait impl semantic. In my case its just a repr(transparent) wrapper, so I do not change representation, but still the idea seems similar.

That said, I feel like a basic AsciiChar type is still the best course of action, otherwise it seems like the whole API discussion here has to be started from scratch :D

For the [AsciiChar] does not implement Debug ruight problem, could we maybe provide a struct AsciiSlice([AschiiChar]);, and just make it easy to convert to/from the basic slice type? I could image that becoming useful for more things than just the debug formatting impl.

@jongiddy
Copy link
Contributor

Will it be possible to backwards-compatibly redefine [u8]::escape_ascii to return an Iterator<Item=AsciiChar> instead of the current Iterator<Item=u8>?

Currently the returned iterator provides a to_string() method that collects the characters into a string, but if any other iterators are chained, we lose the info that the bytes are safe to collect into a String.

@programmerjake
Copy link
Member

Will it be possible to backwards-compatibly redefine [u8]::escape_ascii to return an Iterator<Item=AsciiChar> instead of the current Iterator<Item=u8>?

yes, but likely only if rust implements something like edition-dependent name resolution where the same name resolves to two different escape_ascii functions depending on the edition.

I previously proposed something like that: https://rust-lang.zulipchat.com/#narrow/stream/122651-general/topic/Effect.20of.20fma.20disabled/near/274199236

@clarfonthey
Copy link
Contributor

This is a random thought I had, but seeing the progress on this makes me wonder if we could make char itself implemented as a lang item struct instead of a primitive in the language without breaking anything.

Figuring out what would have to be done so avoid any breakages there could probably be useful here, since they would also need to be applied to ASCII chars too.

@scottmcm
Copy link
Member Author

scottmcm commented May 1, 2024

@clarfonthey Practically I think no, because char has exhaustive matching for its particular disjoint range, which isn't something exposable today.

Maybe eventually, with fancy enough matching-transparent pattern types, but I'm not sure it'd ever really be worth it. (For something like str, sure, but char has so much special stuff that I'm skeptical.)

@clarfonthey
Copy link
Contributor

Getting back to this feature a bit. Genuine question: instead of the common named variants, now that the precedent has been set for CStr having dedicated c"..." literals, would it be reasonable to add a'x' and a"..." literals? This could also solve the issue with ascii_char_variants being unstable, preferring that you use the literals instead of the variants.

Not sure if implementing this as a POC in unstable would require an RFC or not.

@leb-kuchen
Copy link

is it possible to hide the variants in the sidebar of the documentation or even better the entire section?

matthiaskrgr added a commit to matthiaskrgr/rust that referenced this issue Aug 11, 2024
core: optimise Debug impl for ascii::Char

Rather than writing character at a time, optimise Debug implementation
for core::ascii::Char such that it writes the entire representation
with a single write_str call.

With that, add tests for Display and Debug.

Issue: rust-lang#110998
matthiaskrgr added a commit to matthiaskrgr/rust that referenced this issue Aug 11, 2024
core: optimise Debug impl for ascii::Char

Rather than writing character at a time, optimise Debug implementation
for core::ascii::Char such that it writes the entire representation
with a single write_str call.

With that, add tests for Display and Debug.

Issue: rust-lang#110998
rust-timer added a commit to rust-lang-ci/rust that referenced this issue Aug 11, 2024
Rollup merge of rust-lang#120314 - mina86:i, r=Mark-Simulacrum

core: optimise Debug impl for ascii::Char

Rather than writing character at a time, optimise Debug implementation
for core::ascii::Char such that it writes the entire representation
with a single write_str call.

With that, add tests for Display and Debug.

Issue: rust-lang#110998
RalfJung pushed a commit to RalfJung/miri that referenced this issue Aug 12, 2024
core: optimise Debug impl for ascii::Char

Rather than writing character at a time, optimise Debug implementation
for core::ascii::Char such that it writes the entire representation
with a single write_str call.

With that, add tests for Display and Debug.

Issue: rust-lang/rust#110998
@NathanielHardesty
Copy link

Re: ASCII Literals

As an end user of Rust who wants good ASCII support, I believe the best way to improve this feature is to add literals such as a'A' and a"ASCII" as others have suggested. Doing fancy manipulations on strings is far less important (to me anyways) than just being able to write strings so that they can be stored in structs or passed into functions.

Currently, the best solution to arriving at the ASCII subset is this:

  • [Char::SmallA, Char::SmallS, Char::SmallC, Char::SmallI, Char::SmallI]

Being forced to write out all of the Char variants just to arrive at a [Char] is cumbersome and it seems as though these variants are an unstable feature unto themselves which is best avoided.

On the other hand, byte string literals can be used in this manner:

  • b"ascii".as_ascii().unwrap()
  • unsafe {b"ascii".as_ascii_unchecked()}

This forces me to use runtime checks, or simply cross my fingers, rather than use compile time checks in order to assure that the ASCII subset is not being violated. I'm perfectly fine with doing runtime checks when I'm parsing what very well could be arbitrary data, but source code is not arbitrary and violations should be able to be caught at compile time.

@ChayimFriedman2
Copy link
Contributor

@NathanielHardesty If the functions for constructing &[AsciiChar] from &[u8] are const, you can create a macro for it (it's still worse than compiler support, but not by much).

@Marcondiro
Copy link
Contributor

Hello, do you plan to add a char::to_digit(self, radix: u32) equivalent?
This method was proposed in #95969 / #96110 for u8 but was rejected.
Thanks!

@NobodyXu
Copy link
Contributor

@Marcondiro you will need to open an issue in rust-lang/libs-team, which is called an ACP.

The libs-api team would then provide you with feedback there.

@scottmcm
Copy link
Member Author

Interestingly, it looks like char::from(a).to_digit(10) actually optimizes well: https://rust.godbolt.org/z/K5nebha3a

I had been about to say "just send a PR since it's something char has", but then I looked at the to_digit signature, and now I want an ACP to discuss u32 vs some other width for the argument and the return :/

@CodesInChaos
Copy link

CodesInChaos commented Nov 18, 2024

  1. Why do the fallible functions return Option instead of Result<T, AsciiError>? Returning Result would be consistent with utf8 conversion, and would produce a reasonable error when unwrapping. Plus TryFrom already needs an error type.
  2. Would it make sense to add a family of const fn as_ascii_mut[_unchecked](&mut [u8]) -> Option<&mut [ascii::Char]> functions (for u8, [u8], str, [u8; N]) or is TryFrom enough?
  3. Or should some of the conversion functions proposed in the original post be removed in favour of TryFrom?
  4. NonZero<ascii::Char> support could be useful since it can be infallibly converted to CString.
  5. Concerning naming the bikeshed, I like Ascii. It's short, and meaningful even outside the module context (e.g if it becomes part of the prelude at some point). But I'd be fine with AsciiChar as well.

@zopsicle
Copy link
Contributor

@CodesInChaos &[NonZero<ascii::Char>] cannot be converted to &CStr because there is no terminating nul. It can be infallibly converted to CString though.

@clarfonthey
Copy link
Contributor

NonZero<ascii::Char> actually could be implemented with just a library change, although I'm not sure whether libs-api would rather make that go through ACP first. I would imagine that, at least this being the first non-primitive type to support NonZero, it would be nice to ensure that this precedent is okay.

@clarfonthey
Copy link
Contributor

On the other hand, byte string literals can be used in this manner:

* `b"ascii".as_ascii().unwrap()`

* `unsafe {b"ascii".as_ascii_unchecked()}`

This forces me to use runtime checks, or simply cross my fingers, rather than use compile time checks in order to assure that the ASCII subset is not being violated. I'm perfectly fine with doing runtime checks when I'm parsing what very well could be arbitrary data, but source code is not arbitrary and violations should be able to be caught at compile time.

It's worth adding that, if these were available in const context, you could verify they're valid at compile-time by wrapping them in a const block. Although I agree that adding literals is still a good idea.

@nnethercote
Copy link
Contributor

A drive-by bikeshed comment that I don't think has been made before: UpperCaseA and LowerCaseA would be preferable to CapitalA and SmallA, for consistency with char::is_{upper,lower}_case, and because they are just clearer names.

@ChrisDenton
Copy link
Member

I don't think these names are ever intended to be stable. IIRC using an enum is a bit of a hack to get all the niches. The intent is for it to appear more like char to users.

That said, the official Unicode™ name is "Latin Capital letter A"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-Unicode Area: Unicode C-tracking-issue Category: An issue tracking the progress of sth. like the implementation of an RFC T-libs-api Relevant to the library API team, which will review and decide on the PR/issue.
Projects
None yet
Development

No branches or pull requests