Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove string [] indexing #12710

Closed
huonw opened this issue Mar 5, 2014 · 18 comments · Fixed by #15085
Closed

Remove string [] indexing #12710

huonw opened this issue Mar 5, 2014 · 18 comments · Fixed by #15085
Milestone

Comments

@huonw
Copy link
Member

huonw commented Mar 5, 2014

It is byte indexing (not character indexing) which is encouraging poor UTF-8 hygiene, and the behaviour can be regained either with the .bytes() iterator, or just s.as_bytes() to get a &[u8] view (at zero cost).

@huonw
Copy link
Member Author

huonw commented Mar 5, 2014

nominating

@bstrie
Copy link
Contributor

bstrie commented Mar 5, 2014

Seconded. In a UTF-8 world, what we think of as a "string" is emphatically not an array.

@thestinger
Copy link
Contributor

+1, it's rarely correct to index strings at all, and especially not by bytes

@thestinger
Copy link
Contributor

I would also be for removing the len method and using as_bytes().len() instead. Strings don't need to implement the Container trait as they're not actually a container of a specific type.

@sfackler
Copy link
Member

sfackler commented Mar 5, 2014

+1

@pongad
Copy link
Contributor

pongad commented Mar 5, 2014

I volunteer to work on this. By the way, we do have a function to decode character points right?

@pongad
Copy link
Contributor

pongad commented Mar 6, 2014

~str in patterns seems to be blocking this. I'll try work on that and come back to this later.

@huonw
Copy link
Member Author

huonw commented Mar 6, 2014

Why would string pattern matching affect string indexing?

@pongad
Copy link
Contributor

pongad commented Mar 6, 2014

I'm not sure myself. After removing string indexing, I got "internal
compiler errors". Tracked it to #[lang=uniq_str_eq].

On Wed, Mar 5, 2014 at 7:05 PM, Huon Wilson [email protected]:

Why would string pattern matching affect string indexing?


Reply to this email directly or view it on GitHubhttps://github.com//issues/12710#issuecomment-36810589
.

@nikomatsakis
Copy link
Contributor

This will probably fall out of the changes we plan for DST, or at
least could easily do so.

@pnkfelix
Copy link
Member

Accepted for 1.0, P-backcompat-libs.

@lilyball
Copy link
Contributor

lilyball commented May 5, 2014

@thestinger Removal of [] indexing makes some small sense, but removal of len() is too far. If you remove len() then you have to remove slice() as well, and that's just too painful (going through .as_bytes() to slice loses all information about whether the result is valid utf-8 and is extremely wasteful to get back to &str).

@brson
Copy link
Contributor

brson commented Jun 17, 2014

Any hints what has to change in the code to fix this? I took a peek but got lost in trait lookups, while I thought slice indexing was built-in.

@nikomatsakis
Copy link
Contributor

@brson hmm, I think the PRIMARY thing that has to change is middle::ty::ty_index(), which says that a str is indexable to type u8

@nikomatsakis
Copy link
Contributor

the function middle::mem_categorization::element_kind could also be updated to remove that case

@brson
Copy link
Contributor

brson commented Jun 18, 2014

Thanks, @nikomatsakis. I've started a patch.

brson added a commit to brson/rust that referenced this issue Jul 2, 2014
Being able to index into the bytes of a string encourages
poor UTF-8 hygiene. To get a view of `&[u8]` from either
a `String` or `&str` slice, use the `as_bytes()` method.

Closes rust-lang#12710.

[breaking-change]
@alexchandel
Copy link

Rust strings feel much less intuitive to me than Python's str. I'm fine with UTF-8, but in Python str/strings are unicode and indexing/slicing them yields a unicode character. Python has a separate bytes class for encoded strings, whether they're UTF-8, 16, or 32 encoded, and uses the b"" syntax to designate byte-string literals versus u"" for unicode string literals.

bors added a commit that referenced this issue Jul 2, 2014
Being able to index into the bytes of a string encourages
poor UTF-8 hygiene. To get a view of `&[u8]` from either
a `String` or `&str` slice, use the `as_bytes()` method.

Closes #12710.

[breaking-change]

If the diffstat is any indication this shouldn't have a huge impact but it will have some. Most changes in the `str` and `path` module. A lot of the existing usages were in tests where ascii is expected. There are a number of other legit uses where the characters are known to be ascii.
@huonw
Copy link
Member Author

huonw commented Jul 2, 2014

in Python str/strings are unicode

So are Rust strings, the major difference is we do not disguise the underlying representation.

(Also, it's not exactly clear what your point is? If it is that indexing should be removed, then @brson's patch #15085 is being tested as we speak.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

10 participants