-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Rename stdlib Unicode
to Strings
?
#25394
Comments
I would argue that these functions shouldn't have left Base, though that ship has sailed. |
I very much agree that the name |
The definition of what it means to lowercase something or to be lowercase is specific to Unicode and it changes with different versions of the standard. It might make sense to also have an |
I think "return the lowercase character corresponding to the given character" is precise enough by our usual standards of generic function meanings. If I made up my own encoding where "A" is 9999 and "a" is 9998, defining |
Unicode
to String
?Unicode
to Strings
?
That's a particularly unhelpful example since the question is not how the letters |
Here's what I would suggest we do: Keep these functions in Base. For each Julia version x.y.z, document that the functions implement/correspond to a particular Unicode standard. People who want compliance with bleeding edge or old Unicode standards could use a separate package as needed. |
If we call the package |
OK, I'll submit a PR soon to do the renaming. |
Here is the full list of exports, grouped by similar functionality:
Grapheme iteration and Unicode validity are inherently Unicode-related. Other character sets could define notions of these. Character class predicates could extended to other character sets as could character transformations. Having |
|
I think that's why we were also going to move a bunch of other string functionality (search, chomp, isvalid, that sort of thing) in too. |
At least for me, one of the main motivations for moving all these functions to stdlib was to clean the Base namespace from all the |
@StefanKarpinski German added an uppercase ß in 2017. Unicode 5.1: U+1E9E |
That was a rhetorical question, @jrklasen. My point is that the answer to the question is not self-evident – that you referred to the Unicode version in which it was added makes the point quite well. |
sorry I was not quite clear, I just wanted to point out that uppercase and lowercase of ß isn't simply defined by Unicode, but by the "Rat für deutsche Rechtschreibung" (Council for German Orthography), which changed it last year from (ss|ß) <-> SS to ss <-> SS and ß <-> ẞ (U+1E9E). If the package has another name than Unicode this could be adopted. |
But the package specifically implements Unicode uppercasing, not Council for German Orthography uppercasing. That can be implemented by another package, but it's not what this one implements. |
There seem to be three common objections:
While I can understand 1, we moved these things out because it seemed unfortunate to have names like I find 2 a bit whiny. Is the name "Unicode" really that scary? It's 2018 and Unicode is here to stay, being afraid of the name seems silly. If you're doing case transformations on arbitrary code points, you are doing something that is defined in terms of the Unicode standard and it seems not so unreasonable that you be aware of it. Point 3 seems to be the most compelling technical argument, but as I said, I also am having a hard time being convinced that it's important to support character sets that can't be mapped into Unicode. |
I wonder if there's a solution here similar to the Pkg3 "This package is not installed. Install? [Y/n]" prompt. |
Case changing is so basic that I think it's really silly to require |
Other recent languages adopt a variety of approaches:
|
Ruby does the same thing as Python. Matlab and R have basic string functions such as case changing readily available without loading a library. |
I kind of like the character set object approach but that is somewhat orthogonal to whether those objects are in Base or not.
I'm not sure that Matlab and R are the languages to follow when it comes to modularity or strings. |
Some languages allow/encourage interactive use, where others prefer a more programmatic workflow. Better to look at those languages with similar usage patterns... |
Continuing from #25416 (comment) since we should have the conversation here, not on the PR. @JeffBezanson, what is your proposal specifically? Case transforms in |
It's unfortunate that some of the predicate names like A couple of the predicates are redundant and can maybe be deprecated:
|
I find it kind of confusing that the standard library for string processing is called
Unicode
instead of something likeStrings
. Isn't it just an implementation detail of Julia that strings happen to be unicode? Seems weird to force end-users, many of whom are probably scientific users who might not know anything about character encodings, to know that Julia strings are stored as Unicode in order to call common string processing functions.For example, functions like
Unicode.islower
don't seem intrinsically tied to Unicode (if anything, that function is only meaningful for Latin encodings).The text was updated successfully, but these errors were encountered: