Rename stdlib `Unicode` to `Strings`? #25394

malmaud · 2018-01-04T20:08:35Z

I find it kind of confusing that the standard library for string processing is called Unicode instead of something like Strings. Isn't it just an implementation detail of Julia that strings happen to be unicode? Seems weird to force end-users, many of whom are probably scientific users who might not know anything about character encodings, to know that Julia strings are stored as Unicode in order to call common string processing functions.

For example, functions like Unicode.islower don't seem intrinsically tied to Unicode (if anything, that function is only meaningful for Latin encodings).

The text was updated successfully, but these errors were encountered:

ararslan · 2018-01-04T20:14:18Z

I would argue that these functions shouldn't have left Base, though that ship has sailed.

KristofferC · 2018-01-04T20:17:29Z

I very much agree that the name Unicode feels like an implementation detail that gets exposed. I realize that perhaps technically it isn't but splitting function up in different packages depending on whether they depend on the Unicode version is not very user-friendly. I would also want to put them back into Base purely from a user-friendliness perspective, spoken as someone who knows almost nothing of implementation details...

StefanKarpinski · 2018-01-04T20:17:59Z

The definition of what it means to lowercase something or to be lowercase is specific to Unicode and it changes with different versions of the standard. It might make sense to also have an ASCII module and have ASCII-specific predicates and transformations since often one does not care about case transforming Elvish. If someone wants to propose a definition of these functions that is not dependent on the Unicode standard, I think the onus is on them to provide a precise definition of what the functions mean without referring to that standard.

JeffBezanson · 2018-01-04T20:34:26Z

I think "return the lowercase character corresponding to the given character" is precise enough by our usual standards of generic function meanings. If I made up my own encoding where "A" is 9999 and "a" is 9998, defining lowercase(JeffChar(9999)) to return JeffChar(9998) seems legit to me.

StefanKarpinski · 2018-01-04T20:59:15Z

That's a particularly unhelpful example since the question is not how the letters A and a are encoded – that does not matter at all. The question is how to transform other characters whose case transformation is non-obvious according to obvious ASCII rules. What are the uppercase versions of ÿor ß, for example?

ararslan · 2018-01-04T21:08:04Z

Here's what I would suggest we do: Keep these functions in Base. For each Julia version x.y.z, document that the functions implement/correspond to a particular Unicode standard. People who want compliance with bleeding edge or old Unicode standards could use a separate package as needed.

JeffBezanson · 2018-01-04T22:33:59Z

If we call the package Strings, it seems to give us license to move a bunch more functions there from Base, which might be nice. We could have both Strings and Unicode (for truly unicode-specific things like graphemes) or Strings.Unicode.

malmaud · 2018-01-04T22:34:48Z

OK, I'll submit a PR soon to do the renaming.

StefanKarpinski · 2018-01-04T22:34:52Z

Here is the full list of exports, grouped by similar functionality:

Grapheme iteration: graphemes
Text width: textwidth
Unicode validity: isvalid
Character class predicates: islower, isupper, isalpha, isdigit, isxdigit, isnumeric, isalnum, iscntrl, ispunct, isspace, isprint, isgraph
Case transformations: lowercase, uppercase, titlecase, lcfirst, ucfirst

Grapheme iteration and Unicode validity are inherently Unicode-related. Other character sets could define notions of these. Character class predicates could extended to other character sets as could character transformations. Having Strings.isspace etc. not operate on strings may be a bit surprising.

ararslan · 2018-01-04T22:35:47Z

Strings and Strings.Unicode seems fine, though I really think that some of the basic functionality, e.g. lowercase, uppercase, etc., should really live in Base.

vtjnash · 2018-01-04T22:36:43Z

I think that's why we were also going to move a bunch of other string functionality (search, chomp, isvalid, that sort of thing) in too.

nalimilan · 2018-01-04T22:41:21Z

At least for me, one of the main motivations for moving all these functions to stdlib was to clean the Base namespace from all the is* predicates and group them under a common module (see discussion at #14347). I don't really care how the module is called.

jrklasen · 2018-01-05T17:16:03Z

What are the uppercase versions of ÿor ß, for example?

@StefanKarpinski German added an uppercase ß in 2017. Unicode 5.1: U+1E9E

StefanKarpinski · 2018-01-05T20:36:45Z

That was a rhetorical question, @jrklasen. My point is that the answer to the question is not self-evident – that you referred to the Unicode version in which it was added makes the point quite well.

jrklasen · 2018-01-05T21:39:46Z

sorry I was not quite clear, I just wanted to point out that uppercase and lowercase of ß isn't simply defined by Unicode, but by the "Rat für deutsche Rechtschreibung" (Council for German Orthography), which changed it last year from (ss|ß) <-> SS to ss <-> SS and ß <-> ẞ (U+1E9E). If the package has another name than Unicode this could be adopted.

StefanKarpinski · 2018-01-05T22:05:00Z

But the package specifically implements Unicode uppercasing, not Council for German Orthography uppercasing. That can be implemented by another package, but it's not what this one implements.

StefanKarpinski · 2018-01-05T23:03:57Z

There seem to be three common objections:

Some people find it annoying to have to write using to use functions like lowercase.
Some people don't mind writing using but find writing using Unicode specifically annoying.
There are other character sets besides Unicode and it makes some sense to have a generic function that can operate on representations of character sets besides Unicode.

While I can understand 1, we moved these things out because it seemed unfortunate to have names like isvalid or isspace exported from Base – they're not terribly generic or general. We could move them back, but I don't think that's great. We could move some of them back and leave others in the Unicode package, but where do you draw the line?

I find 2 a bit whiny. Is the name "Unicode" really that scary? It's 2018 and Unicode is here to stay, being afraid of the name seems silly. If you're doing case transformations on arbitrary code points, you are doing something that is defined in terms of the Unicode standard and it seems not so unreasonable that you be aware of it.

Point 3 seems to be the most compelling technical argument, but as I said, I also am having a hard time being convinced that it's important to support character sets that can't be mapped into Unicode.

yurivish · 2018-01-05T23:15:18Z

I wonder if there's a solution here similar to the Pkg3 "This package is not installed. Install? [Y/n]" prompt.

ararslan · 2018-01-05T23:18:11Z

Case changing is so basic that I think it's really silly to require using anything to get them. I think the answer to "where do you stop" would be with a survey what other languages provide out of the box without requiring any imports. The lowest common denominator of those seems like it would be a good candidate for Base, while the rest can live in Unicode (or whatever people want to call it).

nalimilan · 2018-01-06T09:47:11Z

Other recent languages adopt a variety of approaches:

Go basically does what we do now.
Rust puts only a few fancy functions in its unicode stdlib module, and offers most of the functions we have in Unicode as methods of its char type. Of course that's a bit different from Julia since methods are attached to a type, which is almost like having them in a separate module (though the advantage is that you don't need to load it explicitly).
Python 3 basically does the same thing as Rust, with most functions as methods of the string type.
Swift offers uppercase/lowercase functions in its core library. The character category predicates are not provided under the form of is* functions, but as a series of character set objects.

ararslan · 2018-01-06T19:55:56Z

Ruby does the same thing as Python. Matlab and R have basic string functions such as case changing readily available without loading a library.

StefanKarpinski · 2018-01-06T23:07:36Z

I kind of like the character set object approach but that is somewhat orthogonal to whether those objects are in Base or not.

Matlab and R have basic string functions such as case changing readily available without loading a library.

I'm not sure that Matlab and R are the languages to follow when it comes to modularity or strings.

cormullion · 2018-01-07T09:05:02Z

Other recent languages adopt a variety of approaches

Some languages allow/encourage interactive use, where others prefer a more programmatic workflow. Better to look at those languages with similar usage patterns...

StefanKarpinski · 2018-01-09T14:29:49Z

Continuing from #25416 (comment) since we should have the conversation here, not on the PR.

@JeffBezanson, what is your proposal specifically? Case transforms in Base and everything else in Unicode? Or some of the character class predicates in Base (some of them do not depend on Unicode at all), and some in Unicode? One of the original motivations was getting isvalid and isnumeric and such into their own namespace.

JeffBezanson · 2018-01-09T16:15:55Z

isvalid is now a pretty important function so I think we'll have to keep it.

It's unfortunate that some of the predicate names like isspace and isdigit are pretty obvious while others like isnumeric are more generic. I don't know where to draw the line so I'd just keep all of them in Base (except isassigned).

A couple of the predicates are redundant and can maybe be deprecated:

isalnum is identical to isalpha(c) || isnumeric(c)
isgraph is identical to isprint(c) && !isspace(c)

fixes #25394

malmaud added the triage This should be discussed on a triage call label Jan 4, 2018

ararslan added stdlib Julia's standard library strings "Strings!" labels Jan 4, 2018

JeffBezanson changed the title ~~Rename stdlib Unicode to String?~~ Rename stdlib Unicode to Strings? Jan 4, 2018

malmaud self-assigned this Jan 4, 2018

vtjnash removed the triage This should be discussed on a triage call label Jan 4, 2018

malmaud mentioned this issue Jan 5, 2018

Rename Unicode package to Strings. #25416

Closed

JeffBezanson added a commit that referenced this issue Jan 9, 2018

move case functions and char predicates back to Base

2c22005

fixes #25394

JeffBezanson mentioned this issue Jan 9, 2018

move case functions and char predicates back to Base #25479

Merged

JeffBezanson added a commit that referenced this issue Jan 10, 2018

move case functions and char predicates back to Base

c4d46fd

fixes #25394

JeffBezanson added a commit that referenced this issue Jan 10, 2018

move case functions and char predicates back to Base

234ad9e

fixes #25394

JeffBezanson added a commit that referenced this issue Jan 11, 2018

move case functions and char predicates back to Base

f89e5ed

fixes #25394

JeffBezanson added a commit that referenced this issue Jan 11, 2018

move case functions and char predicates back to Base

771d999

fixes #25394

JeffBezanson closed this as completed in c5cd13e Jan 13, 2018

Keno pushed a commit that referenced this issue Jun 5, 2024

move case functions and char predicates back to Base (#25479)

1868628

fixes #25394

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rename stdlib `Unicode` to `Strings`? #25394

Rename stdlib `Unicode` to `Strings`? #25394

malmaud commented Jan 4, 2018

ararslan commented Jan 4, 2018

KristofferC commented Jan 4, 2018 •

edited

Loading

StefanKarpinski commented Jan 4, 2018 •

edited

Loading

JeffBezanson commented Jan 4, 2018

StefanKarpinski commented Jan 4, 2018 •

edited

Loading

ararslan commented Jan 4, 2018

JeffBezanson commented Jan 4, 2018

malmaud commented Jan 4, 2018

StefanKarpinski commented Jan 4, 2018

ararslan commented Jan 4, 2018

vtjnash commented Jan 4, 2018

nalimilan commented Jan 4, 2018

jrklasen commented Jan 5, 2018

StefanKarpinski commented Jan 5, 2018

jrklasen commented Jan 5, 2018 •

edited

Loading

StefanKarpinski commented Jan 5, 2018

StefanKarpinski commented Jan 5, 2018 •

edited

Loading

yurivish commented Jan 5, 2018 •

edited

Loading

ararslan commented Jan 5, 2018

nalimilan commented Jan 6, 2018

ararslan commented Jan 6, 2018

StefanKarpinski commented Jan 6, 2018

cormullion commented Jan 7, 2018

StefanKarpinski commented Jan 9, 2018

JeffBezanson commented Jan 9, 2018

Rename stdlib Unicode to Strings? #25394

Rename stdlib Unicode to Strings? #25394

Comments

malmaud commented Jan 4, 2018

ararslan commented Jan 4, 2018

KristofferC commented Jan 4, 2018 • edited Loading

StefanKarpinski commented Jan 4, 2018 • edited Loading

JeffBezanson commented Jan 4, 2018

StefanKarpinski commented Jan 4, 2018 • edited Loading

ararslan commented Jan 4, 2018

JeffBezanson commented Jan 4, 2018

malmaud commented Jan 4, 2018

StefanKarpinski commented Jan 4, 2018

ararslan commented Jan 4, 2018

vtjnash commented Jan 4, 2018

nalimilan commented Jan 4, 2018

jrklasen commented Jan 5, 2018

StefanKarpinski commented Jan 5, 2018

jrklasen commented Jan 5, 2018 • edited Loading

StefanKarpinski commented Jan 5, 2018

StefanKarpinski commented Jan 5, 2018 • edited Loading

yurivish commented Jan 5, 2018 • edited Loading

ararslan commented Jan 5, 2018

nalimilan commented Jan 6, 2018

ararslan commented Jan 6, 2018

StefanKarpinski commented Jan 6, 2018

cormullion commented Jan 7, 2018

StefanKarpinski commented Jan 9, 2018

JeffBezanson commented Jan 9, 2018

Rename stdlib `Unicode` to `Strings`? #25394

Rename stdlib `Unicode` to `Strings`? #25394

KristofferC commented Jan 4, 2018 •

edited

Loading

StefanKarpinski commented Jan 4, 2018 •

edited

Loading

StefanKarpinski commented Jan 4, 2018 •

edited

Loading

jrklasen commented Jan 5, 2018 •

edited

Loading

StefanKarpinski commented Jan 5, 2018 •

edited

Loading

yurivish commented Jan 5, 2018 •

edited

Loading