Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rename stdlib Unicode to Strings? #25394

Closed
malmaud opened this issue Jan 4, 2018 · 25 comments
Closed

Rename stdlib Unicode to Strings? #25394

malmaud opened this issue Jan 4, 2018 · 25 comments
Assignees
Labels
stdlib Julia's standard library strings "Strings!"

Comments

@malmaud
Copy link
Contributor

malmaud commented Jan 4, 2018

I find it kind of confusing that the standard library for string processing is called Unicode instead of something like Strings. Isn't it just an implementation detail of Julia that strings happen to be unicode? Seems weird to force end-users, many of whom are probably scientific users who might not know anything about character encodings, to know that Julia strings are stored as Unicode in order to call common string processing functions.

For example, functions like Unicode.islower don't seem intrinsically tied to Unicode (if anything, that function is only meaningful for Latin encodings).

@malmaud malmaud added the triage This should be discussed on a triage call label Jan 4, 2018
@ararslan ararslan added stdlib Julia's standard library strings "Strings!" labels Jan 4, 2018
@ararslan
Copy link
Member

ararslan commented Jan 4, 2018

I would argue that these functions shouldn't have left Base, though that ship has sailed.

@KristofferC
Copy link
Member

KristofferC commented Jan 4, 2018

I very much agree that the name Unicode feels like an implementation detail that gets exposed. I realize that perhaps technically it isn't but splitting function up in different packages depending on whether they depend on the Unicode version is not very user-friendly. I would also want to put them back into Base purely from a user-friendliness perspective, spoken as someone who knows almost nothing of implementation details...

@StefanKarpinski
Copy link
Member

StefanKarpinski commented Jan 4, 2018

The definition of what it means to lowercase something or to be lowercase is specific to Unicode and it changes with different versions of the standard. It might make sense to also have an ASCII module and have ASCII-specific predicates and transformations since often one does not care about case transforming Elvish. If someone wants to propose a definition of these functions that is not dependent on the Unicode standard, I think the onus is on them to provide a precise definition of what the functions mean without referring to that standard.

@JeffBezanson
Copy link
Member

I think "return the lowercase character corresponding to the given character" is precise enough by our usual standards of generic function meanings. If I made up my own encoding where "A" is 9999 and "a" is 9998, defining lowercase(JeffChar(9999)) to return JeffChar(9998) seems legit to me.

@JeffBezanson JeffBezanson changed the title Rename stdlib Unicode to String? Rename stdlib Unicode to Strings? Jan 4, 2018
@StefanKarpinski
Copy link
Member

StefanKarpinski commented Jan 4, 2018

That's a particularly unhelpful example since the question is not how the letters A and a are encoded – that does not matter at all. The question is how to transform other characters whose case transformation is non-obvious according to obvious ASCII rules. What are the uppercase versions of ÿor ß, for example?

@ararslan
Copy link
Member

ararslan commented Jan 4, 2018

Here's what I would suggest we do: Keep these functions in Base. For each Julia version x.y.z, document that the functions implement/correspond to a particular Unicode standard. People who want compliance with bleeding edge or old Unicode standards could use a separate package as needed.

@malmaud malmaud self-assigned this Jan 4, 2018
@JeffBezanson
Copy link
Member

If we call the package Strings, it seems to give us license to move a bunch more functions there from Base, which might be nice. We could have both Strings and Unicode (for truly unicode-specific things like graphemes) or Strings.Unicode.

@vtjnash vtjnash removed the triage This should be discussed on a triage call label Jan 4, 2018
@malmaud
Copy link
Contributor Author

malmaud commented Jan 4, 2018

OK, I'll submit a PR soon to do the renaming.

@StefanKarpinski
Copy link
Member

Here is the full list of exports, grouped by similar functionality:

  • Grapheme iteration: graphemes
  • Text width: textwidth
  • Unicode validity: isvalid
  • Character class predicates: islower, isupper, isalpha, isdigit, isxdigit, isnumeric, isalnum, iscntrl, ispunct, isspace, isprint, isgraph
  • Case transformations: lowercase, uppercase, titlecase, lcfirst, ucfirst

Grapheme iteration and Unicode validity are inherently Unicode-related. Other character sets could define notions of these. Character class predicates could extended to other character sets as could character transformations. Having Strings.isspace etc. not operate on strings may be a bit surprising.

@ararslan
Copy link
Member

ararslan commented Jan 4, 2018

Strings and Strings.Unicode seems fine, though I really think that some of the basic functionality, e.g. lowercase, uppercase, etc., should really live in Base.

@vtjnash
Copy link
Member

vtjnash commented Jan 4, 2018

I think that's why we were also going to move a bunch of other string functionality (search, chomp, isvalid, that sort of thing) in too.

@nalimilan
Copy link
Member

At least for me, one of the main motivations for moving all these functions to stdlib was to clean the Base namespace from all the is* predicates and group them under a common module (see discussion at #14347). I don't really care how the module is called.

@jrklasen
Copy link
Contributor

jrklasen commented Jan 5, 2018

What are the uppercase versions of ÿor ß, for example?

@StefanKarpinski German added an uppercase ß in 2017. Unicode 5.1: U+1E9E

@StefanKarpinski
Copy link
Member

That was a rhetorical question, @jrklasen. My point is that the answer to the question is not self-evident – that you referred to the Unicode version in which it was added makes the point quite well.

@jrklasen
Copy link
Contributor

jrklasen commented Jan 5, 2018

sorry I was not quite clear, I just wanted to point out that uppercase and lowercase of ß isn't simply defined by Unicode, but by the "Rat für deutsche Rechtschreibung" (Council for German Orthography), which changed it last year from (ss|ß) <-> SS to ss <-> SS and ß <-> ẞ (U+1E9E). If the package has another name than Unicode this could be adopted.

@StefanKarpinski
Copy link
Member

But the package specifically implements Unicode uppercasing, not Council for German Orthography uppercasing. That can be implemented by another package, but it's not what this one implements.

@StefanKarpinski
Copy link
Member

StefanKarpinski commented Jan 5, 2018

There seem to be three common objections:

  1. Some people find it annoying to have to write using to use functions like lowercase.
  2. Some people don't mind writing using but find writing using Unicode specifically annoying.
  3. There are other character sets besides Unicode and it makes some sense to have a generic function that can operate on representations of character sets besides Unicode.

While I can understand 1, we moved these things out because it seemed unfortunate to have names like isvalid or isspace exported from Base – they're not terribly generic or general. We could move them back, but I don't think that's great. We could move some of them back and leave others in the Unicode package, but where do you draw the line?

I find 2 a bit whiny. Is the name "Unicode" really that scary? It's 2018 and Unicode is here to stay, being afraid of the name seems silly. If you're doing case transformations on arbitrary code points, you are doing something that is defined in terms of the Unicode standard and it seems not so unreasonable that you be aware of it.

Point 3 seems to be the most compelling technical argument, but as I said, I also am having a hard time being convinced that it's important to support character sets that can't be mapped into Unicode.

@yurivish
Copy link
Contributor

yurivish commented Jan 5, 2018

I wonder if there's a solution here similar to the Pkg3 "This package is not installed. Install? [Y/n]" prompt.

@ararslan
Copy link
Member

ararslan commented Jan 5, 2018

Case changing is so basic that I think it's really silly to require using anything to get them. I think the answer to "where do you stop" would be with a survey what other languages provide out of the box without requiring any imports. The lowest common denominator of those seems like it would be a good candidate for Base, while the rest can live in Unicode (or whatever people want to call it).

@nalimilan
Copy link
Member

Other recent languages adopt a variety of approaches:

  • Go basically does what we do now.
  • Rust puts only a few fancy functions in its unicode stdlib module, and offers most of the functions we have in Unicode as methods of its char type. Of course that's a bit different from Julia since methods are attached to a type, which is almost like having them in a separate module (though the advantage is that you don't need to load it explicitly).
  • Python 3 basically does the same thing as Rust, with most functions as methods of the string type.
  • Swift offers uppercase/lowercase functions in its core library. The character category predicates are not provided under the form of is* functions, but as a series of character set objects.

@ararslan
Copy link
Member

ararslan commented Jan 6, 2018

Ruby does the same thing as Python. Matlab and R have basic string functions such as case changing readily available without loading a library.

@StefanKarpinski
Copy link
Member

I kind of like the character set object approach but that is somewhat orthogonal to whether those objects are in Base or not.

Matlab and R have basic string functions such as case changing readily available without loading a library.

I'm not sure that Matlab and R are the languages to follow when it comes to modularity or strings.

@cormullion
Copy link
Contributor

Other recent languages adopt a variety of approaches

Some languages allow/encourage interactive use, where others prefer a more programmatic workflow. Better to look at those languages with similar usage patterns...

@StefanKarpinski
Copy link
Member

Continuing from #25416 (comment) since we should have the conversation here, not on the PR.

@JeffBezanson, what is your proposal specifically? Case transforms in Base and everything else in Unicode? Or some of the character class predicates in Base (some of them do not depend on Unicode at all), and some in Unicode? One of the original motivations was getting isvalid and isnumeric and such into their own namespace.

@JeffBezanson
Copy link
Member

isvalid is now a pretty important function so I think we'll have to keep it.

It's unfortunate that some of the predicate names like isspace and isdigit are pretty obvious while others like isnumeric are more generic. I don't know where to draw the line so I'd just keep all of them in Base (except isassigned).

A couple of the predicates are redundant and can maybe be deprecated:

  • isalnum is identical to isalpha(c) || isnumeric(c)
  • isgraph is identical to isprint(c) && !isspace(c)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
stdlib Julia's standard library strings "Strings!"
Projects
None yet
Development

No branches or pull requests

10 participants