Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add graphemes(s) function to iterate over string graphemes #9261

Merged
merged 1 commit into from
Dec 17, 2014

Conversation

stevengj
Copy link
Member

@stevengj stevengj commented Dec 7, 2014

utf8proc/libmojibake includes a feature to break a string up into graphemes (as defined by UAX#29). This PR exports a new function graphemes(s::AbstractString) to expose this functionality as an iterator over substrings of s.

e.g. collect(graphemes("b̀lahβlahb̂láh")) yields SubString{UTF8String}["b̀","l","a","h","β","l","a","h","b̂","l","á","h"]. Note that these are strings, not characters, because graphemes may consist of multiple codepoints (e.g. "b̀" is "\u0062\u0300" in any normalization).

(This may be useful e.g. in the REPL if we want the arrow keys etcetera to work with graphemes as recommended by UAX#29.)

cc: @jiahao, @Keno

@stevengj stevengj added unicode Related to unicode characters and encodings feature labels Dec 7, 2014
@stevengj
Copy link
Member Author

stevengj commented Dec 7, 2014

Note also that in order to implement this cleanly I had to correct an inconsistency in UTF8String slicing: previously s[i:j] allowed i and j to be invalid character indices (effectively by moving them right and left, respectively, to the next valid character indices) except when j was > endof(s) in which case it threw a BoundsError(). I changed this to allow any 0 < j ≤ length(s.data) and added a test case.

@StefanKarpinski
Copy link
Member

Clearly useful and should be exposed. @JeffBezanson, I thought we had changed this so that the indices in range indexing had to be valid character offsets. Did I misremember that or did we not pull the trigger?

@ivarne
Copy link
Member

ivarne commented Dec 7, 2014

We didn't pull the trigger last time I raised the IndexError issue in #7811, but that is a separate issue.

@@ -103,7 +103,7 @@ function getindex(s::UTF8String, r::UnitRange{Int})
if !is_utf8_start(d[i])
i = nextind(s,i)
end
if j > endof(s)
if j > length(d)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't length the number of code points? I think we want sizeof.

Sorry, it's length(d) not length(s), so this is correct, but confusing.

@tkelman
Copy link
Contributor

tkelman commented Dec 8, 2014

I keep trying to click re-build PR on this in appveyor (status.julialang.org was down, it's back now) but it doesn't want to work. If you push to this again it should fire off another build.

edit: and the latest build timed out due to #7942, but at least appveyor is letting me click rebuild this time

@stevengj
Copy link
Member Author

stevengj commented Dec 8, 2014

Now that I look at this more closely, I think that utf8proc's approach of allocating a new string with 0xff beween graphemes is horribly inefficient. No normalization etc. is required for grapheme detection, so all you really need is an isgraphemepair(c1::Char, c2::Char) function that returns whether a pair of Chars are in the same grapheme, and then you can build nextgraphemeind(s, i) and a graphemes(s) iterator on top of that.

(This will require us to re-implement utf8proc's grapheme code, but that's actually not such a big deal: we have to do that anyway since utf8proc's grapheme rules are way out of date as described in JuliaStrings/utf8proc#19.)

@nalimilan
Copy link
Member

@stevengj Would that mean that string iterators could be made to work on graphemes and still be reasonably efficient? (Cf. #6165 (comment) and subsequent comments.)

@StefanKarpinski
Copy link
Member

The main issue there would be creating a new string object for every iteration, which currently would be very expensive. With the string work I'm doing it could be much cheaper – I presume we can be certain that no grapheme is ever longer than 7 bytes (or 15 bytes on 64-bit platforms)? Still, I'm not sure if this is a good idea. I'd be interested in comparing this to what Go does (they invented UTF-8 after all) – I believe that have something called a "rune" which is not quite a code point. Anyone familiar with this?

@stevengj
Copy link
Member Author

stevengj commented Dec 8, 2014

@StefanKarpinski, actually, the graphemes iterator returns substrings, because the caller might want to know the offsets in the original string; since SubString is an immutable struct of just a pointer and a couple integers, presumably it doesn't have to be heap-allocated?

With my suggested non-utf8proc strategy, where performance is critical you could also use nextgraphemeind(s, i) and prevgraphemeind(s, i) and just work with indices.

@nalimilan
Copy link
Member

No, but reading http://blog.golang.org/strings it seems that a rune "means exactly the same as "code point", with one interesting addition", which seems to be that rune is a type alias for int32. And Go iterates over runes.

See also http://blog.golang.org/normalization for an explanation of iteration and normalization. This page notes that the Norm package provides an iterator over normalized characters. That page also says that the Unicode standard defines a Stream Safe text format, with a limit of 30 codepoints that can be combined into a grapheme, and Go apparently uses that assumption to make some functions more efficient.

But the default iterator does not appear to do any normalization. For example, here's the result of iterating over a Unicode string, including the two forms of "noël":

package main

import "fmt"

func main() {
        for i, rune := range "Hello, 世界 noe\u0308l et noël" {
                fmt.Printf("%d: %c\n", i, rune)
        }
}

0: H
1: e
2: l
3: l
4: o
5: ,
6:  
7: 
10: 
13:  
14: n
15: o
16: e
17: ̈
19: l
20:  
21: e
22: t
23:  
24: n
25: o
26: ë
28: l

I find this a bit unfortunate, but there may be good reasons not to normalize. Though I wonder whether at least normalization couldn't be performed on the fly, even if iteration continues to go over codepoints (runes in Go), and not graphemes. Not sure we'd gain much, if some code points still cannot be considered as "user-perceived characters".

@stevengj
Copy link
Member Author

stevengj commented Dec 8, 2014

Yes, Go rune == Julia Char.

(@nalimilan, I think it would be possible to normalize on the fly in an iterator, albeit a bit hairy to code, but so far there hasn't been a lot of demand for this...so far only three packages are using normalize_string at all. I agree with Go's decision not to normalize iterators by default; in most applications this would be a lot of expense for little purpose, because most applications probably don't care whether a string contains combining characters etc...if you are looking inside the string at all, it is usually for ascii substrings like XML tags and commas. Nor is normalization a substitute for grapheme segmentation.)

@nalimilan
Copy link
Member

@stevengj Actually, my vote now goes to require explicit choice of the iteration method. I don't think there's a default which makes more sense than another one, it completely depends on what you want to do. But offering a default is dangerous since it means people don't get to think about what they need and the assumptions they can make.

@stevengj
Copy link
Member Author

I pushed a new version of this commit that uses the new is_grapheme_break function in libmojibake (which has been submitted upstream; still waiting to hear back) to do grapheme breaking without allocating an auxiliary array with 0xFF marker bytes.

I also included a Cbool type, since utf8proc uses bool (from stdbool.h or C++), and I noticed that this was missing. It seems to be a UInt8 on all supported platforms. Whoops, I forgot the discussion in #6529.

@stevengj
Copy link
Member Author

Anyone understand the AppVeyor failure? It looks like it just died in the middle of the build.

@jiahao
Copy link
Member

jiahao commented Dec 16, 2014

I believe AppVeyor times out after 40 minutes.

@tkelman
Copy link
Contributor

tkelman commented Dec 16, 2014

Yeah it's a freeze/timeout. The earlier failure was some breakage from 9266, now fixed in 9366 so I restarted this and other PR's that had been caught by the problem.

Unfortunately we're still getting freezes and timeouts. I'm not sure what the cause of those is, whether it's a Julia problem or the AppVeyor machine is losing its connection or something. Since it only ever happens for 64-bit Julia I suspect it's something in Julia. See http://help.appveyor.com/discussions/problems/1206-inexplicable-timeouts and appveyor/ci#86 - the build usually seems to get stuck either during the system image or the first couple tests, which only takes about 10 minutes to get to. If AppVeyor had a feature like Travis where the build fails if no output is received for some amount of time, we could at least move through our queue faster.

@tkelman
Copy link
Contributor

tkelman commented Dec 17, 2014

And it's timing out again. Feodor's looking into it, see if he can find anything.

@stevengj
Copy link
Member Author

Seems like the tests are passing again. Since everyone seems to be in favor of including this functionality and I've heard no objections to the syntax (modulo larger questions about overhauling s[i] codepoint indexing entirely), I'm going to go ahead and merge.

stevengj added a commit that referenced this pull request Dec 17, 2014
add graphemes(s) function to iterate over string graphemes
@stevengj stevengj merged commit 2364748 into JuliaLang:master Dec 17, 2014
@tonyhffong
Copy link

Would this be something backported to 0.3? or should I think about freezing TermWin.jl version for 0.3 if I choose to use graphemes?

@tkelman
Copy link
Contributor

tkelman commented Dec 21, 2014

This relies on functionality that's only in libmojibake, not upstream utf8proc. Progress is being made on upstreaming our changes, as I understand it. But we haven't backported the dependency change so this would be difficult to backport without also backporting all the other mojibake-related changes.

@tonyhffong
Copy link

Ok, thanks.

@matthieugomez
Copy link
Contributor

I'm using graphemes today : why does next return a string s[i:j] rather than SubString(s, i, j)?

@stevengj
Copy link
Member Author

stevengj commented Nov 6, 2015

@matthieugomez, I don't understand your question. graphemes iterates over substrings, e.g. collect(graphemes("Hello, 世界 noe\u0308l et noël")) returns an Array{SubString{UTF8String},1}.

@matthieugomez
Copy link
Contributor

I'm on my phone but first(graphemes("hello")) is ASCIIString not a Substring

On Friday, November 6, 2015, Steven G. Johnson [email protected]
wrote:

@matthieugomez https://github.com/matthieugomez, I don't understand
your question. graphemes iterates over substrings, e.g. collect(graphemes("Hello,
世界 noe\u0308l et noël")) returns an Array{SubString{UTF8String},1}.


Reply to this email directly or view it on GitHub
#9261 (comment).

nalimilan added a commit that referenced this pull request Nov 6, 2015
This is consistent with eltype() and therefore collect().
Fixes #9261.
@nalimilan
Copy link
Member

Good catch! See #13903.

@stevengj
Copy link
Member Author

stevengj commented Nov 6, 2015

Oh, right; I think I thought that s[i:j] returned a substring at the time.

@stevengj stevengj deleted the graphemes branch November 6, 2015 22:32
@matthieugomez
Copy link
Contributor

Yes so the enumerations generates strings. These strings are converted into
substrings when calling collect (due to your eltype definition). But they
are just substrings of themselves. It seems like the way to fix this is to
replace s[i:j] by SubString(s, i, j) in next

On Friday, November 6, 2015, Matthieu Gomez [email protected]
wrote:

I'm on my phone but first(graphemes("hello")) is ASCIIString not a
Substring

On Friday, November 6, 2015, Steven G. Johnson <[email protected]
javascript:_e(%7B%7D,'cvml','[email protected]');> wrote:

@matthieugomez https://github.com/matthieugomez, I don't understand
your question. graphemes iterates over substrings, e.g. collect(graphemes("Hello,
世界 noe\u0308l et noël")) returns an Array{SubString{UTF8String},1}.


Reply to this email directly or view it on GitHub
#9261 (comment).

@matthieugomez
Copy link
Contributor

Thanks. I have another question. With your type definition, some functions defined for AbstractStrings could directly work on GraphemeIterator, like chr2ind, nextind etc. but currently one needs to redefine all of them. Could GraphemeIterator be an AbstractString? If not could there be a common parent of ABstractStrings and Grapheme iterator?

nalimilan added a commit that referenced this pull request Nov 7, 2015
This is consistent with eltype() and therefore collect().
Fixes #9261.

(cherry picked from commit f379e08)
ref #13903
@nalimilan
Copy link
Member

@matthieugomez GraphemeIterator sounds very different from an AbstractString. It's more like a collection of strings, actually. What would chr2ind or nextind do on a GraphemeIterator?

@matthieugomez
Copy link
Contributor

@nalimilan Sorry i was not clear, I'll try to develop.

The first reason is that AbstractString and GraphemeIterator are both iterators on characters. The only difference is that in a GraphemeIterator, a character is a SubString while in a AbstractString, it is a Char (well I think it has to be a Char but I'm actually not sure).

Another reason is that they have the same implementation. AbstractStrings and GraphemeIterator are both indexable array where only some indices are valid (only the definition of "valid" differs). For these structures, functions such as chr2ind and nextind make sense : they allow to map the next element to it starting index in the array. Moreover, since both iterators use the starting index as the iterator state, chr2ind and nextind should already work on GraphemeIterator (for instance nextind returns the starting index of the next grapheme) — yet they're only defined on AbstractString.

I'm writing this because I'm working on a package that computes various string distances. Most of these distances work by comparing characters. I wrote the package with the idea that a character = a Char. But it'd be nice if the user could also compare strings with the notion that a character = a grapheme. Looking at my code, I just need to sign all methods with a Union{AbstractString, Grapheme}, define SubString(x::GraphemeIterator,...) = graphemes(SubString(x.s,...)), and extend chr2ind and nextind to work on Graphemes (by copying the code used to define these functions on AbstractStrings). We may gain in having a more formal notion of inheritance here.

@nalimilan
Copy link
Member

I think it would be more logical to consider iterating over characters and iterating over graphemes as two different ways of going over a string. You could define a CharacterIterator similar to GraphemeIterator, and use one or the other, or let the user pass one of these two types as an argument. Note that currently, GraphemeIterator isn't indexable: would you pass it a byte index or an index expressed in number of graphemes?

(Actually, one could even advocate that iterating over a string shouldn't be possible: you would be required to do for c in characters(s) or for c in graphemes(s), so that you make an explicit choice.)

@matthieugomez
Copy link
Contributor

I need more than the iterator interface: I also need functions such as nextind, chr2ind (which differ between chars vs graphemes), or endof (which should return the same value for both).

To sum up my point:

  • Defining nextind and chr2ind on GraphemeIterator would allow to write more generic code. A good example is the definition of an iterator on q-grams (substrings composed of q subsequent characters of a string)
  • It turns out that, while nextind and chr2indare defined with an AbstractString signature, they would directly work on GraphemeIterators, due to the way start, next are implemented in this pull request.

So it seems like there's some duplication going on, but I'm not sure what the best solution is.

@matthieugomez
Copy link
Contributor

I'm trying to follow @nalimilan idea by defining two different iterating wrappers for AbstractString. But it gets hard to rewrite stuff like split, join, etc on these iterators so that the iteration property of the string is conserved after passing the function. There must be a better solution (unless i'm doing it wrong).

@nalimilan
Copy link
Member

@matthieugomez What do you mean exactly? Are you trying to allow calling split on these iterators?

@matthieugomez
Copy link
Contributor

Yes.
I need to Substring or split these iterators (ie apply a function on the wrapped string and then wrap the output so that the iterator property of the original string is conserved). With enough of these methods these iterators start to look like strings, so it's not clear to me why they're not AbstractString.

Another thing I don't really like with the solution of implementing two custom iterator types is that this will make the code harder to read, cannot be copy pasted etc. It'd be nice to have the grapheme iteration "for free" rather than rewriting everything for new iterator objects rather than string objects.
I've written a short implementation of Grapheme as an AbstractString (comments mentions some issues) here

@nalimilan
Copy link
Member

I think you misunderstood my suggestion. My proposal was to always work with the standard string, and apply operations (split...) to it, but to also carry an iterator type that you call on that string when you need to iterate over its characters or its graphemes. But I'm not in position to evaluate whether this suits your needs.

@nalimilan nalimilan mentioned this pull request Mar 29, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
unicode Related to unicode characters and encodings
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants