Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

improve character category predicates #5939

Closed
stevengj opened this issue Feb 24, 2014 · 25 comments
Closed

improve character category predicates #5939

stevengj opened this issue Feb 24, 2014 · 25 comments
Labels
breaking This change will break code needs decision A decision on this change is needed unicode Related to unicode characters and encodings

Comments

@stevengj
Copy link
Member

As @jiahao suggested in #5576, it might be worthwhile to use utf8proc (which we are shipping with Julia anyway) to provide functions like isalnum, isalpha, iscntrl, isdigit, isgraph, islower, isprint, ispunct, isspace, isupper, and possibly isblank in string.jl. The reason is that utf8proc seems to be more up-to-date on the Unicode standard than libc, and is unhampered by legacy issues (e.g. isblank returns false for a non-breaking space, apparently for legacy reasons).

utf8proc's results are also locale-independent. This may be a plus or a minus; I don't really understand how the locale affects the results of the abovementioned predicates in libc.

@JeffBezanson
Copy link
Member

We should definitely do this. All those predicates have canonical answers for unicode characters and shouldn't depend on locale.
One example is that the libc function isalpha(c) is defined to be equivalent to isupper(c) || islower(c), but in unicode there are letters that are neither lower- nor upper-case. The "C" locale selects the legacy behavior. So libc seems pretty bogus here.

@jiahao
Copy link
Member

jiahao commented Feb 24, 2014

I'd be up for putting in some effort to make this happen for 0.4. Is it fair to say that things aren't so badly broken Unicode-wide that all this needs to happen in 0.3?

@stevengj
Copy link
Member Author

I think that this particular issue can wait until 0.4.

@stevengj
Copy link
Member Author

One difficult question is whether to even keep some of these functions, or to replace them with more Unicode-centric names and concepts.

@stevengj
Copy link
Member Author

Note that these functions are all broken on Windows because the isw* functions take a single wchar_t argument, which is only 16 bits on Windows (and hence can only handle the BMP).

@JeffBezanson JeffBezanson changed the title use utf8proc's Unicode character categories for isspace etc. improve character category predicates Aug 15, 2014
@catawbasam
Copy link
Contributor

In follow-up to @JeffBezanson's comments in #8012, benchmarks comparing islower(Char) and islower(String) are posted at http://nbviewer.ipython.org/gist/catawbasam/28153e91774992d6482b

They compare 3 approaches:

  1. the current Base methods (which is inaccurate on Windows)
  2. methods based on utf8proc
  3. methods based on PCRE

Findings for test functions, as tested:

  • For islower(Char), Base was fastest, utf8proc was moderately slower and PCRE was unacceptable.
  • For islower(String), Base was slowest, utf8proc was moderately faster and PCRE was much faster.

For my purposes, performance of islower(ByteString) and islower(SubString{ByteString}) are most important, so I find the PCRE approach attractive.

Comments and suggestions for improving either the methods or the tests would be welcome.

@stevengj
Copy link
Member Author

But do utf8proc and PCRE always give the same answers? Is PCRE Unicode-7 aware?

Also, the string benchmark is probably misleading. If you want to optimize this function, don't use all (a higher-order function, where things probably don't get inlined). Use a straight for char in string loop.

@catawbasam
Copy link
Contributor

"Is PCRE Unicode-7 aware?"
Per http://www.pcre.org/changelog.txt PCRE was updated to Unicode 6.3.3 in December 2013, so it is presumably not yet Unicode-7 aware.

In briefly reviewing http://www.unicode.org/charts/PDF/Unicode-7.0/ and http://www.unicode.org/charts/case/ I did not find any Unicode-7 character codes that had lowercase/uppercase. If you have any test cases available, I'd be happy to try them out.

Regarding all vs for loop, the utf8proc version is already using a for loop. However, I did add a tweaked version of the Base.islower(String) call to ccall(:iswlower with a for loop, and as you suggested, it was indeed faster than the current Base.islower(). The new variant is called islower_loop in the notebook, and the gist is updated with those results.

@catawbasam
Copy link
Contributor

Seems like a utf8proc version of islower(String) might be faster if the string were passed via ccall and the conversion to character codes was done on that side. Unfortunately I don't know C very well, so I'm not in a position to do that test.

@stevengj
Copy link
Member Author

Try adding the following to libmojibake (formerly utf8proc):

#include "mojibake.h"

/* note: assumes str points to a valid NUL-terminated UTF8 string */
int mojibake_islower(const uint8_t *str)
{
     uint8_t c = *str;
     while (c) {
          int length = utf8proc_utf8class[c];
          int32_t uc;
          switch (length) {
              case 1:
                   uc = str[0];
                   break;
              case 2:
                   uc = ((str[0] & 0x1F) <<  6) + (str[1] & 0x3F);
                   break;
              case 3:
                   uc = ((str[0] & 0x0F) << 12) + ((str[1] & 0x3F) <<  6)
                        + (str[2] & 0x3F);
                   break;
              case 4:
                   uc = ((str[0] & 0x07) << 18) + ((str[1] & 0x3F) << 12)
                        + ((str[2] & 0x3F) <<  6) + (str[3] & 0x3F);
                   break;
          }
          if (utf8proc_get_property(uc)->category != UTF8PROC_CATEGORY_LL)
               return 0;
          str += length;
          c = *str;
     }
     return 1;
}

@stevengj
Copy link
Member Author

I don't think we would want islower to ever disagree for a Char or a String, so we need to use the same implementation for both. (Is islower for strings a function whose performance is critical?)

@JeffBezanson
Copy link
Member

The libmojibake approach is generally preferred, since it is already a deep dependency of the parser. Although PCRE is very important, I'd rather not dig in deeper with it. At a certain point, it becomes reasonable to want to use just the language with minimal library dependencies.

@catawbasam
Copy link
Contributor

Makes sense. I'll plan on trying out mojibake_islower in the next day or so.

re "Is islower for strings a function whose performance is critical?"
Not particularly. It's just that, in contrast to Julia's excellent performance on numerical tasks, text handling times aren't always especially good compared to Python's. It would be a shame to give up any further ground.

@catawbasam
Copy link
Contributor

Updated test results are up at http://nbviewer.ipython.org/gist/catawbasam/28153e91774992d6482b.
islower_utf8proc() is faster in this version, and mojibake_islower(String) is faster yet for ByteStrings.

Changes:

  • added mojibake_islower(ByteString) based on a C function provided by @stevengj
  • Revised islower_utf8proc() to use a hacked version of category_code(c) which omits the line cat == 0 ? 30 : cat. It isn't needed by islower() and adds quite a bit of time.
  • Updated tests of islower(Char) to use a randomly shuffled array of Char.
  • dropped PCRE as it does not work well with Char and libmojibake is the preferred path.

Updated Findings for test functions, as tested:

  • For Chars run against shuffled Char arrays, islower_base was just a hair faster than islower_utf8proc
  • For Strings, islower_utf8proc was 3-4x faster than islower_base
  • The discrepency between islower(Char) and islower(String) is probably due to the inefficiency of the all() function used in the Base version of islower(String).
  • mojibake_islower(ByteString) provided another 3x speed boost for ByteStrings.

@ivarne
Copy link
Member

ivarne commented Aug 19, 2014

This might be totally obvious to you, but it seems to me like mojibake_islower should take an explicit length, so that SubString can use the same implementation.

@catawbasam
Copy link
Contributor

Looks to me like we might be able to forgo specialized C methods like mojibake_islower altogether if we move c > 0x10FFFF && return 0x0000 and cat == 0 ? 30 : cat from category_code() in utf8proc.jl to utf8proc_get_property on the C side, and still get pretty good performance.

That would keep libmojibake nice and slim and minimize calls on @stevengj 's time.

@stevengj
Copy link
Member Author

One of those checks can be eliminated completely by JuliaStrings/utf8proc#15. As for the other check, moving it into utf8proc_get_property is probably not a good idea, because that is an inner-loop function in the normalization code.

Also, realize that I got the 3x speed boot in mojibake_islower in part by completely removing all validity checks. Honestly, I don't think that is worth it to get a factor of 3 in a function that is probably not performance-critical anyway. (In what real applications is the inner loop a check of whether some string is lower-case?)

@catawbasam
Copy link
Contributor

"Honestly, I don't think that is worth it to get a factor of 3 in a function that is probably not performance-critical anyway"
Agreed.

@catawbasam
Copy link
Contributor

A draft cut at character category predicates based on libmojibake (as-is) is posted at https://gist.github.com/catawbasam/3ab68615b4c78a5a49b1
[nbviewer link: http://nbviewer.ipython.org/gist/catawbasam/3ab68615b4c78a5a49b1]
The definitions draw from similar functions in Haskell, Go and perl.

The functions are intended to be non-breaking, with one minor exception -- isspace() returns true for U+0085 (newline), U+00A0 (non-breaking space). This follows the Go version.

If the approach looks promising, I could put together a pull request.

[pao: add nbviewer link]

@stevengj
Copy link
Member Author

@catawbasam, this seems reasonable to me.

@StefanKarpinski
Copy link
Member

Seems good although I feel like some investigation of the non-breaking space issue is warranted. Would the desired behavior of that character be precisely to look like a space but not act like one?

@catawbasam
Copy link
Contributor

Yes, it looks like a space (with one exception) but behaves differently.
The 2 contexts where I've seen non-breaking space used are:

  1. In word processors, where you sometimes want a visible space, but you don't want a line break at that space. For example you might use it to control orphaning.
  2. In HTML it can be used to avoid whitespace collapse.

Just reviewed Wikipedia at http://en.wikipedia.org/wiki/Non-breaking_space, and there are several variants of non-breaking space. All are in category ZS, so isspace can be slightly simplified to:

function isspace_moji(c::Char)
    return  c in (' ','\t','\n','\r','\f','\v', NEL) || 
             category_code_assigned(c)==UTF8PROC_CATEGORY_ZS
end

The one non-breakable space that looks marginal is word-joiner U+2060: "The word-joiner does not normally produce any space but prohibits a line break on either side of it." It is in category Zs, so it would be selected as true under the proposed definition. Worth adding a clause to exclude it?

@stevengj
Copy link
Member Author

What do other languages do with U+2060?

@catawbasam
Copy link
Contributor

Oops, made a mistake when checking the code categories -- U+2060 is in Cf: Other, Format,
so there is no issue with it.

julia> int(category_code_assigned(char(0x002060))) 
27

@stevengj
Copy link
Member Author

Closed by #8233.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
breaking This change will break code needs decision A decision on this change is needed unicode Related to unicode characters and encodings
Projects
None yet
Development

No branches or pull requests

6 participants