Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

implement uc and lc in C for UTF-8 strings #99

Closed
StefanKarpinski opened this issue Jul 8, 2011 · 6 comments
Closed

implement uc and lc in C for UTF-8 strings #99

StefanKarpinski opened this issue Jul 8, 2011 · 6 comments
Assignees
Labels
performance Must go faster

Comments

@StefanKarpinski
Copy link
Member

A single function may be able to do this depending on the details of UTF-8 encoding.

@ghost ghost assigned StefanKarpinski Jul 8, 2011
@StefanKarpinski
Copy link
Member Author

The best resource I've found for this is: http://developer.gnome.org/glib/2.29/glib-Unicode-Manipulation.html

@JeffBezanson
Copy link
Member

Are these towupper and towlower?

@StefanKarpinski
Copy link
Member Author

For characters, yes, and that's how they're implemented. For ASCII strings, this is pretty easy. For UTF-8 strings, it's complicated because in principle, you have to decode each character, call towupper/towlower on it, and then append it to a new string. But that's a really slow way to do it, when I suspect that UTF-8 might well be designed so that it can be done much faster than that. The glib reference is disheartening though since it says:

The exact manner that this is done depends on the current locale, and may result in the number of characters in the string increasing. (For instance, the German ess-zet will be changed to SS.)

For now, I think I'm just going to keep the TransformedString approach, which is inefficient, but works.

@StefanKarpinski
Copy link
Member Author

The German ß is an excellent example — although not for the reasons glib gives. It shouldn't be capitalized as two letters, but rather as a different unicode character that has one more byte in UTF-8:

julia> length("\u00DF")
2

julia> length("\u1E9E")
3

That makes it really hard to write a fast, single-pass, in-place uppercasing function. We can still get most of the benefit by having something fast for the ASCII string case though.

@StefanKarpinski
Copy link
Member Author

Commit 3f51323 implements fast, copying ucfirst, lcfirst, uc, and lc for ASCIIString objects. UTF8String objects still use the slow TransformedString approach, but for now, that's fine.

@StefanKarpinski
Copy link
Member Author

Closed by 7010db8.

StefanKarpinski pushed a commit that referenced this issue Feb 8, 2018
Add non-float tryparse compat
KristofferC added a commit that referenced this issue Feb 9, 2018
cmcaine pushed a commit to cmcaine/julia that referenced this issue Sep 24, 2020
fredrikekre added a commit that referenced this issue Feb 26, 2021
$ git log --pretty=oneline --abbrev=commit 2b4bed9..6bb8306
6bb83068bd796c4890baaeb39628ff79a4979374 Stop the grace timer iff adding first handle (fix #99) (#102)
af6864d8872247faf2a402d6b2baca5cb74ab96e fix ssh_key_pass bug (fix #91) (#100)
KristofferC pushed a commit that referenced this issue Feb 26, 2021
$ git log --pretty=oneline --abbrev=commit 2b4bed9..6bb8306
6bb83068bd796c4890baaeb39628ff79a4979374 Stop the grace timer iff adding first handle (fix #99) (#102)
af6864d8872247faf2a402d6b2baca5cb74ab96e fix ssh_key_pass bug (fix #91) (#100)
KristofferC pushed a commit that referenced this issue Mar 2, 2021
$ git log --pretty=oneline --abbrev=commit 2b4bed9..6bb8306
6bb83068bd796c4890baaeb39628ff79a4979374 Stop the grace timer iff adding first handle (fix #99) (#102)
af6864d8872247faf2a402d6b2baca5cb74ab96e fix ssh_key_pass bug (fix #91) (#100)

(cherry picked from commit fb500b0)
ElOceanografo pushed a commit to ElOceanografo/julia that referenced this issue May 4, 2021
…uliaLang#39833)

$ git log --pretty=oneline --abbrev=commit 2b4bed9..6bb8306
6bb83068bd796c4890baaeb39628ff79a4979374 Stop the grace timer iff adding first handle (fix JuliaLang#99) (JuliaLang#102)
af6864d8872247faf2a402d6b2baca5cb74ab96e fix ssh_key_pass bug (fix JuliaLang#91) (JuliaLang#100)
antoine-levitt pushed a commit to antoine-levitt/julia that referenced this issue May 9, 2021
…uliaLang#39833)

$ git log --pretty=oneline --abbrev=commit 2b4bed9..6bb8306
6bb83068bd796c4890baaeb39628ff79a4979374 Stop the grace timer iff adding first handle (fix JuliaLang#99) (JuliaLang#102)
af6864d8872247faf2a402d6b2baca5cb74ab96e fix ssh_key_pass bug (fix JuliaLang#91) (JuliaLang#100)
staticfloat pushed a commit that referenced this issue Dec 23, 2022
$ git log --pretty=oneline --abbrev=commit 2b4bed9..6bb8306
6bb83068bd796c4890baaeb39628ff79a4979374 Stop the grace timer iff adding first handle (fix #99) (#102)
af6864d8872247faf2a402d6b2baca5cb74ab96e fix ssh_key_pass bug (fix #91) (#100)

(cherry picked from commit fb500b0)
vchuravy pushed a commit to JuliaPackaging/LazyArtifacts.jl that referenced this issue Oct 2, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Must go faster
Projects
None yet
Development

No branches or pull requests

2 participants