Skip to content

Latest commit

 

History

History
216 lines (173 loc) · 8.08 KB

working-with-unicode.md

File metadata and controls

216 lines (173 loc) · 8.08 KB

Working with Unicode Characters Beyond A to Z (Basic Latin 7-Bit Characters) - Let's Go International! Let's Go World-Wide!

Trivia Quiz: How many letters has the basic latin 7-bit alphabet or using the more common name, that is, the English alphabet?

  • (A) 24
  • (B) 26
  • (C) 28
  • (D) Other - Please, tell.

And the answer is ... Let's ask the ruby oracle:

('A'..'Z').to_a.size
#=> 26
('A'..'Z').to_a.join
#=> "ABCDEFGHIJKLMNOPQRSTUVWXYZ"

Let's go international! Let's go global! Let's go world wide!

Let's use unicode characters and start with internationalization (i18n). Let's use the 1500+ football club names from 100+ countries from the open football.db clubs dataset for a real-world sample:

require 'footballdb/clubs'

Club.all.size
#=> 1716

Football club names include: Bayern München · 1. FC Köln · Preußen Münster · FC København · FC Vestsjælland · JJK Jyväskylä · LB Châteauroux · Évian TG FC · Widzew Łódź · Zagłębie Lubin · İnter Bakı PİK · Qarabağ FK · São Paulo FC · Atlético MG · EC Vitória · and many more.

Triva Quiz: How many extra characters with accents or diacritic marks or joined together into a ligature have the 1500+ clubs names from 100+ countries beyond the basic latin 7-bit alphabet?

  • (A) More than 10
  • (B) More than 30
  • (C) More than 60
  • (D) Other - Please, tell.

And the answer is ... Let's ask the ruby oracle and calculate the character frequency / usage:

freq = Hash.new(0)

clubs.each do |club|
  names = [club.name]+club.alt_names
  names.each do |name|
    # calculate the frequency table of letters, digits, etc.
    name.each_char do |ch|
      next if ch.ord < 127     # skip 7-bit basic latin letters, numbers, etc.
      next if ch =~ /[º]/      # skip some 8-bit special non-letter chars too
      freq[ch] += 1
    end
  end
end

freq.keys.size
#=> 63
freq.keys.sort.join
#=> ÁÅÇÉÍÖÚßàáâãäåæçèéêëíîñóôõöøúüýþāăąćČčėęğīİıŁłńňőřŚŞşŠšţūźżŽžșț

Wow. That's 63 more characters (!). Did you know? Yes, there's a capital I with a dot İ and there's a small i without a dot ı and on and on.

Let's start from the beginning...

Trivia Quiz: What's the official unicode name for Á?

  • (A) LATIN CAPITAL LETTER A WITH GRAVE
  • (B) LATIN CAPITAL LETTER A WITH ACUTE
  • (C) LATIN CAPITAL LETTER A WITH MACRON
  • (D) LATIN CAPITAL LETTER A WITH CIRCUMFLEX

And the answer is ... Let's ask the ruby oracle:

Unicode::Name.of( 'Á' )
#=> "LATIN CAPITAL LETTER A WITH ACUTE"

Now if you wonder where is this wonderful Unicode::Name.of helper coming from? Let's thank Jan Lelis who dreams in Unicode (and of Idiosyncratic Ruby fame) for the great unicode gem / library family that includes the unicode-name gem / library that returns the official Unicode name for a character / codepoint.

require 'unicode/name'

Unicode::Name.of( 'İ' )   # or "\u0130" or "\u{130}" 
#=> "LATIN CAPITAL LETTER I WITH DOT ABOVE"
Unicode::Name.of( 'ı' )   # or "\u0131" or "\u{131}" 
#=> "LATIN SMALL LETTER DOTLESS I"

See janlelis/unicode-name for more documentation and on how to use the unicode library / gem.

To wrap up - let's print all Unicode names and codepoints for the 63 characters beyond the basic 7-bit Latin alphabet:

freq.keys.sort.each do |ch|
  name = Unicode::Name.of( ch )
  puts "#{ch} - U+#{'%04X' % ch.ord} (#{ch.ord}) - #{name}"
end

resulting in:

Á - U+00C1 (193) - LATIN CAPITAL LETTER A WITH ACUTE
Å - U+00C5 (197) - LATIN CAPITAL LETTER A WITH RING ABOVE
Ç - U+00C7 (199) - LATIN CAPITAL LETTER C WITH CEDILLA
É - U+00C9 (201) - LATIN CAPITAL LETTER E WITH ACUTE
Í - U+00CD (205) - LATIN CAPITAL LETTER I WITH ACUTE
Ö - U+00D6 (214) - LATIN CAPITAL LETTER O WITH DIAERESIS
Ú - U+00DA (218) - LATIN CAPITAL LETTER U WITH ACUTE
ß - U+00DF (223) - LATIN SMALL LETTER SHARP S
à - U+00E0 (224) - LATIN SMALL LETTER A WITH GRAVE
á - U+00E1 (225) - LATIN SMALL LETTER A WITH ACUTE
â - U+00E2 (226) - LATIN SMALL LETTER A WITH CIRCUMFLEX
ã - U+00E3 (227) - LATIN SMALL LETTER A WITH TILDE
ä - U+00E4 (228) - LATIN SMALL LETTER A WITH DIAERESIS
å - U+00E5 (229) - LATIN SMALL LETTER A WITH RING ABOVE
æ - U+00E6 (230) - LATIN SMALL LETTER AE
ç - U+00E7 (231) - LATIN SMALL LETTER C WITH CEDILLA
è - U+00E8 (232) - LATIN SMALL LETTER E WITH GRAVE
é - U+00E9 (233) - LATIN SMALL LETTER E WITH ACUTE
ê - U+00EA (234) - LATIN SMALL LETTER E WITH CIRCUMFLEX
ë - U+00EB (235) - LATIN SMALL LETTER E WITH DIAERESIS
í - U+00ED (237) - LATIN SMALL LETTER I WITH ACUTE
î - U+00EE (238) - LATIN SMALL LETTER I WITH CIRCUMFLEX
ñ - U+00F1 (241) - LATIN SMALL LETTER N WITH TILDE
ó - U+00F3 (243) - LATIN SMALL LETTER O WITH ACUTE
ô - U+00F4 (244) - LATIN SMALL LETTER O WITH CIRCUMFLEX
õ - U+00F5 (245) - LATIN SMALL LETTER O WITH TILDE
ö - U+00F6 (246) - LATIN SMALL LETTER O WITH DIAERESIS
ø - U+00F8 (248) - LATIN SMALL LETTER O WITH STROKE
ú - U+00FA (250) - LATIN SMALL LETTER U WITH ACUTE
ü - U+00FC (252) - LATIN SMALL LETTER U WITH DIAERESIS
ý - U+00FD (253) - LATIN SMALL LETTER Y WITH ACUTE
þ - U+00FE (254) - LATIN SMALL LETTER THORN
ā - U+0101 (257) - LATIN SMALL LETTER A WITH MACRON
ă - U+0103 (259) - LATIN SMALL LETTER A WITH BREVE
ą - U+0105 (261) - LATIN SMALL LETTER A WITH OGONEK
ć - U+0107 (263) - LATIN SMALL LETTER C WITH ACUTE
Č - U+010C (268) - LATIN CAPITAL LETTER C WITH CARON
č - U+010D (269) - LATIN SMALL LETTER C WITH CARON
ė - U+0117 (279) - LATIN SMALL LETTER E WITH DOT ABOVE
ę - U+0119 (281) - LATIN SMALL LETTER E WITH OGONEK
ğ - U+011F (287) - LATIN SMALL LETTER G WITH BREVE
ī - U+012B (299) - LATIN SMALL LETTER I WITH MACRON
İ - U+0130 (304) - LATIN CAPITAL LETTER I WITH DOT ABOVE
ı - U+0131 (305) - LATIN SMALL LETTER DOTLESS I
Ł - U+0141 (321) - LATIN CAPITAL LETTER L WITH STROKE
ł - U+0142 (322) - LATIN SMALL LETTER L WITH STROKE
ń - U+0144 (324) - LATIN SMALL LETTER N WITH ACUTE
ň - U+0148 (328) - LATIN SMALL LETTER N WITH CARON
ő - U+0151 (337) - LATIN SMALL LETTER O WITH DOUBLE ACUTE
ř - U+0159 (345) - LATIN SMALL LETTER R WITH CARON
Ś - U+015A (346) - LATIN CAPITAL LETTER S WITH ACUTE
Ş - U+015E (350) - LATIN CAPITAL LETTER S WITH CEDILLA
ş - U+015F (351) - LATIN SMALL LETTER S WITH CEDILLA
Š - U+0160 (352) - LATIN CAPITAL LETTER S WITH CARON
š - U+0161 (353) - LATIN SMALL LETTER S WITH CARON
ţ - U+0163 (355) - LATIN SMALL LETTER T WITH CEDILLA
ū - U+016B (363) - LATIN SMALL LETTER U WITH MACRON
ź - U+017A (378) - LATIN SMALL LETTER Z WITH ACUTE
ż - U+017C (380) - LATIN SMALL LETTER Z WITH DOT ABOVE
Ž - U+017D (381) - LATIN CAPITAL LETTER Z WITH CARON
ž - U+017E (382) - LATIN SMALL LETTER Z WITH CARON
ș - U+0219 (537) - LATIN SMALL LETTER S WITH COMMA BELOW
ț - U+021B (539) - LATIN SMALL LETTER T WITH COMMA BELOW

Welcome to the Unicode wonderland and to quote Jan Lelis: "Ruby ♡ Unicode".

Appendix:

For more about Ruby and Unicode, see the great Ruby ♡ Unicode series by Jan Lelis at the Idiosyncratic Ruby website:

  • Ruby has Character - Ruby comes with good support for Unicode-related features. Read on if you want to learn more about important Unicode fundamentals and how to use them in Ruby...
  • Proper Unicoding - Ruby's Regexp engine has a powerful feature built in: It can match for Unicode character properties. But what exactly are properties you can match for?
  • Regex with Class - Ruby's regex engine defines a lot of shortcut character classes. Besides the common meta characters (\w, etc.), there is also the POSIX style expressions and the unicode property syntax. This is an overview of all character classes...