Skip to content

No-nonsense simple transliteration between writing systems, mostly of Semitic origin

License

Notifications You must be signed in to change notification settings

twardoch/gimeltra

Repository files navigation

Gimeltra

Gimeltra is a Python 3.9+ tool for simple transliteration between 20+ writing systems, mostly of Semitic origin.

Gimeltra performs simplified abjad-only transliteration, and is primarily intended for translating simple texts from modern to ancient scripts. It uses a non-standard romanization scheme. Arabic, Greek or Hebrew letters outside the basic consonant set will not transliterate.

Installation

python3 -m pip install --upgrade git+https://github.com/twardoch/gimeltra

Usage

Command-line

$ gimeltrapy -h
usage: gimeltrapy [-h] [-t TEXT] [-i FILE] [-s SCRIPT] [-o SCRIPT] [--stats] [-v] [-V]

optional arguments:
  -h, --help            show this help message and exit
  -t TEXT, --text TEXT
  -i FILE, --input FILE
  -s SCRIPT, --script SCRIPT
                        Input script as ISO 15924 code
  -o SCRIPT, --to-script SCRIPT
                        Output script as ISO 15924 code
  --stats               List supported scripts
  -v, --verbose         -v show progress, -vv show debug
  -V, --version         show version and exit

Examples:

$ gimeltrapy -t "لرموز"
lrmwz
$ gimeltrapy -t "لرموز" -o Hebr
לרמוז
$ gimeltrapy -t "لرموز" -o Narb
𐪁𐪇𐪃𐪅𐪘
$ gimeltrapy -t "لرموز" -o Sogo
𐼌𐼘𐼍𐼇𐼈

Or from stdin / via piping:

$ echo لرموز | gimeltrapy -o Grek
λρμυζ

Python

from gimeltra.gimeltra import Transliterator
tr = Transliterator()
print(tr.tr("لرموز", sc='Arab', to_sc='Hebr')

Less efficient:

from gimeltra import tr
print(tr("لرموز")

Supported scripts / tech background

Gimeltra supports 24 scripts:

  • Latn (Latin)
  • Arab (Arabic)
  • Ethi (Ethiopic)
  • Armi (Imperial Aramaic)
  • Brah (Brahmi)
  • Chrs (Chorasmian)
  • Egyp (Egyptian hieroglyphs)
  • Elym (Elymaic)
  • Grek (Greek)
  • Hatr (Hatran)
  • Hebr (Hebrew)
  • Mani (Manichaean)
  • Narb (Old North Arabian)
  • Nbat (Nabataean)
  • Palm (Palmyrene)
  • Phli (Inscriptional Pahlavi)
  • Phlp (Psalter Pahlavi)
  • Phnx (Phoenician)
  • Prti (Inscriptional Parthian)
  • Samr (Samaritan)
  • Sarb (Old South Arabian)
  • Sogd (Sogdian)
  • Sogo (Old Sogdian)
  • Syrc (Syriac)
  • Ugar (Ugaritic)

Conversion steps

The conversion uses the .json file derived from the .tsv table. The selection of the conversion rules is based on ISO 15924 script codes. The code mimics a simple OpenType glyph processing model, but with Unicode characters:

1. Preprocessing with a ccmp table

  1. Split ligatures into single letters, also
  2. Decompose into Unicode NFD and drop the marks.

2. Character replacement in the csub table

  1. Try direct source-target script mapping.
  2. If that does not exist, convert into Latn.
  3. Try from Latn to target script
  4. If that’s not available, fallback from Latn to <Latn and try to convert to the target script.

3. Postprocessing

  1. Replace letters by their contextual final forms using the fina table.
  2. Replace series of letters by Unicode ligatures using the liga table.

Characters that aren’t covered by the mappings are passed through. This may change in future (or there will be an option to keep non-letters but drop letters and marks).

Data table

The below table exists in .numbers and .tsv formats. The .numbers file is the source, which I export to .tsv, and then I update the .json, which the transliterator uses.

There are some simple conventions in the table:

  • | separates alternate versions of a character
  • < prefix means that we should only convert from this character but not to it
  • > prefix means that we should only convert to this character but not from it
  • ~ prefix indicates that this is a final form
  • % separates the from and to strings of a character ligature

(Keep on mind that if the characters in the table are RTL, the browser renders the entire cell as RTL and changes > to < and vice versa 😀 )

The Latn column serves as the intermediary (all conversions are done from the source script through Latn to the target script). The column contains some characters that have equivalents only in some scripts. This allows less lossy coversion between, say, Hebrew and Arabic or Ethiopic and Old South Arabian.

The <Latn column provides fallback Latin characters if the target script does not have an equivalent to the Latn character. This gives lossier but still plausible conversion.

Latn <Latn Name Arab Ethi Armi Brah Chrs Egyp Elym Grek Hatr Hebr Mani Narb Nbat Palm Phli Phlp Phnx Prti Samr Sarb Sogd Sogo Syrc Ugar
ʾ Aleph ا 𐡀 𑀅 𐾰|<𐾱 𓃾 𐿠 α|<Α 𐣠 א 𐫀 𐪑 𐢁|~𐢀 𐡠 𐭠 𐮀 𐤀 𐭀 𐩱 𐼰 𐼀|~𐼁 ܐ 𐎀
b Bet ب 𐡁 𑀩 𐾲 𓉐 𐿡 >β|<Β 𐣡 𐫁 𐪈 𐢃|~𐢂 𐡡 𐭡 𐮁 𐤁 𐭁 𐩨 𐼱 𐼂|~𐼃 ܒ 𐎁
g Gimel غ 𐡂 𑀕 𐾳 𓌙 𐿢 γ|<Γ 𐣢 ג 𐫃 𐪔 𐢄 𐡢 𐭢 𐮂 𐤂 𐭂 𐩴 𐼲 𐼄 ܓ|<ܔ 𐎂
d Daleth د 𐡃 𑀥 𐾴 𓇯 𐿣 δ|<Δ 𐣣 ד 𐫅 𐪕 𐢅 𐡣 𐭣 𐮃 𐤃 𐭃 𐩵 𐼹 𐼌 ܕ|<ܕ݂ 𐎄
h He ه 𐡄 𑀳 𐾵 𓀠 𐿤 ε|<Ε 𐣤 ה 𐫆 𐪀 𐢇|~𐢆 𐡤 𐭤 𐮄 𐤄 𐭄 𐩠 𐼳 𐼆|~𐼅 ܗ 𐎅
w Waw و 𐡅 𑀯 𐾶|<𐾷 𓏲 𐿥 υ|<Υ 𐣥 ו 𐫇 𐪅 𐢈 𐡥 >𐭥 >𐮅 𐤅 𐭅 𐩥 𐼴 𐼇 ܘ 𐎆
z Zayin ز 𐡆 𑀚 𐾸 𓏭 𐿦 ζ|<Ζ 𐣦 ז 𐫉 𐪘 𐢉 𐡦 𐭦 𐮆 𐤆 𐭆 >𐩹 𐼵 𐼈 ܙ 𐎇
Het ح 𐡇 𑀖 𐾹 𓉗 𐿧 η|<Η 𐣧 ח 𐫍 𐪂 𐢊 𐡧 𐭧 𐮇 𐤇 𐭇 𐩢 𐼶 𐼉 ܚ|<ܚ݂ 𐎈
Tet ط 𐡈 𑀣 >𐿄 𓄤 𐿨 θ|<Θ 𐣨 ט 𐫎 𐪉 𐢋 𐡨 𐭨 >𐮑 𐤈 𐭈 𐩷 >𐽃 >𐼔 ܛ|<ܜ 𐎉
y Yod ي 𐡉 𑀬 𐾺 𓂝 𐿩 ι|<Ι 𐣩 י 𐫏 𐪚 𐢍|~𐢌 𐡩 𐭩 𐮈 𐤉 𐭉 𐩺 𐼷 𐼊 ܝ 𐎊
k Kaf ك 𐡊 𑀓 𐾻 𓂧 𐿪 κ|<Κ 𐣪 כ|~ך 𐫐 𐪋 𐢏|~𐢎 𐡪 𐭪 𐮉 𐤊 𐭊 𐩫 𐼸 𐼋 ܟ|<ܟ݂ 𐎋
l Lamd ل 𐡋 𑀮 𐾼 𓌅 𐿫 λ|<Λ 𐣫 ל 𐫓 𐪁 𐢑|~𐢐 𐡫 𐭫 𐮊 𐤋 𐭋 𐩡 𐽄 >𐼌 ܠ 𐎍
m Mem م 𐡌 𑀫 𐾽 𓈖 𐿬 μ|<Μ 𐣬 מ|~ם 𐫖 𐪃 𐢓|~𐢒 𐡬 𐭬 𐮋 𐤌 𐭌 𐩣 𐼺 𐼍 ܡ 𐎎
n Nun ن 𐡍 𑀦 𐾾 𓆓 𐿭 ν|<Ν 𐣭 נ|~ן 𐫗 𐪌 𐢕|~𐢔 𐡭|<𐡮 𐭭 𐮌 𐤍 𐭍 𐩬 𐼻 𐼎|~𐼏 ܢܢ|<ܢ 𐎐
s Samekh س 𐡎 𑀱 𐾿 𓊽 𐿮 σ|~ς|<Σ 𐣮 ס 𐫘 𐪊 𐢖 𐡯 𐭮 𐮍 𐤎 𐭎 𐩪 𐼼 𐼑 ܣ 𐎒
ʿ Ain ع 𐡏 𑀏 𐿀 𓁹 𐿯 ο|<ω|<Ο|<Ω 𐣯 ע 𐫙 𐪒 𐢗 𐡰 𐭥 𐮅 𐤏 𐭏 𐩲 𐼽 𐼓|<𐼒 ܥ 𐎓
p Pe پ 𐡐 𑀧 𐿁 𓂋 𐿰 π|<Π 𐣰 פ|~ף 𐫛 >𐪐 𐢘 𐡱 𐭯 𐮎 𐤐 𐭐 >ࠐ >𐩰 𐼾 𐼔 ܦ 𐎔
Sade ض 𐡑 𑀘 >𐾿 𓇑 𐿱 ϻ|<Ϻ 𐣱 צ|~ץ 𐫝 𐪎 𐢙 𐡲 𐭰 𐮏 𐤑 𐭑 𐩮 𐼿 𐼕|~𐼖 ܨ 𐎕
q Qof ق 𐡒 𑀔 >𐾻 𓃻 𐿲 ϙ|<Ϙ 𐣲 ק 𐫞 𐪄 𐢚 𐡳 𐭬 𐮋 𐤒 𐭒 𐩤 >𐼸 >𐼋 ܩ 𐎖
r Resh ر 𐡓 𑀭 𐿂 𓁶 𐿳 ρ|<Ρ 𐣣 ר 𐫡 𐪇 𐢛 𐡴 >𐭥 >𐮅 𐤓 𐭓 𐩧 𐽀 𐼘 ܪ 𐎗
š Shin ش 𐡔 𑀰 𐿃 𓌓 𐿴 ξ|<Ξ 𐣴 ש 𐫢 𐪏 𐢝|~𐢜 𐡵 𐭱 𐮐 𐤔 𐭔 𐩦 𐽁 𐼙 ܫ 𐎌|<𐎝
t Tau ت 𐡕 𑀢 𐿄 𓏴 𐿵 τ|<Τ 𐣵 ת 𐫤 𐪗 𐢞 𐡶 𐭲 𐮑 𐤕 𐭕 𐩩 𐽂 𐼚|~𐼛 ܬ 𐎚
d ض 𐪓
f p ف φ|<Φ פּ|~ףּ 𐪐 𐩰 𐽃 >𐼔
ġ h 𐪖 𐎙
d ذ 𐩹
k خ כּ|~ךּ
𐩭
j g ج ג׳
v b β ב 𐎜
č چ צ׳|~ץ׳
t ث 𐎘
z ظ 𐎑
ž z ז׳
p
𐼓𐼌%𐼧

License

Copyright © 2021 Adam Twardoch, MIT license

Other projects of interest

  • Wiktra — Python transliterator for 100+ scripts and 500+ languages, mostly into Latin but in some cases across other scripts. Uses the Wiktionary transliteration modules written in Lua. Needs Lua runtime.
  • Aksharamukha - Python (plus JS and web) transliterator within the Indic cultural sphere, for 94 scripts and 8 romanization methods. Does conversion between scripts.