Skip to content

Latest commit

 

History

History
42 lines (32 loc) · 994 Bytes

faq.md

File metadata and controls

42 lines (32 loc) · 994 Bytes

Frequently asked questions

How to treat word-initial or -final in a special way

Although it may seem a bit hacky, treating word-initial or -final graphemes differently is straightforward. We'll use the common regular expression markers for start ^ and end $.

  1. Create the orthography profile:
>>> from segments.tokenizer import Profile
>>> prf = Profile(
    {'Grapheme': 'th', 'IPA': 'tH'},
    {'Grapheme': 'c', 'IPA': 'c'},
    {'Grapheme': '^', 'IPA': None},
    {'Grapheme': '$', 'IPA': None},
    {'Grapheme': 'a', 'IPA': 'b'},
    {'Grapheme': '^a', 'IPA': 'A'})

Note: We treat word-initial a differently!

  1. Create the tokenizer
>>> from segments.tokenizer import Tokenizer
>>> t = Tokenizer(prf)
>>> t('tha', 'IPA')
'tH b'
>>> t('ath', 'IPA')
'b tH'
>>> t('^ath', 'IPA')
'A tH'
  1. Make sure to pass properly marked up words to the tokenizer:
>>> t(' '.join('^' + s + '$' for s in 'tha ath'.split()), 'IPA')
'tH b # A tH'