Frequently asked questions

How to treat word-initial or -final in a special way

Although it may seem a bit hacky, treating word-initial or -final graphemes differently is straightforward. We'll use the common regular expression markers for start ^ and end $.

Create the orthography profile:

>>> from segments.tokenizer import Profile
>>> prf = Profile(
    {'Grapheme': 'th', 'IPA': 'tH'},
    {'Grapheme': 'c', 'IPA': 'c'},
    {'Grapheme': '^', 'IPA': None},
    {'Grapheme': '$', 'IPA': None},
    {'Grapheme': 'a', 'IPA': 'b'},
    {'Grapheme': '^a', 'IPA': 'A'})

Note: We treat word-initial a differently!

Create the tokenizer

>>> from segments.tokenizer import Tokenizer
>>> t = Tokenizer(prf)
>>> t('tha', 'IPA')
'tH b'
>>> t('ath', 'IPA')
'b tH'
>>> t('^ath', 'IPA')
'A tH'

Make sure to pass properly marked up words to the tokenizer:

>>> t(' '.join('^' + s + '$' for s in 'tha ath'.split()), 'IPA')
'tH b # A tH'

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

faq.md

faq.md

Frequently asked questions

How to treat word-initial or -final in a special way

Files

faq.md

Latest commit

History

faq.md

File metadata and controls

Frequently asked questions

How to treat word-initial or -final in a special way