NB Uttale: Pronunciation lexicon with dialectal variation for Norwegian

This repo contains a script and data to generate pronunication lexica with dialect variation for spoken Norwegian, and with wordforms in the bokmål written standard.

The base lexicon is NST Pronunciation Lexicon for Norwegian Bokmål from the Norwegian Language Bank's resource catalogue.

Requirements

Ensure you have python >= 3.8 installed (download python here)

Install lexupdater:

python -m pip install lexupdater-0.7.5-py3-none-any.whl

Download the SQLite database file nst_lexicon_bm.db:

wget -P data/input https://www.nb.no/sbfil/uttaleleksikon/nst_lexicon_bm.db

For more info on the database file, see here.

Generate pronunciation lexica

./generate.sh

In case you get "Permission denied" to run the script, change your user permissions:

chmod 700 generate.sh

Output

The script writes 10 csv files to file paths data/output/{DIALECT}_pronunciation_lexicon.csv, where DIALECT is one of the following 10:

e_spoken
e_written
n_spoken
n_written
sw_spoken
sw_written
t_spoken
t_written
w_spoken
w_written

The files contain these columns:

Column	Description
wordform	Wordform as it appears in bokmål text. Underscore (`_`) replaces whitespaces in multiword expressions
pos	Part-of-speech
feats	Morphological features
wordform_id	Wordform identificator. When the same wordform has several possible transcriptions, the wordform_id is repeated. When a wordform is repeated and represent different grammatical/lexical concepts, the wordform_id is different. E.g. the verb "jeg skriver" (I write) vs. the noun "en skriver" (a printer)
update_info	Reference to the data source.
nofabet_transcription	Transcription with the NoFAbet notation
ipa_transcription	Transcription with the IPA notation
sampa_transcription	Transcription with the X-SAMPA notation

The conversion from NoFAbet to the other transcription standards is done with code from Sprakbanken/convert_nofabet.

Parameters

The script explicitly sets input arguments and flags to configure the process that the lexicon is going through with the lexupdater update command:

Parameter	Description
`-v, -vv`	Verbosity of the output log written to `stdout`. `-v` includes `INFO` messages, and `-vv` includes `DEBUG` messages. All logging messages are written to `data/output/log.txt` regardless of this flag.
`-db, --database`	File path to the SQLite database file with the NST lexicon.
`-n, --newwords-path`	File path to a csv file with new word records to add. Each new word (row) in the file gets a `wordform_id` with the prefix "NB" and a count number.
`-d, --dialects`	Categories of pronunciation variation that the transcriptions are updated for. The command writes 1 csv file for each argument given by this flag.
`-r, --rules-file`	Python file with ruleset `dict` objects. The dialectal variation is generated with regex patterns and replacement strings, as well as constraints.
`-e, --exemptions-file`	Python file with `dict`-objects indicating words that should be ignored by a given ruleset.

Files in `data/input`

These files have been developed by linguists in the Norwegian Language Bank 2021-2022.

Filename	Description
`newwords_2022.csv`	Each row contains the wordform (`token`), an East Norwegian `transcription`, up to 3 `alternative_transcription`s, the `pos`-tag and `morphology` features of the word. These words come from the Målfrid corpus or the Norwegian Newspaper Corpus Bokmål, which is indicated in the `update_info` field.
`rules_v1.py`	The Norwegian Language Bank has developed transformation rules for 5 Norwegian dialectal areas: East (`e`), South-West (`sw`), West (`w`), Trøndelag (`t`), North (`n`). There are 2 variants per dialect: `spoken` transcriptions are close to spontaneous speech in the given dialect, and `written` transcriptions are closer to the pronunciation of a bokmål manuscript being read out loud in the given dialect.
`exemptions_v1.py`	Specific words that should be ignored by rulesets defined in `rules_v1.py`. The ruleset `name` values map to the exemption `ruleset` values.

Database file

The SQLite file has two tables, which can be joined with the unique_id field.

Table	Description
`words`	The index is `word_id`. Contains wordforms in bokmål (`wordform`), part-of-speech (`pos`), and morphological features (`feats`), as well as `unique_id`, and more.
`base`	The index is `pron_id`. Contains pronunciation transcriptions for East Norwegian (`nofabet`). A mapping between the X-SAMPA transcription standard and the NoFAbet notation can be found here. Values in `unique_id` maps to the transcription's written wordform in `words`.

Contact

If you have questions, suggestions or problems running the code, please create an issue.

Name		Name	Last commit message	Last commit date
Latest commit History 30 Commits
data/input		data/input
.gitignore		.gitignore
LESMEG.md		LESMEG.md
README.md		README.md
generate.sh		generate.sh
lexupdater-0.7.5-py3-none-any.whl		lexupdater-0.7.5-py3-none-any.whl

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

NB Uttale: Pronunciation lexicon with dialectal variation for Norwegian

Requirements

Generate pronunciation lexica

Output

Parameters

Files in `data/input`

Database file

Contact

About

Releases 1

Packages

Languages

Sprakbanken/nb_uttale

Folders and files

Latest commit

History

Repository files navigation

NB Uttale: Pronunciation lexicon with dialectal variation for Norwegian

Requirements

Generate pronunciation lexica

Output

Parameters

Files in data/input

Database file

Contact

About

Resources

Stars

Watchers

Forks

Releases 1

Packages 0

Languages

Files in `data/input`

Packages