Diacritics restoration for Romanian

The full article can be found here.
Implementation was done using Keras with Tensorflow.
The core of the implementation is done in model_diacritice.py.
As a corpora we used parliamentary debates (see Corpora section).

Corpora

PAR		ROWIKI
Letters	Letters with diacritics	15M	6.39%	38M	4.09%
	Letters that accept/contain diacritics	84M	35.28%	296M	31.78%
	All letters	239M	100.00%	933M	100.00%
Words	Words with diacritics	13M	26.37%	33M	16.40%
	Words that accept/contain diacritics	35M	70.41%	118M	57.98%
	All words	50M	100.00%	204M	100.00%
Sentences		2.6M		22.0M
Unique words		0.21M		2.62M

Only PAR corpus was used in the end because of its higher quality.

Models

The architecutre of the models is briefly described in the picutre below. Several models were tried, using different branches of the architecture. See the Results section for more details, or you can read the full article

Results

Model	Char Embedding	Char LSTM	Hidden	Epochs	Dev char acc (%)	Test char acc (%)	Test word acc (%)
Chars	16	32	32	5	98.865	98.864	97.413
Chars	20	64	256	5	99.012	99.017	97.750
Chars (5 classes)	16	32	32	5	99.048	99.068	97.867
Chars	24	64	64	4	99.064	99.057	97.856
Chars + sentence	20	64	256	3	99.068	99.065	97.881
Chars + word	20	64	256	4	99.309	99.329	98.453
Chars + word + sentence	20	64	256	5	99.365	99.378	98.573
Chars + word + sentence	20	64	256, 128	5	99.380	99.366	98.553

Best models performance per letter:

Model	Letter	Precision (%)	Recall (%)	F-Score (%)
All-256-128	“a”	99.16	98.86	99.01
	“ă”	96.29	97.31	96.80
	“â”	99.17	98.80	98.99
	“i”	99.97	99.96	99.97
	“î”	99.65	99.72	99.69
	“s”	99.84	99.84	99.84
	“ș”	99.44	99.43	99.43
	“t”	99.84	99.77	99.80
	“ț”	98.97	99.29	99.13

Run the model for restoration

Example usage:

    python3 model_diacritice.py -no_fast -load saved_models_diacritice/chars24-64 -no_word -no_sent -classes 4 -no_dep -no_tag -restore input.txt

The model will read text from input.txt (utf-8) and restore the diacritics in tmp_res.txt file. Only the pre-trained models that use characters are available on this git.

Name		Name	Last commit message	Last commit date
Latest commit History 8,816 Commits
article		article
core		core
definitions		definitions
imgs		imgs
models/model3		models/model3
saved_models_diacritice		saved_models_diacritice
spacy-src		spacy-src
.gitignore		.gitignore
LICENSE.txt		LICENSE.txt
README.md		README.md
baseline_model.py		baseline_model.py
docx_restore.py		docx_restore.py
model_diacritice.py		model_diacritice.py
model_diacritice_count.py		model_diacritice_count.py
server.py		server.py
train-model.py		train-model.py
vectors_tensorboard.py		vectors_tensorboard.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Diacritics restoration for Romanian

Corpora

Models

Results

Run the model for restoration

About

Releases

Packages

Contributors 234

Languages

License

teodor-cotet/DiacriticsRestoration

Folders and files

Latest commit

History

Repository files navigation

Diacritics restoration for Romanian

Corpora

Models

Results

Run the model for restoration

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 234

Languages

Packages