Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speller suggestion issue #3

Open
snomos opened this issue May 3, 2024 · 8 comments
Open

Speller suggestion issue #3

snomos opened this issue May 3, 2024 · 8 comments
Assignees
Labels
bug Something isn't working

Comments

@snomos
Copy link
Member

snomos commented May 3, 2024

@Trondtr has reported:

echo а̄им | hfst-ospell -S tools/spellcheckers/mns.zhfst
"а̄им" is NOT in the lexicon:
Corrections for "а̄им":
вим    20.056900
аим    20.787788
йим    21.743298
мим    22.254124
сым    22.524096
оим    22.659590

Now compare this with the following:

grep а tools/spellcheckers/strings.default.txt
а:а̄    1
#а̄:я   4
а̄:а    1
ся:ща   2
Ся:Ща   2

Why is ами suggested as second?

The core of the issue is that а̄ has a base char + combining macron: how well / bad does the error model handle combining diacritics when it comes to suggestions?

@snomos
Copy link
Member Author

snomos commented May 3, 2024

The interesting thing happens when you use divvunspell instead of hfst-ospell:

echo а̄им | divvunspell suggest -a tools/spellcheckers/mns.zhfst
Reading from stdin...
Input: а̄им		[INCORRECT]
аим		17.787788
а̄тим		24.199446
а̄гим		25.03525
аким		26.155512
ам		27.999014
агим		28.65959
атым		32.826374
аюм		34.254124
аи		43.7433
аис		43.966442

For comparison, hfst-ospell-office gives the same output as hfst-ospell:

echo '5	а̄им' | hfst-ospell-office -d tools/spellcheckers/mns.zhfst
@@ Loading tools/spellcheckers/mns.zhfst with args max-weight=-1.00, beam=-1.00, time-cutoff=6.00
@@ hfst-ospell-office is alive
&	вим (20.06;0)	аим (20.79;0)	йим (21.74;0)	мим (22.25;0)	сым (22.52;0)

@snomos
Copy link
Member Author

snomos commented May 3, 2024

@flammie do you have comments or insights re combining diacritics and hfst-ospell?

In any case: I am not sure how much time we should spend on this, given that it works correct using divvunspell, and divvunspell is used everywhere except in the grammar checker — and from the summer also in the grammar checker. That is, hfst-ospell is not first priority.

@flammie
Copy link
Contributor

flammie commented May 3, 2024

yeah it seems quite fragile here at compilation of separate error model part already:

$ echo 'а̄' | hfst-lookup .generated/strings.all.default.hfst 
hfst-lookup: warning: It is not possible to perform fast lookups with OpenFST, std arc, tropical semiring format automata.
Using HFST basic transducer format and performing slow lookups
> а̄	а̄̄	1,000000
а̄	а	2,000000
$ hfst-fst2txt .generated/strings.all.default.hfst | fgrep а
0	0	а	а	0.000000
1	21	а	а	0.000000
2	43	я	а	0.000000
41	44	я	а	0.000000
58	58	а	а	0.000000
$  hfst-fst2txt .generated/strings.all.default.hfst | fgrep а̄
$  hfst-fst2txt .generated/strings.all.default.hfst | fgrep $'\u0304'
0	0			0.000000
7	8			0.000000
13	14	@0@		0.000000
13	27		@0@	0.000000
15	16	@0@		0.000000
15	28		@0@	0.000000
17	18	@0@		0.000000
17	29		@0@	0.000000
19	20	@0@		0.000000
19	30		@0@	0.000000
21	22	@0@		0.000000
21	31		@0@	0.000000
23	24	@0@		0.000000
23	32		@0@	0.000000
25	26	@0@		0.000000
25	33		@0@	0.000000
58	58			0.000000
$  hfst-fst2txt .generated/strings.all.default.hfst
0	0	@_IDENTITY_SYMBOL_@	@_IDENTITY_SYMBOL_@	0.000000
0	0			0.000000
0	0	С	С	0.000000
0	0	Щ	Щ	0.000000
0	0	а	а	0.000000
0	0	г	г	0.000000
0	0	е	е	0.000000
0	0	и	и	0.000000
0	0	й	й	0.000000
0	0	к	к	0.000000
0	0	н	н	0.000000
0	0	о	о	0.000000
0	0	р	р	0.000000
0	0	с	с	0.000000
0	0	т	т	0.000000
0	0	у	у	0.000000
0	0	щ	щ	0.000000
0	0	ы	ы	0.000000
0	0	ь	ь	0.000000
0	0	э	э	0.000000
0	0	ю	ю	0.000000
0	0	я	я	0.000000
0	0	ӈ	ӈ	0.000000
0	0	ӣ	ӣ	0.000000
0	0	ӯ	ӯ	0.000000
0	1	@0@	@0@	0.000000
1	2	с	щ	0.000000
1	4	т	к	0.000000
1	9	т	т	0.000000
1	13	я	я	0.000000
1	15	э	э	0.000000
1	17	ы	ы	0.000000
1	19	ю	ю	0.000000
1	21	а	а	0.000000
1	23	е	е	0.000000
1	25	о	о	0.000000
1	34	н	ӈ	0.000000
1	36	ӈ	н	0.000000
1	38	г	ӈ	0.000000
1	41	С	Щ	0.000000
1	45	н	н	0.000000
1	48	ӯ	ӯ	0.000000
1	51	у	у	0.000000
2	3	ь	@0@	0.000000
2	40	ю	у	0.000000
2	43	я	а	0.000000
3	58	@0@	@0@	1.000000
4	5	и	и	0.000000
4	6	ӣ	ӣ	0.000000
4	7	е	е	0.000000
5	58	@0@	@0@	2.000000
6	58	@0@	@0@	2.000000
7	8			0.000000
7	58	@0@	@0@	2.000000
8	58	@0@	@0@	2.000000
9	10	т	к	0.000000
10	11	е	е	0.000000
10	12	и	и	0.000000
11	58	@0@	@0@	2.000000
12	58	@0@	@0@	2.000000
13	14	@0@		0.000000
13	27		@0@	0.000000
14	58	@0@	@0@	1.000000
15	16	@0@		0.000000
15	28		@0@	0.000000
16	58	@0@	@0@	1.000000
17	18	@0@		0.000000
17	29		@0@	0.000000
18	58	@0@	@0@	1.000000
19	20	@0@		0.000000
19	30		@0@	0.000000
20	58	@0@	@0@	1.000000
21	22	@0@		0.000000
21	31		@0@	0.000000
22	58	@0@	@0@	1.000000
23	24	@0@		0.000000
23	32		@0@	0.000000
24	58	@0@	@0@	1.000000
25	26	@0@		0.000000
25	33		@0@	0.000000
26	58	@0@	@0@	1.000000
27	58	@0@	@0@	2.000000
28	58	@0@	@0@	2.000000
29	58	@0@	@0@	2.000000
30	58	@0@	@0@	2.000000
31	58	@0@	@0@	2.000000
32	58	@0@	@0@	2.000000
33	58	@0@	@0@	2.000000
34	35	г	@0@	0.000000
35	58	@0@	@0@	2.000000
36	37	@0@	г	0.000000
37	58	@0@	@0@	2.000000
38	39	н	н	0.000000
39	58	@0@	@0@	3.000000
40	58	@0@	@0@	2.000000
41	42	ю	у	0.000000
41	44	я	а	0.000000
42	58	@0@	@0@	2.000000
43	58	@0@	@0@	2.000000
44	58	@0@	@0@	2.000000
45	46	т	р	0.000000
46	47	р	@0@	0.000000
47	58	@0@	@0@	4.000000
48	49	й	и	0.000000
48	54	й	ы	0.000000
49	50	и	@0@	0.000000
50	58	@0@	@0@	4.000000
51	52	й	и	0.000000
51	56	й	ы	0.000000
52	53	и	@0@	0.000000
53	58	@0@	@0@	4.000000
54	55	ы	@0@	0.000000
55	58	@0@	@0@	4.000000
56	57	ы	@0@	0.000000
57	58	@0@	@0@	4.000000
58	58	@_IDENTITY_SYMBOL_@	@_IDENTITY_SYMBOL_@	0.000000
58	58			0.000000
58	58	С	С	0.000000
58	58	Щ	Щ	0.000000
58	58	а	а	0.000000
58	58	г	г	0.000000
58	58	е	е	0.000000
58	58	и	и	0.000000
58	58	й	й	0.000000
58	58	к	к	0.000000
58	58	н	н	0.000000
58	58	о	о	0.000000
58	58	р	р	0.000000
58	58	с	с	0.000000
58	58	т	т	0.000000
58	58	у	у	0.000000
58	58	щ	щ	0.000000
58	58	ы	ы	0.000000
58	58	ь	ь	0.000000
58	58	э	э	0.000000
58	58	ю	ю	0.000000
58	58	я	я	0.000000
58	58	ӈ	ӈ	0.000000
58	58	ӣ	ӣ	0.000000
58	58	ӯ	ӯ	0.000000
58	0.000000

@snomos
Copy link
Member Author

snomos commented May 3, 2024

Do I read the above correct, @flammie, when I find no occurrences of а̄ in the ATT version of strings.all.default.hfst? So the base char + combining diacritic is lost during compilation?

If so, how can we force such a sequence to be treated as one symbol, in all contexts? The lexical FST does treat them as one symbol (as opposed to the tokeniser FST, which does the opposite on the input side).

I assume the question relates to all .txt input files for the error model.

@flammie
Copy link
Contributor

flammie commented May 3, 2024

Mm, strings compilation uses hfst-strings2fst just without any alphabets / multichars so it must consider combining characters their own arcs in the graph. I guess it leads into situation where suggestions from а̄им to вим weighs а:в and to аим weighs combining macron:0 rather. Maybe applying nfc/nfd filters in error models could work if this is actually the main issue

@snomos
Copy link
Member Author

snomos commented May 3, 2024

mm, that might be a good idea. I will have a look.

@snomos
Copy link
Member Author

snomos commented May 3, 2024

giellalt/giella-core@c73d62a fixes a bug that hindered spellrestrict.* files from being built. But that is not enough, the generated spellrestrict.regex file does not include the relevant letters.

@snomos
Copy link
Member Author

snomos commented May 3, 2024

Ah - bummer on my part. The spellrestrict.* files will not solve this, because they assume the existence of an NFC form. In this case the problem is that there IS NO NFC form of the relevant letters, but we still want the FST to treat them as one symbol, ie no arch between base letter and combining diacritics.

So we need a new type of filter that finds all combining diacritics and the corresponding base letter(s), and then generates a filter of the following type:

{а̄} -> "а̄";

and then applies this filter to all error model files being read by hfst-strings2fst on both sides. That should hopefully fix the issue, and make it predictable to work with combining diacritics in the speller.

@flammie feel free to continue this work 🙂

@snomos snomos added the bug Something isn't working label May 3, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants