Fix dictionary index case-sensitivity inconsistencies #121

welps · 2020-06-12T22:55:02Z

Hi,

I pulled the gcide dictionary into plato and noticed that the case-sensitivity search was not working. This was because the gcide dictionary index does not case fold the headwords in the index. Looking at some other dictionaries, this seems to be inconsistently handled.

So this PR provides the following fixes (and tests):

Handling of dictionary index parsing from three possible states:
- Full index parse
- Lazy index parse (metadata only)
- Resuming from lazy index parse
Casefolding (accounting for non-latin characters) for the dictionary-side query and when the index is being created within plato

Tested via emulator and on my Forma:

baskerville · 2020-06-13T08:21:29Z

I would rather not add workarounds for invalid dictionaries. I have already mentioned this problem. Where did you download the dictionary from?

welps · 2020-06-13T10:56:12Z

I downloaded mine via arch's user repository, but I confirmed debian's official package repository compiles the index with the same inconsistent case.

I see that your post cites the same dictionary, the reason I used it is because it's seemingly the largest english dictionary available, there really don't seem to be too many options.

Since dict handles the "invalid" gcide properly, looking at the source code, note I'm not a C programmer so I may be wrong, it seems like their solution is to maintain a separate lowercase index which is similar to what I do.

If that's really the case, no pun intended, I would argue that it shouldn't be considered a workaround and just standard data munging.

welps · 2020-06-22T14:34:16Z

@baskerville Is there anything I can do to address any concerns you may have?

If the primary dictd implementation is doing the same thing with case-insensitive dictionaries (see last comment if you missed it), why shouldn't we?

If there are concerns with the code itself, let me know and I can work to address your concerns. Thanks.

baskerville · 2020-06-22T17:32:30Z

I don't know why dictd is doing this: the dictionary you mentioned cannot be produced by dictfmt.

When dictfmt generates the index, it applies the necessary character transformations to the headwords.

You could prevent the headwords from being lowercased with --case-sensitive (in which case, the corresponding special entry will appear within the index), but why would you want that?

welps · 2020-06-22T19:05:17Z

This is conjecture, but I believe gcide predates dictfmt and it was simply never updated to go through dictfmt. It seems dictfmt was introduced in 2002. gcide is at least 20 years old.

Case-sensitive handling was introduced in 2007 and the default handling was lowercasing the index which is probably why it never got addressed within gcide.

I understand your reluctance, but gcide seems to be the most popular dictionary alongside WordNet. If we think about how English users actually go about acquiring a dictionary for plato, which I think is already a bit difficult, they're going to end up seeing some variation of https://packages.debian.org/buster/dict-wn and https://packages.debian.org/buster/dict-gcide. Only one of those explicitly say it's a dictionary.

baskerville · 2020-06-23T07:40:18Z

Being aware of the dictionary situation, I did create my own version of WordNet 3.1 so that there would be at least one good english dictionary that works with Plato.

welps · 2020-06-26T01:34:35Z

@baskerville

There would be another good english dictionary that works with Plato if this code was merged though. It works with the official dictd implementation as we've discussed so why shouldn't it here?

Why is the solution to produce and maintain another bespoke dictionary instead of leveraging what has existed for twenty+ years?

baskerville · 2020-06-26T08:08:53Z

Don't get me wrong: I'm acknowledging the weird backward compatibility problem.

I'm just looking the for most straightforward approach to solving this problem.

Fortunately, the dictionaries generated with dictfmt will have a 00-database-dictfmt-VERSION entry (unless --without-ver is passed!).

Have you found other dictd dictionaries, besides GCIDE, that aren't generated by dictfmt?

welps · 2020-06-26T14:20:08Z

Yes, the dict-moby-thesaurus package does not conform either.

It maintains case for proper nouns, but does not declare 00-database-case-sensitive. See below for a snippet.

I did not realize you are also involved in https://github.com/freedict/libdict so I understand that your concerns about this may extend more outwards than I was aware of. I really believe the most straightforward solution is to case fold the query and the index headword.

Related to this, I did discover the rust-caseless default case folding function does not perform any normalization. The canonical caseless matching strategy recommended in Unicode Section 3.13 requires NFD normalization before/after case folding of a given word while noting NFD normalization after case folding is sufficient to handle most cases. However, Rust's unicode-normalization crate was recently shown to be 2-25x slower than ICU so maybe we don't want to deal with this at all for now.

$ head -100 /usr/share/dictd/moby-thesaurus.index

00-database-info        CV      Jj
00-database-short       BZ      8
00-database-url A       BZ
a cappella      L4      Ho
a la mode       Tg      Hs
a priori        bM      GC
ab ovo  mp      JC
Abaddon vr      G3
abandon 2i      BB2
abandoned       B4Y     1B
abase   CtZ     Fg
abasement       Cy5     HM
abash   C6F     Hh
abashed DBm     LS
abate   DM4     ih
abatement       DvZ     Vq
abatis  EFD     Ki
abbe    EPl     GA
abbess  EVl     Fm
abbot   EbL     FE
abbreviate      EgP     OF
abbreviated     EuU     Js
abbreviation    E4A     Sm
abdicate        FKm     I4
abdication      FTe     Fp
abdomen FZH     N1
abdominal       Fm8     Eg
abduction       Frc     Fw
abecedarian     FxM     Wq
abecedary       GH2     GB
aberrant        GN3     YP
aberration      GmG     ql
abet    HQr     NS
abettor Hd9     NI
abeyance        HrF     Mo
abhor   H3t     EC
abhorrent       H7v     LQ
abide   IPa     eS
abide by        IG/     Ib
abiding Its     Py
ability I9e     YO
abject  JVs     Vr
abjection       JrX     Fm
abjuration      Jw9     UW
abjure  KFT     WX
ablate  Kbq     RM
ablation        Ks2     Rv
ablaze  K+l     RB
able    LPm     Ob
ablution        LeB     GC
ably    LkD     HT
abnegation      LrW     TF
abnormal        L+b     ca
abnormality     Ma1     vy
aboard  NKn     Gc
abode   NRD     J0
abolish Na3     Jk
abolition       Nkb     L+
A-bomb  hO      Fb

Fix dictionary index case-sensitivity inconsistencies

22d362d

welps force-pushed the master branch from 4cd7136 to 22d362d Compare June 18, 2020 16:04

baskerville closed this in 27570e6 Jul 6, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix dictionary index case-sensitivity inconsistencies #121

Fix dictionary index case-sensitivity inconsistencies #121

welps commented Jun 12, 2020

baskerville commented Jun 13, 2020

welps commented Jun 13, 2020 •

edited

Loading

welps commented Jun 22, 2020

baskerville commented Jun 22, 2020

welps commented Jun 22, 2020

baskerville commented Jun 23, 2020

welps commented Jun 26, 2020

baskerville commented Jun 26, 2020

welps commented Jun 26, 2020

Fix dictionary index case-sensitivity inconsistencies #121

Fix dictionary index case-sensitivity inconsistencies #121

Conversation

welps commented Jun 12, 2020

baskerville commented Jun 13, 2020

welps commented Jun 13, 2020 • edited Loading

welps commented Jun 22, 2020

baskerville commented Jun 22, 2020

welps commented Jun 22, 2020

baskerville commented Jun 23, 2020

welps commented Jun 26, 2020

baskerville commented Jun 26, 2020

welps commented Jun 26, 2020

welps commented Jun 13, 2020 •

edited

Loading