Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change how dictionaries are defined and used #5

Closed
jzohrab opened this issue Jun 19, 2023 · 8 comments
Closed

Change how dictionaries are defined and used #5

jzohrab opened this issue Jun 19, 2023 · 8 comments
Labels
enhancement New feature or request fixed Fixed in develop or master, to be launched.

Comments

@jzohrab
Copy link
Collaborator

jzohrab commented Jun 19, 2023

Copying over notes from jzohrab/lute#21.

Currently, Lute stores "dictionary 1" and "dictionary 2" URLs in the Language table, with placeholders for term substitution. This creates a few limitations:

  • use weird "*" character to designate a pop-up dictionary
  • restricted to only using HTML dictionaries, no easy way to handle json, plug-ins, or other types of dictionaries
  • limited to 2 dicts per language

It is potentially worth it to change dictionaries into first-class entities, e.g. with a brand new user form like this:

field notes
dictionary URL textbox, the url with "###" placeholders -- better yet, change the placeholder to "[LUTETERM]" or similar, since "#" is a valid URL entry (e.g., looking up "https://en.m.wiktionary.org/wiki/essere#Italian" would use a URL like "https://en.m.wiktionary.org/wiki/[LUTETERM]#Italian")
opens in pop-up? checkbox
encoding dropdown or textbox
returns dropdown (html default, or json -- reason for the "json" option is that some languages seem to only have dictionaries available via a json API)
active checkbox. Sometimes some dictionaries will be more useful than others -- eg, when offline, any online dicts are useless, so I could potentially deactivate the online dicts and only use an offline Kobo dict or whatever.

These would be stored in a new dictionaries table, and would be linked to the Languages. First draft UI implementation could be a dedicated UI screen to define dictionaries, that would be easiest (It's possible to create child subforms, but I haven't done that yet in Symfony :-) ).

One dict would have to be marked as primary. A language could define one or multiple dicts.

@jzohrab
Copy link
Collaborator Author

jzohrab commented Jun 21, 2023

With this change Lute could support things like Kobo dictionaries with a small amount of effort. From the Discord from hinufo:

To get kobo dicts: https://www.epubor.com/kobo-dictionary-download-and-install.html . The structure for the words within the dictionary is like this: <w><a name="WORD_NAME"> ...description...</w> ... And for variations it's something like this:<w><a name="WORD_NAME"> ...description...<variant="WORD_NAME1"><variant="WORD_NAME2"><variant="WORD_NAME3"></w>

A basic PHP parser:
image

That code is a good start. For integration, Lute dict lookups need to have some kind of "type" field, so this would be a "kobo" type, vs an "online" type, and then Lute would exec the proper code. ... OR the "dict url" would need to handle a different "protocol" - http/s for online, "kobo" for lookup, etc. Anyway, it's a code change.

@jzohrab
Copy link
Collaborator Author

jzohrab commented Jul 4, 2023

Another dictionary note: From MyCheze in discord:

I'm currently learning Czech, which is a language with a TON of inflections. Every noun and verb has many, many forms. And adjectives follow gender and yadda yadda. I've also learned Spanish which is similar, but has lots of "smushing together" of words (dámelo). Because of this, looking up words can be sort of annoying. Most Czech-English dictionaries are difficult to use because they're not "super extensive" or the lemmatization sucks. For example, negation is a ne- prefix, but the lemmatizers don't always know that. I'm not technical enough to know why that is since it seems like the most obvious thing to check.

I want to build my own dictionary by going through content and saving the words and their inflections so that future Czech learners (and myself) can actually do lookups. But, obviously, that would be a lot of work. I'm happy to do some, but I'd really prefer to start with something. Vocabsieve has pretty good lemmatization (for a ton of langauges) and uses Wiktionary for lookups. An integration would be really nice for filling out the Lute database. Maybe it can analyze the text and update some backend information and then let the user go through things. And if it could somehow use the "known words" database for it's text analysis (which is already pretty cool).

jz note: "all points make sense. The lookups are the hardest part, and Lute is at the mercy of what dictionaries can offer. Perhaps what might make sense would be to have a separate process build a dictionary using spaCy (https://www.tutorialspoint.com/spacy/spacy_models_and_languages.htm) to generate inflected forms of words, and then have a "custom dictionary" that has a map of inflected forms to their root/lemma form. That's technically feasible as a separate process, outside of Lute or whatever tool. For it to be available within Lute, Lute would then need to support custom dictionaries, which is a good idea that's come up before (#6)."

@jahnke
Copy link

jahnke commented Aug 10, 2023

Another change I would suggest is the control if you want to search for the word using lowercase or using the same capitalization as in the original text. This makes a difference for a language like German. For example, the page https://de.m.wiktionary.org/wiki/Mimik has content and the page https://de.m.wiktionary.org/wiki/mimik is empty.

@jzohrab
Copy link
Collaborator Author

jzohrab commented Aug 10, 2023

Thank you @jahnke, great point. I haven't studied German with Lute yet but really want to. I believe that some other changes may be necessary -- I think Lute downcases things automatically when saving terms, which perhaps isn't a valid design choice. Will check.


EDIT:

  • yes, Lute downcases all terms, so "Tag" gets saved as "tag". Will need to investigate further on how saving the case as part of the term would affect things.
  • It appears that different dictionaries handle things differently. e.g., https://de.thefreedictionary.com/ looks up "mimik" and returns "Mimik". fyi only, I think the case still needs to be handled better.

@jzohrab jzohrab transferred this issue from jzohrab/lute Nov 11, 2023
@jzohrab
Copy link
Collaborator Author

jzohrab commented Nov 18, 2023

Update: in late v2, and v3, the user can change the term case (but only the case).

@jzohrab jzohrab added the enhancement New feature or request label Nov 28, 2023
@jzohrab
Copy link
Collaborator Author

jzohrab commented Feb 7, 2024

Picking back up b/c this should be completed, have WIP code and webofpies is working on it.

@jzohrab jzohrab added the fixed Fixed in develop or master, to be launched. label Feb 15, 2024
@jzohrab
Copy link
Collaborator Author

jzohrab commented Feb 15, 2024

Merged into develop. Note that this code change only addresses the original issue outline.

@jzohrab
Copy link
Collaborator Author

jzohrab commented Feb 19, 2024

Launched in 3.2.1.

@jzohrab jzohrab closed this as completed Feb 19, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request fixed Fixed in develop or master, to be launched.
Projects
None yet
Development

No branches or pull requests

2 participants