Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LTeX doesn't recognize macron #269

Closed
intractabilis opened this issue Mar 13, 2021 · 10 comments
Closed

LTeX doesn't recognize macron #269

intractabilis opened this issue Mar 13, 2021 · 10 comments
Assignees
Labels
1-bug 🐛 Issue type: Bug report (something isn't working as expected) 2-confirmed Issue status: Confirmed, reproducible bug in LTeX 3-fixed Issue resolution: Issue has been fixed on the develop branch
Milestone

Comments

@intractabilis
Copy link

Usually, if I have a word unfamiliar to LTeX, it underlines it and in the tooltip I can choose to add the word to dictionary. If the word has letters with grave or accute accent, or with circumflex, LTeX properly adds all diacritics to the dictionary. However, it skips letters with macrons completely.

Takea look at the following MWE:

\documentclass[11pt, paper = B5]{scrbook}
\usepackage{fontspec}
\setmainfont{Libertinus Serif}
\begin{document}
Nih\'\=go
\end{document}

It's a single word "Nihóŋḡo". Here is how the tooltip looks like:

изображение

You can see it is "Nihóŋo", instead of "Nihóŋḡo". "Nihóŋo" as well will be added to dictionary.

  • Operating system: Linux (linux), x64, 5.10.19-1-MANJARO
  • VS Code: 1.53.2
  • vscode-ltex: 9.0.0
  • ltex-ls: 10.0.0
  • Java: 15.0.2
@intractabilis intractabilis added 1-bug 🐛 Issue type: Bug report (something isn't working as expected) 2-unconfirmed Issue status: Bug that needs to be reproduced (all new bugs have this label) labels Mar 13, 2021
@valentjn valentjn added 2-confirmed Issue status: Confirmed, reproducible bug in LTeX and removed 2-unconfirmed Issue status: Bug that needs to be reproduced (all new bugs have this label) labels Mar 21, 2021
@valentjn valentjn self-assigned this Mar 21, 2021
@valentjn
Copy link
Owner

valentjn commented Mar 21, 2021

Thanks for the report. LTEX needs to manually map each letter to the correct Unicode code point. As g isn't in there, it will be skipped. It would probably already be a first step if it fell back to the unaccented letter instead of just skipping it altogether, despite still being incorrect.

Do you just need support for g? Or do you also need other letters? If yes, which ones?

@valentjn valentjn added the 2-needs-info Issue status: We need more information (usually) from the submitter before continuing label Mar 21, 2021
@intractabilis
Copy link
Author

intractabilis commented Mar 21, 2021

I am typesetting a text with romanization of Japanese pronunciation. Acute, grave and circumflex are used in that romanization to indicate pitch variation and can happen above any letter, even consonant. For example:

изображение

What would be ideal is if I could add itte and kara without diacritics to LTeX dictionary and it could skip them as correct in the text regardless of the diacritics.

Macron is used only above n̄ and ḡ to indicate nasal pronunciation. But n̄ with macron can also have acute above macron:

изображение

In this case it would be great to have ben̄kyoo in the dictionary.

@valentjn
Copy link
Owner

Ah, okay. Then we'll have to use Unicode's combining diacritical marks as there is a Unicode character with macron for g, but of course not for every letter, and even less for two or more diacritics at the same time. I can add support for this, but of course LanguageTool has to support it as well, which it probably does. As a workaround, if your LATEX engine supports Unicode (e.g., LuaLATEX), you should already be able to use Unicode instead of \', \=, etc. in your documents.

I think we can't just ignore the diacritics, otherwise we'd have many false negatives for all the languages which use them (languages in which omitting them would be a spelling error). So in your last example, it would add ben̄́kyoo or ben̄́kyoo-site to the dictionary (not just ben̄kyoo), depending on what LanguageTool underlines.

@valentjn valentjn removed the 2-needs-info Issue status: We need more information (usually) from the submitter before continuing label Mar 26, 2021
@intractabilis
Copy link
Author

As a workaround, if your LATEX engine supports Unicode (e.g., LuaLATEX), you should already be able to use Unicode instead of \', \=, etc. in your documents.

That doesn't make any difference, ḡ is still ignored:

изображение

I think we can't just ignore the diacritics, otherwise we'd have many false negatives for all the languages which use them (languages in which omitting them would be a spelling error). So in your last example, it would add ben̄́kyoo or ben̄́kyoo-site to the dictionary (not just ben̄kyoo), depending on what LanguageTool underlines.

It's absolutely fine. If it just doesn't ignore letters with diacritics at all (like ḡ above), it's all good. Thank you!

@valentjn
Copy link
Owner

valentjn commented Mar 27, 2021

Edit: I noticed that Unicode places the combining diacritical marks after the letter, not before like in this comment. The point is still valid, though.

That doesn't make any sense, there's no logic in LTEX that would remove this, in contrast to \=. It's working for me for both single-character Unicode (U+1E21: ) and Unicode with combining diacritical marks on ASCII characters (U+0304 U+0067: ):

image

Hex dump of the file:

00000000  4e 69 68 c3 b3 c5 8b e1  b8 a1 6f 0a 0a 4e 69 68  |Nih.......o..Nih|
00000010  6f cc 81 c5 8b 67 cc 84  6f 0a                    |o....g..o.|

@intractabilis
Copy link
Author

Sorry, mea culpa. You are right. My LTeX is set to check on save, and I forgot to save when I experimented. Thanks for the workaround!

@ed359
Copy link

ed359 commented Mar 31, 2021

See also valentjn/ltex-ls#56 (comment). Having \H{o} and \H{u} would be useful in Hungarian. I made a PR valentjn/ltex-ls#57

@valentjn valentjn added this to the 10.0.0 milestone Apr 5, 2021
@valentjn
Copy link
Owner

valentjn commented Apr 5, 2021

I replaced the whole accent table with Java's Unicode normalization algorithm. This means that combined characters will be used if they exist, otherwise combining diacritical marks will be used (Normalization Form C in Unicode Standard Annex #15).

Combining multiple accents on a single letter (e.g., \'{\=n}) is still not supported, since LATEX itself seems to have difficulties with this without special packages. Users that rely on multiple accents on a single letter need to use Unicode instead.

@valentjn valentjn added the 3-fixed Issue resolution: Issue has been fixed on the develop branch label Apr 5, 2021
@valentjn
Copy link
Owner

valentjn commented Apr 5, 2021

Fix released in 10.0.0.

@intractabilis
Copy link
Author

Thanks!

me-johnomar added a commit to me-johnomar/ltex-ls that referenced this issue Jan 31, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
1-bug 🐛 Issue type: Bug report (something isn't working as expected) 2-confirmed Issue status: Confirmed, reproducible bug in LTeX 3-fixed Issue resolution: Issue has been fixed on the develop branch
Projects
None yet
Development

No branches or pull requests

3 participants