LTeX doesn't recognize macron #269

intractabilis · 2021-03-13T08:17:35Z

Usually, if I have a word unfamiliar to LTeX, it underlines it and in the tooltip I can choose to add the word to dictionary. If the word has letters with grave or accute accent, or with circumflex, LTeX properly adds all diacritics to the dictionary. However, it skips letters with macrons completely.

Takea look at the following MWE:

\documentclass[11pt, paper = B5]{scrbook}
\usepackage{fontspec}
\setmainfont{Libertinus Serif}
\begin{document}
Nih\'oŋ\=go
\end{document}

It's a single word "Nihóŋḡo". Here is how the tooltip looks like:

You can see it is "Nihóŋo", instead of "Nihóŋḡo". "Nihóŋo" as well will be added to dictionary.

Operating system: Linux (linux), x64, 5.10.19-1-MANJARO
VS Code: 1.53.2
vscode-ltex: 9.0.0
ltex-ls: 10.0.0
Java: 15.0.2

valentjn · 2021-03-21T17:39:59Z

Thanks for the report. LT_EX needs to manually map each letter to the correct Unicode code point. As g isn't in there, it will be skipped. It would probably already be a first step if it fell back to the unaccented letter instead of just skipping it altogether, despite still being incorrect.

Do you just need support for g? Or do you also need other letters? If yes, which ones?

intractabilis · 2021-03-21T20:07:09Z

I am typesetting a text with romanization of Japanese pronunciation. Acute, grave and circumflex are used in that romanization to indicate pitch variation and can happen above any letter, even consonant. For example:

What would be ideal is if I could add itte and kara without diacritics to LTeX dictionary and it could skip them as correct in the text regardless of the diacritics.

Macron is used only above n̄ and ḡ to indicate nasal pronunciation. But n̄ with macron can also have acute above macron:

In this case it would be great to have ben̄kyoo in the dictionary.

valentjn · 2021-03-26T18:25:03Z

Ah, okay. Then we'll have to use Unicode's combining diacritical marks as there is a Unicode character ḡ with macron for g, but of course not for every letter, and even less for two or more diacritics at the same time. I can add support for this, but of course LanguageTool has to support it as well, which it probably does. As a workaround, if your L^AT_EX engine supports Unicode (e.g., LuaL^AT_EX), you should already be able to use Unicode instead of \', \=, etc. in your documents.

I think we can't just ignore the diacritics, otherwise we'd have many false negatives for all the languages which use them (languages in which omitting them would be a spelling error). So in your last example, it would add ben̄́kyoo or ben̄́kyoo-site to the dictionary (not just ben̄kyoo), depending on what LanguageTool underlines.

intractabilis · 2021-03-26T21:03:44Z

As a workaround, if your LATEX engine supports Unicode (e.g., LuaLATEX), you should already be able to use Unicode instead of \', \=, etc. in your documents.

That doesn't make any difference, ḡ is still ignored:

I think we can't just ignore the diacritics, otherwise we'd have many false negatives for all the languages which use them (languages in which omitting them would be a spelling error). So in your last example, it would add ben̄́kyoo or ben̄́kyoo-site to the dictionary (not just ben̄kyoo), depending on what LanguageTool underlines.

It's absolutely fine. If it just doesn't ignore letters with diacritics at all (like ḡ above), it's all good. Thank you!

valentjn · 2021-03-27T11:46:01Z

Edit: I noticed that Unicode places the combining diacritical marks after the letter, not before like in this comment. The point is still valid, though.

That doesn't make any sense, there's no logic in LT_EX that would remove this, in contrast to \=. It's working for me for both single-character Unicode (U+1E21: ḡ) and Unicode with combining diacritical marks on ASCII characters (U+0304 U+0067: ḡ ):

Hex dump of the file:

00000000  4e 69 68 c3 b3 c5 8b e1  b8 a1 6f 0a 0a 4e 69 68  |Nih.......o..Nih|
00000010  6f cc 81 c5 8b 67 cc 84  6f 0a                    |o....g..o.|

intractabilis · 2021-03-27T22:36:32Z

Sorry, mea culpa. You are right. My LTeX is set to check on save, and I forgot to save when I experimented. Thanks for the workaround!

ed359 · 2021-03-31T23:32:05Z

See also valentjn/ltex-ls#56 (comment). Having \H{o} and \H{u} would be useful in Hungarian. I made a PR valentjn/ltex-ls#57

valentjn · 2021-04-05T08:43:13Z

I replaced the whole accent table with Java's Unicode normalization algorithm. This means that combined characters will be used if they exist, otherwise combining diacritical marks will be used (Normalization Form C in Unicode Standard Annex #15).

Combining multiple accents on a single letter (e.g., \'{\=n}) is still not supported, since L^AT_EX itself seems to have difficulties with this without special packages. Users that rely on multiple accents on a single letter need to use Unicode instead.

valentjn · 2021-04-05T17:10:17Z

Fix released in 10.0.0.

intractabilis · 2021-04-09T04:58:38Z

Thanks!

Fixes valentjn/vscode-ltex#269.

intractabilis added 1-bug 🐛 Issue type: Bug report (something isn't working as expected) 2-unconfirmed Issue status: Bug that needs to be reproduced (all new bugs have this label) labels Mar 13, 2021

valentjn added 2-confirmed Issue status: Confirmed, reproducible bug in LTeX and removed 2-unconfirmed Issue status: Bug that needs to be reproduced (all new bugs have this label) labels Mar 21, 2021

valentjn self-assigned this Mar 21, 2021

valentjn added the 2-needs-info Issue status: We need more information (usually) from the submitter before continuing label Mar 21, 2021

valentjn removed the 2-needs-info Issue status: We need more information (usually) from the submitter before continuing label Mar 26, 2021

valentjn added this to the 10.0.0 milestone Apr 5, 2021

valentjn closed this as completed in valentjn/ltex-ls@d63168e Apr 5, 2021

valentjn added the 3-fixed Issue resolution: Issue has been fixed on the develop branch label Apr 5, 2021

me-johnomar added a commit to me-johnomar/ltex-ls that referenced this issue Jan 31, 2024

Use Unicode normalization for LaTeX accents

c237089

Fixes valentjn/vscode-ltex#269.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LTeX doesn't recognize macron #269

LTeX doesn't recognize macron #269

intractabilis commented Mar 13, 2021

valentjn commented Mar 21, 2021 •

edited

Loading

intractabilis commented Mar 21, 2021 •

edited

Loading

valentjn commented Mar 26, 2021

intractabilis commented Mar 26, 2021

valentjn commented Mar 27, 2021 •

edited

Loading

intractabilis commented Mar 27, 2021

ed359 commented Mar 31, 2021 •

edited

Loading

valentjn commented Apr 5, 2021

valentjn commented Apr 5, 2021

intractabilis commented Apr 9, 2021

LTeX doesn't recognize macron #269

LTeX doesn't recognize macron #269

Comments

intractabilis commented Mar 13, 2021

valentjn commented Mar 21, 2021 • edited Loading

intractabilis commented Mar 21, 2021 • edited Loading

valentjn commented Mar 26, 2021

intractabilis commented Mar 26, 2021

valentjn commented Mar 27, 2021 • edited Loading

intractabilis commented Mar 27, 2021

ed359 commented Mar 31, 2021 • edited Loading

valentjn commented Apr 5, 2021

valentjn commented Apr 5, 2021

intractabilis commented Apr 9, 2021

valentjn commented Mar 21, 2021 •

edited

Loading

intractabilis commented Mar 21, 2021 •

edited

Loading

valentjn commented Mar 27, 2021 •

edited

Loading

ed359 commented Mar 31, 2021 •

edited

Loading