Add basic support for Indic scripts in addition to CJK. #1060

davidflanagan · 2018-01-12T01:50:35Z

This patch just makes KaTeX recognize Unicode codepoints in the
range \u0900-\u109f so that those South and Southeast Asian scripts
do not get automatically rejected.

The patch also generalizes the way that Unicode blocks are handled
to make it easier to add support for new scripts in the future.
src/unicodeRegexes.js is replaced with the new file src/unicodeScripts.js

davidflanagan · 2018-01-12T01:51:28Z

What do you think, Kevin?

davidflanagan · 2018-01-12T01:54:52Z

.gitignore

@@ -13,3 +13,4 @@ diff.png
 /test/symgroups.pdf
 /test/screenshotter/unicode-fonts
 coverage/
+*~


This is for emacs backup files.

Should probably add this to your global gitignore

davidflanagan · 2018-01-12T01:56:43Z

src/fontMetrics.js

@@ -198,10 +198,15 @@ const getCharacterMetrics = function(
    let ch = character.charCodeAt(0);
    if (character[0] in extraCharacterMap) {
        ch = extraCharacterMap[character[0]].charCodeAt(0);
-    } else if (cjkRegex.test(character[0])) {
+    } else if (supportedCodepoint(ch) && !metricMap[font][ch]) {


Note the && !metricMap bit I added on this line... I was thinking that this allows the addition of other font metrics in the future. Not sure whether that makes sense, though.

Actually, I think I need to think some more about this. In Devangari (and probably most of these other Brahmic scripts) there are a lot of combining characters which generally turn into ligatures and have no intrinsic width on their own. Using a full em width is wrong given that somewhere around half of the characters are zero width vowels.

At a minimum I should choose a narrower character for Brahmic scripts. Really though, I think we can't really measure characters from those scripts in isolation: to actually be accurate we have to measure the whole string of them. I haven't looked yet at how this code is used, but I worry that this is a larger architectural change required here for complete accuracy.

For our needs at KA we only need to be able to display these characters in plain \text{} environments. Not trying to center them above a big capital sigma character or anything, so I'd guess that super-accurate measurements are not all that necessary.

The only place we actually use the width of any glyph is when we're positioning accent glyphs. We still need some sort of height/depth for these so what you've done here I think is okay for now.

davidflanagan · 2018-01-12T02:00:35Z

src/unicodeScripts.js

+
+export function supportedCodepoint(codepoint: number): boolean {
+    return scriptFromCodepoint(codepoint) !== null;
+}


I could make this faster by statically concatening all of the ranges for all scripts into a single array. Then there wouldn't be a nested loop. I'm assuming that this has similar performance to using a regex. But could also compile the ranges into a regexp at startup time and use that here on a character instead of a codepoint if performance is a concern.

I think it's fine. Although perf is important we aren't measuring it yet. Once we start measuring it, and if we notice a drop in perf, then we can start optimizing things. If you have time maybe have a look at #1054.

kevinbarabash · 2018-01-12T19:59:55Z

Not sure why the Unicode screenshot test is failing. It renders using the test page. I'm going to run make screenshots locally and try to figure out what the change was.

kevinbarabash · 2018-01-12T20:10:50Z

The reason why the tests are failing is because the browser in the docker isn't loading a font with the glyphs for the Hangul and CJK parts of the Unicode screenshot test.

This is because the spans containing the Hangul and CJK text no long have hangul_fallback and cjk_fallback as part of their class. This class is used in test.html to load appropriate fonts.

kevinbarabash

A few nits that need to be fixed.

kevinbarabash · 2018-01-12T20:11:28Z

src/domTree.js

+        // script names
+        const script = scriptFromCodepoint(this.value.charCodeAt[0]);
+        if (script) {
+            this.classes.push(script + "_fallback");


For some reason this isn't getting applied as expected.

kevinbarabash · 2018-01-12T20:22:19Z

src/domTree.js

+        // We use CSS class names like cjk_fallback, hangul_fallback and
+        // indic_fallback. See ./unicodeScripts.js for the set of possible
+        // script names
+        const script = scriptFromCodepoint(this.value.charCodeAt[0]);


I can't believe flow didn't pick this up.

I was worried that our collating didn't respect script boundaries, but it does. I tested \text{私はバナナです여보세요} and it does the right thing by rendering two spans, one for CJK and one for Hangul.

kevinbarabash · 2018-01-12T20:38:23Z

src/fontMetrics.js

@@ -198,10 +198,15 @@ const getCharacterMetrics = function(
    let ch = character.charCodeAt(0);
    if (character[0] in extraCharacterMap) {
        ch = extraCharacterMap[character[0]].charCodeAt(0);
-    } else if (cjkRegex.test(character[0])) {
+    } else if (supportedCodepoint(ch) && !metricMap[font][ch]) {


The only place we actually use the width of any glyph is when we're positioning accent glyphs. We still need some sort of height/depth for these so what you've done here I think is okay for now.

kevinbarabash · 2018-01-12T20:38:56Z

src/unicodeScripts.js

+    // Korean
+    hangul: [0xAC00, 0xD7AF],
+
+    // The Brahmic scripts of South and Southeast Asia


So many different scripts... cool!

kevinbarabash · 2018-01-12T20:40:20Z

src/unicodeScripts.js

+        0x3000, 0x30FF,  // CJK symbols and punctuation, Hiragana, Katakana
+        0x4E00, 0x9FAF,  // CJK ideograms
+        0xFF00, 0xFF60,  // Fullwidth punctuation
+        // TODO: add halfwidth Katakana and Romanji glyphs


Maybe change these to:

cjk: [ [0x3000, 0x30FF], [0x4E00, 0x9FAF], [0xFF00, 0xFF60], ], brahmic: [ [0x0900, 0x109F], ],

It simplifies the logic in scriptFromCodepoint and makes it more obvious what's going on when there are multiple ranges.

I ended up changing this all around to make Flow happy and to speed up perf. (I was making those changes before you started the review... sorry about that.) But now that I've made codepointSupported() fast, I can probably make this change to make the data structure clearer...

kevinbarabash · 2018-01-12T20:42:30Z

src/unicodeScripts.js

+
+export function supportedCodepoint(codepoint: number): boolean {
+    return scriptFromCodepoint(codepoint) !== null;
+}


I think it's fine. Although perf is important we aren't measuring it yet. Once we start measuring it, and if we notice a drop in perf, then we can start optimizing things. If you have time maybe have a look at #1054.

kevinbarabash · 2018-01-12T20:43:00Z

.gitignore

@@ -13,3 +13,4 @@ diff.png
 /test/symgroups.pdf
 /test/screenshotter/unicode-fonts
 coverage/
+*~


Should probably add this to your global gitignore

kevinbarabash

Sorry, hit the wrong button.

kevinbarabash · 2018-01-12T23:05:55Z

test/unicode-spec.js

+
+    it("should parse Devangari inside \\text{}", function() {
+        expect('\\text{नमस्ते}').toParse();
+    });


Thanks for adding tests.

This patch just makes KaTeX recognize Unicode codepoints in the range \u0900-\u109f so that those South and Southeast Asian scripts do not get automatically rejected. The patch also generalizes the way that Unicode blocks are handled to make it easier to add support for new scripts in the future. src/unicodeRegexes.js is replaced with the new file src/unicodeScripts.js

davidflanagan · 2018-01-13T00:05:21Z

@kevinbarabash: sorry for asking for a review and then continuing to work on this... I believe I've addressed all of your requests. The new unicodeScripts.js file has changed substantially since you reviewed it, so you might want to take another look.

kevinbarabash · 2018-01-13T00:12:04Z

src/fontMetrics.js

+        // not its width.
+        if (supportedCodepoint(ch)) {
+            metrics = metricMap[font][77]; // 77 is the charcode for 'M'
+        }


I see, we fall back to the metrics for 'M' for any code point. Good enough for now.

This is really cool and yet I have a question. Many, perhaps most, of these characters are not in Times New Roman, so we don't even know which font will display. An yet at least some of these characters are taller than an "M".

Is there a widely-used system font that contains all these characters? Then we could specify it in KaTeX CSS and get some font metrics, at least height and depth.

kevinbarabash · 2018-01-13T00:12:41Z

test/unicode-spec.js

+        for (let codepoint = 0; codepoint <= 0xffff; codepoint++) {
+            expect(supportedCodepoint(codepoint)).toBe(
+                allRE.test(String.fromCharCode(codepoint))
+            );


kevinbarabash · 2018-01-13T00:13:08Z

test/unicode-spec.js

+    });
+
+    it("scriptFromCodepoint() should return correct values", () => {
+        for (let codepoint = 0; codepoint <= 0xffff; codepoint++) {


hopefully these tests don't take that long to run.

I had a look and unicode-spec.js take 7s vs 9s for katex-spec.js so I think we're okay.

This diff is a follow-up to PR #1060 which added support for Indic scripts. In order to support Czech, Turkish and Hungarian text (at least) inside \text{} environments, we need to recognize the Latin Extended A and B Unicode blocks. The patch also adds support for Georgian, and enhances support for Cyrillic by defining the entire Cyrillic unicode block instead of defining symbols for a subset of Cyrillic letters as we did previously.

* Support more scripts in \text{} environments. This diff is a follow-up to PR #1060 which added support for Indic scripts. In order to support Czech, Turkish and Hungarian text (at least) inside \text{} environments, we need to recognize the Latin Extended A and B Unicode blocks. The patch also adds support for Georgian, and enhances support for Cyrillic by defining the entire Cyrillic unicode block instead of defining symbols for a subset of Cyrillic letters as we did previously. * Only return fontMetrics for supported Unicode scripts in text mode The Unicode scripts listed in unicodeScripts.js are supported in text mode but getCharacterMetrics() was returning fake metrics for them even in math mode. This caused bad handling of \boldsymbol\imath * use Mode from types.js

davidflanagan requested a review from kevinbarabash January 12, 2018 01:50

davidflanagan commented Jan 12, 2018

View reviewed changes

kevinbarabash approved these changes Jan 12, 2018

View reviewed changes

kevinbarabash requested changes Jan 12, 2018

View reviewed changes

kevinbarabash reviewed Jan 12, 2018

View reviewed changes

davidflanagan force-pushed the brahmic-scripts branch from fafaefe to 1c65263 Compare January 13, 2018 00:02

kevinbarabash reviewed Jan 13, 2018

View reviewed changes

kevinbarabash approved these changes Jan 13, 2018

View reviewed changes

kevinbarabash merged commit 7fe6af2 into master Jan 13, 2018

davidflanagan deleted the brahmic-scripts branch January 13, 2018 00:19

This was referenced Jan 13, 2018

Unicode rendering inside math environment #979

Closed

Add support for Persian & Arabic alphabet in text and eastern-arabic numerals + Proposal Commit #729

Closed

davidflanagan mentioned this pull request Jan 19, 2018

Support more scripts in \text{} environments. #1076

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add basic support for Indic scripts in addition to CJK. #1060

Add basic support for Indic scripts in addition to CJK. #1060

davidflanagan commented Jan 12, 2018

davidflanagan commented Jan 12, 2018

davidflanagan Jan 12, 2018

kevinbarabash Jan 12, 2018

davidflanagan Jan 12, 2018

davidflanagan Jan 12, 2018

kevinbarabash Jan 12, 2018

davidflanagan Jan 12, 2018

kevinbarabash Jan 12, 2018

kevinbarabash commented Jan 12, 2018

kevinbarabash commented Jan 12, 2018

kevinbarabash left a comment

kevinbarabash Jan 12, 2018

kevinbarabash Jan 12, 2018

kevinbarabash Jan 12, 2018

kevinbarabash Jan 12, 2018

kevinbarabash Jan 12, 2018

kevinbarabash Jan 12, 2018

kevinbarabash Jan 12, 2018

davidflanagan Jan 12, 2018

kevinbarabash Jan 12, 2018

kevinbarabash Jan 12, 2018

kevinbarabash left a comment

kevinbarabash Jan 12, 2018

davidflanagan commented Jan 13, 2018

kevinbarabash Jan 13, 2018

ronkok Jan 15, 2018

kevinbarabash Jan 13, 2018

kevinbarabash Jan 13, 2018

kevinbarabash Jan 13, 2018

Add basic support for Indic scripts in addition to CJK. #1060

Add basic support for Indic scripts in addition to CJK. #1060

Conversation

davidflanagan commented Jan 12, 2018

davidflanagan commented Jan 12, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kevinbarabash commented Jan 12, 2018

kevinbarabash commented Jan 12, 2018

kevinbarabash left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kevinbarabash left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

davidflanagan commented Jan 13, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment