-
-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add basic support for Indic scripts in addition to CJK. #1060
Conversation
What do you think, Kevin? |
.gitignore
Outdated
@@ -13,3 +13,4 @@ diff.png | |||
/test/symgroups.pdf | |||
/test/screenshotter/unicode-fonts | |||
coverage/ | |||
*~ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is for emacs backup files.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should probably add this to your global gitignore
src/fontMetrics.js
Outdated
@@ -198,10 +198,15 @@ const getCharacterMetrics = function( | |||
let ch = character.charCodeAt(0); | |||
if (character[0] in extraCharacterMap) { | |||
ch = extraCharacterMap[character[0]].charCodeAt(0); | |||
} else if (cjkRegex.test(character[0])) { | |||
} else if (supportedCodepoint(ch) && !metricMap[font][ch]) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note the && !metricMap
bit I added on this line... I was thinking that this allows the addition of other font metrics in the future. Not sure whether that makes sense, though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, I think I need to think some more about this. In Devangari (and probably most of these other Brahmic scripts) there are a lot of combining characters which generally turn into ligatures and have no intrinsic width on their own. Using a full em width is wrong given that somewhere around half of the characters are zero width vowels.
At a minimum I should choose a narrower character for Brahmic scripts. Really though, I think we can't really measure characters from those scripts in isolation: to actually be accurate we have to measure the whole string of them. I haven't looked yet at how this code is used, but I worry that this is a larger architectural change required here for complete accuracy.
For our needs at KA we only need to be able to display these characters in plain \text{} environments. Not trying to center them above a big capital sigma character or anything, so I'd guess that super-accurate measurements are not all that necessary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The only place we actually use the width
of any glyph is when we're positioning accent glyphs. We still need some sort of height
/depth
for these so what you've done here I think is okay for now.
src/unicodeScripts.js
Outdated
|
||
export function supportedCodepoint(codepoint: number): boolean { | ||
return scriptFromCodepoint(codepoint) !== null; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I could make this faster by statically concatening all of the ranges for all scripts into a single array. Then there wouldn't be a nested loop. I'm assuming that this has similar performance to using a regex. But could also compile the ranges into a regexp at startup time and use that here on a character instead of a codepoint if performance is a concern.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's fine. Although perf is important we aren't measuring it yet. Once we start measuring it, and if we notice a drop in perf, then we can start optimizing things. If you have time maybe have a look at #1054.
Not sure why the Unicode screenshot test is failing. It renders using the test page. I'm going to run |
The reason why the tests are failing is because the browser in the docker isn't loading a font with the glyphs for the Hangul and CJK parts of the Unicode screenshot test. This is because the spans containing the Hangul and CJK text no long have |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few nits that need to be fixed.
src/domTree.js
Outdated
// script names | ||
const script = scriptFromCodepoint(this.value.charCodeAt[0]); | ||
if (script) { | ||
this.classes.push(script + "_fallback"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For some reason this isn't getting applied as expected.
src/domTree.js
Outdated
// We use CSS class names like cjk_fallback, hangul_fallback and | ||
// indic_fallback. See ./unicodeScripts.js for the set of possible | ||
// script names | ||
const script = scriptFromCodepoint(this.value.charCodeAt[0]); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can't believe flow
didn't pick this up.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was worried that our collating didn't respect script boundaries, but it does. I tested \text{私はバナナです여보세요}
and it does the right thing by rendering two spans, one for CJK and one for Hangul.
src/fontMetrics.js
Outdated
@@ -198,10 +198,15 @@ const getCharacterMetrics = function( | |||
let ch = character.charCodeAt(0); | |||
if (character[0] in extraCharacterMap) { | |||
ch = extraCharacterMap[character[0]].charCodeAt(0); | |||
} else if (cjkRegex.test(character[0])) { | |||
} else if (supportedCodepoint(ch) && !metricMap[font][ch]) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The only place we actually use the width
of any glyph is when we're positioning accent glyphs. We still need some sort of height
/depth
for these so what you've done here I think is okay for now.
src/unicodeScripts.js
Outdated
// Korean | ||
hangul: [0xAC00, 0xD7AF], | ||
|
||
// The Brahmic scripts of South and Southeast Asia |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So many different scripts... cool!
src/unicodeScripts.js
Outdated
0x3000, 0x30FF, // CJK symbols and punctuation, Hiragana, Katakana | ||
0x4E00, 0x9FAF, // CJK ideograms | ||
0xFF00, 0xFF60, // Fullwidth punctuation | ||
// TODO: add halfwidth Katakana and Romanji glyphs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe change these to:
cjk: [
[0x3000, 0x30FF],
[0x4E00, 0x9FAF],
[0xFF00, 0xFF60],
],
brahmic: [
[0x0900, 0x109F],
],
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It simplifies the logic in scriptFromCodepoint
and makes it more obvious what's going on when there are multiple ranges.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I ended up changing this all around to make Flow happy and to speed up perf. (I was making those changes before you started the review... sorry about that.) But now that I've made codepointSupported() fast, I can probably make this change to make the data structure clearer...
src/unicodeScripts.js
Outdated
|
||
export function supportedCodepoint(codepoint: number): boolean { | ||
return scriptFromCodepoint(codepoint) !== null; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's fine. Although perf is important we aren't measuring it yet. Once we start measuring it, and if we notice a drop in perf, then we can start optimizing things. If you have time maybe have a look at #1054.
.gitignore
Outdated
@@ -13,3 +13,4 @@ diff.png | |||
/test/symgroups.pdf | |||
/test/screenshotter/unicode-fonts | |||
coverage/ | |||
*~ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should probably add this to your global gitignore
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, hit the wrong button.
|
||
it("should parse Devangari inside \\text{}", function() { | ||
expect('\\text{नमस्ते}').toParse(); | ||
}); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for adding tests.
This patch just makes KaTeX recognize Unicode codepoints in the range \u0900-\u109f so that those South and Southeast Asian scripts do not get automatically rejected. The patch also generalizes the way that Unicode blocks are handled to make it easier to add support for new scripts in the future. src/unicodeRegexes.js is replaced with the new file src/unicodeScripts.js
fafaefe
to
1c65263
Compare
@kevinbarabash: sorry for asking for a review and then continuing to work on this... I believe I've addressed all of your requests. The new unicodeScripts.js file has changed substantially since you reviewed it, so you might want to take another look. |
// not its width. | ||
if (supportedCodepoint(ch)) { | ||
metrics = metricMap[font][77]; // 77 is the charcode for 'M' | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, we fall back to the metrics for 'M'
for any code point. Good enough for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is really cool and yet I have a question. Many, perhaps most, of these characters are not in Times New Roman, so we don't even know which font will display. An yet at least some of these characters are taller than an "M".
Is there a widely-used system font that contains all these characters? Then we could specify it in KaTeX CSS and get some font metrics, at least height
and depth
.
for (let codepoint = 0; codepoint <= 0xffff; codepoint++) { | ||
expect(supportedCodepoint(codepoint)).toBe( | ||
allRE.test(String.fromCharCode(codepoint)) | ||
); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wow
}); | ||
|
||
it("scriptFromCodepoint() should return correct values", () => { | ||
for (let codepoint = 0; codepoint <= 0xffff; codepoint++) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hopefully these tests don't take that long to run.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had a look and unicode-spec.js take 7s vs 9s for katex-spec.js so I think we're okay.
This diff is a follow-up to PR #1060 which added support for Indic scripts. In order to support Czech, Turkish and Hungarian text (at least) inside \text{} environments, we need to recognize the Latin Extended A and B Unicode blocks. The patch also adds support for Georgian, and enhances support for Cyrillic by defining the entire Cyrillic unicode block instead of defining symbols for a subset of Cyrillic letters as we did previously.
* Support more scripts in \text{} environments. This diff is a follow-up to PR #1060 which added support for Indic scripts. In order to support Czech, Turkish and Hungarian text (at least) inside \text{} environments, we need to recognize the Latin Extended A and B Unicode blocks. The patch also adds support for Georgian, and enhances support for Cyrillic by defining the entire Cyrillic unicode block instead of defining symbols for a subset of Cyrillic letters as we did previously. * Only return fontMetrics for supported Unicode scripts in text mode The Unicode scripts listed in unicodeScripts.js are supported in text mode but getCharacterMetrics() was returning fake metrics for them even in math mode. This caused bad handling of \boldsymbol\imath * use Mode from types.js
This patch just makes KaTeX recognize Unicode codepoints in the
range \u0900-\u109f so that those South and Southeast Asian scripts
do not get automatically rejected.
The patch also generalizes the way that Unicode blocks are handled
to make it easier to add support for new scripts in the future.
src/unicodeRegexes.js is replaced with the new file src/unicodeScripts.js