Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add basic support for Indic scripts in addition to CJK. #1060

Merged
merged 1 commit into from
Jan 13, 2018

Conversation

davidflanagan
Copy link

This patch just makes KaTeX recognize Unicode codepoints in the
range \u0900-\u109f so that those South and Southeast Asian scripts
do not get automatically rejected.

The patch also generalizes the way that Unicode blocks are handled
to make it easier to add support for new scripts in the future.
src/unicodeRegexes.js is replaced with the new file src/unicodeScripts.js

@davidflanagan
Copy link
Author

What do you think, Kevin?

.gitignore Outdated
@@ -13,3 +13,4 @@ diff.png
/test/symgroups.pdf
/test/screenshotter/unicode-fonts
coverage/
*~
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is for emacs backup files.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should probably add this to your global gitignore

@@ -198,10 +198,15 @@ const getCharacterMetrics = function(
let ch = character.charCodeAt(0);
if (character[0] in extraCharacterMap) {
ch = extraCharacterMap[character[0]].charCodeAt(0);
} else if (cjkRegex.test(character[0])) {
} else if (supportedCodepoint(ch) && !metricMap[font][ch]) {
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note the && !metricMap bit I added on this line... I was thinking that this allows the addition of other font metrics in the future. Not sure whether that makes sense, though.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, I think I need to think some more about this. In Devangari (and probably most of these other Brahmic scripts) there are a lot of combining characters which generally turn into ligatures and have no intrinsic width on their own. Using a full em width is wrong given that somewhere around half of the characters are zero width vowels.

At a minimum I should choose a narrower character for Brahmic scripts. Really though, I think we can't really measure characters from those scripts in isolation: to actually be accurate we have to measure the whole string of them. I haven't looked yet at how this code is used, but I worry that this is a larger architectural change required here for complete accuracy.

For our needs at KA we only need to be able to display these characters in plain \text{} environments. Not trying to center them above a big capital sigma character or anything, so I'd guess that super-accurate measurements are not all that necessary.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only place we actually use the width of any glyph is when we're positioning accent glyphs. We still need some sort of height/depth for these so what you've done here I think is okay for now.


export function supportedCodepoint(codepoint: number): boolean {
return scriptFromCodepoint(codepoint) !== null;
}
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could make this faster by statically concatening all of the ranges for all scripts into a single array. Then there wouldn't be a nested loop. I'm assuming that this has similar performance to using a regex. But could also compile the ranges into a regexp at startup time and use that here on a character instead of a codepoint if performance is a concern.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's fine. Although perf is important we aren't measuring it yet. Once we start measuring it, and if we notice a drop in perf, then we can start optimizing things. If you have time maybe have a look at #1054.

@kevinbarabash
Copy link
Member

Not sure why the Unicode screenshot test is failing. It renders using the test page. I'm going to run make screenshots locally and try to figure out what the change was.

@kevinbarabash
Copy link
Member

The reason why the tests are failing is because the browser in the docker isn't loading a font with the glyphs for the Hangul and CJK parts of the Unicode screenshot test.

unicode-firefox

This is because the spans containing the Hangul and CJK text no long have hangul_fallback and cjk_fallback as part of their class. This class is used in test.html to load appropriate fonts.

Copy link
Member

@kevinbarabash kevinbarabash left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few nits that need to be fixed.

src/domTree.js Outdated
// script names
const script = scriptFromCodepoint(this.value.charCodeAt[0]);
if (script) {
this.classes.push(script + "_fallback");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For some reason this isn't getting applied as expected.

src/domTree.js Outdated
// We use CSS class names like cjk_fallback, hangul_fallback and
// indic_fallback. See ./unicodeScripts.js for the set of possible
// script names
const script = scriptFromCodepoint(this.value.charCodeAt[0]);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't believe flow didn't pick this up.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was worried that our collating didn't respect script boundaries, but it does. I tested \text{私はバナナです여보세요} and it does the right thing by rendering two spans, one for CJK and one for Hangul.

@@ -198,10 +198,15 @@ const getCharacterMetrics = function(
let ch = character.charCodeAt(0);
if (character[0] in extraCharacterMap) {
ch = extraCharacterMap[character[0]].charCodeAt(0);
} else if (cjkRegex.test(character[0])) {
} else if (supportedCodepoint(ch) && !metricMap[font][ch]) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only place we actually use the width of any glyph is when we're positioning accent glyphs. We still need some sort of height/depth for these so what you've done here I think is okay for now.

// Korean
hangul: [0xAC00, 0xD7AF],

// The Brahmic scripts of South and Southeast Asia
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So many different scripts... cool!

0x3000, 0x30FF, // CJK symbols and punctuation, Hiragana, Katakana
0x4E00, 0x9FAF, // CJK ideograms
0xFF00, 0xFF60, // Fullwidth punctuation
// TODO: add halfwidth Katakana and Romanji glyphs
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe change these to:

cjk: [
    [0x3000, 0x30FF],
    [0x4E00, 0x9FAF],
    [0xFF00, 0xFF60],
],
brahmic: [ 
    [0x0900, 0x109F],
],

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It simplifies the logic in scriptFromCodepoint and makes it more obvious what's going on when there are multiple ranges.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I ended up changing this all around to make Flow happy and to speed up perf. (I was making those changes before you started the review... sorry about that.) But now that I've made codepointSupported() fast, I can probably make this change to make the data structure clearer...


export function supportedCodepoint(codepoint: number): boolean {
return scriptFromCodepoint(codepoint) !== null;
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's fine. Although perf is important we aren't measuring it yet. Once we start measuring it, and if we notice a drop in perf, then we can start optimizing things. If you have time maybe have a look at #1054.

.gitignore Outdated
@@ -13,3 +13,4 @@ diff.png
/test/symgroups.pdf
/test/screenshotter/unicode-fonts
coverage/
*~
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should probably add this to your global gitignore

Copy link
Member

@kevinbarabash kevinbarabash left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, hit the wrong button.


it("should parse Devangari inside \\text{}", function() {
expect('\\text{नमस्ते}').toParse();
});
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding tests.

This patch just makes KaTeX recognize Unicode codepoints in the
range \u0900-\u109f so that those South and Southeast Asian scripts
do not get automatically rejected.

The patch also generalizes the way that Unicode blocks are handled
to make it easier to add support for new scripts in the future.
src/unicodeRegexes.js is replaced with the new file src/unicodeScripts.js
@davidflanagan
Copy link
Author

@kevinbarabash: sorry for asking for a review and then continuing to work on this... I believe I've addressed all of your requests. The new unicodeScripts.js file has changed substantially since you reviewed it, so you might want to take another look.

// not its width.
if (supportedCodepoint(ch)) {
metrics = metricMap[font][77]; // 77 is the charcode for 'M'
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, we fall back to the metrics for 'M' for any code point. Good enough for now.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is really cool and yet I have a question. Many, perhaps most, of these characters are not in Times New Roman, so we don't even know which font will display. An yet at least some of these characters are taller than an "M".
height1

Is there a widely-used system font that contains all these characters? Then we could specify it in KaTeX CSS and get some font metrics, at least height and depth.

for (let codepoint = 0; codepoint <= 0xffff; codepoint++) {
expect(supportedCodepoint(codepoint)).toBe(
allRE.test(String.fromCharCode(codepoint))
);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wow

});

it("scriptFromCodepoint() should return correct values", () => {
for (let codepoint = 0; codepoint <= 0xffff; codepoint++) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hopefully these tests don't take that long to run.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had a look and unicode-spec.js take 7s vs 9s for katex-spec.js so I think we're okay.

@kevinbarabash kevinbarabash merged commit 7fe6af2 into master Jan 13, 2018
@davidflanagan davidflanagan deleted the brahmic-scripts branch January 13, 2018 00:19
davidflanagan pushed a commit that referenced this pull request Jan 19, 2018
This diff is a follow-up to PR #1060 which added support for Indic scripts.
In order to support Czech, Turkish and Hungarian text (at least) inside
\text{} environments, we need to recognize the Latin Extended A and B
Unicode blocks. The patch also adds support for Georgian, and enhances
support for Cyrillic by defining the entire Cyrillic unicode block instead
of defining symbols for a subset of Cyrillic letters as we did previously.
kevinbarabash pushed a commit that referenced this pull request Jan 22, 2018
* Support more scripts in \text{} environments.

This diff is a follow-up to PR #1060 which added support for Indic scripts.
In order to support Czech, Turkish and Hungarian text (at least) inside
\text{} environments, we need to recognize the Latin Extended A and B
Unicode blocks. The patch also adds support for Georgian, and enhances
support for Cyrillic by defining the entire Cyrillic unicode block instead
of defining symbols for a subset of Cyrillic letters as we did previously.

* Only return fontMetrics for supported Unicode scripts in text mode

The Unicode scripts listed in unicodeScripts.js are supported in text mode
but getCharacterMetrics() was returning fake metrics for them even in
math mode. This caused bad handling of \boldsymbol\imath

* use Mode from types.js
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants