Improve mixed CJK/Latin linebreaking. #1986

bigfarts · 2022-08-30T02:52:26Z

This avoids prioritizing kana above spaces and we will break at the first possible break location, rather than via an implicit order of breaking.

Before:

After:

emilk · 2022-09-05T10:02:16Z

I have no knowledge of kana, but in English text I would say it is preferable to break at a space rather than at punctuation or dashes. Take for instance: `Temperature: 3.2 Kelvin". We do not want to break this as:

Temperature: 3.
2 Kelvin

So the current ordering is very deliberate when it comes to spaces, dashes and punctuation, and I don't want to break that.

If the problem is that spaces are prioritized over kana, let's just focus on that.

Perhaps something like:

let best = self.space.or(self.logogram).or(self.dash).or(self.punctuation);
let pos = match (best, kana) {
    (None, None) => None,
    (None, Some(pos)) => Some(pos),
    (Some(pos), None) => Some(pos),
    (Some(best), Some(kana)) => Some(best.max(kana)),
};
pos.or(self.any)

Or we special-case it based on whether or not there is kana:

if let Some(kana) = self.kana {
    // Whatever logic makes sense for kana
} else {
    self.space.or(self.logogram).or(self.dash).or(self.punctuation).or(self.any)
}

It would also be great if you added a test for this so we don't break it in the future!

bigfarts · 2022-09-06T17:58:59Z

This should be a better solution: breaking on CJK (kana/logogram (Hangul is not supported because I don't know much about Hangul)) is now prioritized at the same level as spaces, and also breaking before a CJK character is also prioritized at the same level of spaces. This handles cases like:

CJK break:

日本語と
Englishの混在
した文章

Pre-CJK break:

日本語とEnglish
の混在した文章

(actually the K part is a lie because I don't know much about Korean typesetting, but it should be easy to implement) This changes the break on space rule to break on space, CJK, or pre-CJK, e.g.: aaaあああ ^ break is inserted here

emilk · 2022-09-06T20:19:35Z

Great!

bigfarts changed the title ~~Treat all types of row breaking the same (except any).~~ Improve mixed CJK/Latin linebreaking. Sep 6, 2022

Improve mixed CJK/Latin linebreaking.

faf52e7

(actually the K part is a lie because I don't know much about Korean typesetting, but it should be easy to implement) This changes the break on space rule to break on space, CJK, or pre-CJK, e.g.: aaaあああ ^ break is inserted here

emilk merged commit 0e62c0e into emilk:master Sep 6, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve mixed CJK/Latin linebreaking. #1986

Improve mixed CJK/Latin linebreaking. #1986

bigfarts commented Aug 30, 2022

emilk commented Sep 5, 2022

bigfarts commented Sep 6, 2022 •

edited

Loading

emilk commented Sep 6, 2022

Improve mixed CJK/Latin linebreaking. #1986

Improve mixed CJK/Latin linebreaking. #1986

Conversation

bigfarts commented Aug 30, 2022

emilk commented Sep 5, 2022

bigfarts commented Sep 6, 2022 • edited Loading

emilk commented Sep 6, 2022

bigfarts commented Sep 6, 2022 •

edited

Loading