Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unicode whitespaces #306

Closed
truchi opened this issue Mar 22, 2021 · 8 comments
Closed

Unicode whitespaces #306

truchi opened this issue Mar 22, 2021 · 8 comments

Comments

@truchi
Copy link

truchi commented Mar 22, 2021

Hello!

I just want to share my findings with you. You may already know all this... Google gives me this which could have your interest.

Unicode has a few whitespaces that just render as a space in my (gnome) terminal (and that ideographic as double width space). Theses, when break_wordsing, break the "no space at beginning of line" behavior that you have.

Interestingly, there is also "zero width space" to allow breaks (easy to support, you have open issues about this), and "zero width no-break space" to disallow breaks (less easy I guess) (in addition to the famous "no-break space" and the lesser famous "narrow no-break space", whose have width).

char::is_whitespace links to a database listing other whitespaces and claims it reports them as whitespace but not really...

Good luck with that! :)

@mgeisler
Copy link
Owner

Hi @truchi, thanks for the links!

I've been playing with integrating the unicode-linebreak crate, which should take care of all this. It even does fun things such as preventing line breaks in a text like "Bonjour !" since the ! suppresses a break in the preceding whitespace.

It's not completely done yet since there are some weirdness about soft hyphens: the create finds break points at them, but if the break point is not used, the soft hyphen should be removed. We don't currently support this with the Word fragment.

@truchi
Copy link
Author

truchi commented Mar 27, 2021

Good!

I just want to point out that it will not solve the issue of NBSPs at start of line when breaking words.
Yet when breaking words you don't really need to find out break opportunities anyway.

(Are you French Swiss?)
Keep up the good work!

@mgeisler
Copy link
Owner

I just want to point out that it will not solve the issue of NBSPs at start of line when breaking words.

What kind of issue do you mean? Currently, a NBSP character is not treated in a special way, so it's effectively treated like a _ or any other letter. This should ensure that there are no extra line break opportunities at a NBSP.

However, you're right that a line might be broken at a NBSP if you enable the break_words option and the word is too wide to fit on a line. Put differently: when break_words is in effect, words are simply chopped into pieces in a simple and brutal fashion in a last effort attempt at making them fit.

Are you French Swiss?

No, I'm actually from Denmark, but I moved to Switzerland about 10 years ago 😄

@truchi
Copy link
Author

truchi commented Mar 28, 2021

The following:

let wrap = wrap("Hello\u{00A0}world", Options::new(5).break_words(true));
for line in wrap {
    println!("{}", line);
}

outputs:

Hello
 worl
d

@mgeisler
Copy link
Owner

mgeisler commented Apr 5, 2021

Hey @truchi, ah yeah, that's a good example!

I guess you would expect the breaks to become

Hello
world

so that the no break space disappears because it happens to fall at the end (or beginning) of the line?

@mgeisler
Copy link
Owner

Hi again! :)

I've been playing with integrating the unicode-linebreak crate,

I just want to point out that it will not solve the issue of NBSPs at start of line when breaking words.

I believe you're pointing out that a non-breaking space remains non-breaking when using the Unicode line breaking algorithm?

Indeed, you're completely correct. I implemented support for the Unicode line breaking algorithm in #313 and testing on https://mgeisler.github.io/textwrap/, shows that it makes not difference what kind of word separator I select.

However, is this not working as intended?

@truchi
Copy link
Author

truchi commented Feb 28, 2022

Hello !

The project I was working on with textwrap is totally dead now, so I'm unsure how to test that...

Glad to see your lib being worked on, you have good motivation!

@mgeisler
Copy link
Owner

Glad to see your lib being worked on, you have good motivation!

Thanks!

Let me close this issue now since I hope the Unicode line breaking algorithm from #313 fixes this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants