Reformulate wrapping in terms of words with whitespace and penalties #221

mgeisler · 2020-11-08T18:25:15Z

This is a complete rewrite of the core word wrapping functionality. The user-visible change is that wrap now returns a Vec<Cow<'_, str>> instead of impl Iterator<Item = Cow<'_, str>>. In other words, you now get all lines returned to you at once instead of getting an iterator back. Code that simply iterated over the old return value simply need to add .iter(), code that already collected the lines into a vector can now do away with this code. I looked around GitHub for code that uses textwrap
and most code simply calls fill or wrap. An example is the clap crate.

New algorithm: Before, we would step though the input string and (attempt to) keep track of all aspects of the state. This didn't always work (see at least #122, #158, #158, and #193) and it's inflexible.

This commit replaces the old algorithm with a new one which works on a more abstract level. We now first

First split the input string into "words". A word is a substring of the original string, including any trailing whitespace.
We split each word according to the WordSplitter.
Optional, if break_words is true: further spit each word so that it is no longer than the line width.
We then simply put the words into lines based on the display width.

This is slower than the previous algorithm. The fill/1600 benchmark shows that is now takes ~19 microseconds to wrap a 1600 character long string (about 20 lines of terminal text). That is ~8 microseconds longer than before. I think this is still plenty fast, and the new structure makes it easier to reason about the logic.

This is a step towards #126: the wrap_fragments function could now in principle be used to wrap any kind of opaque "box", and this box could carry formatting information as needed. We can work on abstracting more functionality going forward, probably by making the Fragment trait more powerful, e.g., by moving the break_apart method from Word to Fragment.

This is a complete rewrite of the core word wrapping functionality. Before, we would step though the input string and (attempt to) keep track of all aspects of the state. This didn't always work (see at least #122, #158, #158, and #193) and it's inflexible. This commit replaces the old algorithm with a new one which works on a more abstract level. We now first 1. First split the input string into "words". A word is a substring of the original string, including any trailing whitespace. 2. We split each word according to the `WordSplitter`. 3. We then simply put the words into lines based on the display width. This is slower than the previous algorithm. The `fill/1600` benchmark shows that is now takes ~18 microseconds to wrap a 1600 character long string. That is around 8 microseconds longer than before.

mgeisler · 2020-11-08T22:15:01Z

Ah, I should probably explain the PR title a little... the model here is vaguely inspired by the concepts of boxes, glue, and penalties in TeX. The terms were introduced in the very readable article Breaking Paragraphs into Lines from 1981 by Donald E. Knuth and Michael F. Plass. In short, a box is an opaque rectangle on the page, glue is the stretchable whitespace between boxes, and penalties are the extra content inserted at line breaks (such as hyphens). The article describes a line braking algorithm which justifies text while minimizing the stretching of individual lines.

I first wanted to reuse the terminology from the article, but the word box is a reserved keyword and already has a meaning of "heap allocation". I could have used the word glue to refer to the whitespace between words, but since we don't (yet) support justified text, our glue would be rather unflexible. Lastly, the greedy algorithm implemented in textwrap does not try to minimize anything except the total number of lines — reusing the terminology from the article would have been misleading people.

This was broken by the rewrite in #221 and we only had coverage for a single case of wrapping colored text. Fixes #248.

mgeisler force-pushed the fragments-and-words branch from 5e6ca7a to 1fc0ba3 Compare November 8, 2020 21:57

mgeisler merged commit 52c39c3 into master Nov 8, 2020

mgeisler mentioned this pull request Nov 30, 2020

Separate soft line break finding and wrapping #230

Closed

mgeisler changed the title ~~Reformulate wrapping in terms of boxes, glue, and penalties~~ Reformulate wrapping in terms of words with whitespace and penalties Dec 5, 2020

mgeisler added a commit that referenced this pull request Dec 9, 2020

Correctly compute width while skipping over ANSI escape sequences

abea327

This was broken by the rewrite in #221 and we only had coverage for a single case of wrapping colored text. Fixes #248.

This was referenced Dec 9, 2020

Correctly compute width while skipping over ANSI escape sequences #249

Merged

Modularize library #244

Closed

mgeisler mentioned this pull request Dec 20, 2020

Implement wrapping functions as iterators #257

Closed

mgeisler deleted the fragments-and-words branch January 30, 2021 16:48

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reformulate wrapping in terms of words with whitespace and penalties #221

Reformulate wrapping in terms of words with whitespace and penalties #221

mgeisler commented Nov 8, 2020 •

edited

Loading

mgeisler commented Nov 8, 2020

Reformulate wrapping in terms of words with whitespace and penalties #221

Reformulate wrapping in terms of words with whitespace and penalties #221

Conversation

mgeisler commented Nov 8, 2020 • edited Loading

mgeisler commented Nov 8, 2020

mgeisler commented Nov 8, 2020 •

edited

Loading