Draft: Do line breaking in WeasyPrint using Harfbuzz data #1840
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This is a very preliminary attempt at doing line breaking in WeasyPrint rather than using Pango to do line breaking. It can do very rudimentary HTML conversion, but is far from complete.
In particular, it uses Pango to itemize content (handle bidi levels, identify the relevant fonts, etc.) and identify some character properties, and then uses Harfbuzz to handle shaping the characters into glyphs. Once we've got both, we can iterate over each of them to identify where to break lines. I've tried to be moderately efficient about only itemizing and shaping once per block, so we save a ton of time previously spent in Pango when doing split_first_line calls. This necessitated a bunch of FFI additions, but because Pango is already a dependency and it requires Harfbuzz, it shouldn't actually add any dependencies to WeasyPrint in practice. (I did briefly consider getting rid of Pango entirely, but that would require significantly more effort: Pango provides a lot of value even with just its itemization and font-finding functionality.)
Many, many TODOs have been added throughout the code, and they barely even touch some aspects that will still need to be adjusted (in particular, an efficient way to find the natural size or minimum size of an inline box isn't something I've approached), or could be added (like proper bidi support). Resolving most of the added TODOs would probably be necessary to bring this into a state where it doesn't cause regressions if merged.
In particular, this code really only works with assumptions of "default" white-space, hyphenation, and wrapping settings, and doesn't support text decoration. (Not to mention, for example, hardcoding an inline_block_width for now, and the current apply_inline_box_sizes being extremely problematic, likely because I didn't understand the existing box model enough.) There's also a fairly massive loop in split_inline_box that could hopefully be cleaned up, possibly into multiple functions (at least for some parts?).
On the plus side, some very quick benchmarks with the WeasyPerf
jsonraw
benchmark show that it runs in about 43% of the time that 56.1 does with approximately the same memory usage, so there do seem to be some benefits to the code. I haven't done any profiling on it, and there are probably some relatively easy optimizations lurking around as well. An important caveat is thatjsonraw
is probably the best-case scenario for this change, and that fixing all of the open TODOs will probably slow it down a bit.I wrote the vast majority of this code in 2022, and kept meaning to get around to fixing some of the more glaring issues but struggled to get the necessary motivation. I've cleaned it up enough to get it running satisfactorily with a fairly small test file I had been using just to make sure it can break lines and handle inline boxes, but it will need a ton of work to get it up to the point where it passes tests and can be merged properly. I'm filing this PR mostly as a way to mention that the code is here, but not with an expectation that the WeasyPrint team will necessarily take on that work. (If you're interested in working on it, I may also find some motivation to work on various parts, but if you're not, then it was a fun tech demo. I also won't be offended by a massive refactor - it's not in a great state right now, I was just trying to make things work with a small test file.)
You will also likely notice that it has been branched from 56.1, and I have not attempted a rebase. From occasional glances at the changelog it should be relatively feasible to pull in most updates since then without a lot of trouble, though things touching fonts in particular may need some tweaking.