Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft: Do line breaking in WeasyPrint using Harfbuzz data #1840

Draft
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

aschmitz
Copy link
Contributor

This is a very preliminary attempt at doing line breaking in WeasyPrint rather than using Pango to do line breaking. It can do very rudimentary HTML conversion, but is far from complete.

In particular, it uses Pango to itemize content (handle bidi levels, identify the relevant fonts, etc.) and identify some character properties, and then uses Harfbuzz to handle shaping the characters into glyphs. Once we've got both, we can iterate over each of them to identify where to break lines. I've tried to be moderately efficient about only itemizing and shaping once per block, so we save a ton of time previously spent in Pango when doing split_first_line calls. This necessitated a bunch of FFI additions, but because Pango is already a dependency and it requires Harfbuzz, it shouldn't actually add any dependencies to WeasyPrint in practice. (I did briefly consider getting rid of Pango entirely, but that would require significantly more effort: Pango provides a lot of value even with just its itemization and font-finding functionality.)

Many, many TODOs have been added throughout the code, and they barely even touch some aspects that will still need to be adjusted (in particular, an efficient way to find the natural size or minimum size of an inline box isn't something I've approached), or could be added (like proper bidi support). Resolving most of the added TODOs would probably be necessary to bring this into a state where it doesn't cause regressions if merged.

In particular, this code really only works with assumptions of "default" white-space, hyphenation, and wrapping settings, and doesn't support text decoration. (Not to mention, for example, hardcoding an inline_block_width for now, and the current apply_inline_box_sizes being extremely problematic, likely because I didn't understand the existing box model enough.) There's also a fairly massive loop in split_inline_box that could hopefully be cleaned up, possibly into multiple functions (at least for some parts?).

On the plus side, some very quick benchmarks with the WeasyPerf jsonraw benchmark show that it runs in about 43% of the time that 56.1 does with approximately the same memory usage, so there do seem to be some benefits to the code. I haven't done any profiling on it, and there are probably some relatively easy optimizations lurking around as well. An important caveat is that jsonraw is probably the best-case scenario for this change, and that fixing all of the open TODOs will probably slow it down a bit.

I wrote the vast majority of this code in 2022, and kept meaning to get around to fixing some of the more glaring issues but struggled to get the necessary motivation. I've cleaned it up enough to get it running satisfactorily with a fairly small test file I had been using just to make sure it can break lines and handle inline boxes, but it will need a ton of work to get it up to the point where it passes tests and can be merged properly. I'm filing this PR mostly as a way to mention that the code is here, but not with an expectation that the WeasyPrint team will necessarily take on that work. (If you're interested in working on it, I may also find some motivation to work on various parts, but if you're not, then it was a fun tech demo. I also won't be offended by a massive refactor - it's not in a great state right now, I was just trying to make things work with a small test file.)

You will also likely notice that it has been branched from 56.1, and I have not attempted a rebase. From occasional glances at the changelog it should be relatively feasible to pull in most updates since then without a lot of trouble, though things touching fonts in particular may need some tweaking.

@liZe liZe marked this pull request as draft October 17, 2023 12:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant