Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature Request] Differentiate actual paragraphs and LUTE's page splitting #475

Open
lef-est opened this issue Aug 30, 2024 · 4 comments
Open

Comments

@lef-est
Copy link

lef-est commented Aug 30, 2024

Is your feature request related to a problem? Please describe.

Where to break a parapraph is a stylistic choice by the author and may aid in storytelling. This information is lost because LUTE splits paragraphs to meet the token count per page.

Describe the solution you'd like

Indent original paragraphs, while leaving the "new paragphs" resutled from LUTE's splitting unchanged the same way they are treated now.
If possible, I'd like LUTE to not remove empty lines (or all whitespaces) for the same reason. Many authors use 2 or 3 lines for sub-chapter breaks. Some even differentiate the use of 2 lines or 3 lines.

Describe alternatives you've considered

During book creation, give users the option of "don't split paragraphs" (unchecked by default). Not the best solution performance-wise and probably takes more work.

Additional context

With LUTE we already lose a lot of information such as bold, italic, underline, images, tables. I understand the difficulty of incorporating them, which entails fundamental changes. But paragraph breaks and empty lines are probably the easiest ones to handle and deserve a chance.

@lef-est lef-est changed the title Differentiate actual paragraphs and LUTE's page splitting [Feature Request] Differentiate actual paragraphs and LUTE's page splitting Aug 30, 2024
@jzohrab
Copy link
Collaborator

jzohrab commented Aug 30, 2024

Lute splits by sentence to keep the token count reasonable: some paragraphs can get really long, or the formatting can get weird when people copy/paste text from different places. It's hard to say what the best solution is here, I still feel that splitting paragraphs up is necessary, and I don't have a great answer for this ... unless I do something like first try to group by paragraphs, and then check the page size and only split the paragraph if the page is, say, 50% longer than the maximum size. It's a bit convoluted and tricky, but maybe that would suffice. Does that seem reasonable?

If possible, I'd like LUTE to not remove empty lines (or all whitespaces) for the same reason.

Yes this sounds reasonable and I'm not sure why I did this in the first place. :-)

bold, italic, underline, images, tables.

Yeah that's so tough! Would be nice though, I agree. There's an issue to allow for markdown on import of text files (though tables still wouldn't be possible, b/c it's bananas)

Let me know re the paragraph grouping and page size threshold idea ... it could be tricky-ish.

@LangNerd23
Copy link

Allthough the question was not adressed to me: I personally like this idea with a two step process (First grouping paragraphs and only further split if it exceeds a certain threshold).

At the moment I still have quit a lot of work in pre-processing books, often setting manual separators (---) to avoid parts of the problems mentoined above.

@jzohrab
Copy link
Collaborator

jzohrab commented Oct 23, 2024

I've had annoying paragraph splits as well, in places that didn't make sense and complicated the reading of a tough book.

@jzohrab
Copy link
Collaborator

jzohrab commented Oct 29, 2024

I've been thinking about this one on and off for a few days, and don't have a perfect solution for it yet, but probably good enough: I'm considering doing grouping based on paragraphs, and just not bothering to split paragraphs at all, even if the paragraph is huge. For the most part, that should be fine ... if someone is reading James Joyce or Faulkner with Lute, they're probably outside of the target user base anyway. And such users could just add a new page and split a long paragraph manually if they wanted to.

With this change, the "max words" thing would become a "threshold", and paragraphs would be added to the current page until the threshold is crossed, at which point a new page would be started. The threshold could be exceeded by any number of tokens -- but it's probably good enough.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants