Convert to markdown function to include reading order as an option (so it can also be disabled) #133

Pmmoks · 2024-09-11T19:41:42Z

Pmmoks
Sep 11, 2024

Hi, this is an amazing package. It's really good for processing PDF files. Massive thanks to the team! One BIG wish from me though is to have the natural reading order as an option. I have many PDF's where the natural reading order actually makes it worse (based on using get_text() from pymupdf) - it seems the automatic column identification isn't working for my documents. Would it be possible to add this as a feature / an option to the to_markdown() function?

JorjMcKie · 2024-09-12T21:10:33Z

JorjMcKie
Sep 12, 2024
Maintainer

This is a no-can-do sorry. Detecting / Establishing natural reading order is an integral part of the package making it impossible to disable it.

If it doesn't work in your case, it would instead be worthwhile to find and fix the problem.

5 replies

Pmmoks Sep 13, 2024
Author

Thanks for the quick reply! Is there any way to supply some kind of column information e.g. if we know it has 2 columns? Or to manually split the pages ahead of calling this function?

I saw an old post you had made where you recommend this approach for the PyMuPDF TextPage.extractBLOCKS function.

Stackoverflow link to your old answer

JorjMcKie Sep 13, 2024
Maintainer

Really old 😎!

If you know thins about your PDF, then there are ways to pour in information. For example, you can add a middle line to the pages (in a temporary way) to help the column recognition process.
Maybe you want to share a PDF with a handful of pages, so I can try a few things and demo an approach.
Always welcome!

Pmmoks Sep 14, 2024
Author

Thanks for being so helpful! Here are a few pages from a pdf with 2 columns. I can already fairly accurately determine whether it’s a single page or double page spread, but I’m struggling more with getting the .to_markdown function to recognise that.

pillar-3-disclosure-2022-part-2.pdf

JorjMcKie Sep 22, 2024
Maintainer

With the newest version 0.0.17 at least most of the pages are processed correctly BTW.

Pmmoks Oct 4, 2024
Author

Thanks @JorjMcKie! I’ve tested it and indeed it does work better in terms of reading order. Although some form of manual control to split pages would be awesome for the remaining pages.

If you have any more suggestions I’d love to try them!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Convert to markdown function to include reading order as an option (so it can also be disabled) #133

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment 5 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Convert to markdown function to include reading order as an option (so it can also be disabled) #133

Pmmoks Sep 11, 2024

Replies: 1 comment · 5 replies

JorjMcKie Sep 12, 2024 Maintainer

Pmmoks Sep 13, 2024 Author

JorjMcKie Sep 13, 2024 Maintainer

Pmmoks Sep 14, 2024 Author

JorjMcKie Sep 22, 2024 Maintainer

Pmmoks Oct 4, 2024 Author

Pmmoks
Sep 11, 2024

Replies: 1 comment 5 replies

JorjMcKie
Sep 12, 2024
Maintainer

Pmmoks Sep 13, 2024
Author

JorjMcKie Sep 13, 2024
Maintainer

Pmmoks Sep 14, 2024
Author

JorjMcKie Sep 22, 2024
Maintainer

Pmmoks Oct 4, 2024
Author