Is there a way to reduce the size of output pdf files? #42

jvmachados · 2022-07-10T00:09:10Z

I tested this ocr tool on some PDFs I downloaded from Academia.edu and the results were great. However, there's a problem: it increased the file size by A LOT (ex: a 11.8 MB file turned a 107 MB pdf).

I was hoping to use this tool to create searchable and conveniently highlightable PDFs using scans from physical books I have, but scanned files are normally huge on their own. When I ran zotero-ocr on one of my scans (257 MB) I ended up with a file that's over 2GB in size (it won't even open). :(

Is there something I can do to decrease the file sizes?

(I use Zotero 6.0.9 on Windows and have installed the latest version of zotero-ocr)

mattiaTagliente · 2022-11-02T16:02:04Z

I second this request.

gazzar · 2023-06-03T11:03:04Z

pdftoppm is producing pngs. Maybe if it could be swapped over to jpeg, perhaps as an option, it would shrink the pdf

haverholm · 2023-09-27T15:27:42Z

Having just tested the plugin for the first time, I really feel this need to be prioritised. Here are my file differences for comparison:

Original file size: 59.1 MiB
OCR'ed file size: 438.7 MiB
-- cp. combined size of the generated image files: 321.3 MiB

It seems that not only the file format but also the resolution and possibly colour space of the image files could use some tweaking? I doubt that most scanned or otherwise rasterised PDFs come as high as 300dpi, so exporting PNGs at that resolution will definitely increase file size. Assuming that these files are only used for generate OCR information -- ie, colour elements from the original PDF will remain intact in the OCR'ed file -- a compromise could be to export the page images as greyscale, which will shrink the file size by half, and might also reduce image noise.

Exporting pages as JPGs can also contribute to smaller file sizes. If I save the same greyscale image as PNG and JPG (90% quality), the latter is only third of the PNG file size. But lowering JPG quality might also impact the readability of the text. Issue #23 suggests making image resolution configurable by the user, and it could be really helpful in reducing interim image file sizes, but at the same time makes the process more fidgety, as I know I would end up trying different resolutions to balance OCR quality vs file size...

It may be necessary instead to post-process the generated PDF; I can't tell from the Poppler documentation if it is any help in file compression? I resorted to an online PDF service which reduced the 438.7 Mib PDF to 46.9 Mib, with the OCR intact, but it would be nice to save the bandwidth and process the file locally. Especially since I have close to a hundred PDFs in my Zotero library that need the OCR treatment...

sojusnik · 2023-09-27T15:57:56Z

I'm using https://github.com/ocrmypdf/OCRmyPDF as a manual workaround. Maybe it will be sufficient for your case.

gazzar · 2023-09-28T07:34:23Z

Back in June when I last tried this, I also resorted to using OCRmyPDF after trying zotero-ocr.
I used the Docker version and OCRed my pdf with the following command
docker run -i --rm jbarlow83/ocrmypdf - - <jp.pdf >jp_ocr.pdf -O3 --jobs 2

zuphilip · 2023-12-10T18:46:14Z

Yeah, the size can become quite large. Tesseract itself creates the PDF with the input we give it. Tesseract would also run on jpg images, but the quality of the OCR output also depends on the inputed images and the colors.

The -O3 option form OCRmyPDF looks good and this tool also uses tesseract under the hood. Maybe, one could consider to switch the workflow to it...

stweil · 2024-03-25T10:28:30Z

Reducing the resolution like in pull request #41 would reduce the size a lot. Using JPEG 2000 files with lossy compression would allow really small PDF files. Ideally that should be implemented in Tesseract.

aborel · 2025-02-08T09:36:48Z

I think we can reasonably use JPEG as a pdftoppm output and tesseract input, preferably as a user-selectable option. A quick and dirty test looks promising.

stweil · 2025-02-08T10:02:37Z

We produce nearly all of our OCR from JPEG. I'd use JPEG instead of PNG if this works and is easy to implement. Too many options confuse the users. Therefore I'd not add a new one for switching between JPEG and PNG.

A future solution could eliminate pdftoppm and extract the image which is part of the original PDF. This would eliminate any format conversion and automatically get the image in the best quality.

aborel · 2025-02-08T11:02:36Z

I would definitely prefer getting rid of pdftoppm (as per the ongoing discussion in #80 ), but I still don't have a clue how to implement that so I'm looking for a pragmatic solution - which could of course be replaced by something better eventually. It's a simple change, as far as I can see.
Too many options leading to users confusion: understood. I think I will keep some hidden options for testing purposes, but indeed not everybody needs to figure out the fine-tuning of pdftoppm's output.

aborel · 2025-02-09T08:30:40Z

The tests on my local code yesterday have been quite successful. Playing around with the JPEG options of pdftoppm, I have found pretty good default values that reduce the image size by a factor of 2 to 4 compared to PNG without visible loss of quality at 300 dpi. In my samples (2 pdfs: 268 pages, black & white book scan + 4 pages, color commercial pamphlet), the size of the output PDFs is now roughly the same as the original, sometimes a bit smaller.

I'm counting this as a success :-) I still need to double-check a few things before I push the commit. @stweil would you like to review the code before we create a new release?

stweil · 2025-02-09T08:42:55Z

would you like to review the code before we create a new release

Thank you for your work on this issue. If you create a pull request, I can review the changes there, and it will automatically be added to the release notes. I'd make the new release after the new code was merged.

zuphilip added enhancement New feature or request help wanted Extra attention is needed labels Dec 10, 2023

stweil mentioned this issue Jun 8, 2024

Build a front end for OCRmyPDF (development suggestion) #75

Closed

aborel self-assigned this Feb 8, 2025

aborel mentioned this issue Feb 14, 2025

Use JPEG instead of PNG as intermediate format #89

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Is there a way to reduce the size of output pdf files? #42

Is there a way to reduce the size of output pdf files? #42

jvmachados commented Jul 10, 2022

mattiaTagliente commented Nov 2, 2022

gazzar commented Jun 3, 2023

haverholm commented Sep 27, 2023

sojusnik commented Sep 27, 2023

gazzar commented Sep 28, 2023

zuphilip commented Dec 10, 2023

stweil commented Mar 25, 2024

aborel commented Feb 8, 2025

stweil commented Feb 8, 2025

aborel commented Feb 8, 2025 •

edited

Loading

aborel commented Feb 9, 2025

stweil commented Feb 9, 2025 •

edited

Loading

Is there a way to reduce the size of output pdf files? #42

Is there a way to reduce the size of output pdf files? #42

Comments

jvmachados commented Jul 10, 2022

mattiaTagliente commented Nov 2, 2022

gazzar commented Jun 3, 2023

haverholm commented Sep 27, 2023

sojusnik commented Sep 27, 2023

gazzar commented Sep 28, 2023

zuphilip commented Dec 10, 2023

stweil commented Mar 25, 2024

aborel commented Feb 8, 2025

stweil commented Feb 8, 2025

aborel commented Feb 8, 2025 • edited Loading

aborel commented Feb 9, 2025

stweil commented Feb 9, 2025 • edited Loading

aborel commented Feb 8, 2025 •

edited

Loading

stweil commented Feb 9, 2025 •

edited

Loading