Attempt to cache repeated images at the document, rather than the page, level (issue 11878) #11912

Snuffleupagus · 2020-05-18T12:21:46Z

Currently image resources, as opposed to e.g. font resources, are handled exclusively on a page-specific basis. Generally speaking this makes sense, since pages are separate from each other, however there's PDF documents where many (or even all) pages actually references exactly the image resources (through the XRef table). Hence, in some cases, we're decoding the same images over and over for every page which is obviously slow and wasting both CPU and memory resources better used elsewhere.[1]

Obviously we cannot simply treat all image resources as-if they're used throughout the entire PDF document, since that would end up increasing memory usage too much.[2]
However, by introducing a GlobalImageCache in the worker we can track image resources that appear on more than one page. Hence we can switch image resources from being page-specific to being document-specific, once the image resource has been seen on more than a certain number of pages.

In many cases, such as e.g. the referenced issue, this patch will thus lead to reduced memory usage for image resources. Scrolling through all pages of the document, there's now only a few main-thread copies of the same image data, as opposed to one for each rendered page (i.e. there could theoretically be twenty copies of the image data).
While this obviously benefit both CPU and memory usage in this case, for very large image data this patch may possibly increase persistent main-thread memory usage a tiny bit. Thus to avoid negatively affecting memory usage too much in general, particularly on the main-thread, the GlobalImageCache will only cache a certain number of image resources at the document level and simply fallback to the default behaviour.

Unfortunately the asynchronous nature of the code, with ranged/streamed loading of data, actually makes all of this much more complicated than if all data could be assumed to be immediately available.[3]

Please note: The patch will lead to small movement in some existing test-cases, since we're now using the built-in PDF.js JPEG decoder more. This was done in order to simplify the overall implementation, especially on the main-thread, by limiting it to only the OPS.paintImageXObject operator.

Fixes #11878 (and probably a few more issues/bugs).

Also slightly improves cases such e.g. issue #11518, issue #11612, and bug 1536420

[1] There's e.g. PDF documents that use the same image as background on all pages.

[2] Given that data stored in the commonObjs, on the main-thread, are only cleared manually through PDFDocumentProxy.cleanup. This as opposed to data stored in the objs of each page, which is automatically removed when the page is cleaned-up e.g. by being evicted from the cache in the default viewer.

[3] If the latter case were true, we could simply check for repeat images before parsing started and thus avoid handling any duplicate image resources.

pdfjsbot · 2020-05-18T12:33:58Z

From: Bot.io (Linux m4)

Received

Command cmd_test from @Snuffleupagus received. Current queue size: 0

Live output at: http://54.67.70.0:8877/80049a1debfa01c/output.txt

pdfjsbot · 2020-05-18T12:33:58Z

From: Bot.io (Windows)

Received

Command cmd_test from @Snuffleupagus received. Current queue size: 0

Live output at: http://54.215.176.217:8877/1b23935c86b9296/output.txt

pdfjsbot · 2020-05-18T13:00:03Z

From: Bot.io (Linux m4)

Failed

Full output at http://54.67.70.0:8877/80049a1debfa01c/output.txt

Total script time: 26.07 mins

Font tests: Passed
Unit tests: Passed
Regression tests: FAILED

Image differences available at: http://54.67.70.0:8877/80049a1debfa01c/reftest-analyzer.html#web=eq.log

pdfjsbot · 2020-05-18T13:02:49Z

From: Bot.io (Windows)

Failed

Full output at http://54.215.176.217:8877/1b23935c86b9296/output.txt

Total script time: 28.83 mins

Font tests: Passed
Unit tests: Passed
Regression tests: FAILED

Image differences available at: http://54.215.176.217:8877/1b23935c86b9296/reftest-analyzer.html#web=eq.log

pdfjsbot · 2020-05-18T13:18:52Z

From: Bot.io (Linux m4)

Received

Command cmd_test from @Snuffleupagus received. Current queue size: 0

Live output at: http://54.67.70.0:8877/66866672d17e836/output.txt

pdfjsbot · 2020-05-18T13:18:52Z

From: Bot.io (Windows)

Received

Command cmd_test from @Snuffleupagus received. Current queue size: 0

Live output at: http://54.215.176.217:8877/f6837a88f078c65/output.txt

pdfjsbot · 2020-05-18T13:44:50Z

From: Bot.io (Linux m4)

Failed

Full output at http://54.67.70.0:8877/66866672d17e836/output.txt

Total script time: 25.95 mins

Font tests: Passed
Unit tests: Passed
Regression tests: FAILED

Image differences available at: http://54.67.70.0:8877/66866672d17e836/reftest-analyzer.html#web=eq.log

pdfjsbot · 2020-05-18T13:48:32Z

From: Bot.io (Windows)

Failed

Full output at http://54.215.176.217:8877/f6837a88f078c65/output.txt

Total script time: 29.66 mins

Font tests: Passed
Unit tests: Passed
Regression tests: FAILED

Image differences available at: http://54.215.176.217:8877/f6837a88f078c65/reftest-analyzer.html#web=eq.log

pdfjsbot · 2020-05-18T14:12:55Z

From: Bot.io (Linux m4)

Received

Command cmd_test from @Snuffleupagus received. Current queue size: 0

Live output at: http://54.67.70.0:8877/a0c4d9d8c39cb5c/output.txt

pdfjsbot · 2020-05-18T14:12:55Z

From: Bot.io (Windows)

Received

Command cmd_test from @Snuffleupagus received. Current queue size: 0

Live output at: http://54.215.176.217:8877/530eda41e546e27/output.txt

pdfjsbot · 2020-05-18T14:38:58Z

From: Bot.io (Linux m4)

Failed

Full output at http://54.67.70.0:8877/a0c4d9d8c39cb5c/output.txt

Total script time: 26.03 mins

Font tests: Passed
Unit tests: Passed
Regression tests: FAILED

Image differences available at: http://54.67.70.0:8877/a0c4d9d8c39cb5c/reftest-analyzer.html#web=eq.log

pdfjsbot · 2020-05-18T14:41:49Z

From: Bot.io (Windows)

Failed

Full output at http://54.215.176.217:8877/530eda41e546e27/output.txt

Total script time: 28.88 mins

Font tests: Passed
Unit tests: Passed
Regression tests: FAILED

Image differences available at: http://54.215.176.217:8877/530eda41e546e27/reftest-analyzer.html#web=eq.log

…e, level (issue 11878) Currently image resources, as opposed to e.g. font resources, are handled exclusively on a page-specific basis. Generally speaking this makes sense, since pages are separate from each other, however there's PDF documents where many (or even all) pages actually references exactly the same image resources (through the XRef table). Hence, in some cases, we're decoding the *same* images over and over for every page which is obviously slow and wasting both CPU and memory resources better used elsewhere.[1] Obviously we cannot simply treat all image resources as-if they're used throughout the entire PDF document, since that would end up increasing memory usage too much.[2] However, by introducing a `GlobalImageCache` in the worker we can track image resources that appear on more than one page. Hence we can switch image resources from being page-specific to being document-specific, once the image resource has been seen on more than a certain number of pages. In many cases, such as e.g. the referenced issue, this patch will thus lead to reduced memory usage for image resources. Scrolling through all pages of the document, there's now only a few main-thread copies of the same image data, as opposed to one for each rendered page (i.e. there could theoretically be *twenty* copies of the image data). While this obviously benefit both CPU and memory usage in this case, for *very* large image data this patch *may* possibly increase persistent main-thread memory usage a tiny bit. Thus to avoid negatively affecting memory usage too much in general, particularly on the main-thread, the `GlobalImageCache` will *only* cache a certain number of image resources at the document level and simply fallback to the default behaviour. Unfortunately the asynchronous nature of the code, with ranged/streamed loading of data, actually makes all of this much more complicated than if all data could be assumed to be immediately available.[3] *Please note:* The patch will lead to *small* movement in some existing test-cases, since we're now using the built-in PDF.js JPEG decoder more. This was done in order to simplify the overall implementation, especially on the main-thread, by limiting it to only the `OPS.paintImageXObject` operator. --- [1] There's e.g. PDF documents that use the same image as background on all pages. [2] Given that data stored in the `commonObjs`, on the main-thread, are only cleared manually through `PDFDocumentProxy.cleanup`. This as opposed to data stored in the `objs` of each page, which is automatically removed when the page is cleaned-up e.g. by being evicted from the cache in the default viewer. [3] If the latter case were true, we could simply check for repeat images *before* parsing started and thus avoid handling *any* duplicate image resources.

Snuffleupagus · 2020-05-21T16:27:18Z

/botio test

pdfjsbot · 2020-05-21T16:27:20Z

From: Bot.io (Windows)

Received

Command cmd_test from @Snuffleupagus received. Current queue size: 0

Live output at: http://54.215.176.217:8877/a674514fe42f192/output.txt

pdfjsbot · 2020-05-21T16:27:20Z

From: Bot.io (Linux m4)

Received

Command cmd_test from @Snuffleupagus received. Current queue size: 0

Live output at: http://54.67.70.0:8877/cc5ab32f05c0c45/output.txt

pdfjsbot · 2020-05-21T16:53:09Z

From: Bot.io (Linux m4)

Failed

Full output at http://54.67.70.0:8877/cc5ab32f05c0c45/output.txt

Total script time: 25.80 mins

Font tests: Passed
Unit tests: Passed
Regression tests: FAILED

Image differences available at: http://54.67.70.0:8877/cc5ab32f05c0c45/reftest-analyzer.html#web=eq.log

pdfjsbot · 2020-05-21T16:55:22Z

From: Bot.io (Windows)

Failed

Full output at http://54.215.176.217:8877/a674514fe42f192/output.txt

Total script time: 28.02 mins

Font tests: Passed
Unit tests: Passed
Regression tests: FAILED

Image differences available at: http://54.215.176.217:8877/a674514fe42f192/reftest-analyzer.html#web=eq.log

timvandermeij · 2020-05-21T21:47:10Z

/botio-linux preview

pdfjsbot · 2020-05-21T21:47:12Z

From: Bot.io (Linux m4)

Received

Command cmd_preview from @timvandermeij received. Current queue size: 0

Live output at: http://54.67.70.0:8877/db30669819f4d05/output.txt

pdfjsbot · 2020-05-21T21:50:33Z

From: Bot.io (Linux m4)

Success

Full output at http://54.67.70.0:8877/db30669819f4d05/output.txt

Total script time: 3.35 mins

Published

Viewer: http://54.67.70.0:8877/db30669819f4d05/web/viewer.html
Viewer (ES5): http://54.67.70.0:8877/db30669819f4d05/es5/web/viewer.html

timvandermeij · 2020-05-21T21:54:54Z

Really nice work! The unit test looks good.

/botio makeref

pdfjsbot · 2020-05-21T21:54:55Z

From: Bot.io (Linux m4)

Received

Command cmd_makeref from @timvandermeij received. Current queue size: 0

Live output at: http://54.67.70.0:8877/97a7c5f703e40a3/output.txt

pdfjsbot · 2020-05-21T21:54:56Z

From: Bot.io (Windows)

Received

Command cmd_makeref from @timvandermeij received. Current queue size: 1

Live output at: http://54.215.176.217:8877/8578ccda3f86826/output.txt

pdfjsbot · 2020-05-21T22:18:54Z

From: Bot.io (Linux m4)

Success

Full output at http://54.67.70.0:8877/97a7c5f703e40a3/output.txt

Total script time: 23.96 mins

Lint: Passed
Make references: Passed
Check references: Passed

pdfjsbot · 2020-05-21T22:23:58Z

From: Bot.io (Windows)

Success

Full output at http://54.215.176.217:8877/8578ccda3f86826/output.txt

Total script time: 26.14 mins

Lint: Passed
Make references: Passed
Check references: Passed

Snuffleupagus · 2020-05-23T09:54:23Z

Really nice work! The unit test looks good.

Thanks for landing this; I'm really hoping that the changes will result in user-perceived improvements in these kind of PDF documents.

After this landed (obviously), I've also found a possible improvement related to cleanup; please see PR #11926 for additional details.

Finally, just yesterday, I've also realized that the old pre-existing page-level imageCache is possibly missing some repeated page-level images, since the caching is done only by name (e.g. of the format "Im0", "Im1", ...). In some cases there's PDF documents where one page contains hundreds of distinctly named images, despite them referring (via references) to only a handful of actual image objects. I'll look into fixing that as well, possibly with a new LocalImageCache class, since it seems somewhat ridiculous to re-parse any image over and over; see master...Snuffleupagus:LocalImageCache

Snuffleupagus force-pushed the GlobalImageCache branch 2 times, most recently from 6ee31ed to 56a48d4 Compare May 18, 2020 12:27

Snuffleupagus force-pushed the GlobalImageCache branch 3 times, most recently from fe4bf41 to 132d2f0 Compare May 18, 2020 13:15

Snuffleupagus force-pushed the GlobalImageCache branch from 132d2f0 to e3dbc7d Compare May 18, 2020 13:45

timvandermeij added core performance labels May 18, 2020

Snuffleupagus force-pushed the GlobalImageCache branch 2 times, most recently from 382061f to c4d68a2 Compare May 18, 2020 14:03

Snuffleupagus force-pushed the GlobalImageCache branch 7 times, most recently from 99ac370 to cc0f1a1 Compare May 18, 2020 15:44

Snuffleupagus marked this pull request as ready for review May 18, 2020 15:44

Snuffleupagus force-pushed the GlobalImageCache branch 4 times, most recently from 3f97bd4 to 0279571 Compare May 21, 2020 16:13

Snuffleupagus force-pushed the GlobalImageCache branch from 0279571 to dda6626 Compare May 21, 2020 16:13

Snuffleupagus marked this pull request as ready for review May 21, 2020 16:26

Snuffleupagus requested a review from timvandermeij May 21, 2020 17:21

timvandermeij approved these changes May 21, 2020

View reviewed changes

timvandermeij merged commit 4a3a24b into mozilla:master May 21, 2020

Snuffleupagus deleted the GlobalImageCache branch May 22, 2020 08:47

This was referenced Jun 7, 2020

A couple of small image caching/sending improvements #11974

Merged

Update SVGGraphics to account for globally cached images (PR 11912 follow-up) #11987

Merged

Snuffleupagus mentioned this pull request Jul 16, 2021

"Requesting object that isn't resolved" error due to duplicate image during page.objs.get(name) #13742

Closed

Snuffleupagus mentioned this pull request Apr 28, 2023

Inline the addPageIndex method in GlobalImageCache.shouldCache #16368

Merged

Snuffleupagus mentioned this pull request Dec 15, 2023

Attempt to further reduce re-parsing for globally cached images (PR 11912, 16108 follow-up) #17428

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Attempt to cache repeated images at the document, rather than the page, level (issue 11878) #11912

Attempt to cache repeated images at the document, rather than the page, level (issue 11878) #11912

Snuffleupagus commented May 18, 2020 •

edited

Loading

pdfjsbot commented May 18, 2020

pdfjsbot commented May 18, 2020

pdfjsbot commented May 18, 2020

pdfjsbot commented May 18, 2020

pdfjsbot commented May 18, 2020

pdfjsbot commented May 18, 2020

pdfjsbot commented May 18, 2020

pdfjsbot commented May 18, 2020

pdfjsbot commented May 18, 2020

pdfjsbot commented May 18, 2020

pdfjsbot commented May 18, 2020

pdfjsbot commented May 18, 2020

Snuffleupagus commented May 21, 2020

pdfjsbot commented May 21, 2020

pdfjsbot commented May 21, 2020

pdfjsbot commented May 21, 2020

pdfjsbot commented May 21, 2020

timvandermeij commented May 21, 2020

pdfjsbot commented May 21, 2020

pdfjsbot commented May 21, 2020

timvandermeij commented May 21, 2020

pdfjsbot commented May 21, 2020

pdfjsbot commented May 21, 2020

pdfjsbot commented May 21, 2020

pdfjsbot commented May 21, 2020

Snuffleupagus commented May 23, 2020 •

edited

Loading

Attempt to cache repeated images at the document, rather than the page, level (issue 11878) #11912

Attempt to cache repeated images at the document, rather than the page, level (issue 11878) #11912

Conversation

Snuffleupagus commented May 18, 2020 • edited Loading

pdfjsbot commented May 18, 2020

From: Bot.io (Linux m4)

Received

pdfjsbot commented May 18, 2020

From: Bot.io (Windows)

Received

pdfjsbot commented May 18, 2020

From: Bot.io (Linux m4)

Failed

pdfjsbot commented May 18, 2020

From: Bot.io (Windows)

Failed

pdfjsbot commented May 18, 2020

From: Bot.io (Linux m4)

Received

pdfjsbot commented May 18, 2020

From: Bot.io (Windows)

Received

pdfjsbot commented May 18, 2020

From: Bot.io (Linux m4)

Failed

pdfjsbot commented May 18, 2020

From: Bot.io (Windows)

Failed

pdfjsbot commented May 18, 2020

From: Bot.io (Linux m4)

Received

pdfjsbot commented May 18, 2020

From: Bot.io (Windows)

Received

pdfjsbot commented May 18, 2020

From: Bot.io (Linux m4)

Failed

pdfjsbot commented May 18, 2020

From: Bot.io (Windows)

Failed

Snuffleupagus commented May 21, 2020

pdfjsbot commented May 21, 2020

From: Bot.io (Windows)

Received

pdfjsbot commented May 21, 2020

From: Bot.io (Linux m4)

Received

pdfjsbot commented May 21, 2020

From: Bot.io (Linux m4)

Failed

pdfjsbot commented May 21, 2020

From: Bot.io (Windows)

Failed

timvandermeij commented May 21, 2020

pdfjsbot commented May 21, 2020

From: Bot.io (Linux m4)

Received

pdfjsbot commented May 21, 2020

From: Bot.io (Linux m4)

Success

Published

timvandermeij commented May 21, 2020

pdfjsbot commented May 21, 2020

From: Bot.io (Linux m4)

Received

pdfjsbot commented May 21, 2020

From: Bot.io (Windows)

Received

pdfjsbot commented May 21, 2020

From: Bot.io (Linux m4)

Success

pdfjsbot commented May 21, 2020

From: Bot.io (Windows)

Success

Snuffleupagus commented May 23, 2020 • edited Loading

Snuffleupagus commented May 18, 2020 •

edited

Loading

Snuffleupagus commented May 23, 2020 •

edited

Loading