-
Notifications
You must be signed in to change notification settings - Fork 10.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Attempt to cache repeated images at the document, rather than the page, level (issue 11878) #11912
Conversation
6ee31ed
to
56a48d4
Compare
From: Bot.io (Linux m4)ReceivedCommand cmd_test from @Snuffleupagus received. Current queue size: 0 Live output at: http://54.67.70.0:8877/80049a1debfa01c/output.txt |
From: Bot.io (Windows)ReceivedCommand cmd_test from @Snuffleupagus received. Current queue size: 0 Live output at: http://54.215.176.217:8877/1b23935c86b9296/output.txt |
From: Bot.io (Linux m4)FailedFull output at http://54.67.70.0:8877/80049a1debfa01c/output.txt Total script time: 26.07 mins
Image differences available at: http://54.67.70.0:8877/80049a1debfa01c/reftest-analyzer.html#web=eq.log |
From: Bot.io (Windows)FailedFull output at http://54.215.176.217:8877/1b23935c86b9296/output.txt Total script time: 28.83 mins
Image differences available at: http://54.215.176.217:8877/1b23935c86b9296/reftest-analyzer.html#web=eq.log |
fe4bf41
to
132d2f0
Compare
From: Bot.io (Linux m4)ReceivedCommand cmd_test from @Snuffleupagus received. Current queue size: 0 Live output at: http://54.67.70.0:8877/66866672d17e836/output.txt |
From: Bot.io (Windows)ReceivedCommand cmd_test from @Snuffleupagus received. Current queue size: 0 Live output at: http://54.215.176.217:8877/f6837a88f078c65/output.txt |
From: Bot.io (Linux m4)FailedFull output at http://54.67.70.0:8877/66866672d17e836/output.txt Total script time: 25.95 mins
Image differences available at: http://54.67.70.0:8877/66866672d17e836/reftest-analyzer.html#web=eq.log |
132d2f0
to
e3dbc7d
Compare
From: Bot.io (Windows)FailedFull output at http://54.215.176.217:8877/f6837a88f078c65/output.txt Total script time: 29.66 mins
Image differences available at: http://54.215.176.217:8877/f6837a88f078c65/reftest-analyzer.html#web=eq.log |
382061f
to
c4d68a2
Compare
From: Bot.io (Linux m4)ReceivedCommand cmd_test from @Snuffleupagus received. Current queue size: 0 Live output at: http://54.67.70.0:8877/a0c4d9d8c39cb5c/output.txt |
From: Bot.io (Windows)ReceivedCommand cmd_test from @Snuffleupagus received. Current queue size: 0 Live output at: http://54.215.176.217:8877/530eda41e546e27/output.txt |
From: Bot.io (Linux m4)FailedFull output at http://54.67.70.0:8877/a0c4d9d8c39cb5c/output.txt Total script time: 26.03 mins
Image differences available at: http://54.67.70.0:8877/a0c4d9d8c39cb5c/reftest-analyzer.html#web=eq.log |
From: Bot.io (Windows)FailedFull output at http://54.215.176.217:8877/530eda41e546e27/output.txt Total script time: 28.88 mins
Image differences available at: http://54.215.176.217:8877/530eda41e546e27/reftest-analyzer.html#web=eq.log |
99ac370
to
cc0f1a1
Compare
3f97bd4
to
0279571
Compare
…e, level (issue 11878) Currently image resources, as opposed to e.g. font resources, are handled exclusively on a page-specific basis. Generally speaking this makes sense, since pages are separate from each other, however there's PDF documents where many (or even all) pages actually references exactly the same image resources (through the XRef table). Hence, in some cases, we're decoding the *same* images over and over for every page which is obviously slow and wasting both CPU and memory resources better used elsewhere.[1] Obviously we cannot simply treat all image resources as-if they're used throughout the entire PDF document, since that would end up increasing memory usage too much.[2] However, by introducing a `GlobalImageCache` in the worker we can track image resources that appear on more than one page. Hence we can switch image resources from being page-specific to being document-specific, once the image resource has been seen on more than a certain number of pages. In many cases, such as e.g. the referenced issue, this patch will thus lead to reduced memory usage for image resources. Scrolling through all pages of the document, there's now only a few main-thread copies of the same image data, as opposed to one for each rendered page (i.e. there could theoretically be *twenty* copies of the image data). While this obviously benefit both CPU and memory usage in this case, for *very* large image data this patch *may* possibly increase persistent main-thread memory usage a tiny bit. Thus to avoid negatively affecting memory usage too much in general, particularly on the main-thread, the `GlobalImageCache` will *only* cache a certain number of image resources at the document level and simply fallback to the default behaviour. Unfortunately the asynchronous nature of the code, with ranged/streamed loading of data, actually makes all of this much more complicated than if all data could be assumed to be immediately available.[3] *Please note:* The patch will lead to *small* movement in some existing test-cases, since we're now using the built-in PDF.js JPEG decoder more. This was done in order to simplify the overall implementation, especially on the main-thread, by limiting it to only the `OPS.paintImageXObject` operator. --- [1] There's e.g. PDF documents that use the same image as background on all pages. [2] Given that data stored in the `commonObjs`, on the main-thread, are only cleared manually through `PDFDocumentProxy.cleanup`. This as opposed to data stored in the `objs` of each page, which is automatically removed when the page is cleaned-up e.g. by being evicted from the cache in the default viewer. [3] If the latter case were true, we could simply check for repeat images *before* parsing started and thus avoid handling *any* duplicate image resources.
0279571
to
dda6626
Compare
/botio test |
From: Bot.io (Windows)ReceivedCommand cmd_test from @Snuffleupagus received. Current queue size: 0 Live output at: http://54.215.176.217:8877/a674514fe42f192/output.txt |
From: Bot.io (Linux m4)ReceivedCommand cmd_test from @Snuffleupagus received. Current queue size: 0 Live output at: http://54.67.70.0:8877/cc5ab32f05c0c45/output.txt |
From: Bot.io (Linux m4)FailedFull output at http://54.67.70.0:8877/cc5ab32f05c0c45/output.txt Total script time: 25.80 mins
Image differences available at: http://54.67.70.0:8877/cc5ab32f05c0c45/reftest-analyzer.html#web=eq.log |
From: Bot.io (Windows)FailedFull output at http://54.215.176.217:8877/a674514fe42f192/output.txt Total script time: 28.02 mins
Image differences available at: http://54.215.176.217:8877/a674514fe42f192/reftest-analyzer.html#web=eq.log |
/botio-linux preview |
From: Bot.io (Linux m4)ReceivedCommand cmd_preview from @timvandermeij received. Current queue size: 0 Live output at: http://54.67.70.0:8877/db30669819f4d05/output.txt |
From: Bot.io (Linux m4)SuccessFull output at http://54.67.70.0:8877/db30669819f4d05/output.txt Total script time: 3.35 mins Published |
Really nice work! The unit test looks good. /botio makeref |
From: Bot.io (Linux m4)ReceivedCommand cmd_makeref from @timvandermeij received. Current queue size: 0 Live output at: http://54.67.70.0:8877/97a7c5f703e40a3/output.txt |
From: Bot.io (Windows)ReceivedCommand cmd_makeref from @timvandermeij received. Current queue size: 1 Live output at: http://54.215.176.217:8877/8578ccda3f86826/output.txt |
From: Bot.io (Linux m4)SuccessFull output at http://54.67.70.0:8877/97a7c5f703e40a3/output.txt Total script time: 23.96 mins
|
From: Bot.io (Windows)SuccessFull output at http://54.215.176.217:8877/8578ccda3f86826/output.txt Total script time: 26.14 mins
|
Thanks for landing this; I'm really hoping that the changes will result in user-perceived improvements in these kind of PDF documents. After this landed (obviously), I've also found a possible improvement related to cleanup; please see PR #11926 for additional details. Finally, just yesterday, I've also realized that the old pre-existing page-level |
Currently image resources, as opposed to e.g. font resources, are handled exclusively on a page-specific basis. Generally speaking this makes sense, since pages are separate from each other, however there's PDF documents where many (or even all) pages actually references exactly the image resources (through the XRef table). Hence, in some cases, we're decoding the same images over and over for every page which is obviously slow and wasting both CPU and memory resources better used elsewhere.[1]
Obviously we cannot simply treat all image resources as-if they're used throughout the entire PDF document, since that would end up increasing memory usage too much.[2]
However, by introducing a
GlobalImageCache
in the worker we can track image resources that appear on more than one page. Hence we can switch image resources from being page-specific to being document-specific, once the image resource has been seen on more than a certain number of pages.In many cases, such as e.g. the referenced issue, this patch will thus lead to reduced memory usage for image resources. Scrolling through all pages of the document, there's now only a few main-thread copies of the same image data, as opposed to one for each rendered page (i.e. there could theoretically be twenty copies of the image data).
While this obviously benefit both CPU and memory usage in this case, for very large image data this patch may possibly increase persistent main-thread memory usage a tiny bit. Thus to avoid negatively affecting memory usage too much in general, particularly on the main-thread, the
GlobalImageCache
will only cache a certain number of image resources at the document level and simply fallback to the default behaviour.Unfortunately the asynchronous nature of the code, with ranged/streamed loading of data, actually makes all of this much more complicated than if all data could be assumed to be immediately available.[3]
Please note: The patch will lead to small movement in some existing test-cases, since we're now using the built-in PDF.js JPEG decoder more. This was done in order to simplify the overall implementation, especially on the main-thread, by limiting it to only the
OPS.paintImageXObject
operator.Fixes #11878 (and probably a few more issues/bugs).
Also slightly improves cases such e.g. issue #11518, issue #11612, and bug 1536420
[1] There's e.g. PDF documents that use the same image as background on all pages.
[2] Given that data stored in the
commonObjs
, on the main-thread, are only cleared manually throughPDFDocumentProxy.cleanup
. This as opposed to data stored in theobjs
of each page, which is automatically removed when the page is cleaned-up e.g. by being evicted from the cache in the default viewer.[3] If the latter case were true, we could simply check for repeat images before parsing started and thus avoid handling any duplicate image resources.