Skip to content

Memory fixes for large playwright screenshots#3092

Merged
dgtlmoon merged 3 commits into
dgtlmoon:masterfrom
xconverge:alternateMemoryFixes
Apr 9, 2025
Merged

Memory fixes for large playwright screenshots#3092
dgtlmoon merged 3 commits into
dgtlmoon:masterfrom
xconverge:alternateMemoryFixes

Conversation

@xconverge
Copy link
Copy Markdown
Contributor

@xconverge xconverge commented Apr 7, 2025

This is a cleaner implementation than #3089 for #3035 I think

Key changes:

  • Don't create a list of all chunks and process at the end, process as we create the stitched image so we dont have a large amount of memory allocated to a list of chunks
  • Use context managers for bytes and buffers
  • Use constant viewport size (and thus chunk size) for stitching to eliminate need to adjust viewport size
  • Lower clipped screenshot vertical limit from 32000px to 16000px
  • Added an env var SCREENSHOT_MAX_HEIGHT to address Is there a way to limit/crop the size of the screenshot with Playwright? #1760 and further provide a mitigation for limiting memory consumption (or increasing the limit if a user needs it)
  • Standardize on a single function that handles the playwright screenshot without (intentionally) changing existing functionality at all

@xconverge xconverge changed the title Alternate solution to memory fixes for large playwright screenshots Memory fixes for large playwright screenshots Apr 7, 2025

MAX_TOTAL_HEIGHT = SCREENSHOT_SIZE_STITCH_THRESHOLD*4 # Maximum total height for the final image (When in stitch mode)
MAX_CHUNK_HEIGHT = 4000 # Height per screenshot chunk
MAX_TOTAL_HEIGHT = 16000 # Maximum total height for the final image (When in stitch mode), not worth going over this for now
Copy link
Copy Markdown
Owner

@dgtlmoon dgtlmoon Apr 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm sorry, but these kinds of comments never get re-visited and just confuse or worry the next person who works on this.

Suggested change
MAX_TOTAL_HEIGHT = 16000 # Maximum total height for the final image (When in stitch mode), not worth going over this for now
# Maximum total height for the final image (When in stitch mode).
# We limit this to 16000px due to the huge amount of RAM that was being used
# Example: 16000 × 1400 × 3 = 67,200,000 bytes ≈ 64.1 MB (not including buffers in PIL etc)
MAX_TOTAL_HEIGHT = int(os.getenv("SCREENSHOT_MAX_HEIGHT", 16000))

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add SCREENSHOT_MAX_HEIGHT=16000 (commented out with a note about memory) in docker-compose.yml

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that should be a separate PR because it is a larger change than you are suggesting

If I set SCREENSHOT_MAX_HEIGHT in my env to 4000, I would expect the screenshot in all cases (not just this function) to be 4000. Not just stitched images which is an implementation detail.

If I made the env var SCREENSHOT_STITCHED_MAX_HEIGHT that would be ok but then is kind of unnecessarily specific.

If you are ok with me just doing your suggested change in this one file of

MAX_TOTAL_HEIGHT = int(os.getenv("SCREENSHOT_MAX_HEIGHT", 16000))

I can also just do it

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that should be a separate PR because it is a larger change than you are suggesting

yeah but it just creates work for someone else, please merge in that suggestions

i cant merge this is with mysterious open comments, it has to be something understandable to someone else

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are no open comments I don't think. This is a bug fix for a regression.


# Capture only the visible area using clip
with io.BytesIO(
page.screenshot(
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And you are 100% sure that this solved the memory problem?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR in it's entirety has solved the issue on my system/installation, yes.

@dgtlmoon
Copy link
Copy Markdown
Owner

dgtlmoon commented Apr 7, 2025

Roughly it looks OK to me, so using the context stuff solved the memory problems?

@xconverge
Copy link
Copy Markdown
Contributor Author

Roughly it looks OK to me, so using the context stuff solved the memory problems?

Yes and not storing a list of all chunks. All 4 bullet points in the PR description help.

@dgtlmoon
Copy link
Copy Markdown
Owner

dgtlmoon commented Apr 7, 2025

Yes and not storing a list of all chunks.

ok so that was the main problem?

@dgtlmoon
Copy link
Copy Markdown
Owner

dgtlmoon commented Apr 7, 2025

Yes and not storing a list of all chunks.

but if this was the main issue, surely when the function returns it would clean off that memory because its not in scope anymore

thats the part i dont get

or maybe the python PIL kept the reference? or cache in there?

@xconverge
Copy link
Copy Markdown
Contributor Author

xconverge commented Apr 7, 2025

I dont know, it is just how it is with pillow I believe

https://pillow.readthedocs.io/en/stable/reference/open_files.html

"Users of the library should use a context manager or call Image.Image.close() on any image opened with a filename or Path object to ensure that the underlying file is closed."

Again, the solution is all changes of the PR, not just one line or concept. The list of images instead of processing as we go would certainly work, as long as the image gets closed in the end (it wasn't). The max memory used for a screenshot would be significantly higher in that scenario though (and as long as we iterated over them and closed them, would be freed).

@dgtlmoon
Copy link
Copy Markdown
Owner

dgtlmoon commented Apr 8, 2025

OK amazing, thanks for your work!

@xconverge
Copy link
Copy Markdown
Contributor Author

xconverge commented Apr 8, 2025

OK amazing, thanks for your work!

Are you ok with this bullet point?

"Use stitching screenshot method at all times now (instead of just when > 8000px) for consistency, repeatability, and to easily use the above environment var for max height. This is less efficient than self.page.screenshot(type='jpeg', full_page=True, quality=40) which can be done natively in playwright, but ends up being similar if we have to do the full_page screenshot and THEN crop it using SCREENSHOT_MAX_HEIGHT after anyways"

I think it is the right call but don't feel strongly and could easily be convinced to revert that change.

Either way, I think having a single function "get screenshot" is going to help, if you wanted to do an optimization within it where we check to see if we can just do a self.page.screenshot(type='jpeg', full_page=True, quality=int(os.getenv("SCREENSHOT_QUALITY", 30))) due to page height being less than the SCREENSHOT_MAX_HEIGHT and less than 8000 or something, I think that would make sense

Edit: I went ahead and added the optimization. It is free, closer to what was there before, and still supports us allowing a user to limit the max height of a screenshot without needing to spread logic around in different places.

…od when we don't need to clip the image and it is below 8000px
@dgtlmoon
Copy link
Copy Markdown
Owner

dgtlmoon commented Apr 9, 2025

I think it is the right call but don't feel strongly and could easily be convinced to revert that change.

fine to me, i dont think its worth discussing more, lets make the change and move on :)

@dgtlmoon
Copy link
Copy Markdown
Owner

dgtlmoon commented Apr 9, 2025

Edit: I went ahead and added the optimization. It is free, closer to what was there before, and still supports us allowing a user to limit the max height of a screenshot without needing to spread logic around in different places.
perfect

so what now, its ready to merge?

@dgtlmoon
Copy link
Copy Markdown
Owner

dgtlmoon commented Apr 9, 2025

here is a really long page that could be a good test https://adguard.com/en/versions/windows/nightly.html

@dgtlmoon
Copy link
Copy Markdown
Owner

dgtlmoon commented Apr 9, 2025

According to memray, the usage is still pretty high, i'm still not convinced

PYTHONMALLOC=malloc PLAYWRIGHT_DRIVER_URL=ws://127.0.0.1:3000 memray run -o memray.bin  ./changedetection.py -d /tmp/datastore/

image

it's almost like its playwright holding out with the memory problems

@dgtlmoon
Copy link
Copy Markdown
Owner

dgtlmoon commented Apr 9, 2025

It's still not clear which type of memory you are reporting yet.

heap size vs resident size in memray
🟦 Heap Size
Definition: The total amount of memory allocated on the Python heap by your program (via malloc, new, etc.).

Covers: Memory requested from the OS by the program.

Includes:

Active allocations (in use)

Inactive/freed memory that hasn’t yet been returned to the OS

✅ Useful for understanding how much memory your Python objects are using overall.

🟨 Resident Size
Definition: The amount of memory that is currently resident in RAM, i.e., physically in use in memory (not swapped out).

OS-Level View: This is what the operating system sees as "used" memory.

Includes:

All loaded code, stack, and heap memory actually in RAM

May include memory from other sources like native libraries, threads, etc.

✅ Useful when you're debugging actual system memory pressure or trying to understand real-world memory impact.

ANALOGY -

Heap Size | Size of your bookshelf
Resident Size | Number of books you're actively reading and lying open on the desk

If playwright "expanded its heap size" then ofcourse it will stay at that size

@dgtlmoon
Copy link
Copy Markdown
Owner

dgtlmoon commented Apr 9, 2025

So memory usage still grows even when the screenshot stuff is commented out

image


--- a/changedetectionio/content_fetchers/playwright.py
+++ b/changedetectionio/content_fetchers/playwright.py
@@ -163,11 +163,11 @@ class fetcher(Fetcher):
                 browser.close()
                 raise PageUnloadable(url=url, status_code=None, message=str(e))
 
-            if self.status_code != 200 and not ignore_status_codes:
-                screenshot = self.page.screenshot(type='jpeg', full_page=True,
-                                                  quality=int(os.getenv("SCREENSHOT_QUALITY", 72)))
+#            if self.status_code != 200 and not ignore_status_codes:
+#                screenshot = self.page.screenshot(type='jpeg', full_page=True,
+#                                                  quality=int(os.getenv("SCREENSHOT_QUALITY", 72)))
 
-                raise Non200ErrorCodeReceived(url=url, status_code=self.status_code, screenshot=screenshot)
+#                raise Non200ErrorCodeReceived(url=url, status_code=self.status_code, screenshot=screenshot)
 
             if not empty_pages_are_a_change and len(self.page.content().strip()) == 0:
                 logger.debug("Content Fetcher > Content was empty, empty_pages_are_a_change = False")
@@ -204,7 +204,8 @@ class fetcher(Fetcher):
             # acceptable screenshot quality here
             try:
                 # The actual screenshot - this always base64 and needs decoding! horrible! huge CPU usage
-                self.screenshot = capture_full_page(self.page)
+                #self.screenshot = capture_full_page(self.page)
+                x=1

@dgtlmoon
Copy link
Copy Markdown
Owner

dgtlmoon commented Apr 9, 2025

Ok, comment out these lines, HUGE improvement in memory handling

self.xpath_data = self.page.evaluate(
"async () => {" + self.xpath_element_js.replace('%ELEMENTS%', visualselector_xpath_selectors) + "}")
self.instock_data = self.page.evaluate("async () => {" + self.instock_data_js + "}")

@dgtlmoon
Copy link
Copy Markdown
Owner

dgtlmoon commented Apr 9, 2025

I can 100% prove this is something todo playwright, add the FAST_PUPPETEER_CHROME_FETCHER=True env var... this will skip playwright and go direct with another library

@xconverge
Copy link
Copy Markdown
Contributor Author

If you want to run your own tests, compare 0.49.4 and 0.49.3.

This is a good change and not worse. I have been running it for days. I never claimed to fix every memory problem the application has for big pages, just the bullet points above.

Close the PR if you want, but I am done working on it

@dgtlmoon dgtlmoon merged commit 9f32678 into dgtlmoon:master Apr 9, 2025
@xconverge xconverge deleted the alternateMemoryFixes branch April 9, 2025 15:28
@dgtlmoon
Copy link
Copy Markdown
Owner

dgtlmoon commented Apr 9, 2025

@xconverge claud also recommends this..

I'll try it

image

@xconverge
Copy link
Copy Markdown
Contributor Author

I think those could help and not hurt

I also think pyvips would be better than PIL but I didn't want to complicate your project and add a new dependency

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants