Improved performance and security for ContentStream_readInlineImage #331

sekrause · 2017-02-17T12:21:53Z

This change has been tested with Python 2.6, 2.7 and 3.5.

It fixes #329 by raising an exception when the stream ends and we haven't the end token for the inline image.

It also fixes #330 by using a more efficient parsing algorithm. For large inline images this change speeds up this method by many orders of magnitude:

Instead of only reading single bytes from the stream it reads larger chunks of 8 kB and uses Python's really fast find() method to check for the E the token. Only when the token is found it falls back to the normal algorithm that detects the end of the inline image.
Instead of using an immutable data it uses BytesIO to collect the output which support much faster appends.

vstoykov · 2017-07-20T14:07:22Z

This can be optimized further by searching directly for b'EI' instead of searching for b'E' and then checking if next byte is b'I' and if not found to seek backwards with one byte.
Also the value b'EI' can be assigned to value outside of the loop which will also save some nanoseconds on every loop.

sekrause · 2022-04-08T06:13:43Z

@MartinThoma Since his pull request fixes issue #329, a possible denial-of-service security issue, it might be worth looking at rather sooner than later.

MartinThoma · 2022-04-08T06:24:40Z

Thank you for pointing that out 👍

The issue exists for several years now. I prefer preventing regressions instead of fixing existing issues for the moment. To ensure that, I'm increasing the test coverage. I will check if the code you've introduced is covered / how to cover it.

MartinThoma · 2022-04-09T06:36:53Z

@sekrause Do you happen to have a PDF file with an inline image that we could use for testing?

I've tried to cover this part with #677 , but aparently those images are not inlined.

sekrause · 2022-04-09T07:46:42Z

I think you can create a test file with any image of your choice using ReportLab:

from reportlab.pdfgen import canvas
c = canvas.Canvas("test.pdf")
c.drawInlineImage("test.png", 100, 100, 100, 100)
c.drawString(200, 100, "Test")
c.showPage()
c.save()

I think that's what I did to create the intentionally broken PDF in issue #329: Create one with ReportLab and then edit it manually so that it triggers the bug. But since it's been 5 years since I analyzed the problem I've forgotten all details about it (and had stopped using PyPDF2 because it was unmaintained).

sekrause · 2022-04-12T06:26:17Z

@MartinThoma Was this pull request closed automatically because the target branch was deleted?

MartinThoma · 2022-04-12T08:54:17Z

Huh. Weird. I can only say that I didn't close it on purpose. Also, according to github the renaming should automatically change the target of all PRs. And I still see many open PRs 🤔

MartinThoma · 2022-04-12T08:54:33Z

I also cannot click on re-open

MartinThoma · 2022-04-12T08:57:13Z

There were 72 PRs before, now there are only 67 PRs. Seems like github accidentially closed 5.

Is it possible for you to execute this locally:

git branch -m master main
git fetch origin
git branch -u origin/main main
git remote set-head origin -a

and re-create the PR?

I'm very sorry about the inconvenience :-/

sekrause · 2022-04-12T17:20:08Z

I think the problem is that I deleted my fork of this repository months ago because I lost hope that the pull request would ever be applied. GitHub probably closed all pull requests for the master branch where the other clone doesn't exist anymore.

At last the actual changes didn't get lost, so I could reapply the patch to a new fork and re-create it as PR #740.

MartinThoma · 2022-04-12T20:16:00Z

Thank you so much for doing it 🙏 I would have done it myself once I found the time. Your PR will for sure get merged this year; I just cannot commit to a specific time at the moment. Too many open topics (both in PyPDF2, but also in my private live / work)

Credits to Sebastian Krause for creating the PDF: #331 (comment) Co-authored-by: Sebastian Krause <[email protected]>

Improved performance and security for ContentStream_readInlineImage.

96f5ddf

This was referenced Jul 20, 2017

BUG: Fix Parsing of Inline Images #332

Closed

Endless Loop When Processing Certain Large PDF with PdfFileWriter #358

Closed

sekrause mentioned this pull request Mar 3, 2018

Rebooting PyPDF2 Maintenance #385

Closed

MartinThoma added the Tiny Pull requests that make a tiny change - and thus should be easy to merge label Apr 6, 2022

MartinThoma added the nf-security Non-functional change: Security label Apr 9, 2022

MartinThoma added this to the Last PyPDF2 version 1.X release milestone Apr 9, 2022

MartinThoma added the needs-test A test should be added before this PR is merged. label Apr 9, 2022

MartinThoma deleted the branch py-pdf:master April 12, 2022 05:17

MartinThoma closed this Apr 12, 2022

MartinThoma mentioned this pull request Apr 12, 2022

Rename primary branch from master to main #736

Closed

sekrause mentioned this pull request Apr 12, 2022

Improved performance and security for ContentStream_readInlineImage. #740

Merged

MartinThoma added a commit that referenced this pull request Apr 15, 2022

TST: Add test for inline images

18ced3d

Credits to Sebastian Krause for creating the PDF: #331 (comment) Co-authored-by: Sebastian Krause <[email protected]>

MartinThoma mentioned this pull request Apr 15, 2022

TST: Add test for inline images #758

Merged

MartinThoma added a commit that referenced this pull request Apr 15, 2022

TST: Add test for inline images (#758)

0890b06

Credits to Sebastian Krause for creating the PDF: #331 (comment) Co-authored-by: Sebastian Krause <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improved performance and security for ContentStream_readInlineImage #331

Improved performance and security for ContentStream_readInlineImage #331

sekrause commented Feb 17, 2017

vstoykov commented Jul 20, 2017 •

edited

Loading

sekrause commented Apr 8, 2022

MartinThoma commented Apr 8, 2022

MartinThoma commented Apr 9, 2022

sekrause commented Apr 9, 2022

sekrause commented Apr 12, 2022

MartinThoma commented Apr 12, 2022

MartinThoma commented Apr 12, 2022

MartinThoma commented Apr 12, 2022 •

edited

Loading

sekrause commented Apr 12, 2022

MartinThoma commented Apr 12, 2022

Improved performance and security for ContentStream_readInlineImage #331

Improved performance and security for ContentStream_readInlineImage #331

Conversation

sekrause commented Feb 17, 2017

vstoykov commented Jul 20, 2017 • edited Loading

sekrause commented Apr 8, 2022

MartinThoma commented Apr 8, 2022

MartinThoma commented Apr 9, 2022

sekrause commented Apr 9, 2022

sekrause commented Apr 12, 2022

MartinThoma commented Apr 12, 2022

MartinThoma commented Apr 12, 2022

MartinThoma commented Apr 12, 2022 • edited Loading

sekrause commented Apr 12, 2022

MartinThoma commented Apr 12, 2022

vstoykov commented Jul 20, 2017 •

edited

Loading

MartinThoma commented Apr 12, 2022 •

edited

Loading