Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improved performance and security for ContentStream_readInlineImage #331

Closed
wants to merge 1 commit into from
Closed

Improved performance and security for ContentStream_readInlineImage #331

wants to merge 1 commit into from

Conversation

sekrause
Copy link
Contributor

This change has been tested with Python 2.6, 2.7 and 3.5.

It fixes #329 by raising an exception when the stream ends and we haven't the end token for the inline image.

It also fixes #330 by using a more efficient parsing algorithm. For large inline images this change speeds up this method by many orders of magnitude:

  • Instead of only reading single bytes from the stream it reads larger chunks of 8 kB and uses Python's really fast find() method to check for the E the token. Only when the token is found it falls back to the normal algorithm that detects the end of the inline image.
  • Instead of using an immutable data it uses BytesIO to collect the output which support much faster appends.

@vstoykov
Copy link
Contributor

vstoykov commented Jul 20, 2017

This can be optimized further by searching directly for b'EI' instead of searching for b'E' and then checking if next byte is b'I' and if not found to seek backwards with one byte.
Also the value b'EI' can be assigned to value outside of the loop which will also save some nanoseconds on every loop.

@MartinThoma MartinThoma added the Tiny Pull requests that make a tiny change - and thus should be easy to merge label Apr 6, 2022
@sekrause
Copy link
Contributor Author

sekrause commented Apr 8, 2022

@MartinThoma Since his pull request fixes issue #329, a possible denial-of-service security issue, it might be worth looking at rather sooner than later.

@MartinThoma
Copy link
Member

Thank you for pointing that out 👍

The issue exists for several years now. I prefer preventing regressions instead of fixing existing issues for the moment. To ensure that, I'm increasing the test coverage. I will check if the code you've introduced is covered / how to cover it.

@MartinThoma MartinThoma added the nf-security Non-functional change: Security label Apr 9, 2022
@MartinThoma MartinThoma added the needs-test A test should be added before this PR is merged. label Apr 9, 2022
@MartinThoma
Copy link
Member

@sekrause Do you happen to have a PDF file with an inline image that we could use for testing?

I've tried to cover this part with #677 , but aparently those images are not inlined.

@sekrause
Copy link
Contributor Author

sekrause commented Apr 9, 2022

I think you can create a test file with any image of your choice using ReportLab:

from reportlab.pdfgen import canvas
c = canvas.Canvas("test.pdf")
c.drawInlineImage("test.png", 100, 100, 100, 100)
c.drawString(200, 100, "Test")
c.showPage()
c.save()

I think that's what I did to create the intentionally broken PDF in issue #329: Create one with ReportLab and then edit it manually so that it triggers the bug. But since it's been 5 years since I analyzed the problem I've forgotten all details about it (and had stopped using PyPDF2 because it was unmaintained).

@MartinThoma MartinThoma deleted the branch py-pdf:master April 12, 2022 05:17
@sekrause
Copy link
Contributor Author

@MartinThoma Was this pull request closed automatically because the target branch was deleted?

@MartinThoma
Copy link
Member

Huh. Weird. I can only say that I didn't close it on purpose. Also, according to github the renaming should automatically change the target of all PRs. And I still see many open PRs 🤔

@MartinThoma
Copy link
Member

I also cannot click on re-open

@MartinThoma
Copy link
Member

MartinThoma commented Apr 12, 2022

There were 72 PRs before, now there are only 67 PRs. Seems like github accidentially closed 5.

Is it possible for you to execute this locally:

git branch -m master main
git fetch origin
git branch -u origin/main main
git remote set-head origin -a

and re-create the PR?

I'm very sorry about the inconvenience :-/

@sekrause
Copy link
Contributor Author

I think the problem is that I deleted my fork of this repository months ago because I lost hope that the pull request would ever be applied. GitHub probably closed all pull requests for the master branch where the other clone doesn't exist anymore.

At last the actual changes didn't get lost, so I could reapply the patch to a new fork and re-create it as PR #740.

@MartinThoma
Copy link
Member

Thank you so much for doing it 🙏 I would have done it myself once I found the time. Your PR will for sure get merged this year; I just cannot commit to a specific time at the moment. Too many open topics (both in PyPDF2, but also in my private live / work)

MartinThoma added a commit that referenced this pull request Apr 15, 2022
Credits to Sebastian Krause for creating the PDF:
#331 (comment)

Co-authored-by: Sebastian Krause <[email protected]>
MartinThoma added a commit that referenced this pull request Apr 15, 2022
Credits to Sebastian Krause for creating the PDF:
#331 (comment)

Co-authored-by: Sebastian Krause <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
needs-test A test should be added before this PR is merged. nf-security Non-functional change: Security Tiny Pull requests that make a tiny change - and thus should be easy to merge
Projects
None yet
3 participants