Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PDF files are not always detected #94

Open
peterekepeter opened this issue Jul 17, 2024 · 1 comment
Open

PDF files are not always detected #94

peterekepeter opened this issue Jul 17, 2024 · 1 comment

Comments

@peterekepeter
Copy link
Contributor

From my testing the %PDF- does not necessarily have to be at offset 0.
It can be located anywhere in the file. For example I can type some junk into the file in the beginning and it still opens file.

I received multiple files like this from people, so there is something or someone out in the wild that adds extra characters in front of the magic sequence.

A detector would look something like that it searches for a substring inside a search window:

def is_pdf(file_path):
    with open(file_path, "rb") as file:
        # may throw IOError
        header = file.read(1024)
        return b"%PDF-" in header

From what I see currently the library is not built to handle this kind of situation.
So I'm leaving this ticket here with this code snippet in case more advanced detection is implemented.

@cdgriffith
Copy link
Owner

Just to make sure, I did check out the PDF specifications themselves:

The PDF file begins with the 5 characters “%PDF–” and byte offsets shall be calculated from the
PERCENT SIGN (25h).
NOTE 1 This provision allows for arbitrary bytes preceding the %PDF- without impacting the viability of
the PDF file and its byte offsets.

So it is valid for PDFs to not strictly start with the %PDF- but must contain it in their header. Will work on a better way to detect this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants