PDF files are not always detected #94

peterekepeter · 2024-07-17T12:25:35Z

From my testing the %PDF- does not necessarily have to be at offset 0.
It can be located anywhere in the file. For example I can type some junk into the file in the beginning and it still opens file.

I received multiple files like this from people, so there is something or someone out in the wild that adds extra characters in front of the magic sequence.

A detector would look something like that it searches for a substring inside a search window:

def is_pdf(file_path):
    with open(file_path, "rb") as file:
        # may throw IOError
        header = file.read(1024)
        return b"%PDF-" in header

From what I see currently the library is not built to handle this kind of situation.
So I'm leaving this ticket here with this code snippet in case more advanced detection is implemented.

The text was updated successfully, but these errors were encountered:

cdgriffith · 2024-09-28T22:47:15Z

Just to make sure, I did check out the PDF specifications themselves:

The PDF file begins with the 5 characters “%PDF–” and byte offsets shall be calculated from the
PERCENT SIGN (25h).
NOTE 1 This provision allows for arbitrary bytes preceding the %PDF- without impacting the viability of
the PDF file and its byte offsets.

So it is valid for PDFs to not strictly start with the %PDF- but must contain it in their header. Will work on a better way to detect this.

peterekepeter mentioned this issue Jul 17, 2024

Improve PDF file detection, fix description #93

Merged

cdgriffith mentioned this issue Sep 28, 2024

Version 2.0 Goals #70

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PDF files are not always detected #94

PDF files are not always detected #94

peterekepeter commented Jul 17, 2024

cdgriffith commented Sep 28, 2024

PDF files are not always detected #94

PDF files are not always detected #94

Comments

peterekepeter commented Jul 17, 2024

cdgriffith commented Sep 28, 2024