Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wrong mime types and extension for .xlsx type files #102

Open
Aashutosh05 opened this issue Sep 27, 2024 · 1 comment
Open

Wrong mime types and extension for .xlsx type files #102

Aashutosh05 opened this issue Sep 27, 2024 · 1 comment

Comments

@Aashutosh05
Copy link

I wanted to check the extension and return the mime types for files with extensions like .xlsx, .xls, etc. But whenever I am trying to detect it using from_string() function, it is returning .docx as extension and 'application/vnd.openxmlformats-officedocument.wordprocessingml.document' as the mime type. Though from_file() is returning the correct response but still for my use case I wanted to use from_string(). even I tried to write the .xlsx file to a temp file still it is returning .docx as extension unless I explicitly mention the suffix for the temp file.

In [24]: xl_file = "/Users/aashutosh.chaubey/Desktop/static_data/font.xlsx"

In [26]: da = open(xl_file, "rb").read()

In [27]: from_string(da)
Out[27]: '.docx'

In [28]: from_string(da, mime=True)
Out[28]: 'application/vnd.openxmlformats-officedocument.wordprocessingml.document'

In [35]: tmp_fl = tempfile.NamedTemporaryFile(delete=False, suffix='')

In [36]: tmp_fl.write(da)
Out[36]: 6109

In [37]: tmp_fl.name
Out[37]: '/var/folders/fd/_pnhhl3n4d9bnxngcxg8y0xw0000gr/T/tmps9uejuxr'

In [39]: from_file(tmp_fl.name)
Out[39]: '.docx'

In [40]: tmp_fl = tempfile.NamedTemporaryFile(delete=False, suffix='.xlsx')

In [41]: tmp_fl.write(da)
Out[41]: 6109

In [42]: tmp_fl.name
Out[42]: '/var/folders/fd/_pnhhl3n4d9bnxngcxg8y0xw0000gr/T/tmpqa54ado8.xlsx'

In [43]: from_file(tmp_fl.name)
Out[43]: '.xlsx'

In [44]: from_file(tmp_fl.name, mime=True)
Out[44]: 'application/vnd.openxmlformats-officedocument.spreadsheetml.sheet'

In [45]: xl_file = "/Users/aashutosh.chaubey/Desktop/static_data/font.xlsx"

In [46]: da = open(xl_file, "rb").read()

In [47]: from_string(da)
Out[47]: '.docx'

In [48]: from_file(xl_file)
Out[48]: '.xlsx'

In [49]: from_file(xl_file, mime=True)
Out[49]: 'application/vnd.openxmlformats-officedocument.spreadsheetml.sheet'

In [50]: from_string(da, mime=True)
Out[50]: 'application/vnd.openxmlformats-officedocument.wordprocessingml.document'`

I ran magic_stream() on the file and I received following as the output:

In [6]: da = open(xl_file, "rb")

In [7]: magic_stream(da)
Out[7]:
[PureMagicWithConfidence(byte_match=b'PK\x03\x04', offset=0, extension='.docx', mime_type='application/vnd.openxmlformats-officedocument.wordprocessingml.document', name='MS Office Open XML Format Document', confidence=0.4),
 PureMagicWithConfidence(byte_match=b'PK\x03\x04', offset=0, extension='.pptx', mime_type='application/vnd.openxmlformats-officedocument.presentationml.presentation', name='MS Office Open XML Format Document', confidence=0.4),
 PureMagicWithConfidence(byte_match=b'PK\x03\x04', offset=0, extension='.xlsx', mime_type='application/vnd.openxmlformats-officedocument.spreadsheetml.sheet', name='MS Office Open XML Format Document', confidence=0.4),
 PureMagicWithConfidence(byte_match=b'PK\x03\x04', offset=0, extension='.xlsb', mime_type='application/vnd.ms-excel.sheet.binary.macroenabled.12', name='Microsoft Excel - Binary Workbook', confidence=0.4),
 PureMagicWithConfidence(byte_match=b'PK\x03\x04', offset=0, extension='.xltm', mime_type='application/vnd.ms-excel.template.macroenabled.12', name='Microsoft Excel - Macro-Enabled Template File', confidence=0.4),
 PureMagicWithConfidence(byte_match=b'PK\x03\x04', offset=0, extension='.xltx', mime_type='application/vnd.openxmlformats-officedocument.spreadsheetml.template', name='Microsoft Office - OOXML - Spreadsheet Template', confidence=0.4),
 PureMagicWithConfidence(byte_match=b'PK\x03\x04', offset=0, extension='.xlam', mime_type='application/vnd.ms-excel.addin.macroenabled.12', name='Microsoft Excel - Add-In File', confidence=0.4),
 PureMagicWithConfidence(byte_match=b'PK\x03\x04', offset=0, extension='.docm', mime_type='application/vnd.ms-word.document.macroEnabled.12', name='Microsoft Word - Macro-Enabled Document', confidence=0.4),
 PureMagicWithConfidence(byte_match=b'PK\x03\x04', offset=0, extension='.dotx', mime_type='application/vnd.openxmlformats-officedocument.wordprocessingml.template', name='Microsoft Office - OOXML - Word Document Template', confidence=0.4),
 PureMagicWithConfidence(byte_match=b'PK\x03\x04', offset=0, extension='.dotm', mime_type='application/vnd.ms-word.template.macroenabled.12', name='Microsoft Word - Macro-Enabled Template', confidence=0.4),
 PureMagicWithConfidence(byte_match=b'PK\x03\x04', offset=0, extension='.pptm', mime_type='application/vnd.ms-powerpoint.presentation.macroEnabled.12', name='Microsoft PowerPoint - Macro-Enabled Presentation File', confidence=0.4),
 PureMagicWithConfidence(byte_match=b'PK\x03\x04', offset=0, extension='.potx', mime_type='application/vnd.openxmlformats-officedocument.presentationml.template', name='Microsoft Office - OOXML - Presentation Template', confidence=0.4),
 PureMagicWithConfidence(byte_match=b'PK\x03\x04', offset=0, extension='.potm', mime_type='application/vnd.ms-powerpoint.template.macroenabled.12', name='Microsoft PowerPoint - Macro-Enabled Template File', confidence=0.4),
 PureMagicWithConfidence(byte_match=b'PK\x03\x04', offset=0, extension='.xlsm', mime_type='application/vnd.ms-excel.sheet.macroEnabled.12', name='Microsoft Excel - Macro-Enabled Workbook', confidence=0.4),
 PureMagicWithConfidence(byte_match=b'PK\x03\x04', offset=0, extension='.zip', mime_type='application/zip', name='PKZIP Archive file', confidence=0.4),
 PureMagicWithConfidence(byte_match=b'PK\x03\x04', offset=0, extension='.xpi', mime_type='', name='Mozilla Browser Archive', confidence=0.4),
 PureMagicWithConfidence(byte_match=b'PK\x03\x04', offset=0, extension='.wmz', mime_type='', name='Windows Media compressed skin file', confidence=0.4),
 PureMagicWithConfidence(byte_match=b'PK\x03\x04', offset=0, extension='.xpt', mime_type='', name='eXact Packager Models', confidence=0.4),
 PureMagicWithConfidence(byte_match=b'PK\x03\x04', offset=0, extension='.kwd', mime_type='', name='KWord document', confidence=0.4),
 PureMagicWithConfidence(byte_match=b'PK\x03\x04', offset=0, extension='.xps', mime_type='', name='XML paper specification file', confidence=0.4),
 PureMagicWithConfidence(byte_match=b'PK\x03\x04', offset=0, extension='.jar', mime_type='application/java-archive', name='Java archive', confidence=0.4),
 PureMagicWithConfidence(byte_match=b'PK\x03\x04', offset=0, extension='.odt', mime_type='application/vnd.oasis.opendocument.text', name='OpenDocument template', confidence=0.4),
 PureMagicWithConfidence(byte_match=b'PK\x03\x04', offset=0, extension='.odp', mime_type='application/vnd.oasis.opendocument.presentation', name='OpenDocument template', confidence=0.4),
 PureMagicWithConfidence(byte_match=b'PK\x03\x04', offset=0, extension='.ott', mime_type='application/vnd.oasis.opendocument.text-template', name='OpenDocument template', confidence=0.4),
 PureMagicWithConfidence(byte_match=b'PK\x03\x04', offset=0, extension='.sxd', mime_type='', name='OpenOffice documents', confidence=0.4),
 PureMagicWithConfidence(byte_match=b'PK\x03\x04', offset=0, extension='.sxi', mime_type='', name='OpenOffice documents', confidence=0.4),
 PureMagicWithConfidence(byte_match=b'PK\x03\x04', offset=0, extension='.sxw', mime_type='', name='OpenOffice documents', confidence=0.4),
 PureMagicWithConfidence(byte_match=b'PK\x03\x04', offset=0, extension='.apk', mime_type='', name='Android Application Package', confidence=0.4),
 PureMagicWithConfidence(byte_match=b'PK\x03\x04', offset=0, extension='.cbz', mime_type='application/vnd.comicbook+zip', name='Comic Book Archive (ZIP compression)', confidence=0.4),
 PureMagicWithConfidence(byte_match=b'PK\x03\x04', offset=0, extension='.fb2.zip', mime_type='application/fictionbook2+zip', name='FictionBook 2 eBook file (Zip compressed)', confidence=0.4),
 PureMagicWithConfidence(byte_match=b'PK\x03\x04', offset=0, extension='.fbz', mime_type='application/fictionbook2+zip', name='FictionBook 2 eBook file (Zip compressed)', confidence=0.4),
 PureMagicWithConfidence(byte_match=b'PK\x03\x04', offset=0, extension='.fb3', mime_type='application/fictionbook3+zip', name='FictionBook 3 eBook file', confidence=0.4)]
@cdgriffith
Copy link
Owner

Pure magic only detects files by two methods, first is magic number, which you can see all those file types have the same PK\x03\x04 , and the extension.

Unfortunately Microsoft decided that XLSX and other office files should just be zips (can take any XLSX and unzip them to see) and so it matches any other ZIP magic number.

Would need more advanced file scanning techniques to determine the actual file type, #3

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants