Skip to content

Commit

Permalink
fix: check ole storage content to differentiate filetypes (#3581)
Browse files Browse the repository at this point in the history
      ### Summary

Updates the file detection logic for OLE files to check the storage
content of the file to more reliable differentiate between DOC, PPT, XLS
and MSG files. This corrects a bug that caused file type detection to be
incorrect in cases where the `filetype` library guessed and incorrect
MIME type, such as `'application/vnd.ms-excel'` for a `.msg` file.

As part of this work, the `"msg"` extra was removed because the
`python-oxmsg` package is now a base dependency.

### Testing

Using a test `.msg` file that returns `'application/vnd.ms-excel'` from
`filetype.guess_mime`.

```python
from unstructured.file_utils.filetype import detect_filetype

filename = "test-file.msg"
detect_filetype(filename=filename) # result should be FileType.MSG
```
  • Loading branch information
MthwRobinson authored Aug 30, 2024
1 parent ddb6cb6 commit 6ba8135
Show file tree
Hide file tree
Showing 53 changed files with 171 additions and 149 deletions.
2 changes: 1 addition & 1 deletion .github/workflows/ci.yml
Original file line number Diff line number Diff line change
Expand Up @@ -215,7 +215,7 @@ jobs:
strategy:
matrix:
python-version: ["3.10"]
extra: ["csv", "docx", "odt", "markdown", "pypandoc", "msg", "pdf-image", "pptx", "xlsx"]
extra: ["csv", "docx", "odt", "markdown", "pypandoc", "pdf-image", "pptx", "xlsx"]
runs-on: ubuntu-latest
env:
NLTK_DATA: ${{ github.workspace }}/nltk_data
Expand Down
3 changes: 2 additions & 1 deletion CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
## 0.15.9-dev1
## 0.15.9

### Enhancements

Expand All @@ -8,6 +8,7 @@

### Fixes

* **Check storage contents for OLE file type detection** Updates `detect_filetype` to check the content of OLE files to more reliable differentiate DOC, PPT, XLS, and MSG files. As part of this, the `"msg"` extra was removed because the `python-oxmsg` package is now a base dependency.
* **Fix disk space leaks and Windows errors when accessing file.name on a NamedTemporaryFile** Uses of `NamedTemporaryFile(..., delete=False)` and/or uses of `file.name` of NamedTemporaryFiles have been replaced with TemporaryFileDirectory to avoid a known issue: https://docs.python.org/3/library/tempfile.html#tempfile.NamedTemporaryFile

## 0.15.8
Expand Down
12 changes: 1 addition & 11 deletions Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -83,10 +83,6 @@ install-pypandoc:
install-markdown:
python3 -m pip install -r requirements/extra-markdown.txt

.PHONY: install-msg
install-msg:
python3 -m pip install -r requirements/extra-msg.txt

.PHONY: install-pdf-image
install-pdf-image:
python3 -m pip install -r requirements/extra-pdf-image.txt
Expand All @@ -100,7 +96,7 @@ install-xlsx:
python3 -m pip install -r requirements/extra-xlsx.txt

.PHONY: install-all-docs
install-all-docs: install-base install-csv install-docx install-epub install-odt install-pypandoc install-markdown install-msg install-pdf-image install-pptx install-xlsx
install-all-docs: install-base install-csv install-docx install-epub install-odt install-pypandoc install-markdown install-pdf-image install-pptx install-xlsx

.PHONY: install-all-ingest
install-all-ingest:
Expand Down Expand Up @@ -343,12 +339,6 @@ test-extra-epub:
test-extra-markdown:
PYTHONPATH=. CI=$(CI) pytest test_unstructured/partition/test_md.py

.PHONY: test-extra-msg
test-extra-msg:
# NOTE(scanny): exclude attachment test because partitioning attachments requires other extras
PYTHONPATH=. CI=$(CI) pytest test_unstructured/partition/test_msg.py \
-k "not test_partition_msg_can_process_attachments"

.PHONY: test-extra-odt
test-extra-odt:
PYTHONPATH=. CI=$(CI) pytest test_unstructured/partition/test_odt.py
Expand Down
1 change: 1 addition & 0 deletions requirements/base.in
Original file line number Diff line number Diff line change
Expand Up @@ -21,3 +21,4 @@ unstructured-client
wrapt
tqdm
psutil
python-oxmsg
13 changes: 10 additions & 3 deletions requirements/base.txt
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@ backoff==2.2.1
# via -r ./base.in
beautifulsoup4==4.12.3
# via -r ./base.in
certifi==2024.7.4
certifi==2024.8.30
# via
# httpcore
# httpx
Expand All @@ -23,7 +23,9 @@ charset-normalizer==3.3.2
# requests
# unstructured-client
click==8.1.7
# via nltk
# via
# nltk
# python-oxmsg
dataclasses-json==0.6.7
# via
# -r ./base.in
Expand Down Expand Up @@ -70,6 +72,8 @@ nltk==3.9.1
# via -r ./base.in
numpy==1.26.4
# via -r ./base.in
olefile==0.47
# via python-oxmsg
orderly-set==5.2.2
# via deepdiff
packaging==24.1
Expand All @@ -86,6 +90,8 @@ python-iso639==2024.4.27
# via -r ./base.in
python-magic==0.4.27
# via -r ./base.in
python-oxmsg==0.0.1
# via -r ./base.in
rapidfuzz==3.9.6
# via -r ./base.in
regex==2024.7.24
Expand Down Expand Up @@ -120,6 +126,7 @@ typing-extensions==4.12.2
# anyio
# emoji
# pypdf
# python-oxmsg
# typing-inspect
# unstructured-client
typing-inspect==0.9.0
Expand All @@ -128,7 +135,7 @@ typing-inspect==0.9.0
# unstructured-client
unstructured-client==0.25.5
# via -r ./base.in
urllib3==1.26.19
urllib3==1.26.20
# via
# -c ././deps/constraints.txt
# requests
Expand Down
6 changes: 3 additions & 3 deletions requirements/dev.txt
Original file line number Diff line number Diff line change
Expand Up @@ -34,7 +34,7 @@ bleach==6.1.0
# via nbconvert
build==1.2.1
# via pip-tools
certifi==2024.7.4
certifi==2024.8.30
# via
# -c ./base.txt
# -c ./test.txt
Expand Down Expand Up @@ -130,7 +130,7 @@ jsonschema[format-nongpl]==3.2.0
# jupyter-events
# jupyterlab-server
# nbformat
jupyter==1.1.0
jupyter==1.1.1
# via -r ./dev.in
jupyter-client==7.4.9
# via
Expand Down Expand Up @@ -370,7 +370,7 @@ typing-extensions==4.12.2
# -c ./test.txt
# anyio
# ipython
urllib3==1.26.19
urllib3==1.26.20
# via
# -c ././deps/constraints.txt
# -c ./base.txt
Expand Down
4 changes: 0 additions & 4 deletions requirements/extra-msg.in

This file was deleted.

18 changes: 0 additions & 18 deletions requirements/extra-msg.txt

This file was deleted.

4 changes: 2 additions & 2 deletions requirements/extra-paddleocr.txt
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@ anyio==4.4.0
# httpx
astor==0.8.1
# via paddlepaddle
certifi==2024.7.4
certifi==2024.8.30
# via
# -c ./base.txt
# httpcore
Expand Down Expand Up @@ -170,7 +170,7 @@ typing-extensions==4.12.2
# paddlepaddle
unstructured-paddleocr==2.8.1.0
# via -r ./extra-paddleocr.in
urllib3==1.26.19
urllib3==1.26.20
# via
# -c ././deps/constraints.txt
# -c ./base.txt
Expand Down
4 changes: 2 additions & 2 deletions requirements/extra-pdf-image.txt
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@ antlr4-python3-runtime==4.9.3
# via omegaconf
cachetools==5.5.0
# via google-auth
certifi==2024.7.4
certifi==2024.8.30
# via
# -c ./base.txt
# requests
Expand Down Expand Up @@ -279,7 +279,7 @@ unstructured-inference==0.7.36
# via -r ./extra-pdf-image.in
unstructured-pytesseract==0.3.13
# via -r ./extra-pdf-image.in
urllib3==1.26.19
urllib3==1.26.20
# via
# -c ././deps/constraints.txt
# -c ./base.txt
Expand Down
4 changes: 2 additions & 2 deletions requirements/huggingface.txt
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@
#
# pip-compile ./huggingface.in
#
certifi==2024.7.4
certifi==2024.8.30
# via
# -c ./base.txt
# requests
Expand Down Expand Up @@ -103,7 +103,7 @@ typing-extensions==4.12.2
# -c ./base.txt
# huggingface-hub
# torch
urllib3==1.26.19
urllib3==1.26.20
# via
# -c ././deps/constraints.txt
# -c ./base.txt
Expand Down
4 changes: 2 additions & 2 deletions requirements/ingest/airtable.txt
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@
#
annotated-types==0.7.0
# via pydantic
certifi==2024.7.4
certifi==2024.8.30
# via
# -c ./ingest/../base.txt
# requests
Expand Down Expand Up @@ -36,7 +36,7 @@ typing-extensions==4.12.2
# pyairtable
# pydantic
# pydantic-core
urllib3==1.26.19
urllib3==1.26.20
# via
# -c ./ingest/../base.txt
# -c ./ingest/../deps/constraints.txt
Expand Down
4 changes: 2 additions & 2 deletions requirements/ingest/astradb.txt
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@ cassandra-driver==3.29.1
# via cassio
cassio==0.1.8
# via astrapy
certifi==2024.7.4
certifi==2024.8.30
# via
# -c ./ingest/../base.txt
# httpcore
Expand Down Expand Up @@ -91,7 +91,7 @@ typing-extensions==4.12.2
# via
# -c ./ingest/../base.txt
# anyio
urllib3==1.26.19
urllib3==1.26.20
# via
# -c ./ingest/../base.txt
# -c ./ingest/../deps/constraints.txt
Expand Down
4 changes: 2 additions & 2 deletions requirements/ingest/azure-cognitive-search.txt
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@ azure-core==1.30.2
# via azure-search-documents
azure-search-documents==11.5.1
# via -r ./ingest/azure-cognitive-search.in
certifi==2024.7.4
certifi==2024.8.30
# via
# -c ./ingest/../base.txt
# requests
Expand Down Expand Up @@ -38,7 +38,7 @@ typing-extensions==4.12.2
# -c ./ingest/../base.txt
# azure-core
# azure-search-documents
urllib3==1.26.19
urllib3==1.26.20
# via
# -c ./ingest/../base.txt
# -c ./ingest/../deps/constraints.txt
Expand Down
4 changes: 2 additions & 2 deletions requirements/ingest/azure.txt
Original file line number Diff line number Diff line change
Expand Up @@ -27,7 +27,7 @@ azure-identity==1.17.1
# via adlfs
azure-storage-blob==12.22.0
# via adlfs
certifi==2024.7.4
certifi==2024.8.30
# via
# -c ./ingest/../base.txt
# requests
Expand Down Expand Up @@ -94,7 +94,7 @@ typing-extensions==4.12.2
# azure-core
# azure-identity
# azure-storage-blob
urllib3==1.26.19
urllib3==1.26.20
# via
# -c ./ingest/../base.txt
# -c ./ingest/../deps/constraints.txt
Expand Down
4 changes: 2 additions & 2 deletions requirements/ingest/box.txt
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@ boxfs==0.3.0
# via -r ./ingest/box.in
boxsdk[jwt]==3.13.0
# via boxfs
certifi==2024.7.4
certifi==2024.8.30
# via
# -c ./ingest/../base.txt
# requests
Expand Down Expand Up @@ -51,7 +51,7 @@ six==1.16.0
# via
# -c ./ingest/../base.txt
# python-dateutil
urllib3==1.26.19
urllib3==1.26.20
# via
# -c ./ingest/../base.txt
# -c ./ingest/../deps/constraints.txt
Expand Down
4 changes: 2 additions & 2 deletions requirements/ingest/chroma.txt
Original file line number Diff line number Diff line change
Expand Up @@ -24,7 +24,7 @@ build==1.2.1
# via chromadb
cachetools==5.5.0
# via google-auth
certifi==2024.7.4
certifi==2024.8.30
# via
# -c ./ingest/../base.txt
# httpcore
Expand Down Expand Up @@ -268,7 +268,7 @@ typing-extensions==4.12.2
# starlette
# typer
# uvicorn
urllib3==1.26.19
urllib3==1.26.20
# via
# -c ./ingest/../base.txt
# -c ./ingest/../deps/constraints.txt
Expand Down
4 changes: 2 additions & 2 deletions requirements/ingest/clarifai.txt
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@
#
# pip-compile ./ingest/clarifai.in
#
certifi==2024.7.4
certifi==2024.8.30
# via
# -c ./ingest/../base.txt
# requests
Expand Down Expand Up @@ -74,7 +74,7 @@ tqdm==4.66.5
# clarifai
tritonclient==2.41.1
# via clarifai
urllib3==1.26.19
urllib3==1.26.20
# via
# -c ./ingest/../base.txt
# -c ./ingest/../deps/constraints.txt
Expand Down
6 changes: 3 additions & 3 deletions requirements/ingest/confluence.txt
Original file line number Diff line number Diff line change
Expand Up @@ -4,13 +4,13 @@
#
# pip-compile ./ingest/confluence.in
#
atlassian-python-api==3.41.14
atlassian-python-api==3.41.15
# via -r ./ingest/confluence.in
beautifulsoup4==4.12.3
# via
# -c ./ingest/../base.txt
# atlassian-python-api
certifi==2024.7.4
certifi==2024.8.30
# via
# -c ./ingest/../base.txt
# requests
Expand Down Expand Up @@ -45,7 +45,7 @@ soupsieve==2.6
# via
# -c ./ingest/../base.txt
# beautifulsoup4
urllib3==1.26.19
urllib3==1.26.20
# via
# -c ./ingest/../base.txt
# -c ./ingest/../deps/constraints.txt
Expand Down
4 changes: 2 additions & 2 deletions requirements/ingest/databricks-volumes.txt
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@
#
cachetools==5.5.0
# via google-auth
certifi==2024.7.4
certifi==2024.8.30
# via
# -c ./ingest/../base.txt
# requests
Expand Down Expand Up @@ -34,7 +34,7 @@ requests==2.32.3
# databricks-sdk
rsa==4.9
# via google-auth
urllib3==1.26.19
urllib3==1.26.20
# via
# -c ./ingest/../base.txt
# -c ./ingest/../deps/constraints.txt
Expand Down
Loading

0 comments on commit 6ba8135

Please sign in to comment.