Skip to content

Commit

Permalink
fix(CVE-2024-39705): bump to nltk 3.9.1; correct model download iss…
Browse files Browse the repository at this point in the history
…ues (#3541)

### Summary

Bumps to `nltk==3.9.1` and resolves
[CVE-2024-39705](https://nvd.nist.gov/vuln/detail/CVE-2024-39705). An
NLTK version bump was originally introduced in #3512 and rolled back in
#3527 because `nltk==3.8.2` was yanked from PyPI, and also because we
observed significant slowdowns in processing time after bumping to
`nltk==3.8.2`. The processing time regression does not appear in
`nltk==3.9.1`.

### Testing

After the bump, CI should pass. Additionally we verified locally that
files processing takes around the amount of time we would expect for a
long `.docx` file.

```python
In [1]: from unstructured.partition.auto import partition

In [2]: filename = "test-doc.docx"

In [3]: %timeit partition(filename=filename)
3.92 s ± 73 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
```
  • Loading branch information
MthwRobinson authored Aug 19, 2024
1 parent a861ed8 commit 1f8030d
Show file tree
Hide file tree
Showing 35 changed files with 112 additions and 101 deletions.
3 changes: 2 additions & 1 deletion CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,11 +1,12 @@
## 0.15.6-dev1
## 0.15.6

### Enhancements

### Features

### Fixes

* **Bump to NLTK 3.9.x** Bumps to the latest `nltk` version to resolve CVE.
* **Update CI for `ingest-test-fixture-update-pr` to resolve NLTK model download errors.**
* **Synchronized text and html on `TableChunk` splits.** When a `Table` element is divided during chunking to fit the chunking window, `TableChunk.text` corresponds exactly with the table text in `TableChunk.metadata.text_as_html`, `.text_as_html` is always parseable HTML, and the table is split on even row boundaries whenever possible.

Expand Down
6 changes: 3 additions & 3 deletions requirements/base.txt
Original file line number Diff line number Diff line change
Expand Up @@ -69,7 +69,7 @@ mypy-extensions==1.0.0
# unstructured-client
nest-asyncio==1.6.0
# via unstructured-client
nltk==3.8.1
nltk==3.9.1
# via -r ./base.in
numpy==1.26.4
# via -r ./base.in
Expand Down Expand Up @@ -110,7 +110,7 @@ sniffio==1.3.1
# via
# anyio
# httpx
soupsieve==2.5
soupsieve==2.6
# via beautifulsoup4
tabulate==0.9.0
# via -r ./base.in
Expand All @@ -129,7 +129,7 @@ typing-inspect==0.9.0
# via
# dataclasses-json
# unstructured-client
unstructured-client==0.25.4
unstructured-client==0.25.5
# via
# -c ././deps/constraints.txt
# -r ./base.in
Expand Down
3 changes: 3 additions & 0 deletions requirements/deps/constraints.txt
Original file line number Diff line number Diff line change
Expand Up @@ -56,3 +56,6 @@ fsspec==2024.5.0
wrapt>=1.14.0

langchain-community>=0.2.5

grpcio==1.64.3
label-studio-sdk==0.0.34
4 changes: 2 additions & 2 deletions requirements/dev.txt
Original file line number Diff line number Diff line change
Expand Up @@ -310,7 +310,7 @@ pyyaml==6.0.2
# -c ./test.txt
# jupyter-events
# pre-commit
pyzmq==26.1.0
pyzmq==26.1.1
# via
# ipykernel
# jupyter-client
Expand Down Expand Up @@ -360,7 +360,7 @@ sniffio==1.3.1
# -c ./base.txt
# anyio
# httpx
soupsieve==2.5
soupsieve==2.6
# via
# -c ./base.txt
# beautifulsoup4
Expand Down
2 changes: 1 addition & 1 deletion requirements/extra-markdown.txt
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@
#
importlib-metadata==8.2.0
# via markdown
markdown==3.6
markdown==3.7
# via -r ./extra-markdown.in
zipp==3.20.0
# via importlib-metadata
8 changes: 4 additions & 4 deletions requirements/extra-paddleocr.txt
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,7 @@ astor==0.8.1
# via paddlepaddle
attrdict==2.0.1
# via unstructured-paddleocr
cachetools==5.4.0
cachetools==5.5.0
# via premailer
certifi==2024.7.4
# via
Expand Down Expand Up @@ -64,13 +64,13 @@ idna==3.7
# anyio
# httpx
# requests
imageio==2.34.2
imageio==2.35.1
# via
# imgaug
# scikit-image
imgaug==0.4.0
# via unstructured-paddleocr
importlib-resources==6.4.0
importlib-resources==6.4.3
# via matplotlib
kiwisolver==1.4.5
# via matplotlib
Expand All @@ -83,7 +83,7 @@ lxml==5.3.0
# -c ./base.txt
# premailer
# unstructured-paddleocr
matplotlib==3.9.1.post1
matplotlib==3.9.2
# via imgaug
more-itertools==10.4.0
# via cssutils
Expand Down
17 changes: 9 additions & 8 deletions requirements/extra-pdf-image.txt
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@
#
antlr4-python3-runtime==4.9.3
# via omegaconf
cachetools==5.4.0
cachetools==5.5.0
# via google-auth
certifi==2024.7.4
# via
Expand Down Expand Up @@ -48,7 +48,7 @@ fsspec==2024.5.0
# torch
google-api-core[grpc]==2.19.1
# via google-cloud-vision
google-auth==2.33.0
google-auth==2.34.0
# via
# google-api-core
# google-cloud-vision
Expand All @@ -58,13 +58,14 @@ googleapis-common-protos==1.63.2
# via
# google-api-core
# grpcio-status
grpcio==1.65.4
grpcio==1.64.3
# via
# -c ././deps/constraints.txt
# google-api-core
# grpcio-status
grpcio-status==1.62.3
# via google-api-core
huggingface-hub==0.24.5
huggingface-hub==0.24.6
# via
# timm
# tokenizers
Expand All @@ -76,7 +77,7 @@ idna==3.7
# via
# -c ./base.txt
# requests
importlib-resources==6.4.0
importlib-resources==6.4.3
# via matplotlib
iopath==0.1.10
# via layoutparser
Expand All @@ -92,7 +93,7 @@ lxml==5.3.0
# pikepdf
markupsafe==2.1.5
# via jinja2
matplotlib==3.9.1.post1
matplotlib==3.9.2
# via
# pycocotools
# unstructured-inference
Expand Down Expand Up @@ -120,7 +121,7 @@ onnx==1.16.2
# via
# -r ./extra-pdf-image.in
# unstructured-inference
onnxruntime==1.18.1
onnxruntime==1.19.0
# via unstructured-inference
opencv-python==4.8.0.76
# via
Expand All @@ -147,7 +148,7 @@ pdfminer-six==20231228
# via
# -r ./extra-pdf-image.in
# pdfplumber
pdfplumber==0.11.3
pdfplumber==0.11.4
# via layoutparser
pikepdf==9.1.1
# via -r ./extra-pdf-image.in
Expand Down
2 changes: 1 addition & 1 deletion requirements/huggingface.txt
Original file line number Diff line number Diff line change
Expand Up @@ -27,7 +27,7 @@ fsspec==2024.5.0
# -c ././deps/constraints.txt
# huggingface-hub
# torch
huggingface-hub==0.24.5
huggingface-hub==0.24.6
# via
# tokenizers
# transformers
Expand Down
4 changes: 2 additions & 2 deletions requirements/ingest/azure.txt
Original file line number Diff line number Diff line change
Expand Up @@ -6,9 +6,9 @@
#
adlfs==2024.7.0
# via -r ./ingest/azure.in
aiohappyeyeballs==2.3.5
aiohappyeyeballs==2.3.7
# via aiohttp
aiohttp==3.10.3
aiohttp==3.10.4
# via adlfs
aiosignal==1.3.1
# via aiohttp
Expand Down
2 changes: 1 addition & 1 deletion requirements/ingest/biomed.txt
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@ beautifulsoup4==4.12.3
# bs4
bs4==0.0.2
# via -r ./ingest/biomed.in
soupsieve==2.5
soupsieve==2.6
# via
# -c ./ingest/../base.txt
# beautifulsoup4
21 changes: 11 additions & 10 deletions requirements/ingest/chroma.txt
Original file line number Diff line number Diff line change
Expand Up @@ -20,7 +20,7 @@ backoff==2.2.1
# posthog
bcrypt==4.2.0
# via chromadb
cachetools==5.4.0
cachetools==5.5.0
# via google-auth
certifi==2024.7.4
# via
Expand Down Expand Up @@ -51,7 +51,7 @@ exceptiongroup==1.2.2
# via
# -c ./ingest/../base.txt
# anyio
fastapi==0.112.0
fastapi==0.112.1
# via chromadb
filelock==3.15.4
# via huggingface-hub
Expand All @@ -61,12 +61,13 @@ fsspec==2024.5.0
# via
# -c ./ingest/../deps/constraints.txt
# huggingface-hub
google-auth==2.33.0
google-auth==2.34.0
# via kubernetes
googleapis-common-protos==1.63.2
# via opentelemetry-exporter-otlp-proto-grpc
grpcio==1.65.4
grpcio==1.64.3
# via
# -c ./ingest/../deps/constraints.txt
# chromadb
# opentelemetry-exporter-otlp-proto-grpc
h11==0.14.0
Expand All @@ -76,7 +77,7 @@ h11==0.14.0
# uvicorn
httptools==0.6.1
# via uvicorn
huggingface-hub==0.24.5
huggingface-hub==0.24.6
# via tokenizers
humanfriendly==10.0
# via coloredlogs
Expand All @@ -88,7 +89,7 @@ idna==3.7
# requests
importlib-metadata==8.2.0
# via -r ./ingest/chroma.in
importlib-resources==6.4.0
importlib-resources==6.4.3
# via chromadb
kubernetes==30.1.0
# via chromadb
Expand All @@ -106,7 +107,7 @@ oauthlib==3.2.2
# via
# kubernetes
# requests-oauthlib
onnxruntime==1.18.1
onnxruntime==1.19.0
# via chromadb
opentelemetry-api==1.16.0
# via
Expand Down Expand Up @@ -192,7 +193,7 @@ sniffio==1.3.1
# -c ./ingest/../base.txt
# anyio
# httpx
starlette==0.37.2
starlette==0.38.2
# via fastapi
sympy==1.13.2
# via onnxruntime
Expand Down Expand Up @@ -231,9 +232,9 @@ urllib3==1.26.19
# -c ./ingest/../deps/constraints.txt
# kubernetes
# requests
uvicorn[standard]==0.30.5
uvicorn[standard]==0.30.6
# via chromadb
uvloop==0.19.0
uvloop==0.20.0
# via uvicorn
watchfiles==0.23.0
# via uvicorn
Expand Down
8 changes: 5 additions & 3 deletions requirements/ingest/clarifai.txt
Original file line number Diff line number Diff line change
Expand Up @@ -15,14 +15,16 @@ charset-normalizer==3.3.2
# requests
clarifai==10.7.0
# via -r ./ingest/clarifai.in
clarifai-grpc==10.7.1
clarifai-grpc==10.7.2
# via clarifai
contextlib2==21.6.0
# via schema
googleapis-common-protos==1.63.2
# via clarifai-grpc
grpcio==1.65.4
# via clarifai-grpc
grpcio==1.64.3
# via
# -c ./ingest/../deps/constraints.txt
# clarifai-grpc
idna==3.7
# via
# -c ./ingest/../base.txt
Expand Down
2 changes: 1 addition & 1 deletion requirements/ingest/confluence.txt
Original file line number Diff line number Diff line change
Expand Up @@ -42,7 +42,7 @@ six==1.16.0
# via
# -c ./ingest/../base.txt
# atlassian-python-api
soupsieve==2.5
soupsieve==2.6
# via
# -c ./ingest/../base.txt
# beautifulsoup4
Expand Down
6 changes: 3 additions & 3 deletions requirements/ingest/databricks-volumes.txt
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@
#
# pip-compile ./ingest/databricks-volumes.in
#
cachetools==5.4.0
cachetools==5.5.0
# via google-auth
certifi==2024.7.4
# via
Expand All @@ -15,9 +15,9 @@ charset-normalizer==3.3.2
# via
# -c ./ingest/../base.txt
# requests
databricks-sdk==0.29.0
databricks-sdk==0.30.0
# via -r ./ingest/databricks-volumes.in
google-auth==2.33.0
google-auth==2.34.0
# via databricks-sdk
idna==3.7
# via
Expand Down
4 changes: 1 addition & 3 deletions requirements/ingest/delta-table.txt
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@
#
# pip-compile ./ingest/delta-table.in
#
deltalake==0.18.2
deltalake==0.19.0
# via -r ./ingest/delta-table.in
fsspec==2024.5.0
# via
Expand All @@ -16,5 +16,3 @@ numpy==1.26.4
# pyarrow
pyarrow==17.0.0
# via deltalake
pyarrow-hotfix==0.6
# via deltalake
4 changes: 2 additions & 2 deletions requirements/ingest/discord.txt
Original file line number Diff line number Diff line change
Expand Up @@ -4,9 +4,9 @@
#
# pip-compile ./ingest/discord.in
#
aiohappyeyeballs==2.3.5
aiohappyeyeballs==2.3.7
# via aiohttp
aiohttp==3.10.3
aiohttp==3.10.4
# via discord-py
aiosignal==1.3.1
# via aiohttp
Expand Down
6 changes: 3 additions & 3 deletions requirements/ingest/elasticsearch.txt
Original file line number Diff line number Diff line change
Expand Up @@ -4,9 +4,9 @@
#
# pip-compile ./ingest/elasticsearch.in
#
aiohappyeyeballs==2.3.5
aiohappyeyeballs==2.3.7
# via aiohttp
aiohttp==3.10.3
aiohttp==3.10.4
# via elasticsearch
aiosignal==1.3.1
# via aiohttp
Expand All @@ -21,7 +21,7 @@ certifi==2024.7.4
# elastic-transport
elastic-transport==8.15.0
# via elasticsearch
elasticsearch[async]==8.14.0
elasticsearch[async]==8.15.0
# via -r ./ingest/elasticsearch.in
frozenlist==1.4.1
# via
Expand Down
Loading

0 comments on commit 1f8030d

Please sign in to comment.