Skip to content

Commit

Permalink
fix(CVE-2024-39705): update to latest nltk version (#3512)
Browse files Browse the repository at this point in the history
### Summary

Addresses
[CVE-2024-39705](https://nvd.nist.gov/vuln/detail/CVE-2024-39705) by
updating to `nltk==3.8.2` and closes #3511. This CVE had previously been
mitigated in #3361.

---------

Co-authored-by: Christine Straub <[email protected]>
  • Loading branch information
MthwRobinson and christinestraub authored Aug 13, 2024
1 parent 1158d8f commit 7437f0a
Show file tree
Hide file tree
Showing 32 changed files with 57 additions and 75 deletions.
16 changes: 5 additions & 11 deletions .github/workflows/ci.yml
Original file line number Diff line number Diff line change
Expand Up @@ -120,8 +120,6 @@ jobs:
matrix:
python-version: ["3.9","3.10","3.11", "3.12"]
runs-on: ubuntu-latest
env:
NLTK_DATA: ${{ github.workspace }}/nltk_data
needs: [setup, lint]
steps:
- uses: actions/checkout@v4
Expand Down Expand Up @@ -161,7 +159,6 @@ jobs:
python-version: ["3.10"]
runs-on: ubuntu-latest
env:
NLTK_DATA: ${{ github.workspace }}/nltk_data
UNSTRUCTURED_HF_TOKEN: ${{ secrets.HF_TOKEN }}
needs: [setup, lint]
steps:
Expand All @@ -179,6 +176,7 @@ jobs:
UNS_API_KEY: ${{ secrets.UNS_API_KEY }}
run: |
source .venv/bin/activate
make install-nltk-models
sudo apt-get update
sudo apt-get install -y poppler-utils
make install-pandoc install-test
Expand All @@ -193,8 +191,6 @@ jobs:
matrix:
python-version: ["3.10"]
runs-on: ubuntu-latest
env:
NLTK_DATA: ${{ github.workspace }}/nltk_data
needs: [setup, lint]
steps:
- uses: actions/checkout@v4
Expand All @@ -211,6 +207,7 @@ jobs:
UNS_API_KEY: ${{ secrets.UNS_API_KEY }}
run: |
source .venv/bin/activate
make install-nltk-models
make test-no-extras CI=true
test_unit_dependency_extras:
Expand Down Expand Up @@ -276,8 +273,6 @@ jobs:
matrix:
python-version: [ "3.9","3.10" ]
runs-on: ubuntu-latest
env:
NLTK_DATA: ${{ github.workspace }}/nltk_data
needs: [ setup_ingest, lint ]
steps:
# actions/checkout MUST come before auth
Expand All @@ -296,6 +291,7 @@ jobs:
- name: Test Ingest (unit)
run: |
source .venv/bin/activate
make install-nltk-models
PYTHONPATH=. pytest test_unstructured_ingest/unit
Expand All @@ -304,8 +300,6 @@ jobs:
matrix:
python-version: ["3.9","3.10"]
runs-on: ubuntu-latest-m
env:
NLTK_DATA: ${{ github.workspace }}/nltk_data
needs: [setup_ingest, lint]
steps:
# actions/checkout MUST come before auth
Expand Down Expand Up @@ -373,6 +367,7 @@ jobs:
CI: "true"
run: |
source .venv/bin/activate
make install-nltk-models
sudo apt-get update
sudo apt-get install -y libmagic-dev poppler-utils libreoffice
make install-pandoc
Expand All @@ -391,8 +386,6 @@ jobs:
matrix:
python-version: ["3.9","3.10"]
runs-on: ubuntu-latest-m
env:
NLTK_DATA: ${{ github.workspace }}/nltk_data
needs: [setup_ingest, lint]
steps:
# actions/checkout MUST come before auth
Expand Down Expand Up @@ -445,6 +438,7 @@ jobs:
CI: "true"
run: |
source .venv/bin/activate
make install-nltk-models
sudo apt-get update
sudo apt-get install -y libmagic-dev poppler-utils libreoffice
make install-pandoc
Expand Down
3 changes: 2 additions & 1 deletion CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
## 0.15.2-dev8
## 0.15.2

### Enhancements

Expand All @@ -10,6 +10,7 @@

### Fixes

* **Updates NLTK data file for compatibility with `nltk>=3.8.2`**. The NLTK data file now container `punkt_tab`, making it possible to upgrade to `nltk>=3.8.2`. The `nltk==3.8.2` patches CVE-2024-39705.
* **Renames Astra to Astra DB** Conforms with DataStax internal naming conventions.
* **Accommodate single-column CSV files.** Resolves a limitation of `partition_csv()` where delimiter detection would fail on a single-column CSV file (which naturally has no delimeters).
* **Accommodate `image/jpg` in PPTX as alias for `image/jpeg`.** Resolves problem partitioning PPTX files having an invalid `image/jpg` (should be `image/jpeg`) MIME-type in the `[Content_Types].xml` member of the PPTX Zip archive.
Expand Down
3 changes: 1 addition & 2 deletions Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -38,8 +38,7 @@ install-huggingface:

.PHONY: install-nltk-models
install-nltk-models:
python3 -c "import nltk; nltk.download('punkt')"
python3 -c "import nltk; nltk.download('averaged_perceptron_tagger')"
python3 -c "from unstructured.nlp.tokenize import download_nltk_packages; download_nltk_packages()"

.PHONY: install-test
install-test:
Expand Down
4 changes: 2 additions & 2 deletions requirements/base.txt
Original file line number Diff line number Diff line change
Expand Up @@ -57,7 +57,7 @@ jsonpath-python==1.0.6
# via unstructured-client
langdetect==1.0.9
# via -r ./base.in
lxml==5.2.2
lxml==5.3.0
# via -r ./base.in
marshmallow==3.21.3
# via
Expand All @@ -69,7 +69,7 @@ mypy-extensions==1.0.0
# unstructured-client
nest-asyncio==1.6.0
# via unstructured-client
nltk==3.8.1
nltk==3.8.2
# via -r ./base.in
numpy==1.26.4
# via -r ./base.in
Expand Down
4 changes: 2 additions & 2 deletions requirements/dev.txt
Original file line number Diff line number Diff line change
Expand Up @@ -423,7 +423,7 @@ virtualenv==20.26.3
# via pre-commit
wcwidth==0.2.13
# via prompt-toolkit
webcolors==24.6.0
webcolors==24.8.0
# via jsonschema
webencodings==0.5.1
# via
Expand All @@ -437,7 +437,7 @@ wheel==0.44.0
# pip-tools
widgetsnbextension==4.0.11
# via ipywidgets
zipp==3.19.2
zipp==3.20.0
# via importlib-metadata

# The following packages are considered to be unsafe in a requirements file:
Expand Down
2 changes: 1 addition & 1 deletion requirements/extra-docx.txt
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@
#
# pip-compile ./extra-docx.in
#
lxml==5.2.2
lxml==5.3.0
# via
# -c ./base.txt
# python-docx
Expand Down
2 changes: 1 addition & 1 deletion requirements/extra-markdown.txt
Original file line number Diff line number Diff line change
Expand Up @@ -8,5 +8,5 @@ importlib-metadata==8.2.0
# via markdown
markdown==3.6
# via -r ./extra-markdown.in
zipp==3.19.2
zipp==3.20.0
# via importlib-metadata
2 changes: 1 addition & 1 deletion requirements/extra-odt.txt
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@
#
# pip-compile ./extra-odt.in
#
lxml==5.2.2
lxml==5.3.0
# via
# -c ./base.txt
# python-docx
Expand Down
6 changes: 3 additions & 3 deletions requirements/extra-paddleocr.txt
Original file line number Diff line number Diff line change
Expand Up @@ -78,7 +78,7 @@ lanms-neo==1.0.2
# via unstructured-paddleocr
lazy-loader==0.4
# via scikit-image
lxml==5.2.2
lxml==5.3.0
# via
# -c ./base.txt
# premailer
Expand Down Expand Up @@ -191,7 +191,7 @@ sniffio==1.3.1
# -c ./base.txt
# anyio
# httpx
tifffile==2024.7.24
tifffile==2024.8.10
# via scikit-image
tqdm==4.66.5
# via
Expand All @@ -208,5 +208,5 @@ urllib3==1.26.19
# -c ././deps/constraints.txt
# -c ./base.txt
# requests
zipp==3.19.2
zipp==3.20.0
# via importlib-resources
6 changes: 3 additions & 3 deletions requirements/extra-pdf-image.txt
Original file line number Diff line number Diff line change
Expand Up @@ -86,7 +86,7 @@ kiwisolver==1.4.5
# via matplotlib
layoutparser==0.3.4
# via unstructured-inference
lxml==5.2.2
lxml==5.3.0
# via
# -c ./base.txt
# pikepdf
Expand Down Expand Up @@ -249,7 +249,7 @@ six==1.16.0
# via
# -c ./base.txt
# python-dateutil
sympy==1.13.1
sympy==1.13.2
# via
# onnxruntime
# torch
Expand Down Expand Up @@ -301,5 +301,5 @@ wrapt==1.16.0
# -c ././deps/constraints.txt
# -c ./base.txt
# deprecated
zipp==3.19.2
zipp==3.20.0
# via importlib-resources
2 changes: 1 addition & 1 deletion requirements/extra-pptx.txt
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@
#
# pip-compile ./extra-pptx.in
#
lxml==5.2.2
lxml==5.3.0
# via python-pptx
pillow==10.4.0
# via python-pptx
Expand Down
2 changes: 1 addition & 1 deletion requirements/huggingface.txt
Original file line number Diff line number Diff line change
Expand Up @@ -85,7 +85,7 @@ six==1.16.0
# via
# -c ./base.txt
# langdetect
sympy==1.13.1
sympy==1.13.2
# via torch
tokenizers==0.19.1
# via
Expand Down
2 changes: 1 addition & 1 deletion requirements/ingest/azure.txt
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@ adlfs==2024.7.0
# via -r ./ingest/azure.in
aiohappyeyeballs==2.3.5
# via aiohttp
aiohttp==3.10.2
aiohttp==3.10.3
# via adlfs
aiosignal==1.3.1
# via aiohttp
Expand Down
4 changes: 2 additions & 2 deletions requirements/ingest/chroma.txt
Original file line number Diff line number Diff line change
Expand Up @@ -194,7 +194,7 @@ sniffio==1.3.1
# httpx
starlette==0.37.2
# via fastapi
sympy==1.13.1
sympy==1.13.2
# via onnxruntime
tenacity==8.5.0
# via
Expand Down Expand Up @@ -247,7 +247,7 @@ wrapt==1.16.0
# -c ./ingest/../deps/constraints.txt
# deprecated
# opentelemetry-instrumentation
zipp==3.19.2
zipp==3.20.0
# via
# importlib-metadata
# importlib-resources
Expand Down
2 changes: 1 addition & 1 deletion requirements/ingest/clarifai.txt
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@ charset-normalizer==3.3.2
# requests
clarifai==10.7.0
# via -r ./ingest/clarifai.in
clarifai-grpc==10.7.0
clarifai-grpc==10.7.1
# via clarifai
contextlib2==21.6.0
# via schema
Expand Down
2 changes: 1 addition & 1 deletion requirements/ingest/discord.txt
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@
#
aiohappyeyeballs==2.3.5
# via aiohttp
aiohttp==3.10.2
aiohttp==3.10.3
# via discord-py
aiosignal==1.3.1
# via aiohttp
Expand Down
4 changes: 2 additions & 2 deletions requirements/ingest/elasticsearch.txt
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@
#
aiohappyeyeballs==2.3.5
# via aiohttp
aiohttp==3.10.2
aiohttp==3.10.3
# via elasticsearch
aiosignal==1.3.1
# via aiohttp
Expand All @@ -19,7 +19,7 @@ certifi==2024.7.4
# -c ./ingest/../base.txt
# -c ./ingest/../deps/constraints.txt
# elastic-transport
elastic-transport==8.13.1
elastic-transport==8.15.0
# via elasticsearch
elasticsearch[async]==8.14.0
# via -r ./ingest/elasticsearch.in
Expand Down
4 changes: 2 additions & 2 deletions requirements/ingest/embed-aws-bedrock.txt
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@
#
aiohappyeyeballs==2.3.5
# via aiohttp
aiohttp==3.10.2
aiohttp==3.10.3
# via
# langchain
# langchain-community
Expand Down Expand Up @@ -70,7 +70,7 @@ langchain-core==0.2.29
# langchain-text-splitters
langchain-text-splitters==0.2.2
# via langchain
langsmith==0.1.98
langsmith==0.1.99
# via
# langchain
# langchain-community
Expand Down
4 changes: 2 additions & 2 deletions requirements/ingest/embed-huggingface.txt
Original file line number Diff line number Diff line change
Expand Up @@ -49,7 +49,7 @@ langchain-core==0.2.29
# via langchain-huggingface
langchain-huggingface==0.0.3
# via -r ./ingest/embed-huggingface.in
langsmith==0.1.98
langsmith==0.1.99
# via langchain-core
markupsafe==2.1.5
# via jinja2
Expand Down Expand Up @@ -107,7 +107,7 @@ scipy==1.11.3
# sentence-transformers
sentence-transformers==3.0.1
# via langchain-huggingface
sympy==1.13.1
sympy==1.13.2
# via torch
tenacity==8.5.0
# via langchain-core
Expand Down
2 changes: 1 addition & 1 deletion requirements/ingest/embed-octoai.txt
Original file line number Diff line number Diff line change
Expand Up @@ -49,7 +49,7 @@ idna==3.7
# requests
jiter==0.5.0
# via openai
openai==1.40.2
openai==1.40.3
# via -r ./ingest/embed-octoai.in
pydantic==2.8.2
# via openai
Expand Down
6 changes: 3 additions & 3 deletions requirements/ingest/embed-openai.txt
Original file line number Diff line number Diff line change
Expand Up @@ -55,11 +55,11 @@ jsonpointer==3.0.0
# via jsonpatch
langchain-core==0.2.29
# via langchain-openai
langchain-openai==0.1.20
langchain-openai==0.1.21
# via -r ./ingest/embed-openai.in
langsmith==0.1.98
langsmith==0.1.99
# via langchain-core
openai==1.40.2
openai==1.40.3
# via langchain-openai
orjson==3.10.7
# via langsmith
Expand Down
4 changes: 2 additions & 2 deletions requirements/ingest/embed-vertexai.txt
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@
#
aiohappyeyeballs==2.3.5
# via aiohttp
aiohttp==3.10.2
aiohttp==3.10.3
# via
# langchain
# langchain-community
Expand Down Expand Up @@ -120,7 +120,7 @@ langchain-google-vertexai==1.0.8
# via -r ./ingest/embed-vertexai.in
langchain-text-splitters==0.2.2
# via langchain
langsmith==0.1.98
langsmith==0.1.99
# via
# langchain
# langchain-community
Expand Down
Loading

0 comments on commit 7437f0a

Please sign in to comment.