Skip to content

Commit

Permalink
fix: nltk data download path to prevent redundant nested directories (
Browse files Browse the repository at this point in the history
#3546)

Closes #3543.

### Summary
This PR addresses an issue with the NLTK data download process.
Previously, when downloading NLTK data, a nested "nltk_data" directory
was created within the parent "nltk_data" directory if the parent
directory already existed. This redundant directory structure led to two
significant problems:
- errors in checking if data had already been downloaded, potentially
causing redundant downloads in subsequent calls.
- failures in loading models from the downloaded NLTK data due to
incorrect path resolution.

This fix modifies the NLTK data download logic to prevent creation of
unnecessary nested directories. If the download path ends with
"nltk_data" and that directory already exists, we now use the existing
directory instead of creating a new nested one.

### Testing
CI should pass.
  • Loading branch information
christinestraub authored Aug 20, 2024
1 parent 1f8030d commit 01dbc7b
Show file tree
Hide file tree
Showing 3 changed files with 15 additions and 2 deletions.
11 changes: 10 additions & 1 deletion CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,13 @@
## 0.15.7

### Enhancements

### Features

### Fixes

* **Fix NLTK data download path to prevent nested directories**. Resolved an issue where a nested "nltk_data" directory was created within the parent "nltk_data" directory when it already existed. This fix prevents errors in checking for existing downloads and loading models from NLTK data.

## 0.15.6

### Enhancements
Expand All @@ -10,7 +20,6 @@
* **Update CI for `ingest-test-fixture-update-pr` to resolve NLTK model download errors.**
* **Synchronized text and html on `TableChunk` splits.** When a `Table` element is divided during chunking to fit the chunking window, `TableChunk.text` corresponds exactly with the table text in `TableChunk.metadata.text_as_html`, `.text_as_html` is always parseable HTML, and the table is split on even row boundaries whenever possible.


## 0.15.5

### Enhancements
Expand Down
2 changes: 1 addition & 1 deletion unstructured/__version__.py
Original file line number Diff line number Diff line change
@@ -1 +1 @@
__version__ = "0.15.6" # pragma: no cover
__version__ = "0.15.7" # pragma: no cover
4 changes: 4 additions & 0 deletions unstructured/nlp/tokenize.py
Original file line number Diff line number Diff line change
Expand Up @@ -70,6 +70,10 @@ def download_nltk_packages():
if nltk_data_dir is None:
raise OSError("NLTK data directory does not exist or is not writable.")

# Check if the path ends with "nltk_data" and remove it if it does
if nltk_data_dir.endswith("nltk_data"):
nltk_data_dir = os.path.dirname(nltk_data_dir)

def sha256_checksum(filename: str, block_size: int = 65536):
sha256 = hashlib.sha256()
with open(filename, "rb") as f:
Expand Down

0 comments on commit 01dbc7b

Please sign in to comment.