Skip to content

0.15.2

Compare
Choose a tag to compare
@MthwRobinson MthwRobinson released this 13 Aug 13:40
· 132 commits to main since this release
7437f0a

0.15.2

Enhancements

  • Improve directory handling when extracting image blocks. The figures directory is no longer created when the extract_image_block_to_payload parameter is set to True.

Features

  • Added per-class Object Detection metrics in the evaluation. The metrics include average precision, precision, recall, and f1-score for each class in the dataset.

Fixes

  • Updates NLTK data file for compatibility with nltk>=3.8.2. The NLTK data file now container punkt_tab, making it possible to upgrade to nltk>=3.8.2. The nltk==3.8.2 patches CVE-2024-39705.
  • Renames Astra to Astra DB Conforms with DataStax internal naming conventions.
  • Accommodate single-column CSV files. Resolves a limitation of partition_csv() where delimiter detection would fail on a single-column CSV file (which naturally has no delimeters).
  • Accommodate image/jpg in PPTX as alias for image/jpeg. Resolves problem partitioning PPTX files having an invalid image/jpg (should be image/jpeg) MIME-type in the [Content_Types].xml member of the PPTX Zip archive.
  • Fixes an issue in Object Detection metrics The issue was in preprocessing/validating the ground truth and predicted data for object detection metrics.
  • Removes dependency on unstructured.pytesseract Unstructured forked pytesseract while waiting for code to be upstreamed. Now that the new version has been released, this fork can be removed.