Skip to content

0.15.8

Compare
Choose a tag to compare
@MthwRobinson MthwRobinson released this 27 Aug 15:55
· 113 commits to main since this release
4194a07

0.15.8

Enhancements

  • Bump unstructured.paddleocr to 2.8.1.0.

Features

  • Add MixedbreadAI embedder Adds MixedbreadAI embeddings to support embedding via Mixedbread AI.

Fixes

  • Replace pillow-heif with pi-heif. Replaces pillow-heif with pi-heif due to more permissive licensing on the wheel for pi-heif.
  • Minify text_as_html from DOCX. Previously .metadata.text_as_html for DOCX tables was "bloated" with whitespace and noise elements introduced by tabulate that produced over-chunking and lower "semantic density" of elements. Reduce HTML to minimum character count without preserving all text.
  • Fall back to filename extension-based file-type detection for unidentified OLE files. Resolves a problem where a DOC file that could not be detected as such by filetype was incorrectly identified as a MSG file.