Skip to content

0.15.0

Compare
Choose a tag to compare
@christinestraub christinestraub released this 19 Jul 19:21
· 161 commits to main since this release
ec59abf

0.15.0

Enhancements

  • Improve text clearing process in email partitioning. Updated the email partitioner to remove both =\n and =\r\n characters during the clearing process. Previously, only =\n characters were removed.
  • Bump unstructured.paddleocr to 2.8.0.1.
  • Refine HTML parser to accommodate block element nested in phrasing. HTML parser no longer raises on a block element (e.g. <p>, <div>) nested inside a phrasing element (e.g. <strong> or <cite>). Instead it breaks the phrasing run (and therefore element) at the block-item start and begins a new phrasing run after the block-item. This is consistent with how the browser determines element boundaries in this situation.
  • Install rewritten HTML parser to fix 12 existing bugs and provide headroom for refinement and growth. A rewritten HTML parser resolves a collection of outstanding bugs with HTML partitioning and provides a firm foundation for further elaborating that important partitioner.
  • CI check for dependency licenses Adds a CI check to ensure dependencies are appropriately licensed.

Features

  • Add support for specifying OCR language to partition_pdf(). Extend language specification capability to PaddleOCR in addition to TesseractOCR. Users can now specify OCR languages for both OCR engines when using partition_pdf().
  • Add AstraDB source connector Adds support for ingesting documents from AstraDB.

Fixes

  • Remedy error on Windows when nltk binaries are downloaded. Work around a quirk in the Windows implementation of tempfile.NamedTemporaryFile where accessing the temporary file by name raises PermissionError.
  • Move Astra embedded_dimension to write config