0.16.1

cragwolfe released this 23 Oct 19:26

· 53 commits to main since this release

0b4c72a

0.16.1

Enhancements

Bump unstructured-inference to 0.7.39 and upgrade other dependencies
Round coordinates Round coordinates when computing bounding box overlaps in pdfminer_processing.py to nearest machine precision. This can help reduce underterministic behavior from machine precision that affects which bounding boxes to combine.
Request retry parameters in partition_via_api function. Expose retry-mechanism related parameters in the partition_via_api function to allow users to configure the retry behavior of the API requests.

Features

Parsing HTML to Unstructured Elements and back

Fixes

Remove unsupported chipper model
Rewrite of partition.email module and tests. Use modern Python stdlib email module interface to parse email messages and attachments. This change shortens and simplifies the code, and makes it more robust and maintainable. Several historical problems were remedied in the process.
Minify text_as_html from DOCX. Previously .metadata.text_as_html for DOCX tables was "bloated" with whitespace and noise elements introduced by tabulate that produced over-chunking and lower "semantic density" of elements. Reduce HTML to minimum character count while preserving all text.
Fall back to filename extension-based file-type detection for unidentified OLE files. Resolves a problem where a DOC file that could not be detected as such by filetype was incorrectly identified as a MSG file.
Minify text_as_html from XLSX. Previously .metadata.text_as_html for DOCX tables was "bloated" with whitespace and noise elements introduced by pandas that produced over-chunking and lower "semantic density" of elements. Reduce HTML to minimum character count while preserving all text.
Minify text_as_html from CSV. Previously .metadata.text_as_html for CSV tables was "bloated" with whitespace and noise elements introduced by pandas that produced over-chunking and lower "semantic density" of elements. Reduce HTML to minimum character count while preserving all text.
Minify text_as_html from PPTX. Previously .metadata.text_as_html for PPTX tables was "bloated" with whitespace and noise elements introduced by tabulate that produced over-chunking and lower "semantic density" of elements. Reduce HTML to minimum character count while preserving all text and structure.

Assets 2