You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This commit was created on GitHub.com and signed with GitHub’s verified signature.
0.16.1
Enhancements
Bump unstructured-inference to 0.7.39 and upgrade other dependencies
Round coordinates Round coordinates when computing bounding box overlaps in pdfminer_processing.py to nearest machine precision. This can help reduce underterministic behavior from machine precision that affects which bounding boxes to combine.
Request retry parameters in partition_via_api function. Expose retry-mechanism related parameters in the partition_via_api function to allow users to configure the retry behavior of the API requests.
Features
Parsing HTML to Unstructured Elements and back
Fixes
Remove unsupported chipper model
Rewrite of partition.email module and tests. Use modern Python stdlib email module interface to parse email messages and attachments. This change shortens and simplifies the code, and makes it more robust and maintainable. Several historical problems were remedied in the process.
Minify text_as_html from DOCX. Previously .metadata.text_as_html for DOCX tables was "bloated" with whitespace and noise elements introduced by tabulate that produced over-chunking and lower "semantic density" of elements. Reduce HTML to minimum character count while preserving all text.
Fall back to filename extension-based file-type detection for unidentified OLE files. Resolves a problem where a DOC file that could not be detected as such by filetype was incorrectly identified as a MSG file.
Minify text_as_html from XLSX. Previously .metadata.text_as_html for DOCX tables was "bloated" with whitespace and noise elements introduced by pandas that produced over-chunking and lower "semantic density" of elements. Reduce HTML to minimum character count while preserving all text.
Minify text_as_html from CSV. Previously .metadata.text_as_html for CSV tables was "bloated" with whitespace and noise elements introduced by pandas that produced over-chunking and lower "semantic density" of elements. Reduce HTML to minimum character count while preserving all text.
Minify text_as_html from PPTX. Previously .metadata.text_as_html for PPTX tables was "bloated" with whitespace and noise elements introduced by tabulate that produced over-chunking and lower "semantic density" of elements. Reduce HTML to minimum character count while preserving all text and structure.