Skip to content

0.15.1

Compare
Choose a tag to compare
@christinestraub christinestraub released this 05 Aug 17:36
· 147 commits to main since this release
7e88744

0.15.1

Enhancements

  • Improve pdfminer embedded image extraction to exclude text elements and produce more accurate bounding boxes. This results in cleaner, more precise element extraction in pdf partitioning.

Features

  • Update partition_eml and partition_msg to capture cc, bcc, and message_id fields Cc, bcc, and message_id information is captured in element metadata for both msg and email partitioning and Recipient elements are generated for cc and bcc when include_headers=True for email partitioning.
  • Mark ingest as deprecated Begin sunset of ingest code in this repo as it's been moved to a dedicated repo.
  • Add pdf_hi_res_max_pages argument for partitioning, which allows rejecting PDF files that exceed this page number limit, when the high_res strategy is chosen. By default, it will allow parsing PDF files with an unlimited number of pages.

Fixes

  • Update HuggingFaceEmbeddingEncoder to use HuggingFaceEmbeddings from langchain_huggingface package instead of the deprecated version from langchain-community. This resolves the deprecation warning and ensures compatibility with future versions of langchain.
  • Update OpenAIEmbeddingEncoder to use OpenAIEmbeddings from langchain-openai package instead of the deprecated version from langchain-community. This resolves the deprecation warning and ensures compatibility with future versions of langchain.
  • Update import of Pinecone exception Adds compatibility for pinecone-client>=5.0.0
  • File-type detection catches non-existent file-path. detect_filetype() no longer silently falls back to detecting a file-type based on the extension when no file exists at the path provided. Instead FileNotFoundError is raised. This provides consistent user notification of a mis-typed path rather than an unpredictable exception from a file-type specific partitioner when the file cannot be opened.
  • EML files specified as a file-path are detected correctly. Resolved a bug where an EML file submitted to partition() as a file-path was identified as TXT and partitioned using partition_text(). EML files specified by path are now identified and processed correctly, including processing any attachments.
  • A DOCX, PPTX, or XLSX file specified by path and ambiguously identified as MIME-type "application/octet-stream" is identified correctly. Resolves a shortcoming where a file specified by path immediately fell back to filename-extension based identification when misidentified as "application/octet-stream", either by asserted content type or a mis-guess by libmagic. An MS Office file misidentified in this way is now correctly identified regardless of its filename and whether it is specified by path or file-like object.
  • Textual content retrieved from a URL with gzip transport compression now partitions correctly. Resolves a bug where a textual file-type (such as Markdown) retrieved by passing a URL to partition() would raise when gzip compression was used for transport by the server.
  • A DOCX, PPTX, or XLSX content-type asserted on partition is confirmed or fixed. Resolves a bug where calling partition() with a swapped MS-Office content_type would cause the file-type to be misidentified. A DOCX, PPTX, or XLSX MIME-type received by partition() is now checked for accuracy and corrected if the file is for a different MS-Office 2007+ type.
  • DOC, PPT, XLS, and MSG files are now auto-detected correctly. Resolves a bug where DOC, PPT, and XLS files were auto-detected as MSG files under certain circumstances.