Skip to content

Commit

Permalink
fix: parsing pdf error - new_cells as str has no "copy" (#3130)
Browse files Browse the repository at this point in the history
Closes #3119.

### Testing
Parsing the provided PDF should be successful.


[testing_brochure_2.pdf](https://github.com/user-attachments/files/15518094/testing_brochure_2.pdf)
```
filename = "testing_brochure_2.pdf"
with open(filename, "rb") as pdf_content:
    elements = partition_pdf(
        file=pdf_content,
        infer_table_structure=True,
        extract_image_block_types=["Image", "Table"],
        chunking_strategy="by_title",
        max_characters=1000,
        new_after_n_chars=3000,
        combine_text_under_n_chars=1000,
    )
print("\n\n".join([str(el) for el in elements]))
```
  • Loading branch information
christinestraub authored Jun 3, 2024
1 parent 1b43102 commit 1dede50
Show file tree
Hide file tree
Showing 3 changed files with 5 additions and 3 deletions.
3 changes: 2 additions & 1 deletion CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
## 0.14.4-dev6
## 0.14.4

### Enhancements

Expand All @@ -12,6 +12,7 @@

### Fixes

* **Address the issue of unrecognized tables in `UnstructuredTableTransformerModel`** When a table is not recognized, the `element.metadata.text_as_html` attribute is set to an empty string.
* **Remove root handlers in ingest logger**. Removes root handlers in ingest loggers to ensure secrets aren't accidentally exposed in Colab notebooks.
* **Fix V2 S3 Destination Connector authentication** Fixes bugs with S3 Destination Connector where the connection config was neither registered nor properly deserialized.
* **Clarified dependence on particular version of `python-docx`** Pinned `python-docx` version to ensure a particular method `unstructured` uses is included.
Expand Down
2 changes: 1 addition & 1 deletion unstructured/__version__.py
Original file line number Diff line number Diff line change
@@ -1 +1 @@
__version__ = "0.14.4-dev6" # pragma: no cover
__version__ = "0.14.4" # pragma: no cover
3 changes: 2 additions & 1 deletion unstructured/partition/pdf_image/ocr.py
Original file line number Diff line number Diff line change
Expand Up @@ -280,7 +280,8 @@ def supplement_element_with_table_extraction(
cropped_image, ocr_tokens=table_tokens, result_format="cells"
)

text_as_html = cells_to_html(tatr_cells)
# NOTE(christine): `tatr_cells == ""` means that the table was not recognized
text_as_html = "" if tatr_cells == "" else cells_to_html(tatr_cells)
element.text_as_html = text_as_html

if env_config.EXTRACT_TABLE_AS_CELLS:
Expand Down

0 comments on commit 1dede50

Please sign in to comment.