Skip to content

Commit

Permalink
enhancement: improve text clearing process in email partitioning (#…
Browse files Browse the repository at this point in the history
…3422)

### Summary
Currently, the email partitioner removes only `=\n` characters during
the clearing process. However, email content sometimes contains `=\r\n`
characters, especially when read from file-like objects such as
`SpooledTemporaryFile` (the file type used in our API). This PR updates
the email partitioner to remove both `=\n` and `=\r\n` characters during
the clearing process.

### Testing

```
filename = "example-docs/eml/family-day.eml"

elements = partition_email(
    filename=filename,
)
print(f"From filename: {elements[3].text}")

with open(filename, "rb") as test_file:
    spooled_temp_file = tempfile.SpooledTemporaryFile()
    spooled_temp_file.write(test_file.read())
    spooled_temp_file.seek(0)
    elements = partition_email(file=spooled_temp_file)
    print(f"From spooled_temp_file: {elements[3].text}")
```

**Results:**
- on `main`
```
From filename: Make sure to RSVP!
From spooled_temp_file: Make sure to = RSVP!
```
- on `PR`
```
From filename: Make sure to RSVP!
From spooled_temp_file: Make sure to RSVP!
```
  • Loading branch information
christinestraub authored Jul 19, 2024
1 parent 1df7908 commit ec59abf
Show file tree
Hide file tree
Showing 5 changed files with 56 additions and 4 deletions.
3 changes: 2 additions & 1 deletion CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,8 @@
## 0.15.0-dev16
## 0.15.0

### Enhancements

* **Improve text clearing process in email partitioning.** Updated the email partitioner to remove both `=\n` and `=\r\n` characters during the clearing process. Previously, only `=\n` characters were removed.
* **Bump unstructured.paddleocr to 2.8.0.1.**
* **Refine HTML parser to accommodate block element nested in phrasing.** HTML parser no longer raises on a block element (e.g. `<p>`, `<div>`) nested inside a phrasing element (e.g. `<strong>` or `<cite>`). Instead it breaks the phrasing run (and therefore element) at the block-item start and begins a new phrasing run after the block-item. This is consistent with how the browser determines element boundaries in this situation.
* **Install rewritten HTML parser to fix 12 existing bugs and provide headroom for refinement and growth.** A rewritten HTML parser resolves a collection of outstanding bugs with HTML partitioning and provides a firm foundation for further elaborating that important partitioner.
Expand Down
39 changes: 39 additions & 0 deletions example-docs/eml/family-day.eml
Original file line number Diff line number Diff line change
@@ -0,0 +1,39 @@
MIME-Version: 1.0
Date: Wed, 21 Dec 2022 10:28:53 -0600
Message-ID: <CAPgNNXQKR=[email protected]>
Subject: Family Day
From: Mallori Harrell <[email protected]>
To: Mallori Harrell <[email protected]>
Content-Type: multipart/alternative; boundary="0000000000005c115405f0590ce4"

--0000000000005c115405f0590ce4
Content-Type: text/plain; charset="UTF-8"
Hi All,
Get excited for our first annual family day!
There will be face painting, a petting zoo, funnel cake and more.
Make sure to RSVP!
Best.
--
Mallori Harrell
Unstructured Technologies
Data Scientist

--0000000000005c115405f0590ce4
Content-Type: text/html; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr">Hi All,<div><br></div><div>Get excited for our first annua=
l family day!=C2=A0</div><div><br></div><div>There will be face painting, =
a petting zoo, funnel cake and more.</div><div><br></div><div>Make sure to =
RSVP!</div><div><br></div><div>Best.<br clear=3D"all"><div><br></div>-- <br=
><div dir=3D"ltr" class=3D"gmail_signature" data-smartmail=3D"gmail_signatu=
re"><div dir=3D"ltr">Mallori Harrell<div>Unstructured Technologies<br><div>=
Data Scientist</div><div><br></div></div></div></div></div></div>

--0000000000005c115405f0590ce4--
12 changes: 12 additions & 0 deletions test_unstructured/partition/test_email.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,7 @@
import email
import os
import pathlib
import tempfile

import pytest

Expand Down Expand Up @@ -230,6 +231,17 @@ def test_partition_email_from_file_rb_default_encoding(filename, expected_output
assert element.metadata.filename is None


def test_partition_email_from_spooled_temp_file():
filename = example_doc_path("eml/family-day.eml")
with open(filename, "rb") as test_file:
spooled_temp_file = tempfile.SpooledTemporaryFile()
spooled_temp_file.write(test_file.read())
spooled_temp_file.seek(0)
elements = partition_email(file=spooled_temp_file)
assert len(elements) == 9
assert elements[3].text == "Make sure to RSVP!"


def test_partition_email_from_text_file():
filename = os.path.join(EXAMPLE_DOCS_DIRECTORY, "fake-email.txt")
with open(filename) as f:
Expand Down
2 changes: 1 addition & 1 deletion unstructured/__version__.py
Original file line number Diff line number Diff line change
@@ -1 +1 @@
__version__ = "0.15.0-dev16" # pragma: no cover
__version__ = "0.15.0" # pragma: no cover
4 changes: 2 additions & 2 deletions unstructured/partition/email.py
Original file line number Diff line number Diff line change
Expand Up @@ -416,8 +416,8 @@ def partition_email(
# <li>Item 1</li>=
# <li>Item 2<li>=
# </ul>
list_content = content.split("=\n")
content = "".join(list_content)

content = content.replace("=\n", "").replace("=\r\n", "")
elements = partition_html(
text=content,
include_metadata=False,
Expand Down

0 comments on commit ec59abf

Please sign in to comment.