-
Notifications
You must be signed in to change notification settings - Fork 797
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
enhancement: improve text clearing process in
email
partitioning (#…
…3422) ### Summary Currently, the email partitioner removes only `=\n` characters during the clearing process. However, email content sometimes contains `=\r\n` characters, especially when read from file-like objects such as `SpooledTemporaryFile` (the file type used in our API). This PR updates the email partitioner to remove both `=\n` and `=\r\n` characters during the clearing process. ### Testing ``` filename = "example-docs/eml/family-day.eml" elements = partition_email( filename=filename, ) print(f"From filename: {elements[3].text}") with open(filename, "rb") as test_file: spooled_temp_file = tempfile.SpooledTemporaryFile() spooled_temp_file.write(test_file.read()) spooled_temp_file.seek(0) elements = partition_email(file=spooled_temp_file) print(f"From spooled_temp_file: {elements[3].text}") ``` **Results:** - on `main` ``` From filename: Make sure to RSVP! From spooled_temp_file: Make sure to = RSVP! ``` - on `PR` ``` From filename: Make sure to RSVP! From spooled_temp_file: Make sure to RSVP! ```
- Loading branch information
1 parent
1df7908
commit ec59abf
Showing
5 changed files
with
56 additions
and
4 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,39 @@ | ||
MIME-Version: 1.0 | ||
Date: Wed, 21 Dec 2022 10:28:53 -0600 | ||
Message-ID: <CAPgNNXQKR=[email protected]> | ||
Subject: Family Day | ||
From: Mallori Harrell <[email protected]> | ||
To: Mallori Harrell <[email protected]> | ||
Content-Type: multipart/alternative; boundary="0000000000005c115405f0590ce4" | ||
|
||
--0000000000005c115405f0590ce4 | ||
Content-Type: text/plain; charset="UTF-8" | ||
Hi All, | ||
Get excited for our first annual family day! | ||
There will be face painting, a petting zoo, funnel cake and more. | ||
Make sure to RSVP! | ||
Best. | ||
-- | ||
Mallori Harrell | ||
Unstructured Technologies | ||
Data Scientist | ||
|
||
--0000000000005c115405f0590ce4 | ||
Content-Type: text/html; charset="UTF-8" | ||
Content-Transfer-Encoding: quoted-printable | ||
|
||
<div dir=3D"ltr">Hi All,<div><br></div><div>Get excited for our first annua= | ||
l family day!=C2=A0</div><div><br></div><div>There will be face painting, = | ||
a petting zoo, funnel cake and more.</div><div><br></div><div>Make sure to = | ||
RSVP!</div><div><br></div><div>Best.<br clear=3D"all"><div><br></div>-- <br= | ||
><div dir=3D"ltr" class=3D"gmail_signature" data-smartmail=3D"gmail_signatu= | ||
re"><div dir=3D"ltr">Mallori Harrell<div>Unstructured Technologies<br><div>= | ||
Data Scientist</div><div><br></div></div></div></div></div></div> | ||
|
||
--0000000000005c115405f0590ce4-- |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1 +1 @@ | ||
__version__ = "0.15.0-dev16" # pragma: no cover | ||
__version__ = "0.15.0" # pragma: no cover |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters