Bug Report: Text Truncation in EPUB Files Larger Than 500KB #39

hochenggang · 2024-12-07T09:59:05Z

Bug Report: Text Truncation in EPUB Files Larger Than 500KB

Project: Extractous
Version: [extractous==0.2.0]
Environment: [X86, Linux, Python3.10 ]

Description:

When processing EPUB documents that extracted result text larger than 500KB, the extracted text is consistently truncated around the 500KB mark, preventing the output of the complete content.

Expected Behavior:

The extracted text should contain the complete content of the EPUB file, regardless of its text length.

Actual Behavior:

The extracted text is truncated around the 500KB mark, leaving out the remaining content of the EPUB file.

Example Code:

import os
import uuid
from flask import Flask, request, Response
from extractous import Extractor

# Create a new extractor
extractor = Extractor()

# Create flask instance
app = Flask(__name__)

@app.route("/")
def get_index():
    with open("./extractor.html", 'rt') as f:
        return Response(f.read(), mimetype="text/html")

@app.route("/extractor", methods=["POST"])
def post_index():
    if 'file' not in request.files:
        return "No file part", 400
    file = request.files['file']
    if file.filename == '':
        return "No selected file", 400
    file_content = file.read()
    file_extension = os.path.splitext(file.filename)[1]
    random_str = uuid.uuid1().hex
    save_path = f"{random_str}{file_extension}"
    with open(save_path, 'wb') as f:
        f.write(file_content)
    result, metadata = extractor.extract_file_to_string(save_path)
    if os.path.isfile(save_path):
        os.remove(save_path)
    return Response(result, mimetype="text/plain")

if __name__ == "__main__":
    from waitress import serve
    serve(app, listen='*:18030')

Additional Information:

The issue occurs consistently with EPUB files larger than 500KB.
The sample EPUB file used for testing can be found here.

Greatly appreciate it if you could kindly look into and address this issue at your earliest convenience.
Thank you very much for your assistance!

hochenggang · 2024-12-07T10:08:21Z

here is the Example Code Online

hochenggang · 2024-12-07T10:32:00Z

I have changed the extract_string_max_length to 100000000, but not work.
The truncation of text when processing EPUB files larger than 500KB, even after explicitly setting the set_extract_string_max_length method to a higher value (e.g., 100,000,000). The text is consistently truncated around the 500KB mark, preventing the extraction of the complete content.

# Create a new extractor
extractor = Extractor()
# extractor.set_extract_string_max_length(1000)
extractor.set_extract_string_max_length(100000000)

nmammeri · 2024-12-08T13:12:29Z

Thanks for reporting the issue. I was going to suggest setting set_extract_string_max_length to a higher value, but it seems that you already tried that. I'll have a look further into this.

jabberjabberjabber · 2024-12-16T00:08:47Z

I am getting the same problem, it sets 500K no matter the extract length is set to.

jabberjabberjabber · 2024-12-16T03:33:41Z

I have changed the extract_string_max_length to 100000000, but not work. The truncation of text when processing EPUB files larger than 500KB, even after explicitly setting the set_extract_string_max_length method to a higher value (e.g., 100,000,000). The text is consistently truncated around the 500KB mark, preventing the extraction of the complete content.
# Create a new extractor
extractor = Extractor()
# extractor.set_extract_string_max_length(1000)
extractor.set_extract_string_max_length(100000000)

#41 (comment)

hochenggang closed this as completed Dec 7, 2024

hochenggang reopened this Dec 7, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bug Report: Text Truncation in EPUB Files Larger Than 500KB #39

Bug Report: Text Truncation in EPUB Files Larger Than 500KB #39

hochenggang commented Dec 7, 2024

hochenggang commented Dec 7, 2024

hochenggang commented Dec 7, 2024

nmammeri commented Dec 8, 2024

jabberjabberjabber commented Dec 16, 2024

jabberjabberjabber commented Dec 16, 2024

Bug Report: Text Truncation in EPUB Files Larger Than 500KB #39

Bug Report: Text Truncation in EPUB Files Larger Than 500KB #39

Comments

hochenggang commented Dec 7, 2024