Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug Report: Text Truncation in EPUB Files Larger Than 500KB #39

Open
hochenggang opened this issue Dec 7, 2024 · 5 comments
Open

Bug Report: Text Truncation in EPUB Files Larger Than 500KB #39

hochenggang opened this issue Dec 7, 2024 · 5 comments

Comments

@hochenggang
Copy link

Bug Report: Text Truncation in EPUB Files Larger Than 500KB


Project: Extractous
Version: [extractous==0.2.0]
Environment: [X86, Linux, Python3.10 ]


Description:

When processing EPUB documents that extracted result text larger than 500KB, the extracted text is consistently truncated around the 500KB mark, preventing the output of the complete content.

Expected Behavior:

The extracted text should contain the complete content of the EPUB file, regardless of its text length.

Actual Behavior:

The extracted text is truncated around the 500KB mark, leaving out the remaining content of the EPUB file.

Example Code:

import os
import uuid
from flask import Flask, request, Response
from extractous import Extractor

# Create a new extractor
extractor = Extractor()

# Create flask instance
app = Flask(__name__)

@app.route("/")
def get_index():
    with open("./extractor.html", 'rt') as f:
        return Response(f.read(), mimetype="text/html")

@app.route("/extractor", methods=["POST"])
def post_index():
    if 'file' not in request.files:
        return "No file part", 400
    file = request.files['file']
    if file.filename == '':
        return "No selected file", 400
    file_content = file.read()
    file_extension = os.path.splitext(file.filename)[1]
    random_str = uuid.uuid1().hex
    save_path = f"{random_str}{file_extension}"
    with open(save_path, 'wb') as f:
        f.write(file_content)
    result, metadata = extractor.extract_file_to_string(save_path)
    if os.path.isfile(save_path):
        os.remove(save_path)
    return Response(result, mimetype="text/plain")

if __name__ == "__main__":
    from waitress import serve
    serve(app, listen='*:18030')

Additional Information:

  • The issue occurs consistently with EPUB files larger than 500KB.
  • The sample EPUB file used for testing can be found here.

Greatly appreciate it if you could kindly look into and address this issue at your earliest convenience.
Thank you very much for your assistance!

@hochenggang
Copy link
Author

here is the Example Code Online

@hochenggang
Copy link
Author

I have changed the extract_string_max_length to 100000000, but not work.
The truncation of text when processing EPUB files larger than 500KB, even after explicitly setting the set_extract_string_max_length method to a higher value (e.g., 100,000,000). The text is consistently truncated around the 500KB mark, preventing the extraction of the complete content.

# Create a new extractor
extractor = Extractor()
# extractor.set_extract_string_max_length(1000)
extractor.set_extract_string_max_length(100000000)

@nmammeri
Copy link
Contributor

nmammeri commented Dec 8, 2024

Thanks for reporting the issue. I was going to suggest setting set_extract_string_max_length to a higher value, but it seems that you already tried that. I'll have a look further into this.

@jabberjabberjabber
Copy link

I am getting the same problem, it sets 500K no matter the extract length is set to.

@jabberjabberjabber
Copy link

I have changed the extract_string_max_length to 100000000, but not work. The truncation of text when processing EPUB files larger than 500KB, even after explicitly setting the set_extract_string_max_length method to a higher value (e.g., 100,000,000). The text is consistently truncated around the 500KB mark, preventing the extraction of the complete content.

# Create a new extractor
extractor = Extractor()
# extractor.set_extract_string_max_length(1000)
extractor.set_extract_string_max_length(100000000)

#41 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants