[Bug]: report "File is not a zip file" where parse ".doc" file #474

ben-qiao · 2024-04-22T02:26:39Z

Is there an existing issue for the same bug?

I have checked the existing issues.

Branch name

main

Commit ID

fsjfj23ir23rpwfkwfke

Other environment information

ragflow 0.3.0

Actual behavior

after ragflow start normally, upload doc file. it failed and report that "File is not a zip file"

Expected behavior

No response

Steps to reproduce

just upload doc file

Additional information

No response

Jiafan · 2024-04-22T07:30:47Z

I have got the same problem . And I got logs like this：

So I read the source ，and got this code ：


sections = []
    if re.search(r"\.docx?$", filename, re.IGNORECASE):
        callback(0.1, "Start to parse.")
        for txt in Docx()(filename, binary):
            sections.append(txt)
        callback(0.8, "Finish parsing.")

    elif re.search(r"\.pdf$", filename, re.IGNORECASE):
        pdf_parser = Pdf() if kwargs.get(
            "parser_config", {}).get(
            "layout_recognize", True) else PlainParser()
        for txt, poss in pdf_parser(filename if not binary else binary,
                                    from_page=from_page, to_page=to_page, callback=callback)[0]:
            sections.append(txt + poss)

    elif re.search(r"\.txt$", filename, re.IGNORECASE):
        callback(0.1, "Start to parse.")
        txt = ""
        if binary:
            txt = binary.decode("utf-8")
        else:
            with open(filename, "r") as f:
                while True:
                    l = f.readline()
                    if not l:
                        break
                    txt += l
        sections = txt.split("\n")
        sections = [l for l in sections if l]
        callback(0.8, "Finish parsing.")
    else:
        raise NotImplementedError(
            "file type not supported yet(docx, pdf, txt supported)")

In fact doc is not supported . But the judgement allows doc type file .

### What problem does this PR solve? #474 ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)

KevinHuSh · 2024-04-22T07:46:23Z

fixed

ben-qiao added the bug Something isn't working label Apr 22, 2024

KevinHuSh mentioned this issue Apr 22, 2024

remove doc from supported processing types #488

Merged

1 task

KevinHuSh added a commit that referenced this issue Apr 22, 2024

remove doc from supported processing types (#488)

a38e163

### What problem does this PR solve? #474 ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)

KevinHuSh closed this as completed Apr 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: report "File is not a zip file" where parse ".doc" file #474

[Bug]: report "File is not a zip file" where parse ".doc" file #474

ben-qiao commented Apr 22, 2024

Jiafan commented Apr 22, 2024

KevinHuSh commented Apr 22, 2024

[Bug]: report "File is not a zip file" where parse ".doc" file #474

[Bug]: report "File is not a zip file" where parse ".doc" file #474

Comments

ben-qiao commented Apr 22, 2024

Is there an existing issue for the same bug?

Branch name

Commit ID

Other environment information

Actual behavior

Expected behavior

Steps to reproduce

Additional information

Jiafan commented Apr 22, 2024

KevinHuSh commented Apr 22, 2024