Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: report "File is not a zip file" where parse ".doc" file #474

Closed
1 task done
ben-qiao opened this issue Apr 22, 2024 · 2 comments
Closed
1 task done

[Bug]: report "File is not a zip file" where parse ".doc" file #474

ben-qiao opened this issue Apr 22, 2024 · 2 comments
Labels
bug Something isn't working

Comments

@ben-qiao
Copy link

Is there an existing issue for the same bug?

  • I have checked the existing issues.

Branch name

main

Commit ID

fsjfj23ir23rpwfkwfke

Other environment information

ragflow 0.3.0

Actual behavior

after ragflow start normally, upload doc file. it failed and report that "File is not a zip file"

image

Expected behavior

No response

Steps to reproduce

just upload doc file

Additional information

No response

@ben-qiao ben-qiao added the bug Something isn't working label Apr 22, 2024
@Jiafan
Copy link
Contributor

Jiafan commented Apr 22, 2024

I have got the same problem . And I got logs like this:
WeChatWorkScreenshot_e1256233-f516-462a-abad-1041d1254d8e

So I read the source ,and got this code :


sections = []
    if re.search(r"\.docx?$", filename, re.IGNORECASE):
        callback(0.1, "Start to parse.")
        for txt in Docx()(filename, binary):
            sections.append(txt)
        callback(0.8, "Finish parsing.")

    elif re.search(r"\.pdf$", filename, re.IGNORECASE):
        pdf_parser = Pdf() if kwargs.get(
            "parser_config", {}).get(
            "layout_recognize", True) else PlainParser()
        for txt, poss in pdf_parser(filename if not binary else binary,
                                    from_page=from_page, to_page=to_page, callback=callback)[0]:
            sections.append(txt + poss)

    elif re.search(r"\.txt$", filename, re.IGNORECASE):
        callback(0.1, "Start to parse.")
        txt = ""
        if binary:
            txt = binary.decode("utf-8")
        else:
            with open(filename, "r") as f:
                while True:
                    l = f.readline()
                    if not l:
                        break
                    txt += l
        sections = txt.split("\n")
        sections = [l for l in sections if l]
        callback(0.8, "Finish parsing.")
    else:
        raise NotImplementedError(
            "file type not supported yet(docx, pdf, txt supported)")

In fact doc is not supported . But the judgement allows doc type file .

KevinHuSh added a commit that referenced this issue Apr 22, 2024
### What problem does this PR solve?
#474 

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
@KevinHuSh
Copy link
Collaborator

fixed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants