Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Question]: Knowledge Graph Has a error when Parsing [expected string or buffer] #1859

Closed
JaminYan opened this issue Aug 7, 2024 · 1 comment
Labels
question Further information is requested

Comments

@JaminYan
Copy link

JaminYan commented Aug 7, 2024

Describe your problem

Traceback (most recent call last):
File "/ragflow/rag/svr/task_executor.py", line 149, in build
cks = chunker.chunk(row["name"], binary=binary, from_page=row["from_page"],
File "/ragflow/rag/app/knowledge_graph.py", line 18, in chunk
chunks = build_knowlege_graph_chunks(tenant_id, sections, callback,
File "/ragflow/graphrag/index.py", line 87, in build_knowlege_graph_chunks
tkn_cnt = num_tokens_from_string(chunks[i])
File "/ragflow/rag/utils/init.py", line 79, in num_tokens_from_string
num_tokens = len(encoder.encode(string))
File "/usr/local/lib/python3.10/dist-packages/tiktoken/core.py", line 116, in encode
if match := _special_token_regex(disallowed_special).search(text):
TypeError: expected string or buffer

image

@JaminYan JaminYan added the question Further information is requested label Aug 7, 2024
@cyhasuka
Copy link
Contributor

cyhasuka commented Aug 8, 2024

Same issue. Occurs when uploading any word documents include pics.
Env: Linux, running in docker container

KevinHuSh pushed a commit that referenced this issue Aug 8, 2024
…ted using Knowledge Graph.#1859 (#1865)

### What problem does this PR solve?

Fix a "TypeError: expected string or buffer bug" in docx files extracted
using Knowledge Graph. #1859
```
Traceback (most recent call last):
  File "//Users/XXX/ragflow/rag/svr/task_executor.py", line 149, in build
    cks = chunker.chunk(row["name"], binary=binary, from_page=row["from_page"],
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/XXX/ragflow/rag/app/knowledge_graph.py", line 18, in chunk
    chunks = build_knowlege_graph_chunks(tenant_id, sections, callback,
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/XXX/ragflow/graphrag/index.py", line 87, in build_knowlege_graph_chunks
    tkn_cnt = num_tokens_from_string(chunks[i])
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/XXX/github/ragflow/rag/utils/__init__.py", line 79, in num_tokens_from_string
    num_tokens = len(encoder.encode(string))
                     ^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/XXX/tiktoken/core.py", line 116, in encode
    if match := _special_token_regex(disallowed_special).search(text):
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: expected string or buffer
```
This type is `Dict`
<img width="1689" alt="Pasted Graphic 3"
src="https://github.com/user-attachments/assets/e5ba5c45-df1d-4697-98c9-14365c839f20">
The correct type should be ` Str`
<img width="1725" alt="Pasted Graphic 2"
src="https://github.com/user-attachments/assets/e54d5e60-4ce4-4180-b394-24e485013534">

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
- [ ] New Feature (non-breaking change which adds functionality)
- [ ] Documentation Update
- [ ] Refactoring
- [ ] Performance Improvement
- [ ] Other (please describe):
Halfknow pushed a commit to Halfknow/ragflow that referenced this issue Nov 11, 2024
…ted using Knowledge Graph.infiniflow#1859 (infiniflow#1865)

### What problem does this PR solve?

Fix a "TypeError: expected string or buffer bug" in docx files extracted
using Knowledge Graph. infiniflow#1859
```
Traceback (most recent call last):
  File "//Users/XXX/ragflow/rag/svr/task_executor.py", line 149, in build
    cks = chunker.chunk(row["name"], binary=binary, from_page=row["from_page"],
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/XXX/ragflow/rag/app/knowledge_graph.py", line 18, in chunk
    chunks = build_knowlege_graph_chunks(tenant_id, sections, callback,
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/XXX/ragflow/graphrag/index.py", line 87, in build_knowlege_graph_chunks
    tkn_cnt = num_tokens_from_string(chunks[i])
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/XXX/github/ragflow/rag/utils/__init__.py", line 79, in num_tokens_from_string
    num_tokens = len(encoder.encode(string))
                     ^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/XXX/tiktoken/core.py", line 116, in encode
    if match := _special_token_regex(disallowed_special).search(text):
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: expected string or buffer
```
This type is `Dict`
<img width="1689" alt="Pasted Graphic 3"
src="https://github.com/user-attachments/assets/e5ba5c45-df1d-4697-98c9-14365c839f20">
The correct type should be ` Str`
<img width="1725" alt="Pasted Graphic 2"
src="https://github.com/user-attachments/assets/e54d5e60-4ce4-4180-b394-24e485013534">

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
- [ ] New Feature (non-breaking change which adds functionality)
- [ ] Documentation Update
- [ ] Refactoring
- [ ] Performance Improvement
- [ ] Other (please describe):
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

3 participants