Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tool makes assumptions about receiving some properties back from LLM (e.g. tags) #83

Closed
itsthejb opened this issue Jan 9, 2025 · 12 comments

Comments

@itsthejb
Copy link

itsthejb commented Jan 9, 2025

Describe the bug

Customising the prompt description allows the user a lot of flexibility to control what is received back from the LLM. However, it seems that this tool makes some assumptions on its receiving end. For example, that there are >0 tags.

To Reproduce
Steps to reproduce the behavior:

  1. Settings > Prompt Descrption
  2. Edit the prompt to remove reference to working out tags, e.g.
``````You are a personalized document analyzer. Your task is to analyze documents and extract relevant information.

Analyze the document content and extract the following information into a structured JSON object:

1. title: Create a concise, meaningful title for the document
2. correspondent: Identify the sender/institution but do not include addresses
4. document_date: Extract the document date (format: YYYY-MM-DD)
5. language: Determine the document language (e.g. "de" or "en")

Important rules for the analysis:

For the title:
- Short and concise, NO ADDRESSES
- Contains the most important identification features
- For invoices/orders, mention invoice/order number if available
- The output language is the one used in the document! IMPORTANT!

For the correspondent:
- Identify the sender or institution

For the document date:
- Extract the date of the document
- Use the format YYYY-MM-DD
- If multiple dates are present, use the most relevant one

For the language:
- Determine the document language
- Use language codes like "de" for German or "en" for English
- If the language is not clear, use "und" as a placeholder
  1. Process a document
  2. Setting document properties from the AI response fails, e.g.
Failed to analyze document: Error: Invalid response structure: missing tags array or correspondent string
    at OpenAIService.analyzeDocument (/app/services/openaiService.js:154:15)
    at process.processTicksAndRejections (node:internal/process/task_queues:105:5)
    at async /app/routes/setup.js:401:31

I assume that on the receiving end, the tool wrongly assumes that it will receive back >0 tags. However this response would have zero tags. Therefore it fails. Since the prompt allows the processing to be fully customisable, the tool should not assume that any of the possible properties are received back

Expected behavior
Results are received back correctly for the document properties that are requested in the prompt

Screenshots

  • No doc properties are populated in the Manual tag, for example

Desktop (please complete the following information):

  • OS: macOS
  • Browser: Brave
  • Version: 1.73

Additional context

  • Docker container in Linux host
@itsthejb itsthejb changed the title Tool makes assumptions about existence of some data (e.g. tags) Tool makes assumptions about receiving some properties back from LLM (e.g. tags) Jan 9, 2025
@clusterzx
Copy link
Owner

Could you show me the complete .env in your container under /app/data/.env ?
Of course you should redact alls api keys

@itsthejb
Copy link
Author

itsthejb commented Jan 9, 2025

PAPERLESS_API_URL=<>
PAPERLESS_API_TOKEN=<>
AI_PROVIDER=openai
SCAN_INTERVAL=*/30 * * * *
SYSTEM_PROMPT=```````You are a personalized document analyzer. Your task is to analyze documents and extract relevant information.\n\nAnalyze the document content and extract the following information into a structured JSON object:\n\n1. title: Create a concise, meaningful title for the document\n2. correspondent: Identify the sender/institution but do not include addresses\n4. document_date: Extract the document date (format: YYYY-MM-DD)\n5. language: Determine the document language (e.g. "de" or "en")\n\nImportant rules for the analysis:\n\nFor the title:\n- Short and concise, NO ADDRESSES\n- Contains the most important identification features\n- For invoices/orders, mention invoice/order number if available\n- The output language is the one used in the document! IMPORTANT!\n\nFor the correspondent:\n- Identify the sender or institution\n\nFor the document date:\n- Extract the date of the document\n- Use the format YYYY-MM-DD\n- If multiple dates are present, use the most relevant one\n\nFor the language:\n- Determine the document language\n- Use language codes like "de" for German or "en" for English\n- If the language is not clear, use "und" as a placeholder

        Return the result EXCLUSIVELY as a JSON object. The Tags and Title MUST be in the language that is used in the document.:
        
        {
          "title": "xxxxx",
          "correspondent": "xxxxxxxx",
          "tags": ["Tag1", "Tag2", "Tag3", "Tag4"],
          "document_date": "YYYY-MM-DD",
          "language": "en/de/es/..."
        }`
PROCESS_PREDEFINED_DOCUMENTS=yes
TAGS=AI-Tag
ADD_AI_PROCESSED_TAG=yes
AI_PROCESSED_TAG_NAME=AI-Tagged
USE_PROMPT_TAGS=no
PROMPT_TAGS=
OPENAI_API_KEY=<>
OPENAI_MODEL=gpt-4o-mini

Looking at that, I thought ah, probably it's just that there's still a tags property for the JSON response (this isn't visible in the GUI). However I also just tried removing it and rescanning but it still fails with:

Failed to analyze document: Error: Invalid response structure: missing tags array or correspondent string
    at OpenAIService.analyzeDocument (/app/services/openaiService.js:154:15)
    at process.processTicksAndRejections (node:internal/process/task_queues:105:5)
    at async /app/routes/setup.js:401:31

So it still seems like the tool is assuming the existence of properties

@clusterzx
Copy link
Owner

Thats weird, config looks totally fine.
Could you do a run and post the container log here from start to the first appearence of the error?

@itsthejb
Copy link
Author

itsthejb commented Jan 9, 2025

Loading .env from: /app/data/.env
Loaded environment variables: {
  PAPERLESS_API_URL: 'http://paperless:8000/api',
  PAPERLESS_API_TOKEN: '...'
}
(node:18) [DEP0040] DeprecationWarning: The `punycode` module is deprecated. Please use a userland alternative instead.
(Use `node --trace-deprecation ...` to show where the warning was created)
Server running on port 3000
[DEBUG] [09.01.25, 12:34] OpenAI request sent
Configured scan interval: */30 * * * *
Starting initial scan at 2025-01-09T12:34:10.079Z
Fetched page 1, got 70 tags. Total so far: 70
Refreshing tag cache...
Tag cache refreshed. Found 25 tags.
Found tag "AI-Tag" in cache with ID 164
Filtering documents for tag IDs: [ 164 ]
Fetched page 1, got 4 documents. Total so far: 4
Finished fetching. Found 4 documents.
70
101
Found tag "AI-Tag" in cache with ID 164
Filtering documents for tag IDs: [ 164 ]
Fetched page 1, got 4 documents. Total so far: 4
Fetched page 1, got 70 tags. Total so far: 70
Finished fetching. Found 4 documents.
Fetching content for document: 1805
Document Data: {
  id: 1805,
  correspondent: null,
  document_type: null,
  storage_path: null,
  title: '<>>',
  content: '<content>'... 76007 more characters,
  tags: [ 'AI-Tag', 'Reference' ],
  created: '2023-07-18T00:00:00+01:00',
  created_date: '2023-07-18',
  modified: '2025-01-08T14:04:01.470693Z',
  added: '2024-12-05T17:30:09.046275Z',
  deleted_at: null,
  archive_serial_number: null,
  original_file_name: 'SSRN-id4227132.pdf',
  archived_file_name: '2023-07-18 SSRN-id4227132.pdf',
  owner: null,
  user_can_change: true,
  is_shared_by_requester: false,
  notes: [],
  custom_fields: []
}
Thumbnail not cached, fetching from Paperless
Error status: 500
Error headers: Object [AxiosHeaders] {
  date: 'Thu, 09 Jan 2025 12:34:23 GMT',
  server: 'uvicorn',
  'content-type': 'text/html; charset=utf-8',
  'x-frame-options': 'SAMEORIGIN',
  'x-api-version': '5',
  'x-version': '2.11.1',
  'content-length': '145',
  vary: 'Accept-Language, origin, Cookie',
  'content-language': 'en-us',
  'x-content-type-options': 'nosniff',
  'referrer-policy': 'same-origin',
  'cross-origin-opener-policy': 'same-origin'
}
Error fetching thumbnail for document undefined: Request failed with status code 500
Thumbnail nicht gefunden
Failed to get thumbnail TypeError [ERR_INVALID_ARG_TYPE]: The "data" argument must be of type string or an instance of Buffer, TypedArray, or DataView. Received null
    at Object.writeFile (node:internal/fs/promises:1203:5)
    at OpenAIService.analyzeDocument (/app/services/openaiService.js:76:20)
    at async /app/routes/setup.js:401:31 {
  code: 'ERR_INVALID_ARG_TYPE'
}
[DEBUG] [09.01.25, 12:34] OpenAI request sent
[DEBUG] [09.01.25, 12:34] Used tokens: 328, Total tokens: 22205
Failed to analyze document: Error: Invalid response structure: missing tags array or correspondent string
    at OpenAIService.analyzeDocument (/app/services/openaiService.js:154:15)
    at process.processTicksAndRejections (node:internal/process/task_queues:105:5)
    at async /app/routes/setup.js:401:31

@clusterzx
Copy link
Owner

Ok there's definitely something going on with your file or the paperless ngx instance itself.

This will be a long shot but maybe spinup another paperless instance freshly with that document and also another smaller document.

Try the smaller one first and then this one seeing in the debug log.

Never seen that error here before and I can only guess

@Sblop
Copy link

Sblop commented Jan 9, 2025

This is caused by document ownership. All of my documents did not have any user/owner assigned, and I had a helluva lot of problems... Try assigning a user to your tags/docs.

@clusterzx, you might need to look into this....

@clusterzx
Copy link
Owner

@Sblop thanks for the tip ❤️
But then I don't have any counter measures against it, cause the API calls are in the rights zone of the user. If this is a problem for special edge cases, I can't do anything about it.

@itsthejb
Copy link
Author

itsthejb commented Jan 9, 2025

Hmmm... Interesting. I tried setting the owner of that document to my (logged in) user and re-running. It doesn't appear to make any difference:

Thumbnail not cached, fetching from Paperless
Error fetching thumbnail for document undefined: Request failed with status code 500
Error status: 500
Thumbnail nicht gefunden
Error headers: Object [AxiosHeaders] {
  date: 'Thu, 09 Jan 2025 15:39:43 GMT',
  server: 'uvicorn',
  'content-type': 'text/html; charset=utf-8',
  'x-frame-options': 'SAMEORIGIN',
  'x-api-version': '5',
  'x-version': '2.11.1',
  'content-length': '145',
  vary: 'Accept-Language, origin, Cookie',
  'content-language': 'en-us',
  'x-content-type-options': 'nosniff',
  'referrer-policy': 'same-origin',
  'cross-origin-opener-policy': 'same-origin'
}
Failed to get thumbnail TypeError [ERR_INVALID_ARG_TYPE]: The "data" argument must be of type string or an instance of Buffer, TypedArray, or DataView. Received null
    at Object.writeFile (node:internal/fs/promises:1203:5)
    at OpenAIService.analyzeDocument (/app/services/openaiService.js:76:20)
    at async /app/routes/setup.js:401:31 {
  code: 'ERR_INVALID_ARG_TYPE'
}
[DEBUG] [09.01.25, 15:39] OpenAI request sent
[DEBUG] [09.01.25, 15:39] Used tokens: 250, Total tokens: 22201
Failed to analyze document: Error: Invalid response structure: missing tags array or correspondent string
    at OpenAIService.analyzeDocument (/app/services/openaiService.js:154:15)
    at process.processTicksAndRejections (node:internal/process/task_queues:105:5)
    at async /app/routes/setup.js:401:31

@itsthejb
Copy link
Author

itsthejb commented Jan 9, 2025

However that wouldn't make any sense anyway, because if I just restore the prompt to the default:

``````You are a personalized document analyzer. Your task is to analyze documents and extract relevant information.

Analyze the document content and extract the following information into a structured JSON object:

1. title: Create a concise, meaningful title for the document
2. correspondent: Identify the sender/institution but do not include addresses
3. tags: Select up to 4 relevant thematic tags
4. document_date: Extract the document date (format: YYYY-MM-DD)
5. language: Determine the document language (e.g. "de" or "en")

Important rules for the analysis:

For tags:
- FIRST check the existing tags before suggesting new ones
- Don't create a tag that's the same as the correspondent
- Use only relevant categories
- Maximum 4 tags per document, less if sufficient (at least 1)
- Avoid generic or too specific tags
- Use only the most important information for tag creation
- The output language is the one used in the document! IMPORTANT!
- Format the tags and correspondent in title case! IMPORTANT!

For the title:
- Short and concise, NO ADDRESSES
- Contains the most important identification features
- For invoices/orders, mention invoice/order number if available
- The output language is the one used in the document! IMPORTANT!

For the correspondent:
- Identify the sender or institution

For the document date:
- Extract the date of the document
- Use the format YYYY-MM-DD
- If multiple dates are present, use the most relevant one

For the language:
- Determine the document language
- Use language codes like "de" for German or "en" for English
- If the language is not clear, use "und" as a placeholder

... then everything works just fine:

Screenshot 2025-01-09 at 15 43 05

With no errors in the log:

[DEBUG] [09.01.25, 15:42] OpenAI request sent
[DEBUG] [09.01.25, 15:42] Used tokens: 336, Total tokens: 22324
2025-01-09T15:43:02: PM2 log: [PM2][WORKER] Reset the restart delay, as app paperless-ai has been up for more than 30000ms

So the issue absolutely is caused by not asking to tagging in the prompt.

To clarify; the reason I'm trying this is simply because the title generation works well, but I'm less sure about tagging right now. Therefore it seems to make sense that you ought to be able to apply whatever info you want by specifying it (or not) in the prompt. Hence reporting the issue

@gima84
Copy link

gima84 commented Jan 9, 2025

Hi, I have the same reason for tags (do not want them). I am using just 1 tag as a workaround...

.
.
3. tags: Use tag "aiproc".
.
.
.
For tags:

  • Use tag "aiproc"
    .
    .

@itsthejb
Copy link
Author

itsthejb commented Jan 9, 2025

@gima84 Nice. That appears to be a workaround for the time being

@clusterzx
Copy link
Owner

Not an issue in first place.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants