Tool makes assumptions about receiving some properties back from LLM (e.g. tags) #83

itsthejb · 2025-01-09T08:48:56Z

Describe the bug

Customising the prompt description allows the user a lot of flexibility to control what is received back from the LLM. However, it seems that this tool makes some assumptions on its receiving end. For example, that there are >0 tags.

To Reproduce
Steps to reproduce the behavior:

Settings > Prompt Descrption
Edit the prompt to remove reference to working out tags, e.g.

``````You are a personalized document analyzer. Your task is to analyze documents and extract relevant information.

Analyze the document content and extract the following information into a structured JSON object:

1. title: Create a concise, meaningful title for the document
2. correspondent: Identify the sender/institution but do not include addresses
4. document_date: Extract the document date (format: YYYY-MM-DD)
5. language: Determine the document language (e.g. "de" or "en")

Important rules for the analysis:

For the title:
- Short and concise, NO ADDRESSES
- Contains the most important identification features
- For invoices/orders, mention invoice/order number if available
- The output language is the one used in the document! IMPORTANT!

For the correspondent:
- Identify the sender or institution

For the document date:
- Extract the date of the document
- Use the format YYYY-MM-DD
- If multiple dates are present, use the most relevant one

For the language:
- Determine the document language
- Use language codes like "de" for German or "en" for English
- If the language is not clear, use "und" as a placeholder

Process a document
Setting document properties from the AI response fails, e.g.

Failed to analyze document: Error: Invalid response structure: missing tags array or correspondent string
    at OpenAIService.analyzeDocument (/app/services/openaiService.js:154:15)
    at process.processTicksAndRejections (node:internal/process/task_queues:105:5)
    at async /app/routes/setup.js:401:31

I assume that on the receiving end, the tool wrongly assumes that it will receive back >0 tags. However this response would have zero tags. Therefore it fails. Since the prompt allows the processing to be fully customisable, the tool should not assume that any of the possible properties are received back

Expected behavior
Results are received back correctly for the document properties that are requested in the prompt

Screenshots

No doc properties are populated in the Manual tag, for example

Desktop (please complete the following information):

OS: macOS
Browser: Brave
Version: 1.73

Additional context

Docker container in Linux host

The text was updated successfully, but these errors were encountered:

clusterzx · 2025-01-09T09:07:53Z

Could you show me the complete .env in your container under /app/data/.env ?
Of course you should redact alls api keys

itsthejb · 2025-01-09T09:19:05Z

PAPERLESS_API_URL=<>
PAPERLESS_API_TOKEN=<>
AI_PROVIDER=openai
SCAN_INTERVAL=*/30 * * * *
SYSTEM_PROMPT=```````You are a personalized document analyzer. Your task is to analyze documents and extract relevant information.\n\nAnalyze the document content and extract the following information into a structured JSON object:\n\n1. title: Create a concise, meaningful title for the document\n2. correspondent: Identify the sender/institution but do not include addresses\n4. document_date: Extract the document date (format: YYYY-MM-DD)\n5. language: Determine the document language (e.g. "de" or "en")\n\nImportant rules for the analysis:\n\nFor the title:\n- Short and concise, NO ADDRESSES\n- Contains the most important identification features\n- For invoices/orders, mention invoice/order number if available\n- The output language is the one used in the document! IMPORTANT!\n\nFor the correspondent:\n- Identify the sender or institution\n\nFor the document date:\n- Extract the date of the document\n- Use the format YYYY-MM-DD\n- If multiple dates are present, use the most relevant one\n\nFor the language:\n- Determine the document language\n- Use language codes like "de" for German or "en" for English\n- If the language is not clear, use "und" as a placeholder

        Return the result EXCLUSIVELY as a JSON object. The Tags and Title MUST be in the language that is used in the document.:
        
        {
          "title": "xxxxx",
          "correspondent": "xxxxxxxx",
          "tags": ["Tag1", "Tag2", "Tag3", "Tag4"],
          "document_date": "YYYY-MM-DD",
          "language": "en/de/es/..."
        }`
PROCESS_PREDEFINED_DOCUMENTS=yes
TAGS=AI-Tag
ADD_AI_PROCESSED_TAG=yes
AI_PROCESSED_TAG_NAME=AI-Tagged
USE_PROMPT_TAGS=no
PROMPT_TAGS=
OPENAI_API_KEY=<>
OPENAI_MODEL=gpt-4o-mini

Looking at that, I thought ah, probably it's just that there's still a tags property for the JSON response (this isn't visible in the GUI). However I also just tried removing it and rescanning but it still fails with:

Failed to analyze document: Error: Invalid response structure: missing tags array or correspondent string
    at OpenAIService.analyzeDocument (/app/services/openaiService.js:154:15)
    at process.processTicksAndRejections (node:internal/process/task_queues:105:5)
    at async /app/routes/setup.js:401:31

So it still seems like the tool is assuming the existence of properties

clusterzx · 2025-01-09T11:54:25Z

Thats weird, config looks totally fine.
Could you do a run and post the container log here from start to the first appearence of the error?

itsthejb · 2025-01-09T12:36:04Z

Loading .env from: /app/data/.env
Loaded environment variables: {
  PAPERLESS_API_URL: 'http://paperless:8000/api',
  PAPERLESS_API_TOKEN: '...'
}
(node:18) [DEP0040] DeprecationWarning: The `punycode` module is deprecated. Please use a userland alternative instead.
(Use `node --trace-deprecation ...` to show where the warning was created)
Server running on port 3000
[DEBUG] [09.01.25, 12:34] OpenAI request sent
Configured scan interval: */30 * * * *
Starting initial scan at 2025-01-09T12:34:10.079Z
Fetched page 1, got 70 tags. Total so far: 70
Refreshing tag cache...
Tag cache refreshed. Found 25 tags.
Found tag "AI-Tag" in cache with ID 164
Filtering documents for tag IDs: [ 164 ]
Fetched page 1, got 4 documents. Total so far: 4
Finished fetching. Found 4 documents.
70
101
Found tag "AI-Tag" in cache with ID 164
Filtering documents for tag IDs: [ 164 ]
Fetched page 1, got 4 documents. Total so far: 4
Fetched page 1, got 70 tags. Total so far: 70
Finished fetching. Found 4 documents.
Fetching content for document: 1805
Document Data: {
  id: 1805,
  correspondent: null,
  document_type: null,
  storage_path: null,
  title: '<>>',
  content: '<content>'... 76007 more characters,
  tags: [ 'AI-Tag', 'Reference' ],
  created: '2023-07-18T00:00:00+01:00',
  created_date: '2023-07-18',
  modified: '2025-01-08T14:04:01.470693Z',
  added: '2024-12-05T17:30:09.046275Z',
  deleted_at: null,
  archive_serial_number: null,
  original_file_name: 'SSRN-id4227132.pdf',
  archived_file_name: '2023-07-18 SSRN-id4227132.pdf',
  owner: null,
  user_can_change: true,
  is_shared_by_requester: false,
  notes: [],
  custom_fields: []
}
Thumbnail not cached, fetching from Paperless
Error status: 500
Error headers: Object [AxiosHeaders] {
  date: 'Thu, 09 Jan 2025 12:34:23 GMT',
  server: 'uvicorn',
  'content-type': 'text/html; charset=utf-8',
  'x-frame-options': 'SAMEORIGIN',
  'x-api-version': '5',
  'x-version': '2.11.1',
  'content-length': '145',
  vary: 'Accept-Language, origin, Cookie',
  'content-language': 'en-us',
  'x-content-type-options': 'nosniff',
  'referrer-policy': 'same-origin',
  'cross-origin-opener-policy': 'same-origin'
}
Error fetching thumbnail for document undefined: Request failed with status code 500
Thumbnail nicht gefunden
Failed to get thumbnail TypeError [ERR_INVALID_ARG_TYPE]: The "data" argument must be of type string or an instance of Buffer, TypedArray, or DataView. Received null
    at Object.writeFile (node:internal/fs/promises:1203:5)
    at OpenAIService.analyzeDocument (/app/services/openaiService.js:76:20)
    at async /app/routes/setup.js:401:31 {
  code: 'ERR_INVALID_ARG_TYPE'
}
[DEBUG] [09.01.25, 12:34] OpenAI request sent
[DEBUG] [09.01.25, 12:34] Used tokens: 328, Total tokens: 22205
Failed to analyze document: Error: Invalid response structure: missing tags array or correspondent string
    at OpenAIService.analyzeDocument (/app/services/openaiService.js:154:15)
    at process.processTicksAndRejections (node:internal/process/task_queues:105:5)
    at async /app/routes/setup.js:401:31

clusterzx · 2025-01-09T14:11:04Z

Ok there's definitely something going on with your file or the paperless ngx instance itself.

This will be a long shot but maybe spinup another paperless instance freshly with that document and also another smaller document.

Try the smaller one first and then this one seeing in the debug log.

Never seen that error here before and I can only guess

Sblop · 2025-01-09T14:34:09Z

This is caused by document ownership. All of my documents did not have any user/owner assigned, and I had a helluva lot of problems... Try assigning a user to your tags/docs.

@clusterzx, you might need to look into this....

clusterzx · 2025-01-09T15:27:30Z

@Sblop thanks for the tip ❤️
But then I don't have any counter measures against it, cause the API calls are in the rights zone of the user. If this is a problem for special edge cases, I can't do anything about it.

itsthejb · 2025-01-09T15:41:32Z

Hmmm... Interesting. I tried setting the owner of that document to my (logged in) user and re-running. It doesn't appear to make any difference:

Thumbnail not cached, fetching from Paperless
Error fetching thumbnail for document undefined: Request failed with status code 500
Error status: 500
Thumbnail nicht gefunden
Error headers: Object [AxiosHeaders] {
  date: 'Thu, 09 Jan 2025 15:39:43 GMT',
  server: 'uvicorn',
  'content-type': 'text/html; charset=utf-8',
  'x-frame-options': 'SAMEORIGIN',
  'x-api-version': '5',
  'x-version': '2.11.1',
  'content-length': '145',
  vary: 'Accept-Language, origin, Cookie',
  'content-language': 'en-us',
  'x-content-type-options': 'nosniff',
  'referrer-policy': 'same-origin',
  'cross-origin-opener-policy': 'same-origin'
}
Failed to get thumbnail TypeError [ERR_INVALID_ARG_TYPE]: The "data" argument must be of type string or an instance of Buffer, TypedArray, or DataView. Received null
    at Object.writeFile (node:internal/fs/promises:1203:5)
    at OpenAIService.analyzeDocument (/app/services/openaiService.js:76:20)
    at async /app/routes/setup.js:401:31 {
  code: 'ERR_INVALID_ARG_TYPE'
}
[DEBUG] [09.01.25, 15:39] OpenAI request sent
[DEBUG] [09.01.25, 15:39] Used tokens: 250, Total tokens: 22201
Failed to analyze document: Error: Invalid response structure: missing tags array or correspondent string
    at OpenAIService.analyzeDocument (/app/services/openaiService.js:154:15)
    at process.processTicksAndRejections (node:internal/process/task_queues:105:5)
    at async /app/routes/setup.js:401:31

itsthejb · 2025-01-09T15:45:00Z

However that wouldn't make any sense anyway, because if I just restore the prompt to the default:

``````You are a personalized document analyzer. Your task is to analyze documents and extract relevant information.

Analyze the document content and extract the following information into a structured JSON object:

1. title: Create a concise, meaningful title for the document
2. correspondent: Identify the sender/institution but do not include addresses
3. tags: Select up to 4 relevant thematic tags
4. document_date: Extract the document date (format: YYYY-MM-DD)
5. language: Determine the document language (e.g. "de" or "en")

Important rules for the analysis:

For tags:
- FIRST check the existing tags before suggesting new ones
- Don't create a tag that's the same as the correspondent
- Use only relevant categories
- Maximum 4 tags per document, less if sufficient (at least 1)
- Avoid generic or too specific tags
- Use only the most important information for tag creation
- The output language is the one used in the document! IMPORTANT!
- Format the tags and correspondent in title case! IMPORTANT!

For the title:
- Short and concise, NO ADDRESSES
- Contains the most important identification features
- For invoices/orders, mention invoice/order number if available
- The output language is the one used in the document! IMPORTANT!

For the correspondent:
- Identify the sender or institution

For the document date:
- Extract the date of the document
- Use the format YYYY-MM-DD
- If multiple dates are present, use the most relevant one

For the language:
- Determine the document language
- Use language codes like "de" for German or "en" for English
- If the language is not clear, use "und" as a placeholder

... then everything works just fine:

With no errors in the log:

[DEBUG] [09.01.25, 15:42] OpenAI request sent
[DEBUG] [09.01.25, 15:42] Used tokens: 336, Total tokens: 22324
2025-01-09T15:43:02: PM2 log: [PM2][WORKER] Reset the restart delay, as app paperless-ai has been up for more than 30000ms

So the issue absolutely is caused by not asking to tagging in the prompt.

To clarify; the reason I'm trying this is simply because the title generation works well, but I'm less sure about tagging right now. Therefore it seems to make sense that you ought to be able to apply whatever info you want by specifying it (or not) in the prompt. Hence reporting the issue

gima84 · 2025-01-09T16:06:23Z

Hi, I have the same reason for tags (do not want them). I am using just 1 tag as a workaround...

.
.
3. tags: Use tag "aiproc".
.
.
.
For tags:

Use tag "aiproc"
.
.

itsthejb · 2025-01-09T16:09:35Z

@gima84 Nice. That appears to be a workaround for the time being

clusterzx · 2025-01-13T16:00:35Z

Not an issue in first place.

itsthejb changed the title ~~Tool makes assumptions about existence of some data (e.g. tags)~~ Tool makes assumptions about receiving some properties back from LLM (e.g. tags) Jan 9, 2025

clusterzx closed this as completed Jan 13, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tool makes assumptions about receiving some properties back from LLM (e.g. tags) #83

Tool makes assumptions about receiving some properties back from LLM (e.g. tags) #83

itsthejb commented Jan 9, 2025 •

edited

Loading

clusterzx commented Jan 9, 2025

itsthejb commented Jan 9, 2025 •

edited

Loading

clusterzx commented Jan 9, 2025

itsthejb commented Jan 9, 2025

clusterzx commented Jan 9, 2025

Sblop commented Jan 9, 2025

clusterzx commented Jan 9, 2025

itsthejb commented Jan 9, 2025

itsthejb commented Jan 9, 2025

gima84 commented Jan 9, 2025

itsthejb commented Jan 9, 2025

clusterzx commented Jan 13, 2025

Tool makes assumptions about receiving some properties back from LLM (e.g. tags) #83

Tool makes assumptions about receiving some properties back from LLM (e.g. tags) #83

Comments

itsthejb commented Jan 9, 2025 • edited Loading

clusterzx commented Jan 9, 2025

itsthejb commented Jan 9, 2025 • edited Loading

clusterzx commented Jan 9, 2025

itsthejb commented Jan 9, 2025

clusterzx commented Jan 9, 2025

Sblop commented Jan 9, 2025

clusterzx commented Jan 9, 2025

itsthejb commented Jan 9, 2025

itsthejb commented Jan 9, 2025

gima84 commented Jan 9, 2025

itsthejb commented Jan 9, 2025

clusterzx commented Jan 13, 2025

itsthejb commented Jan 9, 2025 •

edited

Loading

itsthejb commented Jan 9, 2025 •

edited

Loading