Skip to content

fix: resolve OCR search issues for non-Latin scripts#23590

Closed
nyakang wants to merge 3 commits intoimmich-app:mainfrom
nyakang:main
Closed

fix: resolve OCR search issues for non-Latin scripts#23590
nyakang wants to merge 3 commits intoimmich-app:mainfrom
nyakang:main

Conversation

@nyakang
Copy link
Copy Markdown
Contributor

@nyakang nyakang commented Nov 4, 2025

Description

Replace trigram strict word similarity (%>>) with dual ILIKE conditions:

  • Direct match: preserves accents
  • unaccent match: accent-insensitive search

Fixes search failures for Chinese, Japanese, Thai, and other languages
without whitespace separators.

Fixes #23507

How Has This Been Tested?

  • Latin text search (English, Spanish with accents)
  • Non-Latin text search (Chinese, Japanese)
  • Accent-insensitive search ("cafe" matches "café")

@immich-push-o-matic
Copy link
Copy Markdown

immich-push-o-matic bot commented Nov 4, 2025

Label error. Requires exactly 1 of: changelog:.*. Found: 🗄️server. A maintainer will add the required label.

Comment on lines +394 to +397
.where(({ eb, val }) => eb.or([
eb('ocr_search.text', 'ilike', val(`%${options.ocr}%`)),
eb(sql`unaccent(ocr_search.text)`, 'ilike', sql`'%' || unaccent(${options.ocr}) || '%'`),
])))
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an unindexed search that's less efficient. Can you instead try just changing %>> to %>?

@loveyu
Copy link
Copy Markdown

loveyu commented Nov 5, 2025

Would it be possible to consider introducing extensions such as pg_bigm, zhparser, or pg_jieba to address the tokenization issue, or allow custom processing of the text field in ocr_search?

I would recommend enabling users to customize the processing of the text field in ocr_search. For instance, adding an option for users to choose a tokenization method and store the tokenized results in the text field. This way, the search logic would not need to change, nor would the index.

There are numerous tokenization tools available for Chinese, Japanese, and Korean, which can be integrated directly after OCR. Different users can also choose different tools to achieve the best adaptability.

@mertalev
Copy link
Copy Markdown
Member

Superseded by #24285

@mertalev mertalev closed this Nov 30, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

OCR does not support fuzzy search in Chinese, but it can achieve fuzzy search in English.

3 participants