fix: resolve OCR search issues for non-Latin scripts by nyakang · Pull Request #23590 · immich-app/immich

nyakang · 2025-11-04T08:02:23Z

Description

Replace trigram strict word similarity (%>>) with dual ILIKE conditions:

Direct match: preserves accents
unaccent match: accent-insensitive search

Fixes search failures for Chinese, Japanese, Thai, and other languages
without whitespace separators.

Fixes #23507

How Has This Been Tested?

Latin text search (English, Spanish with accents)
Non-Latin text search (Chinese, Japanese)
Accent-insensitive search ("cafe" matches "café")

immich-push-o-matic · 2025-11-04T08:02:36Z

Label error. Requires exactly 1 of: changelog:.*. Found: 🗄️server. A maintainer will add the required label.

mertalev · 2025-11-04T14:30:11Z

server/src/utils/database.ts

+        .where(({ eb, val }) => eb.or([
+        eb('ocr_search.text', 'ilike', val(`%${options.ocr}%`)),
+        eb(sql`unaccent(ocr_search.text)`, 'ilike', sql`'%' || unaccent(${options.ocr}) || '%'`),
+    ])))


This is an unindexed search that's less efficient. Can you instead try just changing %>> to %>?

loveyu · 2025-11-05T02:50:25Z

Would it be possible to consider introducing extensions such as pg_bigm, zhparser, or pg_jieba to address the tokenization issue, or allow custom processing of the text field in ocr_search?

I would recommend enabling users to customize the processing of the text field in ocr_search. For instance, adding an option for users to choose a tokenization method and store the tokenized results in the text field. This way, the search logic would not need to change, nor would the index.

There are numerous tokenization tools available for Chinese, Japanese, and Korean, which can be integrated directly after OCR. Different users can also choose different tools to achieve the best adaptability.

mertalev · 2025-11-30T07:14:33Z

Superseded by #24285

fix: resolve OCR search issues for non-Latin scripts

2bc07a0

nyakang requested a review from danieldietzler as a code owner November 4, 2025 08:02

immich-push-o-matic bot added the 🗄️server label Nov 4, 2025

mertalev reviewed Nov 4, 2025

View reviewed changes

nyakang added 2 commits November 5, 2025 23:10

Merge branch 'immich-app:main' into main

4d5411f

Merge branch 'immich-app:main' into main

5175c00

isZYKerman mentioned this pull request Nov 21, 2025

OCR does not support fuzzy search in Chinese, but it can achieve fuzzy search in English. #23507

Closed

4 tasks

danieldietzler requested a review from mertalev November 24, 2025 16:30

mertalev closed this Nov 30, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: resolve OCR search issues for non-Latin scripts#23590

fix: resolve OCR search issues for non-Latin scripts#23590
nyakang wants to merge 3 commits intoimmich-app:mainfrom
nyakang:main

nyakang commented Nov 4, 2025

Uh oh!

immich-push-o-matic bot commented Nov 4, 2025 •

edited

Loading

Uh oh!

mertalev Nov 4, 2025

Uh oh!

loveyu commented Nov 5, 2025

Uh oh!

mertalev commented Nov 30, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

nyakang commented Nov 4, 2025

Description

How Has This Been Tested?

Uh oh!

immich-push-o-matic bot commented Nov 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mertalev Nov 4, 2025

Choose a reason for hiding this comment

Uh oh!

loveyu commented Nov 5, 2025

Uh oh!

mertalev commented Nov 30, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

immich-push-o-matic bot commented Nov 4, 2025 •

edited

Loading