fix: resolve OCR search issues for non-Latin scripts#23590
fix: resolve OCR search issues for non-Latin scripts#23590nyakang wants to merge 3 commits intoimmich-app:mainfrom
Conversation
|
Label error. Requires exactly 1 of: changelog:.*. Found: 🗄️server. A maintainer will add the required label. |
| .where(({ eb, val }) => eb.or([ | ||
| eb('ocr_search.text', 'ilike', val(`%${options.ocr}%`)), | ||
| eb(sql`unaccent(ocr_search.text)`, 'ilike', sql`'%' || unaccent(${options.ocr}) || '%'`), | ||
| ]))) |
There was a problem hiding this comment.
This is an unindexed search that's less efficient. Can you instead try just changing %>> to %>?
|
Would it be possible to consider introducing extensions such as pg_bigm, zhparser, or pg_jieba to address the tokenization issue, or allow custom processing of the text field in ocr_search? I would recommend enabling users to customize the processing of the text field in ocr_search. For instance, adding an option for users to choose a tokenization method and store the tokenized results in the text field. This way, the search logic would not need to change, nor would the index. There are numerous tokenization tools available for Chinese, Japanese, and Korean, which can be integrated directly after OCR. Different users can also choose different tools to achieve the best adaptability. |
|
Superseded by #24285 |
Description
Replace trigram strict word similarity (%>>) with dual ILIKE conditions:
Fixes search failures for Chinese, Japanese, Thai, and other languages
without whitespace separators.
Fixes #23507
How Has This Been Tested?