Skip to content

[Feature Request] Wildcard field use only 3-gram to index #17099

@HUSTERGS

Description

@HUSTERGS

Is your feature request related to a problem? Please describe

currently the wildcard field will index 1-3 gram of the input text, trying to improve the query performance when the search contains character sequence under 3. But after changing the wildcard field to only index 3-gram of the input test, the search performance seems to be unchanged (sometime even better, maybe caused by the long posting list of common 1/2 length string), and the write throughput drops dramatically (about 5% to 25%). Also, the storage usage of doc file increase about 100%. So maybe we should change the default behavior to only indexing 3-gram.

search performance on http_logs

query 1-3 gram (tp / p99) 3gram ( tp / p99) hits
GET /images/*bg.jpg*HTTP/1.0 0.21 / 4789.55 0.22 / 4640.17 911381
*T /images/*bg.jpg HTTP/1.0 0.21 / 5099.35 0.22 / 4637.81 911378
*T*HTTP/1.* 5.33 / 370.185 5.17 / 218.329 81179

search performance on treccovid_semantic_search

query 1-3 gram (tp / p99) 3gram ( tp / p99) hits
BACKGROUND*T?cells* 3.66 / 329.054 3.91 / 375.651 146
*structure of the*protein* 1.88 / 698.822 1.99 / 668.696 277
BACKGROUND*. 1.08 / 989.615 1.19 / 1018.81 16426
*p7*HCV*EGCG* 122.31 / 7.99482 116.16 / 12.0312 1
*v*k*x*j*q*z* 1.05 / 1078.75 0.54 / 2096.31 1538
*e*t*a*o*i*n* 2.68 / 401.6 3.33 / 523.562 128438

Image

write throughput

data set 1-3 gram 3gram
treccovid_semantic_search 1130.82 1469.22
http_logs 194174 206403

storage use

storage use of the biggest index named logs-241998 in http_logs dataset with 5 shards after force merge to one segment

type total kdd doc pos tim fdt dvd
1-3 gram 22.1GB 2.47GB 9GB 0.208GB 1.344GB 7.04GB 1.87GB
3 gram 17.7GB 2.40GB 4.69GB 0.208GB 1.39GB 6.95GB 1.85GB

Describe the solution you'd like

change the default implementation only index 3-gram

Related component

No response

Describe alternatives you've considered

No response

Additional context

No response

Metadata

Metadata

Assignees

Labels

SearchSearch query, autocomplete ...etcenhancementEnhancement or improvement to existing feature or request

Type

No type

Projects

Status

✅ Done

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions