-
Notifications
You must be signed in to change notification settings - Fork 2.3k
Description
Is your feature request related to a problem? Please describe
currently the wildcard field will index 1-3 gram of the input text, trying to improve the query performance when the search contains character sequence under 3. But after changing the wildcard field to only index 3-gram of the input test, the search performance seems to be unchanged (sometime even better, maybe caused by the long posting list of common 1/2 length string), and the write throughput drops dramatically (about 5% to 25%). Also, the storage usage of doc file increase about 100%. So maybe we should change the default behavior to only indexing 3-gram.
search performance on http_logs
| query | 1-3 gram (tp / p99) | 3gram ( tp / p99) | hits |
|---|---|---|---|
| GET /images/*bg.jpg*HTTP/1.0 | 0.21 / 4789.55 | 0.22 / 4640.17 | 911381 |
| *T /images/*bg.jpg HTTP/1.0 | 0.21 / 5099.35 | 0.22 / 4637.81 | 911378 |
| *T*HTTP/1.* | 5.33 / 370.185 | 5.17 / 218.329 | 81179 |
search performance on treccovid_semantic_search
| query | 1-3 gram (tp / p99) | 3gram ( tp / p99) | hits |
|---|---|---|---|
| BACKGROUND*T?cells* | 3.66 / 329.054 | 3.91 / 375.651 | 146 |
| *structure of the*protein* | 1.88 / 698.822 | 1.99 / 668.696 | 277 |
| BACKGROUND*. | 1.08 / 989.615 | 1.19 / 1018.81 | 16426 |
| *p7*HCV*EGCG* | 122.31 / 7.99482 | 116.16 / 12.0312 | 1 |
| *v*k*x*j*q*z* | 1.05 / 1078.75 | 0.54 / 2096.31 | 1538 |
| *e*t*a*o*i*n* | 2.68 / 401.6 | 3.33 / 523.562 | 128438 |
write throughput
| data set | 1-3 gram | 3gram |
|---|---|---|
| treccovid_semantic_search | 1130.82 | 1469.22 |
| http_logs | 194174 | 206403 |
storage use
storage use of the biggest index named logs-241998 in http_logs dataset with 5 shards after force merge to one segment
| type | total | kdd | doc | pos | tim | fdt | dvd |
|---|---|---|---|---|---|---|---|
| 1-3 gram | 22.1GB | 2.47GB | 9GB | 0.208GB | 1.344GB | 7.04GB | 1.87GB |
| 3 gram | 17.7GB | 2.40GB | 4.69GB | 0.208GB | 1.39GB | 6.95GB | 1.85GB |
Describe the solution you'd like
change the default implementation only index 3-gram
Related component
No response
Describe alternatives you've considered
No response
Additional context
No response
Metadata
Metadata
Assignees
Labels
Type
Projects
Status
