-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
tsearch: add stemming and stopword elimination for several languages #97677
Conversation
6794c44
to
d287bac
Compare
d287bac
to
abdc96e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice!
Reviewed 9 of 9 files at r1, 2 of 2 files at r2, 25 of 25 files at r3, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @jordanlewis and @rafiss)
-- commits
line 24 at r3:
Did you check whether they were updated since then?
pkg/util/tsearch/tsquery.go
line 505 at r3 (raw file):
foundStopwords = true lexeme = "" }
lexeme
should already be "", right?
pkg/util/tsearch/tsquery.go
line 531 at r3 (raw file):
} func cleanupStopword(node *tsNode) (ret *tsNode, lAdd int, rAdd int) {
Can you add a comment to explain what this function is doing?
pkg/util/tsearch/tsquery.go
line 548 at r3 (raw file):
} var llAdd, lrAdd, rlAdd, rrAdd int
These rights and lefts are confusing .. any opportunity for some ascii art to make this clearer?
pkg/util/tsearch/tsvector.go
line 327 at r3 (raw file):
// routines like to_tsvector and to_tsquery. // It can return false in the second parameter to indicate a stopword was found. func TSLexize(config string, token string) (lexeme string, ok bool, err error) {
How about changing ok
to something more self-documenting? e.g. hasStopwords
? (or hasNoStopwords
)
pkg/util/tsearch/tsvector.go
line 362 at r3 (raw file):
term := tsTerm{lexeme: lexeme} term.lexeme = lexeme
I don't think you need this line anymore
pkg/sql/sem/eval/testdata/eval/tsearch
line 346 at r3 (raw file):
to_tsquery('english', 'foo <-> a <-> the <-> bar') ---- 'foo' <3> 'bar'
how about adding a few tests for other languages besides English?
... and simplify tsquery parse loop. This allows ! to follow each other which was mistakenly not allowed before. Release note: None
This commit adds stopword elimination for text search. The languages supported are the same ones that Postgres does. The stopword lists were copied from Postgres commit e757080e041214cf6983e3e77ef01e83f1371d72. Also, add snowball stemming provided by the blevesearch snowball stemming library. Release note (sql change): add stemming and stopword eliminating text search configurations for English, Danish, Dutch, Finnish, French, German, Hungarian, Italian, Norwegian, Portuguese, Russian, Spanish, Swedish, and Turkish.
abdc96e
to
93e4c78
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @rafiss and @rytaft)
Previously, rytaft (Rebecca Taft) wrote…
Did you check whether they were updated since then?
Yes, these are up to date. That commit is a very recent one off of their main branch that I used for this purpose.
pkg/util/tsearch/tsquery.go
line 505 at r3 (raw file):
Previously, rytaft (Rebecca Taft) wrote…
lexeme
should already be "", right?
Good point, Done.
pkg/util/tsearch/tsquery.go
line 531 at r3 (raw file):
Previously, rytaft (Rebecca Taft) wrote…
Can you add a comment to explain what this function is doing?
Done.
pkg/util/tsearch/tsquery.go
line 548 at r3 (raw file):
Previously, rytaft (Rebecca Taft) wrote…
These rights and lefts are confusing .. any opportunity for some ascii art to make this clearer?
I augmented the function with the corresponding comments from the Postgres source code, which I think is as good as we're going to get. I'll be honest, I don't completely understand how this logic works. The purpose of this code is to propagate changes to the followedby number (the n
in <n>
, indicating # of tokens separating the left from the right) after removing a stopword. So the llAdd and lrAdd are the left and right adjustments propagated from the left child, likewise for the rlAdd rrAdd. I think the new comments should help.
pkg/util/tsearch/tsvector.go
line 327 at r3 (raw file):
Previously, rytaft (Rebecca Taft) wrote…
How about changing
ok
to something more self-documenting? e.g.hasStopwords
? (orhasNoStopwords
)
Done.
pkg/util/tsearch/tsvector.go
line 362 at r3 (raw file):
Previously, rytaft (Rebecca Taft) wrote…
I don't think you need this line anymore
Done.
pkg/sql/sem/eval/testdata/eval/tsearch
line 346 at r3 (raw file):
Previously, rytaft (Rebecca Taft) wrote…
how about adding a few tests for other languages besides English?
Done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 31 of 31 files at r4, 31 of 31 files at r5, all commit messages.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @rafiss)
Woohoo, TFTR! bors r+ |
Build failed (retrying...): |
Build failed (retrying...): |
Build failed (retrying...): |
Build failed: |
bors r+ |
Build succeeded: |
97685: sql: add default_text_search_config r=jordanlewis a=jordanlewis Updates: #41288 Epic: CRDB-22357 All but the last commit are from #92966 and #97677. This commit adds the default_text_search_config variable for the tsearch package, which allows the user to set a default configuration for the text search builtin functions that take configurations, such as to_tsvector and to_tsquery. The default for this configuration variable is 'english', as it is in Postgres. Release note (sql change): add the default_text_search_config variable for compatibility with the single-argument variants of the text search functions to_tsvector, to_tsquery, phraseto_tsquery, and plainto_tsquery, which use the value of default_text_search_config instead of expecting one to be included as in the two-argument variants. The default value of this setting is 'english'. 99045: roachtest: set 30m timeout for all disk stall roachtests r=nicktrav a=jbowens This commit sets a new 30m timeout for all disk stall roachtests. Previously, the FUSE filesystem variants had no timeout and inherited the default 10h timeout. The other variants had a 20m timeout, which has been observed to be too short due to upreplication latency. Informs #98904. Informs #98886. Epic: None Release note: None 99057: sql: check replace view columns earlier r=rharding6373 a=rharding6373 Before this change, we could encounter internal errors while attempting to add result columns during a `CREATE OR REPLACE VIEW` if the number of columns in the new view was less than the number of columns in the old view. This led to an inconsistency with postgres, which would only return the error `cannot drop columns from view`. This PR moves the check comparing the number of columns before and after the view replacement earlier so that the correct error returns. Co-authored-by: [email protected] Fixes: #99000 Epic: None Release note (bug fix): Fixes an internal error that can occur when `CREATE OR REPLACE VIEW` replaces a view with fewer columns and another entity depended on the view. Co-authored-by: Jordan Lewis <[email protected]> Co-authored-by: Jackson Owens <[email protected]> Co-authored-by: craig[bot] <[email protected]>
97697: tsearch: add ts_rank functionality r=jordanlewis a=jordanlewis Updates: #41288 Epic: CRDB-22357 All but the last commit are from #92966, #97677, and #97685 This commit adds ts_rank, the family of builtins that allow ranking of text search results. The function takes a tsquery and a tsvector and returns a float that indicates how good the match is. The function can be modified by passing in a custom array of weights that matches the text search weights A, B, C, and D, and a bitmask that controls the ranking behavior in various detailed ways. See the excellent Postgres documentation here for details: https://www.postgresql.org/docs/current/textsearch-controls.html Release note (sql change): add the ts_rank function for ranking text search query results 98776: storage: unify storage/fs.FS and pebble/vfs.FS r=jbowens a=jbowens The storage/fs.FS had largely the same interface as vfs.FS. The storage/fs.FS interface was intended as a temporary stepping stone to using pebble's vfs.FS interface throughout Cockroach for all filesystem access. This commit unifies the two. Epic: None Release note: None 99114: kvserver: fix and unskip TestCheckConsistencyInconsistent r=erikgrinaker a=pavelkalinnikov This PR unskips `TestCheckConsistencyInconsistent` which was skipped for a reason that no longer holds. It also fixes the race possible in `TestCheckConsistencyInconsistent`: - Node 2 is corrupted. - The second phase of `runConsistency` check times out on node 1, and returns early when only nodes 2 and 3 have created the storage checkpoint. - Node 1 haven't created the checkpoint, but has started doing so. - Node 2 "fatals" (mocked out in the test) shortly after the check is complete. - Node 1 is still creating its checkpoint, but has probably created the directory by now. - Hence, the test assumes that the checkpoint has been created, and proceeds to open it and compute the checksum of the range. The test now waits for the moment when all the checkpoint are known to be fully populated. Fixes #81819 Epic: none Release note: none Co-authored-by: Jordan Lewis <[email protected]> Co-authored-by: Jackson Owens <[email protected]> Co-authored-by: Pavel Kalinnikov <[email protected]>
Updates: #41288
Epic: CRDB-22357
First commit is #92966.