Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tsearch: add stemming and stopword elimination for several languages #97677

Merged
merged 2 commits into from
Mar 21, 2023

Conversation

jordanlewis
Copy link
Member

@jordanlewis jordanlewis commented Feb 25, 2023

Updates: #41288
Epic: CRDB-22357
First commit is #92966.

This commit adds stopword elimination for text search. The languages
supported are the same ones that Postgres does. The stopword lists were
copied from Postgres commit e757080e041214cf6983e3e77ef01e83f1371d72.

Also, add snowball stemming provided by the blevesearch snowball
stemming library.

Release note (sql change): add stemming and stopword eliminating text
search configurations for English, Danish, Dutch, Finnish, French,
German, Hungarian, Italian, Norwegian, Portuguese, Russian, Spanish,
Swedish, and Turkish.

@jordanlewis jordanlewis requested review from rafiss, rytaft and a team February 25, 2023 17:45
@jordanlewis jordanlewis requested a review from a team as a code owner February 25, 2023 17:45
@cockroach-teamcity
Copy link
Member

This change is Reviewable

Copy link
Collaborator

@rytaft rytaft left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice!

Reviewed 9 of 9 files at r1, 2 of 2 files at r2, 25 of 25 files at r3, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @jordanlewis and @rafiss)


-- commits line 24 at r3:
Did you check whether they were updated since then?


pkg/util/tsearch/tsquery.go line 505 at r3 (raw file):

				foundStopwords = true
				lexeme = ""
			}

lexeme should already be "", right?


pkg/util/tsearch/tsquery.go line 531 at r3 (raw file):

}

func cleanupStopword(node *tsNode) (ret *tsNode, lAdd int, rAdd int) {

Can you add a comment to explain what this function is doing?


pkg/util/tsearch/tsquery.go line 548 at r3 (raw file):

	}

	var llAdd, lrAdd, rlAdd, rrAdd int

These rights and lefts are confusing .. any opportunity for some ascii art to make this clearer?


pkg/util/tsearch/tsvector.go line 327 at r3 (raw file):

// routines like to_tsvector and to_tsquery.
// It can return false in the second parameter to indicate a stopword was found.
func TSLexize(config string, token string) (lexeme string, ok bool, err error) {

How about changing ok to something more self-documenting? e.g. hasStopwords? (or hasNoStopwords)


pkg/util/tsearch/tsvector.go line 362 at r3 (raw file):

		term := tsTerm{lexeme: lexeme}
		term.lexeme = lexeme

I don't think you need this line anymore


pkg/sql/sem/eval/testdata/eval/tsearch line 346 at r3 (raw file):

to_tsquery('english', 'foo <-> a <-> the <-> bar')
----
'foo' <3> 'bar'

how about adding a few tests for other languages besides English?

... and simplify tsquery parse loop.

This allows ! to follow each other which was mistakenly not allowed
before.

Release note: None
This commit adds stopword elimination for text search. The languages
supported are the same ones that Postgres does. The stopword lists were
copied from Postgres commit e757080e041214cf6983e3e77ef01e83f1371d72.

Also, add snowball stemming provided by the blevesearch snowball
stemming library.

Release note (sql change): add stemming and stopword eliminating text
search configurations for English, Danish, Dutch, Finnish, French,
German, Hungarian, Italian, Norwegian, Portuguese, Russian, Spanish,
Swedish, and Turkish.
Copy link
Member Author

@jordanlewis jordanlewis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @rafiss and @rytaft)


-- commits line 24 at r3:

Previously, rytaft (Rebecca Taft) wrote…

Did you check whether they were updated since then?

Yes, these are up to date. That commit is a very recent one off of their main branch that I used for this purpose.


pkg/util/tsearch/tsquery.go line 505 at r3 (raw file):

Previously, rytaft (Rebecca Taft) wrote…

lexeme should already be "", right?

Good point, Done.


pkg/util/tsearch/tsquery.go line 531 at r3 (raw file):

Previously, rytaft (Rebecca Taft) wrote…

Can you add a comment to explain what this function is doing?

Done.


pkg/util/tsearch/tsquery.go line 548 at r3 (raw file):

Previously, rytaft (Rebecca Taft) wrote…

These rights and lefts are confusing .. any opportunity for some ascii art to make this clearer?

I augmented the function with the corresponding comments from the Postgres source code, which I think is as good as we're going to get. I'll be honest, I don't completely understand how this logic works. The purpose of this code is to propagate changes to the followedby number (the n in <n>, indicating # of tokens separating the left from the right) after removing a stopword. So the llAdd and lrAdd are the left and right adjustments propagated from the left child, likewise for the rlAdd rrAdd. I think the new comments should help.


pkg/util/tsearch/tsvector.go line 327 at r3 (raw file):

Previously, rytaft (Rebecca Taft) wrote…

How about changing ok to something more self-documenting? e.g. hasStopwords? (or hasNoStopwords)

Done.


pkg/util/tsearch/tsvector.go line 362 at r3 (raw file):

Previously, rytaft (Rebecca Taft) wrote…

I don't think you need this line anymore

Done.


pkg/sql/sem/eval/testdata/eval/tsearch line 346 at r3 (raw file):

Previously, rytaft (Rebecca Taft) wrote…

how about adding a few tests for other languages besides English?

Done.

Copy link
Collaborator

@rytaft rytaft left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm: thanks!

Reviewed 31 of 31 files at r4, 31 of 31 files at r5, all commit messages.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @rafiss)

@jordanlewis
Copy link
Member Author

Woohoo, TFTR!

bors r+

@craig
Copy link
Contributor

craig bot commented Mar 20, 2023

Build failed (retrying...):

@craig
Copy link
Contributor

craig bot commented Mar 21, 2023

Build failed (retrying...):

@craig
Copy link
Contributor

craig bot commented Mar 21, 2023

Build failed (retrying...):

@craig
Copy link
Contributor

craig bot commented Mar 21, 2023

Build failed:

@jordanlewis
Copy link
Member Author

bors r+

@craig
Copy link
Contributor

craig bot commented Mar 21, 2023

Build succeeded:

@craig craig bot merged commit 5f5cf26 into cockroachdb:master Mar 21, 2023
craig bot pushed a commit that referenced this pull request Mar 21, 2023
97685: sql: add default_text_search_config r=jordanlewis a=jordanlewis

Updates: #41288
Epic: CRDB-22357

All but the last commit are from #92966 and #97677.


    This commit adds the default_text_search_config variable for the tsearch
    package, which allows the user to set a default configuration for the
    text search builtin functions that take configurations, such as
    to_tsvector and to_tsquery. The default for this configuration variable
    is 'english', as it is in Postgres.

    Release note (sql change): add the default_text_search_config variable
    for compatibility with the single-argument variants of the text search
    functions to_tsvector, to_tsquery, phraseto_tsquery, and
    plainto_tsquery, which use the value of default_text_search_config
    instead of expecting one to be included as in the two-argument variants.
    The default value of this setting is 'english'.

99045: roachtest: set 30m timeout for all disk stall roachtests r=nicktrav a=jbowens

This commit sets a new 30m timeout for all disk stall roachtests. Previously,
the FUSE filesystem variants had no timeout and inherited the default 10h
timeout. The other variants had a 20m timeout, which has been observed to be
too short due to upreplication latency.

Informs #98904.
Informs #98886.
Epic: None
Release note: None


99057: sql: check replace view columns earlier r=rharding6373 a=rharding6373

Before this change, we could encounter internal errors while attempting to add result columns during a `CREATE OR REPLACE VIEW` if the number of columns in the new view was less than the number of columns in the old view. This led to an inconsistency with postgres, which would only return the error `cannot drop columns from view`.

This PR moves the check comparing the number of columns before and after the view replacement earlier so that the correct error returns.

Co-authored-by: [email protected]

Fixes: #99000
Epic: None

Release note (bug fix): Fixes an internal error that can occur when `CREATE OR REPLACE VIEW` replaces a view with fewer columns and another entity depended on the view.

Co-authored-by: Jordan Lewis <[email protected]>
Co-authored-by: Jackson Owens <[email protected]>
Co-authored-by: craig[bot] <[email protected]>
craig bot pushed a commit that referenced this pull request Mar 22, 2023
97697: tsearch: add ts_rank functionality r=jordanlewis a=jordanlewis

Updates: #41288
Epic: CRDB-22357
All but the last commit are from #92966, #97677, and #97685

    This commit adds ts_rank, the family of builtins that allow ranking of
    text search results. The function takes a tsquery and a tsvector and
    returns a float that indicates how good the match is. The function can
    be modified by passing in a custom array of weights that matches the
    text search weights A, B, C, and D, and a bitmask that controls the
    ranking behavior in various detailed ways.

    See the excellent Postgres documentation here for details:
    https://www.postgresql.org/docs/current/textsearch-controls.html

    Release note (sql change): add the ts_rank function for ranking text
    search query results

98776: storage: unify storage/fs.FS and pebble/vfs.FS r=jbowens a=jbowens

The storage/fs.FS had largely the same interface as vfs.FS. The storage/fs.FS interface was intended as a temporary stepping stone to using pebble's vfs.FS interface throughout Cockroach for all filesystem access. This commit unifies the two.

Epic: None
Release note: None

99114: kvserver: fix and unskip TestCheckConsistencyInconsistent r=erikgrinaker a=pavelkalinnikov

This PR unskips `TestCheckConsistencyInconsistent` which was skipped for a reason that no longer holds.

It also fixes the race possible in `TestCheckConsistencyInconsistent`:
- Node 2 is corrupted.
- The second phase of `runConsistency` check times out on node 1, and returns early when only nodes 2 and 3 have created the storage checkpoint.
- Node 1 haven't created the checkpoint, but has started doing so.
- Node 2 "fatals" (mocked out in the test) shortly after the check is complete.
- Node 1 is still creating its checkpoint, but has probably created the directory by now.
- Hence, the test assumes that the checkpoint has been created, and proceeds to open it and compute the checksum of the range.

The test now waits for the moment when all the checkpoint are known to be fully populated.

Fixes #81819
Epic: none
Release note: none

Co-authored-by: Jordan Lewis <[email protected]>
Co-authored-by: Jackson Owens <[email protected]>
Co-authored-by: Pavel Kalinnikov <[email protected]>
@jordanlewis jordanlewis deleted the stopword-elim branch March 22, 2023 21:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants