-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
sql: support tsvector, tsquery based full text search #41288
Comments
We have marked this issue as stale because it has been inactive for |
85185: util: add tsearch package r=jordanlewis a=jordanlewis Updates #41288. Broken off from #83736 This PR adds a tsearch package which contains text search algorithms for the tsvector/tsquery full text search capability. See https://www.postgresql.org/docs/current/datatype-textsearch.html for details. The package can: 1. parse the tsvector language, which consists of a list of terms surrounded by single quotes, optionally suffixed with a list of positions corresponding to the term's original ordinal position within a document. Each position may optionally come with a "weight", which is a letter A-D (defaulting to D, the lowest weight) that can be used to customize how important a word is in a document. 2. parse the tsquery language, which consists of a simple expression language with terms separating by operators `!`, `&`, `|`, and `<n>` where `n` is a number or `-` which is equivalent to 1. Expressions can be further grouped with parentheses. The first 3 operators are ordinary boolean operators over whether a term exists in a document. The `<n>` operator returns true if the left argument is `n` tokens to the left of the argument on the right in the document expressed by the vector. 3. evaluate a tsquery against a tsvector, which returns a boolean indicating whether the vector satisfied the query. So far, this package is standalone and not hooked up to SQL at all. Release note: None Epic: None 89650: allocatorimpl: Prioritize non-voters in voter additions r=kvoli a=KaiSun314 Fix #63810 Previously, when adding a voter via the allocator, we would not prioritize stores with non-voters when selecting candidate stores. This was inadequate because selecting a store without a non-voter would require sending a snapshot over the WAN in order to add a voter onto that store. On the other hand, selecting a store with a non-voter would only require promoting that non-voter to a voter, no snapshot needs to be sent over the WAN. To address this, this patch determines if candidate stores have a non-voter replica and prioritize this status when sorting candidate stores (priority lower than balance score, but higher than range count). By prioritizing selecting a store with a non-voter, we can reduce the number of snapshots that need to be sent over the WAN. Release note (ops change): We prioritized non-voters in voter additions, meaning that when selecting a store to add a voter on (in the allocator), we would prioritize candidate stores that contain a non-voter replica higher. This allows us to reduce the number of snapshots that need to be sent over the WAN. Co-authored-by: Jordan Lewis <[email protected]> Co-authored-by: Kai Sun <[email protected]>
90842: sql: add tsvector and tsquery types to SQL r=jordanlewis a=jordanlewis Updates #41288. Broken off from #83736 This PR adds the tsvector and tsquery types and pgwire encodings to sql, and makes them available for use using the string::tsvector and string::tsquery casts. It also implements the in-memory evaluation of `@@` (the tsquery match operator) by delegating to the tsearch.eval package. Release note (sql change): implement in memory handling for the tsquery and tsvector Postgres datatypes, which provide text search capabilities. See https://www.postgresql.org/docs/current/datatype-textsearch.html for more details. Epic: None 92611: upgrades: make a couple of permanent upgrades idempotent r=andreimatei a=andreimatei Some of the permanent upgrades (opting into telemetry and initializing the cluster secret) were not idempotent; this patch makes them be. In particular, this is important since these upgrades will run on 23.1 cluster that are being upgraded from 22.2 as they've been turned from startupmigrations to upgrades in #91627. For the telemetry one, there's no "natural" way to make it idempotent, so I've added a check for the old startupmigration key. Release note: None Epic: None 92620: upgrademanager: clarify job running code r=andreimatei a=andreimatei This patch makes the code clearer around running upgrade in jobs. Before this patch, the upgrade manager was looking for existing jobs corresponding to upgrades and, if it found them, it called jobsRegistry.Run() on them - just like it did for newly-created jobs. The behavior of Run() for jobs that are not already claimed by the registry is a bit under-specified. I believe it ended up printing an error [1] and then correctly waiting for the job to finish. This patch no longer calls Run() for existing jobs; instead, it calls WaitForJobs(), which is more suggestive. [1] https://github.com/cockroachdb/cockroach/blob/b2a6b80920324bd6b31cba9a6f622961979de600/pkg/jobs/adopt.go#L255 Release note: None Epic: None Co-authored-by: Jordan Lewis <[email protected]> Co-authored-by: Andrei Matei <[email protected]>
90657: builtins: fix pg_function_is_visible to work with UDFs r=e-mbrown a=rafiss fixes #89546 ### builtins: fix pg_function_is_visible to work with UDFs Release note (bug fix): The pg_function_is_visible function now correctly reports visibility based on the functions that are visible on the current search_path. ### builtins: use pg_type to implement pg_type_is_visible This saves us from having to maintain the old code, which I personally found to be a bit hard to work with. Using the internal executor like this should be faster than it used to be, since there is an index on pg_type(oid) that will avoid any lookups for builtin types. ### sessiondata: consolidate logic for searching the search_path 92957: sql: enable storage for tsvector/tsquery r=jordanlewis a=jordanlewis This commit adds the ability to store tsvector and tsquery data in ordinary, unindexed columns. Updates #41288 This functionality is gated behind the 23.1 version. Epic: CRDB-22357 Release note (sql change): permit non-indexed storage of tsvector and tsquery datatypes Co-authored-by: Rafi Shamim <[email protected]> Co-authored-by: Jordan Lewis <[email protected]>
92966: builtins: add to_tsvector {phrase,plain,}to_tsquery r=jordanlewis a=jordanlewis Updates: #41288 Epic: CRDB-22357 Release note (sql change): add the to_tsvector, to_tsquery, phraseto_tsquery, and plainto_tsquery builtins which parse input documents into tsvectors and tsqueries respectively. Co-authored-by: Jordan Lewis <[email protected]>
97677: tsearch: add stemming and stopword elimination for several languages r=jordanlewis a=jordanlewis Updates: #41288 Epic: CRDB-22357 First commit is #92966. This commit adds stopword elimination for text search. The languages supported are the same ones that Postgres does. The stopword lists were copied from Postgres commit e757080e041214cf6983e3e77ef01e83f1371d72. Also, add snowball stemming provided by the blevesearch snowball stemming library. Release note (sql change): add stemming and stopword eliminating text search configurations for English, Danish, Dutch, Finnish, French, German, Hungarian, Italian, Norwegian, Portuguese, Russian, Spanish, Swedish, and Turkish. 98778: cli: unskip test tenant zip test r=dhartunian a=aadityasondhi This patch unskips and re-records the datadriven `TestTenantZip` as it was fixed in #96553, but was not unskipped or recorded. The test was run locally using `--stress` and did not flake: ``` 101 runs so far, 0 failures, over 5m0s ``` Fixes #87141 Release note: None 98830: sqlinstance: add `binary_version` column to instances table r=knz,JeffSwenson a=healthy-pod This code change adds a `binary_version` column to the instances table. This is done by adding the column to the bootstrap schema for system.sql_instances, and piggy-backing on the existing code for the V23_1_SystemRbrReadNew migration that overwrites the live schema for this table by the bootstrap copy. This redefinition of the meaning of the V23_1_SystemRbrReadNew is backward-incompatible and is only possible because this commit is merged before the v23.1 branch is cut. Release note: None Epic: [CRDB-20860](https://cockroachlabs.atlassian.net/browse/CRDB-20860) 99100: kvserver: skip `TestReplicateQueueExpirationLeasesOnly` under deadlock r=erikgrinaker a=erikgrinaker I give up. Resolves #99015. Epic: none Release note: None 99106: server: avoid some log spam r=erikgrinaker a=knz This change removes the following log spam: ``` could not run claimed job 102: no resumer is available for AUTO CONFIG RUNNER ``` Epic: CRDB-23559 Release note: None Co-authored-by: Jordan Lewis <[email protected]> Co-authored-by: Aaditya Sondhi <[email protected]> Co-authored-by: healthy-pod <[email protected]> Co-authored-by: Erik Grinaker <[email protected]> Co-authored-by: Raphael 'kena' Poss <[email protected]>
97685: sql: add default_text_search_config r=jordanlewis a=jordanlewis Updates: #41288 Epic: CRDB-22357 All but the last commit are from #92966 and #97677. This commit adds the default_text_search_config variable for the tsearch package, which allows the user to set a default configuration for the text search builtin functions that take configurations, such as to_tsvector and to_tsquery. The default for this configuration variable is 'english', as it is in Postgres. Release note (sql change): add the default_text_search_config variable for compatibility with the single-argument variants of the text search functions to_tsvector, to_tsquery, phraseto_tsquery, and plainto_tsquery, which use the value of default_text_search_config instead of expecting one to be included as in the two-argument variants. The default value of this setting is 'english'. 99045: roachtest: set 30m timeout for all disk stall roachtests r=nicktrav a=jbowens This commit sets a new 30m timeout for all disk stall roachtests. Previously, the FUSE filesystem variants had no timeout and inherited the default 10h timeout. The other variants had a 20m timeout, which has been observed to be too short due to upreplication latency. Informs #98904. Informs #98886. Epic: None Release note: None 99057: sql: check replace view columns earlier r=rharding6373 a=rharding6373 Before this change, we could encounter internal errors while attempting to add result columns during a `CREATE OR REPLACE VIEW` if the number of columns in the new view was less than the number of columns in the old view. This led to an inconsistency with postgres, which would only return the error `cannot drop columns from view`. This PR moves the check comparing the number of columns before and after the view replacement earlier so that the correct error returns. Co-authored-by: [email protected] Fixes: #99000 Epic: None Release note (bug fix): Fixes an internal error that can occur when `CREATE OR REPLACE VIEW` replaces a view with fewer columns and another entity depended on the view. Co-authored-by: Jordan Lewis <[email protected]> Co-authored-by: Jackson Owens <[email protected]> Co-authored-by: craig[bot] <[email protected]>
97697: tsearch: add ts_rank functionality r=jordanlewis a=jordanlewis Updates: #41288 Epic: CRDB-22357 All but the last commit are from #92966, #97677, and #97685 This commit adds ts_rank, the family of builtins that allow ranking of text search results. The function takes a tsquery and a tsvector and returns a float that indicates how good the match is. The function can be modified by passing in a custom array of weights that matches the text search weights A, B, C, and D, and a bitmask that controls the ranking behavior in various detailed ways. See the excellent Postgres documentation here for details: https://www.postgresql.org/docs/current/textsearch-controls.html Release note (sql change): add the ts_rank function for ranking text search query results 98776: storage: unify storage/fs.FS and pebble/vfs.FS r=jbowens a=jbowens The storage/fs.FS had largely the same interface as vfs.FS. The storage/fs.FS interface was intended as a temporary stepping stone to using pebble's vfs.FS interface throughout Cockroach for all filesystem access. This commit unifies the two. Epic: None Release note: None 99114: kvserver: fix and unskip TestCheckConsistencyInconsistent r=erikgrinaker a=pavelkalinnikov This PR unskips `TestCheckConsistencyInconsistent` which was skipped for a reason that no longer holds. It also fixes the race possible in `TestCheckConsistencyInconsistent`: - Node 2 is corrupted. - The second phase of `runConsistency` check times out on node 1, and returns early when only nodes 2 and 3 have created the storage checkpoint. - Node 1 haven't created the checkpoint, but has started doing so. - Node 2 "fatals" (mocked out in the test) shortly after the check is complete. - Node 1 is still creating its checkpoint, but has probably created the directory by now. - Hence, the test assumes that the checkpoint has been created, and proceeds to open it and compute the checksum of the range. The test now waits for the moment when all the checkpoint are known to be fully populated. Fixes #81819 Epic: none Release note: none Co-authored-by: Jordan Lewis <[email protected]> Co-authored-by: Jackson Owens <[email protected]> Co-authored-by: Pavel Kalinnikov <[email protected]>
@fqazi unsupported function |
@giangpham712 Can you confirm if we need anything else here, this might be one of the easier ones to address |
@fqazi Related to full text search, we have multiple issues Unsupported features:
Other issues:
returns error |
@giangpham712 Let's skip tests with tsvector for now, let me follow up with the queries team for the issue you found |
Can't use cockroachdb because of lack of `setweight`. See: cockroachdb/cockroach#41288
Can't use cockroachdb because of lack of `setweight`. See: cockroachdb/cockroach#41288
Can't use cockroachdb because of lack of `setweight`. See: cockroachdb/cockroach#41288
Can't use cockroachdb because of lack of `setweight`. See: cockroachdb/cockroach#41288
Can't use cockroachdb because of lack of `setweight`. See: cockroachdb/cockroach#41288
Postgres has some advanced text search types and features that Django uses that we we don't support --
https://www.postgresql.org/docs/10/functions-textsearch.html
https://www.postgresql.org/docs/10/datatype-textsearch.html
This includes the
tsvector
andregconfig
types and operations on them.tsvector
datastructuretsquery
datastructureto_tsvector
to_tsquery
phraseto_tsquery
plainto_tsquery
websearch_to_tsquery
tsquery_phrase
@@
ts_rank
ts_rank_cd
strip(tsvector)
setweight(tsvector)
array_to_tsvector(text[])
numnode(tsquery)
querytree(tsquery)
ts_delete(tsvector, text)
ts_headline
ts_rewrite
tsvector_to_array(tsvector)
jsonb_to_tsvector
/json_to_tsvector
(sql: support jsonb_to_tsvector #109955)unnest(tsvector)
And SQL operators for tsvector:
@@
(matching a tsvector and tsquery)Jira issue: CRDB-5464
Epic: CRDB-22357
The text was updated successfully, but these errors were encountered: