sql: support tsvector, tsquery based full text search #41288

rohany · 2019-10-03T17:28:35Z

github-actions · 2021-06-04T19:50:19Z

We have marked this issue as stale because it has been inactive for
18 months. If this issue is still relevant, removing the stale label
or adding a comment will keep it active. Otherwise, we'll close it in
5 days to keep the issue queue tidy. Thank you for your contribution
to CockroachDB!

85185: util: add tsearch package r=jordanlewis a=jordanlewis Updates #41288. Broken off from #83736 This PR adds a tsearch package which contains text search algorithms for the tsvector/tsquery full text search capability. See https://www.postgresql.org/docs/current/datatype-textsearch.html for details. The package can: 1. parse the tsvector language, which consists of a list of terms surrounded by single quotes, optionally suffixed with a list of positions corresponding to the term's original ordinal position within a document. Each position may optionally come with a "weight", which is a letter A-D (defaulting to D, the lowest weight) that can be used to customize how important a word is in a document. 2. parse the tsquery language, which consists of a simple expression language with terms separating by operators `!`, `&`, `|`, and `<n>` where `n` is a number or `-` which is equivalent to 1. Expressions can be further grouped with parentheses. The first 3 operators are ordinary boolean operators over whether a term exists in a document. The `<n>` operator returns true if the left argument is `n` tokens to the left of the argument on the right in the document expressed by the vector. 3. evaluate a tsquery against a tsvector, which returns a boolean indicating whether the vector satisfied the query. So far, this package is standalone and not hooked up to SQL at all. Release note: None Epic: None 89650: allocatorimpl: Prioritize non-voters in voter additions r=kvoli a=KaiSun314 Fix #63810 Previously, when adding a voter via the allocator, we would not prioritize stores with non-voters when selecting candidate stores. This was inadequate because selecting a store without a non-voter would require sending a snapshot over the WAN in order to add a voter onto that store. On the other hand, selecting a store with a non-voter would only require promoting that non-voter to a voter, no snapshot needs to be sent over the WAN. To address this, this patch determines if candidate stores have a non-voter replica and prioritize this status when sorting candidate stores (priority lower than balance score, but higher than range count). By prioritizing selecting a store with a non-voter, we can reduce the number of snapshots that need to be sent over the WAN. Release note (ops change): We prioritized non-voters in voter additions, meaning that when selecting a store to add a voter on (in the allocator), we would prioritize candidate stores that contain a non-voter replica higher. This allows us to reduce the number of snapshots that need to be sent over the WAN. Co-authored-by: Jordan Lewis <[email protected]> Co-authored-by: Kai Sun <[email protected]>

90842: sql: add tsvector and tsquery types to SQL r=jordanlewis a=jordanlewis Updates #41288. Broken off from #83736 This PR adds the tsvector and tsquery types and pgwire encodings to sql, and makes them available for use using the string::tsvector and string::tsquery casts. It also implements the in-memory evaluation of `@@` (the tsquery match operator) by delegating to the tsearch.eval package. Release note (sql change): implement in memory handling for the tsquery and tsvector Postgres datatypes, which provide text search capabilities. See https://www.postgresql.org/docs/current/datatype-textsearch.html for more details. Epic: None 92611: upgrades: make a couple of permanent upgrades idempotent r=andreimatei a=andreimatei Some of the permanent upgrades (opting into telemetry and initializing the cluster secret) were not idempotent; this patch makes them be. In particular, this is important since these upgrades will run on 23.1 cluster that are being upgraded from 22.2 as they've been turned from startupmigrations to upgrades in #91627. For the telemetry one, there's no "natural" way to make it idempotent, so I've added a check for the old startupmigration key. Release note: None Epic: None 92620: upgrademanager: clarify job running code r=andreimatei a=andreimatei This patch makes the code clearer around running upgrade in jobs. Before this patch, the upgrade manager was looking for existing jobs corresponding to upgrades and, if it found them, it called jobsRegistry.Run() on them - just like it did for newly-created jobs. The behavior of Run() for jobs that are not already claimed by the registry is a bit under-specified. I believe it ended up printing an error [1] and then correctly waiting for the job to finish. This patch no longer calls Run() for existing jobs; instead, it calls WaitForJobs(), which is more suggestive. [1] https://github.com/cockroachdb/cockroach/blob/b2a6b80920324bd6b31cba9a6f622961979de600/pkg/jobs/adopt.go#L255 Release note: None Epic: None Co-authored-by: Jordan Lewis <[email protected]> Co-authored-by: Andrei Matei <[email protected]>

90657: builtins: fix pg_function_is_visible to work with UDFs r=e-mbrown a=rafiss fixes #89546 ### builtins: fix pg_function_is_visible to work with UDFs Release note (bug fix): The pg_function_is_visible function now correctly reports visibility based on the functions that are visible on the current search_path. ### builtins: use pg_type to implement pg_type_is_visible This saves us from having to maintain the old code, which I personally found to be a bit hard to work with. Using the internal executor like this should be faster than it used to be, since there is an index on pg_type(oid) that will avoid any lookups for builtin types. ### sessiondata: consolidate logic for searching the search_path 92957: sql: enable storage for tsvector/tsquery r=jordanlewis a=jordanlewis This commit adds the ability to store tsvector and tsquery data in ordinary, unindexed columns. Updates #41288 This functionality is gated behind the 23.1 version. Epic: CRDB-22357 Release note (sql change): permit non-indexed storage of tsvector and tsquery datatypes Co-authored-by: Rafi Shamim <[email protected]> Co-authored-by: Jordan Lewis <[email protected]>

92966: builtins: add to_tsvector {phrase,plain,}to_tsquery r=jordanlewis a=jordanlewis Updates: #41288 Epic: CRDB-22357 Release note (sql change): add the to_tsvector, to_tsquery, phraseto_tsquery, and plainto_tsquery builtins which parse input documents into tsvectors and tsqueries respectively. Co-authored-by: Jordan Lewis <[email protected]>

97677: tsearch: add stemming and stopword elimination for several languages r=jordanlewis a=jordanlewis Updates: #41288 Epic: CRDB-22357 First commit is #92966. This commit adds stopword elimination for text search. The languages supported are the same ones that Postgres does. The stopword lists were copied from Postgres commit e757080e041214cf6983e3e77ef01e83f1371d72. Also, add snowball stemming provided by the blevesearch snowball stemming library. Release note (sql change): add stemming and stopword eliminating text search configurations for English, Danish, Dutch, Finnish, French, German, Hungarian, Italian, Norwegian, Portuguese, Russian, Spanish, Swedish, and Turkish. 98778: cli: unskip test tenant zip test r=dhartunian a=aadityasondhi This patch unskips and re-records the datadriven `TestTenantZip` as it was fixed in #96553, but was not unskipped or recorded. The test was run locally using `--stress` and did not flake: ``` 101 runs so far, 0 failures, over 5m0s ``` Fixes #87141 Release note: None 98830: sqlinstance: add `binary_version` column to instances table r=knz,JeffSwenson a=healthy-pod This code change adds a `binary_version` column to the instances table. This is done by adding the column to the bootstrap schema for system.sql_instances, and piggy-backing on the existing code for the V23_1_SystemRbrReadNew migration that overwrites the live schema for this table by the bootstrap copy. This redefinition of the meaning of the V23_1_SystemRbrReadNew is backward-incompatible and is only possible because this commit is merged before the v23.1 branch is cut. Release note: None Epic: [CRDB-20860](https://cockroachlabs.atlassian.net/browse/CRDB-20860) 99100: kvserver: skip `TestReplicateQueueExpirationLeasesOnly` under deadlock r=erikgrinaker a=erikgrinaker I give up. Resolves #99015. Epic: none Release note: None 99106: server: avoid some log spam r=erikgrinaker a=knz This change removes the following log spam: ``` could not run claimed job 102: no resumer is available for AUTO CONFIG RUNNER ``` Epic: CRDB-23559 Release note: None Co-authored-by: Jordan Lewis <[email protected]> Co-authored-by: Aaditya Sondhi <[email protected]> Co-authored-by: healthy-pod <[email protected]> Co-authored-by: Erik Grinaker <[email protected]> Co-authored-by: Raphael 'kena' Poss <[email protected]>

97685: sql: add default_text_search_config r=jordanlewis a=jordanlewis Updates: #41288 Epic: CRDB-22357 All but the last commit are from #92966 and #97677. This commit adds the default_text_search_config variable for the tsearch package, which allows the user to set a default configuration for the text search builtin functions that take configurations, such as to_tsvector and to_tsquery. The default for this configuration variable is 'english', as it is in Postgres. Release note (sql change): add the default_text_search_config variable for compatibility with the single-argument variants of the text search functions to_tsvector, to_tsquery, phraseto_tsquery, and plainto_tsquery, which use the value of default_text_search_config instead of expecting one to be included as in the two-argument variants. The default value of this setting is 'english'. 99045: roachtest: set 30m timeout for all disk stall roachtests r=nicktrav a=jbowens This commit sets a new 30m timeout for all disk stall roachtests. Previously, the FUSE filesystem variants had no timeout and inherited the default 10h timeout. The other variants had a 20m timeout, which has been observed to be too short due to upreplication latency. Informs #98904. Informs #98886. Epic: None Release note: None 99057: sql: check replace view columns earlier r=rharding6373 a=rharding6373 Before this change, we could encounter internal errors while attempting to add result columns during a `CREATE OR REPLACE VIEW` if the number of columns in the new view was less than the number of columns in the old view. This led to an inconsistency with postgres, which would only return the error `cannot drop columns from view`. This PR moves the check comparing the number of columns before and after the view replacement earlier so that the correct error returns. Co-authored-by: [email protected] Fixes: #99000 Epic: None Release note (bug fix): Fixes an internal error that can occur when `CREATE OR REPLACE VIEW` replaces a view with fewer columns and another entity depended on the view. Co-authored-by: Jordan Lewis <[email protected]> Co-authored-by: Jackson Owens <[email protected]> Co-authored-by: craig[bot] <[email protected]>

97697: tsearch: add ts_rank functionality r=jordanlewis a=jordanlewis Updates: #41288 Epic: CRDB-22357 All but the last commit are from #92966, #97677, and #97685 This commit adds ts_rank, the family of builtins that allow ranking of text search results. The function takes a tsquery and a tsvector and returns a float that indicates how good the match is. The function can be modified by passing in a custom array of weights that matches the text search weights A, B, C, and D, and a bitmask that controls the ranking behavior in various detailed ways. See the excellent Postgres documentation here for details: https://www.postgresql.org/docs/current/textsearch-controls.html Release note (sql change): add the ts_rank function for ranking text search query results 98776: storage: unify storage/fs.FS and pebble/vfs.FS r=jbowens a=jbowens The storage/fs.FS had largely the same interface as vfs.FS. The storage/fs.FS interface was intended as a temporary stepping stone to using pebble's vfs.FS interface throughout Cockroach for all filesystem access. This commit unifies the two. Epic: None Release note: None 99114: kvserver: fix and unskip TestCheckConsistencyInconsistent r=erikgrinaker a=pavelkalinnikov This PR unskips `TestCheckConsistencyInconsistent` which was skipped for a reason that no longer holds. It also fixes the race possible in `TestCheckConsistencyInconsistent`: - Node 2 is corrupted. - The second phase of `runConsistency` check times out on node 1, and returns early when only nodes 2 and 3 have created the storage checkpoint. - Node 1 haven't created the checkpoint, but has started doing so. - Node 2 "fatals" (mocked out in the test) shortly after the check is complete. - Node 1 is still creating its checkpoint, but has probably created the directory by now. - Hence, the test assumes that the checkpoint has been created, and proceeds to open it and compute the checksum of the range. The test now waits for the moment when all the checkpoint are known to be fully populated. Fixes #81819 Epic: none Release note: none Co-authored-by: Jordan Lewis <[email protected]> Co-authored-by: Jackson Owens <[email protected]> Co-authored-by: Pavel Kalinnikov <[email protected]>

giangpham712 · 2023-08-10T11:58:23Z

@fqazi unsupported function array_to_tsvector affects efcore.pg tests

fqazi · 2023-08-10T14:11:12Z

@giangpham712 Can you confirm if we need anything else here, this might be one of the easier ones to address

giangpham712 · 2023-08-13T13:19:45Z

@fqazi Related to full text search, we have multiple issues

Unsupported features:

array_to_tsvector
setweight
ts_delete
ts_headline
regconfig
regdictionary
length(tsvector)
numnode
querytree
ts_rank_cd
ts_rewrite
tsquery_phrase
strip
websearch_to_tsquery
unaccent(string, varchar)

Other issues:

unsupported comparison operator: to_tsvector("ContactTitle") @@ to_tsquery('owner') || to_tsquery('foo'): unsupported 
binary operator: <tsquery> || <tsquery>
unsupported comparison operator: <tsquery> && <tsquery>
unsupported comparison operator: <tsquery> @> <tsquery>
unsupported comparison operator: <tsquery> <@ <tsquery>
unsupported binary operator: <tsquery> || <tsquery>
unsupported binary operator: <tsvector> || <tsvector>

SELECT !!to_tsquery('a & b')
FROM "Customers" AS c
LIMIT 1

returns error plainto_tsquery(): text-search query doesn't contain lexemes: a

fqazi · 2023-08-14T15:04:34Z

@giangpham712 Let's skip tests with tsvector for now, let me follow up with the queries team for the issue you found

Can't use cockroachdb because of lack of `setweight`. See: cockroachdb/cockroach#41288

rohany added C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) A-sql-pgcompat Semantic compatibility with PostgreSQL labels Oct 3, 2019

github-actions bot added the no-issue-activity label Jun 4, 2021

knz removed the no-issue-activity label Jun 5, 2021

jlinder added the T-sql-foundations SQL Foundations Team (formerly SQL Schema + SQL Sessions) label Jun 16, 2021

jordanlewis changed the title ~~sql: Support advanced text search like postgres~~ sql: support tsvector, tsquery based full text search Jul 1, 2022

jordanlewis mentioned this issue Jul 7, 2022

sql: Full Text Search #7821

Open

2 tasks

jordanlewis mentioned this issue Jul 25, 2022

util,sql: add tsvector and tsquery package and sql handling #83736

Closed

jordanlewis mentioned this issue Oct 27, 2022

util: add tsearch package #85185

Merged

jordanlewis mentioned this issue Oct 28, 2022

sql: add tsvector and tsquery types to SQL #90842

Merged

jordanlewis mentioned this issue Nov 18, 2022

sql: support ordering by TSVector, TSQuery, and VECTOR #92165

Open

jordanlewis mentioned this issue Dec 2, 2022

builtins: add to_tsvector {phrase,plain,}to_tsquery #92966

Merged

jordanlewis mentioned this issue Dec 13, 2022

sql: enable storage for tsvector/tsquery #92957

Merged

exalate-issue-sync bot assigned ghost and jordanlewis and unassigned ghost Dec 14, 2022

jordanlewis mentioned this issue Mar 23, 2023

release-23.1: sql,tsearch: add default_text_search_config and ts_rank #99323

Merged

rafiss added the A-tools-efcore label Jul 21, 2023

mgartner added this to SQL Queries Jul 24, 2023

mgartner moved this to New Backlog in SQL Queries Jul 24, 2023

jordanlewis removed their assignment Aug 10, 2023

rafiss mentioned this issue Sep 5, 2023

sql: support jsonb_to_tsvector #109955

Open

exalate-issue-sync bot closed this as completed May 21, 2024

github-project-automation bot moved this from Backlog to Done in SQL Queries May 21, 2024

ansg191 added a commit to ansg191/anshulg-cluster that referenced this issue Nov 19, 2024

Update miniflux to use cnpg

9eeb6cf

Can't use cockroachdb because of lack of `setweight`. See: cockroachdb/cockroach#41288

ansg191 added a commit to ansg191/anshulg-cluster that referenced this issue Nov 19, 2024

Update miniflux to use cnpg

58dbcde

Can't use cockroachdb because of lack of `setweight`. See: cockroachdb/cockroach#41288

ansg191 added a commit to ansg191/anshulg-cluster that referenced this issue Nov 19, 2024

Update miniflux to use cnpg

c8f79f4

Can't use cockroachdb because of lack of `setweight`. See: cockroachdb/cockroach#41288

ansg191 added a commit to ansg191/anshulg-cluster that referenced this issue Nov 19, 2024

Update miniflux to use cnpg

4adcd87

Can't use cockroachdb because of lack of `setweight`. See: cockroachdb/cockroach#41288

ansg191 added a commit to ansg191/anshulg-cluster that referenced this issue Nov 19, 2024

Update miniflux to use cnpg

b5eac12

Can't use cockroachdb because of lack of `setweight`. See: cockroachdb/cockroach#41288

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

sql: support tsvector, tsquery based full text search #41288

sql: support tsvector, tsquery based full text search #41288

rohany commented Oct 3, 2019 •

edited by rafiss

Loading

github-actions bot commented Jun 4, 2021

giangpham712 commented Aug 10, 2023

fqazi commented Aug 10, 2023

giangpham712 commented Aug 13, 2023

fqazi commented Aug 14, 2023

sql: support tsvector, tsquery based full text search #41288

sql: support tsvector, tsquery based full text search #41288

Comments

rohany commented Oct 3, 2019 • edited by rafiss Loading

github-actions bot commented Jun 4, 2021

giangpham712 commented Aug 10, 2023

fqazi commented Aug 10, 2023

giangpham712 commented Aug 13, 2023

fqazi commented Aug 14, 2023

rohany commented Oct 3, 2019 •

edited by rafiss

Loading