Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sql: support tsvector, tsquery based full text search #41288

Closed
13 of 41 tasks
rohany opened this issue Oct 3, 2019 · 5 comments
Closed
13 of 41 tasks

sql: support tsvector, tsquery based full text search #41288

rohany opened this issue Oct 3, 2019 · 5 comments
Labels
A-sql-pgcompat Semantic compatibility with PostgreSQL A-tools-efcore C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) T-sql-queries SQL Queries Team

Comments

@rohany
Copy link
Contributor

rohany commented Oct 3, 2019

Postgres has some advanced text search types and features that Django uses that we we don't support --
https://www.postgresql.org/docs/10/functions-textsearch.html
https://www.postgresql.org/docs/10/datatype-textsearch.html

This includes the tsvector and regconfig types and operations on them.

  • parsers (tokenizing)
  • the rest of the token types supported by Postgres's default parser: tsearch: support all token types in text search parser #97669
  • dictionaries (stemming and stopwords - made by parameterizing a template)
  • templates (templates for dictionaries)
  • configurations (pair of parser and dictionary)
  • regconfig datatype
  • regdictionary datatype
  • tsvector datastructure
  • tsquery datastructure
  • to_tsvector
  • to_tsquery
  • phraseto_tsquery
  • plainto_tsquery
  • websearch_to_tsquery
  • tsquery_phrase
  • accelerated searches with @@
  • GIN index on tsvectors
  • weights
  • ts_rank
  • ts_rank_cd
  • strip(tsvector)
  • length(tsvector)
  • setweight(tsvector)
  • array_to_tsvector(text[])
  • numnode(tsquery)
  • querytree(tsquery)
  • ts_delete(tsvector, text)
  • ts_headline
  • ts_rewrite
  • tsvector_to_array(tsvector)
  • jsonb_to_tsvector / json_to_tsvector (sql: support jsonb_to_tsvector #109955)
  • unnest(tsvector)
  • inverted join on tsvectors

And SQL operators for tsvector:

  • @@ (matching a tsvector and tsquery)
  • tsvector || tsvector (concatenation of tsquery)
  • tsquery || tsquery (OR 2 tsqueries)
  • tsquery && tsquery (AND 2 tsqueries)
  • tsquery <-> tsquery (followedby 2 tsqueries()
  • !! tsquery (negate a tsquery)
  • tsquery @> tsquery (contains)
  • tsquery <@ tsquery (contained-by)

Jira issue: CRDB-5464
Epic: CRDB-22357

@rohany rohany added C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) A-sql-pgcompat Semantic compatibility with PostgreSQL labels Oct 3, 2019
@github-actions
Copy link

github-actions bot commented Jun 4, 2021

We have marked this issue as stale because it has been inactive for
18 months. If this issue is still relevant, removing the stale label
or adding a comment will keep it active. Otherwise, we'll close it in
5 days to keep the issue queue tidy. Thank you for your contribution
to CockroachDB!

@jlinder jlinder added the T-sql-foundations SQL Foundations Team (formerly SQL Schema + SQL Sessions) label Jun 16, 2021
@jordanlewis jordanlewis changed the title sql: Support advanced text search like postgres sql: support tsvector, tsquery based full text search Jul 1, 2022
craig bot pushed a commit that referenced this issue Oct 28, 2022
85185: util: add tsearch package r=jordanlewis a=jordanlewis

Updates #41288.

Broken off from #83736

This PR adds a tsearch package which contains text search algorithms
for the tsvector/tsquery full text search capability.

See https://www.postgresql.org/docs/current/datatype-textsearch.html for details.

The package can:

1. parse the tsvector language, which consists of a list of terms surrounded by single quotes, optionally suffixed with a list of positions corresponding to the term's original ordinal position within a document. Each position may optionally come with a "weight", which is a letter A-D (defaulting to D, the lowest weight) that can be used to customize how important a word is in a document.
2. parse the tsquery language, which consists of a simple expression language with terms separating by operators `!`, `&`, `|`, and `<n>` where `n` is a number or `-` which is equivalent to 1. Expressions can be further grouped with parentheses. The first 3 operators are ordinary boolean operators over whether a term exists in a document. The `<n>` operator returns true if the left argument is `n` tokens to the left of the argument on the right in the document expressed by the vector.
3. evaluate a tsquery against a tsvector, which returns a boolean indicating whether the vector satisfied the query.

So far, this package is standalone and not hooked up to SQL at all.

Release note: None
Epic: None

89650: allocatorimpl: Prioritize non-voters in voter additions r=kvoli a=KaiSun314

Fix #63810

Previously, when adding a voter via the allocator, we would not prioritize stores with non-voters when selecting candidate stores.

This was inadequate because selecting a store without a non-voter would require sending a snapshot over the WAN in order to add a voter onto that store. On the other hand, selecting a store with a non-voter would only require promoting that non-voter to a voter, no snapshot needs to be sent over the WAN.

To address this, this patch determines if candidate stores have a non-voter replica and prioritize this status when sorting candidate stores (priority lower than balance score, but higher than range count). By prioritizing selecting a store with a non-voter, we can reduce the number of snapshots that need to be sent over the WAN.

Release note (ops change): We prioritized non-voters in voter additions, meaning that when selecting a store to add a voter on (in the allocator), we would prioritize candidate stores that contain a non-voter replica higher. This allows us to reduce the number of snapshots that need to be sent over the WAN.

Co-authored-by: Jordan Lewis <[email protected]>
Co-authored-by: Kai Sun <[email protected]>
craig bot pushed a commit that referenced this issue Dec 2, 2022
90842: sql: add tsvector and tsquery types to SQL r=jordanlewis a=jordanlewis

Updates #41288.

Broken off from #83736

This PR adds the tsvector and tsquery types and pgwire encodings to sql, and makes them available for use using the string::tsvector and string::tsquery casts. It also implements the in-memory evaluation of `@@` (the tsquery match operator) by delegating to the tsearch.eval package.

Release note (sql change): implement in memory handling for the tsquery and tsvector Postgres datatypes, which provide text search capabilities. See https://www.postgresql.org/docs/current/datatype-textsearch.html for more details.

Epic: None

92611: upgrades: make a couple of permanent upgrades idempotent r=andreimatei a=andreimatei

Some of the permanent upgrades (opting into telemetry and initializing the cluster secret) were not idempotent; this patch makes them be. In particular, this is important since these upgrades will run on 23.1 cluster that are being upgraded from 22.2 as they've been turned from startupmigrations to upgrades in #91627. For the telemetry one, there's no "natural" way to make it idempotent, so I've added a check for the old startupmigration key.

Release note: None
Epic: None

92620: upgrademanager: clarify job running code r=andreimatei a=andreimatei

This patch makes the code clearer around running upgrade in jobs. Before this patch, the upgrade manager was looking for existing jobs corresponding to upgrades and, if it found them, it called jobsRegistry.Run() on them - just like it did for newly-created jobs. The behavior of Run() for jobs that are not already claimed by the registry is a bit under-specified. I believe it ended up printing an error [1] and then correctly waiting for the job to finish.

This patch no longer calls Run() for existing jobs; instead, it calls WaitForJobs(), which is more suggestive.

[1] https://github.com/cockroachdb/cockroach/blob/b2a6b80920324bd6b31cba9a6f622961979de600/pkg/jobs/adopt.go#L255

Release note: None
Epic: None

Co-authored-by: Jordan Lewis <[email protected]>
Co-authored-by: Andrei Matei <[email protected]>
craig bot pushed a commit that referenced this issue Dec 13, 2022
90657: builtins: fix pg_function_is_visible to work with UDFs r=e-mbrown a=rafiss

fixes #89546

### builtins: fix pg_function_is_visible to work with UDFs

Release note (bug fix): The pg_function_is_visible function now
correctly reports visibility based on the functions that are visible on
the current search_path.

### builtins: use pg_type to implement pg_type_is_visible

This saves us from having to maintain the old code, which I personally
found to be a bit hard to work with. Using the internal executor like
this should be faster than it used to be, since there is an index on
pg_type(oid) that will avoid any lookups for builtin types.

### sessiondata: consolidate logic for searching the search_path 

92957: sql: enable storage for tsvector/tsquery r=jordanlewis a=jordanlewis

This commit adds the ability to store tsvector and tsquery data in ordinary, unindexed columns.

Updates #41288
This functionality is gated behind the 23.1 version.

Epic: CRDB-22357

Release note (sql change): permit non-indexed storage of tsvector and tsquery datatypes

Co-authored-by: Rafi Shamim <[email protected]>
Co-authored-by: Jordan Lewis <[email protected]>
@exalate-issue-sync exalate-issue-sync bot assigned ghost and jordanlewis and unassigned ghost Dec 14, 2022
craig bot pushed a commit that referenced this issue Feb 28, 2023
92966: builtins: add to_tsvector {phrase,plain,}to_tsquery r=jordanlewis a=jordanlewis

Updates: #41288
Epic: CRDB-22357

Release note (sql change): add the to_tsvector, to_tsquery, phraseto_tsquery, and plainto_tsquery builtins which parse input documents into tsvectors and tsqueries respectively.

Co-authored-by: Jordan Lewis <[email protected]>
craig bot pushed a commit that referenced this issue Mar 21, 2023
97677: tsearch: add stemming and stopword elimination for several languages r=jordanlewis a=jordanlewis

Updates: #41288
Epic: CRDB-22357
First commit is #92966.

    This commit adds stopword elimination for text search. The languages
    supported are the same ones that Postgres does. The stopword lists were
    copied from Postgres commit e757080e041214cf6983e3e77ef01e83f1371d72.

    Also, add snowball stemming provided by the blevesearch snowball
    stemming library.

    Release note (sql change): add stemming and stopword eliminating text
    search configurations for English, Danish, Dutch, Finnish, French,
    German, Hungarian, Italian, Norwegian, Portuguese, Russian, Spanish,
    Swedish, and Turkish.

98778: cli: unskip test tenant zip test r=dhartunian a=aadityasondhi

This patch unskips and re-records the datadriven `TestTenantZip` as it was fixed in #96553, but was not unskipped or recorded. The test was run locally using `--stress` and did not flake:
```
101 runs so far, 0 failures, over 5m0s
```

Fixes #87141

Release note: None

98830: sqlinstance: add `binary_version` column to instances table r=knz,JeffSwenson a=healthy-pod

This code change adds a `binary_version` column to the instances table.

This is done by adding the column to the bootstrap schema for system.sql_instances,
and piggy-backing on the existing code for the V23_1_SystemRbrReadNew migration
that overwrites the live schema for this table by the bootstrap copy.

This redefinition of the meaning of the V23_1_SystemRbrReadNew is backward-incompatible
and is only possible because this commit is merged before the v23.1 branch is cut.

Release note: None
Epic: [CRDB-20860](https://cockroachlabs.atlassian.net/browse/CRDB-20860)

99100: kvserver: skip `TestReplicateQueueExpirationLeasesOnly` under deadlock r=erikgrinaker a=erikgrinaker

I give up.

Resolves #99015.

Epic: none
Release note: None

99106: server: avoid some log spam r=erikgrinaker a=knz

This change removes the following log spam:
```
could not run claimed job 102: no resumer is available for AUTO CONFIG RUNNER
```

Epic: CRDB-23559
Release note: None

Co-authored-by: Jordan Lewis <[email protected]>
Co-authored-by: Aaditya Sondhi <[email protected]>
Co-authored-by: healthy-pod <[email protected]>
Co-authored-by: Erik Grinaker <[email protected]>
Co-authored-by: Raphael 'kena' Poss <[email protected]>
craig bot pushed a commit that referenced this issue Mar 21, 2023
97685: sql: add default_text_search_config r=jordanlewis a=jordanlewis

Updates: #41288
Epic: CRDB-22357

All but the last commit are from #92966 and #97677.


    This commit adds the default_text_search_config variable for the tsearch
    package, which allows the user to set a default configuration for the
    text search builtin functions that take configurations, such as
    to_tsvector and to_tsquery. The default for this configuration variable
    is 'english', as it is in Postgres.

    Release note (sql change): add the default_text_search_config variable
    for compatibility with the single-argument variants of the text search
    functions to_tsvector, to_tsquery, phraseto_tsquery, and
    plainto_tsquery, which use the value of default_text_search_config
    instead of expecting one to be included as in the two-argument variants.
    The default value of this setting is 'english'.

99045: roachtest: set 30m timeout for all disk stall roachtests r=nicktrav a=jbowens

This commit sets a new 30m timeout for all disk stall roachtests. Previously,
the FUSE filesystem variants had no timeout and inherited the default 10h
timeout. The other variants had a 20m timeout, which has been observed to be
too short due to upreplication latency.

Informs #98904.
Informs #98886.
Epic: None
Release note: None


99057: sql: check replace view columns earlier r=rharding6373 a=rharding6373

Before this change, we could encounter internal errors while attempting to add result columns during a `CREATE OR REPLACE VIEW` if the number of columns in the new view was less than the number of columns in the old view. This led to an inconsistency with postgres, which would only return the error `cannot drop columns from view`.

This PR moves the check comparing the number of columns before and after the view replacement earlier so that the correct error returns.

Co-authored-by: [email protected]

Fixes: #99000
Epic: None

Release note (bug fix): Fixes an internal error that can occur when `CREATE OR REPLACE VIEW` replaces a view with fewer columns and another entity depended on the view.

Co-authored-by: Jordan Lewis <[email protected]>
Co-authored-by: Jackson Owens <[email protected]>
Co-authored-by: craig[bot] <[email protected]>
craig bot pushed a commit that referenced this issue Mar 22, 2023
97697: tsearch: add ts_rank functionality r=jordanlewis a=jordanlewis

Updates: #41288
Epic: CRDB-22357
All but the last commit are from #92966, #97677, and #97685

    This commit adds ts_rank, the family of builtins that allow ranking of
    text search results. The function takes a tsquery and a tsvector and
    returns a float that indicates how good the match is. The function can
    be modified by passing in a custom array of weights that matches the
    text search weights A, B, C, and D, and a bitmask that controls the
    ranking behavior in various detailed ways.

    See the excellent Postgres documentation here for details:
    https://www.postgresql.org/docs/current/textsearch-controls.html

    Release note (sql change): add the ts_rank function for ranking text
    search query results

98776: storage: unify storage/fs.FS and pebble/vfs.FS r=jbowens a=jbowens

The storage/fs.FS had largely the same interface as vfs.FS. The storage/fs.FS interface was intended as a temporary stepping stone to using pebble's vfs.FS interface throughout Cockroach for all filesystem access. This commit unifies the two.

Epic: None
Release note: None

99114: kvserver: fix and unskip TestCheckConsistencyInconsistent r=erikgrinaker a=pavelkalinnikov

This PR unskips `TestCheckConsistencyInconsistent` which was skipped for a reason that no longer holds.

It also fixes the race possible in `TestCheckConsistencyInconsistent`:
- Node 2 is corrupted.
- The second phase of `runConsistency` check times out on node 1, and returns early when only nodes 2 and 3 have created the storage checkpoint.
- Node 1 haven't created the checkpoint, but has started doing so.
- Node 2 "fatals" (mocked out in the test) shortly after the check is complete.
- Node 1 is still creating its checkpoint, but has probably created the directory by now.
- Hence, the test assumes that the checkpoint has been created, and proceeds to open it and compute the checksum of the range.

The test now waits for the moment when all the checkpoint are known to be fully populated.

Fixes #81819
Epic: none
Release note: none

Co-authored-by: Jordan Lewis <[email protected]>
Co-authored-by: Jackson Owens <[email protected]>
Co-authored-by: Pavel Kalinnikov <[email protected]>
@mgartner mgartner moved this to New Backlog in SQL Queries Jul 24, 2023
@giangpham712
Copy link

@fqazi unsupported function array_to_tsvector affects efcore.pg tests

@fqazi
Copy link
Collaborator

fqazi commented Aug 10, 2023

@giangpham712 Can you confirm if we need anything else here, this might be one of the easier ones to address

@jordanlewis jordanlewis removed their assignment Aug 10, 2023
@giangpham712
Copy link

@fqazi Related to full text search, we have multiple issues

Unsupported features:

array_to_tsvector
setweight
ts_delete
ts_headline
regconfig
regdictionary
length(tsvector)
numnode
querytree
ts_rank_cd
ts_rewrite
tsquery_phrase
strip
websearch_to_tsquery
unaccent(string, varchar)

Other issues:

unsupported comparison operator: to_tsvector("ContactTitle") @@ to_tsquery('owner') || to_tsquery('foo'): unsupported 
binary operator: <tsquery> || <tsquery>
unsupported comparison operator: <tsquery> && <tsquery>
unsupported comparison operator: <tsquery> @> <tsquery>
unsupported comparison operator: <tsquery> <@ <tsquery>
unsupported binary operator: <tsquery> || <tsquery>
unsupported binary operator: <tsvector> || <tsvector>
SELECT !!to_tsquery('a & b')
FROM "Customers" AS c
LIMIT 1

returns error plainto_tsquery(): text-search query doesn't contain lexemes: a

@fqazi
Copy link
Collaborator

fqazi commented Aug 14, 2023

@giangpham712 Let's skip tests with tsvector for now, let me follow up with the queries team for the issue you found

@github-project-automation github-project-automation bot moved this from Backlog to Done in SQL Queries May 21, 2024
ansg191 added a commit to ansg191/anshulg-cluster that referenced this issue Nov 19, 2024
Can't use cockroachdb because of lack of `setweight`.
See: cockroachdb/cockroach#41288
ansg191 added a commit to ansg191/anshulg-cluster that referenced this issue Nov 19, 2024
Can't use cockroachdb because of lack of `setweight`.
See: cockroachdb/cockroach#41288
ansg191 added a commit to ansg191/anshulg-cluster that referenced this issue Nov 19, 2024
Can't use cockroachdb because of lack of `setweight`.
See: cockroachdb/cockroach#41288
ansg191 added a commit to ansg191/anshulg-cluster that referenced this issue Nov 19, 2024
Can't use cockroachdb because of lack of `setweight`.
See: cockroachdb/cockroach#41288
ansg191 added a commit to ansg191/anshulg-cluster that referenced this issue Nov 19, 2024
Can't use cockroachdb because of lack of `setweight`.
See: cockroachdb/cockroach#41288
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-sql-pgcompat Semantic compatibility with PostgreSQL A-tools-efcore C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) T-sql-queries SQL Queries Team
Projects
Archived in project
Development

No branches or pull requests

7 participants