|
| 1 | +# Full-text data |
| 2 | + |
| 3 | +CrateDB features **native full‑text search** powered by **Apache Lucene** and Okapi BM25 ranking, fully accessible via SQL. You can blend this seamlessly with other data types—JSON, time‑series, geospatial, vectors and more—all in a single SQL query platform. |
| 4 | + |
| 5 | +## 1. Data Types & Indexing Strategy |
| 6 | + |
| 7 | +* By default, all text columns are indexed as `plain` (raw, unanalyzed)—efficient for equality search but not suitable for full‑text queries |
| 8 | +* To enable full‑text search, you must define a **FULLTEXT index** with an optional language **analyzer**, e.g.: |
| 9 | + |
| 10 | +```sql |
| 11 | +CREATE TABLE documents ( |
| 12 | + title TEXT, |
| 13 | + body TEXT, |
| 14 | + INDEX ft_body USING FULLTEXT(body) WITH (analyzer = 'english') |
| 15 | +); |
| 16 | +``` |
| 17 | + |
| 18 | +* You may also define **composite full-text indices**, indexing multiple columns at once: |
| 19 | + |
| 20 | +```sql |
| 21 | +INDEX ft_all USING FULLTEXT(title, body) WITH (analyzer = 'english'); |
| 22 | +``` |
| 23 | + |
| 24 | +## 2. Index Design & Custom Analyzers |
| 25 | + |
| 26 | +| Component | Purpose | |
| 27 | +| ----------------- | ---------------------------------------------------------------------------- | |
| 28 | +| **Analyzer** | Tokenizer + token filters + char filters; splits text into searchable terms. | |
| 29 | +| **Tokenizer** | Splits on whitespace/characters. | |
| 30 | +| **Token Filters** | e.g. lowercase, stemming, stop‑word removal. | |
| 31 | +| **Char Filters** | Pre-processing (e.g. stripping HTML). | |
| 32 | + |
| 33 | +CrateDB offers **built-in analyzers** for many languages (e.g. English, German, French). You can also **create custom analyzers**: |
| 34 | + |
| 35 | +```sql |
| 36 | +CREATE ANALYZER myanalyzer ( |
| 37 | + TOKENIZER whitespace, |
| 38 | + TOKEN_FILTERS (lowercase, kstem), |
| 39 | + CHAR_FILTERS (html_strip) |
| 40 | +); |
| 41 | +``` |
| 42 | + |
| 43 | +Or **extend** a built-in analyzer: |
| 44 | + |
| 45 | +```sql |
| 46 | +CREATE ANALYZER german_snowball |
| 47 | + EXTENDS snowball |
| 48 | + WITH (language = 'german'); |
| 49 | +``` |
| 50 | + |
| 51 | +## 3. Querying: MATCH Predicate & Scoring |
| 52 | + |
| 53 | +CrateDB uses the SQL `MATCH` predicate to run full‑text queries against full‑text indices. It optionally returns a relevance score `_score`, ranked via BM25. |
| 54 | + |
| 55 | +**Basic usage:** |
| 56 | + |
| 57 | +```sql |
| 58 | +SELECT title, _score |
| 59 | +FROM documents |
| 60 | +WHERE MATCH(ft_body, 'search term') |
| 61 | +ORDER BY _score DESC; |
| 62 | +``` |
| 63 | + |
| 64 | +**Searching multiple indices with weighted ranking:** |
| 65 | + |
| 66 | +```sql |
| 67 | +MATCH((ft_title boost 2.0, ft_body), 'keyword') |
| 68 | +``` |
| 69 | + |
| 70 | +**You can configure match options like:** |
| 71 | + |
| 72 | +* `using best_fields` (default) |
| 73 | +* `fuzziness = 1` (tolerate minor typos) |
| 74 | +* `operator = 'AND'` or `OR` |
| 75 | +* `slop = N` for phrase proximity |
| 76 | + |
| 77 | +**Example: Fuzzy Search** |
| 78 | + |
| 79 | +```sql |
| 80 | +SELECT firstname, lastname, _score |
| 81 | +FROM person |
| 82 | +WHERE MATCH(lastname_ft, 'bronw') USING best_fields WITH (fuzziness = 2) |
| 83 | +ORDER BY _score DESC; |
| 84 | +``` |
| 85 | + |
| 86 | +This matches similar names like ‘brown’ or ‘browne’. |
| 87 | + |
| 88 | +**Example: Multi‑language Composite Search** |
| 89 | + |
| 90 | +```sql |
| 91 | +CREATE TABLE documents ( |
| 92 | + name STRING PRIMARY KEY, |
| 93 | + description TEXT, |
| 94 | + INDEX ft_en USING FULLTEXT(description) WITH (analyzer = 'english'), |
| 95 | + INDEX ft_de USING FULLTEXT(description) WITH (analyzer = 'german') |
| 96 | +); |
| 97 | +SELECT name, _score |
| 98 | +FROM documents |
| 99 | +WHERE MATCH((ft_en, ft_de), 'jupm OR verwrlost') USING best_fields WITH (fuzziness = 1) |
| 100 | +ORDER BY _score DESC; |
| 101 | +``` |
| 102 | + |
| 103 | +## 4. Use Cases & Integration |
| 104 | + |
| 105 | +CrateDB is ideal for searching **semi-structured large text data**—product catalogs, article archives, user-generated content, descriptions and logs. |
| 106 | + |
| 107 | +Because full-text indices are updated in real-time, search results reflect newly ingested data almost instantly. This tight integration avoids the complexity of maintaining separate search infrastructure. |
| 108 | + |
| 109 | +You can **combine full-text search with other data domains**, for example: |
| 110 | + |
| 111 | +```sql |
| 112 | +SELECT * |
| 113 | +FROM listings |
| 114 | +WHERE |
| 115 | + MATCH(ft_desc, 'garden deck') AND |
| 116 | + price < 500000 AND |
| 117 | + within(location, :polygon); |
| 118 | +``` |
| 119 | + |
| 120 | +This blend lets you query by text relevance, numeric filters, and spatial constraints, all in one. |
| 121 | + |
| 122 | +## 5. Architectural Strengths |
| 123 | + |
| 124 | +* **Built on Lucene inverted index + BM25**, offering relevance ranking comparable to search engines. |
| 125 | +* **Scale horizontally across clusters**, while maintaining fast indexing and search even on high volume datasets. |
| 126 | +* **Integrated SQL interface**: eliminates need for separate search services like Elasticsearch or Solr. |
| 127 | + |
| 128 | +## 6. Best Practices Checklist |
| 129 | + |
| 130 | +| Topic | Recommendation | |
| 131 | +| ------------------- | ---------------------------------------------------------------------------------- | |
| 132 | +| Schema & Indexing | Define full-text indices at table creation; plain indices are insufficient. | |
| 133 | +| Language Support | Pick built-in analyzer matching your content language. | |
| 134 | +| Composite Search | Use multi-column indices to search across title/body/fields. | |
| 135 | +| Query Tuning | Configure fuzziness, operator, boost, and slop options. | |
| 136 | +| Scoring & Ranking | Use `_score` and ordering to sort by relevance. | |
| 137 | +| Real-time Updates | Full-text indices update automatically on INSERT/UPDATE. | |
| 138 | +| Multi-model Queries | Combine full-text search with geo, JSON, numerical filters. | |
| 139 | +| Analyze Limitations | Understand phrase\_prefix caveats at scale; tune analyzer/tokenizer appropriately. | |
| 140 | + |
| 141 | +## 7. Further Learning & Resources |
| 142 | + |
| 143 | +* **CrateDB Full‑Text Search Guide**: details index creation, analyzers, MATCH usage. |
| 144 | +* **FTS Options & Advanced Features**: fuzziness, synonyms, multi-language idioms. |
| 145 | +* **Hands‑On Academy Course**: explore FTS on real datasets (e.g. Chicago neighborhoods). |
| 146 | +* **CrateDB Community Insights**: real‑world advice and experiences from users. |
| 147 | + |
| 148 | +## **8. Summary** |
| 149 | + |
| 150 | +CrateDB combines powerful Lucene‑based full‑text search capabilities with SQL, making it easy to model and query textual data at scale. It supports fuzzy matching, multi-language analysis, composite indexing, and integrates fully with other data types for rich, multi-model queries. Whether you're building document search, catalog lookup, or content analytics—CrateDB offers a flexible and scalable foundation.\ |
0 commit comments