Skip to content

Commit e2b7754

Browse files
committed
Data modelling: Add new section
1 parent 5238b4b commit e2b7754

File tree

9 files changed

+1140
-0
lines changed

9 files changed

+1140
-0
lines changed

docs/modelling/fulltext.md

Lines changed: 150 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,150 @@
1+
# Full-text data
2+
3+
CrateDB features **native full‑text search** powered by **Apache Lucene** and Okapi BM25 ranking, fully accessible via SQL. You can blend this seamlessly with other data types—JSON, time‑series, geospatial, vectors and more—all in a single SQL query platform.
4+
5+
## 1. Data Types & Indexing Strategy
6+
7+
* By default, all text columns are indexed as `plain` (raw, unanalyzed)—efficient for equality search but not suitable for full‑text queries
8+
* To enable full‑text search, you must define a **FULLTEXT index** with an optional language **analyzer**, e.g.:
9+
10+
```sql
11+
CREATE TABLE documents (
12+
title TEXT,
13+
body TEXT,
14+
INDEX ft_body USING FULLTEXT(body) WITH (analyzer = 'english')
15+
);
16+
```
17+
18+
* You may also define **composite full-text indices**, indexing multiple columns at once:
19+
20+
```sql
21+
INDEX ft_all USING FULLTEXT(title, body) WITH (analyzer = 'english');
22+
```
23+
24+
## 2. Index Design & Custom Analyzers
25+
26+
| Component | Purpose |
27+
| ----------------- | ---------------------------------------------------------------------------- |
28+
| **Analyzer** | Tokenizer + token filters + char filters; splits text into searchable terms. |
29+
| **Tokenizer** | Splits on whitespace/characters. |
30+
| **Token Filters** | e.g. lowercase, stemming, stop‑word removal. |
31+
| **Char Filters** | Pre-processing (e.g. stripping HTML). |
32+
33+
CrateDB offers **built-in analyzers** for many languages (e.g. English, German, French). You can also **create custom analyzers**:
34+
35+
```sql
36+
CREATE ANALYZER myanalyzer (
37+
TOKENIZER whitespace,
38+
TOKEN_FILTERS (lowercase, kstem),
39+
CHAR_FILTERS (html_strip)
40+
);
41+
```
42+
43+
Or **extend** a built-in analyzer:
44+
45+
```sql
46+
CREATE ANALYZER german_snowball
47+
EXTENDS snowball
48+
WITH (language = 'german');
49+
```
50+
51+
## 3. Querying: MATCH Predicate & Scoring
52+
53+
CrateDB uses the SQL `MATCH` predicate to run full‑text queries against full‑text indices. It optionally returns a relevance score `_score`, ranked via BM25.
54+
55+
**Basic usage:**
56+
57+
```sql
58+
SELECT title, _score
59+
FROM documents
60+
WHERE MATCH(ft_body, 'search term')
61+
ORDER BY _score DESC;
62+
```
63+
64+
**Searching multiple indices with weighted ranking:**
65+
66+
```sql
67+
MATCH((ft_title boost 2.0, ft_body), 'keyword')
68+
```
69+
70+
**You can configure match options like:**
71+
72+
* `using best_fields` (default)
73+
* `fuzziness = 1` (tolerate minor typos)
74+
* `operator = 'AND'` or `OR`
75+
* `slop = N` for phrase proximity
76+
77+
**Example: Fuzzy Search**
78+
79+
```sql
80+
SELECT firstname, lastname, _score
81+
FROM person
82+
WHERE MATCH(lastname_ft, 'bronw') USING best_fields WITH (fuzziness = 2)
83+
ORDER BY _score DESC;
84+
```
85+
86+
This matches similar names like ‘brown’ or ‘browne’.
87+
88+
**Example: Multi‑language Composite Search**
89+
90+
```sql
91+
CREATE TABLE documents (
92+
name STRING PRIMARY KEY,
93+
description TEXT,
94+
INDEX ft_en USING FULLTEXT(description) WITH (analyzer = 'english'),
95+
INDEX ft_de USING FULLTEXT(description) WITH (analyzer = 'german')
96+
);
97+
SELECT name, _score
98+
FROM documents
99+
WHERE MATCH((ft_en, ft_de), 'jupm OR verwrlost') USING best_fields WITH (fuzziness = 1)
100+
ORDER BY _score DESC;
101+
```
102+
103+
## 4. Use Cases & Integration
104+
105+
CrateDB is ideal for searching **semi-structured large text data**—product catalogs, article archives, user-generated content, descriptions and logs.
106+
107+
Because full-text indices are updated in real-time, search results reflect newly ingested data almost instantly. This tight integration avoids the complexity of maintaining separate search infrastructure.
108+
109+
You can **combine full-text search with other data domains**, for example:
110+
111+
```sql
112+
SELECT *
113+
FROM listings
114+
WHERE
115+
MATCH(ft_desc, 'garden deck') AND
116+
price < 500000 AND
117+
within(location, :polygon);
118+
```
119+
120+
This blend lets you query by text relevance, numeric filters, and spatial constraints, all in one.
121+
122+
## 5. Architectural Strengths
123+
124+
* **Built on Lucene inverted index + BM25**, offering relevance ranking comparable to search engines.
125+
* **Scale horizontally across clusters**, while maintaining fast indexing and search even on high volume datasets.
126+
* **Integrated SQL interface**: eliminates need for separate search services like Elasticsearch or Solr.
127+
128+
## 6. Best Practices Checklist
129+
130+
| Topic | Recommendation |
131+
| ------------------- | ---------------------------------------------------------------------------------- |
132+
| Schema & Indexing | Define full-text indices at table creation; plain indices are insufficient. |
133+
| Language Support | Pick built-in analyzer matching your content language. |
134+
| Composite Search | Use multi-column indices to search across title/body/fields. |
135+
| Query Tuning | Configure fuzziness, operator, boost, and slop options. |
136+
| Scoring & Ranking | Use `_score` and ordering to sort by relevance. |
137+
| Real-time Updates | Full-text indices update automatically on INSERT/UPDATE. |
138+
| Multi-model Queries | Combine full-text search with geo, JSON, numerical filters. |
139+
| Analyze Limitations | Understand phrase\_prefix caveats at scale; tune analyzer/tokenizer appropriately. |
140+
141+
## 7. Further Learning & Resources
142+
143+
* **CrateDB Full‑Text Search Guide**: details index creation, analyzers, MATCH usage.
144+
* **FTS Options & Advanced Features**: fuzziness, synonyms, multi-language idioms.
145+
* **Hands‑On Academy Course**: explore FTS on real datasets (e.g. Chicago neighborhoods).
146+
* **CrateDB Community Insights**: real‑world advice and experiences from users.
147+
148+
## **8. Summary**
149+
150+
CrateDB combines powerful Lucene‑based full‑text search capabilities with SQL, making it easy to model and query textual data at scale. It supports fuzzy matching, multi-language analysis, composite indexing, and integrates fully with other data types for rich, multi-model queries. Whether you're building document search, catalog lookup, or content analytics—CrateDB offers a flexible and scalable foundation.\

docs/modelling/geospatial.md

Lines changed: 101 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,101 @@
1+
# Geospatial data
2+
3+
CrateDB supports **real-time geospatial analytics at scale**, enabling you to store, query, and analyze location-based data using standard SQL over two dedicated types: **GEO\_POINT** and **GEO\_SHAPE**. You can seamlessly combine spatial data with full-text, vector, JSON, or time-series in the same SQL queries.
4+
5+
## 1. Geospatial Data Types
6+
7+
### **GEO\_POINT**
8+
9+
* Stores a single location via latitude/longitude.
10+
* Insert using either a coordinate array `[lon, lat]` or WKT string `'POINT (lon lat)'`.
11+
* Must be declared explicitly; dynamic schema inference will not detect geo\_point type.
12+
13+
### **GEO\_SHAPE**
14+
15+
* Supports complex geometries (Point, LineString, Polygon, MultiPolygon, GeometryCollection) via GeoJSON or WKT.
16+
* Indexed using geohash, quadtree, or BKD-tree, with configurable precision (e.g. `50m`) and error threshold
17+
18+
## 2. Table Schema Example
19+
20+
<pre class="language-sql"><code class="lang-sql"><strong>CREATE TABLE parcel_zones (
21+
</strong> zone_id INTEGER PRIMARY KEY,
22+
<strong> name VARCHAR,
23+
</strong> area GEO_SHAPE,
24+
centroid GEO_POINT
25+
)
26+
WITH (column_policy = 'dynamic');
27+
</code></pre>
28+
29+
* Use `GEO_SHAPE` to define zones or service areas.
30+
* `GEO_POINT` allows for simple referencing (e.g. store approximate center of zone).
31+
32+
## 3. Core Geospatial Functions
33+
34+
CrateDB provides key scalar functions for spatial operations:
35+
36+
* **`distance(geo_point1, geo_point2)`** – returns meters using the Haversine formula (e.g. compute distance between two points)
37+
* **`within(shape1, shape2)`** – true if one geo object is fully contained within another
38+
* **`intersects(shape1, shape2)`** – true if shapes overlap or touch anywhere
39+
* **`latitude(geo_point)` / `longitude(geo_point)`** – extract individual coordinates
40+
* **`geohash(geo_point)`** – compute a 12‑character geohash for the point
41+
* **`area(geo_shape)`** – returns approximate area in square degrees; uses geodetic awareness
42+
43+
Note: More precise relational operations on shapes may bypass indexes and can be slower.
44+
45+
## 4. Spatial Queries & Indexing
46+
47+
CrateDB supports Lucene-based spatial indexing (Prefix Tree and BKD-tree structures) for efficient geospatial search. Use the `MATCH` predicate to leverage indices when filtering spatial data by bounding boxes, circles, polygons, etc.
48+
49+
**Example: Find nearby assets**
50+
51+
```sql
52+
SELECT asset_id, DISTANCE(center_point, asset_location) AS dist
53+
FROM assets
54+
WHERE center_point = 'POINT(-1.234 51.050)'::GEO_POINT
55+
ORDER BY dist
56+
LIMIT 10;
57+
```
58+
59+
**Example: Count incidents within service area**
60+
61+
```sql
62+
SELECT area_id, count(*) AS incident_count
63+
FROM incidents
64+
WHERE within(incidents.location, service_areas.area)
65+
GROUP BY area_id;
66+
```
67+
68+
**Example: Which zones intersect a flight path**
69+
70+
```sql
71+
SELECT zone_id, name
72+
FROM flight_paths f
73+
JOIN service_zones z
74+
ON intersects(f.path_geom, z.area);
75+
```
76+
77+
## 5. Real-World Examples: Chicago Use Cases
78+
79+
* **311 calls**: Each record includes `location` as `GEO_POINT`. Queries use `within()` to find calls near a polygon around O’Hare airport.
80+
* **Community areas**: Polygon boundaries stored in `GEO_SHAPE`. Queries for intersections with arbitrary lines or polygons using `intersects()` return overlapping zones.
81+
* **Taxi rides**: Pickup/drop off locations stored as geo points. Use `distance()` filter to compute trip distances and aggregate.
82+
83+
## 6. Architectural Strengths & Suitability
84+
85+
* Designed for **real-time geospatial tracking and analytics** (e.g. fleet tracking, mapping, location-layered apps).
86+
* **Unified SQL platform**: spatial data can be combined with full-text search, JSON, vectors, time-series — in the same table or query.
87+
* **High ingest and query throughput**, suitable for large-scale location-based workloads
88+
89+
## 7. Best Practices Checklist
90+
91+
<table><thead><tr><th>Topic</th><th width="254">Recommendation</th></tr></thead><tbody><tr><td>Data types</td><td>Declare <code>GEO_POINT</code>/<code>GEO_SHAPE</code> explicitly</td></tr><tr><td>Geometric formats</td><td>Use WKT or GeoJSON for insertions</td></tr><tr><td>Index tuning</td><td>Choose geohash/quadtree/BKD tree &#x26; adjust precision</td></tr><tr><td>Queries</td><td>Prefer <code>MATCH</code> for indexed filtering; use functions for precise checks</td></tr><tr><td>Joins &#x26; spatial filters</td><td>Use within/intersects to correlate spatial entities</td></tr><tr><td>Scale &#x26; performance</td><td>Index shapes, use distance/wwithin filters early</td></tr><tr><td>Mixed-model integration</td><td>Combine spatial with JSON, full-text, vector, time-series</td></tr></tbody></table>
92+
93+
## 8. Further Learning & Resources
94+
95+
* Official **Geospatial Search Guide** in CrateDB docs, detailing geospatial types, indexing, and MATCH predicate usage.
96+
* CrateDB Academy **Hands-on: Geospatial Data** modules, with sample datasets (Chicago 311 calls, taxi rides, community zones) and example queries.
97+
* CrateDB Blog: **Geospatial Queries with CrateDB** – outlines capabilities, limitations, and practical use cases (available since version 0.40
98+
99+
## 9. Summary
100+
101+
CrateDB provides robust support for geospatial modeling through clearly defined data types (`GEO_POINT`, `GEO_SHAPE`), powerful scalar functions (`distance`, `within`, `intersects`, `area`), and Lucene‑based indexing for fast queries. It excels in high‑volume, real‑time spatial analytics and integrates smoothly with multi-model use cases. Whether storing vehicle positions, mapping regions, or enabling spatial joins—CrateDB’s geospatial layer makes it easy, scalable, and extensible.

docs/modelling/index.md

Lines changed: 21 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,21 @@
1+
# Data modelling
2+
3+
CrateDB provides a unified storage engine that supports different data types.
4+
```{toctree}
5+
:maxdepth: 1
6+
7+
relational
8+
json
9+
timeseries
10+
geospatial
11+
fulltext
12+
vector
13+
```
14+
15+
Because CrateDB is a distributed OLAP database designed store large volumes
16+
of data, it needs a few special considerations on certain details.
17+
```{toctree}
18+
:maxdepth: 1
19+
20+
primary-key
21+
```

0 commit comments

Comments
 (0)