From dd62b9b730ad144699a6a3a2462d7c45c05a7b33 Mon Sep 17 00:00:00 2001 From: Stephane Castellani Date: Fri, 8 Aug 2025 01:42:43 +0200 Subject: [PATCH 01/58] Data modelling: Add new section --- docs/start/index.md | 1 + docs/start/modelling/fulltext.md | 150 ++++++++++++++++++ docs/start/modelling/geospatial.md | 101 +++++++++++++ docs/start/modelling/index.md | 21 +++ docs/start/modelling/json.md | 227 ++++++++++++++++++++++++++++ docs/start/modelling/primary-key.md | 174 +++++++++++++++++++++ docs/start/modelling/relational.md | 178 ++++++++++++++++++++++ docs/start/modelling/timeseries.md | 137 +++++++++++++++++ docs/start/modelling/vector.md | 151 ++++++++++++++++++ 9 files changed, 1140 insertions(+) create mode 100644 docs/start/modelling/fulltext.md create mode 100644 docs/start/modelling/geospatial.md create mode 100644 docs/start/modelling/index.md create mode 100644 docs/start/modelling/json.md create mode 100644 docs/start/modelling/primary-key.md create mode 100644 docs/start/modelling/relational.md create mode 100644 docs/start/modelling/timeseries.md create mode 100644 docs/start/modelling/vector.md diff --git a/docs/start/index.md b/docs/start/index.md index 896d9fbf..520f45e0 100644 --- a/docs/start/index.md +++ b/docs/start/index.md @@ -110,6 +110,7 @@ and explore key features. first-steps connect query/index +modelling/index Ingesting data <../ingest/index> application/index going-further diff --git a/docs/start/modelling/fulltext.md b/docs/start/modelling/fulltext.md new file mode 100644 index 00000000..41342afc --- /dev/null +++ b/docs/start/modelling/fulltext.md @@ -0,0 +1,150 @@ +# Full-text data + +CrateDB features **native full‑text search** powered by **Apache Lucene** and Okapi BM25 ranking, fully accessible via SQL. You can blend this seamlessly with other data types—JSON, time‑series, geospatial, vectors and more—all in a single SQL query platform. + +## 1. Data Types & Indexing Strategy + +* By default, all text columns are indexed as `plain` (raw, unanalyzed)—efficient for equality search but not suitable for full‑text queries +* To enable full‑text search, you must define a **FULLTEXT index** with an optional language **analyzer**, e.g.: + +```sql +CREATE TABLE documents ( + title TEXT, + body TEXT, + INDEX ft_body USING FULLTEXT(body) WITH (analyzer = 'english') +); +``` + +* You may also define **composite full-text indices**, indexing multiple columns at once: + +```sql +INDEX ft_all USING FULLTEXT(title, body) WITH (analyzer = 'english'); +``` + +## 2. Index Design & Custom Analyzers + +| Component | Purpose | +| ----------------- | ---------------------------------------------------------------------------- | +| **Analyzer** | Tokenizer + token filters + char filters; splits text into searchable terms. | +| **Tokenizer** | Splits on whitespace/characters. | +| **Token Filters** | e.g. lowercase, stemming, stop‑word removal. | +| **Char Filters** | Pre-processing (e.g. stripping HTML). | + +CrateDB offers **built-in analyzers** for many languages (e.g. English, German, French). You can also **create custom analyzers**: + +```sql +CREATE ANALYZER myanalyzer ( + TOKENIZER whitespace, + TOKEN_FILTERS (lowercase, kstem), + CHAR_FILTERS (html_strip) +); +``` + +Or **extend** a built-in analyzer: + +```sql +CREATE ANALYZER german_snowball + EXTENDS snowball + WITH (language = 'german'); +``` + +## 3. Querying: MATCH Predicate & Scoring + +CrateDB uses the SQL `MATCH` predicate to run full‑text queries against full‑text indices. It optionally returns a relevance score `_score`, ranked via BM25. + +**Basic usage:** + +```sql +SELECT title, _score +FROM documents +WHERE MATCH(ft_body, 'search term') +ORDER BY _score DESC; +``` + +**Searching multiple indices with weighted ranking:** + +```sql +MATCH((ft_title boost 2.0, ft_body), 'keyword') +``` + +**You can configure match options like:** + +* `using best_fields` (default) +* `fuzziness = 1` (tolerate minor typos) +* `operator = 'AND'` or `OR` +* `slop = N` for phrase proximity + +**Example: Fuzzy Search** + +```sql +SELECT firstname, lastname, _score +FROM person +WHERE MATCH(lastname_ft, 'bronw') USING best_fields WITH (fuzziness = 2) +ORDER BY _score DESC; +``` + +This matches similar names like ‘brown’ or ‘browne’. + +**Example: Multi‑language Composite Search** + +```sql +CREATE TABLE documents ( + name STRING PRIMARY KEY, + description TEXT, + INDEX ft_en USING FULLTEXT(description) WITH (analyzer = 'english'), + INDEX ft_de USING FULLTEXT(description) WITH (analyzer = 'german') +); +SELECT name, _score +FROM documents +WHERE MATCH((ft_en, ft_de), 'jupm OR verwrlost') USING best_fields WITH (fuzziness = 1) +ORDER BY _score DESC; +``` + +## 4. Use Cases & Integration + +CrateDB is ideal for searching **semi-structured large text data**—product catalogs, article archives, user-generated content, descriptions and logs. + +Because full-text indices are updated in real-time, search results reflect newly ingested data almost instantly. This tight integration avoids the complexity of maintaining separate search infrastructure. + +You can **combine full-text search with other data domains**, for example: + +```sql +SELECT * +FROM listings +WHERE + MATCH(ft_desc, 'garden deck') AND + price < 500000 AND + within(location, :polygon); +``` + +This blend lets you query by text relevance, numeric filters, and spatial constraints, all in one. + +## 5. Architectural Strengths + +* **Built on Lucene inverted index + BM25**, offering relevance ranking comparable to search engines. +* **Scale horizontally across clusters**, while maintaining fast indexing and search even on high volume datasets. +* **Integrated SQL interface**: eliminates need for separate search services like Elasticsearch or Solr. + +## 6. Best Practices Checklist + +| Topic | Recommendation | +| ------------------- | ---------------------------------------------------------------------------------- | +| Schema & Indexing | Define full-text indices at table creation; plain indices are insufficient. | +| Language Support | Pick built-in analyzer matching your content language. | +| Composite Search | Use multi-column indices to search across title/body/fields. | +| Query Tuning | Configure fuzziness, operator, boost, and slop options. | +| Scoring & Ranking | Use `_score` and ordering to sort by relevance. | +| Real-time Updates | Full-text indices update automatically on INSERT/UPDATE. | +| Multi-model Queries | Combine full-text search with geo, JSON, numerical filters. | +| Analyze Limitations | Understand phrase\_prefix caveats at scale; tune analyzer/tokenizer appropriately. | + +## 7. Further Learning & Resources + +* **CrateDB Full‑Text Search Guide**: details index creation, analyzers, MATCH usage. +* **FTS Options & Advanced Features**: fuzziness, synonyms, multi-language idioms. +* **Hands‑On Academy Course**: explore FTS on real datasets (e.g. Chicago neighborhoods). +* **CrateDB Community Insights**: real‑world advice and experiences from users. + +## **8. Summary** + +CrateDB combines powerful Lucene‑based full‑text search capabilities with SQL, making it easy to model and query textual data at scale. It supports fuzzy matching, multi-language analysis, composite indexing, and integrates fully with other data types for rich, multi-model queries. Whether you're building document search, catalog lookup, or content analytics—CrateDB offers a flexible and scalable foundation.\ diff --git a/docs/start/modelling/geospatial.md b/docs/start/modelling/geospatial.md new file mode 100644 index 00000000..7ddbc982 --- /dev/null +++ b/docs/start/modelling/geospatial.md @@ -0,0 +1,101 @@ +# Geospatial data + +CrateDB supports **real-time geospatial analytics at scale**, enabling you to store, query, and analyze location-based data using standard SQL over two dedicated types: **GEO\_POINT** and **GEO\_SHAPE**. You can seamlessly combine spatial data with full-text, vector, JSON, or time-series in the same SQL queries. + +## 1. Geospatial Data Types + +### **GEO\_POINT** + +* Stores a single location via latitude/longitude. +* Insert using either a coordinate array `[lon, lat]` or WKT string `'POINT (lon lat)'`. +* Must be declared explicitly; dynamic schema inference will not detect geo\_point type. + +### **GEO\_SHAPE** + +* Supports complex geometries (Point, LineString, Polygon, MultiPolygon, GeometryCollection) via GeoJSON or WKT. +* Indexed using geohash, quadtree, or BKD-tree, with configurable precision (e.g. `50m`) and error threshold + +## 2. Table Schema Example + +
CREATE TABLE parcel_zones (
+    zone_id INTEGER PRIMARY KEY,
+    name VARCHAR,
+    area GEO_SHAPE,
+    centroid GEO_POINT
+)
+WITH (column_policy = 'dynamic');
+
+ +* Use `GEO_SHAPE` to define zones or service areas. +* `GEO_POINT` allows for simple referencing (e.g. store approximate center of zone). + +## 3. Core Geospatial Functions + +CrateDB provides key scalar functions for spatial operations: + +* **`distance(geo_point1, geo_point2)`** – returns meters using the Haversine formula (e.g. compute distance between two points) +* **`within(shape1, shape2)`** – true if one geo object is fully contained within another +* **`intersects(shape1, shape2)`** – true if shapes overlap or touch anywhere +* **`latitude(geo_point)` / `longitude(geo_point)`** – extract individual coordinates +* **`geohash(geo_point)`** – compute a 12‑character geohash for the point +* **`area(geo_shape)`** – returns approximate area in square degrees; uses geodetic awareness + +Note: More precise relational operations on shapes may bypass indexes and can be slower. + +## 4. Spatial Queries & Indexing + +CrateDB supports Lucene-based spatial indexing (Prefix Tree and BKD-tree structures) for efficient geospatial search. Use the `MATCH` predicate to leverage indices when filtering spatial data by bounding boxes, circles, polygons, etc. + +**Example: Find nearby assets** + +```sql +SELECT asset_id, DISTANCE(center_point, asset_location) AS dist +FROM assets +WHERE center_point = 'POINT(-1.234 51.050)'::GEO_POINT +ORDER BY dist +LIMIT 10; +``` + +**Example: Count incidents within service area** + +```sql +SELECT area_id, count(*) AS incident_count +FROM incidents +WHERE within(incidents.location, service_areas.area) +GROUP BY area_id; +``` + +**Example: Which zones intersect a flight path** + +```sql +SELECT zone_id, name +FROM flight_paths f +JOIN service_zones z +ON intersects(f.path_geom, z.area); +``` + +## 5. Real-World Examples: Chicago Use Cases + +* **311 calls**: Each record includes `location` as `GEO_POINT`. Queries use `within()` to find calls near a polygon around O’Hare airport. +* **Community areas**: Polygon boundaries stored in `GEO_SHAPE`. Queries for intersections with arbitrary lines or polygons using `intersects()` return overlapping zones. +* **Taxi rides**: Pickup/drop off locations stored as geo points. Use `distance()` filter to compute trip distances and aggregate. + +## 6. Architectural Strengths & Suitability + +* Designed for **real-time geospatial tracking and analytics** (e.g. fleet tracking, mapping, location-layered apps). +* **Unified SQL platform**: spatial data can be combined with full-text search, JSON, vectors, time-series — in the same table or query. +* **High ingest and query throughput**, suitable for large-scale location-based workloads + +## 7. Best Practices Checklist + +
TopicRecommendation
Data typesDeclare GEO_POINT/GEO_SHAPE explicitly
Geometric formatsUse WKT or GeoJSON for insertions
Index tuningChoose geohash/quadtree/BKD tree & adjust precision
QueriesPrefer MATCH for indexed filtering; use functions for precise checks
Joins & spatial filtersUse within/intersects to correlate spatial entities
Scale & performanceIndex shapes, use distance/wwithin filters early
Mixed-model integrationCombine spatial with JSON, full-text, vector, time-series
+ +## 8. Further Learning & Resources + +* Official **Geospatial Search Guide** in CrateDB docs, detailing geospatial types, indexing, and MATCH predicate usage. +* CrateDB Academy **Hands-on: Geospatial Data** modules, with sample datasets (Chicago 311 calls, taxi rides, community zones) and example queries. +* CrateDB Blog: **Geospatial Queries with CrateDB** – outlines capabilities, limitations, and practical use cases (available since version 0.40 + +## 9. Summary + +CrateDB provides robust support for geospatial modeling through clearly defined data types (`GEO_POINT`, `GEO_SHAPE`), powerful scalar functions (`distance`, `within`, `intersects`, `area`), and Lucene‑based indexing for fast queries. It excels in high‑volume, real‑time spatial analytics and integrates smoothly with multi-model use cases. Whether storing vehicle positions, mapping regions, or enabling spatial joins—CrateDB’s geospatial layer makes it easy, scalable, and extensible. diff --git a/docs/start/modelling/index.md b/docs/start/modelling/index.md new file mode 100644 index 00000000..a198efe1 --- /dev/null +++ b/docs/start/modelling/index.md @@ -0,0 +1,21 @@ +# Data modelling + +CrateDB provides a unified storage engine that supports different data types. +```{toctree} +:maxdepth: 1 + +relational +json +timeseries +geospatial +fulltext +vector +``` + +Because CrateDB is a distributed OLAP database designed store large volumes +of data, it needs a few special considerations on certain details. +```{toctree} +:maxdepth: 1 + +primary-key +``` diff --git a/docs/start/modelling/json.md b/docs/start/modelling/json.md new file mode 100644 index 00000000..78583efa --- /dev/null +++ b/docs/start/modelling/json.md @@ -0,0 +1,227 @@ +# JSON data + +CrateDB combines the flexibility of NoSQL document stores with the power of SQL. It enables you to store, query, and index **semi-structured JSON data** using **standard SQL**, making it an excellent choice for applications that handle diverse or evolving schemas. + +CrateDB’s support for dynamic objects, nested structures, and dot-notation querying brings the best of both relational and document-based data modeling—without leaving the SQL world. + +## 1. Object (JSON) Columns + +CrateDB allows you to define **object columns** that can store JSON-style data structures. + +```sql +CREATE TABLE events ( + id UUID PRIMARY KEY, + timestamp TIMESTAMP, + payload OBJECT(DYNAMIC) +); +``` + +This allows inserting flexible, nested JSON data into `payload`: + +```json +{ + "user": { + "id": 42, + "name": "Alice" + }, + "action": "login", + "device": { + "type": "mobile", + "os": "iOS" + } +} +``` + +## 2. Column Policy: Strict vs Dynamic + +You can control how CrateDB handles unexpected fields in an object column: + +| Column Policy | Behavior | +| ------------- | ----------------------------------------------------------- | +| `DYNAMIC` | New fields are automatically added to the schema at runtime | +| `STRICT` | Only explicitly defined fields are allowed | +| `IGNORED` | Extra fields are stored but not indexed or queryable | + +Example with explicitly defined fields: + +```sql +CREATE TABLE sensor_data ( + id UUID PRIMARY KEY, + attributes OBJECT(STRICT) AS ( + temperature DOUBLE, + humidity DOUBLE + ) +); +``` + +## 3. Querying JSON Fields + +Use **dot notation** to access nested fields: + +```sql +SELECT payload['user']['name'], payload['device']['os'] +FROM events +WHERE payload['action'] = 'login'; +``` + +CrateDB also supports **filtering, sorting, and aggregations** on nested values: + +```sql +SELECT COUNT(*) +FROM events +WHERE payload['device']['os'] = 'Android'; +``` + +:::{note} +Dot-notation works for both explicitly and dynamically added fields. +::: + +## 4. Querying DYNAMIC OBJECTs + +To support querying DYNAMIC OBJECTs using SQL, where keys may not exist within an OBJECT, CrateDB provides the [error\_on\_unknown\_object\_key](https://cratedb.com/docs/crate/reference/en/latest/config/session.html#conf-session-error-on-unknown-object-key) session setting. It controls the behaviour when querying unknown object keys to dynamic objects. + +By default, CrateDB will raise an error if any of the queried object keys are unknown. When adjusting this setting to `false`, it will return `NULL` as the value of the corresponding key. + +```sql +cr> CREATE TABLE testdrive (item OBJECT(DYNAMIC)); +CREATE OK, 1 row affected (0.563 sec) + +cr> SELECT item['unknown'] FROM testdrive; +ColumnUnknownException[Column item['unknown'] unknown] + +cr> SET error_on_unknown_object_key = false; +SET OK, 0 rows affected (0.001 sec) + +cr> SELECT item['unknown'] FROM testdrive; ++-----------------+ +| item['unknown'] | ++-----------------+ ++-----------------+ +SELECT 0 rows in set (0.051 sec) +``` + +## 5. Arrays of Objects + +Store arrays of objects for multi-valued nested data: + +```sql +CREATE TABLE products ( + id UUID PRIMARY KEY, + name TEXT, + tags ARRAY(TEXT), + specs ARRAY(OBJECT AS ( + name TEXT, + value TEXT + )) +); +``` + +Query nested arrays with filters: + +```sql +SELECT * +FROM products +WHERE 'outdoor' = ANY(tags); +``` + +You can also filter by object array fields: + +```sql +SELECT * +FROM products +WHERE specs['name'] = 'battery' AND specs['value'] = 'AA'; +``` + +## 6. Combining Structured & Semi-Structured Data + +CrateDB supports **hybrid schemas**, mixing standard columns with JSON fields: + +```sql +CREATE TABLE logs ( + id UUID PRIMARY KEY, + service TEXT, + log_level TEXT, + metadata OBJECT(DYNAMIC), + created_at TIMESTAMP +); +``` + +This allows you to: + +* Query by fixed attributes (`log_level`) +* Flexibly store structured or unstructured metadata +* Add new fields on the fly without migrations + +## 7. Indexing Behavior + +CrateDB **automatically indexes** object fields if: + +* Column policy is `DYNAMIC` +* Field type can be inferred at insert time + +You can also explicitly define and index object fields: + +```sql +CREATE TABLE metrics ( + id UUID PRIMARY KEY, + data OBJECT(DYNAMIC) AS ( + cpu DOUBLE INDEX USING FULLTEXT, + memory DOUBLE + ) +); +``` + +To exclude fields from indexing, set: + +```sql +data['some_field'] INDEX OFF +``` + +:::{note} +Too many dynamic fields can lead to schema explosion. Use `STRICT` or `IGNORED` if needed. +::: + +## 8. Aggregating JSON Fields + +CrateDB allows full SQL-style aggregations on nested fields: + +```sql +SELECT AVG(payload['temperature']) AS avg_temp +FROM sensor_readings +WHERE payload['location'] = 'room1'; +``` + +CrateDB also supports **`GROUP BY`**, **`HAVING`**, and **window functions** on object fields. + +## 9. Use Cases for JSON Modeling + +| Use Case | Description | +| ------------------ | -------------------------------------------- | +| Logs & Traces | Unstructured payloads with flexible metadata | +| Sensor & IoT Data | Variable field schemas, nested measurements | +| Product Catalogs | Specs, tags, reviews in varying formats | +| User Profiles | Custom settings, device info, preferences | +| Telemetry / Events | Event streams with evolving structure | + +## 10. Best Practices + +| Area | Recommendation | +| ---------------- | -------------------------------------------------------------------- | +| Schema Evolution | Use `DYNAMIC` for flexibility, `STRICT` for control | +| Index Management | Avoid over-indexing rarely used fields | +| Nested Depth | Prefer flat structures or shallow nesting for performance | +| Column Mixing | Combine structured columns with JSON for hybrid models | +| Observability | Monitor number of dynamic columns using `information_schema.columns` | + +## 11. Further Learning & Resources + +* CrateDB Docs – Object Columns +* Working with JSON in CrateDB +* CrateDB Academy – Modeling with JSON +* Understanding Column Policies + +## 12. Summary + +CrateDB makes it easy to model **semi-structured JSON data** with full SQL support. Whether you're building a telemetry pipeline, an event store, or a product catalog, CrateDB offers the flexibility of a document store—while preserving the structure, indexing, and power of a relational engine. + +You don’t need to choose between JSON and SQL—**CrateDB gives you both.** diff --git a/docs/start/modelling/primary-key.md b/docs/start/modelling/primary-key.md new file mode 100644 index 00000000..a181e930 --- /dev/null +++ b/docs/start/modelling/primary-key.md @@ -0,0 +1,174 @@ +# Primary key strategies + +CrateDB is built for horizontal scalability and high ingestion throughput. To achieve this, operations must complete independently on each node—without central coordination. This design choice means CrateDB does **not** support traditional auto-incrementing primary key types like `SERIAL` in PostgreSQL or MySQL by default. + +This page explains why that is and walks you through **five common alternatives** to generate unique primary key values in CrateDB, including a recipe to implement your own auto-incrementing sequence mechanism when needed. + +## Why Auto-Increment Doesn't Exist in CrateDB + +In traditional RDBMS systems, auto-increment fields rely on a central counter. In a distributed system like CrateDB, this would create a **global coordination bottleneck**, limiting insert throughput and reducing scalability. + +Instead, CrateDB provides **flexibility**: you can choose a primary key strategy tailored to your use case, whether for strict uniqueness, time ordering, or external system integration. + +## Primary Key Strategies in CrateDB + +### 1. Use a Timestamp as a Primary Key + +```sql +BIGINT DEFAULT now() PRIMARY KEY +``` + +**Pros** + +* Auto-generated, always-increasing value +* Useful when records are timestamped anyway + +**Cons** + +* Can result in gaps +* Collisions possible if multiple records are created in the same millisecond + +### 2. Use UUIDs (v4) + +```sql +TEXT DEFAULT gen_random_text_uuid() PRIMARY KEY +``` + +**Pros** + +* Universally unique +* No conflicts when merging from multiple environments or sources + +**Cons** + +* Not ordered +* Harder to read/debug +* No efficient range queries + +### Use UUIDv7 for Time-Ordered IDs + +UUIDv7 is a new format that preserves **temporal ordering**, making them better suited for distributed inserts and range queries. + +You can use UUIDv7 in CrateDB via a **User-Defined Function (UDF)**, based on your preferred language. + +**Pros** + +* Globally unique and **almost sequential** +* Range queries possible + +**Cons** + +* Not human-friendly +* Slight overhead due to UDF use + +### 4. Use External System IDs + +If you're ingesting data from a source system that **already generates unique IDs**, you can reuse those: + +* No need for CrateDB to generate anything +* Ensures consistency across systems + +> See Replicating data from other databases to CrateDB with Debezium and Kafka for an example. + +### 5. Implement a Custom Sequence Table + +If you **must** have an auto-incrementing numeric ID (e.g., for compatibility or legacy reasons), you can implement a simple sequence generator using a dedicated table and client-side logic. + +**Step 1: Create a sequence tracking table** + +```sql +CREATE TABLE sequences ( + name TEXT PRIMARY KEY, + last_value BIGINT +) CLUSTERED INTO 1 SHARDS; +``` + +**Step 2: Initialize your sequence** + +```sql +INSERT INTO sequences (name, last_value) +VALUES ('mysequence', 0); +``` + +**Step 3: Create a target table** + +```sql +CREATE TABLE mytable ( + id BIGINT PRIMARY KEY, + field1 TEXT +); +``` + +**Step 4: Generate and use sequence values in Python** + +Use optimistic concurrency control to generate unique, incrementing values even in parallel ingestion scenarios: + +```python +# Requires: records, sqlalchemy-cratedb +import time +import records + +db = records.Database("crate://") +sequence_name = "mysequence" + +max_retries = 5 +base_delay = 0.1 # 100 milliseconds + +for attempt in range(max_retries): + select_query = """ + SELECT last_value, _seq_no, _primary_term + FROM sequences + WHERE name = :sequence_name; + """ + row = db.query(select_query, sequence_name=sequence_name).first() + new_value = row.last_value + 1 + + update_query = """ + UPDATE sequences + SET last_value = :new_value + WHERE name = :sequence_name + AND _seq_no = :seq_no + AND _primary_term = :primary_term + RETURNING last_value; + """ + result = db.query( + update_query, + new_value=new_value, + sequence_name=sequence_name, + seq_no=row._seq_no, + primary_term=row._primary_term + ).all() + + if result: + break + + delay = base_delay * (2**attempt) + print(f"Attempt {attempt + 1} failed. Retrying in {delay:.1f} seconds...") + time.sleep(delay) +else: + raise Exception("Failed to acquire sequence after multiple retries.") + +insert_query = "INSERT INTO mytable (id, field1) VALUES (:id, :field1)" +db.query(insert_query, id=new_value, field1="abc") +db.close() +``` + +**Pros** + +* Fully customizable (you can add prefixes, adjust increment size, etc.) +* Sequential IDs possible + +**Cons** + +* More complex client logic required +* The sequence table may become a bottleneck at very high ingestion rates + +## Summary + +| Strategy | Ordered | Unique | Scalable | Human-Friendly | Range Queries | Notes | +| ------------------- | ------- | ------ | -------- | -------------- | ------------- | -------------------- | +| Timestamp | ✅ | ⚠️ | ✅ | ✅ | ✅ | Potential collisions | +| UUID (v4) | ❌ | ✅ | ✅ | ❌ | ❌ | Default UUIDs | +| UUIDv7 | ✅ | ✅ | ✅ | ❌ | ✅ | Requires UDF | +| External System IDs | ✅/❌ | ✅ | ✅ | ✅ | ✅ | Depends on source | +| Sequence Table | ✅ | ✅ | ⚠️ | ✅ | ✅ | Manual retry logic | diff --git a/docs/start/modelling/relational.md b/docs/start/modelling/relational.md new file mode 100644 index 00000000..bdaa4245 --- /dev/null +++ b/docs/start/modelling/relational.md @@ -0,0 +1,178 @@ +# Relational data + +CrateDB is a **distributed SQL database** that offers full **relational data modeling** with the flexibility of dynamic schemas and the scalability of NoSQL systems. It supports **primary and foreign keys**, **joins**, **aggregations**, and **subqueries**, just like traditional RDBMS systems—while also enabling hybrid use cases with time-series, geospatial, full-text, vector, and semi-structured data. + +Use CrateDB when you need to scale relational workloads horizontally while keeping the simplicity of **ANSI SQL**. + +## 1. Table Definitions + +CrateDB supports strongly typed relational schemas using familiar SQL syntax: + +```sql +CREATE TABLE customers ( + id UUID PRIMARY KEY, + name TEXT, + email TEXT, + created_at TIMESTAMP +); + +CREATE TABLE orders ( + order_id UUID PRIMARY KEY, + customer_id UUID, + total_amount DOUBLE, + created_at TIMESTAMP +); +``` + +**Key Features:** + +* Supports scalar types (`TEXT`, `INTEGER`, `DOUBLE`, `BOOLEAN`, `TIMESTAMP`, etc.) +* `UUID` recommended for primary keys in distributed environments +* Default **replication**, **sharding**, and **partitioning** options are built-in for scale + +:::{note} +CrateDB supports `column_policy = 'dynamic'` if you want to mix relational and semi-structured models (like JSON) in the same table. +::: + +## 2. Joins & Relationships + +CrateDB supports **inner joins**, **left/right joins**, **cross joins**, and even **self joins**. + +**Example: Join Customers and Orders** + +```sql +SELECT c.name, o.order_id, o.total_amount +FROM customers c +JOIN orders o ON c.id = o.customer_id +WHERE o.created_at >= CURRENT_DATE - INTERVAL '30 days'; +``` + +Joins are executed efficiently across shards in a **distributed query planner** that parallelizes execution. + +## 3. Normalization vs. Embedding + +CrateDB supports both **normalized** (relational) and **denormalized** (embedded JSON) approaches. + +* For strict referential integrity and modularity: use normalized tables with joins. +* For performance in high-ingest or read-optimized workloads: embed reference data as nested JSON. + +Example: Embedded products inside an `orders` table: + +```sql +CREATE TABLE orders ( + order_id UUID PRIMARY KEY, + customer_id UUID, + items ARRAY(OBJECT ( + name TEXT, + quantity INTEGER, + price DOUBLE + )), + created_at TIMESTAMP +); +``` + +:::{note} +CrateDB lets you **query nested fields** directly using dot notation: `items['name']`, `items['price']`, etc. +::: + +## 4. Aggregations & Grouping + +Use familiar SQL aggregation functions (`SUM`, `AVG`, `COUNT`, `MIN`, `MAX`) with `GROUP BY`, `HAVING`, `WINDOW FUNCTIONS`, and even `FILTER`. + +```sql +SELECT customer_id, COUNT(*) AS num_orders, SUM(total_amount) AS revenue +FROM orders +GROUP BY customer_id +HAVING revenue > 1000; +``` + +:::{note} +CrateDB's **columnar storage** optimizes performance for aggregations—even on large datasets. +::: + +## 5. Constraints & Indexing + +CrateDB supports: + +* **Primary Keys** – enforced for uniqueness and data distribution +* **Unique Constraints** – optional, enforced locally +* **Check Constraints** – for value validation +* **Indexes** – automatic for primary keys and full-text fields; manual for others + +```sql +CREATE TABLE products ( + id UUID PRIMARY KEY, + name TEXT, + price DOUBLE CHECK (price >= 0) +); +``` + +:::{note} +Foreign key constraints are not strictly enforced at write time but can be modeled at the application or query layer. +::: + +## 6. Views & Subqueries + +CrateDB supports **views**, **CTEs**, and **nested subqueries**. + +**Example: Reusable View** + +```sql +CREATE VIEW recent_orders AS +SELECT * FROM orders +WHERE created_at >= CURRENT_DATE - INTERVAL '7 days'; +``` + +**Example: Correlated Subquery** + +```sql +SELECT name, + (SELECT COUNT(*) FROM orders o WHERE o.customer_id = c.id) AS order_count +FROM customers c; +``` + +## 7. Use Cases for Relational Modeling + +| Use Case | Description | +| -------------------- | ------------------------------------------------ | +| Customer & Orders | Classic normalized setup with joins and filters | +| Inventory Management | Products, stock levels, locations | +| Financial Systems | Transactions, balances, audit logs | +| User Profiles | Users, preferences, activity logs | +| Multi-tenant Systems | Use schemas or partitioning for tenant isolation | + +## 8. Scalability & Distribution + +CrateDB automatically shards tables across nodes, distributing both **data and query processing**. + +* Tables can be **sharded and replicated** for fault tolerance +* Use **partitioning** for time-series or tenant-based scaling +* SQL queries are transparently **parallelized across the cluster** + +:::{note} +Use `CLUSTERED BY` and `PARTITIONED BY` in `CREATE TABLE` to control distribution patterns. +::: + +## 9. Best Practices + +| Area | Recommendation | +| ------------- | ------------------------------------------------------------ | +| Keys & IDs | Use UUIDs or consistent IDs for primary keys | +| Sharding | Let CrateDB auto-shard unless you have advanced requirements | +| Join Strategy | Minimize joins over large, high-cardinality columns | +| Nested Fields | Use `column_policy = 'dynamic'` if schema needs flexibility | +| Aggregations | Favor columnar tables for analytical workloads | +| Co-location | Consider denormalization for write-heavy workloads | + +## 10. Further Learning & Resources + +* CrateDB Docs – Data Modeling +* CrateDB Academy – Relational Modeling +* Working with Joins in CrateDB +* Schema Design Guide + +## 11. Summary + +CrateDB offers a familiar, powerful **relational model with full SQL** and built-in support for scale, performance, and hybrid data. You can model clean, normalized data structures and join them across millions of records—without sacrificing the flexibility to embed, index, and evolve schema dynamically. + +CrateDB is the modern SQL engine for building relational, real-time, and hybrid apps in a distributed world. diff --git a/docs/start/modelling/timeseries.md b/docs/start/modelling/timeseries.md new file mode 100644 index 00000000..a8993b7c --- /dev/null +++ b/docs/start/modelling/timeseries.md @@ -0,0 +1,137 @@ +# Time series data + +CrateDB employs a relational representation for time‑series, enabling you to work with timestamped data using standard SQL, while also seamlessly combining with document and context data. + +## 1. Why CrateDB for Time Series? + +* **Distributed architecture and columnar storage** enable very high ingest throughput with fast aggregations and near‑real‑time analytical queries. +* Handles **high cardin­ality** and **mixed data types**, including nested JSON, geospatial and vector data—all queryable via the same SQL statements. +* **PostgreSQL wire‑protocol compatible**, so it integrates easily with existing tools and drivers. + +## 2. Data Model Template + +A typical time‑series schema looks like this: + +
CREATE TABLE IF NOT EXISTS weather_data (
+    ts TIMESTAMP,
+    location VARCHAR,
+    temperature DOUBLE,
+    humidity DOUBLE CHECK (humidity >= 0),
+    PRIMARY KEY (ts, location)
+)
+WITH (column_policy = 'dynamic');
+
+ +Key points: + +* `ts`: append‑only timestamp column +* Composite primary key `(ts, location)` ensures uniqueness and efficient sort/group by time +* `column_policy = 'dynamic'` allows schema evolution: inserting a new field auto‑creates the column. + +## 3. Ingesting and Querying + +### **Data Ingestion** + +* Use SQL `INSERT` or bulk import techniques like `COPY FROM` with JSON or CSV files. +* Schema inference can often happen automatically during import. + +### **Aggregation and Transformations** + +CrateDB offers built‑in SQL functions tailor‑made for time‑series analyses: + +* **`DATE_BIN(interval, timestamp, origin)`** for bucketed aggregations (down‑sampling). +* **Window functions** like `LAG()` and `LEAD()` to detect trends or gaps. +* **`MAX_BY()`** returns the value from one column matching the min/max value of another column in a group. + +**Example**: compute hourly average battery levels and join with metadata: + +```postgresql +WITH avg_metrics AS ( + SELECT device_id, + DATE_BIN('1 hour', time, 0) AS period, + AVG(battery_level) AS avg_battery + FROM devices.readings + GROUP BY device_id, period +) +SELECT period, t.device_id, i.manufacturer, avg_battery +FROM avg_metrics t +JOIN devices.info i USING (device_id) +WHERE i.model = 'mustang'; +``` + +**Example**: gap detection interpolation: + +```text +WITH all_hours AS ( + SELECT generate_series(ts_start, ts_end, ‘30 second’) AS expected_time +), +raw AS ( + SELECT time, battery_level FROM devices.readings +) +SELECT expected_time, r.battery_level +FROM all_hours +LEFT JOIN raw r ON expected_time = r.time +ORDER BY expected_time; +``` + +## 4. Downsampling & Interpolation + +To reduce volume while preserving trends, use `DATE_BIN`.\ +Missing data can be handled using `LAG()`/`LEAD()` or other interpolation logic within SQL. + +## 5. Schema Evolution & Contextual Data + +With `column_policy = 'dynamic'`, ingest JSON payloads containing extra attributes—new columns are auto‑created and indexed. Perfect for capturing evolving sensor metadata. + +You can also store: + +* **Geospatial** (`GEO_POINT`, `GEO_SHAPE`) +* **Vectors** (up to 2048 dims via HNSW indexing) +* **BLOBs** for binary data (e.g. images, logs) + +All types are supported within the same table or joined together. + +## 6. Storage Optimization + +* **Partitioning and sharding**: data can be partitioned by time (e.g. daily/monthly) and sharded across a cluster. +* Supports long‑term retention with performant historic storage. +* Columnar layout reduces storage footprint and accelerates aggregation queries. + +## 7. Advanced Use Cases + +* **Exploratory data analysis** (EDA), decomposition, and forecasting via CrateDB’s SQL or by exporting to Pandas/Plotly. +* **Machine learning workflows**: time‑series features and anomaly detection pipelines can be built using CrateDB + external tools + +## 8. Sample Workflow (Chicago Weather Dataset) + +CrateDB’s sample data set captures hourly temperature, humidity, pressure, wind at three Chicago stations (150,000+ records). + +Typical operations: + +* Table creation and ingestion +* Average per station +* Using `MAX_BY()` to find highest temperature timestamps +* Downsampling using `DATE_BIN` into 4‑week buckets + +This workflow illustrates how CrateDB scales and simplifies time series modeling. + +## 9. Best Practices Checklist + +| Topic | Recommendation | +| ----------------------------- | ------------------------------------------------------------------- | +| Schema design | Use composite primary key (timestamp + series key), dynamic columns | +| Ingestion | Use bulk import (COPY) and JSON ingestion | +| Aggregations | Use DATE\_BIN, window functions, GROUP BY | +| Interpolation / gap analysis | Employ LAG(), LEAD(), generate\_series, joins | +| Schema evolution | Dynamic columns allow adding fields on the fly | +| Mixed data types | Combine time series, JSON, geo, full‑text in one dataset | +| Partitioning & shard strategy | Partition by time, shard across nodes for scale | +| Downsampling | Use DATE\_BIN for aggregating resolution | +| Integration with analytics/ML | Export to pandas/Plotly or train ML models inside CrateDB pipeline | + +## 10. Further Learning + +* Video: **Time Series Data Modeling** – covers relational & time series, document, geospatial, vector, and full-text in one tutorial. +* Official CrateDB Guide: **Time Series Fundamentals**, **Advanced Time Series Analysis**, **Sharding & Partitioning**. +* CrateDB Academy: free courses including an **Advanced Time Series Modeling** module. + diff --git a/docs/start/modelling/vector.md b/docs/start/modelling/vector.md new file mode 100644 index 00000000..6c537428 --- /dev/null +++ b/docs/start/modelling/vector.md @@ -0,0 +1,151 @@ +# Vector data + +CrateDB natively supports **vector embeddings** for efficient **similarity search** using **approximate nearest neighbor (ANN)** algorithms. This makes it a powerful engine for building AI-powered applications involving semantic search, recommendations, anomaly detection, and multimodal analytics—all in the simplicity of SQL. + +Whether you’re working with text, images, sensor data, or any domain represented as high-dimensional embeddings, CrateDB enables **real-time vector search at scale**, in combination with other data types like full-text, geospatial, and time-series.\ + + +## 1. Data Type: VECTOR + +CrateDB introduces a native `VECTOR` type with the following key characteristics: + +* Fixed-length float arrays (e.g. 768, 1024, 2048 dimensions) +* Supports **HNSW (Hierarchical Navigable Small World)** indexing for fast approximate search +* Optimized for cosine, Euclidean, and dot-product similarity + +**Example: Define a Table with Vector Embeddings** + +```sql +CREATE TABLE documents ( + id UUID PRIMARY KEY, + title TEXT, + content TEXT, + embedding VECTOR(FLOAT[768]) +); +``` + +* `VECTOR(FLOAT[768])` declares a fixed-size vector column. +* You can ingest vectors directly or compute them externally and store them via SQL + +## 2. Indexing: Enabling Vector Search + +To use fast similarity search, define an **HNSW index** on the vector column: + +```sql +CREATE INDEX embedding_hnsw +ON documents (embedding) +USING HNSW +WITH ( + m = 16, + ef_construction = 128, + ef_search = 64, + similarity = 'cosine' +); +``` + +**Parameters:** + +* `m`: controls the number of bi-directional links per node (default: 16) +* `ef_construction`: affects index build accuracy/speed (default: 128) +* `ef_search`: controls recall/latency trade-off at query time +* `similarity`: choose from `'cosine'`, `'l2'` (Euclidean), `'dot_product'` + +> CrateDB automatically builds the ANN index in the background, allowing for real-time updates. + +## 3. Querying Vectors with SQL + +Use the `nearest_neighbors` predicate to perform similarity search: + +```sql +SELECT id, title, content +FROM documents +ORDER BY embedding <-> [0.12, 0.73, ..., 0.01] +LIMIT 5; +``` + +This ranks results by **vector similarity** using the index. + +Or, filter and rank by proximity: + +```sql +SELECT id, title, content, embedding <-> [0.12, ..., 0.01] AS score +FROM documents +WHERE MATCH(content_ft, 'machine learning') AND author = 'Alice' +ORDER BY score +LIMIT 10; +``` + +:::{note} +Combine vector similarity with full-text, metadata, or geospatial filters! +::: + +## 4. Ingestion: Working with Embeddings + +You can ingest vectors in several ways: + +* **Precomputed embeddings** from models like OpenAI, HuggingFace, or SentenceTransformers: + + ```sql + INSERT INTO documents (id, title, embedding) + VALUES ('uuid-123', 'AI and Databases', [0.12, 0.34, ..., 0.01]); + ``` +* **Batched imports** via `COPY FROM` using JSON or CSV +* CrateDB doesn't currently compute embeddings internally—you bring your own model or use pipelines that call CrateDB. + +## 5. Use Cases + +| Use Case | Description | +| ----------------------- | ------------------------------------------------------------------ | +| Semantic Search | Rank documents by meaning instead of keywords | +| Recommendation Systems | Find similar products, users, or behaviors | +| Image / Audio Retrieval | Store and compare embeddings of images/audio | +| Fraud Detection | Match behavioral patterns via vectors | +| Hybrid Search | Combine vector similarity with full-text, geo, or temporal filters | + +Example: Hybrid semantic product search + +```sql +SELECT id, title, price, description +FROM products +WHERE MATCH(description_ft, 'running shoes') AND brand = 'Nike' +ORDER BY features <-> [vector] ASC +LIMIT 10; +``` + +## 6. Performance & Scaling + +* Vector search uses **HNSW**: state-of-the-art ANN algorithm with logarithmic search complexity. +* CrateDB parallelizes ANN search across shards/nodes. +* Ideal for 100K to tens of millions of vectors; supports real-time ingestion and queries. + +:::{note} +Note: vector dimensionality must be consistent for each column. +::: + +## 7. Best Practices + +| Area | Recommendation | +| -------------- | ----------------------------------------------------------------------- | +| Vector length | Use standard embedding sizes (e.g. 384, 512, 768, 1024) | +| Similarity | Cosine for semantic/textual data; dot-product for ranking models | +| Index tuning | Tune `ef_search` for latency/recall trade-offs | +| Hybrid queries | Combine vector similarity with metadata filters (e.g. category, region) | +| Updates | Re-inserting or updating vectors is fully supported | +| Data pipelines | Use external tools for vector generation; push to CrateDB via REST/SQL | + +## 8. Integrations + +* **Python / pandas / LangChain**: CrateDB has native drivers and REST interface +* **Embedding models**: Use OpenAI, HuggingFace, Cohere, or in-house models +* **RAG architecture**: CrateDB stores vector + metadata + raw text in a unified store + +## 9. Further Learning & Resources + +* CrateDB Docs – Vector Search +* Blog: Using CrateDB for Hybrid Search (Vector + Full-Text) +* CrateDB Academy – Vector Data +* [Sample notebooks on GitHub](https://github.com/crate/cratedb-examples) + +## 10. Summary + +CrateDB gives you the power of **vector similarity search** with the **flexibility of SQL** and the **scalability of a distributed database**. It lets you unify structured, unstructured, and semantic data—enabling modern applications in AI, search, and recommendation without additional vector databases or pipelines. From 25490061320ac0a20c7aef1f568665d61ce7d8b2 Mon Sep 17 00:00:00 2001 From: surister Date: Sat, 23 Aug 2025 12:32:05 +0200 Subject: [PATCH 02/58] Data modelling: Fix page about "relational data" --- docs/start/modelling/relational.md | 100 ++++++++++++++++------------- 1 file changed, 56 insertions(+), 44 deletions(-) diff --git a/docs/start/modelling/relational.md b/docs/start/modelling/relational.md index bdaa4245..bd982ae8 100644 --- a/docs/start/modelling/relational.md +++ b/docs/start/modelling/relational.md @@ -1,42 +1,35 @@ # Relational data -CrateDB is a **distributed SQL database** that offers full **relational data modeling** with the flexibility of dynamic schemas and the scalability of NoSQL systems. It supports **primary and foreign keys**, **joins**, **aggregations**, and **subqueries**, just like traditional RDBMS systems—while also enabling hybrid use cases with time-series, geospatial, full-text, vector, and semi-structured data. +CrateDB is a **distributed SQL database** that offers rich **relational data modeling** with the flexibility of dynamic schemas and the scalability of NoSQL systems. It supports **primary keys,** **joins**, **aggregations**, and **subqueries**, just like traditional RDBMS systems—while also enabling hybrid use cases with time-series, geospatial, full-text, vector search, and semi-structured data. -Use CrateDB when you need to scale relational workloads horizontally while keeping the simplicity of **ANSI SQL**. +Use CrateDB when you need to scale relational workloads horizontally while keeping the simplicity of **SQL**. -## 1. Table Definitions +## Table Definitions CrateDB supports strongly typed relational schemas using familiar SQL syntax: ```sql CREATE TABLE customers ( - id UUID PRIMARY KEY, + id TEXT DEFAULT gen_random_text_uuid() PRIMARY KEY, name TEXT, email TEXT, - created_at TIMESTAMP -); - -CREATE TABLE orders ( - order_id UUID PRIMARY KEY, - customer_id UUID, - total_amount DOUBLE, - created_at TIMESTAMP + created_at TIMESTAMP DEFAULT now() ); ``` **Key Features:** * Supports scalar types (`TEXT`, `INTEGER`, `DOUBLE`, `BOOLEAN`, `TIMESTAMP`, etc.) -* `UUID` recommended for primary keys in distributed environments +* `gen_random_text_uuid()`, `now()` or `current_timestamp()` recommended for primary keys in distributed environments * Default **replication**, **sharding**, and **partitioning** options are built-in for scale :::{note} CrateDB supports `column_policy = 'dynamic'` if you want to mix relational and semi-structured models (like JSON) in the same table. ::: -## 2. Joins & Relationships +## Joins & Relationships -CrateDB supports **inner joins**, **left/right joins**, **cross joins**, and even **self joins**. +CrateDB supports **inner joins**, **left/right joins**, **cross joins**, **outer joins**, and even **self joins**. **Example: Join Customers and Orders** @@ -49,7 +42,7 @@ WHERE o.created_at >= CURRENT_DATE - INTERVAL '30 days'; Joins are executed efficiently across shards in a **distributed query planner** that parallelizes execution. -## 3. Normalization vs. Embedding +## Normalization vs. Embedding CrateDB supports both **normalized** (relational) and **denormalized** (embedded JSON) approaches. @@ -60,24 +53,25 @@ Example: Embedded products inside an `orders` table: ```sql CREATE TABLE orders ( - order_id UUID PRIMARY KEY, - customer_id UUID, - items ARRAY(OBJECT ( - name TEXT, - quantity INTEGER, - price DOUBLE - )), + order_id TEXT DEFAULT gen_random_text_uuid() PRIMARY KEY, + items ARRAY( + OBJECT(DYNAMIC) AS ( + name TEXT, + quantity INTEGER, + price DOUBLE + ) + ), created_at TIMESTAMP ); ``` :::{note} -CrateDB lets you **query nested fields** directly using dot notation: `items['name']`, `items['price']`, etc. +CrateDB lets you **query nested fields** directly using bracket notation: `items['name']`, `items['price']`, etc. ::: -## 4. Aggregations & Grouping +## Aggregations & Grouping -Use familiar SQL aggregation functions (`SUM`, `AVG`, `COUNT`, `MIN`, `MAX`) with `GROUP BY`, `HAVING`, `WINDOW FUNCTIONS`, and even `FILTER`. +Use familiar SQL aggregation functions (`SUM`, `AVG`, `COUNT`, `MIN`, `MAX`) with `GROUP BY`, `HAVING`, `WINDOW FUNCTIONS` ... etc. ```sql SELECT customer_id, COUNT(*) AS num_orders, SUM(total_amount) AS revenue @@ -90,28 +84,28 @@ HAVING revenue > 1000; CrateDB's **columnar storage** optimizes performance for aggregations—even on large datasets. ::: -## 5. Constraints & Indexing +## Constraints & Indexing CrateDB supports: * **Primary Keys** – enforced for uniqueness and data distribution -* **Unique Constraints** – optional, enforced locally -* **Check Constraints** – for value validation -* **Indexes** – automatic for primary keys and full-text fields; manual for others +* **Check -** enforces custom value validation +* **Indexes** – automatic index for all columns +* **Full-text indexes -** manually defined, supports many tokenizers, analyzers and filters + +In CrateDB every column is indexed by default, depending on the datatype a different index is used, indexing is controlled and maintained by the database, there is no need to `vacuum` or `re-index` like in other systems. Indexing can be manually turned off. ```sql CREATE TABLE products ( - id UUID PRIMARY KEY, + id TEXT PRIMARY KEY, name TEXT, - price DOUBLE CHECK (price >= 0) + price DOUBLE CHECK (price >= 0), + tag TEXT INDEX OFF, + description TEXT INDEX using fulltext ); ``` -:::{note} -Foreign key constraints are not strictly enforced at write time but can be modeled at the application or query layer. -::: - -## 6. Views & Subqueries +## Views & Subqueries CrateDB supports **views**, **CTEs**, and **nested subqueries**. @@ -120,7 +114,7 @@ CrateDB supports **views**, **CTEs**, and **nested subqueries**. ```sql CREATE VIEW recent_orders AS SELECT * FROM orders -WHERE created_at >= CURRENT_DATE - INTERVAL '7 days'; +WHERE created_at >= CURRENT_DATE::TIMESTAMP - INTERVAL '7 days'; ``` **Example: Correlated Subquery** @@ -131,7 +125,25 @@ SELECT name, FROM customers c; ``` -## 7. Use Cases for Relational Modeling +**Example: Common table expression** + +```sql +WITH order_counts AS ( + SELECT + o.customer_id, + COUNT(*) AS order_count + FROM orders o + GROUP BY o.customer_id +) +SELECT + c.name, + COALESCE(oc.order_count, 0) AS order_count +FROM customers c +LEFT JOIN order_counts oc + ON c.id = oc.customer_id; +``` + +## Use Cases for Relational Modeling | Use Case | Description | | -------------------- | ------------------------------------------------ | @@ -141,7 +153,7 @@ FROM customers c; | User Profiles | Users, preferences, activity logs | | Multi-tenant Systems | Use schemas or partitioning for tenant isolation | -## 8. Scalability & Distribution +## Scalability & Distribution CrateDB automatically shards tables across nodes, distributing both **data and query processing**. @@ -153,7 +165,7 @@ CrateDB automatically shards tables across nodes, distributing both **data and q Use `CLUSTERED BY` and `PARTITIONED BY` in `CREATE TABLE` to control distribution patterns. ::: -## 9. Best Practices +## Best Practices | Area | Recommendation | | ------------- | ------------------------------------------------------------ | @@ -164,15 +176,15 @@ Use `CLUSTERED BY` and `PARTITIONED BY` in `CREATE TABLE` to control distributio | Aggregations | Favor columnar tables for analytical workloads | | Co-location | Consider denormalization for write-heavy workloads | -## 10. Further Learning & Resources +## Further Learning & Resources * CrateDB Docs – Data Modeling * CrateDB Academy – Relational Modeling * Working with Joins in CrateDB * Schema Design Guide -## 11. Summary +## Summary -CrateDB offers a familiar, powerful **relational model with full SQL** and built-in support for scale, performance, and hybrid data. You can model clean, normalized data structures and join them across millions of records—without sacrificing the flexibility to embed, index, and evolve schema dynamically. +CrateDB offers a familiar, powerful **relational model with full SQL** and built-in support for scale, performance, and hybrid data. You can model clean, normalized data structures and join them across millions of records, without sacrificing the flexibility to embed, index, and evolve schema dynamically. CrateDB is the modern SQL engine for building relational, real-time, and hybrid apps in a distributed world. From 73ce05705220d011de19aea3754dd8694ee767ad Mon Sep 17 00:00:00 2001 From: Daryl Dudey Date: Sat, 23 Aug 2025 12:36:31 +0200 Subject: [PATCH 03/58] Data modelling: Fix page about "json data" --- docs/start/modelling/json.md | 26 +++++++++++++------------- 1 file changed, 13 insertions(+), 13 deletions(-) diff --git a/docs/start/modelling/json.md b/docs/start/modelling/json.md index 78583efa..de71eb27 100644 --- a/docs/start/modelling/json.md +++ b/docs/start/modelling/json.md @@ -4,7 +4,7 @@ CrateDB combines the flexibility of NoSQL document stores with the power of SQL. CrateDB’s support for dynamic objects, nested structures, and dot-notation querying brings the best of both relational and document-based data modeling—without leaving the SQL world. -## 1. Object (JSON) Columns +## Object (JSON) Columns CrateDB allows you to define **object columns** that can store JSON-style data structures. @@ -32,7 +32,7 @@ This allows inserting flexible, nested JSON data into `payload`: } ``` -## 2. Column Policy: Strict vs Dynamic +## Column Policy: Strict vs Dynamic You can control how CrateDB handles unexpected fields in an object column: @@ -54,9 +54,9 @@ CREATE TABLE sensor_data ( ); ``` -## 3. Querying JSON Fields +## Querying JSON Fields -Use **dot notation** to access nested fields: +Use **bracket notation** to access nested fields: ```sql SELECT payload['user']['name'], payload['device']['os'] @@ -76,7 +76,7 @@ WHERE payload['device']['os'] = 'Android'; Dot-notation works for both explicitly and dynamically added fields. ::: -## 4. Querying DYNAMIC OBJECTs +## Querying DYNAMIC OBJECTs To support querying DYNAMIC OBJECTs using SQL, where keys may not exist within an OBJECT, CrateDB provides the [error\_on\_unknown\_object\_key](https://cratedb.com/docs/crate/reference/en/latest/config/session.html#conf-session-error-on-unknown-object-key) session setting. It controls the behaviour when querying unknown object keys to dynamic objects. @@ -100,7 +100,7 @@ cr> SELECT item['unknown'] FROM testdrive; SELECT 0 rows in set (0.051 sec) ``` -## 5. Arrays of Objects +## Arrays of OBJECTs Store arrays of objects for multi-valued nested data: @@ -132,7 +132,7 @@ FROM products WHERE specs['name'] = 'battery' AND specs['value'] = 'AA'; ``` -## 6. Combining Structured & Semi-Structured Data +## Combining Structured & Semi-Structured Data CrateDB supports **hybrid schemas**, mixing standard columns with JSON fields: @@ -152,7 +152,7 @@ This allows you to: * Flexibly store structured or unstructured metadata * Add new fields on the fly without migrations -## 7. Indexing Behavior +## Indexing Behavior CrateDB **automatically indexes** object fields if: @@ -181,7 +181,7 @@ data['some_field'] INDEX OFF Too many dynamic fields can lead to schema explosion. Use `STRICT` or `IGNORED` if needed. ::: -## 8. Aggregating JSON Fields +## Aggregating JSON Fields CrateDB allows full SQL-style aggregations on nested fields: @@ -193,7 +193,7 @@ WHERE payload['location'] = 'room1'; CrateDB also supports **`GROUP BY`**, **`HAVING`**, and **window functions** on object fields. -## 9. Use Cases for JSON Modeling +## Use Cases for JSON Modeling | Use Case | Description | | ------------------ | -------------------------------------------- | @@ -203,7 +203,7 @@ CrateDB also supports **`GROUP BY`**, **`HAVING`**, and **window functions** on | User Profiles | Custom settings, device info, preferences | | Telemetry / Events | Event streams with evolving structure | -## 10. Best Practices +## Best Practices | Area | Recommendation | | ---------------- | -------------------------------------------------------------------- | @@ -213,14 +213,14 @@ CrateDB also supports **`GROUP BY`**, **`HAVING`**, and **window functions** on | Column Mixing | Combine structured columns with JSON for hybrid models | | Observability | Monitor number of dynamic columns using `information_schema.columns` | -## 11. Further Learning & Resources +## Further Learning & Resources * CrateDB Docs – Object Columns * Working with JSON in CrateDB * CrateDB Academy – Modeling with JSON * Understanding Column Policies -## 12. Summary +## Summary CrateDB makes it easy to model **semi-structured JSON data** with full SQL support. Whether you're building a telemetry pipeline, an event store, or a product catalog, CrateDB offers the flexibility of a document store—while preserving the structure, indexing, and power of a relational engine. From 75899f6a86d11f43e0d2258f063d905be4874862 Mon Sep 17 00:00:00 2001 From: karynzv Date: Sat, 23 Aug 2025 12:38:32 +0200 Subject: [PATCH 04/58] Data modelling: Fix page about "timeseries data" --- docs/start/modelling/timeseries.md | 130 +++++++++++++++++------------ 1 file changed, 76 insertions(+), 54 deletions(-) diff --git a/docs/start/modelling/timeseries.md b/docs/start/modelling/timeseries.md index a8993b7c..f9703ca6 100644 --- a/docs/start/modelling/timeseries.md +++ b/docs/start/modelling/timeseries.md @@ -2,33 +2,45 @@ CrateDB employs a relational representation for time‑series, enabling you to work with timestamped data using standard SQL, while also seamlessly combining with document and context data. -## 1. Why CrateDB for Time Series? +## Why CrateDB for Time Series? -* **Distributed architecture and columnar storage** enable very high ingest throughput with fast aggregations and near‑real‑time analytical queries. -* Handles **high cardin­ality** and **mixed data types**, including nested JSON, geospatial and vector data—all queryable via the same SQL statements. +* While maintaining a high ingest rate, its **columnar storage** and **automatic indexing** let you access and analyze the data immediately with **fast aggregations** and **near-real-time queries**. +* Handles **high cardin­ality** and **a variety of data types**, including nested JSON, geospatial and vector data—all queryable via the same SQL statements. * **PostgreSQL wire‑protocol compatible**, so it integrates easily with existing tools and drivers. -## 2. Data Model Template +## Data Model Template A typical time‑series schema looks like this: -
CREATE TABLE IF NOT EXISTS weather_data (
-    ts TIMESTAMP,
-    location VARCHAR,
-    temperature DOUBLE,
-    humidity DOUBLE CHECK (humidity >= 0),
-    PRIMARY KEY (ts, location)
-)
-WITH (column_policy = 'dynamic');
-
+```sql +CREATE TABLE IF NOT EXISTS devices_readings ( + ts TIMESTAMP WITH TIME ZONE, + device_id TEXT, + battery OBJECT(DYNAMIC) AS ( + level BIGINT, + status TEXT, + temperature DOUBLE PRECISION + ), + cpu OBJECT(DYNAMIC) AS ( + avg_1min DOUBLE PRECISION, + avg_5min DOUBLE PRECISION, + avg_15min DOUBLE PRECISION + ), + memory OBJECT(DYNAMIC) AS ( + free BIGINT, + used BIGINT + ), + month timestamp with time zone GENERATED ALWAYS AS date_trunc('month', ts) +) PARTITIONED BY (month); +``` Key points: -* `ts`: append‑only timestamp column -* Composite primary key `(ts, location)` ensures uniqueness and efficient sort/group by time -* `column_policy = 'dynamic'` allows schema evolution: inserting a new field auto‑creates the column. +* `month` is the partitioning key, optimizing data storage and retrieval. +* Every column is stored in the column store by default for fast aggregations. +* Using **OBJECT columns** in the `devices_readings` table provides a structured and efficient way to organize complex nested data in CrateDB, enhancing both data integrity and flexibility. -## 3. Ingesting and Querying +## Ingesting and Querying ### **Data Ingestion** @@ -45,43 +57,56 @@ CrateDB offers built‑in SQL functions tailor‑made for time‑series analyses **Example**: compute hourly average battery levels and join with metadata: -```postgresql +```sql WITH avg_metrics AS ( SELECT device_id, - DATE_BIN('1 hour', time, 0) AS period, - AVG(battery_level) AS avg_battery - FROM devices.readings + DATE_BIN('1 hour'::interval, ts, 0) AS period, + AVG(battery['level']) AS avg_battery + FROM devices_readings GROUP BY device_id, period ) SELECT period, t.device_id, i.manufacturer, avg_battery FROM avg_metrics t -JOIN devices.info i USING (device_id) +JOIN devices_info i USING (device_id) WHERE i.model = 'mustang'; ``` **Example**: gap detection interpolation: -```text +```sql WITH all_hours AS ( - SELECT generate_series(ts_start, ts_end, ‘30 second’) AS expected_time + SELECT + generate_series( + '2025-01-01', + '2025-01-02', + '30 second' :: interval + ) AS expected_time ), raw AS ( - SELECT time, battery_level FROM devices.readings + SELECT + ts, + battery ['level'] + FROM + devices_readings ) -SELECT expected_time, r.battery_level -FROM all_hours -LEFT JOIN raw r ON expected_time = r.time -ORDER BY expected_time; +SELECT + expected_time, + r.battery ['level'] +FROM + all_hours + LEFT JOIN raw r ON expected_time = r.ts +ORDER BY + expected_time; ``` -## 4. Downsampling & Interpolation +## Down-sampling & Interpolation To reduce volume while preserving trends, use `DATE_BIN`.\ Missing data can be handled using `LAG()`/`LEAD()` or other interpolation logic within SQL. -## 5. Schema Evolution & Contextual Data +## Schema Evolution & Contextual Data -With `column_policy = 'dynamic'`, ingest JSON payloads containing extra attributes—new columns are auto‑created and indexed. Perfect for capturing evolving sensor metadata. +With `column_policy = 'dynamic'`, ingest JSON payloads containing extra attributes—new columns are auto‑created and indexed. Perfect for capturing evolving sensor metadata. For column-level control, use `OBJECT(DYNAMIC)` to auto-create (and, by default, index) subcolumns, or `OBJECT(IGNORED)`to accept unknown keys without creating or indexing subcolumns. You can also store: @@ -91,47 +116,44 @@ You can also store: All types are supported within the same table or joined together. -## 6. Storage Optimization +## Storage Optimization * **Partitioning and sharding**: data can be partitioned by time (e.g. daily/monthly) and sharded across a cluster. * Supports long‑term retention with performant historic storage. * Columnar layout reduces storage footprint and accelerates aggregation queries. -## 7. Advanced Use Cases +## Advanced Use Cases * **Exploratory data analysis** (EDA), decomposition, and forecasting via CrateDB’s SQL or by exporting to Pandas/Plotly. * **Machine learning workflows**: time‑series features and anomaly detection pipelines can be built using CrateDB + external tools -## 8. Sample Workflow (Chicago Weather Dataset) +## Sample Workflow (Chicago Weather Dataset) -CrateDB’s sample data set captures hourly temperature, humidity, pressure, wind at three Chicago stations (150,000+ records). +In [this lesson of the CrateDB Academy](https://cratedb.com/academy/fundamentals/data-modelling-with-cratedb/hands-on-time-series-data) introducing Time Series data, we provide a sample data set that captures hourly temperature, humidity, pressure, wind at three Chicago stations (150,000+ records). Typical operations: * Table creation and ingestion * Average per station * Using `MAX_BY()` to find highest temperature timestamps -* Downsampling using `DATE_BIN` into 4‑week buckets +* Down-sampling using `DATE_BIN` into 4‑week buckets This workflow illustrates how CrateDB scales and simplifies time series modeling. -## 9. Best Practices Checklist - -| Topic | Recommendation | -| ----------------------------- | ------------------------------------------------------------------- | -| Schema design | Use composite primary key (timestamp + series key), dynamic columns | -| Ingestion | Use bulk import (COPY) and JSON ingestion | -| Aggregations | Use DATE\_BIN, window functions, GROUP BY | -| Interpolation / gap analysis | Employ LAG(), LEAD(), generate\_series, joins | -| Schema evolution | Dynamic columns allow adding fields on the fly | -| Mixed data types | Combine time series, JSON, geo, full‑text in one dataset | -| Partitioning & shard strategy | Partition by time, shard across nodes for scale | -| Downsampling | Use DATE\_BIN for aggregating resolution | -| Integration with analytics/ML | Export to pandas/Plotly or train ML models inside CrateDB pipeline | +## Best Practices Checklist -## 10. Further Learning +| Topic | Recommendation | +| ----------------------------- | ---------------------------------------------------------------------------------- | +| Schema design and evolution | Dynamic columns add fields as needed; diverse data types ensure proper typing | +| Ingestion | Use bulk import (COPY) and JSON ingestion | +| Aggregations | Use DATE\_BIN, window functions, GROUP BY | +| Interpolation / gap analysis | Employ LAG(), LEAD(), generate\_series, joins | +| Mixed data types | Combine time series, JSON, geo, full‑text in one dataset | +| Partitioning & shard strategy | Partition by time, shard across nodes for scale | +| Down-sampling | Use DATE\_BIN for aggregating resolution or implement your own strategy using UDFs | +| Integration with analytics/ML | Export to pandas/Plotly to train your ML models | -* Video: **Time Series Data Modeling** – covers relational & time series, document, geospatial, vector, and full-text in one tutorial. -* Official CrateDB Guide: **Time Series Fundamentals**, **Advanced Time Series Analysis**, **Sharding & Partitioning**. -* CrateDB Academy: free courses including an **Advanced Time Series Modeling** module. +## Further Learning +* **Video:** [Time Series Data Modeling](https://cratedb.com/resources/videos/time-series-data-modeling) – covers relational & time series, document, geospatial, vector, and full-text in one tutorial. +* **CrateDB Academy:** [Advanced Time Series Modeling course](https://cratedb.com/academy/time-series/getting-started/introduction-to-time-series-data). From f6d3bbac6a61cbb50320b5a0539223d237ef2a0c Mon Sep 17 00:00:00 2001 From: Kenneth Geisshirt Date: Sat, 23 Aug 2025 12:41:12 +0200 Subject: [PATCH 05/58] Data modelling: Fix page about "geospatial data" --- docs/start/modelling/geospatial.md | 118 +++++++++++++++++------------ 1 file changed, 68 insertions(+), 50 deletions(-) diff --git a/docs/start/modelling/geospatial.md b/docs/start/modelling/geospatial.md index 7ddbc982..576b3da0 100644 --- a/docs/start/modelling/geospatial.md +++ b/docs/start/modelling/geospatial.md @@ -1,35 +1,74 @@ # Geospatial data -CrateDB supports **real-time geospatial analytics at scale**, enabling you to store, query, and analyze location-based data using standard SQL over two dedicated types: **GEO\_POINT** and **GEO\_SHAPE**. You can seamlessly combine spatial data with full-text, vector, JSON, or time-series in the same SQL queries. +CrateDB supports **real-time geospatial analytics at scale**, enabling you to store, query, and analyze 2D location-based data using standard SQL over two dedicated types: **GEO\_POINT** and **GEO\_SHAPE**. You can seamlessly combine spatial data with full-text, vector, JSON, or time-series in the same SQL queries. -## 1. Geospatial Data Types +## Geospatial Data Types ### **GEO\_POINT** * Stores a single location via latitude/longitude. -* Insert using either a coordinate array `[lon, lat]` or WKT string `'POINT (lon lat)'`. -* Must be declared explicitly; dynamic schema inference will not detect geo\_point type. +* Insert using either a coordinate array `[lon, lat]` or Well-Known Text (WKT) string `'POINT (lon lat)'`. +* Must be declared explicitly; dynamic schema inference will not detect `geo_point` type. ### **GEO\_SHAPE** -* Supports complex geometries (Point, LineString, Polygon, MultiPolygon, GeometryCollection) via GeoJSON or WKT. -* Indexed using geohash, quadtree, or BKD-tree, with configurable precision (e.g. `50m`) and error threshold +* Supports complex geometries (Point, MultiPoint, LineString, MultiLineString, Polygon, MultiPolygon, GeometryCollection) via GeoJSON or WKT. +* Indexed using geohash, quadtree, or BKD-tree, with configurable precision (e.g. `50m`) and error threshold. The indexes are described in the [reference manual](https://cratedb.com/docs/crate/reference/en/latest/general/ddl/data-types.html#type-geo-shape-index). -## 2. Table Schema Example +## Table Schema Example -
CREATE TABLE parcel_zones (
-    zone_id INTEGER PRIMARY KEY,
-    name VARCHAR,
-    area GEO_SHAPE,
-    centroid GEO_POINT
-)
-WITH (column_policy = 'dynamic');
-
+Let's define a table with country boarders and capital: -* Use `GEO_SHAPE` to define zones or service areas. -* `GEO_POINT` allows for simple referencing (e.g. store approximate center of zone). +```sql +CREATE TABLE country ( + name text, + country_code text primary key, + shape geo_shape INDEX USING "geohash" WITH (precision='100m'), + capital text, + capital_location geo_point +) +``` + +* Use `GEO_SHAPE` to define the border. +* `GEO_POINT` to define the location of the capital. -## 3. Core Geospatial Functions +## Insert rows + +We can populate the table with Austria: + +```sql +INSERT INTO country (name, country_code, shape, capital, capital_location) +VALUES ( + 'Austria', + 'at', + + ##{type='Polygon', coordinates=[ + [[16.979667, 48.123497], [16.903754, 47.714866], + [16.340584, 47.712902], [16.534268, 47.496171], + [16.202298, 46.852386], [16.011664, 46.683611], + [15.137092, 46.658703], [14.632472, 46.431817], + [13.806475, 46.509306], [12.376485, 46.767559], + [12.153088, 47.115393], [11.164828, 46.941579], + [11.048556, 46.751359], [10.442701, 46.893546], + [9.932448, 46.920728], [9.47997, 47.10281], + [9.632932, 47.347601], [9.594226, 47.525058], + [9.896068, 47.580197], [10.402084, 47.302488], + [10.544504, 47.566399], [11.426414, 47.523766], + [12.141357, 47.703083], [12.62076, 47.672388], + [12.932627, 47.467646], [13.025851, 47.637584], + [12.884103, 48.289146], [13.243357, 48.416115], + [13.595946, 48.877172], [14.338898, 48.555305], + [14.901447, 48.964402], [15.253416, 49.039074], + [16.029647, 48.733899], [16.499283, 48.785808], + [16.960288, 48.596982], [16.879983, 48.470013], + [16.979667, 48.123497]] + ]}, + 'Vienna', + [16.372778, 48.209206] +); +``` + +## Core Geospatial Functions CrateDB provides key scalar functions for spatial operations: @@ -40,62 +79,41 @@ CrateDB provides key scalar functions for spatial operations: * **`geohash(geo_point)`** – compute a 12‑character geohash for the point * **`area(geo_shape)`** – returns approximate area in square degrees; uses geodetic awareness -Note: More precise relational operations on shapes may bypass indexes and can be slower. - -## 4. Spatial Queries & Indexing - -CrateDB supports Lucene-based spatial indexing (Prefix Tree and BKD-tree structures) for efficient geospatial search. Use the `MATCH` predicate to leverage indices when filtering spatial data by bounding boxes, circles, polygons, etc. - -**Example: Find nearby assets** +Furthermore, it is possible to use the **match** predicate with geospatial data in queries. -```sql -SELECT asset_id, DISTANCE(center_point, asset_location) AS dist -FROM assets -WHERE center_point = 'POINT(-1.234 51.050)'::GEO_POINT -ORDER BY dist -LIMIT 10; -``` - -**Example: Count incidents within service area** +Note: More precise relational operations on shapes may bypass indexes and can be slower. -```sql -SELECT area_id, count(*) AS incident_count -FROM incidents -WHERE within(incidents.location, service_areas.area) -GROUP BY area_id; -``` +## An example query -**Example: Which zones intersect a flight path** +It is possible to find the distance to the capital of each country in the table: ```sql -SELECT zone_id, name -FROM flight_paths f -JOIN service_zones z -ON intersects(f.path_geom, z.area); +SELECT distance(capital_location, [9.74, 47.41])/1000 +FROM country; ``` -## 5. Real-World Examples: Chicago Use Cases +## Real-World Examples: Chicago Use Cases * **311 calls**: Each record includes `location` as `GEO_POINT`. Queries use `within()` to find calls near a polygon around O’Hare airport. * **Community areas**: Polygon boundaries stored in `GEO_SHAPE`. Queries for intersections with arbitrary lines or polygons using `intersects()` return overlapping zones. * **Taxi rides**: Pickup/drop off locations stored as geo points. Use `distance()` filter to compute trip distances and aggregate. -## 6. Architectural Strengths & Suitability +## Architectural Strengths & Suitability * Designed for **real-time geospatial tracking and analytics** (e.g. fleet tracking, mapping, location-layered apps). * **Unified SQL platform**: spatial data can be combined with full-text search, JSON, vectors, time-series — in the same table or query. * **High ingest and query throughput**, suitable for large-scale location-based workloads -## 7. Best Practices Checklist +## Best Practices Checklist
TopicRecommendation
Data typesDeclare GEO_POINT/GEO_SHAPE explicitly
Geometric formatsUse WKT or GeoJSON for insertions
Index tuningChoose geohash/quadtree/BKD tree & adjust precision
QueriesPrefer MATCH for indexed filtering; use functions for precise checks
Joins & spatial filtersUse within/intersects to correlate spatial entities
Scale & performanceIndex shapes, use distance/wwithin filters early
Mixed-model integrationCombine spatial with JSON, full-text, vector, time-series
-## 8. Further Learning & Resources +## Further Learning & Resources * Official **Geospatial Search Guide** in CrateDB docs, detailing geospatial types, indexing, and MATCH predicate usage. * CrateDB Academy **Hands-on: Geospatial Data** modules, with sample datasets (Chicago 311 calls, taxi rides, community zones) and example queries. * CrateDB Blog: **Geospatial Queries with CrateDB** – outlines capabilities, limitations, and practical use cases (available since version 0.40 -## 9. Summary +## Summary CrateDB provides robust support for geospatial modeling through clearly defined data types (`GEO_POINT`, `GEO_SHAPE`), powerful scalar functions (`distance`, `within`, `intersects`, `area`), and Lucene‑based indexing for fast queries. It excels in high‑volume, real‑time spatial analytics and integrates smoothly with multi-model use cases. Whether storing vehicle positions, mapping regions, or enabling spatial joins—CrateDB’s geospatial layer makes it easy, scalable, and extensible. From d60feef0030ce01fc611de74e6abec199d9d4013 Mon Sep 17 00:00:00 2001 From: Andreas Motl Date: Sat, 23 Aug 2025 12:46:29 +0200 Subject: [PATCH 06/58] Data modelling: Fix SQL in page about "geospatial data" --- docs/start/modelling/geospatial.md | 9 ++++----- 1 file changed, 4 insertions(+), 5 deletions(-) diff --git a/docs/start/modelling/geospatial.md b/docs/start/modelling/geospatial.md index 576b3da0..7cc8c012 100644 --- a/docs/start/modelling/geospatial.md +++ b/docs/start/modelling/geospatial.md @@ -26,7 +26,7 @@ CREATE TABLE country ( shape geo_shape INDEX USING "geohash" WITH (precision='100m'), capital text, capital_location geo_point -) +); ``` * Use `GEO_SHAPE` to define the border. @@ -34,15 +34,14 @@ CREATE TABLE country ( ## Insert rows -We can populate the table with Austria: +We can populate the table with the coordinate shape of Vienna/Austria: -```sql +```psql INSERT INTO country (name, country_code, shape, capital, capital_location) VALUES ( 'Austria', 'at', - - ##{type='Polygon', coordinates=[ + {type='Polygon', coordinates=[ [[16.979667, 48.123497], [16.903754, 47.714866], [16.340584, 47.712902], [16.534268, 47.496171], [16.202298, 46.852386], [16.011664, 46.683611], From 52c3008f37459e93d42b5b4ead767dca62e7312e Mon Sep 17 00:00:00 2001 From: Brian Munkholm Date: Sat, 23 Aug 2025 12:48:51 +0200 Subject: [PATCH 07/58] Data modelling: Fix page about "full-text data" --- docs/start/modelling/fulltext.md | 16 ++++++++-------- 1 file changed, 8 insertions(+), 8 deletions(-) diff --git a/docs/start/modelling/fulltext.md b/docs/start/modelling/fulltext.md index 41342afc..43f754b9 100644 --- a/docs/start/modelling/fulltext.md +++ b/docs/start/modelling/fulltext.md @@ -2,7 +2,7 @@ CrateDB features **native full‑text search** powered by **Apache Lucene** and Okapi BM25 ranking, fully accessible via SQL. You can blend this seamlessly with other data types—JSON, time‑series, geospatial, vectors and more—all in a single SQL query platform. -## 1. Data Types & Indexing Strategy +## Data Types & Indexing Strategy * By default, all text columns are indexed as `plain` (raw, unanalyzed)—efficient for equality search but not suitable for full‑text queries * To enable full‑text search, you must define a **FULLTEXT index** with an optional language **analyzer**, e.g.: @@ -21,7 +21,7 @@ CREATE TABLE documents ( INDEX ft_all USING FULLTEXT(title, body) WITH (analyzer = 'english'); ``` -## 2. Index Design & Custom Analyzers +## Index Design & Custom Analyzers | Component | Purpose | | ----------------- | ---------------------------------------------------------------------------- | @@ -48,7 +48,7 @@ CREATE ANALYZER german_snowball WITH (language = 'german'); ``` -## 3. Querying: MATCH Predicate & Scoring +## Querying: MATCH Predicate & Scoring CrateDB uses the SQL `MATCH` predicate to run full‑text queries against full‑text indices. It optionally returns a relevance score `_score`, ranked via BM25. @@ -100,7 +100,7 @@ WHERE MATCH((ft_en, ft_de), 'jupm OR verwrlost') USING best_fields WITH (fuzzine ORDER BY _score DESC; ``` -## 4. Use Cases & Integration +## Use Cases & Integration CrateDB is ideal for searching **semi-structured large text data**—product catalogs, article archives, user-generated content, descriptions and logs. @@ -119,13 +119,13 @@ WHERE This blend lets you query by text relevance, numeric filters, and spatial constraints, all in one. -## 5. Architectural Strengths +## Architectural Strengths * **Built on Lucene inverted index + BM25**, offering relevance ranking comparable to search engines. * **Scale horizontally across clusters**, while maintaining fast indexing and search even on high volume datasets. * **Integrated SQL interface**: eliminates need for separate search services like Elasticsearch or Solr. -## 6. Best Practices Checklist +## Best Practices Checklist | Topic | Recommendation | | ------------------- | ---------------------------------------------------------------------------------- | @@ -138,13 +138,13 @@ This blend lets you query by text relevance, numeric filters, and spatial constr | Multi-model Queries | Combine full-text search with geo, JSON, numerical filters. | | Analyze Limitations | Understand phrase\_prefix caveats at scale; tune analyzer/tokenizer appropriately. | -## 7. Further Learning & Resources +## Further Learning & Resources * **CrateDB Full‑Text Search Guide**: details index creation, analyzers, MATCH usage. * **FTS Options & Advanced Features**: fuzziness, synonyms, multi-language idioms. * **Hands‑On Academy Course**: explore FTS on real datasets (e.g. Chicago neighborhoods). * **CrateDB Community Insights**: real‑world advice and experiences from users. -## **8. Summary** +## **Summary** CrateDB combines powerful Lucene‑based full‑text search capabilities with SQL, making it easy to model and query textual data at scale. It supports fuzzy matching, multi-language analysis, composite indexing, and integrates fully with other data types for rich, multi-model queries. Whether you're building document search, catalog lookup, or content analytics—CrateDB offers a flexible and scalable foundation.\ From 736652c1e52c78f3bc0e7a8c18c9cf0503b881f9 Mon Sep 17 00:00:00 2001 From: Juan Pardo Date: Sat, 23 Aug 2025 13:01:21 +0200 Subject: [PATCH 08/58] Data modelling: Fix page about "vector data" --- docs/start/modelling/vector.md | 45 ++++++++-------------------------- 1 file changed, 10 insertions(+), 35 deletions(-) diff --git a/docs/start/modelling/vector.md b/docs/start/modelling/vector.md index 6c537428..5ac191a9 100644 --- a/docs/start/modelling/vector.md +++ b/docs/start/modelling/vector.md @@ -5,7 +5,7 @@ CrateDB natively supports **vector embeddings** for efficient **similarity searc Whether you’re working with text, images, sensor data, or any domain represented as high-dimensional embeddings, CrateDB enables **real-time vector search at scale**, in combination with other data types like full-text, geospatial, and time-series.\ -## 1. Data Type: VECTOR +## Data Type: VECTOR CrateDB introduces a native `VECTOR` type with the following key characteristics: @@ -27,32 +27,7 @@ CREATE TABLE documents ( * `VECTOR(FLOAT[768])` declares a fixed-size vector column. * You can ingest vectors directly or compute them externally and store them via SQL -## 2. Indexing: Enabling Vector Search - -To use fast similarity search, define an **HNSW index** on the vector column: - -```sql -CREATE INDEX embedding_hnsw -ON documents (embedding) -USING HNSW -WITH ( - m = 16, - ef_construction = 128, - ef_search = 64, - similarity = 'cosine' -); -``` - -**Parameters:** - -* `m`: controls the number of bi-directional links per node (default: 16) -* `ef_construction`: affects index build accuracy/speed (default: 128) -* `ef_search`: controls recall/latency trade-off at query time -* `similarity`: choose from `'cosine'`, `'l2'` (Euclidean), `'dot_product'` - -> CrateDB automatically builds the ANN index in the background, allowing for real-time updates. - -## 3. Querying Vectors with SQL +## Querying Vectors with SQL Use the `nearest_neighbors` predicate to perform similarity search: @@ -79,7 +54,7 @@ LIMIT 10; Combine vector similarity with full-text, metadata, or geospatial filters! ::: -## 4. Ingestion: Working with Embeddings +## Ingestion: Working with Embeddings You can ingest vectors in several ways: @@ -92,7 +67,7 @@ You can ingest vectors in several ways: * **Batched imports** via `COPY FROM` using JSON or CSV * CrateDB doesn't currently compute embeddings internally—you bring your own model or use pipelines that call CrateDB. -## 5. Use Cases +## Use Cases | Use Case | Description | | ----------------------- | ------------------------------------------------------------------ | @@ -112,17 +87,17 @@ ORDER BY features <-> [vector] ASC LIMIT 10; ``` -## 6. Performance & Scaling +## Performance & Scaling * Vector search uses **HNSW**: state-of-the-art ANN algorithm with logarithmic search complexity. * CrateDB parallelizes ANN search across shards/nodes. * Ideal for 100K to tens of millions of vectors; supports real-time ingestion and queries. :::{note} -Note: vector dimensionality must be consistent for each column. +vector dimensionality must be consistent for each column. ::: -## 7. Best Practices +## Best Practices | Area | Recommendation | | -------------- | ----------------------------------------------------------------------- | @@ -133,19 +108,19 @@ Note: vector dimensionality must be consistent for each column. | Updates | Re-inserting or updating vectors is fully supported | | Data pipelines | Use external tools for vector generation; push to CrateDB via REST/SQL | -## 8. Integrations +## Integrations * **Python / pandas / LangChain**: CrateDB has native drivers and REST interface * **Embedding models**: Use OpenAI, HuggingFace, Cohere, or in-house models * **RAG architecture**: CrateDB stores vector + metadata + raw text in a unified store -## 9. Further Learning & Resources +## Further Learning & Resources * CrateDB Docs – Vector Search * Blog: Using CrateDB for Hybrid Search (Vector + Full-Text) * CrateDB Academy – Vector Data * [Sample notebooks on GitHub](https://github.com/crate/cratedb-examples) -## 10. Summary +## Summary CrateDB gives you the power of **vector similarity search** with the **flexibility of SQL** and the **scalability of a distributed database**. It lets you unify structured, unstructured, and semantic data—enabling modern applications in AI, search, and recommendation without additional vector databases or pipelines. From 33ecc9424fbc19e94b90535663c805be4605c39f Mon Sep 17 00:00:00 2001 From: Andreas Motl Date: Sat, 23 Aug 2025 23:29:24 +0200 Subject: [PATCH 09/58] Layout: Improve responsiveness on pages using cards heavily --- docs/index.md | 10 +++++----- docs/start/index.md | 6 ++++++ 2 files changed, 11 insertions(+), 5 deletions(-) diff --git a/docs/index.md b/docs/index.md index 596306ce..8aa577a0 100644 --- a/docs/index.md +++ b/docs/index.md @@ -9,7 +9,7 @@ Guides and tutorials about how to use CrateDB and CrateDB Cloud in practice. -::::{grid} 1 2 2 2 +::::{grid} 4 :padding: 0 @@ -17,7 +17,7 @@ Guides and tutorials about how to use CrateDB and CrateDB Cloud in practice. :link: getting-started :link-type: ref :link-alt: Getting started with CrateDB -:padding: 3 +:padding: 1 :text-align: center :class-card: sd-pt-3 :class-body: sd-fs-1 @@ -31,7 +31,7 @@ Guides and tutorials about how to use CrateDB and CrateDB Cloud in practice. :link: install :link-type: ref :link-alt: Installing CrateDB -:padding: 3 +:padding: 1 :text-align: center :class-card: sd-pt-3 :class-body: sd-fs-1 @@ -45,7 +45,7 @@ Guides and tutorials about how to use CrateDB and CrateDB Cloud in practice. :link: administration :link-type: ref :link-alt: CrateDB Administration -:padding: 3 +:padding: 1 :text-align: center :class-card: sd-pt-3 :class-body: sd-fs-1 @@ -59,7 +59,7 @@ Guides and tutorials about how to use CrateDB and CrateDB Cloud in practice. :link: performance :link-type: ref :link-alt: CrateDB Performance Guides -:padding: 3 +:padding: 1 :text-align: center :class-card: sd-pt-3 :class-body: sd-fs-1 diff --git a/docs/start/index.md b/docs/start/index.md index 520f45e0..628bfe64 100644 --- a/docs/start/index.md +++ b/docs/start/index.md @@ -18,6 +18,7 @@ and explore key features. :link: first-steps :link-type: ref :link-alt: First steps with CrateDB +:columns: 6 3 3 3 :padding: 3 :text-align: center :class-card: sd-pt-3 @@ -31,6 +32,7 @@ and explore key features. :link: connect :link-type: ref :link-alt: Connect to CrateDB +:columns: 6 3 3 3 :padding: 3 :text-align: center :class-card: sd-pt-3 @@ -44,6 +46,7 @@ and explore key features. :link: query-capabilities :link-type: ref :link-alt: Query Capabilities +:columns: 6 3 3 3 :padding: 3 :text-align: center :class-card: sd-pt-3 @@ -57,6 +60,7 @@ and explore key features. :link: ingest :link-type: ref :link-alt: Ingesting Data +:columns: 6 3 3 3 :padding: 3 :text-align: center :class-card: sd-pt-3 @@ -78,6 +82,7 @@ and explore key features. :link: example-applications :link-type: ref :link-alt: Sample Applications +:columns: 6 3 3 3 :padding: 3 :text-align: center :class-card: sd-pt-3 @@ -91,6 +96,7 @@ and explore key features. :link: start-going-further :link-type: ref :link-alt: Going Further +:columns: 6 3 3 3 :padding: 3 :text-align: center :class-card: sd-pt-3 From b5ad8b4f8ae99503af27d321415094c3f29625e8 Mon Sep 17 00:00:00 2001 From: Andreas Motl Date: Sat, 23 Aug 2025 23:29:49 +0200 Subject: [PATCH 10/58] Data modelling: Populate index page --- docs/start/modelling/fulltext.md | 1 + docs/start/modelling/geospatial.md | 1 + docs/start/modelling/index.md | 113 +++++++++++++++++++++++++++- docs/start/modelling/json.md | 1 + docs/start/modelling/primary-key.md | 1 + docs/start/modelling/relational.md | 1 + docs/start/modelling/timeseries.md | 1 + docs/start/modelling/vector.md | 1 + 8 files changed, 118 insertions(+), 2 deletions(-) diff --git a/docs/start/modelling/fulltext.md b/docs/start/modelling/fulltext.md index 43f754b9..d94c3921 100644 --- a/docs/start/modelling/fulltext.md +++ b/docs/start/modelling/fulltext.md @@ -1,3 +1,4 @@ +(model-fulltext)= # Full-text data CrateDB features **native full‑text search** powered by **Apache Lucene** and Okapi BM25 ranking, fully accessible via SQL. You can blend this seamlessly with other data types—JSON, time‑series, geospatial, vectors and more—all in a single SQL query platform. diff --git a/docs/start/modelling/geospatial.md b/docs/start/modelling/geospatial.md index 7cc8c012..e40ef3e4 100644 --- a/docs/start/modelling/geospatial.md +++ b/docs/start/modelling/geospatial.md @@ -1,3 +1,4 @@ +(model-geospatial)= # Geospatial data CrateDB supports **real-time geospatial analytics at scale**, enabling you to store, query, and analyze 2D location-based data using standard SQL over two dedicated types: **GEO\_POINT** and **GEO\_SHAPE**. You can seamlessly combine spatial data with full-text, vector, JSON, or time-series in the same SQL queries. diff --git a/docs/start/modelling/index.md b/docs/start/modelling/index.md index a198efe1..58bce3b8 100644 --- a/docs/start/modelling/index.md +++ b/docs/start/modelling/index.md @@ -1,8 +1,99 @@ +(modelling)= +(data-modelling)= # Data modelling +:::{div} sd-text-muted CrateDB provides a unified storage engine that supports different data types. +::: + +:::::{grid} 2 3 3 3 +:padding: 0 +:class-container: installation-grid + +::::{grid-item-card} Relational data +:link: model-relational +:link-type: ref +:link-alt: Relational data +:padding: 3 +:text-align: center +:class-card: sd-pt-3 +:class-body: sd-fs-1 +:class-title: sd-fs-6 + +{fas}`table-list` +:::: + +::::{grid-item-card} JSON data +:link: model-json +:link-type: ref +:link-alt: JSON data +:padding: 3 +:text-align: center +:class-card: sd-pt-3 +:class-body: sd-fs-1 +:class-title: sd-fs-6 + +{fas}`file-lines` +:::: + +::::{grid-item-card} Timeseries data +:link: model-timeseries +:link-type: ref +:link-alt: Timeseries data +:padding: 3 +:text-align: center +:class-card: sd-pt-3 +:class-body: sd-fs-1 +:class-title: sd-fs-6 + +{fas}`timeline` +:::: + +::::{grid-item-card} Geospatial data +:link: model-geospatial +:link-type: ref +:link-alt: Geospatial data +:padding: 3 +:text-align: center +:class-card: sd-pt-3 +:class-body: sd-fs-1 +:class-title: sd-fs-6 + +{fas}`globe` +:::: + +::::{grid-item-card} Fulltext data +:link: model-fulltext +:link-type: ref +:link-alt: Fulltext data +:padding: 3 +:text-align: center +:class-card: sd-pt-3 +:class-body: sd-fs-1 +:class-title: sd-fs-6 + +{fas}`font` +:::: + +::::{grid-item-card} Vector data +:link: model-vector +:link-type: ref +:link-alt: Vector data +:padding: 3 +:text-align: center +:class-card: sd-pt-3 +:class-body: sd-fs-1 +:class-title: sd-fs-6 + +{fas}`lightbulb` +:::: + +::::: + + ```{toctree} :maxdepth: 1 +:hidden: relational json @@ -12,10 +103,28 @@ fulltext vector ``` -Because CrateDB is a distributed OLAP database designed store large volumes -of data, it needs a few special considerations on certain details. +:::{rubric} Implementation notes +::: + +Because CrateDB is a distributed analytical database (OLAP) designed to store +large volumes of data, users need to consider certain details compared to +traditional RDBMS. + + +:::{card} Primary key strategies +:link: model-primary-key +:link-type: ref +CrateDB is built for horizontal scalability and high ingestion throughput. ++++ +To achieve this, operations must complete independently on each node—without +central coordination. This design choice means CrateDB does not support +traditional auto-incrementing primary key types like `SERIAL` in PostgreSQL +or MySQL by default. +::: + ```{toctree} :maxdepth: 1 +:hidden: primary-key ``` diff --git a/docs/start/modelling/json.md b/docs/start/modelling/json.md index de71eb27..fc0fda8b 100644 --- a/docs/start/modelling/json.md +++ b/docs/start/modelling/json.md @@ -1,3 +1,4 @@ +(model-json)= # JSON data CrateDB combines the flexibility of NoSQL document stores with the power of SQL. It enables you to store, query, and index **semi-structured JSON data** using **standard SQL**, making it an excellent choice for applications that handle diverse or evolving schemas. diff --git a/docs/start/modelling/primary-key.md b/docs/start/modelling/primary-key.md index a181e930..fbc8e756 100644 --- a/docs/start/modelling/primary-key.md +++ b/docs/start/modelling/primary-key.md @@ -1,3 +1,4 @@ +(model-primary-key)= # Primary key strategies CrateDB is built for horizontal scalability and high ingestion throughput. To achieve this, operations must complete independently on each node—without central coordination. This design choice means CrateDB does **not** support traditional auto-incrementing primary key types like `SERIAL` in PostgreSQL or MySQL by default. diff --git a/docs/start/modelling/relational.md b/docs/start/modelling/relational.md index bd982ae8..8f9e90eb 100644 --- a/docs/start/modelling/relational.md +++ b/docs/start/modelling/relational.md @@ -1,3 +1,4 @@ +(model-relational)= # Relational data CrateDB is a **distributed SQL database** that offers rich **relational data modeling** with the flexibility of dynamic schemas and the scalability of NoSQL systems. It supports **primary keys,** **joins**, **aggregations**, and **subqueries**, just like traditional RDBMS systems—while also enabling hybrid use cases with time-series, geospatial, full-text, vector search, and semi-structured data. diff --git a/docs/start/modelling/timeseries.md b/docs/start/modelling/timeseries.md index f9703ca6..71be6067 100644 --- a/docs/start/modelling/timeseries.md +++ b/docs/start/modelling/timeseries.md @@ -1,3 +1,4 @@ +(model-timeseries)= # Time series data CrateDB employs a relational representation for time‑series, enabling you to work with timestamped data using standard SQL, while also seamlessly combining with document and context data. diff --git a/docs/start/modelling/vector.md b/docs/start/modelling/vector.md index 5ac191a9..083e1284 100644 --- a/docs/start/modelling/vector.md +++ b/docs/start/modelling/vector.md @@ -1,3 +1,4 @@ +(model-vector)= # Vector data CrateDB natively supports **vector embeddings** for efficient **similarity search** using **approximate nearest neighbor (ANN)** algorithms. This makes it a powerful engine for building AI-powered applications involving semantic search, recommendations, anomaly detection, and multimodal analytics—all in the simplicity of SQL. From 036ebcf3057a2edc2697ba57834506458ec3cd5a Mon Sep 17 00:00:00 2001 From: Andreas Motl Date: Sun, 24 Aug 2025 01:59:38 +0200 Subject: [PATCH 11/58] Data modelling: Relocate original page about primary keys and sequences --- docs/performance/inserts/index.rst | 1 - docs/performance/inserts/sequences.rst | 205 -------------------- docs/start/modelling/primary-key.md | 249 +++++++++++++++---------- 3 files changed, 154 insertions(+), 301 deletions(-) delete mode 100644 docs/performance/inserts/sequences.rst diff --git a/docs/performance/inserts/index.rst b/docs/performance/inserts/index.rst index 6934363e..e11462ac 100644 --- a/docs/performance/inserts/index.rst +++ b/docs/performance/inserts/index.rst @@ -30,6 +30,5 @@ This section of the guide will show you how. parallel tuning testing - sequences .. _Abstract Syntax Tree: https://en.wikipedia.org/wiki/Abstract_syntax_tree diff --git a/docs/performance/inserts/sequences.rst b/docs/performance/inserts/sequences.rst deleted file mode 100644 index d381d931..00000000 --- a/docs/performance/inserts/sequences.rst +++ /dev/null @@ -1,205 +0,0 @@ -.. _autogenerated_sequences_performance: - -########################################################### - Autogenerated sequences and PRIMARY KEY values in CrateDB -########################################################### - -As you begin working with CrateDB, you might be puzzled why CrateDB does not -have a built-in, auto-incrementing "serial" data type as PostgreSQL or MySQL. - -As a distributed database, designed to scale horizontally, CrateDB needs as many -operations as possible to complete independently on each node without any -coordination between nodes. - -Maintaining a global auto-increment value requires that a node checks with other -nodes before allocating a new value. This bottleneck would be hindering our -ability to achieve `extremely fast ingestion speeds`_. - -That said, there are many alternatives available and we can also implement true -consistent/synchronized sequences if we want to. - -************************************ - Using a timestamp as a primary key -************************************ - -This option involves declaring a column as follows: - -.. code:: psql - - BIGINT DEFAULT now() PRIMARY KEY - -:Pros: - Always increasing number - ideal if we need to timestamp records creation - anyway - -:Cons: - gaps between the numbers, not suitable if we may have more than one record on - the same millisecond - -************* - Using UUIDs -************* - -This option involves declaring a column as follows: - -.. code:: psql - - TEXT DEFAULT gen_random_text_uuid() PRIMARY KEY - -:Pros: - Globally unique, no risk of conflicts if merging things from different - tables/environments - -:Cons: - No order guarantee. Not as human-friendly as numbers. String format may not - be applicable to cover all scenarios. Range queries are not possible. - -************************ - Use UUIDv7 identifiers -************************ - -`Version 7 UUIDs`_ are a relatively new kind of UUIDs which feature a -time-ordered value. We can use these in CrateDB with an UDF_ with the code from -`UUIDv7 in N languages`_. - -:Pros: - Same as `gen_random_text_uuid` above but almost sequential, which enables - range queries. - -:Cons: - not as human-friendly as numbers and slight performance impact from UDF use - -********************************* - Use IDs from an external system -********************************* - -In cases where data is imported into CrateDB from external systems that employ -identifier governance, CrateDB does not need to generate any identifier values -and primary key values can be inserted as-is from the source system. - -See `Replicating data from other databases to CrateDB with Debezium and Kafka`_ -for an example. - -********************* - Implement sequences -********************* - -This approach involves a table to keep the latest values that have been consumed -and client side code to keep it up-to-date in a way that guarantees unique -values even when many ingestion processes run in parallel. - -:Pros: - Can have any arbitrary type of sequences, (we may for instance want to - increment values by 10 instead of 1 - prefix values with a year number - - combine numbers and letters - etc) - -:Cons: - Need logic for the optimistic update implemented client-side, the sequences - table becomes a bottleneck so not suitable for high-velocity ingestion - scenarios - -We will first create a table to keep the latest values for our sequences: - -.. code:: psql - - CREATE TABLE sequences ( - name TEXT PRIMARY KEY, - last_value BIGINT - ) CLUSTERED INTO 1 SHARDS; - -We will then initialize it with one new sequence at 0: - -.. code:: psql - - INSERT INTO sequences (name,last_value) - VALUES ('mysequence',0); - -And we are going to do an example with a new table defined as follows: - -.. code:: psql - - CREATE TABLE mytable ( - id BIGINT PRIMARY KEY, - field1 TEXT - ); - -The Python code below reads the last value used from the sequences table, and -then attempts an `optimistic UPDATE`_ with a ``RETURNING`` clause, if a -contending process already consumed the identity nothing will be returned so our -process will retry until a value is returned, then it uses that value as the new -ID for the record we are inserting into the ``mytable`` table. - -.. code:: python - - # /// script - # requires-python = ">=3.8" - # dependencies = [ - # "records", - # "sqlalchemy-cratedb", - # ] - # /// - - import time - - import records - - db = records.Database("crate://") - sequence_name = "mysequence" - - max_retries = 5 - base_delay = 0.1 # 100 milliseconds - - for attempt in range(max_retries): - select_query = """ - SELECT last_value, - _seq_no, - _primary_term - FROM sequences - WHERE name = :sequence_name; - """ - row = db.query(select_query, sequence_name=sequence_name).first() - new_value = row.last_value + 1 - - update_query = """ - UPDATE sequences - SET last_value = :new_value - WHERE name = :sequence_name - AND _seq_no = :seq_no - AND _primary_term = :primary_term - RETURNING last_value; - """ - if ( - str( - db.query( - update_query, - new_value=new_value, - sequence_name=sequence_name, - seq_no=row._seq_no, - primary_term=row._primary_term, - ).all() - ) - != "[]" - ): - break - - delay = base_delay * (2**attempt) - print(f"Attempt {attempt + 1} failed. Retrying in {delay:.1f} seconds...") - time.sleep(delay) - else: - raise Exception(f"Failed after {max_retries} retries with exponential backoff") - - insert_query = "INSERT INTO mytable (id, field1) VALUES (:id, :field1)" - db.query(insert_query, id=new_value, field1="abc") - db.close() - -.. _extremely fast ingestion speeds: https://cratedb.com/blog/how-we-scaled-ingestion-to-one-million-rows-per-second - -.. _optimistic update: https://cratedb.com/docs/crate/reference/en/latest/general/occ.html#optimistic-update - -.. _replicating data from other databases to cratedb with debezium and kafka: https://cratedb.com/blog/replicating-data-from-other-databases-to-cratedb-with-debezium-and-kafka - -.. _udf: https://cratedb.com/docs/crate/reference/en/latest/general/user-defined-functions.html - -.. _uuidv7 in n languages: https://github.com/nalgeon/uuidv7/blob/main/src/uuidv7.cratedb - -.. _version 7 uuids: https://datatracker.ietf.org/doc/html/rfc9562#name-uuid-version-7 diff --git a/docs/start/modelling/primary-key.md b/docs/start/modelling/primary-key.md index fbc8e756..744224a7 100644 --- a/docs/start/modelling/primary-key.md +++ b/docs/start/modelling/primary-key.md @@ -1,112 +1,168 @@ (model-primary-key)= -# Primary key strategies - -CrateDB is built for horizontal scalability and high ingestion throughput. To achieve this, operations must complete independently on each node—without central coordination. This design choice means CrateDB does **not** support traditional auto-incrementing primary key types like `SERIAL` in PostgreSQL or MySQL by default. - -This page explains why that is and walks you through **five common alternatives** to generate unique primary key values in CrateDB, including a recipe to implement your own auto-incrementing sequence mechanism when needed. - -## Why Auto-Increment Doesn't Exist in CrateDB - -In traditional RDBMS systems, auto-increment fields rely on a central counter. In a distributed system like CrateDB, this would create a **global coordination bottleneck**, limiting insert throughput and reducing scalability. - -Instead, CrateDB provides **flexibility**: you can choose a primary key strategy tailored to your use case, whether for strict uniqueness, time ordering, or external system integration. - -## Primary Key Strategies in CrateDB - -### 1. Use a Timestamp as a Primary Key - -```sql +(autogenerated-sequences)= +# Primary key strategies and autogenerated sequences + +:::{rubric} Introduction +::: + +As you begin working with CrateDB, you might be puzzled why CrateDB does not +have a built-in, auto-incrementing "serial" data type, like PostgreSQL or MySQL. + +This page explains why that is and walks you through **five common alternatives** +to generate unique primary key values in CrateDB, including a recipe to implement +your own auto-incrementing sequence mechanism when needed. + +:::{rubric} Why auto-increment sequences don't exist in CrateDB +::: +In traditional RDBMS systems, auto-increment fields rely on a central counter. +In a distributed system like CrateDB, maintaining a global auto-increment value +would require that a node checks with other nodes before allocating a new value. +This would create a **global coordination bottleneck**, limit insert throughput, +and reduce scalability. + +CrateDB is designed for horizontal scalability and [high ingestion throughput]. +To achieve this, operations must complete independently on each node—without +central coordination. This design choice means CrateDB does **not** support +traditional auto-incrementing primary key types like `SERIAL` in PostgreSQL +or MySQL by default. + +:::{rubric} Solutions +::: +CrateDB provides flexibility: You can choose a primary key strategy +tailored to your use case, whether for strict uniqueness, time ordering, or +external system integration. You can also implement true consistent/synchronized +sequences if you want to. + +## Using a timestamp as a primary key + +This option involves declaring a column using `DEFAULT now()`. +```psql BIGINT DEFAULT now() PRIMARY KEY ``` -**Pros** - -* Auto-generated, always-increasing value -* Useful when records are timestamped anyway +:Pros: + - Auto-generated, always-increasing value + - Useful when records are timestamped anyway -**Cons** +:Cons: + - Can result in gaps + - Collisions possible if multiple records are created in the same millisecond -* Can result in gaps -* Collisions possible if multiple records are created in the same millisecond +## Using UUIDv4 identifiers -### 2. Use UUIDs (v4) - -```sql +This option involves declaring a column using `DEFAULT gen_random_text_uuid()`. +```psql TEXT DEFAULT gen_random_text_uuid() PRIMARY KEY ``` -**Pros** - -* Universally unique -* No conflicts when merging from multiple environments or sources +:Pros: + - Universally unique + - No conflicts when merging from multiple environments or sources -**Cons** +:Cons: + - Not ordered + - Harder to read/debug + - No efficient range queries -* Not ordered -* Harder to read/debug -* No efficient range queries +## Using UUIDv7 identifiers -### Use UUIDv7 for Time-Ordered IDs +[UUIDv7] is a new format that preserves **temporal ordering**, making UUIDs +better suited for inserts and range queries in distributed databases. -UUIDv7 is a new format that preserves **temporal ordering**, making them better suited for distributed inserts and range queries. +We can use these in CrateDB with an UDF with the code from [UUIDv7 in N languages]. -You can use UUIDv7 in CrateDB via a **User-Defined Function (UDF)**, based on your preferred language. +You can use [UUIDv7 for CrateDB] via a {ref}`User-Defined Function (UDF) ` +in JavaScript, or in your preferred programming language by using one of the +available UUIDv7 libraries. -**Pros** +:Pros: + - Globally unique and **almost sequential** + - Efficient range queries possible -* Globally unique and **almost sequential** -* Range queries possible +:Cons: + - Not as human-friendly as integer numbers + - Slight overhead due to UDF use -**Cons** +## Using IDs from external systems -* Not human-friendly -* Slight overhead due to UDF use +If you are importing data from a source system that **already generates unique +IDs**, you can reuse those by inserting primary key values as-is from the +source system. -### 4. Use External System IDs +In this case, CrateDB does not need to generate any identifier values, +and consistency is ensured across systems. -If you're ingesting data from a source system that **already generates unique IDs**, you can reuse those: +:::{seealso} +An example for that is [Replicating data from other databases to CrateDB with Debezium and Kafka]. +::: -* No need for CrateDB to generate anything -* Ensures consistency across systems +## Implementing a custom sequence table -> See Replicating data from other databases to CrateDB with Debezium and Kafka for an example. +If you **must** have an auto-incrementing numeric ID (e.g., for compatibility +or legacy reasons), you can implement a simple sequence generator using a +dedicated table and client-side logic. -### 5. Implement a Custom Sequence Table +This approach involves a table to keep the latest values that have been consumed +and client side code to keep it up-to-date in a way that guarantees unique +values even when many ingestion processes run in parallel. -If you **must** have an auto-incrementing numeric ID (e.g., for compatibility or legacy reasons), you can implement a simple sequence generator using a dedicated table and client-side logic. +:Pros: + - Fully customizable (you can add prefixes, adjust increment size, etc.) + - Sequential IDs possible -**Step 1: Create a sequence tracking table** +:Cons: + - Additional client logic about optimistic updates is required for writing + - The sequence table may become a bottleneck at very high ingestion rates -```sql +### Step 1: Create a sequence tracking table +Create a table to keep the latest values for the sequences. +```psql CREATE TABLE sequences ( - name TEXT PRIMARY KEY, - last_value BIGINT + name TEXT PRIMARY KEY, + last_value BIGINT ) CLUSTERED INTO 1 SHARDS; ``` -**Step 2: Initialize your sequence** - -```sql -INSERT INTO sequences (name, last_value) -VALUES ('mysequence', 0); +### Step 2: Initialize your sequence +Initialize the table with one new sequence at 0. +```psql +INSERT INTO sequences (name,last_value) +VALUES ('mysequence',0); ``` -**Step 3: Create a target table** - -```sql +### Step 3: Create a target table +Start an example with a newly defined table. +```psql CREATE TABLE mytable ( - id BIGINT PRIMARY KEY, - field1 TEXT + id BIGINT PRIMARY KEY, + field1 TEXT ); ``` -**Step 4: Generate and use sequence values in Python** +### Step 4: Generate and use sequence values in Python + +Use optimistic concurrency control to generate unique, incrementing values +even in parallel ingestion scenarios. -Use optimistic concurrency control to generate unique, incrementing values even in parallel ingestion scenarios: +The Python code below reads the last value used from the sequences table, and +then attempts an [optimistic UPDATE] with a `RETURNING` clause, if a +contending process already consumed the identity nothing will be returned so our +process will retry until a value is returned, then it uses that value as the new +ID for the record we are inserting into the `mytable` table. ```python # Requires: records, sqlalchemy-cratedb +# +# /// script +# requires-python = ">=3.8" +# dependencies = [ +# "records", +# "sqlalchemy-cratedb", +# ] +# /// + import time + import records db = records.Database("crate://") @@ -117,7 +173,9 @@ base_delay = 0.1 # 100 milliseconds for attempt in range(max_retries): select_query = """ - SELECT last_value, _seq_no, _primary_term + SELECT last_value, + _seq_no, + _primary_term FROM sequences WHERE name = :sequence_name; """ @@ -128,48 +186,49 @@ for attempt in range(max_retries): UPDATE sequences SET last_value = :new_value WHERE name = :sequence_name - AND _seq_no = :seq_no - AND _primary_term = :primary_term + AND _seq_no = :seq_no + AND _primary_term = :primary_term RETURNING last_value; """ - result = db.query( - update_query, - new_value=new_value, - sequence_name=sequence_name, - seq_no=row._seq_no, - primary_term=row._primary_term - ).all() - - if result: + if ( + str( + db.query( + update_query, + new_value=new_value, + sequence_name=sequence_name, + seq_no=row._seq_no, + primary_term=row._primary_term, + ).all() + ) + != "[]" + ): break delay = base_delay * (2**attempt) print(f"Attempt {attempt + 1} failed. Retrying in {delay:.1f} seconds...") time.sleep(delay) else: - raise Exception("Failed to acquire sequence after multiple retries.") + raise Exception(f"Failed after {max_retries} retries with exponential backoff") insert_query = "INSERT INTO mytable (id, field1) VALUES (:id, :field1)" db.query(insert_query, id=new_value, field1="abc") db.close() ``` -**Pros** - -* Fully customizable (you can add prefixes, adjust increment size, etc.) -* Sequential IDs possible - -**Cons** - -* More complex client logic required -* The sequence table may become a bottleneck at very high ingestion rates - ## Summary -| Strategy | Ordered | Unique | Scalable | Human-Friendly | Range Queries | Notes | -| ------------------- | ------- | ------ | -------- | -------------- | ------------- | -------------------- | +| Strategy | Ordered | Unique | Scalable | Human-friendly | Range queries | Notes | +|---------------------|----------| ------ | -------- |----------------|---------------| -------------------- | | Timestamp | ✅ | ⚠️ | ✅ | ✅ | ✅ | Potential collisions | -| UUID (v4) | ❌ | ✅ | ✅ | ❌ | ❌ | Default UUIDs | +| UUIDv4 | ❌ | ✅ | ✅ | ❌ | ❌ | Default UUIDs | | UUIDv7 | ✅ | ✅ | ✅ | ❌ | ✅ | Requires UDF | -| External System IDs | ✅/❌ | ✅ | ✅ | ✅ | ✅ | Depends on source | -| Sequence Table | ✅ | ✅ | ⚠️ | ✅ | ✅ | Manual retry logic | +| External system IDs | ✅/❌ | ✅ | ✅ | ✅ | ✅ | Depends on source | +| Sequence table | ✅ | ✅ | ⚠️ | ✅ | ✅ | Manual retry logic | + + +[high ingestion throughput]: https://cratedb.com/blog/how-we-scaled-ingestion-to-one-million-rows-per-second +[optimistic update]: https://cratedb.com/docs/crate/reference/en/latest/general/occ.html#optimistic-update +[replicating data from other databases to cratedb with debezium and kafka]: https://cratedb.com/blog/replicating-data-from-other-databases-to-cratedb-with-debezium-and-kafka +[udf]: https://cratedb.com/docs/crate/reference/en/latest/general/user-defined-functions.html +[UUIDv7]: https://datatracker.ietf.org/doc/html/rfc9562#name-uuid-version-7 +[UUIDv7 for CrateDB]: https://github.com/nalgeon/uuidv7/blob/main/src/uuidv7.cratedb From 488e43cbe2adb0cdb6404dd56b4e2619002158fc Mon Sep 17 00:00:00 2001 From: Brian Munkholm Date: Thu, 28 Aug 2025 21:03:31 +0200 Subject: [PATCH 12/58] Move connect page to install --- docs/{start => connect}/connect.md | 0 docs/connect/index.md | 3 ++- docs/start/index.md | 3 +-- 3 files changed, 3 insertions(+), 3 deletions(-) rename docs/{start => connect}/connect.md (100%) diff --git a/docs/start/connect.md b/docs/connect/connect.md similarity index 100% rename from docs/start/connect.md rename to docs/connect/connect.md diff --git a/docs/connect/index.md b/docs/connect/index.md index 58e4e24b..130004d0 100644 --- a/docs/connect/index.md +++ b/docs/connect/index.md @@ -109,6 +109,7 @@ Database driver connection examples. :hidden: configure +connect CLI programs ide Drivers @@ -137,7 +138,7 @@ ruby [CrateDB PostgreSQL interface]: inv:crate-reference:*:label#interface-postgresql [HTTP interface]: inv:crate-reference:*:label#interface-http [HTTP protocol]: https://en.wikipedia.org/wiki/HTTP -[JDBC]: https://en.wikipedia.org/wiki/Java_Database_Connectivity +[JDBC]: https://en.wikipedia.org/wiki/Java_Database_Connectivity [ODBC]: https://en.wikipedia.org/wiki/Open_Database_Connectivity [PostgreSQL interface]: inv:crate-reference:*:label#interface-postgresql [PostgreSQL wire protocol]: https://www.postgresql.org/docs/current/protocol.html diff --git a/docs/start/index.md b/docs/start/index.md index 628bfe64..b888cf5b 100644 --- a/docs/start/index.md +++ b/docs/start/index.md @@ -114,9 +114,8 @@ and explore key features. :hidden: first-steps -connect -query/index modelling/index +query/index Ingesting data <../ingest/index> application/index going-further From 8e8be1c8719ce26bdb69861e3f8618bbacb3ab57 Mon Sep 17 00:00:00 2001 From: Brian Munkholm Date: Thu, 28 Aug 2025 21:37:10 +0200 Subject: [PATCH 13/58] Updated FTS in modelling --- docs/start/modelling/fulltext.md | 126 ++++++++++++++++--------------- 1 file changed, 66 insertions(+), 60 deletions(-) diff --git a/docs/start/modelling/fulltext.md b/docs/start/modelling/fulltext.md index d94c3921..aad69b5c 100644 --- a/docs/start/modelling/fulltext.md +++ b/docs/start/modelling/fulltext.md @@ -1,12 +1,21 @@ (model-fulltext)= # Full-text data -CrateDB features **native full‑text search** powered by **Apache Lucene** and Okapi BM25 ranking, fully accessible via SQL. You can blend this seamlessly with other data types—JSON, time‑series, geospatial, vectors and more—all in a single SQL query platform. +CrateDB offers **native full-text search** powered by **Apache Lucene** and Okapi +BM25 ranking, accessible via SQL for easy modeling and querying of large-scale +textual data. It supports fuzzy matching, multi-language analysis, and composite +indexing, while fully integrating with data types such as JSON, time-series, +geospatial, vectors, and more for comprehensive multi-model queries. Whether you +need document search, catalog lookup, or content analytics, CrateDB is an ideal +solution. -## Data Types & Indexing Strategy +## Data Types & Indexing -* By default, all text columns are indexed as `plain` (raw, unanalyzed)—efficient for equality search but not suitable for full‑text queries -* To enable full‑text search, you must define a **FULLTEXT index** with an optional language **analyzer**, e.g.: +By default, all text columns are indexed as `plain` (raw, unanalyzed)—efficient +for equality search but not suitable for full-text queries. + +To use full-text search, add a FULLTEXT index with an optional analyzer to the +text columns you want to search: ```sql CREATE TABLE documents ( @@ -16,22 +25,33 @@ CREATE TABLE documents ( ); ``` -* You may also define **composite full-text indices**, indexing multiple columns at once: +You can also index multiple columns with **composite full-text indices**: ```sql INDEX ft_all USING FULLTEXT(title, body) WITH (analyzer = 'english'); ``` -## Index Design & Custom Analyzers +For detailed options, check out the [Reference Manual](https://cratedb.com/docs/crate/reference/en/latest/general/ddl/fulltext-indices.html). + +## Analyzers + +An analyzer splits text into searchable terms and consists of the following components: + +* **Tokenizer -** splits on whitespace/characters +* **Token Filters -** e.g. lowercase, stemming, stop‑word removal +* **Char Filters -** pre-processing (e.g. stripping HTML). -| Component | Purpose | -| ----------------- | ---------------------------------------------------------------------------- | -| **Analyzer** | Tokenizer + token filters + char filters; splits text into searchable terms. | -| **Tokenizer** | Splits on whitespace/characters. | -| **Token Filters** | e.g. lowercase, stemming, stop‑word removal. | -| **Char Filters** | Pre-processing (e.g. stripping HTML). | +CrateDB offers about 50 [**built-in analyzers**](https://cratedb.com/docs/crate/reference/en/latest/general/ddl/analyzers.html#built-in-analyzers) supporting more than 30 [languages](https://cratedb.com/docs/crate/reference/en/latest/general/ddl/analyzers.html#language). + +You can **extend** a built-in analyzer: + +```sql +CREATE ANALYZER german_snowball + EXTENDS snowball + WITH (language = 'german'); +``` -CrateDB offers **built-in analyzers** for many languages (e.g. English, German, French). You can also **create custom analyzers**: +or create your own **custom** analyzer : ```sql CREATE ANALYZER myanalyzer ( @@ -41,17 +61,14 @@ CREATE ANALYZER myanalyzer ( ); ``` -Or **extend** a built-in analyzer: +Learn more about the [builtin analyzers](https://cratedb.com/docs/crate/reference/en/latest/general/ddl/analyzers.html#built-in-analyzers) and how to [define your own](https://cratedb.com/docs/crate/reference/en/latest/general/ddl/fulltext-indices.html#creating-a-custom-analyzer) with custom [tokenizers](https://cratedb.com/docs/crate/reference/en/latest/general/ddl/analyzers.html#built-in-tokenizers) and [token filters.](https://cratedb.com/docs/crate/reference/en/latest/general/ddl/analyzers.html#built-in-token-filters) -```sql -CREATE ANALYZER german_snowball - EXTENDS snowball - WITH (language = 'german'); -``` ## Querying: MATCH Predicate & Scoring -CrateDB uses the SQL `MATCH` predicate to run full‑text queries against full‑text indices. It optionally returns a relevance score `_score`, ranked via BM25. +CrateDB uses the SQL `MATCH` predicate to run full‑text queries against +full‑text indices. It optionally returns a relevance score `_score`, ranked via +BM25. **Basic usage:** @@ -78,24 +95,25 @@ MATCH((ft_title boost 2.0, ft_body), 'keyword') **Example: Fuzzy Search** ```sql -SELECT firstname, lastname, _score -FROM person -WHERE MATCH(lastname_ft, 'bronw') USING best_fields WITH (fuzziness = 2) +SELECT title, _score +FROM documents +WHERE MATCH(ft_body, 'Jamse') USING best_fields WITH (fuzziness = 2) ORDER BY _score DESC; ``` -This matches similar names like ‘brown’ or ‘browne’. +This matches similar words like ‘James’. **Example: Multi‑language Composite Search** ```sql CREATE TABLE documents ( - name STRING PRIMARY KEY, - description TEXT, - INDEX ft_en USING FULLTEXT(description) WITH (analyzer = 'english'), - INDEX ft_de USING FULLTEXT(description) WITH (analyzer = 'german') + title TEXT, + body TEXT, + INDEX ft_en USING FULLTEXT(body) WITH (analyzer = 'english'), + INDEX ft_de USING FULLTEXT(body) WITH (analyzer = 'german') ); -SELECT name, _score + +SELECT title, _score FROM documents WHERE MATCH((ft_en, ft_de), 'jupm OR verwrlost') USING best_fields WITH (fuzziness = 1) ORDER BY _score DESC; @@ -103,49 +121,37 @@ ORDER BY _score DESC; ## Use Cases & Integration -CrateDB is ideal for searching **semi-structured large text data**—product catalogs, article archives, user-generated content, descriptions and logs. +CrateDB is ideal for searching **semi-structured large text data**—product +catalogs, article archives, user-generated content, descriptions and logs. -Because full-text indices are updated in real-time, search results reflect newly ingested data almost instantly. This tight integration avoids the complexity of maintaining separate search infrastructure. +Because full-text indices are updated in real-time, search results reflect newly +ingested data almost instantly. This tight integration avoids the complexity of +maintaining separate search infrastructure. You can **combine full-text search with other data domains**, for example: ```sql SELECT * FROM listings -WHERE +WHERE MATCH(ft_desc, 'garden deck') AND price < 500000 AND within(location, :polygon); ``` -This blend lets you query by text relevance, numeric filters, and spatial constraints, all in one. - -## Architectural Strengths - -* **Built on Lucene inverted index + BM25**, offering relevance ranking comparable to search engines. -* **Scale horizontally across clusters**, while maintaining fast indexing and search even on high volume datasets. -* **Integrated SQL interface**: eliminates need for separate search services like Elasticsearch or Solr. - -## Best Practices Checklist - -| Topic | Recommendation | -| ------------------- | ---------------------------------------------------------------------------------- | -| Schema & Indexing | Define full-text indices at table creation; plain indices are insufficient. | -| Language Support | Pick built-in analyzer matching your content language. | -| Composite Search | Use multi-column indices to search across title/body/fields. | -| Query Tuning | Configure fuzziness, operator, boost, and slop options. | -| Scoring & Ranking | Use `_score` and ordering to sort by relevance. | -| Real-time Updates | Full-text indices update automatically on INSERT/UPDATE. | -| Multi-model Queries | Combine full-text search with geo, JSON, numerical filters. | -| Analyze Limitations | Understand phrase\_prefix caveats at scale; tune analyzer/tokenizer appropriately. | +This blend lets you query by text relevance, numeric filters, and spatial +constraints, all in one. ## Further Learning & Resources -* **CrateDB Full‑Text Search Guide**: details index creation, analyzers, MATCH usage. -* **FTS Options & Advanced Features**: fuzziness, synonyms, multi-language idioms. -* **Hands‑On Academy Course**: explore FTS on real datasets (e.g. Chicago neighborhoods). -* **CrateDB Community Insights**: real‑world advice and experiences from users. - -## **Summary** - -CrateDB combines powerful Lucene‑based full‑text search capabilities with SQL, making it easy to model and query textual data at scale. It supports fuzzy matching, multi-language analysis, composite indexing, and integrates fully with other data types for rich, multi-model queries. Whether you're building document search, catalog lookup, or content analytics—CrateDB offers a flexible and scalable foundation.\ +* [**Full-text Search**](../../feature/search/fts/index.md): In-depth walkthrough of full-text search capabilities. +* Reference Manual: + * [Full-text indices]: Defining indices, extending builtin analyzers, custom analyzers. + * [Full-text analyzers]: Builtin analyzers, tokenizers, token and char filters. + * [SQL MATCH predicate]: Details about MATCH predicate arguments and options. +* [**Hands‑On Academy Course**](https://learn.cratedb.com/cratedb-fundamentals?lesson=fulltext-search): explore FTS on real datasets (e.g. Chicago neighborhoods). + +[Full-text search]: project:#fulltext-search +[Full-text indices]: inv:crate-reference:*:label#fulltext-indices +[Full-text analyzers]: inv:crate-reference:*:label#sql-analyzer +[SQL MATCH predicate]: inv:crate-reference:*:label#sql_dql_fulltext_search From 68591e9abf2e18186bf10c2060a7b026c2487e0c Mon Sep 17 00:00:00 2001 From: Brian Munkholm Date: Fri, 29 Aug 2025 09:49:48 +0200 Subject: [PATCH 14/58] Updated datamodel json from Gitbook. --- docs/start/modelling/json.md | 42 +++++++----------------------------- 1 file changed, 8 insertions(+), 34 deletions(-) diff --git a/docs/start/modelling/json.md b/docs/start/modelling/json.md index fc0fda8b..fe378585 100644 --- a/docs/start/modelling/json.md +++ b/docs/start/modelling/json.md @@ -73,9 +73,9 @@ FROM events WHERE payload['device']['os'] = 'Android'; ``` -:::{note} +```{note} Dot-notation works for both explicitly and dynamically added fields. -::: +``` ## Querying DYNAMIC OBJECTs @@ -178,9 +178,9 @@ To exclude fields from indexing, set: data['some_field'] INDEX OFF ``` -:::{note} +```{note} Too many dynamic fields can lead to schema explosion. Use `STRICT` or `IGNORED` if needed. -::: +``` ## Aggregating JSON Fields @@ -194,35 +194,9 @@ WHERE payload['location'] = 'room1'; CrateDB also supports **`GROUP BY`**, **`HAVING`**, and **window functions** on object fields. -## Use Cases for JSON Modeling - -| Use Case | Description | -| ------------------ | -------------------------------------------- | -| Logs & Traces | Unstructured payloads with flexible metadata | -| Sensor & IoT Data | Variable field schemas, nested measurements | -| Product Catalogs | Specs, tags, reviews in varying formats | -| User Profiles | Custom settings, device info, preferences | -| Telemetry / Events | Event streams with evolving structure | - -## Best Practices - -| Area | Recommendation | -| ---------------- | -------------------------------------------------------------------- | -| Schema Evolution | Use `DYNAMIC` for flexibility, `STRICT` for control | -| Index Management | Avoid over-indexing rarely used fields | -| Nested Depth | Prefer flat structures or shallow nesting for performance | -| Column Mixing | Combine structured columns with JSON for hybrid models | -| Observability | Monitor number of dynamic columns using `information_schema.columns` | - ## Further Learning & Resources -* CrateDB Docs – Object Columns -* Working with JSON in CrateDB -* CrateDB Academy – Modeling with JSON -* Understanding Column Policies - -## Summary - -CrateDB makes it easy to model **semi-structured JSON data** with full SQL support. Whether you're building a telemetry pipeline, an event store, or a product catalog, CrateDB offers the flexibility of a document store—while preserving the structure, indexing, and power of a relational engine. - -You don’t need to choose between JSON and SQL—**CrateDB gives you both.** +* Reference Manual: + * [Objects](inv:crate-reference:*:label#data-types-objects) and [Object Column policy](inv:crate-reference:*:label#data-types-objects) + * [Inserting objects as JSON](inv:crate-reference:*:label#data-types-object-json) + * [json type](inv:crate-reference:*:label#column_policy) From bebb18392d0ab5a3c7198afd16a3875eef48383d Mon Sep 17 00:00:00 2001 From: Brian Munkholm Date: Fri, 29 Aug 2025 09:49:48 +0200 Subject: [PATCH 15/58] Updated datamodel json from Gitbook. --- docs/start/modelling/json.md | 42 +++++++----------------------------- 1 file changed, 8 insertions(+), 34 deletions(-) diff --git a/docs/start/modelling/json.md b/docs/start/modelling/json.md index fc0fda8b..881d03ae 100644 --- a/docs/start/modelling/json.md +++ b/docs/start/modelling/json.md @@ -73,9 +73,9 @@ FROM events WHERE payload['device']['os'] = 'Android'; ``` -:::{note} +```{note} Dot-notation works for both explicitly and dynamically added fields. -::: +``` ## Querying DYNAMIC OBJECTs @@ -178,9 +178,9 @@ To exclude fields from indexing, set: data['some_field'] INDEX OFF ``` -:::{note} +```{note} Too many dynamic fields can lead to schema explosion. Use `STRICT` or `IGNORED` if needed. -::: +``` ## Aggregating JSON Fields @@ -194,35 +194,9 @@ WHERE payload['location'] = 'room1'; CrateDB also supports **`GROUP BY`**, **`HAVING`**, and **window functions** on object fields. -## Use Cases for JSON Modeling - -| Use Case | Description | -| ------------------ | -------------------------------------------- | -| Logs & Traces | Unstructured payloads with flexible metadata | -| Sensor & IoT Data | Variable field schemas, nested measurements | -| Product Catalogs | Specs, tags, reviews in varying formats | -| User Profiles | Custom settings, device info, preferences | -| Telemetry / Events | Event streams with evolving structure | - -## Best Practices - -| Area | Recommendation | -| ---------------- | -------------------------------------------------------------------- | -| Schema Evolution | Use `DYNAMIC` for flexibility, `STRICT` for control | -| Index Management | Avoid over-indexing rarely used fields | -| Nested Depth | Prefer flat structures or shallow nesting for performance | -| Column Mixing | Combine structured columns with JSON for hybrid models | -| Observability | Monitor number of dynamic columns using `information_schema.columns` | - ## Further Learning & Resources -* CrateDB Docs – Object Columns -* Working with JSON in CrateDB -* CrateDB Academy – Modeling with JSON -* Understanding Column Policies - -## Summary - -CrateDB makes it easy to model **semi-structured JSON data** with full SQL support. Whether you're building a telemetry pipeline, an event store, or a product catalog, CrateDB offers the flexibility of a document store—while preserving the structure, indexing, and power of a relational engine. - -You don’t need to choose between JSON and SQL—**CrateDB gives you both.** +* Reference Manual: + * [Objects](inv:crate-reference:*:label#data-types-objects) and [Object Column policy](inv:crate-reference:*:label#type-object-column-policy) + * [Inserting objects as JSON](inv:crate-reference:*:label#data-types-object-json) + * [json type](inv:crate-reference:*:label#column_policy) From 855950c609fbc96d3062354b5f089685126a1761 Mon Sep 17 00:00:00 2001 From: Brian Munkholm Date: Fri, 29 Aug 2025 10:24:08 +0200 Subject: [PATCH 16/58] reformat json to 80 chars, convert link to reference --- docs/start/modelling/json.md | 34 +++++++++++++++++++++++++--------- 1 file changed, 25 insertions(+), 9 deletions(-) diff --git a/docs/start/modelling/json.md b/docs/start/modelling/json.md index 881d03ae..06def605 100644 --- a/docs/start/modelling/json.md +++ b/docs/start/modelling/json.md @@ -1,13 +1,19 @@ (model-json)= # JSON data -CrateDB combines the flexibility of NoSQL document stores with the power of SQL. It enables you to store, query, and index **semi-structured JSON data** using **standard SQL**, making it an excellent choice for applications that handle diverse or evolving schemas. +CrateDB combines the flexibility of NoSQL document stores with the power of SQL. +It enables you to store, query, and index **semi-structured JSON data** using +**standard SQL**, making it an excellent choice for applications that handle +diverse or evolving schemas. -CrateDB’s support for dynamic objects, nested structures, and dot-notation querying brings the best of both relational and document-based data modeling—without leaving the SQL world. +CrateDB’s support for dynamic objects, nested structures, and dot-notation +querying brings the best of both relational and document-based data +modeling—without leaving the SQL world. ## Object (JSON) Columns -CrateDB allows you to define **object columns** that can store JSON-style data structures. +CrateDB allows you to define **object columns** that can store JSON-style data +structures. ```sql CREATE TABLE events ( @@ -79,9 +85,15 @@ Dot-notation works for both explicitly and dynamically added fields. ## Querying DYNAMIC OBJECTs -To support querying DYNAMIC OBJECTs using SQL, where keys may not exist within an OBJECT, CrateDB provides the [error\_on\_unknown\_object\_key](https://cratedb.com/docs/crate/reference/en/latest/config/session.html#conf-session-error-on-unknown-object-key) session setting. It controls the behaviour when querying unknown object keys to dynamic objects. +To support querying DYNAMIC OBJECTs using SQL, where keys may not exist within +an OBJECT, CrateDB provides the +[error_on_unknown_object_key](inv:crate-reference:*:label#conf-session-error_on_unknown_object_key) +session setting. It controls the behaviour when querying unknown object keys to +dynamic objects. -By default, CrateDB will raise an error if any of the queried object keys are unknown. When adjusting this setting to `false`, it will return `NULL` as the value of the corresponding key. +By default, CrateDB will raise an error if any of the queried object keys are +unknown. When adjusting this setting to `false`, it will return `NULL` as the +value of the corresponding key. ```sql cr> CREATE TABLE testdrive (item OBJECT(DYNAMIC)); @@ -179,7 +191,8 @@ data['some_field'] INDEX OFF ``` ```{note} -Too many dynamic fields can lead to schema explosion. Use `STRICT` or `IGNORED` if needed. +Too many dynamic fields can lead to schema explosion. Use `STRICT` or `IGNORED` +if needed. ``` ## Aggregating JSON Fields @@ -192,11 +205,14 @@ FROM sensor_readings WHERE payload['location'] = 'room1'; ``` -CrateDB also supports **`GROUP BY`**, **`HAVING`**, and **window functions** on object fields. +CrateDB also supports **`GROUP BY`**, **`HAVING`**, and **window functions** on +object fields. ## Further Learning & Resources * Reference Manual: - * [Objects](inv:crate-reference:*:label#data-types-objects) and [Object Column policy](inv:crate-reference:*:label#type-object-column-policy) - * [Inserting objects as JSON](inv:crate-reference:*:label#data-types-object-json) + * [Objects](inv:crate-reference:*:label#data-types-objects) and [Object Column + policy](inv:crate-reference:*:label#type-object-column-policy) + * [Inserting objects as + JSON](inv:crate-reference:*:label#data-types-object-json) * [json type](inv:crate-reference:*:label#column_policy) From a7afeda6f32b24fa5c8c17aa8a86603fdb33398f Mon Sep 17 00:00:00 2001 From: Brian Munkholm Date: Fri, 29 Aug 2025 10:58:29 +0200 Subject: [PATCH 17/58] Update Primary key strategies --- docs/start/modelling/index.md | 16 +++------------- 1 file changed, 3 insertions(+), 13 deletions(-) diff --git a/docs/start/modelling/index.md b/docs/start/modelling/index.md index 58bce3b8..ae468744 100644 --- a/docs/start/modelling/index.md +++ b/docs/start/modelling/index.md @@ -103,28 +103,18 @@ fulltext vector ``` -:::{rubric} Implementation notes -::: - -Because CrateDB is a distributed analytical database (OLAP) designed to store -large volumes of data, users need to consider certain details compared to -traditional RDBMS. - :::{card} Primary key strategies :link: model-primary-key :link-type: ref CrateDB is built for horizontal scalability and high ingestion throughput. -+++ -To achieve this, operations must complete independently on each node—without -central coordination. This design choice means CrateDB does not support -traditional auto-incrementing primary key types like `SERIAL` in PostgreSQL -or MySQL by default. +To achieve this, auto-incrementing primary keys are not supported, and other +solutions are required instead. ::: ```{toctree} :maxdepth: 1 :hidden: -primary-key +Primary key strategies ``` From aee1c58dbd0415a5fc277054d29b8a52ef85351e Mon Sep 17 00:00:00 2001 From: Brian Munkholm Date: Fri, 29 Aug 2025 10:59:00 +0200 Subject: [PATCH 18/58] Add link to data modelling --- docs/start/going-further.md | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/docs/start/going-further.md b/docs/start/going-further.md index 87436921..3c0e3b0d 100644 --- a/docs/start/going-further.md +++ b/docs/start/going-further.md @@ -18,8 +18,9 @@ of the documentation portal. ::: :::{sd-row} -```{sd-item} Data modelling +```{sd-item} :class: sd-font-weight-bolder +{ref}`Data modelling ` ``` ```{sd-item} Learn the different types of structured, semi-structured, and unstructured data. From ef495b9bf0a5bd5a723eb160b8421ffe177a45fe Mon Sep 17 00:00:00 2001 From: Brian Munkholm Date: Fri, 29 Aug 2025 10:59:24 +0200 Subject: [PATCH 19/58] Move Going Further in index to match GitBook --- docs/start/index.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/start/index.md b/docs/start/index.md index b888cf5b..c71a511f 100644 --- a/docs/start/index.md +++ b/docs/start/index.md @@ -114,11 +114,11 @@ and explore key features. :hidden: first-steps +going-further modelling/index query/index Ingesting data <../ingest/index> application/index -going-further ``` From 908e867bb36084411201cb96dc7bb2b8fda66bce Mon Sep 17 00:00:00 2001 From: Brian Munkholm Date: Fri, 29 Aug 2025 11:29:29 +0200 Subject: [PATCH 20/58] Update timeseries with content from Gitbook --- docs/start/modelling/timeseries.md | 60 +++++++++++++++--------------- 1 file changed, 29 insertions(+), 31 deletions(-) diff --git a/docs/start/modelling/timeseries.md b/docs/start/modelling/timeseries.md index 71be6067..4b7c4944 100644 --- a/docs/start/modelling/timeseries.md +++ b/docs/start/modelling/timeseries.md @@ -1,13 +1,14 @@ (model-timeseries)= # Time series data -CrateDB employs a relational representation for time‑series, enabling you to work with timestamped data using standard SQL, while also seamlessly combining with document and context data. - ## Why CrateDB for Time Series? +CrateDB employs a relational representation for time‑series, enabling you to work with timestamped data using standard SQL, while also seamlessly combining with document and context data. + * While maintaining a high ingest rate, its **columnar storage** and **automatic indexing** let you access and analyze the data immediately with **fast aggregations** and **near-real-time queries**. * Handles **high cardin­ality** and **a variety of data types**, including nested JSON, geospatial and vector data—all queryable via the same SQL statements. -* **PostgreSQL wire‑protocol compatible**, so it integrates easily with existing tools and drivers. + +*** ## Data Model Template @@ -39,7 +40,9 @@ Key points: * `month` is the partitioning key, optimizing data storage and retrieval. * Every column is stored in the column store by default for fast aggregations. -* Using **OBJECT columns** in the `devices_readings` table provides a structured and efficient way to organize complex nested data in CrateDB, enhancing both data integrity and flexibility. +* Using **OBJECT columns** provides a structured and efficient way to organize complex nested data in CrateDB, enhancing both data integrity and flexibility. + +*** ## Ingesting and Querying @@ -100,11 +103,24 @@ ORDER BY expected_time; ``` -## Down-sampling & Interpolation +### Typical time-series functions + +* **Time extraction:** date\_trunc, extract, date\_part, now(), current\_timestamp +* **Time bucketing:** date\_bin, interval, age +* **Window functions:** avg(...) OVER (...), stddev(...) OVER (...), lag, lead, first\_value, last\_value, row\_number, rank, WINDOW ... AS (...) +* **Null handling:** coalesce, nullif +* **Statistical aggregates:** percentile, correlation, stddev, variance, min, max, sum +* **Advanced filtering & logic:** greatest, least, case when ... then ... end + +*** + +## Downsampling & Interpolation To reduce volume while preserving trends, use `DATE_BIN`.\ Missing data can be handled using `LAG()`/`LEAD()` or other interpolation logic within SQL. +*** + ## Schema Evolution & Contextual Data With `column_policy = 'dynamic'`, ingest JSON payloads containing extra attributes—new columns are auto‑created and indexed. Perfect for capturing evolving sensor metadata. For column-level control, use `OBJECT(DYNAMIC)` to auto-create (and, by default, index) subcolumns, or `OBJECT(IGNORED)`to accept unknown keys without creating or indexing subcolumns. @@ -117,44 +133,26 @@ You can also store: All types are supported within the same table or joined together. +*** + ## Storage Optimization * **Partitioning and sharding**: data can be partitioned by time (e.g. daily/monthly) and sharded across a cluster. * Supports long‑term retention with performant historic storage. * Columnar layout reduces storage footprint and accelerates aggregation queries. +*** + ## Advanced Use Cases * **Exploratory data analysis** (EDA), decomposition, and forecasting via CrateDB’s SQL or by exporting to Pandas/Plotly. * **Machine learning workflows**: time‑series features and anomaly detection pipelines can be built using CrateDB + external tools -## Sample Workflow (Chicago Weather Dataset) - -In [this lesson of the CrateDB Academy](https://cratedb.com/academy/fundamentals/data-modelling-with-cratedb/hands-on-time-series-data) introducing Time Series data, we provide a sample data set that captures hourly temperature, humidity, pressure, wind at three Chicago stations (150,000+ records). - -Typical operations: - -* Table creation and ingestion -* Average per station -* Using `MAX_BY()` to find highest temperature timestamps -* Down-sampling using `DATE_BIN` into 4‑week buckets - -This workflow illustrates how CrateDB scales and simplifies time series modeling. - -## Best Practices Checklist - -| Topic | Recommendation | -| ----------------------------- | ---------------------------------------------------------------------------------- | -| Schema design and evolution | Dynamic columns add fields as needed; diverse data types ensure proper typing | -| Ingestion | Use bulk import (COPY) and JSON ingestion | -| Aggregations | Use DATE\_BIN, window functions, GROUP BY | -| Interpolation / gap analysis | Employ LAG(), LEAD(), generate\_series, joins | -| Mixed data types | Combine time series, JSON, geo, full‑text in one dataset | -| Partitioning & shard strategy | Partition by time, shard across nodes for scale | -| Down-sampling | Use DATE\_BIN for aggregating resolution or implement your own strategy using UDFs | -| Integration with analytics/ML | Export to pandas/Plotly to train your ML models | +*** -## Further Learning +## Further Learning & Resources +* **Documentation:** [Advanced Time Series Analysis](project:#timeseries-analysis), [Time Series Long Term Storage](project:#timeseries-longterm) * **Video:** [Time Series Data Modeling](https://cratedb.com/resources/videos/time-series-data-modeling) – covers relational & time series, document, geospatial, vector, and full-text in one tutorial. * **CrateDB Academy:** [Advanced Time Series Modeling course](https://cratedb.com/academy/time-series/getting-started/introduction-to-time-series-data). +* **Tutorial:** [Downsampling with LTTB algorithm](https://community.cratedb.com/t/advanced-downsampling-with-the-lttb-algorithm/1287) From 198fa7f360329b2e0d77d03e68429121136bcadd Mon Sep 17 00:00:00 2001 From: Brian Munkholm Date: Fri, 29 Aug 2025 12:06:04 +0200 Subject: [PATCH 21/58] Added vector modeling content from Gitbook --- docs/start/modelling/vector.md | 65 +++++++++++----------------------- 1 file changed, 20 insertions(+), 45 deletions(-) diff --git a/docs/start/modelling/vector.md b/docs/start/modelling/vector.md index 083e1284..a7fbb723 100644 --- a/docs/start/modelling/vector.md +++ b/docs/start/modelling/vector.md @@ -1,14 +1,15 @@ (model-vector)= # Vector data -CrateDB natively supports **vector embeddings** for efficient **similarity search** using **approximate nearest neighbor (ANN)** algorithms. This makes it a powerful engine for building AI-powered applications involving semantic search, recommendations, anomaly detection, and multimodal analytics—all in the simplicity of SQL. +CrateDB natively supports **vector embeddings** for efficient **similarity search** using **k-nearest neighbour (kNN)** algorithms. This makes it a powerful engine for building AI-powered applications involving semantic search, recommendations, anomaly detection, and multimodal analytics, all in the simplicity of SQL. -Whether you’re working with text, images, sensor data, or any domain represented as high-dimensional embeddings, CrateDB enables **real-time vector search at scale**, in combination with other data types like full-text, geospatial, and time-series.\ +Whether you’re working with text, images, sensor data, or any domain represented as high-dimensional embeddings, CrateDB enables **real-time vector search at scale**, in combination with other data types like full-text, geospatial, and time-series. +*** ## Data Type: VECTOR -CrateDB introduces a native `VECTOR` type with the following key characteristics: +CrateDB has a native `VECTOR` type with the following key characteristics: * Fixed-length float arrays (e.g. 768, 1024, 2048 dimensions) * Supports **HNSW (Hierarchical Navigable Small World)** indexing for fast approximate search @@ -28,6 +29,8 @@ CREATE TABLE documents ( * `VECTOR(FLOAT[768])` declares a fixed-size vector column. * You can ingest vectors directly or compute them externally and store them via SQL +*** + ## Querying Vectors with SQL Use the `nearest_neighbors` predicate to perform similarity search: @@ -51,9 +54,11 @@ ORDER BY score LIMIT 10; ``` -:::{note} +```{note} Combine vector similarity with full-text, metadata, or geospatial filters! -::: +``` + +*** ## Ingestion: Working with Embeddings @@ -68,25 +73,7 @@ You can ingest vectors in several ways: * **Batched imports** via `COPY FROM` using JSON or CSV * CrateDB doesn't currently compute embeddings internally—you bring your own model or use pipelines that call CrateDB. -## Use Cases - -| Use Case | Description | -| ----------------------- | ------------------------------------------------------------------ | -| Semantic Search | Rank documents by meaning instead of keywords | -| Recommendation Systems | Find similar products, users, or behaviors | -| Image / Audio Retrieval | Store and compare embeddings of images/audio | -| Fraud Detection | Match behavioral patterns via vectors | -| Hybrid Search | Combine vector similarity with full-text, geo, or temporal filters | - -Example: Hybrid semantic product search - -```sql -SELECT id, title, price, description -FROM products -WHERE MATCH(description_ft, 'running shoes') AND brand = 'Nike' -ORDER BY features <-> [vector] ASC -LIMIT 10; -``` +*** ## Performance & Scaling @@ -94,20 +81,11 @@ LIMIT 10; * CrateDB parallelizes ANN search across shards/nodes. * Ideal for 100K to tens of millions of vectors; supports real-time ingestion and queries. -:::{note} -vector dimensionality must be consistent for each column. -::: - -## Best Practices +```{note} +Vector dimensionality must be consistent for each column. +``` -| Area | Recommendation | -| -------------- | ----------------------------------------------------------------------- | -| Vector length | Use standard embedding sizes (e.g. 384, 512, 768, 1024) | -| Similarity | Cosine for semantic/textual data; dot-product for ranking models | -| Index tuning | Tune `ef_search` for latency/recall trade-offs | -| Hybrid queries | Combine vector similarity with metadata filters (e.g. category, region) | -| Updates | Re-inserting or updating vectors is fully supported | -| Data pipelines | Use external tools for vector generation; push to CrateDB via REST/SQL | +*** ## Integrations @@ -115,13 +93,10 @@ vector dimensionality must be consistent for each column. * **Embedding models**: Use OpenAI, HuggingFace, Cohere, or in-house models * **RAG architecture**: CrateDB stores vector + metadata + raw text in a unified store -## Further Learning & Resources - -* CrateDB Docs – Vector Search -* Blog: Using CrateDB for Hybrid Search (Vector + Full-Text) -* CrateDB Academy – Vector Data -* [Sample notebooks on GitHub](https://github.com/crate/cratedb-examples) +*** -## Summary +## Further Learning & Resources -CrateDB gives you the power of **vector similarity search** with the **flexibility of SQL** and the **scalability of a distributed database**. It lets you unify structured, unstructured, and semantic data—enabling modern applications in AI, search, and recommendation without additional vector databases or pipelines. +* [Vector Search](project:#vector-search): More details about searching with vectors +* Blog: [Using CrateDB for Hybrid Search (Vector + Full-Text)](https://cratedb.com/blog/hybrid-search-explained) +* CrateDB Academy: [Vector similarity search](https://learn.cratedb.com/cratedb-fundamentals?lesson=vector-similarity-search) From 8fc8021c3683a33cbb74799d0b87832b4068a689 Mon Sep 17 00:00:00 2001 From: Brian Munkholm Date: Tue, 2 Sep 2025 11:42:22 +0200 Subject: [PATCH 22/58] relational.md updated with content from GitBook --- docs/start/modelling/relational.md | 55 ++++++------------------------ 1 file changed, 10 insertions(+), 45 deletions(-) diff --git a/docs/start/modelling/relational.md b/docs/start/modelling/relational.md index 8f9e90eb..fc0de93d 100644 --- a/docs/start/modelling/relational.md +++ b/docs/start/modelling/relational.md @@ -130,13 +130,13 @@ FROM customers c; ```sql WITH order_counts AS ( - SELECT + SELECT o.customer_id, COUNT(*) AS order_count FROM orders o GROUP BY o.customer_id ) -SELECT +SELECT c.name, COALESCE(oc.order_count, 0) AS order_count FROM customers c @@ -144,48 +144,13 @@ LEFT JOIN order_counts oc ON c.id = oc.customer_id; ``` -## Use Cases for Relational Modeling - -| Use Case | Description | -| -------------------- | ------------------------------------------------ | -| Customer & Orders | Classic normalized setup with joins and filters | -| Inventory Management | Products, stock levels, locations | -| Financial Systems | Transactions, balances, audit logs | -| User Profiles | Users, preferences, activity logs | -| Multi-tenant Systems | Use schemas or partitioning for tenant isolation | - -## Scalability & Distribution - -CrateDB automatically shards tables across nodes, distributing both **data and query processing**. - -* Tables can be **sharded and replicated** for fault tolerance -* Use **partitioning** for time-series or tenant-based scaling -* SQL queries are transparently **parallelized across the cluster** - -:::{note} -Use `CLUSTERED BY` and `PARTITIONED BY` in `CREATE TABLE` to control distribution patterns. -::: - -## Best Practices - -| Area | Recommendation | -| ------------- | ------------------------------------------------------------ | -| Keys & IDs | Use UUIDs or consistent IDs for primary keys | -| Sharding | Let CrateDB auto-shard unless you have advanced requirements | -| Join Strategy | Minimize joins over large, high-cardinality columns | -| Nested Fields | Use `column_policy = 'dynamic'` if schema needs flexibility | -| Aggregations | Favor columnar tables for analytical workloads | -| Co-location | Consider denormalization for write-heavy workloads | - ## Further Learning & Resources -* CrateDB Docs – Data Modeling -* CrateDB Academy – Relational Modeling -* Working with Joins in CrateDB -* Schema Design Guide - -## Summary - -CrateDB offers a familiar, powerful **relational model with full SQL** and built-in support for scale, performance, and hybrid data. You can model clean, normalized data structures and join them across millions of records, without sacrificing the flexibility to embed, index, and evolve schema dynamically. - -CrateDB is the modern SQL engine for building relational, real-time, and hybrid apps in a distributed world. +* Reference Manual: + * How to [query with joins](inv:crate-reference:*:label#sql_joins) + * [SQL join statements](inv:crate-reference:*:label#sql-select-joined-relation) + * [Join types and their implementation](inv:crate-reference:*:label#concept-joins) +* Blog posts: + * [How to fine-tune the query optimizer](https://cratedb.com/blog/join-performance-to-the-rescue) + * [Adding support for joins on virtual tables and multi-row subselects](https://cratedb.com/blog/joins-multi-row-subselects) + * How we made Joins twenty three thousand times faster - part [#1](https://cratedb.com/blog/joins-faster-part-one), [#2](https://cratedb.com/blog/lab-notes-how-we-made-joins-23-thousand-times-faster-part-two), [#3](https://cratedb.com/blog/lab-notes-how-we-made-joins-23-thousand-times-faster-part-three), [Video](https://cratedb.com/resources/videos/distributed-join-algorithms) From 0d6a691be17800249456c63124d00763bcea0870 Mon Sep 17 00:00:00 2001 From: Brian Munkholm Date: Wed, 3 Sep 2025 00:01:06 +0200 Subject: [PATCH 23/58] Update Geospatial with Gitbook content --- docs/start/modelling/geospatial.md | 131 ++++++++++------------------- 1 file changed, 44 insertions(+), 87 deletions(-) diff --git a/docs/start/modelling/geospatial.md b/docs/start/modelling/geospatial.md index e40ef3e4..9a30a7c3 100644 --- a/docs/start/modelling/geospatial.md +++ b/docs/start/modelling/geospatial.md @@ -3,117 +3,74 @@ CrateDB supports **real-time geospatial analytics at scale**, enabling you to store, query, and analyze 2D location-based data using standard SQL over two dedicated types: **GEO\_POINT** and **GEO\_SHAPE**. You can seamlessly combine spatial data with full-text, vector, JSON, or time-series in the same SQL queries. +The strength of CrateDB's support for geospatial data includes: + +* Designed for **real-time geospatial tracking and analytics** (e.g., fleet tracking, mapping, location-layered apps) +* **Unified SQL platform**: spatial data can be combined with full-text search, JSON, vectors, time-series — in the same table or query +* **High ingest and query throughput**, suitable for large-scale location-based workloads + ## Geospatial Data Types -### **GEO\_POINT** +CrateDB has two geospatial data types: + +### geo_point * Stores a single location via latitude/longitude. -* Insert using either a coordinate array `[lon, lat]` or Well-Known Text (WKT) string `'POINT (lon lat)'`. +* Insert using + * coordinate array `[lon, lat]` + * [Well-Known Text](https://libgeos.org/specifications/wkt/) (WKT) string `'POINT (lon lat)'`. * Must be declared explicitly; dynamic schema inference will not detect `geo_point` type. -### **GEO\_SHAPE** +### **geo_shape** -* Supports complex geometries (Point, MultiPoint, LineString, MultiLineString, Polygon, MultiPolygon, GeometryCollection) via GeoJSON or WKT. -* Indexed using geohash, quadtree, or BKD-tree, with configurable precision (e.g. `50m`) and error threshold. The indexes are described in the [reference manual](https://cratedb.com/docs/crate/reference/en/latest/general/ddl/data-types.html#type-geo-shape-index). +* Represents more complex 2D shapes defined via GeoJSON or WKT formats. +* Supported geometry types: + * `Point`, `MultiPoint` + * `LineString`, `MultiLineString` + * `Polygon`, `MultiPolygon` + * `GeometryCollection` +* Indexed using geohash, quadtree, or BKD-tree, with configurable precision (e.g. `50m`) and error threshold. The indexes are described in the [reference manual](https://cratedb.com/docs/crate/reference/en/latest/general/ddl/data-types.html#type-geo-shape-index). You can choose and configure the indexing method when defining your table schema. -## Table Schema Example +## Defining a Geospatial Column -Let's define a table with country boarders and capital: +Here’s an example of how to define a `GEO_SHAPE` column with a specific index: ```sql -CREATE TABLE country ( - name text, - country_code text primary key, - shape geo_shape INDEX USING "geohash" WITH (precision='100m'), - capital text, - capital_location geo_point -); -``` - -* Use `GEO_SHAPE` to define the border. -* `GEO_POINT` to define the location of the capital. - -## Insert rows - -We can populate the table with the coordinate shape of Vienna/Austria: - -```psql -INSERT INTO country (name, country_code, shape, capital, capital_location) -VALUES ( - 'Austria', - 'at', - {type='Polygon', coordinates=[ - [[16.979667, 48.123497], [16.903754, 47.714866], - [16.340584, 47.712902], [16.534268, 47.496171], - [16.202298, 46.852386], [16.011664, 46.683611], - [15.137092, 46.658703], [14.632472, 46.431817], - [13.806475, 46.509306], [12.376485, 46.767559], - [12.153088, 47.115393], [11.164828, 46.941579], - [11.048556, 46.751359], [10.442701, 46.893546], - [9.932448, 46.920728], [9.47997, 47.10281], - [9.632932, 47.347601], [9.594226, 47.525058], - [9.896068, 47.580197], [10.402084, 47.302488], - [10.544504, 47.566399], [11.426414, 47.523766], - [12.141357, 47.703083], [12.62076, 47.672388], - [12.932627, 47.467646], [13.025851, 47.637584], - [12.884103, 48.289146], [13.243357, 48.416115], - [13.595946, 48.877172], [14.338898, 48.555305], - [14.901447, 48.964402], [15.253416, 49.039074], - [16.029647, 48.733899], [16.499283, 48.785808], - [16.960288, 48.596982], [16.879983, 48.470013], - [16.979667, 48.123497]] - ]}, - 'Vienna', - [16.372778, 48.209206] +CREATE TABLE parks ( + name TEXT, + area GEO_SHAPE INDEX USING quadtree ); ``` -## Core Geospatial Functions - -CrateDB provides key scalar functions for spatial operations: +## Inserting Geospatial Data -* **`distance(geo_point1, geo_point2)`** – returns meters using the Haversine formula (e.g. compute distance between two points) -* **`within(shape1, shape2)`** – true if one geo object is fully contained within another -* **`intersects(shape1, shape2)`** – true if shapes overlap or touch anywhere -* **`latitude(geo_point)` / `longitude(geo_point)`** – extract individual coordinates -* **`geohash(geo_point)`** – compute a 12‑character geohash for the point -* **`area(geo_shape)`** – returns approximate area in square degrees; uses geodetic awareness - -Furthermore, it is possible to use the **match** predicate with geospatial data in queries. +You can insert geospatial values using either **GeoJSON** or **WKT** formats. -Note: More precise relational operations on shapes may bypass indexes and can be slower. +```sql +-- Insert a shape (WKT format) +INSERT INTO parks (name, area) +VALUES ('My Park', 'POLYGON ((5 5, 30 5, 30 30, 5 30, 5 5))'); +``` -## An example query +## Querying with spacial operations -It is possible to find the distance to the capital of each country in the table: +It is e.g. possible to check if a point is within a park in the table: ```sql -SELECT distance(capital_location, [9.74, 47.41])/1000 -FROM country; +SELECT name FROM parks +WHERE within('POINT(10 10)'::geo_shape, area); ``` -## Real-World Examples: Chicago Use Cases - -* **311 calls**: Each record includes `location` as `GEO_POINT`. Queries use `within()` to find calls near a polygon around O’Hare airport. -* **Community areas**: Polygon boundaries stored in `GEO_SHAPE`. Queries for intersections with arbitrary lines or polygons using `intersects()` return overlapping zones. -* **Taxi rides**: Pickup/drop off locations stored as geo points. Use `distance()` filter to compute trip distances and aggregate. +CrateDB provides key scalar functions for spatial operations like, distance(), within(), intersects(), area(), geohash() and lattitude()/longitude(). -## Architectural Strengths & Suitability - -* Designed for **real-time geospatial tracking and analytics** (e.g. fleet tracking, mapping, location-layered apps). -* **Unified SQL platform**: spatial data can be combined with full-text search, JSON, vectors, time-series — in the same table or query. -* **High ingest and query throughput**, suitable for large-scale location-based workloads - -## Best Practices Checklist +Furthermore, it is possible to use the **match** predicate with geospatial data in queries. -
TopicRecommendation
Data typesDeclare GEO_POINT/GEO_SHAPE explicitly
Geometric formatsUse WKT or GeoJSON for insertions
Index tuningChoose geohash/quadtree/BKD tree & adjust precision
QueriesPrefer MATCH for indexed filtering; use functions for precise checks
Joins & spatial filtersUse within/intersects to correlate spatial entities
Scale & performanceIndex shapes, use distance/wwithin filters early
Mixed-model integrationCombine spatial with JSON, full-text, vector, time-series
+See the section about searching geospatial data (!!! add link) for details on this. ## Further Learning & Resources -* Official **Geospatial Search Guide** in CrateDB docs, detailing geospatial types, indexing, and MATCH predicate usage. -* CrateDB Academy **Hands-on: Geospatial Data** modules, with sample datasets (Chicago 311 calls, taxi rides, community zones) and example queries. -* CrateDB Blog: **Geospatial Queries with CrateDB** – outlines capabilities, limitations, and practical use cases (available since version 0.40 - -## Summary - -CrateDB provides robust support for geospatial modeling through clearly defined data types (`GEO_POINT`, `GEO_SHAPE`), powerful scalar functions (`distance`, `within`, `intersects`, `area`), and Lucene‑based indexing for fast queries. It excels in high‑volume, real‑time spatial analytics and integrates smoothly with multi-model use cases. Whether storing vehicle positions, mapping regions, or enabling spatial joins—CrateDB’s geospatial layer makes it easy, scalable, and extensible. +* Reference manual: + * {ref}`Geo Search ` + * {ref}`Geo functions `: distance, within, intersects, latitude, longitude, geohash, area +* CrateDB Academy [**Hands-on: Geospatial Data**](https://cratedb.com/academy/fundamentals/data-modelling-with-cratedb/hands-on-geospatial-data) modules, with sample datasets (Chicago 311 calls, taxi rides, community zones) and example queries. +* CrateDB Blog: [**Geospatial Queries with CrateDB**](https://cratedb.com/blog/geospatial-queries-with-crate-data) – outlines capabilities, limitations, and practical use cases. From 12c45c73759dc9b4788cc2ec12c432e1d38d8f45 Mon Sep 17 00:00:00 2001 From: Brian Munkholm Date: Wed, 3 Sep 2025 00:02:09 +0200 Subject: [PATCH 24/58] Updated global references and removed seperators --- docs/start/modelling/json.md | 9 ++++----- docs/start/modelling/relational.md | 6 +++--- docs/start/modelling/timeseries.md | 16 +--------------- docs/start/modelling/vector.md | 2 +- 4 files changed, 9 insertions(+), 24 deletions(-) diff --git a/docs/start/modelling/json.md b/docs/start/modelling/json.md index 06def605..34073aae 100644 --- a/docs/start/modelling/json.md +++ b/docs/start/modelling/json.md @@ -211,8 +211,7 @@ object fields. ## Further Learning & Resources * Reference Manual: - * [Objects](inv:crate-reference:*:label#data-types-objects) and [Object Column - policy](inv:crate-reference:*:label#type-object-column-policy) - * [Inserting objects as - JSON](inv:crate-reference:*:label#data-types-object-json) - * [json type](inv:crate-reference:*:label#column_policy) + * {ref}`Objects ` + * {ref}`Object Column policy ` + * {ref}`Inserting objects as JSON ` + * {ref}`json type ` diff --git a/docs/start/modelling/relational.md b/docs/start/modelling/relational.md index fc0de93d..9acdc42f 100644 --- a/docs/start/modelling/relational.md +++ b/docs/start/modelling/relational.md @@ -147,9 +147,9 @@ LEFT JOIN order_counts oc ## Further Learning & Resources * Reference Manual: - * How to [query with joins](inv:crate-reference:*:label#sql_joins) - * [SQL join statements](inv:crate-reference:*:label#sql-select-joined-relation) - * [Join types and their implementation](inv:crate-reference:*:label#concept-joins) + * How to {ref}`query with joins ` + * {ref}`SQL join statements ` + * {ref}`Join types and their implementation ` * Blog posts: * [How to fine-tune the query optimizer](https://cratedb.com/blog/join-performance-to-the-rescue) * [Adding support for joins on virtual tables and multi-row subselects](https://cratedb.com/blog/joins-multi-row-subselects) diff --git a/docs/start/modelling/timeseries.md b/docs/start/modelling/timeseries.md index 4b7c4944..f4e7faf3 100644 --- a/docs/start/modelling/timeseries.md +++ b/docs/start/modelling/timeseries.md @@ -8,8 +8,6 @@ CrateDB employs a relational representation for time‑series, enabling you to w * While maintaining a high ingest rate, its **columnar storage** and **automatic indexing** let you access and analyze the data immediately with **fast aggregations** and **near-real-time queries**. * Handles **high cardin­ality** and **a variety of data types**, including nested JSON, geospatial and vector data—all queryable via the same SQL statements. -*** - ## Data Model Template A typical time‑series schema looks like this: @@ -42,8 +40,6 @@ Key points: * Every column is stored in the column store by default for fast aggregations. * Using **OBJECT columns** provides a structured and efficient way to organize complex nested data in CrateDB, enhancing both data integrity and flexibility. -*** - ## Ingesting and Querying ### **Data Ingestion** @@ -112,15 +108,11 @@ ORDER BY * **Statistical aggregates:** percentile, correlation, stddev, variance, min, max, sum * **Advanced filtering & logic:** greatest, least, case when ... then ... end -*** - ## Downsampling & Interpolation To reduce volume while preserving trends, use `DATE_BIN`.\ Missing data can be handled using `LAG()`/`LEAD()` or other interpolation logic within SQL. -*** - ## Schema Evolution & Contextual Data With `column_policy = 'dynamic'`, ingest JSON payloads containing extra attributes—new columns are auto‑created and indexed. Perfect for capturing evolving sensor metadata. For column-level control, use `OBJECT(DYNAMIC)` to auto-create (and, by default, index) subcolumns, or `OBJECT(IGNORED)`to accept unknown keys without creating or indexing subcolumns. @@ -133,26 +125,20 @@ You can also store: All types are supported within the same table or joined together. -*** - ## Storage Optimization * **Partitioning and sharding**: data can be partitioned by time (e.g. daily/monthly) and sharded across a cluster. * Supports long‑term retention with performant historic storage. * Columnar layout reduces storage footprint and accelerates aggregation queries. -*** - ## Advanced Use Cases * **Exploratory data analysis** (EDA), decomposition, and forecasting via CrateDB’s SQL or by exporting to Pandas/Plotly. * **Machine learning workflows**: time‑series features and anomaly detection pipelines can be built using CrateDB + external tools -*** - ## Further Learning & Resources -* **Documentation:** [Advanced Time Series Analysis](project:#timeseries-analysis), [Time Series Long Term Storage](project:#timeseries-longterm) +* **Documentation:** {ref}`Advanced Time Series Analysis `, {ref}`Time Series Long Term Storage ` * **Video:** [Time Series Data Modeling](https://cratedb.com/resources/videos/time-series-data-modeling) – covers relational & time series, document, geospatial, vector, and full-text in one tutorial. * **CrateDB Academy:** [Advanced Time Series Modeling course](https://cratedb.com/academy/time-series/getting-started/introduction-to-time-series-data). * **Tutorial:** [Downsampling with LTTB algorithm](https://community.cratedb.com/t/advanced-downsampling-with-the-lttb-algorithm/1287) diff --git a/docs/start/modelling/vector.md b/docs/start/modelling/vector.md index a7fbb723..8ca1856c 100644 --- a/docs/start/modelling/vector.md +++ b/docs/start/modelling/vector.md @@ -97,6 +97,6 @@ Vector dimensionality must be consistent for each column. ## Further Learning & Resources -* [Vector Search](project:#vector-search): More details about searching with vectors +* {ref}`Vector Search `: More details about searching with vectors * Blog: [Using CrateDB for Hybrid Search (Vector + Full-Text)](https://cratedb.com/blog/hybrid-search-explained) * CrateDB Academy: [Vector similarity search](https://learn.cratedb.com/cratedb-fundamentals?lesson=vector-similarity-search) From 16167b672e481e36e93ca1dd50941fc1d78270e3 Mon Sep 17 00:00:00 2001 From: Brian Munkholm Date: Wed, 3 Sep 2025 01:07:06 +0200 Subject: [PATCH 25/58] Fix link in geospatial --- docs/start/modelling/geospatial.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/docs/start/modelling/geospatial.md b/docs/start/modelling/geospatial.md index 9a30a7c3..ae31b0b3 100644 --- a/docs/start/modelling/geospatial.md +++ b/docs/start/modelling/geospatial.md @@ -42,7 +42,7 @@ CREATE TABLE parks ( ); ``` -## Inserting Geospatial Data +## Inserting Geospatial Data You can insert geospatial values using either **GeoJSON** or **WKT** formats. @@ -65,7 +65,7 @@ CrateDB provides key scalar functions for spatial operations like, distance(), w Furthermore, it is possible to use the **match** predicate with geospatial data in queries. -See the section about searching geospatial data (!!! add link) for details on this. +See the section about {ref}`searching geospatial data ` for details on this. ## Further Learning & Resources From 856eab9a9796f82446f57dd59a7c9e19f424d791 Mon Sep 17 00:00:00 2001 From: Brian Munkholm Date: Fri, 5 Sep 2025 13:24:05 +0200 Subject: [PATCH 26/58] Updated modelling/Vector from AI gen to working. --- docs/start/modelling/vector.md | 115 +++++++++++++-------------------- 1 file changed, 46 insertions(+), 69 deletions(-) diff --git a/docs/start/modelling/vector.md b/docs/start/modelling/vector.md index 8ca1856c..cdea297a 100644 --- a/docs/start/modelling/vector.md +++ b/docs/start/modelling/vector.md @@ -1,102 +1,79 @@ (model-vector)= # Vector data -CrateDB natively supports **vector embeddings** for efficient **similarity search** using **k-nearest neighbour (kNN)** algorithms. This makes it a powerful engine for building AI-powered applications involving semantic search, recommendations, anomaly detection, and multimodal analytics, all in the simplicity of SQL. +CrateDB natively supports **vector embeddings** for efficient **similarity +search** using **k-nearest neighbour (kNN)** algorithms. This makes it a +powerful engine for building AI-powered applications involving semantic search, +recommendations, anomaly detection, and multimodal analytics, all in the +simplicity of SQL. -Whether you’re working with text, images, sensor data, or any domain represented as high-dimensional embeddings, CrateDB enables **real-time vector search at scale**, in combination with other data types like full-text, geospatial, and time-series. +Whether you’re working with text, images, sensor data, or any domain represented +as high-dimensional embeddings, CrateDB enables **real-time vector search at +scale**, in combination with other data types like full-text, geospatial, and +time-series. -*** +## Data Type: FLOAT_VECTOR -## Data Type: VECTOR +CrateDB has a native {ref}`FLOAT_VECTOR type ` +type with the following key characteristics: -CrateDB has a native `VECTOR` type with the following key characteristics: - -* Fixed-length float arrays (e.g. 768, 1024, 2048 dimensions) -* Supports **HNSW (Hierarchical Navigable Small World)** indexing for fast approximate search -* Optimized for cosine, Euclidean, and dot-product similarity +* Fixed-length float arrays (1-2048 dimensions) +* Backed by Lucene’s HNSW approximate nearest neighbor (ANN) search +* Similarity and scoring exposed via {ref}`KNN_MATCH ` +and {ref}`VECTOR_SIMILARITY `. **Example: Define a Table with Vector Embeddings** ```sql CREATE TABLE documents ( - id UUID PRIMARY KEY, title TEXT, content TEXT, - embedding VECTOR(FLOAT[768]) + embedding FLOAT_VECTOR(3) ); ``` -* `VECTOR(FLOAT[768])` declares a fixed-size vector column. -* You can ingest vectors directly or compute them externally and store them via SQL - -*** - -## Querying Vectors with SQL - -Use the `nearest_neighbors` predicate to perform similarity search: - -```sql -SELECT id, title, content -FROM documents -ORDER BY embedding <-> [0.12, 0.73, ..., 0.01] -LIMIT 5; -``` - -This ranks results by **vector similarity** using the index. - -Or, filter and rank by proximity: - -```sql -SELECT id, title, content, embedding <-> [0.12, ..., 0.01] AS score -FROM documents -WHERE MATCH(content_ft, 'machine learning') AND author = 'Alice' -ORDER BY score -LIMIT 10; -``` - -```{note} -Combine vector similarity with full-text, metadata, or geospatial filters! -``` - -*** +* `FLOAT_VECTOR(3)` declares a vector column with 3 floats. ## Ingestion: Working with Embeddings You can ingest vectors in several ways: -* **Precomputed embeddings** from models like OpenAI, HuggingFace, or SentenceTransformers: - +* **Precomputed embeddings** from models: ```sql - INSERT INTO documents (id, title, embedding) - VALUES ('uuid-123', 'AI and Databases', [0.12, 0.34, ..., 0.01]); + INSERT INTO documents (title, embedding) + VALUES ('AI and Databases', [0.12, 0.34, 0.01]); ``` -* **Batched imports** via `COPY FROM` using JSON or CSV -* CrateDB doesn't currently compute embeddings internally—you bring your own model or use pipelines that call CrateDB. + You must insert the exact number of floats defined in the table or an error + will be thrown. -*** +* **Batched imports** via {ref}`COPY FROM ` +using JSON or CSV. +* CrateDB doesn't currently compute embeddings internally — you bring your own +model or use pipelines that call CrateDB. -## Performance & Scaling +## Querying Vectors with SQL -* Vector search uses **HNSW**: state-of-the-art ANN algorithm with logarithmic search complexity. -* CrateDB parallelizes ANN search across shards/nodes. -* Ideal for 100K to tens of millions of vectors; supports real-time ingestion and queries. +Use {ref}`KNN_MATCH ` to perform similarity +search: -```{note} -Vector dimensionality must be consistent for each column. +```sql +SELECT title, content, _score +FROM documents +WHERE knn_match(embedding, [3.14, 5.1, 8.2], 2) +ORDER BY _score DESC; ``` -*** - -## Integrations - -* **Python / pandas / LangChain**: CrateDB has native drivers and REST interface -* **Embedding models**: Use OpenAI, HuggingFace, Cohere, or in-house models -* **RAG architecture**: CrateDB stores vector + metadata + raw text in a unified store - -*** +This ranks results by **vector similarity** to the vector supplied by searching +max 2 nearest neighbours. ## Further Learning & Resources -* {ref}`Vector Search `: More details about searching with vectors -* Blog: [Using CrateDB for Hybrid Search (Vector + Full-Text)](https://cratedb.com/blog/hybrid-search-explained) -* CrateDB Academy: [Vector similarity search](https://learn.cratedb.com/cratedb-fundamentals?lesson=vector-similarity-search) +* {ref}`Vector Search `: More details about searching with + vectors +Reference manual: + * {ref}`FLOAT_VECTOR type ` + * {ref}`KNN_MATCH ` + * {ref}`VECTOR_SIMILARITY ` +* Blog: [Vector support and KNN search](https://cratedb.com/blog/unlocking-the-power-of-vector-support-and-knn-search-in-cratedb) +* CrateDB Academy: [Vector similarity + search](https://learn.cratedb.com/cratedb-fundamentals?lesson=vector-similarity-search) From 289c309e43cbc3f2cdb9d4bddc73863ae53c061e Mon Sep 17 00:00:00 2001 From: Brian Munkholm Date: Fri, 5 Sep 2025 14:29:04 +0200 Subject: [PATCH 27/58] Fix errors in modelling/relational --- docs/start/modelling/relational.md | 102 ++++++++++++++++++----------- 1 file changed, 62 insertions(+), 40 deletions(-) diff --git a/docs/start/modelling/relational.md b/docs/start/modelling/relational.md index 9acdc42f..ea0353dc 100644 --- a/docs/start/modelling/relational.md +++ b/docs/start/modelling/relational.md @@ -1,9 +1,15 @@ (model-relational)= # Relational data -CrateDB is a **distributed SQL database** that offers rich **relational data modeling** with the flexibility of dynamic schemas and the scalability of NoSQL systems. It supports **primary keys,** **joins**, **aggregations**, and **subqueries**, just like traditional RDBMS systems—while also enabling hybrid use cases with time-series, geospatial, full-text, vector search, and semi-structured data. +CrateDB is a **distributed SQL database** that offers rich **relational data +modeling** with the flexibility of dynamic schemas and the scalability of NoSQL +systems. It supports **primary keys,** **joins**, **aggregations**, and +**subqueries**, just like traditional RDBMS systems—while also enabling hybrid +use cases with time-series, geospatial, full-text, vector search, and +semi-structured data. -Use CrateDB when you need to scale relational workloads horizontally while keeping the simplicity of **SQL**. +Use CrateDB when you need to scale relational workloads horizontally while +keeping the simplicity of **SQL**. ## Table Definitions @@ -20,41 +26,31 @@ CREATE TABLE customers ( **Key Features:** -* Supports scalar types (`TEXT`, `INTEGER`, `DOUBLE`, `BOOLEAN`, `TIMESTAMP`, etc.) -* `gen_random_text_uuid()`, `now()` or `current_timestamp()` recommended for primary keys in distributed environments -* Default **replication**, **sharding**, and **partitioning** options are built-in for scale +* Supports scalar types (`TEXT`, `INTEGER`, `DOUBLE`, `BOOLEAN`, `TIMESTAMP`, +etc.) +* `gen_random_text_uuid()`, `now()` or `current_timestamp()` recommended for +primary keys in distributed environments +* Default **replication**, **sharding**, and **partitioning** options are +built-in for scale -:::{note} -CrateDB supports `column_policy = 'dynamic'` if you want to mix relational and semi-structured models (like JSON) in the same table. -::: - -## Joins & Relationships - -CrateDB supports **inner joins**, **left/right joins**, **cross joins**, **outer joins**, and even **self joins**. - -**Example: Join Customers and Orders** - -```sql -SELECT c.name, o.order_id, o.total_amount -FROM customers c -JOIN orders o ON c.id = o.customer_id -WHERE o.created_at >= CURRENT_DATE - INTERVAL '30 days'; -``` - -Joins are executed efficiently across shards in a **distributed query planner** that parallelizes execution. ## Normalization vs. Embedding -CrateDB supports both **normalized** (relational) and **denormalized** (embedded JSON) approaches. +CrateDB supports both **normalized** (relational) and **denormalized** (embedded +JSON) approaches with {ref}`column_policy = 'dynamic' `. -* For strict referential integrity and modularity: use normalized tables with joins. -* For performance in high-ingest or read-optimized workloads: embed reference data as nested JSON. +* For strict referential integrity and modularity: use normalized tables with + joins. +* For performance in high-ingest or read-optimized workloads: embed reference + data as nested JSON. Example: Embedded products inside an `orders` table: ```sql CREATE TABLE orders ( order_id TEXT DEFAULT gen_random_text_uuid() PRIMARY KEY, + customer_id TEXT, + total_amount DOUBLE, items ARRAY( OBJECT(DYNAMIC) AS ( name TEXT, @@ -66,24 +62,40 @@ CREATE TABLE orders ( ); ``` -:::{note} -CrateDB lets you **query nested fields** directly using bracket notation: `items['name']`, `items['price']`, etc. -::: +:::{note} CrateDB lets you **query nested fields** directly using bracket +notation: `items['name']`, `items['price']`, etc. ::: + +## Joins & Relationships + +CrateDB supports **inner joins**, **left/right joins**, **cross joins**, **outer +joins**, and even **self joins**. + +**Example: Join Customers and Orders** + +```sql +SELECT c.name, o.order_id, o.total_amount +FROM customers c +JOIN orders o ON c.id = o.customer_id +WHERE o.created_at >= CURRENT_DATE - INTERVAL '30 days'; +``` + +Joins are executed efficiently across shards in a **distributed query planner** +that parallelizes execution. ## Aggregations & Grouping -Use familiar SQL aggregation functions (`SUM`, `AVG`, `COUNT`, `MIN`, `MAX`) with `GROUP BY`, `HAVING`, `WINDOW FUNCTIONS` ... etc. +Use familiar SQL aggregation functions (`SUM`, `AVG`, `COUNT`, `MIN`, `MAX`) +with `GROUP BY`, `HAVING`, `WINDOW FUNCTIONS` ... etc. ```sql SELECT customer_id, COUNT(*) AS num_orders, SUM(total_amount) AS revenue FROM orders GROUP BY customer_id -HAVING revenue > 1000; +HAVING SUM(total_amount) > 1000; ``` -:::{note} -CrateDB's **columnar storage** optimizes performance for aggregations—even on large datasets. -::: +:::{note} CrateDB's **columnar storage** optimizes performance for +aggregations — even on large datasets. ::: ## Constraints & Indexing @@ -92,9 +104,13 @@ CrateDB supports: * **Primary Keys** – enforced for uniqueness and data distribution * **Check -** enforces custom value validation * **Indexes** – automatic index for all columns -* **Full-text indexes -** manually defined, supports many tokenizers, analyzers and filters +* **Full-text indexes -** manually defined, supports many tokenizers, analyzers + and filters -In CrateDB every column is indexed by default, depending on the datatype a different index is used, indexing is controlled and maintained by the database, there is no need to `vacuum` or `re-index` like in other systems. Indexing can be manually turned off. +In CrateDB every column is indexed by default, depending on the datatype a +different index is used, indexing is controlled and maintained by the database, +there is no need to `vacuum` or `re-index` like in other systems. Indexing can +be manually turned off. ```sql CREATE TABLE products ( @@ -115,7 +131,7 @@ CrateDB supports **views**, **CTEs**, and **nested subqueries**. ```sql CREATE VIEW recent_orders AS SELECT * FROM orders -WHERE created_at >= CURRENT_DATE::TIMESTAMP - INTERVAL '7 days'; +WHERE created_at >= CAST(CURRENT_DATE AS TIMESTAMP) - INTERVAL '7 days'; ``` **Example: Correlated Subquery** @@ -151,6 +167,12 @@ LEFT JOIN order_counts oc * {ref}`SQL join statements ` * {ref}`Join types and their implementation ` * Blog posts: - * [How to fine-tune the query optimizer](https://cratedb.com/blog/join-performance-to-the-rescue) - * [Adding support for joins on virtual tables and multi-row subselects](https://cratedb.com/blog/joins-multi-row-subselects) - * How we made Joins twenty three thousand times faster - part [#1](https://cratedb.com/blog/joins-faster-part-one), [#2](https://cratedb.com/blog/lab-notes-how-we-made-joins-23-thousand-times-faster-part-two), [#3](https://cratedb.com/blog/lab-notes-how-we-made-joins-23-thousand-times-faster-part-three), [Video](https://cratedb.com/resources/videos/distributed-join-algorithms) + * [How to fine-tune the query + optimizer](https://cratedb.com/blog/join-performance-to-the-rescue) + * [Adding support for joins on virtual tables and multi-row + subselects](https://cratedb.com/blog/joins-multi-row-subselects) + * How we made Joins twenty three thousand times faster - part + [#1](https://cratedb.com/blog/joins-faster-part-one), + [#2](https://cratedb.com/blog/lab-notes-how-we-made-joins-23-thousand-times-faster-part-two), + [#3](https://cratedb.com/blog/lab-notes-how-we-made-joins-23-thousand-times-faster-part-three), + [Video](https://cratedb.com/resources/videos/distributed-join-algorithms) From 620096969a2012886be842dcd19a95766493b82a Mon Sep 17 00:00:00 2001 From: Brian Munkholm Date: Fri, 5 Sep 2025 16:06:57 +0200 Subject: [PATCH 28/58] Updaterd modelling/json to at least be correct. --- docs/start/modelling/json.md | 138 +++++++++++++---------------------- 1 file changed, 49 insertions(+), 89 deletions(-) diff --git a/docs/start/modelling/json.md b/docs/start/modelling/json.md index 34073aae..d17626d3 100644 --- a/docs/start/modelling/json.md +++ b/docs/start/modelling/json.md @@ -10,14 +10,14 @@ CrateDB’s support for dynamic objects, nested structures, and dot-notation querying brings the best of both relational and document-based data modeling—without leaving the SQL world. -## Object (JSON) Columns +## A Simple Table with JSON CrateDB allows you to define **object columns** that can store JSON-style data structures. ```sql CREATE TABLE events ( - id UUID PRIMARY KEY, + id TEXT PRIMARY KEY, timestamp TIMESTAMP, payload OBJECT(DYNAMIC) ); @@ -39,7 +39,7 @@ This allows inserting flexible, nested JSON data into `payload`: } ``` -## Column Policy: Strict vs Dynamic +## Column Policy — Strict vs Dynamic You can control how CrateDB handles unexpected fields in an object column: @@ -49,120 +49,92 @@ You can control how CrateDB handles unexpected fields in an object column: | `STRICT` | Only explicitly defined fields are allowed | | `IGNORED` | Extra fields are stored but not indexed or queryable | -Example with explicitly defined fields: +Let’s evolve our table to restrict the structure of `payload`: ```sql -CREATE TABLE sensor_data ( - id UUID PRIMARY KEY, - attributes OBJECT(STRICT) AS ( +CREATE TABLE events2 ( + id TEXT PRIMARY KEY, + timestamp TIMESTAMP, + payload OBJECT(STRICT) AS ( temperature DOUBLE, humidity DOUBLE ) ); ``` +You can no longer use fields other than temperature and humidity in the payload +object. + ## Querying JSON Fields Use **bracket notation** to access nested fields: ```sql -SELECT payload['user']['name'], payload['device']['os'] -FROM events -WHERE payload['action'] = 'login'; +SELECT payload['temperature'], payload['humidity'] +FROM events2 +WHERE payload['temperature'] >= 20.0; ``` CrateDB also supports **filtering, sorting, and aggregations** on nested values: ```sql -SELECT COUNT(*) -FROM events -WHERE payload['device']['os'] = 'Android'; +-- count events with high humidity +SELECT COUNT(*) AS high_humidity_events +FROM events2 +WHERE payload['humidity'] > 70 ``` ```{note} Dot-notation works for both explicitly and dynamically added fields. ``` -## Querying DYNAMIC OBJECTs +## Querying DYNAMIC OBJECTs Safely -To support querying DYNAMIC OBJECTs using SQL, where keys may not exist within -an OBJECT, CrateDB provides the +When working with dynamic objects, some keys may not exist. CrateDB provides the [error_on_unknown_object_key](inv:crate-reference:*:label#conf-session-error_on_unknown_object_key) -session setting. It controls the behaviour when querying unknown object keys to -dynamic objects. +session setting to control behavior in such cases. By default, CrateDB will raise an error if any of the queried object keys are unknown. When adjusting this setting to `false`, it will return `NULL` as the value of the corresponding key. ```sql -cr> CREATE TABLE testdrive (item OBJECT(DYNAMIC)); +cr> CREATE TABLE events (payload OBJECT(DYNAMIC)); CREATE OK, 1 row affected (0.563 sec) -cr> SELECT item['unknown'] FROM testdrive; -ColumnUnknownException[Column item['unknown'] unknown] +cr> SELECT payload['unknown'] FROM events; +ColumnUnknownException[Column payload['unknown'] unknown] cr> SET error_on_unknown_object_key = false; SET OK, 0 rows affected (0.001 sec) -cr> SELECT item['unknown'] FROM testdrive; -+-----------------+ -| item['unknown'] | -+-----------------+ -+-----------------+ +cr> SELECT payload['unknown'] FROM events; ++-------------------+ +| payload['unknown']| ++-------------------+ ++-------------------+ SELECT 0 rows in set (0.051 sec) ``` -## Arrays of OBJECTs - -Store arrays of objects for multi-valued nested data: - -```sql -CREATE TABLE products ( - id UUID PRIMARY KEY, - name TEXT, - tags ARRAY(TEXT), - specs ARRAY(OBJECT AS ( - name TEXT, - value TEXT - )) -); -``` - -Query nested arrays with filters: - -```sql -SELECT * -FROM products -WHERE 'outdoor' = ANY(tags); -``` +## Aggregating JSON Fields -You can also filter by object array fields: +CrateDB allows full SQL-style aggregations on nested fields: ```sql -SELECT * -FROM products -WHERE specs['name'] = 'battery' AND specs['value'] = 'AA'; +SELECT AVG(payload['temperature']) AS avg_temp +FROM events3 +WHERE payload['humidity'] > 20.0'; ``` ## Combining Structured & Semi-Structured Data -CrateDB supports **hybrid schemas**, mixing standard columns with JSON fields: - -```sql -CREATE TABLE logs ( - id UUID PRIMARY KEY, - service TEXT, - log_level TEXT, - metadata OBJECT(DYNAMIC), - created_at TIMESTAMP -); -``` +As you can see in the events table, CrateDB supports **hybrid schemas**, mixing +standard columns with JSON fields. This allows you to: -* Query by fixed attributes (`log_level`) -* Flexibly store structured or unstructured metadata +* Query by fixed attributes (`temerature`) +* Flexibly store structured or unstructured metadata in `payload` * Add new fields on the fly without migrations ## Indexing Behavior @@ -172,39 +144,27 @@ CrateDB **automatically indexes** object fields if: * Column policy is `DYNAMIC` * Field type can be inferred at insert time -You can also explicitly define and index object fields: +You can also explicitly define and index object fields. Let’s extend the payload +with a message field with full-text index, and also disable index for `humidity`: ```sql -CREATE TABLE metrics ( - id UUID PRIMARY KEY, - data OBJECT(DYNAMIC) AS ( - cpu DOUBLE INDEX USING FULLTEXT, - memory DOUBLE +CREATE TABLE events3 ( + id TEXT PRIMARY KEY, + timestamp TIMESTAMP, + tags ARRAY(TEXT), + payload OBJECT(DYNAMIC) AS ( + temperature DOUBLE, + humidity DOUBLE INDEX OFF, + message TEXT INDEX USING FULLTEXT ) ); ``` -To exclude fields from indexing, set: - -```sql -data['some_field'] INDEX OFF -``` - ```{note} Too many dynamic fields can lead to schema explosion. Use `STRICT` or `IGNORED` if needed. ``` -## Aggregating JSON Fields - -CrateDB allows full SQL-style aggregations on nested fields: - -```sql -SELECT AVG(payload['temperature']) AS avg_temp -FROM sensor_readings -WHERE payload['location'] = 'room1'; -``` - CrateDB also supports **`GROUP BY`**, **`HAVING`**, and **window functions** on object fields. @@ -213,5 +173,5 @@ object fields. * Reference Manual: * {ref}`Objects ` * {ref}`Object Column policy ` + * {ref}`json data type ` * {ref}`Inserting objects as JSON ` - * {ref}`json type ` From 61184130517fce0541088dafa21fa138270c8811 Mon Sep 17 00:00:00 2001 From: Brian Munkholm Date: Fri, 5 Sep 2025 16:12:40 +0200 Subject: [PATCH 29/58] Minor fixed in modeling/geospatial --- docs/start/modelling/geospatial.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/docs/start/modelling/geospatial.md b/docs/start/modelling/geospatial.md index ae31b0b3..8216b1e2 100644 --- a/docs/start/modelling/geospatial.md +++ b/docs/start/modelling/geospatial.md @@ -52,7 +52,7 @@ INSERT INTO parks (name, area) VALUES ('My Park', 'POLYGON ((5 5, 30 5, 30 30, 5 30, 5 5))'); ``` -## Querying with spacial operations +## Querying with spatial operations It is e.g. possible to check if a point is within a park in the table: @@ -61,7 +61,7 @@ SELECT name FROM parks WHERE within('POINT(10 10)'::geo_shape, area); ``` -CrateDB provides key scalar functions for spatial operations like, distance(), within(), intersects(), area(), geohash() and lattitude()/longitude(). +CrateDB provides key scalar functions for spatial operations such as distance(), within(), intersects(), area(), geohash() and latitude()/longitude(). Furthermore, it is possible to use the **match** predicate with geospatial data in queries. From c3bcc0c6c5d9e3e6d6f29a0ec69457ed19f0bc29 Mon Sep 17 00:00:00 2001 From: Brian Munkholm Date: Fri, 5 Sep 2025 17:13:12 +0200 Subject: [PATCH 30/58] formatting & fixes. --- docs/start/modelling/geospatial.md | 48 ++++++++++++++++++++++-------- docs/start/modelling/relational.md | 14 +++++---- 2 files changed, 44 insertions(+), 18 deletions(-) diff --git a/docs/start/modelling/geospatial.md b/docs/start/modelling/geospatial.md index 8216b1e2..6e3c561d 100644 --- a/docs/start/modelling/geospatial.md +++ b/docs/start/modelling/geospatial.md @@ -1,13 +1,20 @@ (model-geospatial)= # Geospatial data -CrateDB supports **real-time geospatial analytics at scale**, enabling you to store, query, and analyze 2D location-based data using standard SQL over two dedicated types: **GEO\_POINT** and **GEO\_SHAPE**. You can seamlessly combine spatial data with full-text, vector, JSON, or time-series in the same SQL queries. +CrateDB supports **real-time geospatial analytics at scale**, enabling you to +store, query, and analyze 2D location-based data using standard SQL over two +dedicated types: **GEO\_POINT** and **GEO\_SHAPE**. You can seamlessly combine +spatial data with full-text, vector, JSON, or time-series in the same SQL +queries. The strength of CrateDB's support for geospatial data includes: -* Designed for **real-time geospatial tracking and analytics** (e.g., fleet tracking, mapping, location-layered apps) -* **Unified SQL platform**: spatial data can be combined with full-text search, JSON, vectors, time-series — in the same table or query -* **High ingest and query throughput**, suitable for large-scale location-based workloads +* Designed for **real-time geospatial tracking and analytics** (e.g., fleet + tracking, mapping, location-layered apps) +* **Unified SQL platform**: spatial data can be combined with full-text search, + JSON, vectors, time-series — in the same table or query +* **High ingest and query throughput**, suitable for large-scale location-based + workloads ## Geospatial Data Types @@ -18,8 +25,10 @@ CrateDB has two geospatial data types: * Stores a single location via latitude/longitude. * Insert using * coordinate array `[lon, lat]` - * [Well-Known Text](https://libgeos.org/specifications/wkt/) (WKT) string `'POINT (lon lat)'`. -* Must be declared explicitly; dynamic schema inference will not detect `geo_point` type. + * [Well-Known Text](https://libgeos.org/specifications/wkt/) (WKT) string + `'POINT (lon lat)'`. +* Must be declared explicitly; dynamic schema inference will not detect + `geo_point` type. ### **geo_shape** @@ -29,7 +38,11 @@ CrateDB has two geospatial data types: * `LineString`, `MultiLineString` * `Polygon`, `MultiPolygon` * `GeometryCollection` -* Indexed using geohash, quadtree, or BKD-tree, with configurable precision (e.g. `50m`) and error threshold. The indexes are described in the [reference manual](https://cratedb.com/docs/crate/reference/en/latest/general/ddl/data-types.html#type-geo-shape-index). You can choose and configure the indexing method when defining your table schema. +* Indexed using geohash, quadtree, or BKD-tree, with configurable precision + (e.g. `50m`) and error threshold. The indexes are described in the [reference + manual](https://cratedb.com/docs/crate/reference/en/latest/general/ddl/data-types.html#type-geo-shape-index). + You can choose and configure the indexing method when defining your table + schema. ## Defining a Geospatial Column @@ -61,16 +74,25 @@ SELECT name FROM parks WHERE within('POINT(10 10)'::geo_shape, area); ``` -CrateDB provides key scalar functions for spatial operations such as distance(), within(), intersects(), area(), geohash() and latitude()/longitude(). +CrateDB provides key scalar functions for spatial operations such as distance(), +within(), intersects(), area(), geohash() and latitude()/longitude(). -Furthermore, it is possible to use the **match** predicate with geospatial data in queries. +Furthermore, it is possible to use the **match** predicate with geospatial data +in queries. -See the section about {ref}`searching geospatial data ` for details on this. +See the section about {ref}`searching geospatial data ` for details +on this. ## Further Learning & Resources * Reference manual: * {ref}`Geo Search ` - * {ref}`Geo functions `: distance, within, intersects, latitude, longitude, geohash, area -* CrateDB Academy [**Hands-on: Geospatial Data**](https://cratedb.com/academy/fundamentals/data-modelling-with-cratedb/hands-on-geospatial-data) modules, with sample datasets (Chicago 311 calls, taxi rides, community zones) and example queries. -* CrateDB Blog: [**Geospatial Queries with CrateDB**](https://cratedb.com/blog/geospatial-queries-with-crate-data) – outlines capabilities, limitations, and practical use cases. + * {ref}`Geo functions `: distance, within, + intersects, latitude, longitude, geohash, area +* CrateDB Academy [**Hands-on: Geospatial + Data**](https://cratedb.com/academy/fundamentals/data-modelling-with-cratedb/hands-on-geospatial-data) + modules, with sample datasets (Chicago 311 calls, taxi rides, community zones) + and example queries. +* CrateDB Blog: [**Geospatial Queries with + CrateDB**](https://cratedb.com/blog/geospatial-queries-with-crate-data) – + outlines capabilities, limitations, and practical use cases. diff --git a/docs/start/modelling/relational.md b/docs/start/modelling/relational.md index ea0353dc..58e9197c 100644 --- a/docs/start/modelling/relational.md +++ b/docs/start/modelling/relational.md @@ -37,7 +37,7 @@ built-in for scale ## Normalization vs. Embedding CrateDB supports both **normalized** (relational) and **denormalized** (embedded -JSON) approaches with {ref}`column_policy = 'dynamic' `. +JSON) approaches with {ref}`column_policy = 'dynamic' `. * For strict referential integrity and modularity: use normalized tables with joins. @@ -62,8 +62,10 @@ CREATE TABLE orders ( ); ``` -:::{note} CrateDB lets you **query nested fields** directly using bracket -notation: `items['name']`, `items['price']`, etc. ::: +:::{note} +CrateDB lets you **query nested fields** directly using bracket +notation: `items['name']`, `items['price']`, etc. +::: ## Joins & Relationships @@ -94,8 +96,10 @@ GROUP BY customer_id HAVING SUM(total_amount) > 1000; ``` -:::{note} CrateDB's **columnar storage** optimizes performance for -aggregations — even on large datasets. ::: +:::{note} +CrateDB's **columnar storage** optimizes performance for +aggregations — even on large datasets. +::: ## Constraints & Indexing From e73ff3d9480f8e327f18eefaaccdc816ea718966 Mon Sep 17 00:00:00 2001 From: Brian Munkholm Date: Mon, 8 Sep 2025 09:24:59 +0200 Subject: [PATCH 31/58] Fixed modelling/fulltext --- docs/start/modelling/fulltext.md | 8 +++++--- 1 file changed, 5 insertions(+), 3 deletions(-) diff --git a/docs/start/modelling/fulltext.md b/docs/start/modelling/fulltext.md index aad69b5c..b58d4060 100644 --- a/docs/start/modelling/fulltext.md +++ b/docs/start/modelling/fulltext.md @@ -73,8 +73,7 @@ BM25. **Basic usage:** ```sql -SELECT title, _score -FROM documents +SELECT title, _score FROM documents WHERE MATCH(ft_body, 'search term') ORDER BY _score DESC; ``` @@ -82,8 +81,11 @@ ORDER BY _score DESC; **Searching multiple indices with weighted ranking:** ```sql -MATCH((ft_title boost 2.0, ft_body), 'keyword') +SELECT title, _score FROM documents +WHERE MATCH((ft_body, title 2.0), 'search term'); +ORDER BY _score DESC; ``` +Here `title` is weighted twice as much as `ft_body`. **You can configure match options like:** From 2f87497c16f5b651670428906bed4f1c8b98a8d5 Mon Sep 17 00:00:00 2001 From: Brian Munkholm Date: Mon, 8 Sep 2025 14:13:51 +0200 Subject: [PATCH 32/58] Minor fixes --- docs/start/modelling/fulltext.md | 7 ++-- docs/start/modelling/timeseries.md | 65 ++++++++++++++++++++---------- 2 files changed, 47 insertions(+), 25 deletions(-) diff --git a/docs/start/modelling/fulltext.md b/docs/start/modelling/fulltext.md index b58d4060..2776c95f 100644 --- a/docs/start/modelling/fulltext.md +++ b/docs/start/modelling/fulltext.md @@ -21,7 +21,8 @@ text columns you want to search: CREATE TABLE documents ( title TEXT, body TEXT, - INDEX ft_body USING FULLTEXT(body) WITH (analyzer = 'english') + INDEX ft_title USING FULLTEXT(title) WITH (analyzer = 'english'), + INDEX ft_body USING FULLTEXT(body) WITH (analyzer = 'english') ); ``` @@ -82,10 +83,10 @@ ORDER BY _score DESC; ```sql SELECT title, _score FROM documents -WHERE MATCH((ft_body, title 2.0), 'search term'); +WHERE MATCH((ft_body, ft_title 2.0), 'search term'); ORDER BY _score DESC; ``` -Here `title` is weighted twice as much as `ft_body`. +Here `ft_title` is weighted twice as much as `ft_body`. **You can configure match options like:** diff --git a/docs/start/modelling/timeseries.md b/docs/start/modelling/timeseries.md index f4e7faf3..3aa54401 100644 --- a/docs/start/modelling/timeseries.md +++ b/docs/start/modelling/timeseries.md @@ -3,17 +3,23 @@ ## Why CrateDB for Time Series? -CrateDB employs a relational representation for time‑series, enabling you to work with timestamped data using standard SQL, while also seamlessly combining with document and context data. +CrateDB employs a relational representation for time‑series, enabling you to +work with timestamped data using standard SQL, while also seamlessly combining +with document and context data. -* While maintaining a high ingest rate, its **columnar storage** and **automatic indexing** let you access and analyze the data immediately with **fast aggregations** and **near-real-time queries**. -* Handles **high cardin­ality** and **a variety of data types**, including nested JSON, geospatial and vector data—all queryable via the same SQL statements. +* While maintaining a high ingest rate, the **columnar storage** and **automatic + indexing** let you access and analyze the data immediately with **fast + aggregations** and **near-real-time queries**. +* Handles **high cardin­ality** and **a variety of data types**, including + nested JSON, geospatial and vector data — all queryable via the same SQL + statements. ## Data Model Template A typical time‑series schema looks like this: ```sql -CREATE TABLE IF NOT EXISTS devices_readings ( +CREATE TABLE devices_readings ( ts TIMESTAMP WITH TIME ZONE, device_id TEXT, battery OBJECT(DYNAMIC) AS ( @@ -30,7 +36,7 @@ CREATE TABLE IF NOT EXISTS devices_readings ( free BIGINT, used BIGINT ), - month timestamp with time zone GENERATED ALWAYS AS date_trunc('month', ts) + month TIMESTAMP GENERATED ALWAYS AS date_trunc('month', ts) ) PARTITIONED BY (month); ``` @@ -79,19 +85,19 @@ WITH all_hours AS ( generate_series( '2025-01-01', '2025-01-02', - '30 second' :: interval + INTERVAL '30 second' ) AS expected_time ), raw AS ( SELECT ts, - battery ['level'] + battery['level'] FROM devices_readings ) SELECT expected_time, - r.battery ['level'] + r.battery['level'] FROM all_hours LEFT JOIN raw r ON expected_time = r.ts @@ -101,21 +107,27 @@ ORDER BY ### Typical time-series functions -* **Time extraction:** date\_trunc, extract, date\_part, now(), current\_timestamp -* **Time bucketing:** date\_bin, interval, age -* **Window functions:** avg(...) OVER (...), stddev(...) OVER (...), lag, lead, first\_value, last\_value, row\_number, rank, WINDOW ... AS (...) +* **Time extraction:** date_trunc, extract, date_part, now(), current_timestamp +* **Time bucketing:** date_bin, interval, age +* **Window functions:** avg(...) OVER (...), stddev(...) OVER (...), lag, lead, + first_value, last_value, row_number, rank, WINDOW ... AS (...) * **Null handling:** coalesce, nullif -* **Statistical aggregates:** percentile, correlation, stddev, variance, min, max, sum +* **Statistical aggregates:** percentile, correlation, stddev, variance, min, + max, sum * **Advanced filtering & logic:** greatest, least, case when ... then ... end ## Downsampling & Interpolation -To reduce volume while preserving trends, use `DATE_BIN`.\ -Missing data can be handled using `LAG()`/`LEAD()` or other interpolation logic within SQL. +To reduce volume while preserving trends, use `DATE_BIN`. Missing data can be +handled using `LAG()`/`LEAD()` or other interpolation logic within SQL. ## Schema Evolution & Contextual Data -With `column_policy = 'dynamic'`, ingest JSON payloads containing extra attributes—new columns are auto‑created and indexed. Perfect for capturing evolving sensor metadata. For column-level control, use `OBJECT(DYNAMIC)` to auto-create (and, by default, index) subcolumns, or `OBJECT(IGNORED)`to accept unknown keys without creating or indexing subcolumns. +With `column_policy = 'dynamic'`, ingest JSON payloads containing extra +attributes—new columns are auto‑created and indexed. Perfect for capturing +evolving sensor metadata. For column-level control, use `OBJECT(DYNAMIC)` to +auto-create (and, by default, index) subcolumns, or `OBJECT(IGNORED)`to accept +unknown keys without creating or indexing subcolumns. You can also store: @@ -127,18 +139,27 @@ All types are supported within the same table or joined together. ## Storage Optimization -* **Partitioning and sharding**: data can be partitioned by time (e.g. daily/monthly) and sharded across a cluster. +* **Partitioning and sharding**: data can be partitioned by time (e.g. + daily/monthly) and sharded across a cluster. * Supports long‑term retention with performant historic storage. * Columnar layout reduces storage footprint and accelerates aggregation queries. ## Advanced Use Cases -* **Exploratory data analysis** (EDA), decomposition, and forecasting via CrateDB’s SQL or by exporting to Pandas/Plotly. -* **Machine learning workflows**: time‑series features and anomaly detection pipelines can be built using CrateDB + external tools +* **Exploratory data analysis** (EDA), decomposition, and forecasting via + CrateDB’s SQL or by exporting to Pandas/Plotly. +* **Machine learning workflows**: time‑series features and anomaly detection + pipelines can be built using CrateDB + external tools ## Further Learning & Resources -* **Documentation:** {ref}`Advanced Time Series Analysis `, {ref}`Time Series Long Term Storage ` -* **Video:** [Time Series Data Modeling](https://cratedb.com/resources/videos/time-series-data-modeling) – covers relational & time series, document, geospatial, vector, and full-text in one tutorial. -* **CrateDB Academy:** [Advanced Time Series Modeling course](https://cratedb.com/academy/time-series/getting-started/introduction-to-time-series-data). -* **Tutorial:** [Downsampling with LTTB algorithm](https://community.cratedb.com/t/advanced-downsampling-with-the-lttb-algorithm/1287) +* **Documentation:** {ref}`Advanced Time Series Analysis `, + {ref}`Time Series Long Term Storage ` +* **Video:** [Time Series Data + Modeling](https://cratedb.com/resources/videos/time-series-data-modeling) – + covers relational & time series, document, geospatial, vector, and full-text + in one tutorial. +* **CrateDB Academy:** [Advanced Time Series Modeling + course](https://cratedb.com/academy/time-series/getting-started/introduction-to-time-series-data). +* **Tutorial:** [Downsampling with LTTB + algorithm](https://community.cratedb.com/t/advanced-downsampling-with-the-lttb-algorithm/1287) From 095dbe561b12257c9a57fe48012d2a0ddf020557 Mon Sep 17 00:00:00 2001 From: Brian Munkholm Date: Mon, 8 Sep 2025 14:14:44 +0200 Subject: [PATCH 33/58] line wrap fix --- docs/start/modelling/timeseries.md | 12 ++++++++---- 1 file changed, 8 insertions(+), 4 deletions(-) diff --git a/docs/start/modelling/timeseries.md b/docs/start/modelling/timeseries.md index 3aa54401..f75285c5 100644 --- a/docs/start/modelling/timeseries.md +++ b/docs/start/modelling/timeseries.md @@ -44,22 +44,26 @@ Key points: * `month` is the partitioning key, optimizing data storage and retrieval. * Every column is stored in the column store by default for fast aggregations. -* Using **OBJECT columns** provides a structured and efficient way to organize complex nested data in CrateDB, enhancing both data integrity and flexibility. +* Using **OBJECT columns** provides a structured and efficient way to organize + complex nested data in CrateDB, enhancing both data integrity and flexibility. ## Ingesting and Querying ### **Data Ingestion** -* Use SQL `INSERT` or bulk import techniques like `COPY FROM` with JSON or CSV files. +* Use SQL `INSERT` or bulk import techniques like `COPY FROM` with JSON or CSV + files. * Schema inference can often happen automatically during import. ### **Aggregation and Transformations** CrateDB offers built‑in SQL functions tailor‑made for time‑series analyses: -* **`DATE_BIN(interval, timestamp, origin)`** for bucketed aggregations (down‑sampling). +* **`DATE_BIN(interval, timestamp, origin)`** for bucketed aggregations + (down‑sampling). * **Window functions** like `LAG()` and `LEAD()` to detect trends or gaps. -* **`MAX_BY()`** returns the value from one column matching the min/max value of another column in a group. +* **`MAX_BY()`** returns the value from one column matching the min/max value of + another column in a group. **Example**: compute hourly average battery levels and join with metadata: From 7222aadbdc5329399a5cbba8c322c087fd67af4f Mon Sep 17 00:00:00 2001 From: Brian Munkholm Date: Mon, 8 Sep 2025 15:56:21 +0200 Subject: [PATCH 34/58] wording and reference updates --- docs/start/modelling/fulltext.md | 28 ++++++++++++++-------------- docs/start/modelling/geospatial.md | 11 +++++------ docs/start/modelling/relational.md | 2 +- 3 files changed, 20 insertions(+), 21 deletions(-) diff --git a/docs/start/modelling/fulltext.md b/docs/start/modelling/fulltext.md index 2776c95f..c0b6438e 100644 --- a/docs/start/modelling/fulltext.md +++ b/docs/start/modelling/fulltext.md @@ -2,7 +2,7 @@ # Full-text data CrateDB offers **native full-text search** powered by **Apache Lucene** and Okapi -BM25 ranking, accessible via SQL for easy modeling and querying of large-scale +BM25 ranking, accessible via SQL for easy modelling and querying of large-scale textual data. It supports fuzzy matching, multi-language analysis, and composite indexing, while fully integrating with data types such as JSON, time-series, geospatial, vectors, and more for comprehensive multi-model queries. Whether you @@ -42,7 +42,7 @@ An analyzer splits text into searchable terms and consists of the following comp * **Token Filters -** e.g. lowercase, stemming, stop‑word removal * **Char Filters -** pre-processing (e.g. stripping HTML). -CrateDB offers about 50 [**built-in analyzers**](https://cratedb.com/docs/crate/reference/en/latest/general/ddl/analyzers.html#built-in-analyzers) supporting more than 30 [languages](https://cratedb.com/docs/crate/reference/en/latest/general/ddl/analyzers.html#language). +CrateDB offers about 50 [**built-in analyzers**](https://cratedb.com/docs/crate/reference/en/latest/general/ddl/analyzers.html#built-in-analyzers) supporting more than 30 [languages](https://cratedb.com/docs/crate/reference/en/latest/general/ddl/analyzers.html#language). You can **extend** a built-in analyzer: @@ -52,7 +52,7 @@ CREATE ANALYZER german_snowball WITH (language = 'german'); ``` -or create your own **custom** analyzer : +or create your own **custom** analyzer: ```sql CREATE ANALYZER myanalyzer ( @@ -62,7 +62,7 @@ CREATE ANALYZER myanalyzer ( ); ``` -Learn more about the [builtin analyzers](https://cratedb.com/docs/crate/reference/en/latest/general/ddl/analyzers.html#built-in-analyzers) and how to [define your own](https://cratedb.com/docs/crate/reference/en/latest/general/ddl/fulltext-indices.html#creating-a-custom-analyzer) with custom [tokenizers](https://cratedb.com/docs/crate/reference/en/latest/general/ddl/analyzers.html#built-in-tokenizers) and [token filters.](https://cratedb.com/docs/crate/reference/en/latest/general/ddl/analyzers.html#built-in-token-filters) +Learn more about the [built-in analyzers](https://cratedb.com/docs/crate/reference/en/latest/general/ddl/analyzers.html#built-in-analyzers) and how to [define your own](https://cratedb.com/docs/crate/reference/en/latest/general/ddl/fulltext-indices.html#creating-a-custom-analyzer) with custom [tokenizers](https://cratedb.com/docs/crate/reference/en/latest/general/ddl/analyzers.html#built-in-tokenizers) and [token filters.](https://cratedb.com/docs/crate/reference/en/latest/general/ddl/analyzers.html#built-in-token-filters) ## Querying: MATCH Predicate & Scoring @@ -147,14 +147,14 @@ constraints, all in one. ## Further Learning & Resources -* [**Full-text Search**](../../feature/search/fts/index.md): In-depth walkthrough of full-text search capabilities. +* [**Full-text Search**](../../feature/search/fts/index.md): In-depth + walkthrough of full-text search capabilities. * Reference Manual: - * [Full-text indices]: Defining indices, extending builtin analyzers, custom analyzers. - * [Full-text analyzers]: Builtin analyzers, tokenizers, token and char filters. - * [SQL MATCH predicate]: Details about MATCH predicate arguments and options. -* [**Hands‑On Academy Course**](https://learn.cratedb.com/cratedb-fundamentals?lesson=fulltext-search): explore FTS on real datasets (e.g. Chicago neighborhoods). - -[Full-text search]: project:#fulltext-search -[Full-text indices]: inv:crate-reference:*:label#fulltext-indices -[Full-text analyzers]: inv:crate-reference:*:label#sql-analyzer -[SQL MATCH predicate]: inv:crate-reference:*:label#sql_dql_fulltext_search + * {ref}`Full-text indices `: Defining + indices, extending builtin analyzers, custom analyzers. + * sql-analyzer>`: Builtin analyzers, tokenizers, token and char filters. + * {ref}`SQL MATCH predicate `: + Details about MATCH predicate arguments and options. +* [**Hands‑On Academy + Course**](https://learn.cratedb.com/cratedb-fundamentals?lesson=fulltext-search): + explore FTS on real datasets (e.g. Chicago neighborhoods). diff --git a/docs/start/modelling/geospatial.md b/docs/start/modelling/geospatial.md index 6e3c561d..6acc6285 100644 --- a/docs/start/modelling/geospatial.md +++ b/docs/start/modelling/geospatial.md @@ -24,13 +24,13 @@ CrateDB has two geospatial data types: * Stores a single location via latitude/longitude. * Insert using - * coordinate array `[lon, lat]` + * coordinate array `[lon, lat]` * [Well-Known Text](https://libgeos.org/specifications/wkt/) (WKT) string `'POINT (lon lat)'`. * Must be declared explicitly; dynamic schema inference will not detect `geo_point` type. -### **geo_shape** +### geo_shape * Represents more complex 2D shapes defined via GeoJSON or WKT formats. * Supported geometry types: @@ -51,7 +51,7 @@ Here’s an example of how to define a `GEO_SHAPE` column with a specific index: ```sql CREATE TABLE parks ( name TEXT, - area GEO_SHAPE INDEX USING quadtree + area GEO_SHAPE INDEX USING quadtree WITH (precision = '50m') ); ``` @@ -67,7 +67,7 @@ VALUES ('My Park', 'POLYGON ((5 5, 30 5, 30 30, 5 30, 5 5))'); ## Querying with spatial operations -It is e.g. possible to check if a point is within a park in the table: +For example, check whether a point lies within a park: ```sql SELECT name FROM parks @@ -80,8 +80,7 @@ within(), intersects(), area(), geohash() and latitude()/longitude(). Furthermore, it is possible to use the **match** predicate with geospatial data in queries. -See the section about {ref}`searching geospatial data ` for details -on this. +See {ref}`Geo Search ` for details. ## Further Learning & Resources diff --git a/docs/start/modelling/relational.md b/docs/start/modelling/relational.md index 58e9197c..8faeb8a3 100644 --- a/docs/start/modelling/relational.md +++ b/docs/start/modelling/relational.md @@ -2,7 +2,7 @@ # Relational data CrateDB is a **distributed SQL database** that offers rich **relational data -modeling** with the flexibility of dynamic schemas and the scalability of NoSQL +modelling** with the flexibility of dynamic schemas and the scalability of NoSQL systems. It supports **primary keys,** **joins**, **aggregations**, and **subqueries**, just like traditional RDBMS systems—while also enabling hybrid use cases with time-series, geospatial, full-text, vector search, and From 5addf404b4de35fded86ccf8135bd0594154fbe4 Mon Sep 17 00:00:00 2001 From: Brian Munkholm Date: Mon, 8 Sep 2025 16:29:31 +0200 Subject: [PATCH 35/58] Fixes from Coderabbit feedback --- docs/start/modelling/geospatial.md | 4 ++-- docs/start/modelling/json.md | 6 +++--- docs/start/modelling/relational.md | 2 +- 3 files changed, 6 insertions(+), 6 deletions(-) diff --git a/docs/start/modelling/geospatial.md b/docs/start/modelling/geospatial.md index 6acc6285..375b9623 100644 --- a/docs/start/modelling/geospatial.md +++ b/docs/start/modelling/geospatial.md @@ -20,7 +20,7 @@ The strength of CrateDB's support for geospatial data includes: CrateDB has two geospatial data types: -### geo_point +### GEO_POINT * Stores a single location via latitude/longitude. * Insert using @@ -30,7 +30,7 @@ CrateDB has two geospatial data types: * Must be declared explicitly; dynamic schema inference will not detect `geo_point` type. -### geo_shape +### GEO_SHAPE * Represents more complex 2D shapes defined via GeoJSON or WKT formats. * Supported geometry types: diff --git a/docs/start/modelling/json.md b/docs/start/modelling/json.md index d17626d3..cebbad65 100644 --- a/docs/start/modelling/json.md +++ b/docs/start/modelling/json.md @@ -8,7 +8,7 @@ diverse or evolving schemas. CrateDB’s support for dynamic objects, nested structures, and dot-notation querying brings the best of both relational and document-based data -modeling—without leaving the SQL world. +modelling — without leaving the SQL world. ## A Simple Table with JSON @@ -123,7 +123,7 @@ CrateDB allows full SQL-style aggregations on nested fields: ```sql SELECT AVG(payload['temperature']) AS avg_temp FROM events3 -WHERE payload['humidity'] > 20.0'; +WHERE payload['humidity'] > 20.0; ``` ## Combining Structured & Semi-Structured Data @@ -133,7 +133,7 @@ standard columns with JSON fields. This allows you to: -* Query by fixed attributes (`temerature`) +* Query by fixed attributes (`temperature`) * Flexibly store structured or unstructured metadata in `payload` * Add new fields on the fly without migrations diff --git a/docs/start/modelling/relational.md b/docs/start/modelling/relational.md index 8faeb8a3..37678a04 100644 --- a/docs/start/modelling/relational.md +++ b/docs/start/modelling/relational.md @@ -122,7 +122,7 @@ CREATE TABLE products ( name TEXT, price DOUBLE CHECK (price >= 0), tag TEXT INDEX OFF, - description TEXT INDEX using fulltext + description TEXT INDEX USING FULLTEXT ); ``` From 1f565ba8d2ab92bc29fa842ada0200b0923ddbe1 Mon Sep 17 00:00:00 2001 From: Brian Munkholm Date: Mon, 8 Sep 2025 16:49:55 +0200 Subject: [PATCH 36/58] Coderabbit fixes and minor adjustments of primary-key page in modelling --- docs/start/modelling/primary-key.md | 31 ++++++++++++++--------------- 1 file changed, 15 insertions(+), 16 deletions(-) diff --git a/docs/start/modelling/primary-key.md b/docs/start/modelling/primary-key.md index 744224a7..4408eff2 100644 --- a/docs/start/modelling/primary-key.md +++ b/docs/start/modelling/primary-key.md @@ -37,7 +37,9 @@ sequences if you want to. This option involves declaring a column using `DEFAULT now()`. ```psql -BIGINT DEFAULT now() PRIMARY KEY +CREATE TABLE example ( + id BIGINT DEFAULT now() PRIMARY KEY +); ``` :Pros: @@ -52,7 +54,9 @@ BIGINT DEFAULT now() PRIMARY KEY This option involves declaring a column using `DEFAULT gen_random_text_uuid()`. ```psql -TEXT DEFAULT gen_random_text_uuid() PRIMARY KEY +CREATE TABLE example2 ( + id TEXT DEFAULT gen_random_text_uuid() PRIMARY KEY +); ``` :Pros: @@ -69,11 +73,8 @@ TEXT DEFAULT gen_random_text_uuid() PRIMARY KEY [UUIDv7] is a new format that preserves **temporal ordering**, making UUIDs better suited for inserts and range queries in distributed databases. -We can use these in CrateDB with an UDF with the code from [UUIDv7 in N languages]. - You can use [UUIDv7 for CrateDB] via a {ref}`User-Defined Function (UDF) ` -in JavaScript, or in your preferred programming language by using one of the -available UUIDv7 libraries. +in JavaScript, or use a [UUIDv7 library] in your application layer. :Pros: - Globally unique and **almost sequential** @@ -118,8 +119,8 @@ values even when many ingestion processes run in parallel. Create a table to keep the latest values for the sequences. ```psql CREATE TABLE sequences ( - name TEXT PRIMARY KEY, - last_value BIGINT + name TEXT PRIMARY KEY, + last_value BIGINT ) CLUSTERED INTO 1 SHARDS; ``` @@ -134,8 +135,8 @@ VALUES ('mysequence',0); Start an example with a newly defined table. ```psql CREATE TABLE mytable ( - id BIGINT PRIMARY KEY, - field1 TEXT + id BIGINT PRIMARY KEY, + field1 TEXT ); ``` @@ -145,9 +146,9 @@ Use optimistic concurrency control to generate unique, incrementing values even in parallel ingestion scenarios. The Python code below reads the last value used from the sequences table, and -then attempts an [optimistic UPDATE] with a `RETURNING` clause, if a +then attempts an [optimistic UPDATE] with a `RETURNING` clause. If a contending process already consumed the identity nothing will be returned so our -process will retry until a value is returned, then it uses that value as the new +process will retry until a value is returned. Then it uses that value as the new ID for the record we are inserting into the `mytable` table. ```python @@ -162,7 +163,6 @@ ID for the record we are inserting into the `mytable` table. # /// import time - import records db = records.Database("crate://") @@ -173,9 +173,7 @@ base_delay = 0.1 # 100 milliseconds for attempt in range(max_retries): select_query = """ - SELECT last_value, - _seq_no, - _primary_term + SELECT last_value, _seq_no, _primary_term FROM sequences WHERE name = :sequence_name; """ @@ -232,3 +230,4 @@ db.close() [udf]: https://cratedb.com/docs/crate/reference/en/latest/general/user-defined-functions.html [UUIDv7]: https://datatracker.ietf.org/doc/html/rfc9562#name-uuid-version-7 [UUIDv7 for CrateDB]: https://github.com/nalgeon/uuidv7/blob/main/src/uuidv7.cratedb +[UUIDv7 library]: https://github.com/nalgeon/uuidv7 From 65dd655293b86bc09711772bc8d4d073fab08b00 Mon Sep 17 00:00:00 2001 From: Brian Munkholm Date: Tue, 9 Sep 2025 22:21:57 +0200 Subject: [PATCH 37/58] small fixes in timeseries --- docs/start/modelling/timeseries.md | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/docs/start/modelling/timeseries.md b/docs/start/modelling/timeseries.md index f75285c5..53e491ef 100644 --- a/docs/start/modelling/timeseries.md +++ b/docs/start/modelling/timeseries.md @@ -130,8 +130,8 @@ handled using `LAG()`/`LEAD()` or other interpolation logic within SQL. With `column_policy = 'dynamic'`, ingest JSON payloads containing extra attributes—new columns are auto‑created and indexed. Perfect for capturing evolving sensor metadata. For column-level control, use `OBJECT(DYNAMIC)` to -auto-create (and, by default, index) subcolumns, or `OBJECT(IGNORED)`to accept -unknown keys without creating or indexing subcolumns. +auto-create (and, by default, index) subcolumns, or `OBJECT(IGNORED)` to accept +unknown keys without creating or indexing subcolumns. You can also store: @@ -160,10 +160,10 @@ All types are supported within the same table or joined together. * **Documentation:** {ref}`Advanced Time Series Analysis `, {ref}`Time Series Long Term Storage ` * **Video:** [Time Series Data - Modeling](https://cratedb.com/resources/videos/time-series-data-modeling) – + Modelling](https://cratedb.com/resources/videos/time-series-data-modeling) – covers relational & time series, document, geospatial, vector, and full-text in one tutorial. -* **CrateDB Academy:** [Advanced Time Series Modeling +* **CrateDB Academy:** [Advanced Time Series Modelling course](https://cratedb.com/academy/time-series/getting-started/introduction-to-time-series-data). * **Tutorial:** [Downsampling with LTTB algorithm](https://community.cratedb.com/t/advanced-downsampling-with-the-lttb-algorithm/1287) From 6df493ab7f1893939ae36e6d27936df5acaeac25 Mon Sep 17 00:00:00 2001 From: Brian Munkholm Date: Tue, 9 Sep 2025 22:24:15 +0200 Subject: [PATCH 38/58] remove advanced use-cases from timeseries. --- docs/start/modelling/timeseries.md | 7 ------- 1 file changed, 7 deletions(-) diff --git a/docs/start/modelling/timeseries.md b/docs/start/modelling/timeseries.md index 53e491ef..3a66833e 100644 --- a/docs/start/modelling/timeseries.md +++ b/docs/start/modelling/timeseries.md @@ -148,13 +148,6 @@ All types are supported within the same table or joined together. * Supports long‑term retention with performant historic storage. * Columnar layout reduces storage footprint and accelerates aggregation queries. -## Advanced Use Cases - -* **Exploratory data analysis** (EDA), decomposition, and forecasting via - CrateDB’s SQL or by exporting to Pandas/Plotly. -* **Machine learning workflows**: time‑series features and anomaly detection - pipelines can be built using CrateDB + external tools - ## Further Learning & Resources * **Documentation:** {ref}`Advanced Time Series Analysis `, From 3e52799921d2583f2796e7c791513721766378b7 Mon Sep 17 00:00:00 2001 From: surister Date: Wed, 10 Sep 2025 11:33:57 +0200 Subject: [PATCH 39/58] Add default now() to `created_at` column --- docs/start/modelling/relational.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/start/modelling/relational.md b/docs/start/modelling/relational.md index 37678a04..33e276b2 100644 --- a/docs/start/modelling/relational.md +++ b/docs/start/modelling/relational.md @@ -58,7 +58,7 @@ CREATE TABLE orders ( price DOUBLE ) ), - created_at TIMESTAMP + created_at TIMESTAMP DEFAULT now() ); ``` From 9f46dca5d2946a7bbd74619720e842982b7cda08 Mon Sep 17 00:00:00 2001 From: surister Date: Wed, 10 Sep 2025 11:36:00 +0200 Subject: [PATCH 40/58] Small fix on index off --- docs/start/modelling/relational.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/docs/start/modelling/relational.md b/docs/start/modelling/relational.md index 33e276b2..314b5fdb 100644 --- a/docs/start/modelling/relational.md +++ b/docs/start/modelling/relational.md @@ -114,14 +114,14 @@ CrateDB supports: In CrateDB every column is indexed by default, depending on the datatype a different index is used, indexing is controlled and maintained by the database, there is no need to `vacuum` or `re-index` like in other systems. Indexing can -be manually turned off. +be manually turned off with `INDEX OFF`. ```sql CREATE TABLE products ( id TEXT PRIMARY KEY, name TEXT, price DOUBLE CHECK (price >= 0), - tag TEXT INDEX OFF, + tag TEXT INDEX OFF, -- <------- INDEX WILL NOT BE CREATED description TEXT INDEX USING FULLTEXT ); ``` From c38a1f9634dc15044cecb7de94ea04cca323982e Mon Sep 17 00:00:00 2001 From: surister Date: Wed, 10 Sep 2025 11:46:58 +0200 Subject: [PATCH 41/58] fix hallucination --- docs/start/modelling/json.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/docs/start/modelling/json.md b/docs/start/modelling/json.md index cebbad65..a15cd627 100644 --- a/docs/start/modelling/json.md +++ b/docs/start/modelling/json.md @@ -6,7 +6,7 @@ It enables you to store, query, and index **semi-structured JSON data** using **standard SQL**, making it an excellent choice for applications that handle diverse or evolving schemas. -CrateDB’s support for dynamic objects, nested structures, and dot-notation +CrateDB’s support for dynamic objects, nested structures, and bracket notation querying brings the best of both relational and document-based data modelling — without leaving the SQL world. @@ -85,7 +85,7 @@ WHERE payload['humidity'] > 70 ``` ```{note} -Dot-notation works for both explicitly and dynamically added fields. +Bracket notation works for both explicitly and dynamically added fields. ``` ## Querying DYNAMIC OBJECTs Safely From f09f42c8cbdabba94992426fb636be699a70baf8 Mon Sep 17 00:00:00 2001 From: surister Date: Wed, 10 Sep 2025 11:48:19 +0200 Subject: [PATCH 42/58] object is dynamic by default --- docs/start/modelling/json.md | 12 ++++++------ 1 file changed, 6 insertions(+), 6 deletions(-) diff --git a/docs/start/modelling/json.md b/docs/start/modelling/json.md index a15cd627..607de570 100644 --- a/docs/start/modelling/json.md +++ b/docs/start/modelling/json.md @@ -19,7 +19,7 @@ structures. CREATE TABLE events ( id TEXT PRIMARY KEY, timestamp TIMESTAMP, - payload OBJECT(DYNAMIC) + payload OBJECT ); ``` @@ -43,11 +43,11 @@ This allows inserting flexible, nested JSON data into `payload`: You can control how CrateDB handles unexpected fields in an object column: -| Column Policy | Behavior | -| ------------- | ----------------------------------------------------------- | -| `DYNAMIC` | New fields are automatically added to the schema at runtime | -| `STRICT` | Only explicitly defined fields are allowed | -| `IGNORED` | Extra fields are stored but not indexed or queryable | +| Column Policy | Behavior | +| ------------- |-----------------------------------------------------------------------| +| `DYNAMIC` | (Default) New fields are automatically added to the schema at runtime | +| `STRICT` | Only explicitly defined fields are allowed | +| `IGNORED` | Extra fields are stored but not indexed or queryable | Let’s evolve our table to restrict the structure of `payload`: From 3339e7cdbdb0d878d0482f643bac6e52f724586b Mon Sep 17 00:00:00 2001 From: surister Date: Wed, 10 Sep 2025 11:50:08 +0200 Subject: [PATCH 43/58] remove unnecessary line --- docs/start/modelling/json.md | 1 - 1 file changed, 1 deletion(-) diff --git a/docs/start/modelling/json.md b/docs/start/modelling/json.md index 607de570..ae8726b9 100644 --- a/docs/start/modelling/json.md +++ b/docs/start/modelling/json.md @@ -112,7 +112,6 @@ cr> SELECT payload['unknown'] FROM events; +-------------------+ | payload['unknown']| +-------------------+ -+-------------------+ SELECT 0 rows in set (0.051 sec) ``` From d9e420bc6ea1ef937d45ccf03293d8939109b0f5 Mon Sep 17 00:00:00 2001 From: surister Date: Wed, 10 Sep 2025 11:50:52 +0200 Subject: [PATCH 44/58] Use `event` table consistently --- docs/start/modelling/json.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/docs/start/modelling/json.md b/docs/start/modelling/json.md index ae8726b9..799c7104 100644 --- a/docs/start/modelling/json.md +++ b/docs/start/modelling/json.md @@ -121,7 +121,7 @@ CrateDB allows full SQL-style aggregations on nested fields: ```sql SELECT AVG(payload['temperature']) AS avg_temp -FROM events3 +FROM events WHERE payload['humidity'] > 20.0; ``` @@ -147,7 +147,7 @@ You can also explicitly define and index object fields. Let’s extend the paylo with a message field with full-text index, and also disable index for `humidity`: ```sql -CREATE TABLE events3 ( +CREATE TABLE events ( id TEXT PRIMARY KEY, timestamp TIMESTAMP, tags ARRAY(TEXT), From 5a3a849f8bb919d1ecbc8be137550a35cb3328fe Mon Sep 17 00:00:00 2001 From: surister Date: Wed, 10 Sep 2025 11:51:37 +0200 Subject: [PATCH 45/58] minor tweak --- docs/start/modelling/json.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/start/modelling/json.md b/docs/start/modelling/json.md index 799c7104..81f06a45 100644 --- a/docs/start/modelling/json.md +++ b/docs/start/modelling/json.md @@ -134,7 +134,7 @@ This allows you to: * Query by fixed attributes (`temperature`) * Flexibly store structured or unstructured metadata in `payload` -* Add new fields on the fly without migrations +* Add new fields on the fly without altering a table, skipping migrations ## Indexing Behavior From afb38cb9f8ab3ed0e35ab261da04db7bfed06c69 Mon Sep 17 00:00:00 2001 From: surister Date: Wed, 10 Sep 2025 11:56:08 +0200 Subject: [PATCH 46/58] remove 'schema explosion' its confusing --- docs/start/modelling/json.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/docs/start/modelling/json.md b/docs/start/modelling/json.md index 81f06a45..219ea4b2 100644 --- a/docs/start/modelling/json.md +++ b/docs/start/modelling/json.md @@ -160,8 +160,8 @@ CREATE TABLE events ( ``` ```{note} -Too many dynamic fields can lead to schema explosion. Use `STRICT` or `IGNORED` -if needed. +When using dynamic objects too many columns could be created, the default per table is 1000, more could impact performance. + Use `STRICT` or `IGNORED`if needed. ``` CrateDB also supports **`GROUP BY`**, **`HAVING`**, and **window functions** on From 65010db828c881dbbc3223ac49a0a296d919a532 Mon Sep 17 00:00:00 2001 From: surister Date: Wed, 10 Sep 2025 11:57:29 +0200 Subject: [PATCH 47/58] tweak comment on object fields --- docs/start/modelling/json.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/docs/start/modelling/json.md b/docs/start/modelling/json.md index 219ea4b2..15fc020f 100644 --- a/docs/start/modelling/json.md +++ b/docs/start/modelling/json.md @@ -164,8 +164,8 @@ When using dynamic objects too many columns could be created, the default per ta Use `STRICT` or `IGNORED`if needed. ``` -CrateDB also supports **`GROUP BY`**, **`HAVING`**, and **window functions** on -object fields. +Object fields are treated as any other column, therefore **`GROUP BY`**, **`HAVING`**, and **window functions** +are supported. ## Further Learning & Resources From e64c022399d7f39dea365f45dd3f958757090892 Mon Sep 17 00:00:00 2001 From: surister Date: Wed, 10 Sep 2025 12:11:33 +0200 Subject: [PATCH 48/58] improve consistency --- docs/start/modelling/timeseries.md | 22 +++++++++++----------- 1 file changed, 11 insertions(+), 11 deletions(-) diff --git a/docs/start/modelling/timeseries.md b/docs/start/modelling/timeseries.md index 3a66833e..d7f51642 100644 --- a/docs/start/modelling/timeseries.md +++ b/docs/start/modelling/timeseries.md @@ -22,17 +22,17 @@ A typical time‑series schema looks like this: CREATE TABLE devices_readings ( ts TIMESTAMP WITH TIME ZONE, device_id TEXT, - battery OBJECT(DYNAMIC) AS ( + battery OBJECT AS ( level BIGINT, status TEXT, temperature DOUBLE PRECISION ), - cpu OBJECT(DYNAMIC) AS ( + cpu OBJECT AS ( avg_1min DOUBLE PRECISION, avg_5min DOUBLE PRECISION, avg_15min DOUBLE PRECISION ), - memory OBJECT(DYNAMIC) AS ( + memory OBJECT AS ( free BIGINT, used BIGINT ), @@ -62,7 +62,7 @@ CrateDB offers built‑in SQL functions tailor‑made for time‑series analyses * **`DATE_BIN(interval, timestamp, origin)`** for bucketed aggregations (down‑sampling). * **Window functions** like `LAG()` and `LEAD()` to detect trends or gaps. -* **`MAX_BY()`** returns the value from one column matching the min/max value of +* **`MAX_BY(returnField, SearchField)` / `MIN_BY(returnField, SearchField)` ** returns the value from one column matching the min/max value of another column in a group. **Example**: compute hourly average battery levels and join with metadata: @@ -111,14 +111,14 @@ ORDER BY ### Typical time-series functions -* **Time extraction:** date_trunc, extract, date_part, now(), current_timestamp -* **Time bucketing:** date_bin, interval, age -* **Window functions:** avg(...) OVER (...), stddev(...) OVER (...), lag, lead, - first_value, last_value, row_number, rank, WINDOW ... AS (...) +* **Time extraction:** `date_trunc(...)`, `extract(...)`, `date_part(...)`, `now()`, `current_timestamp` +* **Time bucketing:** `date_bin()`, `interval`, `age()` +* **Window functions:** `avg(...)`, `over(...)`, `lag(...)`, `lead(...)`, + `first_value(...)`, `last_value(...)`, `row_number()`, `rank()` , `WINDOW ... AS (...)` * **Null handling:** coalesce, nullif -* **Statistical aggregates:** percentile, correlation, stddev, variance, min, - max, sum -* **Advanced filtering & logic:** greatest, least, case when ... then ... end +* **Statistical aggregates:** `percentile(...)`, `stddev(...)`, `variance()`, `min()`, + `max(...)`, `sum(...)`, `topk(...)` +* **Advanced filtering & logic:** `greatest(...)`, `least(...)`, `case when ... then ... end` ## Downsampling & Interpolation From ab28efa97e0ccb7776749e8aa0522a4d06174274 Mon Sep 17 00:00:00 2001 From: surister Date: Wed, 10 Sep 2025 12:14:46 +0200 Subject: [PATCH 49/58] improve consistency again --- docs/start/modelling/geospatial.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/docs/start/modelling/geospatial.md b/docs/start/modelling/geospatial.md index 375b9623..0ff74f8e 100644 --- a/docs/start/modelling/geospatial.md +++ b/docs/start/modelling/geospatial.md @@ -74,8 +74,8 @@ SELECT name FROM parks WHERE within('POINT(10 10)'::geo_shape, area); ``` -CrateDB provides key scalar functions for spatial operations such as distance(), -within(), intersects(), area(), geohash() and latitude()/longitude(). +CrateDB provides key scalar functions for spatial operations such as `distance(...)`, +`within(...)`, `intersects(...)`, `area(...)`, `geohash(...)`, `latitude(...)` and `longitude(...)`. Furthermore, it is possible to use the **match** predicate with geospatial data in queries. From 7243df93a36ef41e9c3603c1a629ca4143caacdf Mon Sep 17 00:00:00 2001 From: surister Date: Wed, 10 Sep 2025 12:22:54 +0200 Subject: [PATCH 50/58] minor tweak --- docs/start/modelling/primary-key.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/start/modelling/primary-key.md b/docs/start/modelling/primary-key.md index 4408eff2..d809ec4b 100644 --- a/docs/start/modelling/primary-key.md +++ b/docs/start/modelling/primary-key.md @@ -24,7 +24,7 @@ CrateDB is designed for horizontal scalability and [high ingestion throughput]. To achieve this, operations must complete independently on each node—without central coordination. This design choice means CrateDB does **not** support traditional auto-incrementing primary key types like `SERIAL` in PostgreSQL -or MySQL by default. +or MySQL. :::{rubric} Solutions ::: From 3a59299e38c06c6020af6f1b2d88eb2337aa26e6 Mon Sep 17 00:00:00 2001 From: Karyn Azevedo Date: Wed, 10 Sep 2025 12:15:09 +0100 Subject: [PATCH 51/58] Add `devices_info` table definition to timeseries page Added a simple table definition to allow the following join query to successfully run --- docs/start/modelling/timeseries.md | 8 ++++++++ 1 file changed, 8 insertions(+) diff --git a/docs/start/modelling/timeseries.md b/docs/start/modelling/timeseries.md index d7f51642..b2d66f62 100644 --- a/docs/start/modelling/timeseries.md +++ b/docs/start/modelling/timeseries.md @@ -38,6 +38,14 @@ CREATE TABLE devices_readings ( ), month TIMESTAMP GENERATED ALWAYS AS date_trunc('month', ts) ) PARTITIONED BY (month); + +CREATE TABLE devices_info ( + "device_id" TEXT, + "api_version" TEXT, + "manufacturer" TEXT, + "model" TEXT, + "os_name" TEXT +); ``` Key points: From 9ccf253ffe8b16997620b2b82c1470d37880c1a4 Mon Sep 17 00:00:00 2001 From: Brian Munkholm Date: Wed, 10 Sep 2025 18:25:23 +0200 Subject: [PATCH 52/58] fixed bug in reference --- docs/start/modelling/fulltext.md | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/docs/start/modelling/fulltext.md b/docs/start/modelling/fulltext.md index c0b6438e..2e9ab927 100644 --- a/docs/start/modelling/fulltext.md +++ b/docs/start/modelling/fulltext.md @@ -152,9 +152,9 @@ constraints, all in one. * Reference Manual: * {ref}`Full-text indices `: Defining indices, extending builtin analyzers, custom analyzers. - * sql-analyzer>`: Builtin analyzers, tokenizers, token and char filters. + * {ref}`Full-text analyzers `: Builtin + analyzers, tokenizers, token and char filters. * {ref}`SQL MATCH predicate `: Details about MATCH predicate arguments and options. -* [**Hands‑On Academy - Course**](https://learn.cratedb.com/cratedb-fundamentals?lesson=fulltext-search): +* [**Hands‑On Academy Course**](https://learn.cratedb.com/cratedb-fundamentals?lesson=fulltext-search): explore FTS on real datasets (e.g. Chicago neighborhoods). From 2a1e24e99e73eed26f962f910b9e0200d6f29c72 Mon Sep 17 00:00:00 2001 From: Brian Munkholm Date: Wed, 10 Sep 2025 18:26:35 +0200 Subject: [PATCH 53/58] Minor fixes in json modelling --- docs/start/modelling/json.md | 9 +++++---- 1 file changed, 5 insertions(+), 4 deletions(-) diff --git a/docs/start/modelling/json.md b/docs/start/modelling/json.md index 15fc020f..ea4b65b5 100644 --- a/docs/start/modelling/json.md +++ b/docs/start/modelling/json.md @@ -147,7 +147,7 @@ You can also explicitly define and index object fields. Let’s extend the paylo with a message field with full-text index, and also disable index for `humidity`: ```sql -CREATE TABLE events ( +CREATE TABLE events3 ( id TEXT PRIMARY KEY, timestamp TIMESTAMP, tags ARRAY(TEXT), @@ -160,12 +160,13 @@ CREATE TABLE events ( ``` ```{note} -When using dynamic objects too many columns could be created, the default per table is 1000, more could impact performance. +When using dynamic objects too many columns could be created, the default per +table is 1000, more could impact performance. Use `STRICT` or `IGNORED`if needed. ``` -Object fields are treated as any other column, therefore **`GROUP BY`**, **`HAVING`**, and **window functions** -are supported. +Object fields are treated as any other column, therefore **`GROUP BY`**, +**`HAVING`**, and **window functions** are supported. ## Further Learning & Resources From 7b5ec7b7808ab009229183132f147f1b6435f03f Mon Sep 17 00:00:00 2001 From: Brian Munkholm Date: Wed, 10 Sep 2025 18:26:55 +0200 Subject: [PATCH 54/58] wording in vector.md --- docs/start/modelling/vector.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/start/modelling/vector.md b/docs/start/modelling/vector.md index cdea297a..4efc41ba 100644 --- a/docs/start/modelling/vector.md +++ b/docs/start/modelling/vector.md @@ -64,7 +64,7 @@ ORDER BY _score DESC; ``` This ranks results by **vector similarity** to the vector supplied by searching -max 2 nearest neighbours. +top 2 nearest neighbours. ## Further Learning & Resources From d8e29e5ca8416cf6e2c16d7ad301e9d1b3726ce4 Mon Sep 17 00:00:00 2001 From: surister Date: Thu, 11 Sep 2025 11:32:12 +0200 Subject: [PATCH 55/58] Remove references to UUID4 --- docs/start/modelling/primary-key.md | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/docs/start/modelling/primary-key.md b/docs/start/modelling/primary-key.md index d809ec4b..daabb831 100644 --- a/docs/start/modelling/primary-key.md +++ b/docs/start/modelling/primary-key.md @@ -50,11 +50,11 @@ CREATE TABLE example ( - Can result in gaps - Collisions possible if multiple records are created in the same millisecond -## Using UUIDv4 identifiers +## Using elasticflake identifiers This option involves declaring a column using `DEFAULT gen_random_text_uuid()`. ```psql -CREATE TABLE example2 ( +CREATE TABLE example ( id TEXT DEFAULT gen_random_text_uuid() PRIMARY KEY ); ``` @@ -218,7 +218,7 @@ db.close() | Strategy | Ordered | Unique | Scalable | Human-friendly | Range queries | Notes | |---------------------|----------| ------ | -------- |----------------|---------------| -------------------- | | Timestamp | ✅ | ⚠️ | ✅ | ✅ | ✅ | Potential collisions | -| UUIDv4 | ❌ | ✅ | ✅ | ❌ | ❌ | Default UUIDs | +| Elasticflake | ❌ | ✅ | ✅ | ❌ | ❌ | Default UUIDs | | UUIDv7 | ✅ | ✅ | ✅ | ❌ | ✅ | Requires UDF | | External system IDs | ✅/❌ | ✅ | ✅ | ✅ | ✅ | Depends on source | | Sequence table | ✅ | ✅ | ⚠️ | ✅ | ✅ | Manual retry logic | From d5cfb9938db59bd0b51744d92fda9f3220f3f727 Mon Sep 17 00:00:00 2001 From: Brian Munkholm Date: Thu, 11 Sep 2025 15:05:08 +0200 Subject: [PATCH 56/58] enum example table name --- docs/start/modelling/primary-key.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/start/modelling/primary-key.md b/docs/start/modelling/primary-key.md index daabb831..ca36e203 100644 --- a/docs/start/modelling/primary-key.md +++ b/docs/start/modelling/primary-key.md @@ -54,7 +54,7 @@ CREATE TABLE example ( This option involves declaring a column using `DEFAULT gen_random_text_uuid()`. ```psql -CREATE TABLE example ( +CREATE TABLE example2 ( id TEXT DEFAULT gen_random_text_uuid() PRIMARY KEY ); ``` From 9973fe5c2e2ee834dc579da7a7e742ff75b83f8a Mon Sep 17 00:00:00 2001 From: Brian Munkholm Date: Thu, 11 Sep 2025 16:08:30 +0200 Subject: [PATCH 57/58] fix index reference to model-promary-key --- docs/start/modelling/index.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/start/modelling/index.md b/docs/start/modelling/index.md index ae468744..3bf5a8c5 100644 --- a/docs/start/modelling/index.md +++ b/docs/start/modelling/index.md @@ -116,5 +116,5 @@ solutions are required instead. :maxdepth: 1 :hidden: -Primary key strategies +Primary key strategies ``` From bddacdb2d2c87bd6bd38b926c6e0adbc098e0b9c Mon Sep 17 00:00:00 2001 From: Brian Munkholm Date: Thu, 11 Sep 2025 16:10:47 +0200 Subject: [PATCH 58/58] revert fix. --- docs/start/modelling/index.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/start/modelling/index.md b/docs/start/modelling/index.md index 3bf5a8c5..ae468744 100644 --- a/docs/start/modelling/index.md +++ b/docs/start/modelling/index.md @@ -116,5 +116,5 @@ solutions are required instead. :maxdepth: 1 :hidden: -Primary key strategies +Primary key strategies ```