Skip to content

Commit 059d7ec

Browse files
committed
Indexing and storage: Cross linking
1 parent fea8b05 commit 059d7ec

File tree

5 files changed

+54
-58
lines changed

5 files changed

+54
-58
lines changed

docs/explain/index.md

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -13,6 +13,11 @@ about applications and use cases of CrateDB, trying to put things into a
1313
bigger picture and joining things together to help answer the question _why_?
1414

1515

16+
:::{rubric} 2021
17+
:::
18+
19+
- {ref}`indexing-and-storage`
20+
1621
:::{rubric} 2018
1722
:::
1823

docs/feature/index/index.md

Lines changed: 6 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -84,23 +84,16 @@ with solutions from other vendors.
8484

8585

8686

87-
::::{info-card}
88-
:::{grid-item}
89-
:columns: auto 9 9 9
90-
**Blog: Indexing and Storage in CrateDB**
91-
92-
{{ '{}[Indexing and Storage in CrateDB]'.format(blog) }}
93-
87+
::::{card} Blog: Indexing and Storage in CrateDB
88+
:link: indexing-and-storage
89+
:link-type: ref
9490
Learn about the fundamentals of the CrateDB storage layer,
9591
looking at the three main Lucene structures that are used within CrateDB:
96-
Inverted Indexes for text values, BKD-trees for numeric values, and Doc Values.
97-
:::
98-
:::{grid-item}
99-
:columns: auto 3 3 3
100-
{tags-primary}`Fundamentals` \
92+
Inverted indexes for text values, BKD trees for numeric values, and doc values.
93+
+++
94+
{tags-primary}`Fundamentals`
10195
{tags-secondary}`Converged Indexing`
10296
{tags-secondary}`Deep Dive`
103-
:::
10497
::::
10598

10699

@@ -159,7 +152,6 @@ bit thin.
159152
[Elasticsearch for Dummies]: https://dzone.com/articles/elasticsearch-for-dummies
160153
[Elasticsearch: Documents and Indices]: https://www.elastic.co/guide/en/elasticsearch/reference/current/documents-indices.html
161154
[Independent comparison of CrateDB and MongoDB using Time Series Benchmark Suite]: https://blog.nyrkio.com/wp-content/uploads/2024/07/Nyrkio-comparison-of-CrateDB-and-MongoDB-with-TSBS-v2.pdf
162-
[Indexing and Storage in CrateDB]: https://cratedb.com/blog/indexing-and-storage-in-cratedb
163155
[Searching and Indexing With Apache Lucene]: https://dzone.com/articles/apache-lucene-a-high-performance-and-full-featured
164156
[Time Series Benchmark on CrateDB and MongoDB]: https://blog.nyrkio.com/2024/07/11/timeseries-benchmark-on-cratedb-and-mongodb/
165157
[TimescaleDB Time Series Benchmark Suite (TSBS)]: https://github.com/timescale/tsbs

docs/feature/search/fts/index.md

Lines changed: 9 additions & 27 deletions
Original file line numberDiff line numberDiff line change
@@ -61,6 +61,7 @@ of a [search engine].
6161
- {ref}`vector-search`
6262
- {ref}`hybrid-search`
6363
- {ref}`query`
64+
- {ref}`storage-layer`
6465

6566
{tags-primary}`SQL`
6667
{tags-primary}`Full-Text Search`
@@ -301,41 +302,23 @@ by exploring how to manage a dataset of Netflix titles.
301302
:::{rubric} Explanations
302303
:::
303304

304-
::::{info-card}
305-
:::{grid-item}
306-
:columns: auto 9 9 9
307-
**Indexing and Storage in CrateDB**
305+
:::{card} Indexing and Storage in CrateDB
306+
:link: indexing-and-storage
307+
:link-type: ref
308308

309309
This article explores the internal workings of the storage layer in CrateDB,
310310
with a focus on Lucene's indexing strategies.
311311

312-
{hyper-navigate}`Indexing and Storage in CrateDB <[Indexing and Storage in CrateDB]>`
313-
314312
The CrateDB storage layer is based on Lucene indexes.
315-
Lucene offers scalable and high-performance indexing which enables efficient search
313+
Lucene offers scalable and high-performance indexing, which enables efficient search
316314
and aggregations over documents and rapid updates to the existing documents.
317-
We will look at the three main Lucene structures that are used within CrateDB:
318-
Inverted Indexes for text values, BKD-Trees for numeric values, and Doc Values.
319-
320-
:Inverted Index:
321-
You will learn how inverted indexes are implemented in Lucene and CrateDB.
322-
323-
:BKD Tree:
324-
Better understand the BKD tree, starting from KD trees, and how this data
325-
structure supports range queries in CrateDB.
326-
327-
:Doc Values:
328-
This data structure supports more efficient querying document fields by id,
329-
performs column-oriented retrieval of data, and improves the performance of
330-
aggregation and sorting operations.
315+
+++
316+
CrateDB uses three important Lucene data structures:
317+
Inverted indexes for text values, BKD trees for numeric values, and doc values.
331318

332-
:::
333-
:::{grid-item}
334-
:columns: auto 3 3 3
335-
{tags-primary}`Introduction` \
319+
{tags-primary}`Introduction`
336320
{tags-secondary}`Lucene Indexing`
337321
:::
338-
::::
339322

340323

341324
:::{card} Indexing Text for Both Effective Search and Accurate Analysis
@@ -374,7 +357,6 @@ effective-search
374357
[BM25: The Next Generation of Lucene Relevance]: https://opensourceconnections.com/blog/2015/10/16/bm25-the-next-generation-of-lucene-relevation/
375358
[BM25 vs. Lucene Default Similarity]: https://www.elastic.co/blog/found-bm-vs-lucene-default-similarity
376359
[full-text search]: https://en.wikipedia.org/wiki/Full_text_search
377-
[Indexing and Storage in CrateDB]: https://cratedb.com/blog/indexing-and-storage-in-cratedb
378360
[MATCH predicate]: inv:crate-reference#predicates_match
379361
[Okapi BM25]: https://www.microsoft.com/en-us/research/wp-content/uploads/2016/02/okapi_trec3.pdf
380362
[search engine]: https://en.wikipedia.org/wiki/Search_engine

docs/feature/storage/index.md

Lines changed: 17 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,4 @@
1+
(storage-internals)=
12
(storage-layer)=
23
# Storage Layer
34

@@ -11,8 +12,8 @@ The CrateDB storage layer is based on Lucene.
1112
By default, all fields are indexed,
1213
nested or not, but the indexing can be turned off selectively.
1314

14-
This page enumerates some concepts of Lucene. The article [Indexing and Storage in
15-
CrateDB] goes into more details by exploring its internal workings.
15+
This page enumerates some concepts of Lucene. The article {ref}`indexing-and-storage`
16+
goes into more details by exploring its internal workings.
1617

1718
## Lucene
1819

@@ -49,7 +50,7 @@ Elasticsearch are building upon the same technologies.
4950
## Data structures
5051

5152
CrateDB uses three main data structures of Lucene:
52-
Inverted indexes for text values, BKD trees for numeric values, and DocValues.
53+
Inverted indexes for text values, BKD trees for numeric values, and doc values.
5354

5455
- **Inverted index**
5556

@@ -69,7 +70,7 @@ Inverted indexes for text values, BKD trees for numeric values, and DocValues.
6970

7071
To optimize numeric range queries, Lucene uses an implementation of the Block KD (BKD)
7172
tree data structure. The BKD tree index structure is suitable for indexing large
72-
multi-dimensional point data sets. It is an I/O-efficient dynamic data structure based
73+
multidimensional point data sets. It is an I/O-efficient dynamic data structure based
7374
on the KD tree. Contrary to its predecessors, the BKD tree maintains its high space
7475
utilization and excellent query and update performance regardless of the number of
7576
updates performed on it.
@@ -78,17 +79,25 @@ Inverted indexes for text values, BKD trees for numeric values, and DocValues.
7879
including fields defined as `TIMESTAMP` types, supporting performant date range
7980
queries.
8081

81-
- **DocValues**
82+
- **Doc values**
8283

8384
Because Lucene's inverted index data structure implementation is not optimal for
8485
finding field values by given document identifier, and for performing column-oriented
85-
retrieval of data, the DocValues data structure is used for those purposes instead.
86+
retrieval of data, the doc values data structure is used for those purposes instead.
8687

87-
DocValues is a column-based data storage built at document index time. They store
88+
Doc values is a column-based data storage built at document index time. They store
8889
all field values that are not analyzed as strings in a compact column, making it more
8990
effective for sorting and aggregations.
9091

92+
## See also
93+
94+
- {ref}`indexing-and-storage`
95+
96+
97+
:::{toctree}
98+
:hidden:
99+
indexing-and-storage
100+
:::
91101

92102

93103
[column-based store]: https://cratedb.com/docs/crate/reference/en/latest/general/ddl/storage.html
94-
[Indexing and Storage in CrateDB]: https://cratedb.com/blog/indexing-and-storage-in-cratedb

docs/feature/storage/indexing-and-storage.md

Lines changed: 17 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,4 @@
11
(indexing-and-storage)=
2-
(storage-internals)=
32

43
# Indexing and storage in CrateDB
54

@@ -155,6 +154,9 @@ Lucene 6.0 adds an implementation of Block KD (BKD) tree data structure.
155154

156155
### BKD tree
157156

157+
> Lucene's k-d tree geospatial data structure offers fast single- and
158+
multidimensional numeric range and geospatial point-in-shape filtering.
159+
158160
To better understand the BKD tree data structure, let's begin with an
159161
introduction to KD trees. A KD tree is a binary tree for multidimensional
160162
queries. KD tree shares the same properties as binary search trees (BST), but
@@ -266,10 +268,6 @@ These values are quite fast to access at search time, since they are
266268
stored column-stride such that only the value for that one field needs
267269
to be decoded per row searched.
268270

269-
:::{seealso}
270-
-- [Document values with Apache Lucene]
271-
:::
272-
273271
### Column store
274272

275273
CrateDB implements a {ref}`column store <crate-reference:ddl-storage-columnstore>`
@@ -305,9 +303,19 @@ the following:
305303
The use of a column store results in a small disk footprint, thanks to specialized
306304
compression algorithms such as delta encoding, bit packing, and GCD.
307305

308-
Besides inverted indexes, the Lucene indexing strategy also relies on BKD trees
309-
and Doc Values that are successfully adopted by CrateDB as well as many popular
310-
search engines. With a better understanding of the storage layer, we move to
311-
another interesting topic: [Handling Dynamic Objects in CrateDB].
306+
## See also
307+
308+
[Introducing Lucene Index Doc Values] is a technical deep dive into
309+
IndexDocValues introduced with Lucene 4.0.
310+
311+
[Storing multidimensional points using BKD trees] is a comprehensive
312+
technical explanation about the benefits and design decisions
313+
behind the BKD tree geospatial data structure coming with Lucene 6.0.
314+
315+
[Document values with Apache Lucene] highlights significant improvements to
316+
Apache Lucene 7.0 around how doc values are indexed and searched.
317+
312318

313319
[Document values with Apache Lucene]: https://www.elastic.co/blog/sparse-versus-dense-document-values-with-apache-lucene
320+
[Introducing Lucene Index Doc Values]: https://trifork.nl/blog/introducing-lucene-index-doc-values/
321+
[Storing multidimensional points using BKD trees]: https://www.elastic.co/blog/lucene-points-6-0

0 commit comments

Comments
 (0)