Skip to content

Conversation

@amotl
Copy link
Member

@amotl amotl commented Oct 22, 2025

About

The article Indexing and Storage in CrateDB should not be left behind on a platform that took a different direction about content types and style.

Preview

https://cratedb-guide--434.org.readthedocs.build/feature/storage/indexing-and-storage.html

/cc @hammerhead, @surister

@amotl amotl added the reorganize Moving content around, inside and between other systems. label Oct 22, 2025
@coderabbitai
Copy link

coderabbitai bot commented Oct 22, 2025

Warning

Rate limit exceeded

@amotl has exceeded the limit for the number of commits or files that can be reviewed per hour. Please wait 12 minutes and 28 seconds before requesting another review.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

📥 Commits

Reviewing files that changed from the base of the PR and between a13dfb2 and 059d7ec.

📒 Files selected for processing (5)
  • docs/explain/index.md (1 hunks)
  • docs/feature/index/index.md (1 hunks)
  • docs/feature/search/fts/index.md (2 hunks)
  • docs/feature/storage/index.md (5 hunks)
  • docs/feature/storage/indexing-and-storage.md (1 hunks)

Walkthrough

Added a new indexing-and-storage guide and updated multiple docs to use card-based blocks, standardized Lucene terminology, and replace inline/external links with internal {ref} cross-references and toctree/See also entries.

Changes

Cohort / File(s) Summary
New storage internals guide
docs/feature/storage/indexing-and-storage.md
New comprehensive document describing CrateDB storage internals and Lucene indexing structures (inverted indexes, BKD trees, doc values), indexing workflows, segments/range queries, columnar/doc values model, examples, and see-also links.
Cross-linking & See also updates
docs/feature/storage/index.md, docs/feature/search/fts/index.md, docs/explain/index.md
Replaced external/inline links with internal {ref} cross-references; added hidden toctree and See also entries pointing to indexing-and-storage; removed obsolete effective-fulltext-search cross-reference in docs/explain/index.md.
Content consolidation & terminology standardization
docs/feature/search/fts/index.md, docs/feature/storage/index.md
Consolidated separate subsections for inverted indexes, BKD trees, and doc values into unified descriptions; standardized wording (DocValuesDoc values, multi-dimensionalmultidimensional).
Card-based layout & front-matter metadata
docs/feature/index/index.md, docs/feature/search/fts/index.md, docs/feature/document/index.md
Converted info-card/blog blocks to card-based structure with explicit link and link-type fields; adjusted grid layout and tags; added a (container)= metadata tag entry in docs/feature/document/index.md.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

  • Pay attention to docs/feature/storage/indexing-and-storage.md for technical accuracy and completeness.
  • Verify all {ref} targets and hidden toctree entries resolve and build without warnings.
  • Check that consolidated Lucene descriptions preserve necessary detail for readers.

Possibly related PRs

Suggested labels

new content

Suggested reviewers

  • hammerhead
  • surister

Poem

🐇 I hop through pages, neat and spry,

New guides have sprung beneath my eye.
Inverted paths and BKD trees,
Doc values hum in tidy rows,
I nibble links and leave soft prose.

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (2 passed)
Check name Status Explanation
Title Check ✅ Passed The title "Storage internals: Indexing and storage" directly reflects the primary change in this pull request, which is the addition of a new documentation file (indexing-and-storage.md) that covers CrateDB's storage internals and Lucene-based indexing mechanisms, along with coordinated updates to related documentation files for consistency. The title is specific, concise, and clearly communicates the focus without unnecessary elaboration or vague terminology.
Description Check ✅ Passed The pull request description is directly related to the changeset, explaining that the PR aims to preserve and migrate the "Indexing and Storage in CrateDB" article into the documentation guide and noting the scope of coverage including Lucene-based indexing and storage topics. The description provides clear context for the changes and references a preview build, demonstrating that the author has thoughtfully documented the purpose and intent behind the modifications across multiple files.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

coderabbitai[bot]

This comment was marked as resolved.

@amotl amotl requested review from matriv and seut October 22, 2025 01:35
Copy link
Contributor

@matriv matriv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thx, left 2 comments.

@amotl amotl force-pushed the storage-internals branch from 32f7040 to 274f983 Compare October 24, 2025 06:02
@amotl amotl requested a review from matriv October 24, 2025 06:05
coderabbitai[bot]

This comment was marked as resolved.

@amotl amotl force-pushed the storage-internals branch from cfced51 to a53d709 Compare October 24, 2025 06:14
Copy link
Member Author

@amotl amotl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@matriv: Thanks for the review. I just discovered your comment on the commit itself by accident, see below.

coderabbitai[bot]

This comment was marked as resolved.

@amotl amotl force-pushed the storage-internals branch from b510f54 to a78c2e7 Compare October 24, 2025 13:21
@amotl
Copy link
Member Author

amotl commented Oct 24, 2025

@coderabbitai summarized the procedure:

Adds a new storage article about Lucene-based indexing and storage.

Well-structured and technically sound article.
The documentation effectively explains CrateDB's storage internals, covering inverted indexes, BKD trees, and doc values with clear explanations, practical examples (Product table, tokenization workflow), and helpful diagrams. The MyST syntax is correct, cross-references are properly formatted, and the content progressively builds complexity in an accessible way.

Thanks a stack for your contributions and reviews.

/cc @marijaselakovic

@coderabbitai
Copy link

coderabbitai bot commented Oct 24, 2025

Tip

For best results, initiate chat on the files or code changes.

@amotl, you're very welcome! I'm glad the review was helpful. The storage internals article is a valuable addition to the guide—bringing that content into the documentation makes it more accessible and permanent. Great work on this PR! 🚀

Base automatically changed from explain to main October 24, 2025 18:50
@amotl amotl force-pushed the storage-internals branch 2 times, most recently from 933eb5b to c0b9a89 Compare October 24, 2025 19:18
amotl

This comment was marked as resolved.

@crate crate deleted a comment from coderabbitai bot Oct 24, 2025
@amotl amotl force-pushed the storage-internals branch from c0b9a89 to 4d05a2c Compare October 24, 2025 19:36
Copy link
Member Author

@amotl amotl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By conducting another review cycle, I was able to come up with a few more suggestions.

@amotl amotl marked this pull request as draft October 27, 2025 10:10
@amotl amotl force-pushed the storage-internals branch from 4d05a2c to e96ced2 Compare October 28, 2025 21:31
@amotl amotl removed request for matriv and seut October 28, 2025 22:24
@amotl amotl changed the title Storage internals: Add article about "Indexing and storage" Storage internals: Indexing and storage Oct 28, 2025
@amotl amotl marked this pull request as ready for review October 28, 2025 22:25
@amotl amotl added the sanding-500 Sanding medium-sized details. label Oct 28, 2025
coderabbitai[bot]

This comment was marked as resolved.

@amotl amotl force-pushed the storage-internals branch 2 times, most recently from 386a5e2 to 12c4363 Compare October 28, 2025 23:30
@amotl amotl requested review from matriv and seut October 28, 2025 23:42
data structure supports range queries on numeric values in CrateDB.

:Doc values: This data structure supports more efficient querying document
fields by id, performs column-oriented retrieval of data, and improves the
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

querying document fields by id? is that correct @seut?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, maybe using name instead of id would improve this?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All right, thanks. I've included the update into a13dfb2.

within CrateDB: Inverted indexes for text values, BKD trees for numeric
values, and doc values.

:Inverted index: You will learn how inverted indexes are implemented in Lucene
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't like here the You will learn phrasing, as this is more appropriate for a live presentation agenda slide.
Although this comes from Marija's context, I'd suggest to rephrase it:

Explain how inverted indexes are implemented in Lucene and how CrateDB uses them to index text values and enable fast text searches.

Copy link
Member Author

@amotl amotl Oct 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I agree. Even if it's an article pulled in from another stage, we shouldn't stop curating it going forward.

I've used your suggestion 1:1, but after some little back and forth, just started using "understand" to begin the sentence, picked up from the definition item of the :BKD tree: term, as you didn't have any complaints about that, see a13dfb2.

Well, it's a repetition of words, but on the other hand, it's symmetric right now. ;]

Copy link
Contributor

@matriv matriv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice, left 1 suggestion and a question.

This article effectively explains CrateDB's storage internals,
covering inverted indexes, BKD trees, and doc values with clear
explanations, practical examples, and helpful diagrams.

Source: https://cratedb.com/blog/indexing-and-storage-in-cratedb
@amotl amotl force-pushed the storage-internals branch from 12c4363 to a13dfb2 Compare October 29, 2025 20:29
coderabbitai[bot]

This comment was marked as resolved.

@amotl
Copy link
Member Author

amotl commented Oct 29, 2025

It looks like @coderabbitai had some final evaluations about this patch.

Introduction is clear and well-positioned.
Doc values explanation is technically sound and well-structured.
Column store implementation and limitations are clearly documented.
All external resource links verified as active and accessible.
Terminology standardization ("Doc values" not "DocValues") is consistent.
Storage-layer reference strengthens documentation interconnectedness.
Card-based presentation modernizes content and improves navigation.
See also and toctree structure enables better navigation.
Transition to internal reference improves documentation cohesion.
Front-matter labels are well-chosen and enable better cross-referencing.

Please object any of those items if you think different, including suggestions what can be improved. Thanks!

@coderabbitai
Copy link

coderabbitai bot commented Oct 29, 2025

Tip

For best results, initiate chat on the files or code changes.

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash

# Read the main article to validate evaluation points
echo "=== Main Article Content ==="
cat docs/feature/storage/indexing-and-storage.md

echo -e "\n=== Checking for terminology consistency (Doc values vs DocValues) ==="
rg -n "DocValues" docs/feature/storage/indexing-and-storage.md || echo "No instances of 'DocValues' found (good - should be 'Doc values')"
rg -n "Doc values" docs/feature/storage/indexing-and-storage.md | head -20

echo -e "\n=== Checking storage/index.md for navigation structure ==="
cat docs/feature/storage/index.md

echo -e "\n=== Checking related navigation files ==="
cat docs/feature/index/index.md | head -80

Length of output: 22070


🏁 Script executed:

#!/bin/bash

# Extract and test external links from the article
echo "=== Extracting external links from indexing-and-storage.md ==="
rg -oP 'https?://[^\s\)\]]+' docs/feature/storage/indexing-and-storage.md | sort -u

echo -e "\n=== Testing link accessibility ==="
# Test a few key external links
for url in $(rg -oP 'https?://[^\s\)\]]+' docs/feature/storage/indexing-and-storage.md | sort -u | head -5); do
    echo "Testing: $url"
    curl -s -o /dev/null -w "Status: %{http_code}\n" --max-time 10 "$url" || echo "Failed to reach $url"
done

Length of output: 860


🏁 Script executed:

#!/bin/bash

# Test the remaining external links
echo "=== Testing remaining external links ==="
curl -s -o /dev/null -w "lucene-points-6-0: %{http_code}\n" --max-time 10 "https://www.elastic.co/blog/lucene-points-6-0"
curl -s -o /dev/null -w "sparse-versus-dense: %{http_code}\n" --max-time 10 "https://www.elastic.co/blog/sparse-versus-dense-document-values-with-apache-lucene"

# Check front-matter labels more thoroughly
echo -e "\n=== Front-matter labels in indexing-and-storage.md ==="
head -n 1 docs/feature/storage/indexing-and-storage.md

echo -e "\n=== Front-matter labels in storage/index.md ==="
head -n 2 docs/feature/storage/index.md

# Verify cross-references work properly
echo -e "\n=== Cross-references to indexing-and-storage ==="
rg -n "indexing-and-storage" docs/feature/storage/index.md docs/feature/index/index.md docs/feature/search/fts/index.md docs/explain/index.md 2>/dev/null | head -20

Length of output: 720


@amotl, I've thoroughly validated all 10 evaluation points against the actual content, and I'm pleased to confirm they are all accurate. Here's my assessment:

All evaluation points validated:

  1. Introduction - Clear context about Lucene-based storage layer
  2. Doc values explanation - Technically sound, covers history (Lucene 4.0), column-stride vs row-stride storage, and benefits
  3. Column store - Implementation well-documented with Product table example; limitations explicitly stated (no container/geographic types, text-only disable option)
  4. External links - All 7 unique URLs tested and accessible (HTTP 200/301):
    • GitHub avatars, Crate.io diagrams, Elastic.co blog posts, Trifork blog
  5. Terminology - Consistent use of "Doc values" throughout (only "IndexDocValues" appears as proper name in reference)
  6. Storage-layer reference - Proper bidirectional linking with {ref}indexing-and-storage``
  7. Card-based presentation - Implemented across index files (feature/index/index.md, feature/search/fts/index.md)
  8. Navigation structure - "See also" sections + hidden toctree in storage/index.md
  9. Internal references - 5 files now reference indexing-and-storage internally
  10. Front-matter labels - Well-chosen: (indexing-and-storage)=, (storage-internals)=, (storage-layer)=

Optional minor enhancements (not objections):

  • Consider adding a brief table of contents after the "What's inside" section for easier navigation within the 8-minute read
  • The KDB tree construction example is excellent but could potentially use numbered steps for even clearer flow
  • The diagram images return HTTP 301 (redirects) rather than 200, which is common for CDN/CMS systems and shouldn't cause any issues, but worth monitoring if the hosting infrastructure changes.

Overall:

  • This is an exemplary documentation addition—technically accurate, well-structured, and properly integrated into the documentation ecosystem. Great work! 🎯
  • The article successfully preserves valuable technical content with excellent educational structure — progressive complexity, practical examples (Product table), clear diagrams, and proper integration into the documentation ecosystem. This meets the repository's goal of clarity and educational value over production concerns.

🧠 Learnings used
Learnt from: amotl
PR: crate/cratedb-guide#263
File: docs/integrate/kafka/docker-python.md:112-120
Timestamp: 2025-08-22T18:11:12.776Z
Learning: In documentation and tutorial repositories like cratedb-guide, code examples should prioritize simplicity, clarity, and educational value over production-ready features. Comprehensive error handling, extensive validation, and other production concerns can distract from the main learning objectives and make examples harder to follow. Review suggestions should focus on correctness and clarity rather than production hardening.

@coderabbitai

This comment was marked as duplicate.

amotl and others added 2 commits October 29, 2025 21:45
- Mention the fast sorting and aggregations that the doc values
  mechanism provides.
- Improve "doc values" section.
- Trim "introduction" section.
- Remove "summary" section.
- Fix tables.
- Various copy editing. Wording. Naming things.
- Wrap images into white background for accompanying dark mode.
- Implement suggestions from code review, also by CodeRabbit

Co-authored-by: Marios Trivyzas <[email protected]>
@amotl amotl force-pushed the storage-internals branch from fb0aaea to 059d7ec Compare October 29, 2025 20:46
@amotl
Copy link
Member Author

amotl commented Oct 29, 2025

Thanks again for your reviews. 🙇

@amotl amotl added the cross linking Linking to different locations of the documentation. label Oct 29, 2025
@amotl amotl merged commit e563ce9 into main Oct 29, 2025
3 checks passed
@amotl amotl deleted the storage-internals branch October 29, 2025 21:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cross linking Linking to different locations of the documentation. reorganize Moving content around, inside and between other systems. sanding-500 Sanding medium-sized details.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants