Skip to content

Conversation

@amotl
Copy link
Member

@amotl amotl commented Oct 28, 2025

About

  • Reorganize page for better overview and guidance.
  • Cross link to relevant in-detail information in the reference manual.
  • Add information about :Sharded storage:, :Segment merges:, and :Table refreshes:.

Preview

References

@coderabbitai
Copy link

coderabbitai bot commented Oct 28, 2025

Warning

Rate limit exceeded

@amotl has exceeded the limit for the number of commits or files that can be reviewed per hour. Please wait 12 minutes and 45 seconds before requesting another review.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

📥 Commits

Reviewing files that changed from the base of the PR and between f171c23 and 0077a6e.

📒 Files selected for processing (1)
  • docs/feature/storage/index.md (4 hunks)

Walkthrough

Reorganized storage docs: replaced a Lucene-centric layout with a "Data structures" section (Inverted index, BKD tree, Doc values, Column store) and added a "Storage process" section covering append-only segments, segment merges (TieredMergePolicy), refresh behavior, manual OPTIMIZE/REFRESH guidance, and updated anchors/cross-references. (41 words)

Changes

Cohort / File(s) Summary
Storage documentation restructuring and expansion
docs/feature/storage/index.md
Replaced Lucene-focused overview with a "Data structures" section (Inverted index, BKD tree, Doc values) and an explicit "Column store" subsection. Added "Storage process" covering sharded storage, append-only segments, segment merges (TieredMergePolicy), segment lifecycle, table refresh behavior, and manual OPTIMIZE TABLE/REFRESH TABLE guidance. Updated headers/anchors, related-links, and cross-references.

Sequence Diagram(s)

sequenceDiagram
    participant Client as Client (ingest / query)
    participant Coord as Coordinator
    participant Shard as Shard (segments)
    participant BG as Background merges

    rect #F0F9FF
    Note over Client,Coord: Write path
    Client->>Coord: INSERT / bulk ingest
    Coord->>Shard: append document to new segment (append-only)
    Shard-->>Client: ack (visible after refresh)
    end

    rect #FFF7F0
    Note over BG,Shard: Maintenance
    BG->>Shard: TieredMergePolicy triggers merges
    Shard-->>Shard: create merged segment
    Shard-->>Coord: merged segment becomes active
    end

    rect #F0FFF4
    Note over Client,Shard: Read path
    Client->>Coord: SELECT / search
    Coord->>Shard: search (inverted index, BKD tree, doc values / column store)
    Shard-->>Client: results (as of latest refresh)
    end
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

  • Verify accuracy of doc values → column store linkage and statements about sorting/aggregations.
  • Confirm anchors/cross-references for TieredMergePolicy, OPTIMIZE TABLE, and REFRESH TABLE.
  • Check clarity of storage lifecycle and refresh semantics.

Possibly related PRs

Suggested labels

sanding-500, refurbish, cross linking

Suggested reviewers

  • seut
  • matriv

Poem

🐇 I hopped through docs with nimble paws,

Mapped inverted trees and column laws.
Segments stack and merges play,
Refresh brings light to what I say.
A rabbit cheers — docs tidy today.

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (2 passed)
Check name Status Explanation
Title Check ✅ Passed The title "Storage internals: Refurbish overview page" is directly related to the main changes in the pull request. The raw summary confirms that the changeset involves extensive reorganization and enhancement of the docs/feature/storage/index.md file, including restructuring subsections, adding new sections about the storage process, and expanding data-structure details. The title accurately and specifically identifies this as a refurbishment of the storage internals overview documentation, making it clear and informative for someone scanning the commit history.
Description Check ✅ Passed The pull request description clearly relates to the changeset provided in the raw summary. The author states the PR will "reorganize page for better overview and guidance," "cross link to relevant in-detail information," and "add information about Sharded storage, Segment merges, and Table refreshes." These claims are directly supported by the raw summary, which confirms the reorganization of sections, updates to cross-links and references, and the addition of a Storage process section covering Sharded storage, Segment merges, and Table refreshes. The description is specific enough to convey the PR's main objectives and accurately reflects the nature of the changes.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

coderabbitai[bot]

This comment was marked as resolved.

@amotl amotl force-pushed the storage-internals branch 2 times, most recently from 386a5e2 to 12c4363 Compare October 28, 2025 23:30
@amotl amotl changed the base branch from storage-internals to main October 28, 2025 23:38
@amotl amotl force-pushed the storage-layer-more-details branch from cd19761 to b7b569d Compare October 28, 2025 23:39
@amotl amotl added sanding-1200 Fine sanding. reorganize Moving content around, inside and between other systems. labels Oct 28, 2025
@amotl amotl requested review from matriv and seut October 28, 2025 23:42
Copy link
Contributor

@matriv matriv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thx, left also a few comments/suggestions.

@amotl amotl force-pushed the storage-layer-more-details branch from b7b569d to ffce69e Compare October 29, 2025 22:45
@amotl amotl requested review from matriv and seut October 29, 2025 22:49
Copy link
Member

@seut seut left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, spotted something more.

@amotl amotl force-pushed the storage-layer-more-details branch from ffce69e to 622cd2a Compare October 30, 2025 10:38
@amotl amotl requested a review from seut October 30, 2025 10:39
coderabbitai[bot]

This comment was marked as resolved.

@amotl amotl force-pushed the storage-layer-more-details branch from 98c9ad7 to c5467bb Compare October 30, 2025 12:04
@amotl amotl requested a review from matriv October 30, 2025 12:05
Copy link
Member Author

@amotl amotl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just another little bunch of suggestions I am not sure about.

@amotl amotl force-pushed the storage-layer-more-details branch from 8b2f9bd to 23c257c Compare October 30, 2025 16:39
Comment on lines 85 to 87
## Storage process

How CrateDB stores data using Lucene.
Copy link
Member Author

@amotl amotl Oct 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When starting this topic already with a quick introduction, it makes perfect sense to refer to the CrateDB reference manual here or at the bottom of the section or page. We've identified two candidates. wdyt?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

both is fine, thx!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The first link was inlined using a separate :Sharded storage: term definition at the top of the "Storage process" section to improve the overall narrative per 442d39d, and the second one, not really fitting inside the article, was added to the bottom of the document into the "Related sections" section per 69d0b90.

Maybe someone will have suggestions about wording improvements, as this is just a v0.1. While those "teaser texts" are mostly based on ingredients of the relevant sections they are referring to, they are still "new" in the sense that they try to compress the whole gist into just a few concise words. In that curation process, every support is much welcome.

@coderabbitai: Can you analyze the content of the two commits I was referring to and present suggestions how to improve them, when applicable? As always, please use concise language, active voice, and avoid yapping.

This comment was marked as resolved.

Copy link
Member Author

@amotl amotl Oct 31, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@coderabbitai: Your suggestion saves a few words. Applied per e82d746, thank you.
@matriv: I found the statements are still correct, and I hope you don't have any objections.

This comment was marked as spam.

@amotl amotl force-pushed the storage-layer-more-details branch from 23c257c to 69d0b90 Compare October 31, 2025 17:26
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 23c257c and 69d0b90.

📒 Files selected for processing (1)
  • docs/feature/storage/index.md (4 hunks)
🧰 Additional context used
🧠 Learnings (13)
📓 Common learnings
Learnt from: amotl
Repo: crate/cratedb-guide PR: 0
File: :0-0
Timestamp: 2025-10-06T16:40:16.322Z
Learning: In the crate/cratedb-guide repository, documentation is organized by content type: files in `docs/feature/`, `docs/integrate/`, and `docs/ingest/` are reference material designed for quick access, while `docs/start/` contains learning path content. Reference material sections use top-right reference layouts for efficient navigation.
📚 Learning: 2025-10-06T16:40:16.322Z
Learnt from: amotl
Repo: crate/cratedb-guide PR: 0
File: :0-0
Timestamp: 2025-10-06T16:40:16.322Z
Learning: In the crate/cratedb-guide repository, documentation is organized by content type: files in `docs/feature/`, `docs/integrate/`, and `docs/ingest/` are reference material designed for quick access, while `docs/start/` contains learning path content. Reference material sections use top-right reference layouts for efficient navigation.

Applied to files:

  • docs/feature/storage/index.md
📚 Learning: 2025-08-09T11:00:03.578Z
Learnt from: amotl
Repo: crate/cratedb-guide PR: 236
File: docs/index.md:312-316
Timestamp: 2025-08-09T11:00:03.578Z
Learning: In the CrateDB Guide documentation, locally stale references (such as `(reference-architectures)=`, `(domain)=`, and similar anchor labels) should be retained even when they appear unused within the repository, as external projects might still reference them. This ensures backward compatibility for external documentation and projects.

Applied to files:

  • docs/feature/storage/index.md
📚 Learning: 2025-10-10T17:58:26.545Z
Learnt from: amotl
Repo: crate/cratedb-guide PR: 395
File: docs/install/cloud/azure/terraform.md:108-127
Timestamp: 2025-10-10T17:58:26.545Z
Learning: In the crate/cratedb-guide repository, during documentation format conversions (e.g., from reStructuredText to MyST/Markdown), markdownlint warnings such as MD053 (unused reference-style links) should be ignored and not flagged as issues.

Applied to files:

  • docs/feature/storage/index.md
📚 Learning: 2025-08-23T15:15:18.717Z
Learnt from: amotl
Repo: crate/cratedb-guide PR: 235
File: docs/start/query/ai-integration.md:265-273
Timestamp: 2025-08-23T15:15:18.717Z
Learning: In CrateDB documentation, avoid using "data lake" terminology when describing CrateDB's role, as it can create misconceptions. Prefer "real-time analytics store" or similar terms that better reflect CrateDB's actual capabilities as a real-time analytics database.

Applied to files:

  • docs/feature/storage/index.md
📚 Learning: 2025-08-23T15:09:38.537Z
Learnt from: amotl
Repo: crate/cratedb-guide PR: 235
File: docs/start/index.md:1-3
Timestamp: 2025-08-23T15:09:38.537Z
Learning: In the CrateDB Guide documentation, the `(use)=` label in `docs/start/index.md` is intentionally placed alongside `(getting-started)=` as part of the documentation architecture, even though it may appear to conflict with top-level "Use" section labeling. This is a deliberate design decision by the maintainers.

Applied to files:

  • docs/feature/storage/index.md
📚 Learning: 2025-08-23T04:22:52.597Z
Learnt from: amotl
Repo: crate/cratedb-guide PR: 235
File: docs/start/going-further.md:20-38
Timestamp: 2025-08-23T04:22:52.597Z
Learning: In the CrateDB Guide repository, the prefix `crate-reference:` in documentation links (like `crate-reference:index`) is an intersphinx reference that points to external content in the CrateDB Reference Manual documentation. These references are intentionally not pointing to local files within the repository.

Applied to files:

  • docs/feature/storage/index.md
📚 Learning: 2025-10-30T23:12:30.165Z
Learnt from: amotl
Repo: crate/cratedb-guide PR: 465
File: docs/admin/troubleshooting/system-tables.md:6-6
Timestamp: 2025-10-30T23:12:30.165Z
Learning: In the cratedb-guide repository, documentation headers and titles should use sentence case (e.g., "Diagnostics with system tables") rather than title case (e.g., "Diagnostics with System Tables"). This style choice enables headers to be referenced more fluently within sentences.

Applied to files:

  • docs/feature/storage/index.md
📚 Learning: 2025-09-25T19:31:54.320Z
Learnt from: bmunkholm
Repo: crate/cratedb-guide PR: 340
File: docs/home/index.md:84-97
Timestamp: 2025-09-25T19:31:54.320Z
Learning: In the CrateDB Guide docs (MyST), the CrateDB Cloud card on the homepage should link to `getting-started` using `:link-type: ref` instead of the previous `cloud:index` intersphinx target. This change was implemented in PR #340 to direct users to the getting started section rather than directly to the Cloud documentation.

Applied to files:

  • docs/feature/storage/index.md
📚 Learning: 2025-08-09T21:40:46.154Z
Learnt from: amotl
Repo: crate/cratedb-guide PR: 238
File: docs/integrate/streamsets/index.md:18-21
Timestamp: 2025-08-09T21:40:46.154Z
Learning: In the CrateDB Guide documentation, reStructuredText files (`.rst`) use the RST label syntax `.. _label:` while MyST Markdown files (`.md`) use the syntax `(label)=`. Both formats are used throughout the repository and labels can be cross-referenced between them.

Applied to files:

  • docs/feature/storage/index.md
📚 Learning: 2025-09-22T22:05:05.734Z
Learnt from: amotl
Repo: crate/cratedb-guide PR: 279
File: docs/integrate/airflow/data-retention-hot-cold.md:14-14
Timestamp: 2025-09-22T22:05:05.734Z
Learning: When searching for labels in the CrateDB Guide documentation, remember that reStructuredText files (.rst) use the syntax `.. _label:` while MyST Markdown files (.md) use the syntax `(label)=`. Cross-references work between formats, so {ref}`label` in MyST can reference an RST label defined with `.. _label:`.

Applied to files:

  • docs/feature/storage/index.md
📚 Learning: 2025-10-19T19:21:49.864Z
Learnt from: amotl
Repo: crate/cratedb-guide PR: 420
File: docs/connect/erlang.md:100-113
Timestamp: 2025-10-19T19:21:49.864Z
Learning: In the cratedb-guide repository, when documenting cloud or alternative connection methods, show only the differences from the canonical example (e.g., SSL configuration) rather than repeating the entire workflow (query execution, cleanup, etc.). This pattern is used consistently across pages to avoid repetition.

Applied to files:

  • docs/feature/storage/index.md
📚 Learning: 2025-09-30T14:50:33.725Z
Learnt from: amotl
Repo: crate/cratedb-guide PR: 296
File: docs/feature/query/recurrent.md:36-36
Timestamp: 2025-09-30T14:50:33.725Z
Learning: In CrateDB, `_id` is an internal system column available on each indexed document in doc schema tables and can be retrieved by SELECT queries or used in RETURNING clauses. Reference: https://cratedb.com/docs/crate/reference/en/latest/general/ddl/system-columns.html

Applied to files:

  • docs/feature/storage/index.md
🪛 markdownlint-cli2 (0.18.1)
docs/feature/storage/index.md

130-130: Unordered list indentation
Expected: 0; Actual: 2

(MD007, ul-indent)


132-132: Unordered list indentation
Expected: 0; Actual: 2

(MD007, ul-indent)


133-133: Unordered list indentation
Expected: 0; Actual: 2

(MD007, ul-indent)

🔇 Additional comments (3)
docs/feature/storage/index.md (3)

17-84: Data structures section is well-organized and technically sound.

The reorganization from Lucene-centric layout to CrateDB-specific data structures is clear. The progression (document → inverted index → BKD tree → doc values → column store) flows logically, and the explanations address past review feedback (e.g., clarifying doc values as the basis for column store, scope of column store support). Cross-references to external documentation are appropriately placed.


85-164: Storage process section adds valuable context on lifecycle and maintenance.

The new section effectively covers Lucene's storage mechanics (sharded, append-only, segment merges) and ties them to operational practicalities (TieredMergePolicy, manual OPTIMIZE, refresh intervals, idle shard behavior). The "tldr" framing works well for reference material. Past feedback (e.g., about reindexing on major versions, refresh behavior, OPTIMIZE/REFRESH commands) has been integrated. The footnote about table recreation on major upgrades is a helpful cross-reference.


165-182: Related sections link structure is clear.

Renaming "See also" to "Related sections" and updating the introductory text to reference indexing-and-storage and high-availability/consistency topics maintains good discoverability. External links (footnote to recreate tables, TieredMergePolicy) are well-positioned.

amotl and others added 27 commits November 3, 2025 12:08
@amotl amotl force-pushed the storage-layer-more-details branch from f171c23 to 0077a6e Compare November 3, 2025 11:11
@amotl amotl merged commit 6e212de into main Nov 3, 2025
3 checks passed
@amotl amotl deleted the storage-layer-more-details branch November 3, 2025 11:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cross linking Linking to different locations of the documentation. refurbish A more significant refurbishment. reorganize Moving content around, inside and between other systems. sanding-1200 Fine sanding.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants