Skip to content

Conversation

@b41sh
Copy link
Member

@b41sh b41sh commented Nov 6, 2025

I hereby agree to the terms of the CLA available at: https://docs.databend.com/dev/policies/cla/

Summary

Problem: When using storage_format = native for data storage, queries involving inverted or vector indexes could lead to a panic. This issue specifically occurred because the native format processes data page by page, where a single data block is internally divided into multiple pages.

for example

CREATE TABLE t_native (id int, content string, INVERTED INDEX idx1 (content)) storage_format = 'native' row_per_page = 2;

INSERT INTO t_native VALUES
(1, 'The quick brown fox jumps over the lazy dog'),
(2, 'A picture is worth a thousand words'),
(3, 'The early bird catches the worm'),
(4, 'Actions speak louder than words'),
(5, 'Time flies like an arrow; fruit flies like a banana'),
(6, 'Beauty is in the eye of the beholder'),
(7, 'When life gives you lemons, make lemonade'),
(8, 'Put all your eggs in one basket'),
(9, 'You can not judge a book by its cover'),
(10, 'An apple a day keeps the doctor away');

SELECT * FROM t_native WHERE query('content:book OR content:basket');
error: APIError: QueryFailed: [1104]index out of bounds: the len is 1 but the index is 1

This PR addresses the panic by introducing a mechanism to correctly transform the block-level row indices (idx) from the inverted and vector indexes into page-specific indices.

To efficiently manage the row offsets in each page, std::vec::Vec has been replaced with RoaringTreemap.

  • RoaringTreemap provides a highly optimized way to store and query sets of integers.
  • It allows for rapid determination of whether a given idx (after conversion) exists within the page and enables quick calculation of the idx's relative offset.

fixes: #[Link the issue here]

Tests

  • Unit Test
  • Logic Test
  • Benchmark Test
  • No Test - Explain why

Type of change

  • Bug Fix (non-breaking change which fixes an issue)
  • New Feature (non-breaking change which adds functionality)
  • Breaking Change (fix or feature that could cause existing functionality not to work as expected)
  • Documentation Update
  • Refactoring
  • Performance Improvement
  • Other (please describe):

This change is Reviewable

@b41sh b41sh requested review from sundy-li and zhyass November 6, 2025 05:38
@github-actions github-actions bot added the pr-bugfix this PR patches a bug in codebase label Nov 6, 2025
@b41sh b41sh merged commit 17fb7b4 into databendlabs:main Nov 6, 2025
87 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

pr-bugfix this PR patches a bug in codebase

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants