Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: support vacuum aggregating index #17231

Merged
merged 16 commits into from
Jan 11, 2025

Conversation

SkyFan2002
Copy link
Member

@SkyFan2002 SkyFan2002 commented Jan 9, 2025

I hereby agree to the terms of the CLA available at: https://docs.databend.com/dev/policies/cla/

Summary

The DROP AGGREGATING INDEX statement cleans up the index's metadata but does not clean up the index's data. This PR implements a new table function to clean up the data of dropped and outside the retention period indexes.

-- Create database and table
CREATE
OR REPLACE DATABASE test_vacuum_drop_aggregating_index;

CREATE
OR REPLACE TABLE test_vacuum_drop_aggregating_index.agg (a INT, b INT, c INT);

-- Insert initial data
INSERT INTO
    test_vacuum_drop_aggregating_index.agg
VALUES
    (1, 1, 4),
    (1, 2, 1),
    (1, 2, 4);

-- Create aggregating index
CREATE
OR REPLACE AGGREGATING INDEX index AS
SELECT
    MIN(a),
    MAX(b)
FROM
    test_vacuum_drop_aggregating_index.agg;

-- Insert more data
INSERT INTO
    test_vacuum_drop_aggregating_index.agg
VALUES
    (2, 2, 5);

-- Refresh index
REFRESH AGGREGATING INDEX index;

-- Drop index, mark index as dropped, but not vacuum index data
DROP AGGREGATING INDEX index;

SET
    data_retention_time_in_days = 0;

-- Vacuum from specified table
SELECT
    *
FROM
    fuse_vacuum_drop_aggregating_index('test_vacuum_drop_aggregating_index', 'agg');
---
# table_id index_id num_removed_files
1840    1848    1

-- Or vacuum from all tables
SELECT
    *
FROM
    fuse_vacuum_drop_aggregating_index();

Implemention

A new key-value pair is added to the meta-service:

__fd_marked_deleted_index/table_id/index_id-> marked_deleted_index_meta

pub struct MarkedDeletedIndexMeta {
    pub dropped_on: DateTime<Utc>,
    pub index_type: MarkedDeletedIndexType,
}

When an index is dropped, along with removing the name->id->meta, the fd_marked_deleted_index key-value pair is added.

When a vacuum is triggered, the meta-service will check the __fd_marked_deleted_index key. And filter out the indexes that is in retention period with MarkedDeletedIndexMeta.dropped_on.

The vacuum will delete the index data that is not in retention period, by identifying the index files with index id. After that, the meta-service will remove the index meta from the __fd_marked_deleted_index/table_id/index_id key.

Tests

  • Unit Test
  • Logic Test
  • Benchmark Test
  • No Test - Explain why

Type of change

  • Bug Fix (non-breaking change which fixes an issue)
  • New Feature (non-breaking change which adds functionality)
  • Breaking Change (fix or feature that could cause existing functionality not to work as expected)
  • Documentation Update
  • Refactoring
  • Performance Improvement
  • Other (please describe):

This change is Reviewable

@github-actions github-actions bot added the pr-feature this PR introduces a new feature to the codebase label Jan 9, 2025
@SkyFan2002 SkyFan2002 marked this pull request as ready for review January 10, 2025 00:54
@SkyFan2002 SkyFan2002 requested review from b41sh, dantengsky and drmingdrmer and removed request for drmingdrmer January 10, 2025 00:54
@drmingdrmer
Copy link
Member

A new key-value pair is added to the meta-service:

__fd_marked_deleted_index/table_id->[(index_id_1,index_meta_1), (index_id_2,index_meta_2), ...]

When an index is dropped, along with removing the name->id->meta,the index id and meta are added to the relevant table's __fd_marked_deleted_index key, and IndexMeta.dropped_on is set to Some(timestamp).

Is it a Vec<(u64, IndexMeta)> in a single value?

It would be better to store it in

__fd_marked_deleted_index/<table_id>/<index_id_1> ->index_meta_1
__fd_marked_deleted_index/<table_id>/<index_id_2> ->index_meta_2
...

Such that each entry has a smaller locking scope and there is no need to introduce a new container protobuf message to store the list.

@SkyFan2002 SkyFan2002 force-pushed the vacuum_agg_index branch 2 times, most recently from 52e3f40 to 203e986 Compare January 10, 2025 12:14
@SkyFan2002 SkyFan2002 requested a review from b41sh January 10, 2025 12:54
@SkyFan2002
Copy link
Member Author

All comments have been resolved. PTAL @b41sh @drmingdrmer. Thanks.

@drmingdrmer
Copy link
Member

All comments have been resolved. PTAL @b41sh @drmingdrmer. Thanks.

The PR description should be updated too

Copy link
Member

@drmingdrmer drmingdrmer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Some derived trait implementation looks unnecessary.

Reviewed 11 of 24 files at r1, 13 of 14 files at r4, 1 of 1 files at r5, all commit messages.
Reviewable status: all files reviewed, 7 unresolved discussions (waiting on @b41sh, @dantengsky, and @SkyFan2002)


src/meta/app/src/schema/marked_deleted_index_id.rs line 16 at r5 (raw file):

#[derive(Clone, Debug, Copy, Default, Eq, PartialEq, PartialOrd, Ord)]
pub struct MarkedDeletedIndexId {

Does it need to be Default?


src/meta/app/src/schema/index.rs line 83 at r5 (raw file):

pub enum MarkedDeletedIndexType {
    #[default]
    AGGREGATING = 1,

It's weird to have a Default implementation for it. In every case the type should be specified explicitly AFAIK.

serde is not necessary either, is it?


src/meta/app/src/schema/index.rs line 88 at r5 (raw file):

#[derive(serde::Serialize, serde::Deserialize, Clone, Debug, Eq, PartialEq, Default)]
pub struct MarkedDeletedIndexMeta {

Does it really need to be serde? And Default does not seem necessary either.


src/meta/app/src/schema/index.rs line 189 at r5 (raw file):

#[derive(Clone, Debug, PartialEq, Eq)]
pub struct GetMarkedDeletedIndexesReply {
    pub table_indexes: HashMap<u64, Vec<(u64, MarkedDeletedIndexMeta)>>,

Add doc comment explaining the key and value.


src/meta/api/src/schema_api.rs line 171 at r1 (raw file):

        &self,
        table_id: Option<u64>,
        tenant: &Tenant,

Always put tenant at first as a convention.

Code quote:

    async fn get_marked_deleted_indexes(
        &self,
        table_id: Option<u64>,
        tenant: &Tenant,

Copy link
Member

@drmingdrmer drmingdrmer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 6 of 6 files at r6, all commit messages.
Reviewable status: all files reviewed, 2 unresolved discussions (waiting on @dantengsky and @SkyFan2002)

@drmingdrmer
Copy link
Member

@SkyFan2002 ready to merge?

@SkyFan2002
Copy link
Member Author

@SkyFan2002 ready to merge?

Yes.

@SkyFan2002 SkyFan2002 added this pull request to the merge queue Jan 11, 2025
Merged via the queue into databendlabs:main with commit 77364e2 Jan 11, 2025
70 of 71 checks passed
@SkyFan2002 SkyFan2002 deleted the vacuum_agg_index branch January 11, 2025 08:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
pr-feature this PR introduces a new feature to the codebase
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants