Skip to content

feat: optimize indices#1841

Merged
eddyxu merged 11 commits intomainfrom
lei/opt_idx
Jan 17, 2024
Merged

feat: optimize indices#1841
eddyxu merged 11 commits intomainfrom
lei/opt_idx

Conversation

@eddyxu
Copy link
Member

@eddyxu eddyxu commented Jan 17, 2024

Allow users to speicfy how many delta indices to be merged

Copy link
Member

@westonpace westonpace left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few questions to help me understand how this operation works

Comment on lines +54 to +60
let mut indices = Vec::with_capacity(old_indices.len());
for idx in old_indices {
let index = dataset
.open_generic_index(&column.name, &idx.uuid.to_string())
.await?;
indices.push(index);
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, could maybe use Vec::from_iter here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When I change to

    let indices = stream::iter(old_indices.iter())
        .zip(repeat((dataset.clone(), col_name.clone())))
        .map(|(meta, (ds, col_name))| async move {
            ds.open_generic_index(&col_name, &meta.uuid.to_string())
                .await
        })
        .buffered(10)
        .try_collect::<Vec<_>>()
        .await?;

it complaints

error: higher-ranked lifetime error
   --> lance/src/index.rs:300:5
    |
300 |     #[instrument(skip_all)]
    |     ^^^^^^^^^^^^^^^^^^^^^^^
    |
    = note: could not prove `Pin<Box<{async block@lance/src/index.rs:300:5: 300:28}>>: CoerceUnsized<Pin<Box<(dyn futures::Future<Output = Result<(), lance_core::Error>> + std::marker::Send + 'i)>>>`
    = note: this error originates in the attribute macro `instrument` (in Nightly builds, run with -Z macro-backtrace for more info)

error: could not compile `lance` (lib) due to previous error

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, the return of the dreaded rust-lang/rust#102211

Don't worry too much about it but I I've had luck with something like this...

    let indices = stream::iter(old_indices.iter())
        .zip(repeat((dataset.clone(), col_name.clone())))
        .map(|(meta, (ds, col_name))| async move {
            ds.open_generic_index(&col_name, &meta.uuid.to_string())
                .await
        })
        .collect::<Vec<_>>();
   let indices = stream::iter(indices)
        .buffered(10)
        .try_collect::<Vec<_>>()
        .await?;

...or you can just remove the call to buffered (that's the method that usually introduces the bogus error).

Comment on lines +20 to +23
/// If `num_indices_to_merge` is 0, a new delta index will be created.
/// If `num_indices_to_merge` is 1, the delta updates will be merged into the latest index.
/// If `num_indices_to_merge` is more than 1, the delta updates and latest N indices
/// will be merged into one single index.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need to explain to the user why they would change this parameter.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated the doc

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't worry too much. We can tweak this if/when the parameter becomes public. I'm guessing that there is some kind of cost-to-compute/accuracy-of-index tradeoff here? Or is it a cost-to-compute/cost-to-search tradeoff?

In other words, "why wouldn't I merge the indicies into one big index every time?" or "why wouldn't I make every index a delta index?"

I'm still not sure it is clear from the comment.

@eddyxu eddyxu merged commit 1671ea0 into main Jan 17, 2024
@eddyxu eddyxu deleted the lei/opt_idx branch January 17, 2024 22:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants