Skip to content

refactor!: index stats to handle error and bunch of bugs#1828

Merged
eddyxu merged 9 commits intomainfrom
lei/idx_stats
Jan 14, 2024
Merged

refactor!: index stats to handle error and bunch of bugs#1828
eddyxu merged 9 commits intomainfrom
lei/idx_stats

Conversation

@eddyxu
Copy link
Member

@eddyxu eddyxu commented Jan 13, 2024

BREAKING CHANGE: removed single-purpose stats API from public API and refactored DatasetIndexExt to lance-index.

Also, fixed a few places that unwrap() results.

@github-actions
Copy link
Contributor

ACTION NEEDED

Lance follows the Conventional Commits specification for release automation.

The PR title and description are used as the merge commit message. Please update your PR title and description to match the specification.

For details on the error please inspect the "PR Title Check" action.

@eddyxu eddyxu changed the title refactor: refactor index stats to handle error and bunch of bugs. BREAKING CHANGE: refactor index stats to handle error and bunch of bugs. Jan 13, 2024
@eddyxu eddyxu changed the title BREAKING CHANGE: refactor index stats to handle error and bunch of bugs. BREAKING CHANGE: refactor index stats to handle error and bunch of bugs Jan 14, 2024
@eddyxu eddyxu changed the title BREAKING CHANGE: refactor index stats to handle error and bunch of bugs refactor!: index stats to handle error and bunch of bugs Jan 14, 2024
@eddyxu eddyxu marked this pull request as ready for review January 14, 2024 01:11
Comment on lines +37 to +39

/// Trait for a Lance Dataset
pub trait Dataset {}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry a left over

}

async fn load_indices(&self) -> Result<Vec<IndexMetadata>> {
let manifest_file = self.manifest_file(self.version().version).await?;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why self.manifest_file(...) and not self.manifest?

Copy link
Member Author

@eddyxu eddyxu Jan 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

self.manifest_file returns the file path, the index is lazy loaded , so it opens the file again and read the indices, iiuc.


async fn load_indices(&self) -> Result<Vec<IndexMetadata>> {
let manifest_file = self.manifest_file(self.version().version).await?;
read_manifest_indexes(&self.object_store, &manifest_file, &self.manifest).await
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This reminds me (#1657) we should cache this

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah did not realize that we have to open this file for every query

Comment on lines +53 to +83
async fn load_index(&self, uuid: &str) -> Result<Option<Index>> {
self.load_indices()
.await
.map(|indices| indices.into_iter().find(|idx| idx.uuid.to_string() == uuid))
}

/// Loads a specific index with the given index name
async fn load_index_by_name(&self, name: &str) -> Result<Option<Index>> {
self.load_indices()
.await
.map(|indices| indices.into_iter().find(|idx| idx.name == name))
}

/// Loads a specific index with the given index name.
async fn load_scalar_index_for_column(&self, col: &str) -> Result<Option<Index>>;

/// Optimize indices.
async fn optimize_indices(&mut self) -> Result<()>;

/// Find index with a given index_name and return its serialized statistics.
async fn index_statistics(&self, index_name: &str) -> Result<Option<String>>;

/// Count the rows that are not indexed by the given index.
///
/// TODO: move to [DatasetInternalExt]
async fn count_unindexed_rows(&self, index_name: &str) -> Result<Option<usize>>;

/// Count the rows that are indexed by the given index.
///
/// TODO: move to [DatasetInternalExt]
async fn count_indexed_rows(&self, index_name: &str) -> Result<Option<usize>>;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All of these methods return Result<Option<...>> which is a little confusing. We should document what None means in each case because it is slightly different depending on the function. Or, we could probably change most of these to just Result<...> and raise an error in the None case.

load_index_by_name -> No index exists with that name, maybe just error here?
load_scalar_index_for_column -> No index exists for that column, should definitely error here looking at usage
index_statistics -> No index exists with that name, we should error here
count_unindexed_rows -> An index exists, but we couldn't determine the row count because of old manifest version
count_indexed_rows -> An index exists, but we couldn't determine the row count because of old manifest version

For example, looking at the python version of count_unindexed_rows, it is wrong:

    fn count_unindexed_rows(&self, index_name: String) -> PyResult<Option<usize>> {
        let idx = RT.block_on(None, self.ds.load_index_by_name(index_name.as_str()))?;
        if let Some(index) = idx {
            RT.block_on(
                None,
                self.ds
                    .count_unindexed_rows(index.uuid.to_string().as_str()),
            )?
            .map_err(|err| PyIOError::new_err(err.to_string()))
        } else {
            THIS IS NOT THE CORRECT ERROR MESSAGE
            Err(PyIOError::new_err(format!(
                "Index {} not found",
                index_name
            )))
        }
    }

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was copied from the previous code. I meant to fix them in the next follow up.

Also try to eliminate these single used APIs and just put them into the json blob in index_stats.

@eddyxu eddyxu merged commit 01b1f3d into main Jan 14, 2024
@eddyxu eddyxu deleted the lei/idx_stats branch January 14, 2024 18:02
eddyxu added a commit that referenced this pull request Jan 14, 2024
Clean pu leftovers from #1828
eddyxu added a commit that referenced this pull request Jan 16, 2024
BREAKING CHANGE: removed single-purpose stats API from public API and
refactored `DatasetIndexExt` to `lance-index`.

Also, fixed a few places that `unwrap()` results.
eddyxu added a commit that referenced this pull request Jan 16, 2024
Clean pu leftovers from #1828
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants