refactor!: index stats to handle error and bunch of bugs#1828
Conversation
|
ACTION NEEDED Lance follows the Conventional Commits specification for release automation. The PR title and description are used as the merge commit message. Please update your PR title and description to match the specification. For details on the error please inspect the "PR Title Check" action. |
|
|
||
| /// Trait for a Lance Dataset | ||
| pub trait Dataset {} |
| } | ||
|
|
||
| async fn load_indices(&self) -> Result<Vec<IndexMetadata>> { | ||
| let manifest_file = self.manifest_file(self.version().version).await?; |
There was a problem hiding this comment.
Why self.manifest_file(...) and not self.manifest?
There was a problem hiding this comment.
self.manifest_file returns the file path, the index is lazy loaded , so it opens the file again and read the indices, iiuc.
|
|
||
| async fn load_indices(&self) -> Result<Vec<IndexMetadata>> { | ||
| let manifest_file = self.manifest_file(self.version().version).await?; | ||
| read_manifest_indexes(&self.object_store, &manifest_file, &self.manifest).await |
There was a problem hiding this comment.
Ah did not realize that we have to open this file for every query
| async fn load_index(&self, uuid: &str) -> Result<Option<Index>> { | ||
| self.load_indices() | ||
| .await | ||
| .map(|indices| indices.into_iter().find(|idx| idx.uuid.to_string() == uuid)) | ||
| } | ||
|
|
||
| /// Loads a specific index with the given index name | ||
| async fn load_index_by_name(&self, name: &str) -> Result<Option<Index>> { | ||
| self.load_indices() | ||
| .await | ||
| .map(|indices| indices.into_iter().find(|idx| idx.name == name)) | ||
| } | ||
|
|
||
| /// Loads a specific index with the given index name. | ||
| async fn load_scalar_index_for_column(&self, col: &str) -> Result<Option<Index>>; | ||
|
|
||
| /// Optimize indices. | ||
| async fn optimize_indices(&mut self) -> Result<()>; | ||
|
|
||
| /// Find index with a given index_name and return its serialized statistics. | ||
| async fn index_statistics(&self, index_name: &str) -> Result<Option<String>>; | ||
|
|
||
| /// Count the rows that are not indexed by the given index. | ||
| /// | ||
| /// TODO: move to [DatasetInternalExt] | ||
| async fn count_unindexed_rows(&self, index_name: &str) -> Result<Option<usize>>; | ||
|
|
||
| /// Count the rows that are indexed by the given index. | ||
| /// | ||
| /// TODO: move to [DatasetInternalExt] | ||
| async fn count_indexed_rows(&self, index_name: &str) -> Result<Option<usize>>; |
There was a problem hiding this comment.
All of these methods return Result<Option<...>> which is a little confusing. We should document what None means in each case because it is slightly different depending on the function. Or, we could probably change most of these to just Result<...> and raise an error in the None case.
load_index_by_name -> No index exists with that name, maybe just error here?
load_scalar_index_for_column -> No index exists for that column, should definitely error here looking at usage
index_statistics -> No index exists with that name, we should error here
count_unindexed_rows -> An index exists, but we couldn't determine the row count because of old manifest version
count_indexed_rows -> An index exists, but we couldn't determine the row count because of old manifest version
For example, looking at the python version of count_unindexed_rows, it is wrong:
fn count_unindexed_rows(&self, index_name: String) -> PyResult<Option<usize>> {
let idx = RT.block_on(None, self.ds.load_index_by_name(index_name.as_str()))?;
if let Some(index) = idx {
RT.block_on(
None,
self.ds
.count_unindexed_rows(index.uuid.to_string().as_str()),
)?
.map_err(|err| PyIOError::new_err(err.to_string()))
} else {
THIS IS NOT THE CORRECT ERROR MESSAGE
Err(PyIOError::new_err(format!(
"Index {} not found",
index_name
)))
}
}
There was a problem hiding this comment.
This was copied from the previous code. I meant to fix them in the next follow up.
Also try to eliminate these single used APIs and just put them into the json blob in index_stats.
BREAKING CHANGE: removed single-purpose stats API from public API and refactored `DatasetIndexExt` to `lance-index`. Also, fixed a few places that `unwrap()` results.
BREAKING CHANGE: removed single-purpose stats API from public API and refactored
DatasetIndexExttolance-index.Also, fixed a few places that
unwrap()results.