refactor!: index stats to handle error and bunch of bugs by eddyxu · Pull Request #1828 · lance-format/lance

eddyxu · 2024-01-13T22:29:18Z

BREAKING CHANGE: removed single-purpose stats API from public API and refactored DatasetIndexExt to lance-index.

Also, fixed a few places that unwrap() results.

github-actions · 2024-01-13T22:29:42Z

ACTION NEEDED

Lance follows the Conventional Commits specification for release automation.

The PR title and description are used as the merge commit message. Please update your PR title and description to match the specification.

For details on the error please inspect the "PR Title Check" action.

westonpace · 2024-01-14T14:24:56Z

rust/lance-core/src/lib.rs

+
+/// Trait for a Lance Dataset
+pub trait Dataset {}


Sorry a left over

westonpace · 2024-01-14T14:36:39Z

rust/lance/src/index.rs

    }

+    async fn load_indices(&self) -> Result<Vec<IndexMetadata>> {
+        let manifest_file = self.manifest_file(self.version().version).await?;


Why self.manifest_file(...) and not self.manifest?

self.manifest_file returns the file path, the index is lazy loaded , so it opens the file again and read the indices, iiuc.

westonpace · 2024-01-14T14:53:32Z

rust/lance/src/index.rs


+    async fn load_indices(&self) -> Result<Vec<IndexMetadata>> {
+        let manifest_file = self.manifest_file(self.version().version).await?;
+        read_manifest_indexes(&self.object_store, &manifest_file, &self.manifest).await


This reminds me (#1657) we should cache this

Ah did not realize that we have to open this file for every query

westonpace · 2024-01-14T15:04:26Z

rust/lance-index/src/traits.rs

+    async fn load_index(&self, uuid: &str) -> Result<Option<Index>> {
+        self.load_indices()
+            .await
+            .map(|indices| indices.into_iter().find(|idx| idx.uuid.to_string() == uuid))
+    }
+
+    /// Loads a specific index with the given index name
+    async fn load_index_by_name(&self, name: &str) -> Result<Option<Index>> {
+        self.load_indices()
+            .await
+            .map(|indices| indices.into_iter().find(|idx| idx.name == name))
+    }
+
+    /// Loads a specific index with the given index name.
+    async fn load_scalar_index_for_column(&self, col: &str) -> Result<Option<Index>>;
+
+    /// Optimize indices.
+    async fn optimize_indices(&mut self) -> Result<()>;
+
+    /// Find index with a given index_name and return its serialized statistics.
+    async fn index_statistics(&self, index_name: &str) -> Result<Option<String>>;
+
+    /// Count the rows that are not indexed by the given index.
+    ///
+    /// TODO: move to [DatasetInternalExt]
+    async fn count_unindexed_rows(&self, index_name: &str) -> Result<Option<usize>>;
+
+    /// Count the rows that are indexed by the given index.
+    ///
+    /// TODO: move to [DatasetInternalExt]
+    async fn count_indexed_rows(&self, index_name: &str) -> Result<Option<usize>>;


All of these methods return Result<Option<...>> which is a little confusing. We should document what None means in each case because it is slightly different depending on the function. Or, we could probably change most of these to just Result<...> and raise an error in the None case.

load_index_by_name -> No index exists with that name, maybe just error here? load_scalar_index_for_column -> No index exists for that column, should definitely error here looking at usage index_statistics -> No index exists with that name, we should error here count_unindexed_rows -> An index exists, but we couldn't determine the row count because of old manifest version count_indexed_rows -> An index exists, but we couldn't determine the row count because of old manifest version

For example, looking at the python version of count_unindexed_rows, it is wrong:

fn count_unindexed_rows(&self, index_name: String) -> PyResult<Option<usize>> { let idx = RT.block_on(None, self.ds.load_index_by_name(index_name.as_str()))?; if let Some(index) = idx { RT.block_on( None, self.ds .count_unindexed_rows(index.uuid.to_string().as_str()), )? .map_err(|err| PyIOError::new_err(err.to_string())) } else { THIS IS NOT THE CORRECT ERROR MESSAGE Err(PyIOError::new_err(format!( "Index {} not found", index_name ))) } }

This was copied from the previous code. I meant to fix them in the next follow up.

Also try to eliminate these single used APIs and just put them into the json blob in index_stats.

Clean pu leftovers from #1828

BREAKING CHANGE: removed single-purpose stats API from public API and refactored `DatasetIndexExt` to `lance-index`. Also, fixed a few places that `unwrap()` results.

Clean pu leftovers from #1828

buildable

e98e592

eddyxu changed the title ~~refactor: refactor index stats to handle error and bunch of bugs.~~ BREAKING CHANGE: refactor index stats to handle error and bunch of bugs. Jan 13, 2024

eddyxu added 6 commits January 13, 2024 15:18

move dataset ext trait to lance-index

dc7568a

build traits

fecf656

s

65d39c0

clean rust build

1c36c9e

make rust test pass

7b2c243

fmt

379e7e2

eddyxu changed the title ~~BREAKING CHANGE: refactor index stats to handle error and bunch of bugs.~~ BREAKING CHANGE: refactor index stats to handle error and bunch of bugs Jan 14, 2024

do not unwrap!!

38830b9

eddyxu changed the title ~~BREAKING CHANGE: refactor index stats to handle error and bunch of bugs~~ refactor!: index stats to handle error and bunch of bugs Jan 14, 2024

eddyxu requested review from westonpace and wjones127 January 14, 2024 01:08

eddyxu marked this pull request as ready for review January 14, 2024 01:11

fix clippy

79e91e9

westonpace approved these changes Jan 14, 2024

View reviewed changes

eddyxu merged commit 01b1f3d into main Jan 14, 2024

eddyxu deleted the lei/idx_stats branch January 14, 2024 18:02

eddyxu mentioned this pull request Jan 14, 2024

chore: remove empty trait #1830

Merged

eddyxu added a commit that referenced this pull request Jan 14, 2024

chore: remove empty trait (#1830)

cd14ec3

Clean pu leftovers from #1828

eddyxu added a commit that referenced this pull request Jan 16, 2024

chore: remove empty trait (#1830)

27f5a47

Clean pu leftovers from #1828

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

refactor!: index stats to handle error and bunch of bugs#1828

refactor!: index stats to handle error and bunch of bugs#1828
eddyxu merged 9 commits intomainfrom
lei/idx_stats

eddyxu commented Jan 13, 2024 •

edited

Loading

Uh oh!

github-actions bot commented Jan 13, 2024

Uh oh!

westonpace Jan 14, 2024

Uh oh!

eddyxu Jan 14, 2024

Uh oh!

westonpace Jan 14, 2024

Uh oh!

eddyxu Jan 14, 2024 •

edited

Loading

Uh oh!

westonpace Jan 14, 2024

Uh oh!

eddyxu Jan 14, 2024

Uh oh!

westonpace Jan 14, 2024

Uh oh!

eddyxu Jan 14, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

eddyxu commented Jan 13, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Jan 13, 2024

Uh oh!

westonpace Jan 14, 2024

Choose a reason for hiding this comment

Uh oh!

eddyxu Jan 14, 2024

Choose a reason for hiding this comment

Uh oh!

westonpace Jan 14, 2024

Choose a reason for hiding this comment

Uh oh!

eddyxu Jan 14, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

westonpace Jan 14, 2024

Choose a reason for hiding this comment

Uh oh!

eddyxu Jan 14, 2024

Choose a reason for hiding this comment

Uh oh!

westonpace Jan 14, 2024

Choose a reason for hiding this comment

Uh oh!

eddyxu Jan 14, 2024

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

eddyxu commented Jan 13, 2024 •

edited

Loading

eddyxu Jan 14, 2024 •

edited

Loading