wip: support fields within FSL columns#1693
wip: support fields within FSL columns#1693wjones127 wants to merge 7 commits intolance-format:mainfrom
Conversation
|
ACTION NEEDED Lance follows the Conventional Commits specification for release automation. The PR title and description are used as the merge commit message. Please update your PR title and description to match the specification. For details on the error please inspect the "PR Title Check" action. |
2fed5a4 to
2b64080
Compare
|
Validated backwards compatibility locally. If we rewrite the new test table with these changes, I can still read this table using |
32cbefd to
2553e13
Compare
|
This PR is getting too big, and I think it would be difficult to finish without refactoring how the page table is handled. Therefore, I've created a separate issue for fixing the page table (#1809) and will pause work on this PR. I'll be looking for an alternative path to support bfloat16 vectors. |
I've given up for now on supporting generic extension types in FSL (#1693), so for now we'll have a special case for bfloat16 where we have a specific string that refers to our extension type. This will unblock further development on vector search with bfloat16, but there is also still substantial UX work before we want to advertise this to users. Closes #1684.
I've given up for now on supporting generic extension types in FSL (#1693), so for now we'll have a special case for bfloat16 where we have a specific string that refers to our extension type. This will unblock further development on vector search with bfloat16, but there is also still substantial UX work before we want to advertise this to users. Closes #1684.
This PR changes FixedSizeList data types to store child fields, like we do for List and LargeList. This is a breaking change to Lance schemas and page tables, so special handling is provided to keep compatibility with older versions whenever possible. However, if certain data types are used that can't be represented in the old schema system (such as
fixed_size_list<bfloat16, N>), then we add a feature flag that tells older readers they can't process this dataset.Changes to schema
Currently, List and LargeList store child fields, but FixedSizeList does not. This has several consequences:
When the new feature flag is present, we write these child fields.
Changes to page table
The page table contains a mapping of field ids to a list of pages. The field ids are contiguous, so if there are any gaps in the ids, then there will be blank pages (pages with zero length) in the page table. Struct columns are present but have blank pages as well.
For List and LargeList, their entries in the page table point to their offset buffers. Meanwhile, the child fields point to the value buffers.
For FixedSizeList, there is no child field currently, but there are also no offsets. So the FixedSizeList page table entry points to the value buffers.
This PR changes so that if the feature flag is present, FixedSizeList list-level page table entries are empty (just like struct) and there are value fields that point to their buffers. This more flexible layout will allow having nested types inside of fixed size list.
Testing plan
test_datafolder.Closes #1684
Closes #1293