Skip to content

wip: support fields within FSL columns#1693

Closed
wjones127 wants to merge 7 commits intolance-format:mainfrom
wjones127:feat/support-fsl-subfields
Closed

wip: support fields within FSL columns#1693
wjones127 wants to merge 7 commits intolance-format:mainfrom
wjones127:feat/support-fsl-subfields

Conversation

@wjones127
Copy link
Contributor

@wjones127 wjones127 commented Dec 7, 2023

This PR changes FixedSizeList data types to store child fields, like we do for List and LargeList. This is a breaking change to Lance schemas and page tables, so special handling is provided to keep compatibility with older versions whenever possible. However, if certain data types are used that can't be represented in the old schema system (such as fixed_size_list<bfloat16, N>), then we add a feature flag that tells older readers they can't process this dataset.

Changes to schema

Currently, List and LargeList store child fields, but FixedSizeList does not. This has several consequences:

  1. FixedSizeList cannot have any metadata, including extension metadata. This means we can have them contain extension types like bfloat16.
  2. There is no field id associated with the child field.

When the new feature flag is present, we write these child fields.

Changes to page table

The page table contains a mapping of field ids to a list of pages. The field ids are contiguous, so if there are any gaps in the ids, then there will be blank pages (pages with zero length) in the page table. Struct columns are present but have blank pages as well.

For List and LargeList, their entries in the page table point to their offset buffers. Meanwhile, the child fields point to the value buffers.

For FixedSizeList, there is no child field currently, but there are also no offsets. So the FixedSizeList page table entry points to the value buffers.

This PR changes so that if the feature flag is present, FixedSizeList list-level page table entries are empty (just like struct) and there are value fields that point to their buffers. This more flexible layout will allow having nested types inside of fixed size list.

Testing plan

  • We validate newer versions of Lance are able to correctly read data written by Lance version 0.8.19, before this schema change. This is tested with a new table in the test_data folder.
  • We validate that when writing compatible schemas, we write schemas in the old format.
  • We validate that when writing an incompatible type, we write schemas in the new format and set the feature flag.

Closes #1684
Closes #1293

@github-actions
Copy link
Contributor

github-actions bot commented Dec 7, 2023

ACTION NEEDED

Lance follows the Conventional Commits specification for release automation.

The PR title and description are used as the merge commit message. Please update your PR title and description to match the specification.

For details on the error please inspect the "PR Title Check" action.

@wjones127 wjones127 force-pushed the feat/support-fsl-subfields branch from 2fed5a4 to 2b64080 Compare January 8, 2024 19:14
@wjones127
Copy link
Contributor Author

Validated backwards compatibility locally. If we rewrite the new test table with these changes, I can still read this table using pylance==0.8.19. If I write out a table with a bfloat16 vector column, trying to read with that old pylance version gives the error:

>>> import lance
>>> ds = lance.dataset("test_data_bf16")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/willjones/Documents/lance/python/test_env/lib/python3.10/site-packages/lance/__init__.py", line 92, in dataset
    ds = LanceDataset(
  File "/Users/willjones/Documents/lance/python/test_env/lib/python3.10/site-packages/lance/dataset.py", line 95, in __init__
    self._ds = _Dataset(
ValueError: Not supported: This dataset cannot be read by this version of Lance. Please upgrade Lance to read this dataset.
 Flags: 2, /Users/runner/work/lance/lance/rust/lance/src/dataset.rs:323:27

@wjones127 wjones127 force-pushed the feat/support-fsl-subfields branch from 32cbefd to 2553e13 Compare January 10, 2024 20:30
@wjones127
Copy link
Contributor Author

This PR is getting too big, and I think it would be difficult to finish without refactoring how the page table is handled. Therefore, I've created a separate issue for fixing the page table (#1809) and will pause work on this PR.

I'll be looking for an alternative path to support bfloat16 vectors.

@wjones127 wjones127 closed this Jan 15, 2024
wjones127 added a commit that referenced this pull request Jan 15, 2024
I've given up for now on supporting generic extension types in FSL
(#1693), so for now we'll have a special case for bfloat16 where we have
a specific string that refers to our extension type.

This will unblock further development on vector search with bfloat16,
but there is also still substantial UX work before we want to advertise
this to users.

Closes #1684.
eddyxu pushed a commit that referenced this pull request Jan 16, 2024
I've given up for now on supporting generic extension types in FSL
(#1693), so for now we'll have a special case for bfloat16 where we have
a specific string that refers to our extension type.

This will unblock further development on vector search with bfloat16,
but there is also still substantial UX work before we want to advertise
this to users.

Closes #1684.
@wjones127 wjones127 deleted the feat/support-fsl-subfields branch August 6, 2025 18:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Unsupported Logical Type when writing a FixedSizeList Support bf16 vector columns

1 participant