-
-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
New Binary/String type #13459
Comments
Hi @ritchie46 , I believe that this issue can be closed as the functionality is merged and released. |
Yeap, thanks |
Is this available in the Python API? I don't see any references to the type. |
It is available. Our string column type is backed by this. Polars keeps it simple, we have 1 string type, and you use it automatically. |
If I understand correctly this type can/should intern large strings, providing some level of "dictionary encoding". import polars as pl
large_string = 'a' * 10_000
data = [large_string] * 100_000
df = pl.DataFrame({'x': data}, schema={'x': pl.String})
print(df.estimated_size()) # 1000000000
df = pl.DataFrame({'x': data}, schema={'x': pl.Categorical})
print(df.estimated_size()) # 410000 Is this expected? Am I misunderstanding the type? |
Yes we can. Though there is cost in interning checking upon construction. Currently we only do that if you extend internally. Later we will also look into interning checking until a limited size is reached. |
Indeed: import polars as pl
large_string = 'a' * 10_000
data = [large_string] * 100_000
df = pl.DataFrame({'x': data}, schema={'x': pl.String})
assert df.shape[0] == 100_000
print(df.estimated_size()) # 1000000000
df = pl.DataFrame({'x': data}, schema={'x': pl.Categorical})
assert df.shape[0] == 100_000
print(df.estimated_size()) # 410000
s = pl.Series('x', [large_string], pl.String)
s = s.new_from_index(0, len(data))
df = pl.DataFrame({'x': s})
assert df.shape[0] == 100_000
print(df.estimated_size()) # 10000 It's even better than a dictionary / categorical column! Agreed it would be nice to heuristically do this when constructing, or at least to be able to force it on if I know my data has a lot of duplication of large strings. |
The goal is to replace the current Arrow (Large)String type with a string type that allows a union between an inlined small string and an offset to a string that is allocated somewhere else.
This would prevent the terrible performance we have when filtering/gathering large string data as that forces a copy of all bytes. Second this type also allows string interning. As duplicates can only be stored once in the buffer and then we can point to that string multiple times.
Relevant arrow discussion here: https://lists.apache.org/thread/w88tpz76ox8h3rxkjl4so6rg3f1rv7wt
polars-arrow
feat: implementBinaryView
andUtf8View
inpolars-arrow
#13243BinaryView
/Utf8View
IPC support #13464BinaryView
toparquet
writer/reader. #13489Elide utf8-validation in OOC(we mmap)Optional: change avro to new type (otherwise pay conversion cost until implemented)(Pay cost)The text was updated successfully, but these errors were encountered: