New Binary/String type #13459

ritchie46 · 2024-01-05T12:30:01Z

The goal is to replace the current Arrow (Large)String type with a string type that allows a union between an inlined small string and an offset to a string that is allocated somewhere else.

This would prevent the terrible performance we have when filtering/gathering large string data as that forces a copy of all bytes. Second this type also allows string interning. As duplicates can only be stored once in the buffer and then we can point to that string multiple times.

Relevant arrow discussion here: https://lists.apache.org/thread/w88tpz76ox8h3rxkjl4so6rg3f1rv7wt

Implement in polars-arrow feat: implement BinaryView and Utf8View in polars-arrow #13243
Use prefix in equality feat: implement binview comparison kernels #13715 (prefix not used)
Implement IPC feat(rust): BinaryView/Utf8View IPC support #13464
Implement Parquet feat(rust): add BinaryView to parquet writer/reader. #13489
Implement opt-in Polars flavor of IPC (until offically supported in arrow and all issues resolved.
- Use the polars IPC flavor in pickle feat: add architecture for polars-flavored IPC #13734
  - Use the polars IPC flavor in OOC feat: add architecture for polars-flavored IPC #13734
  - ~~Elide utf8-validation in OOC~~ (we mmap)
~~Optional: change avro to new type (otherwise pay conversion cost until implemented)~~ (Pay cost)
Migrate polars and all compute
Check dataframe protocol correctness.
Implement in polars-row feat: implement binview for polars-row #13736
Implement in polars-json feat: implement binview for polars-json #13737

The text was updated successfully, but these errors were encountered:

Steiniche · 2024-02-05T18:31:56Z

Hi @ritchie46 , I believe that this issue can be closed as the functionality is merged and released.

ritchie46 · 2024-02-05T19:59:12Z

Yeap, thanks

adriangb · 2024-08-08T13:57:54Z

Is this available in the Python API? I don't see any references to the type.

ritchie46 · 2024-08-08T15:56:05Z

It is available. Our string column type is backed by this. Polars keeps it simple, we have 1 string type, and you use it automatically.

adriangb · 2024-08-08T16:07:18Z

If I understand correctly this type can/should intern large strings, providing some level of "dictionary encoding".
On Polars 1.4.1 I tried comparing to a Categorical column and get a much smaller size for the categorical column:

import polars as pl

large_string = 'a' * 10_000
data = [large_string] * 100_000

df = pl.DataFrame({'x': data}, schema={'x': pl.String})
print(df.estimated_size())  # 1000000000

df = pl.DataFrame({'x': data}, schema={'x': pl.Categorical})
print(df.estimated_size())  # 410000

Is this expected? Am I misunderstanding the type?

ritchie46 · 2024-08-08T18:18:33Z

Yes we can. Though there is cost in interning checking upon construction. Currently we only do that if you extend internally. series.new_from_index, I believe.

Later we will also look into interning checking until a limited size is reached.

adriangb · 2024-08-08T18:44:46Z

Indeed:

import polars as pl

large_string = 'a' * 10_000
data = [large_string] * 100_000

df = pl.DataFrame({'x': data}, schema={'x': pl.String})
assert df.shape[0] == 100_000
print(df.estimated_size())  # 1000000000

df = pl.DataFrame({'x': data}, schema={'x': pl.Categorical})
assert df.shape[0] == 100_000
print(df.estimated_size())  # 410000

s = pl.Series('x', [large_string], pl.String)
s = s.new_from_index(0, len(data))
df = pl.DataFrame({'x': s})
assert df.shape[0] == 100_000
print(df.estimated_size())  # 10000

It's even better than a dictionary / categorical column!

Agreed it would be nice to heuristically do this when constructing, or at least to be able to force it on if I know my data has a lot of duplication of large strings.

ritchie46 added this to the 1.0.0 milestone Jan 5, 2024

ritchie46 mentioned this issue Jan 5, 2024

feat(rust): BinaryView/Utf8View IPC support #13464

Merged

stinodego added the accepted Ready for implementation label Jan 5, 2024

ritchie46 added python Related to Python Polars rust Related to Rust Polars enhancement New feature or an improvement of an existing feature performance Performance issues or improvements labels Jan 6, 2024

stinodego assigned ritchie46 Jan 8, 2024

ritchie46 mentioned this issue Jan 15, 2024

feat: new implementation for String/Binary type. #13748

Merged

ritchie46 closed this as completed Feb 5, 2024

benclmnt mentioned this issue Jul 17, 2024

German style strings benclmnt/til#32

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

New Binary/String type #13459

New Binary/String type #13459

ritchie46 commented Jan 5, 2024 •

edited

Loading

Steiniche commented Feb 5, 2024 •

edited

Loading

ritchie46 commented Feb 5, 2024

adriangb commented Aug 8, 2024

ritchie46 commented Aug 8, 2024

adriangb commented Aug 8, 2024 •

edited

Loading

ritchie46 commented Aug 8, 2024

adriangb commented Aug 8, 2024

New Binary/String type #13459

New Binary/String type #13459

Comments

ritchie46 commented Jan 5, 2024 • edited Loading

Steiniche commented Feb 5, 2024 • edited Loading

ritchie46 commented Feb 5, 2024

adriangb commented Aug 8, 2024

ritchie46 commented Aug 8, 2024

adriangb commented Aug 8, 2024 • edited Loading

ritchie46 commented Aug 8, 2024

adriangb commented Aug 8, 2024

ritchie46 commented Jan 5, 2024 •

edited

Loading

Steiniche commented Feb 5, 2024 •

edited

Loading

adriangb commented Aug 8, 2024 •

edited

Loading