You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When I try to load large dataset as below, string columns data is stored in its original, uncompressed form, leading to increased memory usage which leads to out-of-memory errors.
from pyiceberg.catalog import load_catalog
from pyiceberg.expressions import GreaterThanOrEqual
To avoid these issues, it's best to use dictionary encoding, especially for columns with low cardinality, where values are repeated frequently.
The text was updated successfully, but these errors were encountered:
learningkeeda
changed the title
When pyiceberg loads Iceberg tables data into memory, dictionary encoding is not applied especially for columns with low cardinality, where values are repeated frequently.
When pyiceberg loads Iceberg tables data into memory, dictionary encoding is not applied for columns
Sep 25, 2024
Apache Iceberg version
0.7.0
Please describe the bug 🐞
When I try to load large dataset as below, string columns data is stored in its original, uncompressed form, leading to increased memory usage which leads to out-of-memory errors.
from pyiceberg.catalog import load_catalog
from pyiceberg.expressions import GreaterThanOrEqual
catalog = load_catalog("default")
table = catalog.load_table("test.table1")
table.scan(
selected_fields=("id", "string_cols1", "string_cols2"),
).to_arrow()
To avoid these issues, it's best to use dictionary encoding, especially for columns with low cardinality, where values are repeated frequently.
The text was updated successfully, but these errors were encountered: