When pyiceberg loads Iceberg tables data into memory, dictionary encoding is not applied for columns #1205

learningkeeda · 2024-09-25T03:30:48Z

Apache Iceberg version

0.7.0

Please describe the bug 🐞

When I try to load large dataset as below, string columns data is stored in its original, uncompressed form, leading to increased memory usage which leads to out-of-memory errors.

from pyiceberg.catalog import load_catalog
from pyiceberg.expressions import GreaterThanOrEqual

catalog = load_catalog("default")
table = catalog.load_table("test.table1")
table.scan(
selected_fields=("id", "string_cols1", "string_cols2"),
).to_arrow()

To avoid these issues, it's best to use dictionary encoding, especially for columns with low cardinality, where values are repeated frequently.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

When pyiceberg loads Iceberg tables data into memory, dictionary encoding is not applied for columns #1205

When pyiceberg loads Iceberg tables data into memory, dictionary encoding is not applied for columns #1205

learningkeeda commented Sep 25, 2024

When pyiceberg loads Iceberg tables data into memory, dictionary encoding is not applied for columns #1205

When pyiceberg loads Iceberg tables data into memory, dictionary encoding is not applied for columns #1205

Comments

learningkeeda commented Sep 25, 2024

Apache Iceberg version

Please describe the bug 🐞