Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

When pyiceberg loads Iceberg tables data into memory, dictionary encoding is not applied for columns #1205

Open
learningkeeda opened this issue Sep 25, 2024 · 0 comments

Comments

@learningkeeda
Copy link

Apache Iceberg version

0.7.0

Please describe the bug 🐞

When I try to load large dataset as below, string columns data is stored in its original, uncompressed form, leading to increased memory usage which leads to out-of-memory errors.

from pyiceberg.catalog import load_catalog
from pyiceberg.expressions import GreaterThanOrEqual

catalog = load_catalog("default")
table = catalog.load_table("test.table1")
table.scan(
selected_fields=("id", "string_cols1", "string_cols2"),
).to_arrow()

To avoid these issues, it's best to use dictionary encoding, especially for columns with low cardinality, where values are repeated frequently.

@learningkeeda learningkeeda changed the title When pyiceberg loads Iceberg tables data into memory, dictionary encoding is not applied especially for columns with low cardinality, where values are repeated frequently. When pyiceberg loads Iceberg tables data into memory, dictionary encoding is not applied for columns Sep 25, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant