-
Notifications
You must be signed in to change notification settings - Fork 2.9k
ORC: Fix error when projecting nested indentity partition column #6327
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
hi @shardulm94, it's seems less intrusive and better than #4599 |
@shardulm94 do we know why dropping those constant fields is required in the read path? I see that this was introduced by #1191 but I'm lacking some historical context and would appreciate your input here. |
Here we are construct the "readSchema" i.e. the schema that should be passed to the underlying file reader e.g. ORC when reading the data file. We already have the value of constant columns per file available via Iceberg, so reading them from the file is unnecessary. Doing so, we avoid the read & deserialization cost for the constant columns and also avoid the memory footprint to store the column vector in memory. |
nastra
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for fixing this, LGTM
|
@aokolnychyi Mind taking a look as well? This PR paves the way to fix #3192. In #3192 the |
|
Hey @rdblue, can you take a look at this? |
|
@shardulm94, yes. I'll put it in the queue. Thanks for pinging me. |
|
This pull request has been marked as stale due to 30 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that’s incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the [email protected] list. Thank you for your contributions. |
|
This pull request has been closed due to lack of activity. This is not a judgement on the merit of the PR in any way. It is just a way of keeping the PR queue manageable. If you think that is incorrect, or the pull request requires review, you can revive the PR at any time. |
Closes #4604
This is an alternate and arguably simpler implementation to #4599. The issue is that ORC read path did not support projecting nested structs which have just the partition columns selected. As part of the ORC read path, we drop constant fields from the projected schema before passing it to the ORC file reader. Example:
This step also results in dropping of structs which contain just the partition columns as they now become empty.
#4599 tries to fix this by not dropping nested struct containing partition columns, thus reading the partition values from the file. This PR instead takes a different approach by preserving empty struct when dropping constant fields. This allows the existing constant handling in the ORC read path to work as expected even for nested partition fields.