Skip to content

Conversation

@shardulm94
Copy link
Contributor

Closes #4604

This is an alternate and arguably simpler implementation to #4599. The issue is that ORC read path did not support projecting nested structs which have just the partition columns selected. As part of the ORC read path, we drop constant fields from the projected schema before passing it to the ORC file reader. Example:

Schema readSchemaWithoutConstantAndMetadataFields =
        TypeUtil.selectNot(
            readSchema, Sets.union(idToConstant.keySet(), MetadataColumns.metadataFieldIds()));

This step also results in dropping of structs which contain just the partition columns as they now become empty.

#4599 tries to fix this by not dropping nested struct containing partition columns, thus reading the partition values from the file. This PR instead takes a different approach by preserving empty struct when dropping constant fields. This allows the existing constant handling in the ORC read path to work as expected even for nested partition fields.

@shardulm94
Copy link
Contributor Author

@tprelle @kbendick Can you help take a look? This is an alternate implementation to #4599.

@shardulm94 shardulm94 requested a review from nastra December 1, 2022 07:33
@tprelle
Copy link
Contributor

tprelle commented Dec 5, 2022

hi @shardulm94, it's seems less intrusive and better than #4599

@nastra
Copy link
Contributor

nastra commented Dec 6, 2022

As part of the ORC read path, we drop constant fields from the projected schema before passing it to the ORC file reader.

@shardulm94 do we know why dropping those constant fields is required in the read path? I see that this was introduced by #1191 but I'm lacking some historical context and would appreciate your input here.

@shardulm94
Copy link
Contributor Author

@shardulm94 do we know why dropping those constant fields is required in the read path? I see that this was introduced by #1191 but I'm lacking some historical context and would appreciate your input here.

Here we are construct the "readSchema" i.e. the schema that should be passed to the underlying file reader e.g. ORC when reading the data file. We already have the value of constant columns per file available via Iceberg, so reading them from the file is unnecessary. Doing so, we avoid the read & deserialization cost for the constant columns and also avoid the memory footprint to store the column vector in memory.

Copy link
Contributor

@nastra nastra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for fixing this, LGTM

@shardulm94
Copy link
Contributor Author

shardulm94 commented Dec 7, 2022

@aokolnychyi Mind taking a look as well? This PR paves the way to fix #3192. In #3192 the _partition struct gets dropped when performing the schema projection and hence the reader layer is not able to substitute it with a constant value. This fix will enable the substitution, but needs an additional change to fully fix #3192.

@shardulm94
Copy link
Contributor Author

Hey @rdblue, can you take a look at this?

@rdblue
Copy link
Contributor

rdblue commented Feb 9, 2023

@shardulm94, yes. I'll put it in the queue. Thanks for pinging me.

@github-actions
Copy link

This pull request has been marked as stale due to 30 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that’s incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the [email protected] list. Thank you for your contributions.

@github-actions github-actions bot added the stale label Aug 24, 2024
@github-actions
Copy link

This pull request has been closed due to lack of activity. This is not a judgement on the merit of the PR in any way. It is just a way of keeping the PR queue manageable. If you think that is incorrect, or the pull request requires review, you can revive the PR at any time.

@github-actions github-actions bot closed this Sep 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Orc : Bug when adding a inner struct field as partition field

5 participants