Use table schema from the table handle#14076
Conversation
|
Looks like there are some other places we do
I think only the first one really matters |
|
@alexjo2144 this is actually a point that I wanted to bring in this PR - |
|
Yeah I think we should prefer the schema in the Handle. The pattern I was looking for here were methods which called |
let's make sure cleanups like this and the bug fix come in separate commits |
I created a separate PR to avoid cluttering the current changes |
Due to internal caching within the method `org.apache.iceberg.ManifestGroup.planFiles` the returned file scan tasks may contain an invalid split schema string. Rely on the table schema from the table handle while reading from AVRO data files.
In the context of the dealing with an Iceberg table with a structure which evolves over time (columns are added / dropped) in case of performing a snapshot/time travel query, the schema of the output matches the corresponding schema of the table snapshot queried.
5940942 to
009b735
Compare
| split.getPartitionDataJson(), | ||
| split.getFileFormat(), | ||
| split.getSchemaAsJson().map(SchemaParser::fromJson), | ||
| SchemaParser.fromJson(table.getTableSchemaJson()), |
There was a problem hiding this comment.
Due to internal caching within the method
org.apache.iceberg.ManifestGroup.planFiles
the returned file scan tasks may contain an invalid split
schema string.
is it testable?
There was a problem hiding this comment.
Yes, it it testable through io.trino.plugin.iceberg.TestIcebergAvroConnectorTest.
I was reluctant on squashing the two commits of this PR because they address different issues.
The test io.trino.plugin.iceberg.TestIcebergAvroConnectorTest covers both of the issues.
|
|
||
| ImmutableMap.Builder<String, ColumnHandle> columnHandles = ImmutableMap.builder(); | ||
| for (IcebergColumnHandle columnHandle : getColumns(icebergTable.schema(), typeManager)) { | ||
| for (IcebergColumnHandle columnHandle : getColumns(SchemaParser.fromJson(table.getTableSchemaJson()), typeManager)) { |
There was a problem hiding this comment.
this is a good change.
However, it looks like we call SchemaParser.fromJson(tableHandle.getTableSchemaJson()) multiple times on one table handle. Am i right?
SchemaParser.fromJson does cache internally (on a static field).
This isn't ideal, and we could better, caching within table handle object. Not sure it matters though -- depends how frequently this is called.
There was a problem hiding this comment.
Should we switch to SchemaParser.fromJson(JsonNode)
fromJson(JsonUtil.mapper().readValue(jsonKey, JsonNode.class))
?
Description
In the context of the dealing with an Iceberg table with a structure which evolves over time (columns are added / dropped) in case of performing a snapshot/time travel query, the schema of the output matches the corresponding schema of the table snapshot queried.
Fixes #14064
Relates to #12786
Non-technical explanation
In the context of time travel queries, use the table schema corresponding to the snapshot of the table queried
for retrieving the columns of the output.
Release notes
( ) This is not user-visible and no release notes are required.
( ) Release notes are required, please propose a release note for me.
(x) Release notes are required, with the following suggested text: