Skip to content

Conversation

@HotSushi
Copy link
Contributor

How Serde works for AVRO, ORC, parquet in Hive?

  • AvroSerDe, ParquetSerDe looks at two properties that hive sets "columns", "columns.types" (in case there is schema evolution few more properties: "hive.exec.schema.evolution","schema.evolution.columns","schema.evolution.columns.types"). And constructs object inspectors just for those specific columns (by converting them to native filetype schema first, for example: orc does "strings -> TypeDescription -> oi" or avro does "strings -> Schema -> oi"). The oi returned only contains schema that hive expects.

who/how sets "columns", "column.types"?

  • it gets it from hive schema

How iceberg works today?

  • doesn't look at the properties set by hive at all.
  • doesn't look at schema evolution props.
  • creates a raw object inspector out of whatever table schema is set.

How do RecordReaders work for AVRO, ORC, parquet in Hive?

  • In ORC, Avro (AvroContainerInputFormat), the record reader again looks at "columns", "columns.types", "hive.exec.schema.evolution","schema.evolution.columns","schema.evolution.columns.types", to get schema that is expected by hive. And reads the file using that schema as projection.

@HotSushi
Copy link
Contributor Author

This was just a prototype. Closing this, as a better solution is available here: #45

@HotSushi HotSushi closed this Oct 29, 2020
@HotSushi HotSushi deleted the hive-location-fix-with-new-serde-path branch November 20, 2020 00:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants