Add file metadata columns support for spark parquet#7880
Add file metadata columns support for spark parquet#7880gaoyangxiaozhu wants to merge 10 commits intofacebookincubator:mainfrom
Conversation
✅ Deploy Preview for meta-velox canceled.
|
|
@majetideepak also help for review and I saw looks @aditi-pandit also submit a similar PR #8800 for addressing presdo engine |
|
@gaoyangxiaozhu : Have couple of high-level comments. There are several missing pieces : i) A TableScan output operator test like https://github.com/facebookincubator/velox/pull/8800/files#r1498654870. This would introduce the wiring for metadata columns in HiveConnector TestBase classes. It might be simpler to change #8800 to use metadata_columns parameters for HiveSplit instead. wdyt ? |
thanks @aditi-pandit for i) Yes, a test is needed and I saw your PR already have it. for iii) I actually before don't know what's the synthesized means and also i don't found anywhere code path use it, so i just added a KMetadata to easily mark it is metadata column, but we can still use synthesized it should be ok, I'd like metadata naming due to I don't understand what synthesized means. Sure @aditi-pandit , it's OK if you can update your PR to move using metadata_columns parameters and remove any hard code checking for know it is metadata or not. For filter part just reference change |
…me' to be queried in SQL (#8800) Summary: $file_size and $file_modified_time are queryable synthesized columns for Hive tables in Presto. Spark also has bunch of such queryable synthesized columns (#7880). The columns are passed by the co-ordinator to the worker in the HiveSplit. i) Velox HiveSplit needed to be enhanced to get filesize and file_modified_time metadata in a generic map data-structure of (column name, value) from Prestissimo. ii) These values should be populated by SplitReader into TableScanOperator output buffers. This also needs a Prestissimo change to populate the HiveSplit with this info sent in the fragment prestodb/presto#21965 Fixes prestodb/presto#21867 gaoyangxiaozhu will have a follow up PR on the Spark integration. Pull Request resolved: #8800 Reviewed By: mbasmanova Differential Revision: D54512245 Pulled By: Yuhta fbshipit-source-id: 190a97f9fcb1e869fff82e0a2264d57f9915376e
|
closed with @aditi-pandit 's PR is merged do the same thing. |


Spark support people query file metdata as
file_size,file_name,file_path,file_modified_time,file_block_startetc for Hive tables as seperated file metadata column. Checking this for all file metadata supported query by spark https://github.com/apache/spark/blob/081c7a7947a47bf0b2bfd478abdd4b78a1db3ddb/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormat.scala#L183C2-L193C56.The PR extends the
HiveSplitwith a new parametermetadaColumnsto let upsteram computing engine as spark to pass the initialized const file metadata columns (if have) to velox connector split when constructed, it use to fix file metadata columns null issue when intergrating with Velox using Spark.Checking this issue #8173 for details context.
It is also a dependency of Gluten repository PR apache/gluten#3870