Skip to content

Add file metadata columns support for spark parquet#7880

Closed
gaoyangxiaozhu wants to merge 10 commits intofacebookincubator:mainfrom
gayangya:gayangya/spark_metadatacolumns
Closed

Add file metadata columns support for spark parquet#7880
gaoyangxiaozhu wants to merge 10 commits intofacebookincubator:mainfrom
gayangya:gayangya/spark_metadatacolumns

Conversation

@gaoyangxiaozhu
Copy link
Copy Markdown
Contributor

@gaoyangxiaozhu gaoyangxiaozhu commented Dec 5, 2023

Spark support people query file metdata as file_size, file_name, file_path, file_modified_time, file_block_start etc for Hive tables as seperated file metadata column. Checking this for all file metadata supported query by spark https://github.com/apache/spark/blob/081c7a7947a47bf0b2bfd478abdd4b78a1db3ddb/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormat.scala#L183C2-L193C56.

The PR extends the HiveSplit with a new parameter metadaColumns to let upsteram computing engine as spark to pass the initialized const file metadata columns (if have) to velox connector split when constructed, it use to fix file metadata columns null issue when intergrating with Velox using Spark.

  1. It let HiveSplit can get file metadata info from spark
  2. The values is populated by by SplitReader into TableScanOperator output.

Checking this issue #8173 for details context.

It is also a dependency of Gluten repository PR apache/gluten#3870

@netlify
Copy link
Copy Markdown

netlify bot commented Dec 5, 2023

Deploy Preview for meta-velox canceled.

Name Link
🔨 Latest commit 6d8b7d8
🔍 Latest deploy log https://app.netlify.com/sites/meta-velox/deploys/65be35fb7ccc460008a45ce9

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Dec 5, 2023
@gaoyangxiaozhu gaoyangxiaozhu changed the title add metadatacolumns support for spark hive connector Add metadata columns support for spark parquet Dec 25, 2023
@gaoyangxiaozhu gaoyangxiaozhu changed the title Add metadata columns support for spark parquet Add file metadata columns support for spark parquet Dec 25, 2023
@yingsu00 yingsu00 requested review from Yuhta and yingsu00 January 18, 2024 15:27
@gaoyangxiaozhu gaoyangxiaozhu requested a review from Yuhta January 31, 2024 01:05
@gaoyangxiaozhu
Copy link
Copy Markdown
Contributor Author

gaoyangxiaozhu commented Feb 21, 2024

@majetideepak also help for review

and I saw looks @aditi-pandit also submit a similar PR #8800 for addressing presdo engine $file_path and $file_modify_time query issue. We can see if can make the PR cover both.

@aditi-pandit
Copy link
Copy Markdown
Collaborator

@gaoyangxiaozhu : Have couple of high-level comments. There are several missing pieces :

i) A TableScan output operator test like https://github.com/facebookincubator/velox/pull/8800/files#r1498654870. This would introduce the wiring for metadata columns in HiveConnector TestBase classes.
ii) The metadata columns can be used as filters in Presto queries as well. I would imagine Spark would allow this as well. The code should be enhanced for this.
iii) What is the difference between synthesized and metadata columns ? Its not clear to me.

It might be simpler to change #8800 to use metadata_columns parameters for HiveSplit instead. wdyt ?

@gaoyangxiaozhu
Copy link
Copy Markdown
Contributor Author

gaoyangxiaozhu commented Feb 22, 2024

@gaoyangxiaozhu : Have couple of high-level comments. There are several missing pieces :

i) A TableScan output operator test like https://github.com/facebookincubator/velox/pull/8800/files#r1498654870. This would introduce the wiring for metadata columns in HiveConnector TestBase classes. ii) The metadata columns can be used as filters in Presto queries as well. I would imagine Spark would allow this as well. The code should be enhanced for this. iii) What is the difference between synthesized and metadata columns ? Its not clear to me.

It might be simpler to change #8800 to use metadata_columns parameters for HiveSplit instead. wdyt ?

thanks @aditi-pandit

for i) Yes, a test is needed and I saw your PR already have it.
for ii) Actually I add filter part support and just remove it temporily with this commit df5e279, but we can added it back if it still need for making both presto and spark work.

image

for iii) I actually before don't know what's the synthesized means and also i don't found anywhere code path use it, so i just added a KMetadata to easily mark it is metadata column, but we can still use synthesized it should be ok, I'd like metadata naming due to I don't understand what synthesized means.

Sure @aditi-pandit , it's OK if you can update your PR to move using metadata_columns parameters and remove any hard code checking for know it is metadata or not. For filter part just reference change

image

if (auto name = subfield.toString();
        metadataColumns.count(name) == 0 && partitionKeys.count(name) == 0) {
      filterSubfields[getColumnName(subfield)].push_back(&subfield);
    }

facebook-github-bot pushed a commit that referenced this pull request Mar 5, 2024
…me' to be queried in SQL (#8800)

Summary:
$file_size and $file_modified_time are queryable synthesized columns for Hive tables in Presto. Spark also has bunch of such queryable synthesized columns (#7880).

The columns are passed by the co-ordinator to the worker in the HiveSplit.

i) Velox HiveSplit needed to be enhanced to get filesize and file_modified_time metadata in a generic map data-structure of (column name, value) from Prestissimo.
ii) These values should be populated by SplitReader into TableScanOperator output buffers.

This also needs a Prestissimo change to populate the HiveSplit with this info sent in the fragment prestodb/presto#21965

Fixes prestodb/presto#21867

gaoyangxiaozhu will have a follow up PR on the Spark integration.

Pull Request resolved: #8800

Reviewed By: mbasmanova

Differential Revision: D54512245

Pulled By: Yuhta

fbshipit-source-id: 190a97f9fcb1e869fff82e0a2264d57f9915376e
@gaoyangxiaozhu
Copy link
Copy Markdown
Contributor Author

closed with @aditi-pandit 's PR is merged do the same thing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants