Skip to content

Allow HiveSplit info columns like '$file_size' and '$file_modified_time' to be queried in SQL#8800

Closed
aditi-pandit wants to merge 1 commit intomainfrom
hive_file_metadata
Closed

Allow HiveSplit info columns like '$file_size' and '$file_modified_time' to be queried in SQL#8800
aditi-pandit wants to merge 1 commit intomainfrom
hive_file_metadata

Conversation

@aditi-pandit
Copy link
Copy Markdown
Collaborator

@aditi-pandit aditi-pandit commented Feb 19, 2024

$file_size and $file_modified_time are queryable synthesized columns for Hive tables in Presto. Spark also has bunch of such queryable synthesized columns (#7880).

The columns are passed by the co-ordinator to the worker in the HiveSplit.

i) Velox HiveSplit needed to be enhanced to get filesize and file_modified_time metadata in a generic map data-structure of (column name, value) from Prestissimo.
ii) These values should be populated by SplitReader into TableScanOperator output buffers.

This also needs a Prestissimo change to populate the HiveSplit with this info sent in the fragment prestodb/presto#21965

Fixes prestodb/presto#21867

@gaoyangxiaozhu will have a follow up PR on the Spark integration.

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Feb 19, 2024
@netlify
Copy link
Copy Markdown

netlify bot commented Feb 19, 2024

Deploy Preview for meta-velox canceled.

Name Link
🔨 Latest commit 86ab66c
🔍 Latest deploy log https://app.netlify.com/sites/meta-velox/deploys/65e63b3958c088000893f7ad

@gaoyangxiaozhu
Copy link
Copy Markdown
Contributor

hey @aditi-pandit I also have a similar PR #7880 to let velox support query spark engine supported file metadata for hiveTables (file_path, file_size, file_name, file_modify_time, file_block_start, file_block_end) etc.

Maybe we can work together to see if can let the change support for both engine presto and spark ?

@gaoyangxiaozhu
Copy link
Copy Markdown
Contributor

hey @aditi-pandit may change the PR title to Allow info columns for HiveSplits to be queried in SQL

@aditi-pandit aditi-pandit changed the title Allow '$file_size' and '$file_modified_time' for HiveSplits to be queried in SQL Allow HiveSplit info columns like '$file_size' and '$file_modified_time' to be queried in SQL Feb 27, 2024
@aditi-pandit
Copy link
Copy Markdown
Collaborator Author

@Yuhta @majetideepak : PTAL.

@facebook-github-bot
Copy link
Copy Markdown
Contributor

@Yuhta has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@aditi-pandit
Copy link
Copy Markdown
Collaborator Author

@Yuhta : Do you need help with the linter error ? Please can you give me more info about it.

@facebook-github-bot
Copy link
Copy Markdown
Contributor

@Yuhta merged this pull request in b9afa14.

@aditi-pandit aditi-pandit deleted the hive_file_metadata branch March 5, 2024 21:22
philo-he added a commit to philo-he/velox that referenced this pull request Mar 7, 2024
philo-he added a commit to philo-he/velox that referenced this pull request Mar 7, 2024
…file_modified_time' to be queried in SQL (facebookincubator#8800)""

This reverts commit d3dc172.
//
// Unfortunately, Presto happens to specify a filter for $path or
// $bucket column. This filter is redundant and needs to be removed.
// Unfortunately, Presto happens to specify a filter for $path, $file_size,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just wondering if there is there an issue for this on Presto side?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. Merged

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[native] Hidden columns missing in Prestissimo Hive Connector

6 participants