Skip to content

Collect file stats#14618

Merged
highker merged 1 commit intoprestodb:masterfrom
NikhilCollooru:metastoreStats
Jul 24, 2020
Merged

Collect file stats#14618
highker merged 1 commit intoprestodb:masterfrom
NikhilCollooru:metastoreStats

Conversation

@NikhilCollooru
Copy link
Contributor

@NikhilCollooru NikhilCollooru commented Jun 5, 2020

Collect the Partition - file stats (file name, size) from the HiveWriter to be stored in a TBD location and then later use it during scheduling to avoid the directory Listing call.
In this PR we are collecting the stats, tracking the blob size and then throwing it away.

depended by https://github.com/facebookexternal/presto-facebook/pull/1084

@NikhilCollooru NikhilCollooru requested a review from highker June 5, 2020 23:09
Copy link

@highker highker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

comments on serde format

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The return format should be a page (or maybe a list of pages). Or if you wanna wrap it, it is also ok. But Page is the essential part for it. That allows the coordinator to write pages into manifests in ORC format.

Here is an example:

file_prefix file_name file_suffix rows bytes col1_min col1_max col1_non_null_rows col2_min col2_max col2_non_null_rows
ws://ws.atn5/abc/ 1 .orc 1000 100030 ab zzzd 900 0 15 1000
ws://ws.atn5/abc/ 2 .orc 1200 110040 ac zzf 1100 5 22 1200
ws://ws.atn5/abc/ 3 .orc 1100 120050 bf xwf 1000 19 42 1100

Each column types should be known beforehand. Or if not, types are serializable anyway, so they can be saved as List<TypeSignature>. For example, the above stats should come with column types: {varchar, varchar, varchar, bigint, bigint, varchar, varchar, bigint, bigint, integer, integer}.

Types and Blocks are the core data structure for Presto. So let's try to reuse them as much as possible.

@NikhilCollooru NikhilCollooru requested a review from a team July 9, 2020 03:16
@NikhilCollooru NikhilCollooru force-pushed the metastoreStats branch 3 times, most recently from baa383b to 3a5a6f8 Compare July 16, 2020 04:49
@NikhilCollooru NikhilCollooru requested a review from highker July 17, 2020 17:06
@NikhilCollooru NikhilCollooru changed the title [WIP] Collect and persist file stats Collect and persist file stats Jul 17, 2020
@NikhilCollooru NikhilCollooru marked this pull request as ready for review July 17, 2020 22:07
@NikhilCollooru NikhilCollooru changed the title Collect and persist file stats Collect file stats Jul 17, 2020
@NikhilCollooru NikhilCollooru force-pushed the metastoreStats branch 2 times, most recently from 7807e50 to 69feafd Compare July 21, 2020 19:23
@NikhilCollooru NikhilCollooru requested review from a team and highker July 21, 2020 20:10
Copy link

@highker highker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor comments

@NikhilCollooru NikhilCollooru force-pushed the metastoreStats branch 2 times, most recently from 9622bb9 to 05df5ce Compare July 22, 2020 23:17
@NikhilCollooru NikhilCollooru requested a review from highker July 24, 2020 00:45
Copy link

@highker highker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor comments only

@highker highker self-assigned this Jul 24, 2020
@highker highker merged commit 8045dbe into prestodb:master Jul 24, 2020
@caithagoras caithagoras mentioned this pull request Jul 28, 2020
13 tasks
@NikhilCollooru NikhilCollooru deleted the metastoreStats branch July 30, 2020 00:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants