Skip to content

Expose partition, file format and compression info of input tables as part of QueryCompletedEvent#12551

Merged
sopel39 merged 5 commits intotrinodb:masterfrom
gaurav8297:enhancements_for_telemetry
Jun 2, 2022
Merged

Expose partition, file format and compression info of input tables as part of QueryCompletedEvent#12551
sopel39 merged 5 commits intotrinodb:masterfrom
gaurav8297:enhancements_for_telemetry

Conversation

@gaurav8297
Copy link
Copy Markdown
Member

@gaurav8297 gaurav8297 commented May 25, 2022

Description

Is this change a fix, improvement, new feature, refactoring, or other?

improvement

Is this a change to the core query engine, a connector, client library, or the SPI interfaces? (be specific)

How would you describe this change to a non-technical end user or system administrator?

Expose the following things as part of QueryCompletedEvent which can be later tracked through telemetry.

  • Whether the table partitioned?
  • What kind of compression input tables are using?
  • File format - Parquet/Orc

Related issues, pull requests, and links

Documentation

( ) No documentation is needed.
( ) Sufficient documentation is included in this PR.
( ) Documentation PR is available with #prnumber.
( ) Documentation issue #issuenumber is filed, and can be handled later.

Release notes

( ) No release notes entries required.
( ) Release notes entries required with the following suggested text:

# Section
* Fix some things. ({issue}`issuenumber`)

@cla-bot cla-bot bot added the cla-signed label May 25, 2022
@gaurav8297 gaurav8297 changed the title Enhancements for telemetry Expose partition, file format and compression info of input tables as part of QueryCompletedEvent May 25, 2022
@gaurav8297 gaurav8297 requested a review from raunaqmorarka May 25, 2022 20:05
@gaurav8297 gaurav8297 changed the title Expose partition, file format and compression info of input tables as part of QueryCompletedEvent [WIP] Expose partition, file format and compression info of input tables as part of QueryCompletedEvent May 25, 2022
Copy link
Copy Markdown
Member

@raunaqmorarka raunaqmorarka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add some tests as well

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This shouldn't be optional, table will always have a file format

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please confirm if the call to metastore.getTable is always satisfied by the transaction level metadata cache. Ideally, this shouldn't result in another metadata call to the underlying metastore.

@raunaqmorarka
Copy link
Copy Markdown
Member

raunaqmorarka commented May 26, 2022

@alexjo2144 for iceberg and delta changes

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Partitioned table may have multiple file formats (each partition can be different)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, however I'm worried that that information may not be cheap to obtain. For our use case we're okay with just knowing about the current storage format configured for the table.
Would it be okay if we renamed fileFormat to tableFileFormat to be more explicit avoid the effort getting this info for all partitions ?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be okay if we renamed fileFormat to tableFileFormat to be more explicit avoid the effort getting this info for all partitions ?

Seems like an OK workaround. Add code comment that's it's only as much useful for partitioned tables.

or call it tableDefaultFileFormat

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this is the format of the table and not necessarily the format encountered by the query, it’s an odd thing to include in the query event. You can just obtain that data by looking at table metadata with SHOW CREATE TABLE, etc.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@martint we want to be able to do an offline analysis about the relative occurrence of each file format for reads. We're recording the default table format instead of extracting the storage format of every partition to avoid adding overheads for collecting this info and because knowing the table level format is good enough approximation for telemetry use cases.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have metics names, but we don't have metrics values.
What is the value supposed to mean?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've updated the value to be the data length as per this comment: #12551 (comment)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In Iceberg, can different files be of different format?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it is possible that's why updated it to tableDefaultFileFormat.

@martint
Copy link
Copy Markdown
Member

martint commented May 27, 2022

Please, move the change to add connector metrics support to the query event to a separate commit.

Copy link
Copy Markdown
Member

@raunaqmorarka raunaqmorarka May 30, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we're doing this only for testing with mock connector, can we keep the changes in MockConnectorPageSource ? You can merge the mock and delegate page source metrics in getMetrics

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of sending dummy metrics via MockConnectorFactory, you can consider defining some actual metric like rowCount in MockConnectorPageSource if we get non-empty pages there during tests.

Copy link
Copy Markdown
Member

@raunaqmorarka raunaqmorarka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm % minor comments

@gaurav8297 gaurav8297 changed the title [WIP] Expose partition, file format and compression info of input tables as part of QueryCompletedEvent Expose partition, file format and compression info of input tables as part of QueryCompletedEvent May 31, 2022
@gaurav8297 gaurav8297 marked this pull request as ready for review May 31, 2022 07:20
@gaurav8297 gaurav8297 requested a review from sopel39 May 31, 2022 14:31
@sopel39
Copy link
Copy Markdown
Member

sopel39 commented Jun 1, 2022

approved, planning to land it tomorrow if there are not objections from @martint or @findepi

@sopel39 sopel39 merged commit 38ce359 into trinodb:master Jun 2, 2022
@sopel39
Copy link
Copy Markdown
Member

sopel39 commented Jun 2, 2022

I don't think this needs release notes

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Development

Successfully merging this pull request may close these issues.

5 participants