Skip to content

Conversation

@phd3
Copy link
Member

@phd3 phd3 commented Mar 16, 2023

Description

Even when no columns are accessed for a scan, GenericHiveRecordCursor pays the cost of deserializing the record - which is never used to get values out. This change avoids that deserialization and improves performance for count(const) or count(*) queries on formats that use GenericHiveRecordCursor.

Practically observed ~3x-4x improvement in CPU consumption for some queries.

We still need to fetch the data as long as we're dependent on RecordReader API since it doesn't provide a way to get counts.

Additional context and related issues

Test exists in BaseHiveConnectorTest#testReadNoColumns

Release notes

( ) This is not user-visible or docs only and no release notes are required.
( ) Release notes are required, please propose a release note for me.
( ) Release notes are required, with the following suggested text:

# Hive
* Improve scan performance for count(*) queries on row-oriented formats. ({issue}`16595`)

This improves performance for count(const) or count(*) queris
on formats that use GenericHiveRecordCursor.
@cla-bot cla-bot bot added the cla-signed label Mar 16, 2023
@phd3 phd3 requested review from raunaqmorarka and sopel39 March 16, 2023 14:51
@github-actions github-actions bot added hive Hive connector tests:hive labels Mar 16, 2023
@electrum
Copy link
Member

Note that as of now, all formats except Avro use the native Trino readers by default. @jklamer is working on Avro.

@phd3 phd3 merged commit b7258a6 into trinodb:master Mar 17, 2023
@github-actions github-actions bot added this to the 411 milestone Mar 17, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Development

Successfully merging this pull request may close these issues.

3 participants