-
Notifications
You must be signed in to change notification settings - Fork 5.5k
Collect column statistics on write #10617
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
c1e3eb2
a8ea09a
ace432c
253a5a5
029873e
bef760c
b0e4405
8c85fd2
4c8d740
6147083
a7bc650
0651392
8914741
be35b5e
d27e219
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -111,9 +111,9 @@ security options in the Hive connector. | |
| Hive Configuration Properties | ||
| ----------------------------- | ||
|
|
||
| ================================================== ============================================================ ========== | ||
| ================================================== ============================================================ ============================= | ||
| Property Name Description Default | ||
| ================================================== ============================================================ ========== | ||
| ================================================== ============================================================ ============================= | ||
| ``hive.metastore.uri`` The URI(s) of the Hive metastore to connect to using the | ||
| Thrift protocol. If multiple URIs are provided, the first | ||
| URI is used by default and the rest of the URIs are | ||
|
|
@@ -175,7 +175,11 @@ Property Name Description | |
| ``hive.non-managed-table-writes-enabled`` Enable writes to non-managed (external) Hive tables. ``false`` | ||
|
|
||
| ``hive.non-managed-table-creates-enabled`` Enable creating non-managed (external) Hive tables. ``true`` | ||
| ================================================== ============================================================ ========== | ||
|
|
||
| ``hive.collect-column-statistics-on-write`` Enables automatic column level statistics collection ``ENABLED_FOR_MARKED_TABLES`` | ||
| on write. Possible values are ``ENABLED``, | ||
| ``ENABLED_FOR_MARKED_TABLES`` or ``DISABLED`` | ||
| ================================================== ============================================================ ============================= | ||
|
|
||
| Amazon S3 Configuration | ||
| ----------------------- | ||
|
|
@@ -334,6 +338,59 @@ the ``org.apache.hadoop.conf.Configurable`` interface from the Hadoop Java API, | |
| will be passed in after the object instance is created and before it is asked to provision or retrieve any | ||
| encryption keys. | ||
|
|
||
| Table Statistics | ||
| ---------------- | ||
|
|
||
| The Hive connector collects ``numRows``, ``rawDataSize``, ``totalSize``, ``numFiles`` statistics | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. How about The Hive connector automatically collects basic statistics
(``numFiles', ``numRows``, ``rawDataSize``, ``totalSize``)
on ``INSERT`` and ``CREATE TABLE AS`` operations.I shortened the names to match what we call them in the SQL Statement Syntax documentation. |
||
| automatically on ``INSERT INTO`` and ``CREATE TABLE AS SELECT`` operations. | ||
|
|
||
| The Hive connector can also collect the column level statistics: | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Remove "the" |
||
|
|
||
| ============= ================================================================================================================ | ||
| Column Type Collectible Statistics | ||
| ============= ================================================================================================================ | ||
| ``TINYINT`` ``NUMBER_OF_NULLS``, ``MIN``, ``MAX``, ``NUMBER_OF_DISTINCT_VALUES`` | ||
|
|
||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Let's remove all the blank lines between rows, since each row is only one line. That should make it a bit easier to read. |
||
| ``SMALLINT`` ``NUMBER_OF_NULLS``, ``MIN``, ``MAX``, ``NUMBER_OF_DISTINCT_VALUES`` | ||
|
|
||
| ``INTEGER`` ``NUMBER_OF_NULLS``, ``MIN``, ``MAX``, ``NUMBER_OF_DISTINCT_VALUES`` | ||
|
|
||
| ``BIGINT`` ``NUMBER_OF_NULLS``, ``MIN``, ``MAX``, ``NUMBER_OF_DISTINCT_VALUES`` | ||
|
|
||
| ``DOUBLE`` ``NUMBER_OF_NULLS``, ``MIN``, ``MAX``, ``NUMBER_OF_DISTINCT_VALUES`` | ||
|
|
||
| ``REAL`` ``NUMBER_OF_NULLS``, ``MIN``, ``MAX``, ``NUMBER_OF_DISTINCT_VALUES`` | ||
|
|
||
| ``BOOLEAN`` ``NUMBER_OF_NULLS``, ``NUMBER_OF_FALSE``, ``NUMBER_OF_TRUE`` | ||
|
|
||
| ``VARCHAR`` ``NUMBER_OF_NULLS``, ``NUMBER_OF_DISTINCT_VALUES``, ``MAX_VALUE_SIZE_IN_BYTES``, ``AVERAGE_VALUE_SIZE_IN_BYTES`` | ||
|
|
||
| ``CHAR`` ``NUMBER_OF_NULLS``, ``NUMBER_OF_DISTINCT_VALUES`` | ||
|
|
||
| ``VARBINARY`` ``NUMBER_OF_NULLS``, ``MAX_VALUE_SIZE_IN_BYTES``, ``AVERAGE_VALUE_SIZE_IN_BYTES`` | ||
|
|
||
| ``DATE`` ``NUMBER_OF_NULLS``, ``MIN``, ``MAX``, ``NUMBER_OF_DISTINCT_VALUES`` | ||
|
|
||
| ``TIMESTAMP`` ``NUMBER_OF_NULLS``, ``MIN``, ``MAX``, ``NUMBER_OF_DISTINCT_VALUES`` | ||
|
|
||
| ``DECIMAL`` ``NUMBER_OF_NULLS``, ``MIN``, ``MAX``, ``NUMBER_OF_DISTINCT_VALUES`` | ||
| ============= ================================================================================================================ | ||
|
|
||
| Automatic column level statistics collection on write can be enabled by tuning the ``hive.collect-column-statistics-on-write`` | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. |
||
| property: | ||
|
|
||
| * ``ENABLED``- Presto will collect the column level statistics for all the tables. | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. s/for all the tables/for all tables
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. How about * ``ENABLED``: Always collect column level statistics when writing tables.
* ``ENABLED_FOR_MARKED_TABLES``: Collect column level statistics when writing
to any table that was created with the ``collect_column_statistics_on_write``
table property set to ``true``::
CREATE TABLE automatically_collect_column_statistics (
id bigint
)
WITH (collect_column_statistics_on_write_enabled = true)
* ``DISABLED``: Never collect column level statistics when writing tables.
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Removed the entire section. The property now takes a boolean. Boolean is pretty much self explanatory. |
||
| * ``ENABLED_FOR_MARKED_TABLES`` - Presto will collect the column level statistics for the tables | ||
| created with the ``collect_column_statistics_on_write_enabled`` set to ``true``: | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. s/with the/with
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Removed the section |
||
| :: | ||
|
|
||
| CREATE TABLE automatically_collect_column_statistics ( | ||
| a BIGINT | ||
| ) | ||
| WITH (collect_column_statistics_on_write_enabled = true) | ||
|
|
||
| * ``DISABLED`` - Presto will not collect the column level statistics for any table. | ||
|
|
||
| Schema Evolution | ||
| ---------------- | ||
|
|
||
|
|
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,24 @@ | ||
| /* | ||
| * Licensed under the Apache License, Version 2.0 (the "License"); | ||
| * you may not use this file except in compliance with the License. | ||
| * You may obtain a copy of the License at | ||
| * | ||
| * http://www.apache.org/licenses/LICENSE-2.0 | ||
| * | ||
| * Unless required by applicable law or agreed to in writing, software | ||
| * distributed under the License is distributed on an "AS IS" BASIS, | ||
| * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
| * See the License for the specific language governing permissions and | ||
| * limitations under the License. | ||
| */ | ||
| package com.facebook.presto.hive; | ||
|
|
||
| import com.facebook.presto.spi.statistics.ColumnStatisticType; | ||
| import com.facebook.presto.spi.type.Type; | ||
|
|
||
| import java.util.Set; | ||
|
|
||
| public interface CollectibleStatisticsProvider | ||
| { | ||
| Set<ColumnStatisticType> get(Type type); | ||
| } |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -37,6 +37,7 @@ | |
| import java.util.TimeZone; | ||
| import java.util.concurrent.TimeUnit; | ||
|
|
||
| import static com.facebook.presto.hive.HiveClientConfig.CollectColumnStatisticsOnWriteOption.ENABLED_FOR_MARKED_TABLES; | ||
| import static io.airlift.units.DataSize.Unit.MEGABYTE; | ||
|
|
||
| @DefunctConfig({ | ||
|
|
@@ -134,6 +135,8 @@ public class HiveClientConfig | |
|
|
||
| private boolean tableStatisticsEnabled = true; | ||
|
|
||
| private CollectColumnStatisticsOnWriteOption collectColumnStatisticsOnWrite = ENABLED_FOR_MARKED_TABLES; | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The enable-only-for-marked-tables feature seems like something transitional that we would later change to be always enabled. Does it make sense to split this into two booleans, so we can remove the marked table support later? Or do you expect we'll want that forever?
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Actually i think meanwhile the plan had changed, and we no longer need this. I will replace the enum with a simple boolean. |
||
|
|
||
| public int getMaxInitialSplits() | ||
| { | ||
| return maxInitialSplits; | ||
|
|
@@ -1044,4 +1047,25 @@ public boolean isTableStatisticsEnabled() | |
| { | ||
| return tableStatisticsEnabled; | ||
| } | ||
|
|
||
| @NotNull | ||
| public CollectColumnStatisticsOnWriteOption getCollectColumnStatisticsOnWrite() | ||
| { | ||
| return collectColumnStatisticsOnWrite; | ||
| } | ||
|
|
||
| @Config("hive.collect-column-statistics-on-write") | ||
| @ConfigDescription("Enables automatic column level statistics collection on write") | ||
| public HiveClientConfig setCollectColumnStatisticsOnWrite(CollectColumnStatisticsOnWriteOption collectColumnStatisticsOnWrite) | ||
| { | ||
| this.collectColumnStatisticsOnWrite = collectColumnStatisticsOnWrite; | ||
| return this; | ||
| } | ||
|
|
||
| public enum CollectColumnStatisticsOnWriteOption | ||
| { | ||
| ENABLED, | ||
| ENABLED_FOR_MARKED_TABLES, | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. add a javadoc linking to
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I removed the enum |
||
| DISABLED | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. put DISABLED first, end all options with
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I removed the enum |
||
| } | ||
| } | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -19,6 +19,7 @@ | |
| import com.facebook.presto.hive.parquet.ParquetPageSourceFactory; | ||
| import com.facebook.presto.hive.parquet.ParquetRecordCursorProvider; | ||
| import com.facebook.presto.hive.rcfile.RcFilePageSourceFactory; | ||
| import com.facebook.presto.hive.util.Statistics; | ||
| import com.facebook.presto.spi.connector.ConnectorNodePartitioningProvider; | ||
| import com.facebook.presto.spi.connector.ConnectorPageSinkProvider; | ||
| import com.facebook.presto.spi.connector.ConnectorPageSourceProvider; | ||
|
|
@@ -60,6 +61,7 @@ public void configure(Binder binder) | |
| binder.bind(HiveConnectorId.class).toInstance(new HiveConnectorId(connectorId)); | ||
| binder.bind(TypeTranslator.class).toInstance(new HiveTypeTranslator()); | ||
| binder.bind(CoercionPolicy.class).to(HiveCoercionPolicy.class).in(Scopes.SINGLETON); | ||
| binder.bind(CollectibleStatisticsProvider.class).toInstance(Statistics::getSupportedStatistics); | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. we should return nothing when glue metastore is used.
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm going to replace |
||
|
|
||
| binder.bind(HdfsConfigurationUpdater.class).in(Scopes.SINGLETON); | ||
| binder.bind(HdfsConfiguration.class).to(HiveHdfsConfiguration.class).in(Scopes.SINGLETON); | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's add another sentence:
(This syntax should work. I copied it from
accumulo.rst)