-
Notifications
You must be signed in to change notification settings - Fork 2.5k
[HUDI-4250][HUDI-4202] Optimizing performance of Column Stats Index reading in Data Skipping #5746
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
de37979
b1237ae
84d661b
96e4915
509b8cd
29e6724
d056a0d
843e6a3
37bbbee
c3eb82e
9186ed1
6913812
102e485
6f1686f
afeb3a7
054c933
27de4fe
ff9add6
31b53e2
7ccef0c
0ce4491
a67aa01
a953f0e
f9a4103
a36a654
4d56e07
a764186
b23cfd2
c953ab2
f4c847a
ea2a1e8
93d7d63
4f665e4
f01fdd2
f4f19c2
66bc580
9efbb07
517b546
02719f9
5538dde
40cf8fe
215bab0
b1e993b
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -187,6 +187,26 @@ public final class HoodieMetadataConfig extends HoodieConfig { | |
| .sinceVersion("0.11.0") | ||
| .withDocumentation("Comma-separated list of columns for which column stats index will be built. If not set, all columns will be indexed"); | ||
|
|
||
| public static final String COLUMN_STATS_INDEX_PROCESSING_MODE_IN_MEMORY = "in-memory"; | ||
| public static final String COLUMN_STATS_INDEX_PROCESSING_MODE_ENGINE = "engine"; | ||
|
|
||
| public static final ConfigProperty<String> COLUMN_STATS_INDEX_PROCESSING_MODE_OVERRIDE = ConfigProperty | ||
| .key(METADATA_PREFIX + ".index.column.stats.processing.mode.override") | ||
| .noDefaultValue() | ||
| .withValidValues(COLUMN_STATS_INDEX_PROCESSING_MODE_IN_MEMORY, COLUMN_STATS_INDEX_PROCESSING_MODE_ENGINE) | ||
| .sinceVersion("0.12.0") | ||
| .withDocumentation("By default Column Stats Index is automatically determining whether it should be read and processed either" | ||
| + "'in-memory' (w/in executing process) or using Spark (on a cluster), based on some factors like the size of the Index " | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Don't forget to update the docs as well |
||
| + "and how many columns are read. This config allows to override this behavior."); | ||
|
|
||
| public static final ConfigProperty<Integer> COLUMN_STATS_INDEX_IN_MEMORY_PROJECTION_THRESHOLD = ConfigProperty | ||
| .key(METADATA_PREFIX + ".index.column.stats.inMemory.projection.threshold") | ||
| .defaultValue(100000) | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Any perf numbers to support this default threshold?
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes, reading Column Stats lesser than 100k rows is faster in-memory, than using Spark (AFAIR it was roughly about 1s vs 2s)
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. How fast is it for in-memory? |
||
| .sinceVersion("0.12.0") | ||
| .withDocumentation("When reading Column Stats Index, if the size of the expected resulting projection is below the in-memory" | ||
| + " threshold (counted by the # of rows), it will be attempted to be loaded \"in-memory\" (ie not using the execution engine" | ||
| + " like Spark, Flink, etc). If the value is above the threshold execution engine will be used to compose the projection."); | ||
|
|
||
| public static final ConfigProperty<String> BLOOM_FILTER_INDEX_FOR_COLUMNS = ConfigProperty | ||
| .key(METADATA_PREFIX + ".index.bloom.filter.column.list") | ||
| .noDefaultValue() | ||
|
|
@@ -246,6 +266,14 @@ public List<String> getColumnsEnabledForColumnStatsIndex() { | |
| return StringUtils.split(getString(COLUMN_STATS_INDEX_FOR_COLUMNS), CONFIG_VALUES_DELIMITER); | ||
| } | ||
|
|
||
| public String getColumnStatsIndexProcessingModeOverride() { | ||
| return getString(COLUMN_STATS_INDEX_PROCESSING_MODE_OVERRIDE); | ||
| } | ||
|
|
||
| public Integer getColumnStatsIndexInMemoryProjectionThreshold() { | ||
| return getIntOrDefault(COLUMN_STATS_INDEX_IN_MEMORY_PROJECTION_THRESHOLD); | ||
| } | ||
|
|
||
| public List<String> getColumnsEnabledForBloomFilterIndex() { | ||
| return StringUtils.split(getString(BLOOM_FILTER_INDEX_FOR_COLUMNS), CONFIG_VALUES_DELIMITER); | ||
| } | ||
|
|
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -170,7 +170,8 @@ Map<Pair<String, String>, HoodieMetadataColumnStats> getColumnStats(final List<P | |
| * @return {@link HoodieData} of {@link HoodieRecord}s with records matching the passed in key prefixes. | ||
| */ | ||
| HoodieData<HoodieRecord<HoodieMetadataPayload>> getRecordsByKeyPrefixes(List<String> keyPrefixes, | ||
| String partitionName); | ||
| String partitionName, | ||
| boolean shouldLoadInMemory); | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Whether or not column statistics should be loaded in memory is the behavior of a metadata table.
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Good call. I think that would definitely make sense if we in consider in the future to make all methods be configurable to either load in-memory or using Spark. Right now only this method is configurable and it would be very misleading if we elevate it to be instance level config |
||
|
|
||
| /** | ||
| * Get the instant time to which the metadata is synced w.r.t data timeline. | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe
.index.column.stats.processing.modeis ok.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Processing mode is not meant to be configured by default: this config is specifically to override this behavior