Record source information of HBO stats #22234
Conversation
7f1e575 to
b576187
Compare
|
Does it make sense to encode the environment and version? The environment could identify the deployment that wrote the stats (including the eval engine used), and the version could be used to identify any anomalies introduced in the metrics that may vary between version. Similar to how we embed the version in things like Parquet file metadata. |
3c1aa54 to
6de650a
Compare
Currently I plan to include the type of workers and the query ID of the queries which produce the history statistics. I do not see immediate need of adding environment and version, and they can be inferred from query ID with proper logging. We can add them later if needed, just I do not see immediate need for now. |
82abb6d to
01f9257
Compare
| ObjectMapper newObjectMapper = objectMapper.copy().configure(SerializationFeature.ORDER_MAP_ENTRIES_BY_KEYS, true); | ||
| this.planCanonicalInfoProvider = new CachingPlanCanonicalInfoProvider(historyBasedStatisticsCacheManager, newObjectMapper, metadata); | ||
| this.config = requireNonNull(config, "config is null"); | ||
| this.isNativeExecution = featuresConfig.isNativeExecutionEnabled(); |
There was a problem hiding this comment.
Check if it's native execution, and record the information when writing stats to HBO
| if (predictedPlanStatistics.getConfidence() > 0) { | ||
| return delegateStats.combineStats( | ||
| predictedPlanStatistics, | ||
| new HistoryBasedSourceInfo(entry.getKey().getHash(), inputTableStatistics, Optional.of(historicalPlanStatisticsEntry.get().getHistoricalPlanStatisticsEntryInfo()))); |
There was a problem hiding this comment.
Add the source information to plan statistics returned from HBO.
| HistoricalPlanStatisticsEntryInfo historicalPlanStatisticsEntryInfo = new HistoricalPlanStatisticsEntryInfo( | ||
| isNativeExecution ? HistoricalPlanStatisticsEntryInfo.WorkerType.CPP : HistoricalPlanStatisticsEntryInfo.WorkerType.JAVA, queryInfo.getQueryId()); | ||
|
|
There was a problem hiding this comment.
record the worker type and query ID when recording the HBO stats
There was a problem hiding this comment.
Can we get this isNativeExecution property via session directly? queryInfo.getSession().toSession(sessionPropertyManager); That way you wound not need to inject featuresConfig
There was a problem hiding this comment.
@jaystarshot Jay, we explicitly removed this property from the sessions because it doesn't make sense to allow this to be modified for individual queries. This is a cluster-wide property. I have a PR to actually delete it: #22183
CC: @tdcmeehan
|
I like the idea of recording the version. That way if there's a problem with stats from some release or some big change in an operator that would make old stats not relevant, we could programatically exclude those stats from being used. |
01f9257 to
8df83cc
Compare
| this.planCanonicalInfoProvider = new CachingPlanCanonicalInfoProvider(historyBasedStatisticsCacheManager, newObjectMapper, metadata); | ||
| this.config = requireNonNull(config, "config is null"); | ||
| this.isNativeExecution = featuresConfig.isNativeExecutionEnabled(); | ||
| this.serverVersion = requireNonNull(nodeVersion, "nodeVersion is null").toString(); |
There was a problem hiding this comment.
Add information for server version
Add server version per suggestion. |
8df83cc to
16b95c8
Compare
...-spi/src/main/java/com/facebook/presto/spi/statistics/HistoricalPlanStatisticsEntryInfo.java
Outdated
Show resolved
Hide resolved
16b95c8 to
43593e1
Compare
Record the type of workers (CPP, JAVA) and query ID of the queries which produce these stats.
43593e1 to
3aa03b5
Compare
Description
This PR records more information about HBO stats, including what type of workers (currently c++ and Java) are these stats from, and the query ID which generates these stats.
Motivation and Context
Adding worker type because HBO tracks the size of operator output, however this size can be dependent on the data structure used and compaction algorithm when used. Hence it's expected that presto java and presto c++ can report different size. We need to log this information in HBO stats.
Adding query ID is for debugging purpose. This can help to identify the query which populates the stats quickly.
Impact
Improve on HBO stats to make it more precise and easier to debug.
Test Plan
End to end test locally to make sure these stats are available in logging.
Contributor checklist
Release Notes
Please follow release notes guidelines and fill in the release notes below.