Remove minimum_assigned_split_weight Hive and Iceberg session config#12656
Conversation
pettyjamesm
left a comment
There was a problem hiding this comment.
I mentioned this in an offline conversation with @findepi, but having this session property is extremely convenient for exploratory analysis to identify better weight minimums in any given environment and workload. There is also an additional guard-rail in place to reduce the potential for huge task update requests: node-scheduler.max-unacknowledged-splits-per-task (default: 500) sets an absolute maximum on the number of splits, regardless of their weights, that can be "pending and not delivered" to tasks that the scheduler respects.
I propose that instead of removing the session property, we could consider either or both of the following alternatives:
- Marking the session property as hidden to discourage users from tweaking the value, but leaving it there as a convenience for development and power user's convenience.
- Adding a validation on the session property to not allow it to bet set to a value lower than the cluster configured value
Ultimately, if you think the risk is still there and insist on removing it then I'm ok with that choice- just pointing out the alternative options that preserve the convenience it provides.
There was a problem hiding this comment.
If we decide to remove the session property, you could just pass a HiveSplitWeightProvider directly instead of the double value here instead.
split assignment tuning parameters should be configurable on the session level, should they? also, per my understanding, the "min weight" is a safety measure so shouldn't be configurable by the end user, should it?
that would address the safety issue, but then again, why would we want to keep it? |
I don't see why they shouldn't be, at least to a trusted power user. The parameter controls behavior that attempts to be adaptive to characteristics in the workload, so if someone knows something specific to their workload or wants to experiment with the setting in the context of that specific workload I think the ability to control that without restarting the cluster is valid. It's not ideal that someone could use it to impact the cluster health, but I would bet that malicious users with sufficient privileges to submit queries with or access to other session property overrides already have more than a few different ways to DoS a cluster. For instance, you could create similar payload bombs by creating a table with lots of columns and very long comments on each, and that's ignoring all the system session properties that can specify overrides on the per query stage limit, etc. |
The idea of having some minimum split weight is to prevent very tiny splits from overloading the cluster, or pumping up task scheduling request sizes. A safety toggle like this should not be configurable at session level. This remains configurable at the catalog property level for both Hive and Iceberg.
982f359 to
9c96f92
Compare
|
I agree that it is useful to have session properties to configure behaviors that are not necessarily commonly modified by non-power users. Actually we have plenty of those related to Tardigrade project (fault-tolerant execution). They often come helpful in prototyping, or when diagnosing performance issues with some specific query (it is often trial and error changing of configuration via session properties; doing that via config is PITA (and also not possible in some environments). Maybe instead of dropping the session property, we should provide some mechanism that would allow hiding specific properties from end-users? |
The idea of having some minimum split weight is to prevent very tiny
splits from overloading the cluster, or pumping up task scheduling
request sizes. A safety toggle like this should not be configurable at
session level.
This remains configurable at the catalog property level.