-
Notifications
You must be signed in to change notification settings - Fork 27
Implement benchmark scenario WeightedWorkloadOnTreeDataset
#21
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…nto weighted-workloads
pingtimeout
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the delay ! I completely missed this PR...
The new workload is pretty cool, and I especially like the visualization part. I made a few comments to improve the code, and to simplify it in some cases.
benchmarks/src/gatling/scala/org/apache/polaris/benchmarks/actions/TableActions.scala
Outdated
Show resolved
Hide resolved
| ) { | ||
| val nAryTree: NAryTreeBuilder = NAryTreeBuilder(nsWidth, nsDepth) | ||
| private val maxPossibleTables = nAryTree.numberOfLastLevelElements * numTablesPerNs | ||
| val maxPossibleTables = nAryTree.numberOfLastLevelElements * numTablesPerNs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The intent behind initially maxPossibleTables having private visibility was to only use numTables so that the number of tables could be capped to a certain value.
I.e. if there are 2^16 max possible tables in your dataset, but you want to restrict a specific workload on the first 2000 tables, you would set max-tables = 2000. That way, you can reuse a larger dataset for more concentrated workloads.
It does not mean that all workloads have to support this though. I just want to emphasize was the initial idea was.
If you want to keep using maxPossibleTables, would you mind adding a type annotation to it? IntelliJ displays a warning when a public field has no annotated type and is just var xyz = ....
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good callout on the annotation.
Initially I thought to create my own dataset for this workload (i.e. a flat namespace structure), but I realized it's fairly easy to use the existing one. maxPossibleTables was really convenient for getting the number of tables in total, which is the main thing this workload cares about. So if possible I think it makes sense to continue to use it.
I thought for a while about how to make the sampling with with numTablesMax set, but just gave up and decided I'll go with maxPossibleTables for the time being.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's an interesting idea. Let's defer this to a future PR as we can improve the code incrementally.
The benchmarks config currently uses dataset.tree as the prefix for the tree-related parameters. We could totally introduce a new dataset shape like:
dataset {
tree {
...
}
flat {
...
}
}I think Russel mentioned this shape a long time ago. So that would make total sense, especially if you are already using it today.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, we can defer that. Right now what I do is just make the ns width as 1 and then it becomes naturally flat :)
I needed to use this shape for the benchmark in this PR since I cared specifically about many tables in one ns
...scala/org/apache/polaris/benchmarks/parameters/WeightedWorkloadOnTreeDatasetParameters.scala
Outdated
Show resolved
Hide resolved
...scala/org/apache/polaris/benchmarks/parameters/WeightedWorkloadOnTreeDatasetParameters.scala
Show resolved
Hide resolved
...scala/org/apache/polaris/benchmarks/parameters/WeightedWorkloadOnTreeDatasetParameters.scala
Show resolved
Hide resolved
.../gatling/scala/org/apache/polaris/benchmarks/simulations/WeightedWorkloadOnTreeDataset.scala
Outdated
Show resolved
Hide resolved
.../gatling/scala/org/apache/polaris/benchmarks/simulations/WeightedWorkloadOnTreeDataset.scala
Show resolved
Hide resolved
.../gatling/scala/org/apache/polaris/benchmarks/simulations/WeightedWorkloadOnTreeDataset.scala
Outdated
Show resolved
Hide resolved
| .set("multipartNamespace", namespace.mkString(0x1f.toChar.toString)) | ||
| .set("tableName", table) | ||
| .set("initialProperties", expectedProperties) | ||
| .set("location", expectedLocation) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This part does not feel right as in all the other workloads, it is implemented as a feeder is the TableActions class. What is the rationale behind having this in the simulation itself and in an exec block?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not a gatling expert but I did wrestle with this for some time before getting it working. When I implemented this in TableActions, I was not able to get the readers & writers to build their distributions at runtime based on the config.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok let's keep this version and improve later
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
np, happy to revisit this. I need to learn more about gatling!
pingtimeout
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is one nit and one question left but nothing that must be addressed before merging. +1
This implements a new scenario,
WeightedWorkloadOnTreeDataset, that supports the configuration of multiple distributions over which to weight reads & writes against the catalog.Compared with
ReadUpdateTreeDataset, this allows us to understand how performance changes when reads or writes frequently hit the same tables.Sampling
The distributions are defined in the config file like so:
countis simply the number of threads which will sample from the distribution, whilemeanandvariancedescribe the Gaussian distribution to sample from. These values are generally expected to fall between 0 and 1.0 and when they don't the distribution will be repeatedly resampled.For an extreme example, refer to the following:

In this case, about 50% of samples should fall below 0.0 and therefore be resampled. This allows us to create highly concentrated or uniform distributions as needed.
Once a value in [0, 1] is obtained, this value is mapped to a table where 1.0 is the highest table (e.g. T_2048) in the tree dataset and 0.0 is T_0.
To help developers understand the distributions they've defined, some information is printed when the new simulation is run: