Skip to content

Conversation

@eric-maynard
Copy link
Contributor

@eric-maynard eric-maynard commented Apr 30, 2025

This implements a new scenario, WeightedWorkloadOnTreeDataset, that supports the configuration of multiple distributions over which to weight reads & writes against the catalog.

Compared with ReadUpdateTreeDataset, this allows us to understand how performance changes when reads or writes frequently hit the same tables.

Sampling

The distributions are defined in the config file like so:

    # Distributions for readers
    # ...
    readers = [
      { count = 8, mean = 0.3, variance = 0.0278 }
    ]

count is simply the number of threads which will sample from the distribution, while mean and variance describe the Gaussian distribution to sample from. These values are generally expected to fall between 0 and 1.0 and when they don't the distribution will be repeatedly resampled.

For an extreme example, refer to the following:
Screenshot 2025-04-30 at 1 27 43 AM

In this case, about 50% of samples should fall below 0.0 and therefore be resampled. This allows us to create highly concentrated or uniform distributions as needed.

Once a value in [0, 1] is obtained, this value is mapped to a table where 1.0 is the highest table (e.g. T_2048) in the tree dataset and 0.0 is T_0.

To help developers understand the distributions they've defined, some information is printed when the new simulation is run:

. . .

### Writer distributions ###
Summary for Distribution(2,0.7,0.0278):
  Range         | % of Samples | Visualization
  --------------|--------------|------------------
  [0.0 - 0.1) |   0.02%      | 
  [0.1 - 0.2) |   0.14%      | 
  [0.2 - 0.3) |   0.71%      | 
  [0.3 - 0.4) |   2.86%      | █
  [0.4 - 0.5) |   8.40%      | ████
  [0.5 - 0.6) |  16.36%      | ████████
  [0.6 - 0.7) |  23.44%      | ████████████
  [0.7 - 0.8) |  23.37%      | ████████████
  [0.8 - 0.9) |  16.56%      | ████████
  [0.9 - 1.0) |   8.15%      | ████

  The most frequently selected table was chosen in ~6% of samples
. . .

eric-maynard added 2 commits April 30, 2025 01:33
Copy link
Contributor

@pingtimeout pingtimeout left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the delay ! I completely missed this PR...

The new workload is pretty cool, and I especially like the visualization part. I made a few comments to improve the code, and to simplify it in some cases.

) {
val nAryTree: NAryTreeBuilder = NAryTreeBuilder(nsWidth, nsDepth)
private val maxPossibleTables = nAryTree.numberOfLastLevelElements * numTablesPerNs
val maxPossibleTables = nAryTree.numberOfLastLevelElements * numTablesPerNs
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The intent behind initially maxPossibleTables having private visibility was to only use numTables so that the number of tables could be capped to a certain value.

I.e. if there are 2^16 max possible tables in your dataset, but you want to restrict a specific workload on the first 2000 tables, you would set max-tables = 2000. That way, you can reuse a larger dataset for more concentrated workloads.

It does not mean that all workloads have to support this though. I just want to emphasize was the initial idea was.

If you want to keep using maxPossibleTables, would you mind adding a type annotation to it? IntelliJ displays a warning when a public field has no annotated type and is just var xyz = ....

Copy link
Contributor Author

@eric-maynard eric-maynard Jun 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good callout on the annotation.

Initially I thought to create my own dataset for this workload (i.e. a flat namespace structure), but I realized it's fairly easy to use the existing one. maxPossibleTables was really convenient for getting the number of tables in total, which is the main thing this workload cares about. So if possible I think it makes sense to continue to use it.

I thought for a while about how to make the sampling with with numTablesMax set, but just gave up and decided I'll go with maxPossibleTables for the time being.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's an interesting idea. Let's defer this to a future PR as we can improve the code incrementally.

The benchmarks config currently uses dataset.tree as the prefix for the tree-related parameters. We could totally introduce a new dataset shape like:

dataset {
  tree {
    ...
  }
  flat {
    ...
  }
}

I think Russel mentioned this shape a long time ago. So that would make total sense, especially if you are already using it today.

Copy link
Contributor Author

@eric-maynard eric-maynard Jun 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, we can defer that. Right now what I do is just make the ns width as 1 and then it becomes naturally flat :)

I needed to use this shape for the benchmark in this PR since I cared specifically about many tables in one ns

.set("multipartNamespace", namespace.mkString(0x1f.toChar.toString))
.set("tableName", table)
.set("initialProperties", expectedProperties)
.set("location", expectedLocation)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This part does not feel right as in all the other workloads, it is implemented as a feeder is the TableActions class. What is the rationale behind having this in the simulation itself and in an exec block?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not a gatling expert but I did wrestle with this for some time before getting it working. When I implemented this in TableActions, I was not able to get the readers & writers to build their distributions at runtime based on the config.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok let's keep this version and improve later

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

np, happy to revisit this. I need to learn more about gatling!

@eric-maynard eric-maynard requested a review from pingtimeout June 16, 2025 19:06
Copy link
Contributor

@pingtimeout pingtimeout left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is one nit and one question left but nothing that must be addressed before merging. +1

@eric-maynard eric-maynard merged commit 564556d into apache:main Jul 10, 2025
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants