Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions benchmarks/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
.gradle
build
bin/
44 changes: 44 additions & 0 deletions benchmarks/.scalafmt.conf
Original file line number Diff line number Diff line change
@@ -0,0 +1,44 @@
#
# Licensed to the Apache Software Foundation (ASF) under one
# or more contributor license agreements. See the NOTICE file
# distributed with this work for additional information
# regarding copyright ownership. The ASF licenses this file
# to you under the Apache License, Version 2.0 (the
# "License"); you may not use this file except in compliance
# with the License. You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing,
# software distributed under the License is distributed on an
# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
# KIND, either express or implied. See the License for the
# specific language governing permissions and limitations
# under the License.
#

# Scalafmt is used to reformat Gatling benchmarks from the benchmarks/ directory

version = 3.9.3
runner.dialect = scala213

maxColumn = 100

preset = default
align.preset = some

assumeStandardLibraryStripMargin = true
align.stripMargin = true

rewrite.rules = [
AvoidInfix
RedundantBraces
RedundantParens
SortModifiers
PreferCurlyFors
Imports
]

rewrite.imports.sort = original
docstrings.style = Asterisk
docstrings.wrap = fold
242 changes: 242 additions & 0 deletions benchmarks/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,242 @@
<!--
Licensed to the Apache Software Foundation (ASF) under one
or more contributor license agreements. See the NOTICE file
distributed with this work for additional information
regarding copyright ownership. The ASF licenses this file
to you under the Apache License, Version 2.0 (the
"License"); you may not use this file except in compliance
with the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing,
software distributed under the License is distributed on an
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
KIND, either express or implied. See the License for the
specific language governing permissions and limitations
under the License.
-->

# Polaris Benchmarks

Benchmarks for the Polaris service using Gatling.

## Available Benchmarks

### Dataset Creation Benchmark

The CreateTreeDataset benchmark creates a test dataset with a specific structure. It exists in two variants:

- `org.apache.polaris.benchmarks.simulations.CreateTreeDatasetSequential`: Creates entities one at a time
- `org.apache.polaris.benchmarks.simulations.CreateTreeDatasetConcurrent`: Creates up to 50 entities simultaneously

These are write-only workloads designed to populate the system for subsequent benchmarks.

### Read/Update Benchmark

The ReadUpdateTreeDataset benchmark tests read and update operations on an existing dataset. It exists in two variants:

- `org.apache.polaris.benchmarks.simulations.ReadUpdateTreeDatasetSequential`: Performs read/update operations one at a time
- `org.apache.polaris.benchmarks.simulations.ReadUpdateTreeDatasetConcurrent`: Performs up to 20 read/update operations simultaneously

These benchmarks can only be run after using CreateTreeDataset to populate the system.

## Parameters

All parameters are configured through the [benchmark-defaults.conf](src/gatling/resources/benchmark-defaults.conf) file located in `src/gatling/resources/`. The configuration uses the [Typesafe Config](https://github.com/lightbend/config) format. The reference configuration file contains default values as well as documentation for each parameter.

### Dataset Structure Parameters

These parameters must be consistent across all benchmarks and are configured under `dataset.tree`:

```hocon
dataset.tree {
num-catalogs = 1 # Number of catalogs to create
namespace-width = 2 # Width of the namespace tree
namespace-depth = 4 # Depth of the namespace tree
tables-per-namespace = 5 # Tables per namespace
views-per-namespace = 3 # Views per namespace
columns-per-table = 10 # Columns per table
columns-per-view = 10 # Columns per view
default-base-location = "file:///tmp/polaris" # Base location for datasets
namespace-properties = 10 # Number of properties to add to each namespace
table-properties = 10 # Number of properties to add to each table
view-properties = 10 # Number of properties to add to each view
max-tables = -1 # Cap on total tables (-1 for no cap). Must be less than N^(D-1) * tables-per-namespace
max-views = -1 # Cap on total views (-1 for no cap). Must be less than N^(D-1) * views-per-namespace
}
```

### Connection Parameters

Connection settings are configured under `http` and `auth`:

```hocon
http {
base-url = "http://localhost:8181" # Service URL
}

auth {
client-id = null # Required: OAuth2 client ID
client-secret = null # Required: OAuth2 client secret
}
```

### Workload Parameters

Workload settings are configured under `workload`:

```hocon
workload {
read-write-ratio = 0.8 # Ratio of reads (0.0-1.0)
}
```

## Running the Benchmarks

The benchmark uses [typesafe-config](https://github.com/lightbend/config) for configuration management. Default settings are in `src/gatling/resources/benchmark-defaults.conf`. This file should not be modified directly.

To customize the benchmark settings, create your own `application.conf` file and specify it using the `-Dconfig.file` parameter. Your settings will override the default values.

Example `application.conf`:
```hocon
auth {
client-id = "your-client-id"
client-secret = "your-client-secret"
}

http {
base-url = "http://your-polaris-instance:8181"
}

workload {
read-write-ratio = 0.8
}
```

Run benchmarks with your configuration:

```bash
# Sequential dataset creation
./gradlew gatlingRun --simulation org.apache.polaris.benchmarks.simulations.CreateTreeDatasetSequential \
-Dconfig.file=./application.conf

# Concurrent dataset creation
./gradlew gatlingRun --simulation org.apache.polaris.benchmarks.simulations.CreateTreeDatasetConcurrent \
-Dconfig.file=./application.conf
```

A message will show the location of the Gatling report:
```
Reports generated in: ./benchmarks/build/reports/gatling/<simulation-name>/index.html
```

### Example Polaris server startup

For repeated testing and benchmarking purposes it's convenient to have fixed client-ID + client-secret combinations. **The following example is ONLY for testing and benchmarking against an airgapped Polaris instance**

```bash
# Start Polaris with the fixed client-ID/secret admin/admin
# DO NEVER EVER USE THE FOLLOWING FOR ANY NON-AIRGAPPED POLARIS INSTANCE !!
./gradlew :polaris-quarkus-server:quarkusBuild && java \
-Dpolaris.bootstrap.credentials=POLARIS,admin,admin \
-Djava.security.manager=allow \
-jar quarkus/server/build/quarkus-app/quarkus-run.jar
```

With the above you can run the benchmarks using a configuration file with `client-id = "admin"` and `client-secret = "admin"` - meant only for convenience in a fully airgapped system.

# Test Dataset

The benchmarks use synthetic procedural datasets that are generated deterministically at runtime. This means that given the same input parameters, the exact same dataset structure will always be generated. This approach allows generating large volumes of test data without having to store it, while ensuring reproducible benchmark results across different runs.

The diagrams below describe the data sets that are used in benchmarks. Note that the benchmark dataset may not cover all Polaris features.

## Generation rules

The dataset has a tree shape. At the root of the tree is a Polaris realm that must exist before the dataset is created.

An arbitrary number of catalogs can be created under the realm. However, only the first catalog (`C_0`) is used for the rest of the dataset.

The namespaces part of the dataset is a complete `N`-ary tree. That is, it starts with a root namespace (`NS_0`) and then, each namespace contains exactly `0` or `N` children namespaces. The width as well as the depth of the namespaces tree are configurable. The total number of namespaces can easily be calculated with the following formulae, where `N` is the tree width and `D` is the total tree depth, including the root:

$$\text{Total number of namespaces} =
\begin{cases}
\frac{N^{D} - 1}{N - 1} & \mbox{if } N \gt 1 \\
D & \mbox{if } N = 1
\end{cases}$$

The tables are created under the leaves of the tree. That is, they are put under the namespaces with no child namespace. The number of tables that is created under each leaf namespace is configurable. The total number of tables can easily be calculated with the following formulae, where `N` is the tree width, `D` is the total tree depth, and `T` is the number of tables per leaf namespace:

Total number of tables = *N*<sup>*D* − 1</sup> \* *T*

The views are created alongside the tables. The number of views that is created under each leaf namespace is also configurable. The total number of views can easily be calculated with the following formulae, where `N` is the tree width, `D` is the total tree depth, `V` is the number of views per leaf namespace:

Total number of tables = *N*<sup>*D* − 1</sup> \* *V*

## Binary tree example

The diagram below shows an example of a test dataset with the following properties:

- Number of catalogs: `3`
- Namespace tree width (`N`): `2` (a binary tree)
- Namespace tree depth (`D`): `3`
- Tables per namespace (`T`): `5`
- Views per namespace (`V`): `3`

![Binary tree dataset example with width 2, depth 3, and 5 tables per namespace](docs/dataset-shape-2-3-5.svg)

Using the formula from the previous section, we can calculate the total number of namespaces and the total number of tables as follows:

$$\text{Total number of namespaces} = \frac{2^{3} - 1}{2 - 1} = 7$$

Total number of tables = 2<sup>3 − 1</sup> \* 5 = 20

## 10-ary tree example

The diagram below shows an example of a test dataset with the following properties:

- Number of catalogs: `1`
- Namespace tree width (`N`): `10`
- Namespace tree depth (`D`): `2`
- Tables per namespace (`T`): `3`
- Views per namespace (`V`): `3`

![10-ary tree dataset example with width 10, depth 2, and 3 tables per namespace](docs/dataset-shape-10-2-3.svg)

Using the formula from the previous section, we can calculate the total number of namespaces and the total number of tables as follows:

$$\text{Total number of namespaces} = \frac{10^{2} - 1}{10 - 1} = 11$$

Total number of tables = 10<sup>2 − 1</sup> \* 3 = 30

## 1-ary tree example

The diagram below shows an example of a test dataset with the following properties:

- Number of catalogs: `1`
- Namespace tree width (`N`): `1`
- Namespace tree depth (`D`): `1000`
- Tables per namespace (`T`): `7`
- Views per namespace (`V`): `4`

![1-ary tree dataset example with width 1, depth 1000, and 7 tables per namespace](docs/dataset-shape-1-1000-7.svg)

Using the formula from the previous section, we can calculate the total number of namespaces and the total number of tables as follows:

Total number of namespaces = 1000

Total number of tables = 1<sup>1000 − 1</sup> \* 7 = 7

## Size

The data set size can be adjusted as well. Each namespace is associated with an arbitrary number of dummy properties. Similarly, each table is associated with an arbitrary number of dummy columns and properties.

The diagram below shows sample catalog, namespace and table definition given the following properties:

- Default base location: `file:///tmp/polaris`
- Number of namespace properties: `100`
- Number of columns per table: `999`
- Number of table properties: `59`

![Dataset size example showing catalog, namespace, and table definitions](docs/dataset-size.png)
29 changes: 29 additions & 0 deletions benchmarks/build.gradle.kts
Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@
plugins {
scala
id("io.gatling.gradle") version "3.13.5.2"
id("com.diffplug.spotless") version "7.0.2"
}

description = "Polaris Iceberg REST API performance tests"

tasks.withType<ScalaCompile> {
scalaCompileOptions.forkOptions.apply {
jvmArgs = listOf("-Xss100m") // Scala compiler may require a larger stack size when compiling Gatling simulations
Copy link
Member

@RussellSpitzer RussellSpitzer Apr 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👀 that's a big stack

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it is necessary, tbh, given that the initial PR did not have that and the benchmarks were compiling. But this comes from the canonical example of gatling-gradle plugin https://github.com/gatling/gatling-gradle-plugin-demo-scala/blob/main/build.gradle. So I kept it.

}
}

dependencies {
gatling("com.typesafe.play:play-json_2.13:2.9.4")
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: do we want to use toml files like the main Polaris repo?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have introduced toml in #1, once we merge it we need to rebase this PR and use it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this comment applies for the line just after play-json (the typesafe-config line). Play-json is used to parse the payloads returned by Polaris and ensure that maps (e.g. namespace properties) are equal. Given that there is no order guarantee between properties, a plain string comparison cannot be used. So play-json has to stay.

We could move from typesafe-config to a toml file. But I would first like to double check that we are talking about the same thing. Typesafe config was initially preferred as it is already used to configure Gatling and offers the ability to have default parameter values for benchmarks that can then be overridden by users either from the CLI or a separate configuration (HOCON) file.

AFAICT in #1, toml files are used for the Gradle build. The equivalent of typesafe-config in #1 is picocli, not toml files. And I missing something @ajantha-bhat?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean toml for gradle dependencies, not for benchmark config... sorry about the confusion 😅

gatling("com.typesafe:config:1.4.3")
}

repositories {
mavenCentral()
}

spotless {
scala {
// Use scalafmt for Scala formatting
scalafmt("3.9.3").configFile(".scalafmt.conf")
}
}
8 changes: 8 additions & 0 deletions benchmarks/gradle/wrapper/gradle-wrapper.properties
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
distributionBase=GRADLE_USER_HOME
distributionPath=wrapper/dists
distributionSha256Sum=20f1b1176237254a6fc204d8434196fa11a4cfb387567519c61556e8710aed78
distributionUrl=https\://services.gradle.org/distributions/gradle-8.13-bin.zip
networkTimeout=10000
validateDistributionUrl=true
zipStoreBase=GRADLE_USER_HOME
zipStorePath=wrapper/dists
Loading