Add Polaris benchmarks #2

pingtimeout · 2025-04-01T15:34:22Z

This pull request introduces a comprehensive set of benchmarks to Polaris. The current set includes:

A benchmark that populates an empty Polaris server with a dataset that have predefined attributes
A benchmark that issues only read queries over that dataset
A benchmark that issues read and write queries (entity updates) over that dataset, with a configurable read/write ratio
A benchmark that creates commits against existing tables

A documentation if provided in the README.md file, including examples of the different datasets that can be generated. The datasets are procedural, which means that given the same input parameters, the same datasets will be generated, thus enabling reproducible benchmarks.

Note that there is one big change compared to the initial PR that was opened against apache/polaris: the benchmark configuration is now handled via a configuration file (see the README.md file). Using environment variables was brittle and could not be easily documented.

pingtimeout · 2025-04-01T15:45:06Z

@jbonofre just a heads up, I might need some help to get the license check and header check added to that project. I did not implement it as I thought, maybe we will have a parent project in this repository too, just like in apache/polaris. Let me know how you want to proceed.

benchmarks/README.md

RussellSpitzer · 2025-04-01T15:52:44Z

benchmarks/build.gradle.kts

+
+tasks.withType<ScalaCompile> {
+    scalaCompileOptions.forkOptions.apply {
+        jvmArgs = listOf("-Xss100m") // Scala compiler may require a larger stack size when compiling Gatling simulations


👀 that's a big stack

I don't think it is necessary, tbh, given that the initial PR did not have that and the benchmarks were compiling. But this comes from the canonical example of gatling-gradle plugin https://github.com/gatling/gatling-gradle-plugin-demo-scala/blob/main/build.gradle. So I kept it.

benchmarks/src/gatling/scala/org/apache/polaris/benchmarks/NAryTreeBuilder.scala

RussellSpitzer · 2025-04-01T16:11:34Z

benchmarks/src/gatling/scala/org/apache/polaris/benchmarks/actions/AuthenticationActions.scala

+ * @param maxRetries Maximum number of retry attempts for failed operations
+ * @param retryableHttpCodes HTTP status codes that should trigger a retry
+ */
+case class AuthenticationActions(


Is this part of the benchmarks listed in the config? I don't remember seeing a doc in the readme for this

I assume you are talking about the *Actions.scala classes. These classes regroup the elements that are needed to perform certain actions based on either a feature (e.g. authenticating) or an entity (e.g. tables).

They are not listed in the README, indeed, as they are an implementation detail. This is also where the configuration is consumed, not defined.

I mean specifically the Authentication Actions, I didn't understand where that fit in the Write/Read benchmark

Ok I see. The explanation is that before any request can be issued, the user must first authenticate against the OAuth endpoint. This class takes care of that. It can feel a little bit over-engineered as, at this time, authentication is performed only once at the very beginning of the benchmark. Example from the CreateTreeDatasetConcurrent simulation.

authenticate .inject(atOnceUsers(1)) .andThen(createCatalogs.inject(atOnceUsers(50))) .andThen( .. )

But the OAuth token is only valid for 1h. Which means that any benchmark that lasts more than that will fail. I have separate modifications (not published yet) that renew this token periodically that make use of the AuthenticationActions more. Does that help clarify?

I believe we should be using the credential refresh hooks now? I haven't followed this enough in the OSS Iceberg a discussion but there should be auto-refresh hooks now but that may just be for the Iceberg client?

We could. But I am not sure what the value would be for OSS Polaris. I need to research that more. AFAICT, the OAuth endpoint is deprecated for removal. Which means there could be a way to bypass authentication entirely later for internal testing purposes. Alternatively, if it only stays for development purposes, then we don't need to strictly follow the client credentials flow?

At some point Polaris will support external IdP (AFAIK), so auto-refresh hooks may not cover all cases.

RussellSpitzer · 2025-04-01T16:17:04Z

benchmarks/src/gatling/scala/org/apache/polaris/benchmarks/actions/CatalogActions.scala

+   * There is no limit to the number of users that can create catalogs concurrently.
+   */
+  val createCatalog: ChainBuilder =
+    retryOnHttpStatus(maxRetries, retryableHttpCodes, "Create catalog")(


Looks good for now, but I was wondering if this should be using a Polaris client rather than doing String based querying? Probably not important because I assume we are going to keep these apis static but I generally am afraid of string constants.

It is a good question. In my experience, Gatling benchmarks should have little, if any, external dependencies, so that they are easy to read. The risk is for Gatling benchmarks to become only readable/editable with a fully fledged IDE, instead of a simple text editor or even Github UI.

We could extract those String constants into some sort of a Polaris client. But that would be it. The connection logic and all that would make a complete Polaris client would not be there, as they are Gatling specific. So I am not sure it would be net positive.

benchmarks/src/gatling/scala/org/apache/polaris/benchmarks/util/CircularIterator.scala

RussellSpitzer

I did a general skim here and everything looks good to me. I won't have time to get through a detailed review in the near future but wanted to give my stamp of approval.

dimas-b · 2025-04-01T18:34:20Z

benchmarks/build.gradle.kts

+}
+
+dependencies {
+    gatling("com.typesafe.play:play-json_2.13:2.9.4")


nit: do we want to use toml files like the main Polaris repo?

I have introduced toml in #1, once we merge it we need to rebase this PR and use it.

I believe this comment applies for the line just after play-json (the typesafe-config line). Play-json is used to parse the payloads returned by Polaris and ensure that maps (e.g. namespace properties) are equal. Given that there is no order guarantee between properties, a plain string comparison cannot be used. So play-json has to stay.

We could move from typesafe-config to a toml file. But I would first like to double check that we are talking about the same thing. Typesafe config was initially preferred as it is already used to configure Gatling and offers the ability to have default parameter values for benchmarks that can then be overridden by users either from the CLI or a separate configuration (HOCON) file.

AFAICT in #1, toml files are used for the Gradle build. The equivalent of typesafe-config in #1 is picocli, not toml files. And I missing something @ajantha-bhat?

I mean toml for gradle dependencies, not for benchmark config... sorry about the confusion 😅

dimas-b · 2025-04-01T18:36:55Z

benchmarks/src/gatling/resources/gatling.conf

+      percentile1 = 25
+      percentile2 = 50
+      percentile3 = 75


IQR 🎉 ;)

Yes that one is specifically for your box plots :-)

benchmarks/src/gatling/scala/org/apache/polaris/benchmarks/NAryTreeBuilder.scala

dimas-b · 2025-04-01T18:44:20Z

benchmarks/src/gatling/scala/org/apache/polaris/benchmarks/actions/AuthenticationActions.scala

+ * @param maxRetries Maximum number of retry attempts for failed operations
+ * @param retryableHttpCodes HTTP status codes that should trigger a retry
+ */
+case class AuthenticationActions(


At some point Polaris will support external IdP (AFAIK), so auto-refresh hooks may not cover all cases.

benchmarks/src/gatling/scala/org/apache/polaris/benchmarks/util/CircularIterator.scala

pingtimeout · 2025-04-02T07:51:58Z

I will wait for #1 to be merged as it contains the base build directives that could be leveraged in this PR. I might have to modify these base Gradle files though, in order to enable Scala projects. It looks like #1 assumes that all projects under this repository will be Java-based. To be continued.

snazy

LGTM!

Looks like it doesn't conflict with #1, so +1 to merge.

benchmarks/README.md

Co-authored-by: Robert Stupp <[email protected]>

pingtimeout · 2025-04-10T16:10:04Z

Ok 👍, merging then

Feature/chat widget

Add Polaris benchmarks

c93fb84

pingtimeout mentioned this pull request Apr 1, 2025

Add Polaris benchmarks that use the REST APIs apache/polaris#1208

Closed