Skip to content

Conversation

@pingtimeout
Copy link
Contributor

@pingtimeout pingtimeout commented Apr 1, 2025

This pull request introduces a comprehensive set of benchmarks to Polaris. The current set includes:

  • A benchmark that populates an empty Polaris server with a dataset that have predefined attributes
  • A benchmark that issues only read queries over that dataset
  • A benchmark that issues read and write queries (entity updates) over that dataset, with a configurable read/write ratio
  • A benchmark that creates commits against existing tables

A documentation if provided in the README.md file, including examples of the different datasets that can be generated. The datasets are procedural, which means that given the same input parameters, the same datasets will be generated, thus enabling reproducible benchmarks.

Note that there is one big change compared to the initial PR that was opened against apache/polaris: the benchmark configuration is now handled via a configuration file (see the README.md file). Using environment variables was brittle and could not be easily documented.

@pingtimeout
Copy link
Contributor Author

@jbonofre just a heads up, I might need some help to get the license check and header check added to that project. I did not implement it as I thought, maybe we will have a parent project in this repository too, just like in apache/polaris. Let me know how you want to proceed.


tasks.withType<ScalaCompile> {
scalaCompileOptions.forkOptions.apply {
jvmArgs = listOf("-Xss100m") // Scala compiler may require a larger stack size when compiling Gatling simulations
Copy link
Member

@RussellSpitzer RussellSpitzer Apr 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👀 that's a big stack

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it is necessary, tbh, given that the initial PR did not have that and the benchmarks were compiling. But this comes from the canonical example of gatling-gradle plugin https://github.com/gatling/gatling-gradle-plugin-demo-scala/blob/main/build.gradle. So I kept it.

* @param maxRetries Maximum number of retry attempts for failed operations
* @param retryableHttpCodes HTTP status codes that should trigger a retry
*/
case class AuthenticationActions(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this part of the benchmarks listed in the config? I don't remember seeing a doc in the readme for this

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume you are talking about the *Actions.scala classes. These classes regroup the elements that are needed to perform certain actions based on either a feature (e.g. authenticating) or an entity (e.g. tables).

They are not listed in the README, indeed, as they are an implementation detail. This is also where the configuration is consumed, not defined.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean specifically the Authentication Actions, I didn't understand where that fit in the Write/Read benchmark

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok I see. The explanation is that before any request can be issued, the user must first authenticate against the OAuth endpoint. This class takes care of that. It can feel a little bit over-engineered as, at this time, authentication is performed only once at the very beginning of the benchmark. Example from the CreateTreeDatasetConcurrent simulation.

    authenticate
      .inject(atOnceUsers(1))
      .andThen(createCatalogs.inject(atOnceUsers(50)))
      .andThen(
        ..
      )

But the OAuth token is only valid for 1h. Which means that any benchmark that lasts more than that will fail. I have separate modifications (not published yet) that renew this token periodically that make use of the AuthenticationActions more. Does that help clarify?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe we should be using the credential refresh hooks now? I haven't followed this enough in the OSS Iceberg a discussion but there should be auto-refresh hooks now but that may just be for the Iceberg client?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could. But I am not sure what the value would be for OSS Polaris. I need to research that more. AFAICT, the OAuth endpoint is deprecated for removal. Which means there could be a way to bypass authentication entirely later for internal testing purposes. Alternatively, if it only stays for development purposes, then we don't need to strictly follow the client credentials flow?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At some point Polaris will support external IdP (AFAIK), so auto-refresh hooks may not cover all cases.

* There is no limit to the number of users that can create catalogs concurrently.
*/
val createCatalog: ChainBuilder =
retryOnHttpStatus(maxRetries, retryableHttpCodes, "Create catalog")(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good for now, but I was wondering if this should be using a Polaris client rather than doing String based querying? Probably not important because I assume we are going to keep these apis static but I generally am afraid of string constants.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is a good question. In my experience, Gatling benchmarks should have little, if any, external dependencies, so that they are easy to read. The risk is for Gatling benchmarks to become only readable/editable with a fully fledged IDE, instead of a simple text editor or even Github UI.

We could extract those String constants into some sort of a Polaris client. But that would be it. The connection logic and all that would make a complete Polaris client would not be there, as they are Gatling specific. So I am not sure it would be net positive.

Copy link
Member

@RussellSpitzer RussellSpitzer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did a general skim here and everything looks good to me. I won't have time to get through a detailed review in the near future but wanted to give my stamp of approval.

}

dependencies {
gatling("com.typesafe.play:play-json_2.13:2.9.4")
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: do we want to use toml files like the main Polaris repo?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have introduced toml in #1, once we merge it we need to rebase this PR and use it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this comment applies for the line just after play-json (the typesafe-config line). Play-json is used to parse the payloads returned by Polaris and ensure that maps (e.g. namespace properties) are equal. Given that there is no order guarantee between properties, a plain string comparison cannot be used. So play-json has to stay.

We could move from typesafe-config to a toml file. But I would first like to double check that we are talking about the same thing. Typesafe config was initially preferred as it is already used to configure Gatling and offers the ability to have default parameter values for benchmarks that can then be overridden by users either from the CLI or a separate configuration (HOCON) file.

AFAICT in #1, toml files are used for the Gradle build. The equivalent of typesafe-config in #1 is picocli, not toml files. And I missing something @ajantha-bhat?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean toml for gradle dependencies, not for benchmark config... sorry about the confusion 😅

Comment on lines +4 to +6
percentile1 = 25
percentile2 = 50
percentile3 = 75
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IQR 🎉 ;)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes that one is specifically for your box plots :-)

* @param maxRetries Maximum number of retry attempts for failed operations
* @param retryableHttpCodes HTTP status codes that should trigger a retry
*/
case class AuthenticationActions(
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At some point Polaris will support external IdP (AFAIK), so auto-refresh hooks may not cover all cases.

@pingtimeout
Copy link
Contributor Author

I will wait for #1 to be merged as it contains the base build directives that could be leveraged in this PR. I might have to modify these base Gradle files though, in order to enable Scala projects. It looks like #1 assumes that all projects under this repository will be Java-based. To be continued.

Copy link
Member

@snazy snazy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

Looks like it doesn't conflict with #1, so +1 to merge.

Co-authored-by: Robert Stupp <[email protected]>
@pingtimeout
Copy link
Contributor Author

Ok 👍, merging then

@jbonofre jbonofre merged commit 1859a16 into apache:main Apr 10, 2025
@pingtimeout pingtimeout deleted the benchmarks branch April 16, 2025 15:41
flyrain referenced this pull request in gh-yzou/polaris-tools Dec 10, 2025
flyrain pushed a commit to flyrain/polaris-tools that referenced this pull request Dec 14, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants