-
Notifications
You must be signed in to change notification settings - Fork 27
Add Polaris benchmarks #2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
@jbonofre just a heads up, I might need some help to get the license check and header check added to that project. I did not implement it as I thought, maybe we will have a parent project in this repository too, just like in |
|
|
||
| tasks.withType<ScalaCompile> { | ||
| scalaCompileOptions.forkOptions.apply { | ||
| jvmArgs = listOf("-Xss100m") // Scala compiler may require a larger stack size when compiling Gatling simulations |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👀 that's a big stack
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think it is necessary, tbh, given that the initial PR did not have that and the benchmarks were compiling. But this comes from the canonical example of gatling-gradle plugin https://github.com/gatling/gatling-gradle-plugin-demo-scala/blob/main/build.gradle. So I kept it.
benchmarks/src/gatling/scala/org/apache/polaris/benchmarks/NAryTreeBuilder.scala
Show resolved
Hide resolved
| * @param maxRetries Maximum number of retry attempts for failed operations | ||
| * @param retryableHttpCodes HTTP status codes that should trigger a retry | ||
| */ | ||
| case class AuthenticationActions( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this part of the benchmarks listed in the config? I don't remember seeing a doc in the readme for this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assume you are talking about the *Actions.scala classes. These classes regroup the elements that are needed to perform certain actions based on either a feature (e.g. authenticating) or an entity (e.g. tables).
They are not listed in the README, indeed, as they are an implementation detail. This is also where the configuration is consumed, not defined.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mean specifically the Authentication Actions, I didn't understand where that fit in the Write/Read benchmark
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok I see. The explanation is that before any request can be issued, the user must first authenticate against the OAuth endpoint. This class takes care of that. It can feel a little bit over-engineered as, at this time, authentication is performed only once at the very beginning of the benchmark. Example from the CreateTreeDatasetConcurrent simulation.
authenticate
.inject(atOnceUsers(1))
.andThen(createCatalogs.inject(atOnceUsers(50)))
.andThen(
..
)
But the OAuth token is only valid for 1h. Which means that any benchmark that lasts more than that will fail. I have separate modifications (not published yet) that renew this token periodically that make use of the AuthenticationActions more. Does that help clarify?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe we should be using the credential refresh hooks now? I haven't followed this enough in the OSS Iceberg a discussion but there should be auto-refresh hooks now but that may just be for the Iceberg client?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could. But I am not sure what the value would be for OSS Polaris. I need to research that more. AFAICT, the OAuth endpoint is deprecated for removal. Which means there could be a way to bypass authentication entirely later for internal testing purposes. Alternatively, if it only stays for development purposes, then we don't need to strictly follow the client credentials flow?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At some point Polaris will support external IdP (AFAIK), so auto-refresh hooks may not cover all cases.
| * There is no limit to the number of users that can create catalogs concurrently. | ||
| */ | ||
| val createCatalog: ChainBuilder = | ||
| retryOnHttpStatus(maxRetries, retryableHttpCodes, "Create catalog")( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good for now, but I was wondering if this should be using a Polaris client rather than doing String based querying? Probably not important because I assume we are going to keep these apis static but I generally am afraid of string constants.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is a good question. In my experience, Gatling benchmarks should have little, if any, external dependencies, so that they are easy to read. The risk is for Gatling benchmarks to become only readable/editable with a fully fledged IDE, instead of a simple text editor or even Github UI.
We could extract those String constants into some sort of a Polaris client. But that would be it. The connection logic and all that would make a complete Polaris client would not be there, as they are Gatling specific. So I am not sure it would be net positive.
benchmarks/src/gatling/scala/org/apache/polaris/benchmarks/util/CircularIterator.scala
Outdated
Show resolved
Hide resolved
RussellSpitzer
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did a general skim here and everything looks good to me. I won't have time to get through a detailed review in the near future but wanted to give my stamp of approval.
| } | ||
|
|
||
| dependencies { | ||
| gatling("com.typesafe.play:play-json_2.13:2.9.4") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: do we want to use toml files like the main Polaris repo?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have introduced toml in #1, once we merge it we need to rebase this PR and use it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe this comment applies for the line just after play-json (the typesafe-config line). Play-json is used to parse the payloads returned by Polaris and ensure that maps (e.g. namespace properties) are equal. Given that there is no order guarantee between properties, a plain string comparison cannot be used. So play-json has to stay.
We could move from typesafe-config to a toml file. But I would first like to double check that we are talking about the same thing. Typesafe config was initially preferred as it is already used to configure Gatling and offers the ability to have default parameter values for benchmarks that can then be overridden by users either from the CLI or a separate configuration (HOCON) file.
AFAICT in #1, toml files are used for the Gradle build. The equivalent of typesafe-config in #1 is picocli, not toml files. And I missing something @ajantha-bhat?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mean toml for gradle dependencies, not for benchmark config... sorry about the confusion 😅
| percentile1 = 25 | ||
| percentile2 = 50 | ||
| percentile3 = 75 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IQR 🎉 ;)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes that one is specifically for your box plots :-)
benchmarks/src/gatling/scala/org/apache/polaris/benchmarks/NAryTreeBuilder.scala
Outdated
Show resolved
Hide resolved
| * @param maxRetries Maximum number of retry attempts for failed operations | ||
| * @param retryableHttpCodes HTTP status codes that should trigger a retry | ||
| */ | ||
| case class AuthenticationActions( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At some point Polaris will support external IdP (AFAIK), so auto-refresh hooks may not cover all cases.
benchmarks/src/gatling/scala/org/apache/polaris/benchmarks/util/CircularIterator.scala
Outdated
Show resolved
Hide resolved
|
I will wait for #1 to be merged as it contains the base build directives that could be leveraged in this PR. I might have to modify these base Gradle files though, in order to enable Scala projects. It looks like #1 assumes that all projects under this repository will be Java-based. To be continued. |
snazy
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
Looks like it doesn't conflict with #1, so +1 to merge.
Co-authored-by: Robert Stupp <[email protected]>
|
Ok 👍, merging then |
Feature/chat widget
This pull request introduces a comprehensive set of benchmarks to Polaris. The current set includes:
A documentation if provided in the README.md file, including examples of the different datasets that can be generated. The datasets are procedural, which means that given the same input parameters, the same datasets will be generated, thus enabling reproducible benchmarks.
Note that there is one big change compared to the initial PR that was opened against
apache/polaris: the benchmark configuration is now handled via a configuration file (see the README.md file). Using environment variables was brittle and could not be easily documented.