Conversation
...e/service/src/main/java/org/apache/polaris/service/persistence/PersistenceConfiguration.java
Outdated
Show resolved
Hide resolved
|
I don’t think this PR is moving in the right direction. NoSQL is one of the persistence impl., but this change introduces NoSQL specific concerns into the business logic module. This blurs the separation of concerns and makes the business layer dependent on a concrete persistence backend, which will make the architecture harder to evolve and maintain. |
flyrain
left a comment
There was a problem hiding this comment.
NoSQL persistence impl. should NOT leak to business module.
This change adds the NoSQL persistence to polaris-runtime-service.
fdd5958 to
029fe0b
Compare
|
@flyrain :
Could you, please clarify what you mean by "business module"? |
dimas-b
left a comment
There was a problem hiding this comment.
The latest state of this PR LGTM 👍
Changes in runtime/default properly flag the new NoSQL persistence as "beta". This should be sufficient for early adopters. Documentation for mainstream users TBD later.
| quarkus.micrometer.export.prometheus.enabled=true | ||
| quarkus.oidc.enabled=true | ||
| quarkus.otel.enabled=true | ||
| quarkus.mongodb.metrics.enabled=true |
There was a problem hiding this comment.
Should these properties also be commented out since they aren't applicable?
There was a problem hiding this comment.
@RussellSpitzer : what do you mean by "aren't applicable"?
There was a problem hiding this comment.
If I use this application.properties as is, this property won't do anything correct?
There was a problem hiding this comment.
True. However, if NoSQL + MongoDB is enabled, getting metrics automatically would be handy, right?
I'm ok commenting this out too.
There was a problem hiding this comment.
Definitely! I was mostly just thinking that we generally comment out properties that are available but don't do anything in the current config.
It may make sense to just group nosql or mongo params together too
There was a problem hiding this comment.
Thinking some more about this: runtime/defaults is basically for the default Server image (or tar). However, users still need to make decisions about monitoring and backend database parameters. So it does make sense to comment out anything that is not "activated" by default.
Helm charts can enable extra config at deployment time (e.g. if MongoDB is selected).
Users of tar distributions should still set up extra infra for monitoring, so it's not really seamless even if it's enabled by default.
There was a problem hiding this comment.
+1 to commenting out everything that's specific to enabling the particular persistence type. Someone reading through the config file trying to figure out could get confused whether a mongodb config setting is active or impacts runtime behaviors despite not turning on mongodb for anything.
Is there a way for the code-level default to already have these things enabled, and have a commented-out config basically reflect the code default (for these and the other similar mongodb configs discussed elsewhere).
I'm thinking of how things like hadoop-env.sh or spark-env.sh or spark-defaults.conf are basically just a big file of commented-out settings that mostly just reflect code-level defaults, but make it easy for a reader to mix in what they want to customize without getting overwhelmed about bare-minimum settings.
Are Quarkus application.properties "composeable" to easily define e.g. a mongodb-mixins.properties or spanner-mixins.properties to still make it easy to choose a persistence profile without ending up with a monolithic base config file?
If not, big commented-out sections still seems a reasonable way to scale this.
| # Available backends: | ||
| # - InMemory - for testing purposes | ||
| # - MongoDb - configure the via the Quarkus extension | ||
| polaris.persistence.nosql.backend=InMemory |
There was a problem hiding this comment.
Same comment here, shouldn't this be commented out since it's not in use?
There was a problem hiding this comment.
This is a reasonable default if the user selects NoSQL persistence via polaris.persistence.type.
It has no effect otherwise.
There was a problem hiding this comment.
That was basically my thought, I generally would keep properties unset unless they do something. So in this case we have a property that's set but doesn't actually do anything because other config stops it from being valid.
Like how we do in the cloudwatch config
# AWS CloudWatch event listener settings
# polaris.event-listener.type=aws-cloudwatch
# polaris.event-listener.aws-cloudwatch.log-group=polaris-cloudwatch-default-group
# polaris.event-listener.aws-cloudwatch.log-stream=polaris-cloudwatch-default-stream
# polaris.event-listener.aws-cloudwatch.region=us-east-1
# polaris.event-listener.aws-cloudwatch.synchronous-mode=false
There was a problem hiding this comment.
Fair enough from my POV (but I do not consider it a blocker 🙂 )
There was a problem hiding this comment.
side note: it might be worth setting polaris.persistence.type automatically if polaris.persistence.nosql.backend is set (future).
There was a problem hiding this comment.
Just reviewing and leaving some feedback I hope would be helpful :)
@flyrain Could you be more explicit in your concern here? I see the new config in the Application.conf which seems reasonable to me, and the test integration which also seems reasonable. This mirrors what we are doing with JDBC correct? Is there something else you are worried about? |
| quarkus.rds.sync-client.type=apache | ||
|
|
||
| ## MongoDB specific configuration | ||
| quarkus.mongodb.database=polaris |
There was a problem hiding this comment.
Do we need this config? I don't think we have to put it here as a default configs, given that NoSql is one of persistence options.
There was a problem hiding this comment.
This config setting could definitely use a comment explaining what it is and what the options are. At first glance it's not clear what the config does. Are the options, for example, "polaris" vs "glue" or "unity" ;)
Is it just the database name in mongodb to use when mongodb is used as the persistence layer?
I'd also prefer to comment out the configs like this one that aren't "active" by default (actually, I would've preferred to comment out the quarkus.rds ones as well and have a nice config section for all the quarkus.rds.* settings that makes it clear which config setting enables the block of rds configs too).
In my experience, it's nice to pare down the actual active config strings in the application.properties file to the true bare minimum for correct default out-of-the-box. Even the configs that match code-level defaults would be present but commented out as documentation so it's easy to see what can be adjusted, but without having to worry about whether it's a crucial input when the deployment mode isn't applicable for the config.
| quarkus.index-dependency.agrona.group-id=org.agrona | ||
| quarkus.index-dependency.agrona.artifact-id=agrona |
There was a problem hiding this comment.
Question, why do we need these two new configs?
There was a problem hiding this comment.
The comment just above (lines 298-299) explains it, does it not?
There was a problem hiding this comment.
indexing jars is generally required for proper CDI
There was a problem hiding this comment.
CDI has been there since we introduced Quarkus to Polaris without indexing jars. Why do we need them now?
There was a problem hiding this comment.
If I'm understanding correctly, indexing isn't the new thing, but agrona is the new library we need to index now in addition to e.g. avro, and guava, is that right?
I do agree a bit that agrona is less ubiquitous the others here, so could be worth a quick comment in here saying what agrona does (looks like it's a fancy utils library? https://github.com/aeron-io/agrona)
In my experience avro, guava, and protobuf are ubiquitous enough that no one thinks twice seeing it in deps, but agrona might be new to some folks (like me, or maybe I'm just old-fashioned :) )
I like to lean towards over-communicating rather than risk under-communicating in comments in these cases, since things like compliance audits, etc., will have folks looking through these types of configs and anything that can avoid unnecessary questions will help save time for everyone
There was a problem hiding this comment.
Actually, in my local env. I experimented with commenting out all of these index properties and I do not see any effect :)
I propose to open a GH issue for a more in-depth investigation and removal of them later, on main. I can take that on my TODO list. WDYT?
There was a problem hiding this comment.
thanks @dimas-b , that works for me. We can always add them back if it turns out useful.
There was a problem hiding this comment.
I did more local debugging and this indexing property does appear to be advised by Quarkus for the runtime/server build.
Here's the warning message from Quarkus when running without this property:
Unable to properly register the hierarchy of the following classes for reflection as they are not in the Jandex index:
- org.agrona.collections.Int2ObjectHashMap (source: JacksonProcessor annotated with @com.fasterxml.jackson.databind.annotation.JsonDeserialize > org.apache.polaris.persistence.nosql.authz.api.ImmutableAclEntry$Json > org.apache.polaris.persistence.nosql.authz.api.PrivilegeSet > org.apache.polaris.persistence.nosql.authz.api.Privileges > org.agrona.collections.Int2ObjectHashMap)
- org.agrona.collections.Object2ObjectHashMap (sources: JacksonProcessor annotated with @com.fasterxml.jackson.databind.annotation.JsonDeserialize > org.apache.polaris.persistence.nosql.authz.api.ImmutableAclEntry$Json > org.apache.polaris.persistence.nosql.authz.api.PrivilegeSet > org.apache.polaris.persistence.nosql.authz.api.Privileges > org.agrona.collections.Object2ObjectHashMap, JacksonProcessor annotated with @com.fasterxml.jackson.databind.annotation.JsonDeserialize > org.apache.polaris.persistence.nosql.authz.api.ImmutableAclEntry$Json > org.apache.polaris.persistence.nosql.authz.api.PrivilegeSet > org.apache.polaris.persistence.nosql.authz.api.Privileges > org.apache.polaris.persistence.nosql.authz.api.Acl$AclBuilder > org.agrona.collections.Object2ObjectHashMap, JacksonProcessor annotated with @com.fasterxml.jackson.databind.annotation.JsonDeserialize > org.apache.polaris.persistence.nosql.coretypes.acl.AclObj > org.apache.polaris.persistence.nosql.authz.api.Acl > org.agrona.collections.Object2ObjectHashMap)
Consider adding them to the index either by creating a Jandex index for your dependency via the Maven plugin, an empty META-INF/beans.xml or quarkus.index-dependency properties.
That said, I'm investigating the reason why this message does not show up during normal build (no debugger) under #3467.
I believe this resolves this comment thread. @flyrain : WDYT?
My concern is around #3396 (comment) seems to be resolved already), and NoSql specific configs in |
RussellSpitzer
left a comment
There was a problem hiding this comment.
Approved, I would recommend we comment out or group the parameters prior to merging though so it isn't confusing to users who might think that those parameters are always necessary or relevant.
|
@flyrain :
If your concern is resolved, would you mind lifting the "request changes" flag? If you think that something in current |
Hi @dimas-b, the Polaris "Code Contribution Guidelines" clarifies what's a blocker comment or not, https://polaris.apache.org/community/contributing-guidelines/. PR authors should honor this, do they? |
dennishuo
left a comment
There was a problem hiding this comment.
Generally I agree with trying to leave as much non-default-specific config commented out in the appliation.properties file as possible to manage cognitive load on service runners, especially as we scale.
I could be convinced otherwise for individual config settings, but for anything that is active but isn't supposed to be applicable without another opt-in, I'd prefer at least a comment in the file explaining the config sufficiently to put the reader at ease.
Some things seem to follow this convention reasonably well, like all the oidc stuff. Others like quarkus.rds we might want to consider also fixing forward along this convention if others agree with my take.
For example, as a developer wanting to test Polaris in various environments, seeing the rds settings not commented out in the properties file makes me worry whether Polaris will try to be too clever and discover RDS whenever I happen to run in an EC2 VM that has some environment variables enabled, for example, to talk to RDS via an instance role.
| quarkus.micrometer.export.prometheus.enabled=true | ||
| quarkus.oidc.enabled=true | ||
| quarkus.otel.enabled=true | ||
| quarkus.mongodb.metrics.enabled=true |
There was a problem hiding this comment.
+1 to commenting out everything that's specific to enabling the particular persistence type. Someone reading through the config file trying to figure out could get confused whether a mongodb config setting is active or impacts runtime behaviors despite not turning on mongodb for anything.
Is there a way for the code-level default to already have these things enabled, and have a commented-out config basically reflect the code default (for these and the other similar mongodb configs discussed elsewhere).
I'm thinking of how things like hadoop-env.sh or spark-env.sh or spark-defaults.conf are basically just a big file of commented-out settings that mostly just reflect code-level defaults, but make it easy for a reader to mix in what they want to customize without getting overwhelmed about bare-minimum settings.
Are Quarkus application.properties "composeable" to easily define e.g. a mongodb-mixins.properties or spanner-mixins.properties to still make it easy to choose a persistence profile without ending up with a monolithic base config file?
If not, big commented-out sections still seems a reasonable way to scale this.
| quarkus.rds.sync-client.type=apache | ||
|
|
||
| ## MongoDB specific configuration | ||
| quarkus.mongodb.database=polaris |
There was a problem hiding this comment.
This config setting could definitely use a comment explaining what it is and what the options are. At first glance it's not clear what the config does. Are the options, for example, "polaris" vs "glue" or "unity" ;)
Is it just the database name in mongodb to use when mongodb is used as the persistence layer?
I'd also prefer to comment out the configs like this one that aren't "active" by default (actually, I would've preferred to comment out the quarkus.rds ones as well and have a nice config section for all the quarkus.rds.* settings that makes it clear which config setting enables the block of rds configs too).
In my experience, it's nice to pare down the actual active config strings in the application.properties file to the true bare minimum for correct default out-of-the-box. Even the configs that match code-level defaults would be present but commented out as documentation so it's easy to see what can be adjusted, but without having to worry about whether it's a crucial input when the deployment mode isn't applicable for the config.
| # Available backends: | ||
| # - InMemory - for testing purposes | ||
| # - MongoDb - configure the via the Quarkus extension | ||
| polaris.persistence.nosql.backend=InMemory |
| quarkus.index-dependency.agrona.group-id=org.agrona | ||
| quarkus.index-dependency.agrona.artifact-id=agrona |
There was a problem hiding this comment.
If I'm understanding correctly, indexing isn't the new thing, but agrona is the new library we need to index now in addition to e.g. avro, and guava, is that right?
I do agree a bit that agrona is less ubiquitous the others here, so could be worth a quick comment in here saying what agrona does (looks like it's a fancy utils library? https://github.com/aeron-io/agrona)
In my experience avro, guava, and protobuf are ubiquitous enough that no one thinks twice seeing it in deps, but agrona might be new to some folks (like me, or maybe I'm just old-fashioned :) )
I like to lean towards over-communicating rather than risk under-communicating in comments in these cases, since things like compliance audits, etc., will have folks looking through these types of configs and anything that can avoid unnecessary questions will help save time for everyone
@dimas-b were you just asking for clarification on the preferred fix to the file as opposed to the nature of comments being blockers? Looks like the remaining open comment was about @flyrain I also commented on that config key as something that raised eyebrows for me - would my suggestion of adding an explanatory comment and possibly also commenting it out by default also address your concern? |
My question is to specifically flag what people consider as blockers on this PR. Some approvals exist. Comment threads are long... so it's not obvious for me what remains to be addressed. Well, the concerns about |
Address reviewer feedback from PR apache#3385 (comment r2700474938): - Introduce MetricsPersistence interface in polaris-core for backend-agnostic metrics persistence - Add MetricsContext immutable class to encapsulate metrics context information - Add NoOpMetricsPersistence for backends that don't support metrics persistence - Add JdbcMetricsPersistence implementation that wraps JdbcBasePersistenceImpl - Add MetricsPersistenceProducer CDI producer to select implementation at runtime - Refactor PersistenceMetricsProcessor to use MetricsPersistence instead of instanceof checks This design supports the upcoming NoSQL persistence backend (apache#3396) by allowing each backend to provide its own MetricsPersistence implementation or use the no-op implementation that silently discards metrics.
There was a problem hiding this comment.
Now that the other properties are commented out, I'm a +1
Although I notice
quarkus.mongodb.metrics.enabled=true
Was missed
Nit: The formatting is off a bit
#property
Instead of
# property
Also we should just add in the indexing dependencies in another PR unless there is a relationship to this PR that I don't see.
|
|
||
| quarkus.datasource.devservices.enabled=false | ||
| quarkus.keycloak.devservices.enabled=false | ||
| quarkus.mongodb.devservices.enabled=false |
There was a problem hiding this comment.
Do we need this config?
| runtimeOnly(project(":polaris-persistence-nosql-cdi-quarkus")) | ||
| runtimeOnly(project(":polaris-persistence-nosql-cdi-quarkus-distcache")) | ||
| runtimeOnly(project(":polaris-persistence-nosql-maintenance-impl")) | ||
| runtimeOnly(project(":polaris-persistence-nosql-metastore-maintenance")) |
There was a problem hiding this comment.
Since we already did this for JDBC, adding these makes sense.
That said, looking at the build.gradle.kts, we now have test dependencies for multiple persistence implementations in the service module. The main implementation already separates persistence backends from runtime/service, reflecting that the service layer logic is designed to be abstract and agnostic to specific persistence implementations. In the long term, it might be worth migrating the persistence-backend integration tests (both JDBC and NoSQL) to separate test modules to stay consistent with this separation.
Not a blocker
There was a problem hiding this comment.
I support the idea of doing integration tests separately in each major plugin module (JDBC, NoSQL, etc.). OPA does that already.
It will probably add CI time due to extra Quarkus build activity, but improves isolation.
This is certainly for future discussion / follow-up PRs.
There was a problem hiding this comment.
This is imho, a great use of a followup issue. The follow up would make things arguably more clean but requires a larger refactoring than the scope of this pr which currently just follows the existing model from JDBC.
| # - MongoDb - configure the via the Quarkus extension | ||
| # Configure the necessary MongoDB properties starting with 'quarkus.mongodb.' in this file. | ||
| # See https://quarkus.io/guides/mongodb#configuration-reference for details about these configurations. | ||
| #polaris.persistence.nosql.backend=InMemory |
| quarkus.rds.sync-client.type=apache | ||
|
|
||
| ## MongoDB specific configuration | ||
| #quarkus.mongodb.database=polaris |
RussellSpitzer
left a comment
There was a problem hiding this comment.
Formatting on the commented out properties still has the wrong spacing, this should just take a few seconds to fix so, I would just do that before merging.
I also don't see why we add the indexing things in this PR still, I'd much rather we just add them in a separate commit since they don't seem to be related to this change.
|
Re: "indexing things", I'm trying to do a holistic investigation for that in #3467 , but it turned out to be a tricky issue due to oddities with build-time logging. I propose to keep existing indexing properties in this PR and follow up with corrections / minimally required settings when the matter is clear. Indexing things that are not required will not harm runtime in any way, even though it might slow CI down. |
|
@dimas-b But how is it related to this pr? I don't like having a commit where we look at back and we can't figure out why a change was made. It's pretty simple to just make a quick 2 line PR that adds them in, and it would take less time than our discussion about those 2 lines |
I cannot say for sure, but if tests pass under this PR without the new indexing config, I suppose we can remove it from this PR and re-add later as an optimization. I did not personally test it though. @snazy WDYT? |
|
Another update about indexing properties: I debugged the build and I can confirm that the property is advisable to have. Please see my comment above for details. |
Is it necessitated by this PR? That's my only big question here, It's not obvious to me why the other changes in this PR caused this issue |
|
Re: indexing properties:
This PR integrates NoSQL code into the I'd say we ought to fix the warning from Quarkus in order to be sure that its introspection yields a complete picture for CDI to function properly. As I noted above, not having this property does not lead to any visible breakage, however, I think it is much safer to set it in order to benefit from Quarkus build-time checks to the full extent. |
We should definitely include it then. I just wasn't clear on whether this had been present before this PR or not, if the other changes pulls in this as a build warning then we should keep it in this PR. |
|
Removed my "request changes" as my comments are resolved. Feel free to move forward with this PR. Thanks a lot for working on it. |
|
Thanks everyone for contributions (core and reviews)! Merging. As usual, follow-up improvements are possible on |
* Replace custom token-bucket implementation with Guava's `RateLimiter` (apache#3507) Addresses the issues discussed on the dev mailing-list discussion https://lists.apache.org/thread/gkyw7m4fcbjbzhcrlrp4kcq5lr05r0m4, opting to use Guava as the easiest replacement here. * Move idempotency_records schema to v4 and add H2 support (apache#3386) * Move idempotency_records schema to v4 and add H2 support * address comments and fix test failures * fix format * add comment to resource_id * (nit): Getting started examples with mc/s5cmd to aws cli (apache#3526) * Switch mc/s3cmd to aws cli * Switch mc/s3cmd to aws cli * Add support for no KMS with s3-compatible backend (apache#3501) * chore(deps): update amazon/aws-cli docker tag to v2.33.7 (apache#3558) * Update doc for helm around rateLimiter (apache#3562) * Disable renoavte update for python version (apache#3560) * Fix the Keycloak getting-started example for 26.5+ (apache#3568) The example was failing because Keycloak 26.5 introduced stricter validation rules for session lifespan and timeout. * NoSQL: Add to runtime-service (apache#3396) * NoSQL: Add to runtime-service This change adds the NoSQL persistence to polaris-runtime-service. * chore(deps): update amazon/aws-cli docker tag to v2.33.8 (apache#3575) * Add spark sql integration test for Hudi (apache#3194) * Fix ozone getting started example (apache#3574) * Fix Ozone getting started example * Fix Ozone getting started example * Change AWS CLI image to weekly (apache#3578) * fix(deps): update dependency com.diffplug.spotless:spotless-plugin-gradle to v8.2.1 (apache#3576) * chore(deps): update registry.access.redhat.com/ubi9/openjdk-21-runtime docker tag to v1.24-2.1769108682 (apache#3588) * removed references of BEFORE/AFTER_COMMIT_VIEW (apache#3554) * nits - post-merge fixes * Last merged commit 2b0ca21 --------- Co-authored-by: Huaxin Gao <huaxin.gao11@gmail.com> Co-authored-by: Yong Zheng <yongzheng0809@gmail.com> Co-authored-by: Mend Renovate <bot@renovateapp.com> Co-authored-by: Alexandre Dutra <adutra@apache.org> Co-authored-by: Rahil C <32500120+rahil-c@users.noreply.github.com> Co-authored-by: Innocent Djiofack <djiofack007@gmail.com>
This change adds the NoSQL persistence to polaris-runtime-service.