Guides: fix setup scripts to yield correct exit code by snazy · Pull Request #3612 · apache/polaris

snazy · 2026-01-29T14:31:18Z

The statements in the shell scripts for the setup services are often concatenated using ;, which means that a previous' command exit code is not propagated and the service, although it failed, is determined to be successful.

This change updates those scripts to use && for the statement concatenation.

"Final" setup services (aka "polaris-setup") now have a final sleep 120. This is due to the behavior of docker compose up --detach --wait, which considers any service (without dependants) that exits with exit code 0 as a failure, leading to that docker-compose command yielding an error code. That would break the guides testing code (#3553). That sleep 120 in "polaris-setup" services does not cause a delay of the compose starting up - it is purely a "hack around" Docker Compose not having a notion of "setup services".

To avoid merge conflicts, this change also:

updates affected curl invocations (as Guides: add mandatory curl --fail option #3610)
removes superfluous restart: "no"

MonkeyCanCode · 2026-01-29T18:17:22Z

getting-started/keycloak/docker-compose.yml

-        apk add --no-cache jq && 
-        chmod +x /polaris/create-catalog.sh && 
+        apk add --no-cache jq &&
        token=$$(curl http://keycloak:8080/realms/iceberg/protocol/openid-connect/token --user client1:s3cr3t -d 'grant_type=client_credentials' | jq -r .access_token) && 


should we do implicitly --fail-with-body here too for curl?

My assumption is that the token would just be "invalid".
It's also that the pipe "eliminates" the exit code from curl, but jq would fail then.

That is fair enough. Yeah. Just thought to be consistent. But not a blocker.

MonkeyCanCode · 2026-01-29T18:24:18Z

getting-started/ceph/docker-compose.yml

+        echo Creating Ceph bucket... &&
+        aws s3 mb s3://${S3_POLARIS_BUCKET} &&
+        aws s3 ls &&
+        echo Bucket setup complete.


This will return 0 as well no?

So we have at most 2 setup services for these docker compose and you are adding sleep 120 to pass the CI based on docker-compose (as exit of status code of zero will be treat as failed). Thus, this will also return status code of zero.

Ah, I think I know what you're referring to.
Those setup-bucket services behave in an "interesting" way, which is different than how the "final" polaris-setup services behave (in terms of exit behavior).
An "intermediate" setup-service like setup-bucket, on which for example polaris depends using the service_completed_successfully condition, is fine to exit and does not "break" a docker compose up --detach --wait.
A "final" setup-service like some polaris-setup, a service that no other service depends on, lets docker compose up --detach --wait fail when it exits.

The "crux" with those "final" setup-services is that the behavior isn't immediately obvious and depends on the timing of the relevant services and when exactly docker compose up --detach --wait "thinks" the whole compose thing is ready.

But this ^^ is all from experiments. It's tricky to find good documentation on this behavior. However, the behavior of depends_on (... a setup service) with service_completed_successfully condition makes sense.

MonkeyCanCode · 2026-01-29T18:26:18Z

getting-started/keycloak/docker-compose.yml

        /polaris/create-catalog.sh realm-external $$token && 
-        /polaris/create-catalog.sh realm-mixed $$token
+        /polaris/create-catalog.sh realm-mixed $$token &&
+        sleep 120


So yeah, while I was doing a local prototyping, I ran into this issue as well where --exit-code-from is not happy with this. What I ended up doing is to use tail /dev/null to keep the service up then use health check on service ready instead of completed. Sleep 120 here will work but if for whatever reason these bash scripts took more than 120s (assuming they get very complex later on), this will bite us.

Man, docker-compose isn't a good friend at all :(

Neither tail /dev/null nor sleep 120 is particularly great.
I wanted to stay away from the added complexity of health-checks for the setup-scripts though.
Let me think a bit about this.

Probably best to let the polaris-setup services "tail forever" and have health-checks. Added that

Yes. That is what I ended up doing in my local prototyping with tail forever as status code of zero is not okay for compose service.

The statements in the shell scripts for the setup services are often concatenated using `;`, which means that a previous' command exit code is _not_ propagated and the service, although it failed, is determined to be successful. This change updates those scripts to use `&&` for the statement concatenation. "Final" setup services (aka "polaris-setup") now have a final `sleep 120`. This is due to the behavior of `docker compose up --detach --wait`, which considers _any_ service (without dependants) that exits with exit code 0 as a failure, leading to that docker-compose command yielding an error code. That would break the guides testing code (apache#3553). That `sleep 120` in "polaris-setup" services does **not** cause a delay of the compose starting up - it is purely a "hack around" Docker Compose not having a notion of "setup services". To avoid merge conflicts, this change also: * updates affected `curl` invocations (as apache#3610) * removes superfluous `restart: "no"`

snazy · 2026-02-03T10:33:26Z

@MonkeyCanCode mind taking another look?

The statements in the shell scripts for the setup services are often concatenated using `;`, which means that a previous' command exit code is _not_ propagated and the service, although it failed, is determined to be successful. This change updates those scripts to use `&&` for the statement concatenation. "Final" setup services (aka "polaris-setup") now have a final `sleep 120`. This is due to the behavior of `docker compose up --detach --wait`, which considers _any_ service (without dependants) that exits with exit code 0 as a failure, leading to that docker-compose command yielding an error code. That would break the guides testing code (apache#3553). That `sleep 120` in "polaris-setup" services does **not** cause a delay of the compose starting up - it is purely a "hack around" Docker Compose not having a notion of "setup services". To avoid merge conflicts, this change also: * updates affected `curl` invocations (as apache#3610) * removes superfluous `restart: "no"`

* Releasey: adjust workflwo for Apache org level secrets (apache#3647) See reference [INFRA-27430](https://issues.apache.org/jira/browse/INFRA-27430), requiring us to use `DOCKERHUB_USER` + `DOCKERHUB_TOKEN` instead of `DOCKERHUB_USERNAME` + `DOCKERHUB_TOKEN`. * Guides: fix setup scripts to yield correct exit code (apache#3612) The statements in the shell scripts for the setup services are often concatenated using `;`, which means that a previous' command exit code is _not_ propagated and the service, although it failed, is determined to be successful. This change updates those scripts to use `&&` for the statement concatenation. "Final" setup services (aka "polaris-setup") now have a final `sleep 120`. This is due to the behavior of `docker compose up --detach --wait`, which considers _any_ service (without dependants) that exits with exit code 0 as a failure, leading to that docker-compose command yielding an error code. That would break the guides testing code (apache#3553). That `sleep 120` in "polaris-setup" services does **not** cause a delay of the compose starting up - it is purely a "hack around" Docker Compose not having a notion of "setup services". To avoid merge conflicts, this change also: * updates affected `curl` invocations (as apache#3610) * removes superfluous `restart: "no"` * Add PolarisEventType int codes and remove unused before/after commit view/table values. (apache#3608) This is a continuation of apache#3418 where we agreed we should remove associate enums values from Enum definition. The code also adds a constructor to help not having to rely on ordinals() to to have enum codes. I incremented enum code assignments by 100 based on categories. I am happy to take feedback here. Does not change logic and removes only enums that are no longer used so behavior does not change. * docs: Add quick guide for downstream builds (apache#3601) * docs: Add quick guide for downstream builds * IntegrationTestsHelper: fix extract & merge logic (apache#3650) Both methods were flawed: * `extractFromAnnotatedElements` wasn't properly delegating to the class if the method isn't annotated * `mergeFromAnnotatedElements` wasn't properly prioritizing method properties over class properties. * Community page: Yong Zheng - PPMC Member (apache#3653) * chore(deps): update actions/checkout digest to de0fac2 (apache#3652) * Use `quarkus.package.jar.type` (apache#3644) Switch to `quarkus.package.jar.type` instead of the old `quarkus.package.type` build property as suggested by Quarkus build warning: ``` 2026-02-03T00:26:05.672955189Z WorkerExecutor Queue WARN Configuration property 'quarkus.package.type' has been deprecated and replaced by: [quarkus.package.jar.enabled, quarkus.package.jar.type, quarkus.native.enabled, quarkus.native.sources-only] ``` * Site: Add blog post for Floe Polaris Integration (apache#3645) * "Stale" job: restrict executions and adjust issue permissions (apache#3636) * Add copyright on website (apache#3659) * Sanitize principal names in AWS STS role session names (apache#3525) Principal names containing invalid characters (spaces, parentheses, etc.) were causing AWS STS AssumeRole requests to fail with validation errors. AWS STS role session names must match the pattern [\w+=,.@-]*. This change: - Adds AwsRoleSessionNameSanitizer utility class to sanitize strings for use as AWS STS role session names - Replaces invalid characters with underscores and truncates to 64 characters (AWS maximum) - Updates AwsCredentialsStorageIntegration to sanitize principal names when INCLUDE_PRINCIPAL_NAME_IN_SUBSCOPED_CREDENTIAL is enabled - Adds tests to verify sanitization behavior and AWS pattern compliance Fixes issue where principal names like "Joe (local)" would produce invalid role session names like "polaris-Joe (local)" and cause AssumeRole to fail. Now sanitized to "polaris-Joe__local_". Co-authored-by: carc-prathyush-shankar <prathyush.shankar@carbonarc.co> * Site: Adds the copyright message to all site pages (apache#3661) * Site: Change Security link to local security reporting page (apache#3662) * fix(deps): update dependency org.mongodb:mongodb-driver-sync to v5.6.3 (apache#3654) * Releasey: use Apache org Nexus credentials (apache#3651) * Releasey: use the correct SVN credentials (apache#3648) * CI: Prerequisite PR for apache#3625 (apache#3646) This change is only needed to update the required-checks to be able to eventually merge apache#3625. * CI: all-in-one workflow (apache#3625) This change moves all CI jobs into a single workflow. A single workflow comes with a couple advantages: * all jobs are visible on one page * option to re-run all failed jobs at once The refactoring also simplifies the `.asf.yaml` file by referencing a single required check that can only succeeds if dependent jobs were successful. The actual CI jobs are in the `ci.yml` file, which is only triggered via a `workflow_call` event. There are two workflows that call `ci.yml`: * `ci-main.yml` for `main` and `release/*` branches, with a concurrency group that does not cancel already running workflows * `ci-pr.yml` for PRs with a concurrency group that cancels previous CI runs The names of these two calling workflows include information that enrich the workflow view, and the workflows for main, release branches and PRs are grouped separartely in the GH Actions page for the repository. In other words, the reference name (for main + release branches) or the PR number and title are shown in the workflow runs list on the GH Actions page. * Fix CI required checks (apache#3666) * Add Sung as committer (apache#3665) * Fix `CatalogFederationIntegrationTest.testFederatedCatalogWithCredentialVending()` for AWSSDK update (apache#3664) Recent AWSSDK versions introduce `software.amazon.awssdk.services.s3.model.AccessDeniedException`, hence the assertion on `S3Exception` fails. * CI/main: re-add commit message to `run_name` (apache#3668) The default `run_name` value is, in case of `push` events, the commit message. This change re-adds the commit message. * Add CI workflow to test against Iceberg unreleased versions (apache#3630) * Update dependency software.amazon.awssdk:bom to v2.41.21 (apache#3639) * Update actions/checkout digest to de0fac2 (apache#3671) * Nit: Fix wrong `Nullable` import (apache#3672) * Explicitly set build-time property `quarkus.datasource.db-kind` (apache#3674) The property is set for the Polaris admin tool, but not for Polaris server. This causes a startup error with Quarkus 3.31. Error message: ``` ERROR: Failed to start application java.lang.RuntimeException: Failed to start quarkus at io.quarkus.runner.ApplicationImpl.doStart(Unknown Source) at io.quarkus.runtime.Application.start(Application.java:116) at io.quarkus.runtime.ApplicationLifecycleManager.run(ApplicationLifecycleManager.java:119) at io.quarkus.runtime.Quarkus.run(Quarkus.java:79) at io.quarkus.runtime.Quarkus.run(Quarkus.java:50) at io.quarkus.runtime.Quarkus.run(Quarkus.java:143) at io.quarkus.runner.GeneratedMain.main(Unknown Source) at io.quarkus.bootstrap.runner.QuarkusEntryPoint.doRun(QuarkusEntryPoint.java:86) at io.quarkus.bootstrap.runner.QuarkusEntryPoint.main(QuarkusEntryPoint.java:37) Caused by: java.lang.IllegalStateException: Build time property cannot be changed at runtime: - quarkus.datasource.db-kind is set to 'postgresql' but it is build time fixed to 'null'. Did you change the property quarkus.datasource.db-kind after building the application? at io.quarkus.runtime.configuration.ConfigRecorder.handleConfigChange(ConfigRecorder.java:72) at io.quarkus.runner.recorded.ConfigGenerationBuildStep$checkForBuildTimeConfigChange1532146938.deploy_6(Unknown Source) at io.quarkus.runner.recorded.ConfigGenerationBuildStep$checkForBuildTimeConfigChange1532146938.deploy(Unknown Source) ... 9 more ``` * Last merged commit 2651e06 --------- Co-authored-by: Innocent Djiofack <djiofack007@gmail.com> Co-authored-by: Dmitri Bourlatchkov <dmitri.bourlatchkov@gmail.com> Co-authored-by: Alexandre Dutra <adutra@apache.org> Co-authored-by: Mend Renovate <bot@renovateapp.com> Co-authored-by: Neelesh Salian <nssalian@users.noreply.github.com> Co-authored-by: JB Onofré <jbonofre@apache.org> Co-authored-by: Prathyush Shankar <prathyush2018@gmail.com> Co-authored-by: carc-prathyush-shankar <prathyush.shankar@carbonarc.co> Co-authored-by: Russell Spitzer <russell.spitzer@GMAIL.COM>

github-project-automation bot added this to Basic Kanban Board Jan 29, 2026

github-project-automation bot moved this to PRs In Progress in Basic Kanban Board Jan 29, 2026

snazy force-pushed the guides-hc-exit-code branch from 4c23076 to 3645bae Compare January 29, 2026 14:33

MonkeyCanCode reviewed Jan 29, 2026

View reviewed changes

snazy mentioned this pull request Jan 31, 2026

Guides: compose dependencies / long-option #3611

Merged

snazy added 3 commits January 31, 2026 11:04

health-checks

be82d9c

update

974046c

snazy force-pushed the guides-hc-exit-code branch from f2228d6 to 974046c Compare January 31, 2026 10:09

snazy requested a review from MonkeyCanCode February 2, 2026 12:15

MonkeyCanCode approved these changes Feb 3, 2026

View reviewed changes

github-project-automation bot moved this from PRs In Progress to Ready to merge in Basic Kanban Board Feb 3, 2026

snazy merged commit 890b33a into apache:main Feb 3, 2026
15 checks passed

github-project-automation bot moved this from Ready to merge to Done in Basic Kanban Board Feb 3, 2026

snazy deleted the guides-hc-exit-code branch February 3, 2026 13:59

Conversation

snazy commented Jan 29, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

snazy commented Feb 3, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants