Skip to content

Conversation

@adutra
Copy link
Contributor

@adutra adutra commented Jun 13, 2025

This is a simple attempt to reduce the CI execution time for pull requests. Instead of executing all tests in the same job, it parallelizes the tests in 4 jobs:

  • Code style & publishing checks
  • Unit tests
  • Quarkus tests
  • Integration tests

This could certainly be improved further, but I observed a total execution time around 10 minutes instead of 20 minutes before (at the cost of more compute power, of course).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think the ci before runs whole check, build, and tests on JDK 21 because Polaris is currently guarantee the compatibility with java 21, and then we are only doing selected check with java 23.
If we just update the JDK version to 23, that means we will loose the test coverage for 21, right? if the intension of this PR is to just split the ci into multiple jobs, can we keep the java 21 setup?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh sorry that wasn't my intention. I can set the "Style check" job to 21, would that be enough? If not, I'm afraid we'd need a matrix job.

Also, tbh we should upgrade 23 to 24 but I don't want to introduce any unwanted changes in this PR.

Copy link
Contributor

@gh-yzou gh-yzou Jun 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The original CI have the following coverage for 21

run: ./gradlew --continue check

which I believe runs all tests, and on 23 it only does compilation and integration tests. However, with the current change, it seems we are doing full test coverage with java 23, but only style check with 21. Can we still retain the full test coverage with 21, limited test with 23 like before ? so that this is a pure refactoring task

If we also want to introduce full coverage with other jdk version like 23, we can do in follow up PR to introduce a test matrix

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, that will limit the amount of parallelism to just 2 jobs, but if that's what you want, OK.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I went back to 3 jobs, 2 with 21 and 1 with 23, because 2 jobs only was too slow.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All jobs on 21 now, as per @snazy suggestion.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry, for the confusion. i don't really mean still run ./gradlew --continue check. what i mean is when we are breaking it into jobs, make sure the tasks runs on 21 still provides the fully test coverage, include unit test, initTest, style check, publishToMaven etc. For 23 it can remain just have one job with coverage for integration test.

However, it seems we are now removing the coverage on 23 now because 23 is going to be end of like, which i think should be fine.

If we want to further breakdown the job runs with 21 to more jobs, i am also fine with that.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

23 is EOL - and 24 won't work. Probably better to go back to 21?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to update .asf.yaml as well

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we're going to have a problem to merge this PR because the build job is still marked as required, but it doesn't exist anymore.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I kept the old build job around to make it possible to merge this PR. We'd need to remove that job in a separate PR.

gh-yzou
gh-yzou previously approved these changes Jun 17, 2025
@github-project-automation github-project-automation bot moved this from PRs In Progress to Ready to merge in Basic Kanban Board Jun 17, 2025
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we isolate quarkus test, but not intTest?

Copy link
Contributor

@dimas-b dimas-b Jun 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd assume intTest takes the longest time in general, so it might be preferable to break it out if the goal is to reduce CI time 🤔

Copy link
Contributor Author

@adutra adutra Jun 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we isolate quarkus test, but not intTest?

Because these two jobs are roughly taking 8 minutes each now, so that sounded like a balanced choice. In fact, the intTest job is not necessarily the longest the job currently 🤷‍♂️

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also: the job is named "Quarkus tests" for simplicitly but it actually runs all the tests in polaris-runtime-service and polaris-admin.

We could isolate only the tests annotated with @QuarkusTest or QuarkusMainTest but that's not a trivial thing to do with Gradle, so I went for simplicity. We can revisit that later if necessary.

dimas-b
dimas-b previously approved these changes Jun 17, 2025
Copy link
Contributor

@dimas-b dimas-b left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM in general. Please consider my comments optional.

@snazy
Copy link
Member

snazy commented Jun 17, 2025

Just figured out that we should have another change to either this PR or a followup around Gradle caches.
Splitting up the Java CI into multiple parts "borks" the cached Gradle artifacts (cache stored from the last run on "main" wins). This results in unnecessary build effort and test runs. I'd be okay to live with that for a bit, but we should solve it.
NB: we have a mechanism for that in Nessie CI.

@adutra
Copy link
Contributor Author

adutra commented Jun 18, 2025

Splitting up the Java CI into multiple parts "borks" the cached Gradle artifacts (cache stored from the last run on "main" wins). This results in unnecessary build effort and test runs.

@snazy the caches seem to be working afaict; what makes you say they are "borked"?

@adutra adutra dismissed stale reviews from dimas-b and gh-yzou via feb5601 June 18, 2025 08:08
@snazy
Copy link
Member

snazy commented Jun 18, 2025

@snazy the caches seem to be working afaict; what makes you say they are "borked"?

"Last Gradle cache store" wins. The next job will read the state of that "last one". So you either have test, intTest or ... - but not a merged state of all. This is what Nessie CI does - merging the cache state of all jobs.

@adutra
Copy link
Contributor Author

adutra commented Jun 18, 2025

@snazy the caches seem to be working afaict; what makes you say they are "borked"?

"Last Gradle cache store" wins. The next job will read the state of that "last one". So you either have test, intTest or ... - but not a merged state of all. This is what Nessie CI does - merging the cache state of all jobs.

OK, I will tackle that in a follow-up since this one was approved already.

@adutra
Copy link
Contributor Author

adutra commented Jun 18, 2025

@snazy or @dimas-b do you mind re-approving please? I will look into cache improvements next.

@adutra adutra merged commit 5441bb6 into apache:main Jun 18, 2025
12 checks passed
@github-project-automation github-project-automation bot moved this from Ready to merge to Done in Basic Kanban Board Jun 18, 2025
@eric-maynard
Copy link
Contributor

Hey @adutra, did this possibly break the tests? I notice that the new tests are stuck in a pending state and PRs can't be merged.

Screenshot 2025-06-18 at 11 29 51 AM

@adutra
Copy link
Contributor Author

adutra commented Jun 18, 2025

Hey @adutra, did this possibly break the tests? I notice that the new tests are stuck in a pending state and PRs can't be merged.

Screenshot 2025-06-18 at 11 29 51 AM

Yes, this was expected unfortunately. If you rebase the PR it should get unstuck.

adutra added a commit to adutra/polaris that referenced this pull request Jun 19, 2025
Since apache#1897, the jobs in gradle.yaml changed and the "build" job was split into many smaller jobs. But since it was a required job, it couldn't be removed immediately.
adutra added a commit that referenced this pull request Jun 19, 2025
Since #1897, the jobs in gradle.yaml changed and the "build" job was split into many smaller jobs. But since it was a required job, it couldn't be removed immediately.
@adutra adutra deleted the split-ci branch July 19, 2025 14:43
snazy added a commit to snazy/polaris that referenced this pull request Nov 20, 2025
* Cleanup unnecessary files in client/python (apache#1878)

Cleanup unnecessary files in `client/python`

* Bump version in version.txt

With the release/1.0.0 branch being cut, we should bump this to reflect the current state of main

* JDBC: Refactor DatabaseOps (apache#1843)

* removes the databaseType computation from JDBCMetastoreManagerFactory to DbOperations
* wraps the bootstrap in a transaction !
* refactor Production Readiness checks for Postgres

* Fix two wrong links in README.md (apache#1879)

* Avoid using org.testcontainers.shaded.** (apache#1876)

* main: Update dependency io.smallrye.config:smallrye-config-core to v3.13.2 (apache#1888)

* main: Update registry.access.redhat.com/ubi9/openjdk-21-runtime Docker tag to v1.22-1.1749462970 (apache#1887)

* main: Update dependency boto3 to v1.38.36 (apache#1886)

* fix(build): Fix deprecation warnings in PolarisIntegrationTestExtension (apache#1895)

* Enable patch version updates for maintained Polaris version (apache#1891)

Polaris 1.x will be a supported/maintained release. It is crucial to apply bug and security fixes to such release branches.

Therefore, this change enables patch-version updates for Polaris 1.*

* Add Polaris community meeting record for 2025-06-12 (apache#1892)

* Do not use relative path inside CLI script

Issue apache#1868 reported that the Polaris script can fail when it's run from an unexpected path. The recent addition of a reference to `./gradlew` looks incorrect here, and should be changed to use an absolute path.

Fixes apache#1868

* feat(build): Add Checkstyle plugin and an IllegalImport rule (apache#1880)

* Python CI: pin mypy version to avoid CI failure due to new release (apache#1903)

Mypy did a new release 1.16.1 and it cause our CI to fail for about 20 minutes due to missing wheel (upload not completed)
```
 | Unable to find installation candidates for mypy (1.16.1)
    | 
    | This is likely not a Poetry issue.
    | 
    |   - 14 candidate(s) were identified for the package
    |   - 14 wheel(s) were skipped as your project's environment does not support the identified abi tags
    | 
    | Solutions:
    | Make sure the lockfile is up-to-date. You can try one of the following;
    | 
    |     1. Regenerate lockfile: poetry lock --no-cache --regenerate
    |     2. Update package     : poetry update --no-cache mypy
    | 
    | If neither works, please first check to verify that the mypy has published wheels available from your configured source that are compatible with your environment- ie. operating system, architecture (x86_64, arm64 etc.), python interpreter.
    | 

```
This PR temporarily restrict the mypy version to avoid the similar issue.

We may consider bring poetry.lock back to git tracking so we won't automatically update test dependencies all the time

* Remove `.github/CODEOWNERS` (apache#1902)

As per this [dev-ML discussion](https://lists.apache.org/thread/jjr5w3hslk755yvxy8b3z45c7094cxdn)

* Rename quarkus as runtime (apache#1695)

* Rename runtime/test-commons to runtime/test-common (for consistency with module name) (apache#1906)

* docs: Add `Polaris Evolution` page (apache#1890)

---------

Co-authored-by: Eric Maynard <emaynard@apache.org>

* feat(ci): Split Java Gradle CI in many jobs to reduce execution time (apache#1897)

* Add webpage for Generic Table support (apache#1889)

* add change

* add comment

* address feedback

* update limitations

* update docs

* update doc

* address feedback

* Improve the parsing and validation of UserSecretReferenceUrns (apache#1840)

This change addresses all the TODOs found the org.polaris.core.secrets package. 
Main changes:

- Create a helper to parse, validate and build the URN strings. 
- Use Regex instead of `String.split()`.
- Add Precondition checks to ensure that the URN is valid and the UserSecretManager matches the expected type. 
- Remove the now unused `GLOBAL_INSTANCE` of the UnsafeInMemorySecretsManager.

Testing 
- Existing `UnsafeInMemorySecretsManagerTest` captures most of the functional changes. 
- Added `UserSecretReferenceUrnHelperTest` to capture the utilities exposed.

* Reuse shadowJar for spark client bundle jar maven publish (apache#1857)

* fix spark client

* fix test failure and address feedback

* fix error

* update regression test

* update classifier name

* address comment

* add change

* update doc

* update build and readme

* add back jr

* udpate dependency

* add change

* update

* update tests

* remove merge service file

* update readme

* update readme

* fix(ci): Remove dummy "build" job from Gradle CI (apache#1911)

Since apache#1897, the jobs in gradle.yaml changed and the "build" job was split into many smaller jobs. But since it was a required job, it couldn't be removed immediately.

* main: Update Quarkus Platform and Group to v3.23.3 (apache#1797)

* main: Update Quarkus Platform and Group to v3.23.3

* Adopt polaris-admin test invocation

---------

Co-authored-by: Robert Stupp <snazy@snazy.de>

* Feature: Rollback compaction on conflict (apache#1285)

Intention is make the catalog smarter, to revert the compaction commits in case of crunch to let the writers who are actually adding or removing the data to the table succeed. In a sense treating compaction as always a lower priority process.

Presently the rest catalog client creates the snapshot and asks the Rest Server to apply the snapshot and gives this in a combination of requirement and update.

Polaris could apply some basic inference and generate some updates to metadata given a property is enabled at a table level, by saying that It will revert back the commit which was created by compaction and let the write succeed.
I had this PR in OSS, which was essentially doing this at the client end, but we think its best if we do this as server end. to support more such clients.

How to use this
Enable a catalog level configuration : polaris.config.rollback.compaction.on-conflicts.enabled when this is enabled polaris will apply the intelligence of rollbacking those REPLACE ops snapshot which have the property of polaris.internal.rollback.compaction.on-conflict in their snapshot summary to resolve conflicts at the server end !
a sample use case is there is a deployment of a Polaris where this config is enabled and there is auto compaction (maintenance job) which is updating the table state, it adds the snapshot summary that polaris.internal.rollback.compaction.on-conflict is true now when a backfill process running for 8 hours want to commit but can't because the compaction job committed before so in this case it will reach out to Polaris and Polaris will see if the snapshot of compation aka replace snapshot has this property if yes roll it back and let the writer succeed !

Devlist: https://lists.apache.org/thread/8k8t77dgk1vc124fnb61932bdp9kf1lc

* NoSQL: nits

* `AutoCloseable` for `PersistenceTestExtension`
* checkstyle adoptions

* fix: unify bootstrap credentials and standardize POLARIS setup (apache#1905)

- unified formatting across docker, gradle
- reverted secret to s3cr3t
- updated docker-compose, README, conftest.py

use POLARIS for consistency across docker, gradle and others.

* Add doc for rollback config (apache#1919)

* Revert "Reuse shadowJar for spark client bundle jar maven publish (apache#1857)" (apache#1921)

…857)"

This reverts commit 1f7f127.

The shadowJar plugin actually stops publish the original jar, which is not what spark client intend to publish for the --package usage. 

Revert it for now, will follow up with a better way to reuse the shadow jar plugin, likely with a separate bundle project

* fix(build): Gradle caching effectively not working (apache#1922)

Using a `custom()` spotless formatter check effectively disables caching, see `com.diffplug.gradle.spotless.FormatExtension#custom(java.lang.String, com.diffplug.spotless.FormatterFunc)` using `globalState`, which is a `NeverUpToDateBetweenRuns`. This change refactors this to be cachable.

We also already have a errorprone rule, so we can get rid entirely of the spotless step.

* Update spark client to use the shaded iceberg-core in iceberg-spark-runtime to avoid spark compatibilities issue (apache#1908)

* add change

* add comment

* update change

* add comment

* add change

* add tests

* add comment

* clean up style check

* update build

* Revert "Reuse shadowJar for spark client bundle jar maven publish (apache#1857)"

This reverts commit 1f7f127.

* Reuse shadowJar for spark client bundle jar maven publish (apache#1857)

* fix spark client

* fix test failure and address feedback

* fix error

* update regression test

* update classifier name

* address comment

* add change

* update doc

* update build and readme

* add back jr

* udpate dependency

* add change

* update

* update tests

* remove merge service file

* update readme

* update readme

* update checkstyl

* rebase with main

* Revert "Reuse shadowJar for spark client bundle jar maven publish (apache#1857)"

This reverts commit 40f4d36.

* update checkstyle

* revert change

* address comments

* trigger tests

* Last merged commit 93938fd

---------

Co-authored-by: Honah (Jonas) J. <honahx@apache.org>
Co-authored-by: Eric Maynard <eric.maynard+oss@snowflake.com>
Co-authored-by: Prashant Singh <35593236+singhpk234@users.noreply.github.com>
Co-authored-by: Yufei Gu <yufei@apache.org>
Co-authored-by: Dmitri Bourlatchkov <dmitri.bourlatchkov@gmail.com>
Co-authored-by: Mend Renovate <bot@renovateapp.com>
Co-authored-by: Alexandre Dutra <adutra@users.noreply.github.com>
Co-authored-by: JB Onofré <jbonofre@apache.org>
Co-authored-by: Eric Maynard <emaynard@apache.org>
Co-authored-by: Yun Zou <yunzou.colostate@gmail.com>
Co-authored-by: Pooja Nilangekar <poojan@umd.edu>
Co-authored-by: Seungchul Lee <scleefe01@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants