Skip to content

Initial integration for hudi tables within Polaris #1862

Merged
flyrain merged 2 commits intoapache:mainfrom
rahil-c:rahil-c/polaris-hudi
Nov 25, 2025
Merged

Initial integration for hudi tables within Polaris #1862
flyrain merged 2 commits intoapache:mainfrom
rahil-c:rahil-c/polaris-hudi

Conversation

@rahil-c
Copy link
Contributor

@rahil-c rahil-c commented Jun 11, 2025

Motivation

Issue: #1896

The Polaris Spark client currently supports Iceberg and Delta table. This PR aims to add support for Apache Hudi tables as Generic Tables.

Current behavior

Currently, the Polaris Spark client routes Iceberg table requests to Iceberg REST endpoints and Delta table requests to
Generic Table REST endpoints. This PR aims to allow Hudi to follow a similar support as for what was done for Delta in Polaris.

Desired Behavior

Enable basic Hudi table operations through the Polaris Spark catalog by:

  • Adding Hudi table detection and routing logic

Changes Included

  • Core Implementation: Added HudiHelper utility class and enhanced PolarisCatalogUtils with Hudi-specific table loading
    logic
  • Catalog Integration: Modified SparkCatalog to detect and route Hudi table operations appropriately
  • Testing: Added unit tests for testingHudi integration
  • Documentation: Updated README with Hudi development support

Special note

Will follow up in another pr for integration and regression testing as they will need to consume the latest hudi point release artifact, once some changes in hudi land.

@dimas-b
Copy link
Contributor

dimas-b commented Jun 11, 2025

Thanks for you contribution, @rahil-c ! Would you mind opening a discussion for this feature on dev@polaris.apache.org?

@rahil-c rahil-c force-pushed the rahil-c/polaris-hudi branch from 37af09a to 98908b3 Compare June 13, 2025 15:27
@rahil-c
Copy link
Contributor Author

rahil-c commented Jun 13, 2025

Thanks @dimas-b will do so! Have raised a email on dev list here: https://lists.apache.org/thread/66d39oqkc412kk262gy80bm723r9xmpm

@rahil-c rahil-c force-pushed the rahil-c/polaris-hudi branch from d0011d5 to 5445c48 Compare June 16, 2025 00:21
@rahil-c rahil-c force-pushed the rahil-c/polaris-hudi branch from 5b136d6 to 2bb83cd Compare July 1, 2025 07:20
@rahil-c rahil-c changed the title [DRAFT] Initial integration for hudi tables within Polaris Initial integration for hudi tables within Polaris Jul 1, 2025
@rahil-c rahil-c marked this pull request as ready for review July 1, 2025 07:20
@rahil-c rahil-c force-pushed the rahil-c/polaris-hudi branch from 2bb83cd to 6185ea6 Compare July 1, 2025 07:25
@rahil-c
Copy link
Contributor Author

rahil-c commented Jul 1, 2025

cc @flyrain @gh-yzou @singhpk234

@flyrain flyrain requested a review from Copilot July 1, 2025 15:54
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR introduces initial support for Hudi tables in the Polaris Spark catalog, enabling Hudi create/load operations alongside existing formats.

  • Extended parameterized tests to cover the new “hudi” format.
  • Added HudiHelper and HudiCatalogUtils for Hudi-specific catalog loading and namespace synchronization.
  • Updated SparkCatalog, PolarisCatalogUtils, and build configurations to wire in Hudi dependencies and behavior.

Reviewed Changes

Copilot reviewed 12 out of 12 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
plugins/spark/v3.5/spark/src/test/java/.../DeserializationTest.java Updated parameterized tests to accept and assert on format
plugins/spark/v3.5/spark/src/test/java/.../SparkCatalogTest.java Added static mocks and new Hudi namespace/table tests
plugins/spark/v3.5/spark/src/test/java/.../NoopHudiCatalog.java Created a no-op Hudi catalog stub for tests
plugins/spark/v3.5/spark/src/main/java/.../PolarisCatalogUtils.java Introduced useHudi, isHudiExtensionEnabled, Hudi load support, SQL builders
plugins/spark/v3.5/spark/src/main/java/.../HudiHelper.java New helper for instantiating and delegating to Hudi Catalog
plugins/spark/v3.5/spark/src/main/java/.../HudiCatalogUtils.java New utility for syncing namespace operations via SQL
plugins/spark/v3.5/spark/src/main/java/.../SparkCatalog.java Routed create/alter/drop to Hudi catalog when appropriate
plugins/spark/v3.5/spark/src/main/java/.../PolarisSparkCatalog.java Adjusted calls to pass Identifier through Hudi load API
plugins/spark/v3.5/spark/build.gradle.kts Added Hudi dependencies and exclusions
plugins/spark/v3.5/integration/.../logback.xml Enabled Hudi loggers for integration tests
plugins/spark/v3.5/integration/.../SparkHudiIT.java New integration tests for basic and unsupported Hudi ops
plugins/spark/v3.5/integration/build.gradle.kts Added Hive and Hudi bundles to integration dependencies
Comments suppressed due to low confidence (1)

@flyrain flyrain requested a review from gh-yzou July 1, 2025 15:57
@gh-yzou
Copy link
Contributor

gh-yzou commented Jul 8, 2025

@rahil-c sorry, i made my comment yesterday, but forgot to push it. I did a push, and added some more comments, please let me know if you have more questions about this!
As we have discussed, there are two main concerns for this PR:

  1. the hudi dependency introduced for spark client, which is caused by the usage of HoodieInternalV2Table. This can be resolved by loading V1Table, and then let HudiCatalog loadTable to handle the final table result https://github.com/apache/hudi/blob/master/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/spark/sql/hudi/catalog/HoodieCatalog.scala#L123
  2. the extra namespace creation for HudiCatalog. Polaris Spark Client reuses the whole Iceberg namespace, ideally we do not want to maintain extra namespace creation just for specific table format. The needs of extra namespace creation is because HudiCatalog only works with SparkSession Catalog and HiveCatalog today https://github.com/apache/hudi/blob/master/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/spark/sql/hudi/command/CreateHoodieTableCommand.scala#L198, however, since Polaris is rest catalog, this will not work anymore. We want to see if we can push forward on hudi community to improve the catalog implementation regarding to the third party catalog plugin. Similar as Delta did a special case for unity catalog here https://github.com/delta-io/delta/blob/2d89954008b6c53e49744f09435136c5c63b9f2c/spark/src/main/scala/org/apache/spark/sql/delta/catalog/DeltaCatalog.scala#L218

@rahil-c
Copy link
Contributor Author

rahil-c commented Jul 23, 2025

@rahil-c sorry, i made my comment yesterday, but forgot to push it. I did a push, and added some more comments, please let me know if you have more questions about this! As we have discussed, there are two main concerns for this PR:

  1. the hudi dependency introduced for spark client, which is caused by the usage of HoodieInternalV2Table. This can be resolved by loading V1Table, and then let HudiCatalog loadTable to handle the final table result https://github.com/apache/hudi/blob/master/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/spark/sql/hudi/catalog/HoodieCatalog.scala#L123
  2. the extra namespace creation for HudiCatalog. Polaris Spark Client reuses the whole Iceberg namespace, ideally we do not want to maintain extra namespace creation just for specific table format. The needs of extra namespace creation is because HudiCatalog only works with SparkSession Catalog and HiveCatalog today https://github.com/apache/hudi/blob/master/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/spark/sql/hudi/command/CreateHoodieTableCommand.scala#L198, however, since Polaris is rest catalog, this will not work anymore. We want to see if we can push forward on hudi community to improve the catalog implementation regarding to the third party catalog plugin. Similar as Delta did a special case for unity catalog here https://github.com/delta-io/delta/blob/2d89954008b6c53e49744f09435136c5c63b9f2c/spark/src/main/scala/org/apache/spark/sql/delta/catalog/DeltaCatalog.scala#L218

Thanks @gh-yzou, I have followed the recommendations above and updated the pr. Let me know if the approach looks good to you. If so then I can try to break this down into smaller prs.

@rahil-c rahil-c force-pushed the rahil-c/polaris-hudi branch from 7777796 to 087b408 Compare July 24, 2025 07:03
@rahil-c rahil-c requested a review from gh-yzou July 24, 2025 07:04
@rahil-c rahil-c force-pushed the rahil-c/polaris-hudi branch from d6f3175 to 2b113c4 Compare July 26, 2025 05:29
gh-yzou
gh-yzou previously approved these changes Jul 28, 2025

### Hudi Support
Currently support for Hudi tables within the Polaris catalog is still under development.
The Hudi community has made a change to integrate with Polaris, and is planning on doing a minor release.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

-> hudi-spark-xxx is required for hudi table support to work end to end, which is still under releasing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will add this line

}

/** Return a Spark V1Table for Hudi tables. */
public static Table loadV1SparkHudiTable(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually this function doesn't seem very specific for huid, maybe we can just call it loadV1SaprkTable, and in the comment mention that it is currently only used by hudi.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will do so

@singhpk234 singhpk234 added this to the 1.3.0 milestone Sep 26, 2025
@eric-maynard eric-maynard self-requested a review October 5, 2025 17:24
@github-project-automation github-project-automation bot moved this from PRs In Progress to Done in Basic Kanban Board Oct 6, 2025
@rahil-c
Copy link
Contributor Author

rahil-c commented Oct 7, 2025

Can we open this pr back up not sure why it was closed? @singhpk234 @flyrain @gh-yzou @eric-maynard

@flyrain flyrain reopened this Oct 7, 2025
@github-project-automation github-project-automation bot moved this from Done to PRs In Progress in Basic Kanban Board Oct 7, 2025
@flyrain
Copy link
Contributor

flyrain commented Oct 7, 2025

Hi @rahil-c. reopened. Please update as needed.

@rahil-c rahil-c dismissed stale reviews from flyrain and gh-yzou via e5d0651 November 23, 2025 23:28
@rahil-c rahil-c force-pushed the rahil-c/polaris-hudi branch from 001f23b to e5d0651 Compare November 23, 2025 23:28
@rahil-c rahil-c requested review from flyrain and gh-yzou November 23, 2025 23:35
Copy link
Contributor

@sfc-gh-ygu sfc-gh-ygu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

3) Rename a Delta table is not supported.
4) ALTER TABLE ... SET LOCATION is not supported for DELTA table.
5) For other non-Iceberg tables like csv, it is not supported today.
5) For other non-Iceberg tables like csv, it is not supported today. No newline at end of file
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a blocker: This change is not needed, but I think we will need a doc change to claim that Hudi table is supported. I'm OK to add it in a followup PR.

flyrain
flyrain previously approved these changes Nov 24, 2025
Copy link
Contributor

@flyrain flyrain left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

@rahil-c
Copy link
Contributor Author

rahil-c commented Nov 24, 2025

@flyrain @gh-yzou Now that Hudi 1.1.0 is out (see hudi release docs https://hudi.apache.org/releases/release-1.1.0) I think we can land this PR. Based on the original discussion we wanted to have a follow up PR for the IT test plan, which will follow a similar flow as the SparkDeltaIT: https://github.com/apache/polaris/blob/main/plugins/spark/v3.5/integration/src/intTest/java/org/apache/polaris/spark/quarkus/it/SparkDeltaIT.java

Once we land this PR, will raise the Hudi IT [PR](https://github.com/rahil-c/polaris/pull/10/files#diff-ec6cb0157edd392da9149d2627b6a928846e236050232678c525f0dc744c6963R37 (adding hudi 1.1.0 as a test dependency similar to Delta) and the test which verifies that DDL, DML, DQL queries with hudi as a generic table.

@rahil-c rahil-c force-pushed the rahil-c/polaris-hudi branch from ce9aa54 to d6b7e7a Compare November 24, 2025 18:14
@flyrain flyrain dismissed eric-maynard’s stale review November 25, 2025 00:03

The blocker has been resolved per this comment, #1862 (comment).

@github-project-automation github-project-automation bot moved this from PRs In Progress to Ready to merge in Basic Kanban Board Nov 25, 2025
@flyrain flyrain merged commit b479a2f into apache:main Nov 25, 2025
15 checks passed
@github-project-automation github-project-automation bot moved this from Ready to merge to Done in Basic Kanban Board Nov 25, 2025
snazy added a commit to snazy/polaris that referenced this pull request Feb 11, 2026
* Do not fail a release when markdown-link-check check fails as it is flaky (apache#3116)

* Source tarball reproducible (apache#3143)

`git --mtime` MUST use the time zone for reproducible builds.

* Skip release e-mail templates from svn dist copy (apache#3147)

* Make pom.xml always reproducible (apache#3145)

It turned out in practice, that there's no guarantee that the `<parent>` element in `pom.xml` files always appear at the same place.

This change ensures that the `<parent>` elements always appears at a deterministic location at the top of `pom.xml` files.

* Fix executable POSIX permission in archive files (apache#3146)

The PR apache#2819 accidentally _removed_ the executable POSIX file permission, assuming that not explicity setting the attributes via `filePermissions` retains the file-system 'x' permission.

This change updates the logic to explicitly check the owner-executable bit and uses `755` or `644` respectively for each individual file in the archive.

* Spark: Initial integration for hudi tables within Polaris  (apache#1862)

* Update actions/setup-python digest to 83679a8 (apache#3157)

* Update actions/stale digest to 5611b9d (apache#3155)

* Fix LICENSE and NOTICE in the distributions and docker images. (apache#3125)

* Remove readEntity() call (apache#3111)

Calling readEntity() is not allowed server-side by some HTTP servers.

* Run CI on release branches (apache#3121)

The release workflows check whether CI passes for the required checks.
This would fail, because CI isn't configured to run on release branches.

This change lets CI run on `release/*` branches.

* adding support to use a kms key for s3 buckets data encryption (AWS only) (apache#2802)

Add catalog-level support for KMS with s3 buckets

* Update plugin jetbrains-changelog to v2.5.0 (apache#3166)

* Update quay.io/keycloak/keycloak Docker tag to v26.4.6 (apache#3163)

* NoSQL: Prepare admin-tool (apache#3134)

No functional changes.

1. Refactor the configuration property to a configuration type.
2. Make `BaseCommand` suitable for non-meta-store-factory use cases.

* Iceberg-Catalog: also set catalog-id for location overlap checks (apache#3136)

* Fix catalog-role creating in `PolarisTestMetaStoreManager` (apache#3122)

`testLookup()` attempts to check for a catalog-role on catalog ID 0, which is an illegal ID for a catalog.

Fix is to move the assertion below the catalog creation.

* Releasy: prepare for Helm 4 (helm package repro) (apache#3088)

Part of apache#3086

* Update Quarkus Platform and Group to v3.30.1 (apache#3168)

* Relax ARN validation logic (apache#3071)

Following up on apache#3005, which allowed a wide range of ARN values in the validation RegEx, remove an additional explicit check for `aws-cn` being present in the ARN as a sub-string.

Update existing unit tests to process `aws-cn` ARNs as common `aws` ARNs.

Note: the old validation code does not look correct because it used to check for `aws-cn` anywhere in the ARN string, not just in its "partition" component.

* docs: Add François as Mentor (apache#3162)

* docs: Add François as Mentor

* update mentor list according to ASF project info

* Event type IDs + event metadata incl. OTel context (apache#2998)

This PR implements the action items from the following discussion threads:

- https://lists.apache.org/thread/yx7pkgczl6k7bt4k4yzqrrq9gn7gqk2p
- https://lists.apache.org/thread/rl5cpcft16sn5n00mfkmx9ldn3gsqtfy
- https://lists.apache.org/thread/5dpyo0nn2jbnjtkgv0rm1dz8mpt132j9

Summary of changes:

- Introduced a `PolarisEventType` enum holding the 150+ event types.
- Introduced a `PolarisEventMetadata` interface as suggested by @adnanhemani, exposing: event ID, timestamp, realm ID, principal, request ID, and OTel context.
- Introduced a `PolarisEventMetadataFactory` to centralize the logic for gathering the various elements of an event metadata.
- Modified `PolarisEvent` to expose 3 new methods:
  - `PolarisEventType type()`
  - `PolarisEventMetadata metadata()`
- Persistence of OTel context is done in `additional_properties` as suggested by @flyrain.
- Added `InMemoryBufferEventListenerIntegrationTest` to verify that all contextual data is properly persisted.

* fix typo in management API yaml (apache#3172)

* Fix homepage Get Started button layout (apache#3169)

Wrap the Get Started button in a div container to prevent it from
becoming inline with text at certain screen widths. Follows Docsy
blocks/cover shortcode pattern.

* fix OPA javadoc referencing `OpaSchemaGenerator` (apache#3153)

`OpaSchemaGenerator` is not on the classpath of `opa/impl/main` so the javadoc tool is not able to resolve a `@link` to it.

Use `@code` instead to avoid build warnings like the following:

* Update dependency com.azure:azure-sdk-bom to v1.3.3 (apache#3179)

* Update dependency com.google.errorprone:error_prone_core to v2.45.0 (apache#3177)

* test: Add Some Spark Client Tests and Update Documentation on Generic Tables (apache#3152)

* Site: Make homepage image full-width (apache#3171)

Add CSS class to allow images to span full viewport width by
canceling out container padding. Apply to homepage hero image
using AsciiDoc role attribute.

* chore(enhancement): gitignore application-local.properties (apache#3175)

* Update registry.access.redhat.com/ubi9/openjdk-21-runtime Docker tag to v1.23-6.1764155306 (apache#3186)

* Update quay.io/keycloak/keycloak Docker tag to v26.4.7 (apache#3185)

* Update dependency software.amazon.awssdk:bom to v2.39.6 (apache#3184)

* Testing: increase visibility + make PCC/PMSM accessible (apache#3137)

* `BasePolarisMetaStoreManagerTest`: make `PolarisCallContext` + `PolarisMetaStoreManager` + `PolarisTestMetaStoreManager` accessible by subclasses
* Make constants of `PolarisRestCatalogMinIOIT` accessible

* Update docker.io/prom/prometheus Docker tag to v3.8.0 (apache#3191)

* Update helm/chart-testing-action action to v2.8.0 (apache#2982)

* chore(enhancement): make custom hidden tasks visible in ./gradlew tasks (apache#3176)

* fix type cast warning in PolarisCatalogUtils (apache#3178)

```
plugins/spark/v3.5/spark/src/main/java/org/apache/polaris/spark/utils/PolarisCatalogUtils.java:131: warning: [unchecked] unchecked cast
            scala.collection.immutable.Map$.MODULE$.apply(
                                                         ^
  required: Map<String,String>
  found:    Map
```

* chore(deps): update actions/stale digest to 9971854 (apache#3197)

* fix(deps): update dependency io.smallrye:jandex to v3.5.3 (apache#3193)

* chore(deps): update actions/checkout digest to 8e8c483 (apache#3192)

* added venv to the gitignore (apache#3199)

* CLI: Add Hive federation option (apache#2798)

* chore(deps): update docker.io/jaegertracing/all-in-one docker tag to v1.76.0 (apache#3201)

* chore(deps): update registry.access.redhat.com/ubi9/openjdk-21-runtime docker tag to v1.23-6.1764562148 (apache#3202)

* fix(deps): update quarkus platform and group to v3.30.2 (apache#3198)

* chore(deps): update dependency boto3 to ~=1.42.2 (apache#3126)

* NoSQL: CDI / Quarkus (apache#3135)

* fix(deps): update dependency com.adobe.testing:s3mock-testcontainers to v4.11.0 (apache#3208)

* Update dependency mypy to >=1.19, <=1.19.0 (apache#3180)

* chore(deps): update actions/setup-java digest to f2beeb2 (apache#3206)

* Fix spelling in comments (apache#3212)

* Make each task attempt run in a dedicated CDI request context (apache#3210)

* Make each task attempt run in a dedicated CDI request context

Currently, tasks inherit the CDI context from the requests that
submitted them, but run asynchronously. Therefore, if the original
request context ends, the task may not be able to use the expired
beans for that context.

This change makes each task run in its own dedicated CDI request
context with `RealmContext` explicitly propagated in `TaskExecutorImpl`.

Test-only error handlers are added to `TaskExecutorImpl` to facilitate
detecting task errors during CI.

Fixes apache#3203

* fix(deps): update dependency com.gradleup.shadow:shadow-gradle-plugin to v9.3.0 (apache#3218)

* Last merged commit be3c88b

---------

Co-authored-by: Pierre Laporte <pierre@pingtimeout.fr>
Co-authored-by: Rahil C <32500120+rahil-c@users.noreply.github.com>
Co-authored-by: Mend Renovate <bot@renovateapp.com>
Co-authored-by: JB Onofré <jbonofre@apache.org>
Co-authored-by: Alexandre Dutra <adutra@apache.org>
Co-authored-by: fabio-rizzo-01 <fabio.rizzocascio@jpmorgan.com>
Co-authored-by: Dmitri Bourlatchkov <dmitri.bourlatchkov@gmail.com>
Co-authored-by: Tamas Mate <50709850+tmater@users.noreply.github.com>
Co-authored-by: Adam Christian <105929021+adam-christian-software@users.noreply.github.com>
Co-authored-by: Artur Rakhmatulin <artur.rakhmatulin@gmail.com>
Co-authored-by: cccs-cat001 <56204545+cccs-cat001@users.noreply.github.com>
Co-authored-by: Yufei Gu <yufei@apache.org>
Co-authored-by: Yong Zheng <yongzheng0809@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.