Skip to content

Conversation

@Zouxxyy
Copy link
Contributor

@Zouxxyy Zouxxyy commented Nov 1, 2025

Purpose

  • Add spark push down transform predicate support (spark version >= 3.3)
  • Enable concat pushdown (spark version >= 3.4)

Tests

API and Format

Documentation

@Zouxxyy Zouxxyy force-pushed the dev/spark-transform branch 4 times, most recently from 8b37699 to fe5b4e9 Compare November 1, 2025 10:05
@Zouxxyy Zouxxyy force-pushed the dev/spark-transform branch from fe5b4e9 to 7b13471 Compare November 1, 2025 11:45
/** Represents a transform function. */
public interface Transform extends Serializable {

List<Object> inputs();
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe we can introduce a Literal object in the future, so that our inputs will only be of either Literal or FieldRef type now.


/** Leaf node of a {@link Predicate} tree. Compares a field in the row with literals. */
public class LeafPredicate implements Predicate {
public class LeafPredicate extends TransformPredicate {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since LeafPredicate extends TransformPredicate, and we have

T visit(LeafPredicate predicate);
T visit(TransformPredicate predicate);

Maybe we can combine them in the future

@JingsongLi JingsongLi requested a review from Copilot November 2, 2025 00:42
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR adds support for pushing down CONCAT expressions to Paimon for Spark 3.4+. The implementation introduces a transform-based predicate system that can handle not only simple field references but also general expressions like CONCAT.

Key Changes:

  • Introduced a new Transform interface hierarchy to represent field transformations and expressions
  • Refactored predicate handling to work with Transform objects instead of just field indices
  • Added support for CONCAT expression pushdown in Spark filters

Reviewed Changes

Copilot reviewed 16 out of 16 changed files in this pull request and generated no comments.

Show a summary per file
File Description
SparkExpressionConverter.scala New utility to convert Spark expressions to Paimon transforms and literals
SparkV2FilterConverter.scala Refactored to use Transform-based predicates instead of field indices
PredicateBuilder.java Added Transform-based predicate builder methods alongside index-based ones
TransformPredicate.java Enhanced with factory method and field name extraction
Transform.java Added withNewInputs method for transform reconstruction
FieldTransform.java New class representing field extraction as a Transform
ConcatTransform.java Added withNewInputs implementation
ConcatWsTransform.java Added withNewInputs implementation
LeafPredicate.java Refactored to extend TransformPredicate
PaimonPushDownTestBase.scala Added test for CONCAT pushdown
PaimonCatalogUtils.scala Removed unused imports
ExpressionHelper.scala Removed unused import
PaimonBasePushDown.scala Removed unused import and improved code style
PaimonScan.scala Updated isSupportedRuntimeFilter call to use instance method
PaimonBaseScan.scala Removed unused imports
TransformPredicateTest.java Updated to use TransformPredicate.of factory method

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link
Contributor

@JingsongLi JingsongLi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left minor comments.

}

@Override
public Transform withNewInputs(List<Object> inputs) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

copy


/** Transform that extracts a field from a row. */
public class FieldTransform implements Transform {

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add ser id.

public class LeafPredicate implements Predicate {
public class LeafPredicate extends TransformPredicate {

private static final long serialVersionUID = 1L;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

change ser id.

@JingsongLi
Copy link
Contributor

+1

@JingsongLi JingsongLi merged commit d442b88 into apache:master Nov 3, 2025
27 of 28 checks passed
gmdfalk added a commit to gmdfalk/paimon that referenced this pull request Nov 5, 2025
* master: (162 commits)
  [Python] Rename to BATCH_COMMIT_IDENTIFIER in snapshot.py
  [Python] Suppport multi prepare commit in the same TableWrite  (apache#6526)
  [spark] Fix drop temporary view (apache#6529)
  [core] skip validate main branch before orphan files cleaning (apache#6524)
  [core][spark] Introduce upper transform (apache#6521)
  [Python] Keep the variable names of Identifier consistent with Java (apache#6520)
  [core] Remove hash lookup to simplify interface (apache#6519)
  [core][format] Format Table plan partitions should ignore hidden & illegal dirs (apache#6522)
  [hotfix] Print partition spec and type when error in InternalRowPartitionComputer
  [hotfix] Add more informat to check partition spec in InternalRowPartitionComputer
  [hotfix] Use deleteDirectoryQuietly in TempFileCommitter.clean
  [core] format table: support write file in _temporary at first (apache#6510)
  [core] Support non null column with write type (apache#6513)
  [core][fix] Blob with rolling file failed (apache#6518)
  [core][rest] Support schema validation and infer for external paimon table (apache#6501)
  [hotfix] Correct visitors for TransformPredicate
  [hotfix] Rename to copy from withNewInputs in TransformPredicate
  [core][spark] Support push down transform predicate (apache#6506)
  [spark] Implement SupportsReportStatistics for PaimonFormatTableBaseScan (apache#6515)
  [docs] add docs for auto-clustering of historical partitions (apache#6516)
  ...
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants