[FLUSS-2262][lake] Improved Stability for Iceberg Log Table Compaction Test #2265

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed

rionmonster wants to merge 109 commits into apache:main from rionmonster:fluss-2262

.github/workflows/ci-template.yaml

            
                      Original file line number
                      Diff line number
                      Diff line change
                  
    @@ -40,9 +40,9 @@ jobs:
  
        name: "${{ matrix.module }}"

        steps:

          - name: Checkout code

            uses: actions/checkout@v2

            uses: actions/checkout@v6

          - name: Set up JDK

            uses: actions/setup-java@v4

            uses: actions/setup-java@v5

            with:

              java-version: ${{ inputs.java-version }}

              distribution: 'temurin'

    @@ -66,13 +66,13 @@ jobs:
  
              ARTIFACTS_OSS_STS_ENDPOINT: ${{ secrets.ARTIFACTS_OSS_STS_ENDPOINT }}

              ARTIFACTS_OSS_ROLE_ARN: ${{ secrets.ARTIFACTS_OSS_ROLE_ARN }}

          - name: Upload build logs

            uses: actions/upload-artifact@v4

            uses: actions/upload-artifact@v6

            if: ${{ failure() }}

            with:

              name: logs-test-${{ matrix.module }}-${{ github.run_number}}#${{ github.run_attempt }}

              path: ${{ runner.temp }}/fluss-logs/*

          - name: Upload JaCoCo coverage report

            uses: actions/upload-artifact@v4

            uses: actions/upload-artifact@v6

            if: ${{ success() && github.ref == 'refs/heads/main' }}

            with:

              name: jacoco-report-${{ matrix.module }}-${{ github.run_number}}#${{ github.run_attempt }}

.github/workflows/ci.yaml

            
                      Original file line number
                      Diff line number
                      Diff line change
                  
    @@ -40,9 +40,9 @@ jobs:
  
        runs-on: ubuntu-latest

        steps:

          - name: Checkout code

            uses: actions/checkout@v2

            uses: actions/checkout@v6

          - name: Set up JDK 8

            uses: actions/setup-java@v4

            uses: actions/setup-java@v5

            with:

              java-version: '8'

              distribution: 'temurin'

.github/workflows/docs-check.yaml

            
                      Original file line number
                      Diff line number
                      Diff line change
                  
    @@ -37,14 +37,14 @@ jobs:
  
          run:

            working-directory: ./website

        steps:

          - uses: actions/checkout@v4

          - uses: actions/checkout@v6

            with:

              fetch-depth: 0

          - name: Generate versioned docs

            run: ./build_versioned_docs.sh

          - uses: actions/setup-node@v4

          - uses: actions/setup-node@v6

            with:

              node-version: 20

              node-version: 24

          - name: Install dependencies

            run: npm install

          - name: Test build website

.github/workflows/license-check.yml

            
                      Original file line number
                      Diff line number
                      Diff line change
                  
    @@ -32,10 +32,10 @@ jobs:
  
          MVN_VALIDATION_DIR: "/tmp/fluss-validation-deployment"

        steps:

          - uses: actions/checkout@v4

          - uses: actions/checkout@v6

          - name: Set JDK

            uses: actions/setup-java@v4

            uses: actions/setup-java@v5

            with:

              java-version: 11

              distribution: 'temurin'

.gitignore

-Original file line number
+Diff line change
@@ Expand Up / @@ -22,6 +22,9 @@ dependency-reduced-pom.xml @@
     ### VS Code ###
     .vscode/
+    ### claude code ###
+    .claude/
     ### Mac OS ###
     .DS_Store
@@ Expand Down @@

NOTICE

-Original file line number
+Diff line change
@@ -1,5 +1,5 @@
     Apache Fluss (incubating)
-    Copyright 2025 The Apache Software Foundation
+    Copyright 2025-2026 The Apache Software Foundation
     This product includes software developed at
     The Apache Software Foundation (http://www.apache.org/).
@@ Expand Down @@

README.md

            
                      Original file line number
                      Diff line number
                      Diff line change
                  
    @@ -38,7 +38,7 @@
  
    ## What is Apache Fluss (Incubating)?

    Apache Fluss (Incubating) is a streaming storage built for real-time analytics which can serve as the real-time data layer for Lakehouse architectures.

    Apache Fluss (Incubating) is a streaming storage built for real-time analytics & AI which can serve as the real-time data layer for Lakehouse architectures.

    It bridges the gap between **data streaming** and **data Lakehouse** by enabling low-latency, high-throughput data ingestion and processing while seamlessly integrating with popular compute engines like **Apache Flink**, while 

    Apache Spark, and StarRocks are coming soon.

    @@ -47,12 +47,12 @@ Apache Spark, and StarRocks are coming soon.
  
    ## Features

    - **Sub-Second Latency**: Low-latency streaming reads/writes optimized for real-time applications with Apache Flink.

    - **Columnar Stream**: 10x improvement in streaming read performance with efficient pushdown projections.

    - **Streaming & Lakehouse Unification**: Unified data streaming and Lakehouse with low latencies for powerful analytics.

    - **Real-Time Updates**: Cost-efficient partial updates for large-scale data without expensive join operations.

    - **Changelog Generation**: Complete changelogs for streaming processors, streamlining analytics workflows.

    - **Lookup Queries**: Ultra-high QPS for primary key lookups, enabling efficient dimension table serving.

    - **Sub-Second Data Freshness**: Continuous ingestion and immediate availability of data enable low-latency analytics and real-time decision-making at scale.

    - **Streaming & Lakehouse Unification**: Streaming-native storage with low-latency access on top of the lakehouse, using tables as a single abstraction to unify real-time and historical data across engines.

    - **Columnar Streaming**: Based on Apache Arrow it allows database primitives on data streams and techniques like column pruning and predicate pushdown. This ensures engines read only the data they need, minimizing I/O and network costs.

    - **Compute–Storage Separation**: Stream processors focus on pure computation while Fluss manages state and storage, with features like deduplication, partial updates, delta joins, and aggregation merge engines.

    - **ML & AI–Ready Storage**: A unified storage layer supporting row-based, columnar, vector, and multi-modal data, enabling real-time feature stores and a centralized data repository for ML and AI systems.

    - **Changelogs & Decision Tracking**: Built-in changelog generation provides an append-only history of state and decision evolution, enabling auditing, reproducibility, and deep system observability.

    ## Building

fluss-client/src/main/java/org/apache/fluss/client/admin/Admin.java

-Original file line number
+Diff line change
@@ Expand Up / @@ -22,9 +22,13 @@ @@
     import org.apache.fluss.client.metadata.KvSnapshots;
     import org.apache.fluss.client.metadata.LakeSnapshot;
     import org.apache.fluss.cluster.ServerNode;
+    import org.apache.fluss.cluster.rebalance.GoalType;
+    import org.apache.fluss.cluster.rebalance.RebalanceProgress;
+    import org.apache.fluss.cluster.rebalance.ServerTag;
     import org.apache.fluss.config.ConfigOptions;
     import org.apache.fluss.config.cluster.AlterConfig;
     import org.apache.fluss.config.cluster.ConfigEntry;
+    import org.apache.fluss.exception.AuthorizationException;
     import org.apache.fluss.exception.DatabaseAlreadyExistException;
     import org.apache.fluss.exception.DatabaseNotEmptyException;
     import org.apache.fluss.exception.DatabaseNotExistException;
@@ Expand All / @@ -35,10 +39,15 @@ @@
     import org.apache.fluss.exception.InvalidTableException;
     import org.apache.fluss.exception.KvSnapshotNotExistException;
     import org.apache.fluss.exception.LakeTableSnapshotNotExistException;
+    import org.apache.fluss.exception.NoRebalanceInProgressException;
     import org.apache.fluss.exception.NonPrimaryKeyTableException;
     import org.apache.fluss.exception.PartitionAlreadyExistsException;
     import org.apache.fluss.exception.PartitionNotExistException;
+    import org.apache.fluss.exception.RebalanceFailureException;
     import org.apache.fluss.exception.SchemaNotExistException;
+    import org.apache.fluss.exception.ServerNotExistException;
+    import org.apache.fluss.exception.ServerTagAlreadyExistException;
+    import org.apache.fluss.exception.ServerTagNotExistException;
     import org.apache.fluss.exception.TableAlreadyExistException;
     import org.apache.fluss.exception.TableNotExistException;
     import org.apache.fluss.exception.TableNotPartitionedException;
@@ Expand All / @@ -58,8 +67,12 @@ @@
     import org.apache.fluss.security.acl.AclBinding;
     import org.apache.fluss.security.acl.AclBindingFilter;
+    import javax.annotation.Nullable;
     import java.util.Collection;
     import java.util.List;
+    import java.util.Map;
+    import java.util.Optional;
     import java.util.concurrent.CompletableFuture;
     /**
@@ Expand Down Expand Up / @@ -492,4 +505,165 @@ ListOffsetsResult listOffsets( @@
          * @return A CompletableFuture indicating completion of the operation.
          */
         CompletableFuture<Void> alterClusterConfigs(Collection<AlterConfig> configs);
+        /**
+         * Add server tag to the specified tabletServers, one tabletServer can only have one serverTag.
+         *
+         * <p>If one tabletServer failed adding tag, none of the tags will take effect.
+         *
+         * <p>If one tabletServer already has a serverTag, and the serverTag is same with the existing
+         * one, this operation will be ignored.
+         *
+         * <ul>
+         *   <li>{@link AuthorizationException} If the authenticated user doesn't have cluster
+         *       permissions.
+         *   <li>{@link ServerNotExistException} If the tabletServer in {@code tabletServers} does not
+         *       exist.
+         *   <li>{@link ServerTagAlreadyExistException} If the server tag already exists for any one of
+         *       the tabletServers, and the server tag is different from the existing one.
+         * </ul>
+         *
+         * @param tabletServers the tabletServers we want to add server tags.
+         * @param serverTag the server tag to be added.
+         */
+        CompletableFuture<Void> addServerTag(List<Integer> tabletServers, ServerTag serverTag);
+        /**
+         * Remove server tag from the specified tabletServers.
+         *
+         * <p>If one tabletServer failed removing tag, none of the tags will be removed.
+         *
+         * <p>No exception will be thrown if the server already has no any server tag now.
+         *
+         * <ul>
+         *   <li>{@link AuthorizationException} If the authenticated user doesn't have cluster
+         *       permissions.
+         *   <li>{@link ServerNotExistException} If the tabletServer in {@code tabletServers} does not
+         *       exist.
+         *   <li>{@link ServerTagNotExistException} If the server tag does not exist for any one of the
+         *       tabletServers.
+         * </ul>
+         *
+         * @param tabletServers the tabletServers we want to remove server tags.
+         */
+        CompletableFuture<Void> removeServerTag(List<Integer> tabletServers, ServerTag serverTag);
+        /**
+         * Based on the provided {@code priorityGoals}, Fluss performs load balancing on the cluster's
+         * bucket load.
+         *
+         * <p>More details, Fluss collects the cluster's load information and optimizes to perform load
+         * balancing according to the user-defined {@code priorityGoals}.
+         *
+         * <p>Currently, Fluss only supports one active rebalance task in the cluster. If an uncompleted
+         * rebalance task exists, Fluss will return the uncompleted rebalance task's progress.
+         *
+         * <p>If you want to cancel the rebalance task, you can use {@link #cancelRebalance(String)}
+         *
+         * <ul>
+         *   <li>{@link AuthorizationException} If the authenticated user doesn't have cluster
+         *       permissions.
+         *   <li>{@link RebalanceFailureException} If the rebalance failed. Such as there is an
+         *       inProgress execution.
+         * </ul>
+         *
+         * @param priorityGoals the goals to be optimized.
+         * @return the rebalance id. If there is no rebalance task in progress, it will trigger a new
+         *     rebalance task and return the rebalance id.
+         */
+        CompletableFuture<String> rebalance(List<GoalType> priorityGoals);
+        /**
+         * List the rebalance progress.
+         *
+         * <ul>
+         *   <li>{@link AuthorizationException} If the authenticated user doesn't have cluster
+         *       permissions.
+         *   <li>{@link NoRebalanceInProgressException} If there are no rebalance tasks in progress for
+         *       the input rebalanceId.
+         * </ul>
+         *
+         * @param rebalanceId the rebalance id to list progress, if it is null means list the in
+         *     progress rebalance task's.
+         * @return the rebalance process.
+         */
+        CompletableFuture<Optional<RebalanceProgress>> listRebalanceProgress(
+                @Nullable String rebalanceId);
+        /**
+         * Cannel the rebalance task.
+         *
+         * <ul>
+         *   <li>{@link AuthorizationException} If the authenticated user doesn't have cluster
+         *       permissions.
+         *   <li>{@link NoRebalanceInProgressException} If there are no rebalance tasks in progress or
+         *       the rebalance id is not exists.
+         * </ul>
+         *
+         * @param rebalanceId the rebalance id to cancel, if it is null means cancel the exists
+         *     rebalance task. If rebalanceId is not exists in server, {@link
+         *     NoRebalanceInProgressException} will be thrown.
+         */
+        CompletableFuture<Void> cancelRebalance(@Nullable String rebalanceId);
+        // ==================================================================================
+        // Producer Offset Management APIs (for Exactly-Once Semantics)
+        // ==================================================================================
+        /**
+         * Register producer offset snapshot.
+         *
+         * <p>This method provides atomic "check and register" semantics:
+         *
+         * <ul>
+         *   <li>If snapshot does not exist: create new snapshot and return {@link
+         *       RegisterResult#CREATED}
+         *   <li>If snapshot already exists: do NOT overwrite and return {@link
+         *       RegisterResult#ALREADY_EXISTS}
+         * </ul>
+         *
+         * <p>The atomicity is guaranteed by the server implementation. This enables the caller to
+         * determine whether undo recovery is needed based on the return value.
+         *
+         * <p>The snapshot will be automatically cleaned up after the configured TTL expires.
+         *
+         * <p>This API is typically used by Flink Operator Coordinator at job startup to register the
+         * initial offset snapshot before any data is written.
+         *
+         * @param producerId the ID of the producer (typically Flink job ID)
+         * @param offsets map of TableBucket to offset for all tables
+         * @return a CompletableFuture containing the registration result indicating whether the
+         *     snapshot was newly created or already existed
+         * @since 0.9
+         */
+        CompletableFuture<RegisterResult> registerProducerOffsets(
+                String producerId, Map<TableBucket, Long> offsets);
+        /**
+         * Get producer offset snapshot.
+         *
+         * <p>This method retrieves the registered offset snapshot for a producer. Returns null if no
+         * snapshot exists for the given producer ID.
+         *
+         * <p>This API is typically used by Flink Operator Coordinator at job startup to check if a
+         * previous snapshot exists (indicating a failover before first checkpoint).
+         *
+         * @param producerId the ID of the producer
+         * @return a CompletableFuture containing the producer offsets, or null if not found
+         * @since 0.9
+         */
+        CompletableFuture<ProducerOffsetsResult> getProducerOffsets(String producerId);
+        /**
+         * Delete producer offset snapshot.
+         *
+         * <p>This method deletes the registered offset snapshot for a producer. This is typically
+         * called after the first checkpoint completes successfully, as the checkpoint state will be
+         * used for recovery instead of the initial snapshot.
+         *
+         * @param producerId the ID of the producer
+         * @return a CompletableFuture that completes when deletion succeeds
+         * @since 0.9
+         */
+        CompletableFuture<Void> deleteProducerOffsets(String producerId);
     }

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FLUSS-2262][lake] Improved Stability for Iceberg Log Table Compaction Test #2265

Uh oh!

Diff view

Diff view

Uh oh!

There are no files selected for viewing

Uh oh!

Uh oh!