[HUDI-6806] Support Spark 3.5.0 #9717

CTTY · 2023-09-14T22:18:56Z

Change Logs

Support Spark 3.5.0

Impact

No public API changes

Risk level (write none, low medium or high below)

Medium

Documentation Update

Describe any necessary documentation update if there is any new feature, config, or user-facing change

The config description must be updated if new configs are added or the default value of the configs are changed
Any new feature or user-facing change requires updating the Hudi website. Please create a Jira ticket, attach the
ticket number here and follow the instruction to make
changes to the website.

Contributor's checklist

Read through contributor's guide
Change Logs and Impact were stated clearly
Adequate tests were added if applicable
CI passed

vinothchandar · 2023-09-16T16:45:58Z

awesome ! @CTTY .

@yihua can we figure out how we can integrate with the native spark reader on top.

yihua · 2023-09-19T05:57:48Z

awesome ! @CTTY .

@yihua can we figure out how we can integrate with the native spark reader on top.

Yes. We should be able to use the native Spark parquet reader from the file format.

CTTY · 2023-09-22T17:26:00Z

@hudi-bot run azure

.github/workflows/bot.yml

hudi-client/hudi-spark-client/src/main/scala/org/apache/spark/sql/DataFrameUtil.scala

hudi-common/src/test/java/org/apache/hudi/common/util/TestClusteringUtils.java

CTTY · 2023-11-03T00:21:52Z

...eg-test/src/main/java/org/apache/hudi/integ/testsuite/dag/nodes/BaseValidateDatasetNode.java

-    return RowEncoder.apply(schema)
-        .resolveAndBind(JavaConverters.asScalaBufferConverter(attributes).asScala().toSeq(),
-            SimpleAnalyzer$.MODULE$);
+    return SparkAdapterSupport$.MODULE$.sparkAdapter().getEncoder(schema);


SPARK-44531 Encoder inference moved elsewhere in Spark 3.5.0

CTTY · 2023-11-03T00:22:29Z

hudi-spark-datasource/hudi-spark/pom.xml

+      <artifactId>parquet-hadoop-bundle</artifactId>
+      <version>${parquet.version}</version>
+      <scope>provided</scope>
+    </dependency>


Added parquet-hadoop-bundle to fix classpath issues

java.lang.NoClassDefFoundError: Could not initialize class org.apache.spark.sql.execution.datasources.parquet.ParquetOptions$ at org.apache.spark.sql.execution.datasources.parquet.ParquetOptions.<init>(ParquetOptions.scala:50) at org.apache.spark.sql.execution.datasources.parquet.ParquetOptions.<init>(ParquetOptions.scala:40) at org.apache.spark.sql.execution.datasources.parquet.Spark34LegacyHoodieParquetFileFormat.buildReaderWithPartitionValues(Spark34LegacyHoodieParquetFileFormat.scala:150)

CTTY · 2023-11-03T00:22:49Z

...di-spark/src/main/scala/org/apache/spark/sql/hudi/command/InsertIntoHoodieTableCommand.scala

+      case ae: AnalysisException if (ae.getMessage().startsWith("[INCOMPATIBLE_DATA_FOR_TABLE.CANNOT_FIND_DATA] Cannot write incompatible data for the table")
+        || ae.getMessage().startsWith("Cannot write incompatible data to table")) =>
+        planUtils.resolveOutputColumns(catalogTable.catalogTableName, sparkAdapter.toAttributes(expectedSchema), query, byName = false, conf)
    }


SPARK-42309 Error message changed in Spark 3.5.0

CTTY · 2023-11-03T00:25:34Z

...di-spark3.5.x/src/main/scala/org/apache/spark/sql/HoodieSpark35CatalystExpressionUtils.scala

+        case DateDiff(_, OrderPreservingTransformation(attrRef)) => Some(attrRef)
+        case FromUnixTime(OrderPreservingTransformation(attrRef), _, _) => Some(attrRef)
+        case FromUTCTimestamp(OrderPreservingTransformation(attrRef), _) => Some(attrRef)
+        case ParseToDate(OrderPreservingTransformation(attrRef), _, _, _) => Some(attrRef)


Added a new empty argument due to SPARK-43779, ParseToDate API change

CTTY · 2023-11-03T00:26:14Z

...rk3.5.x/src/main/scala/org/apache/spark/sql/parser/HoodieSpark3_5ExtendedSqlAstBuilder.scala

+import org.apache.spark.sql.catalyst.expressions._
+import org.apache.spark.sql.catalyst.expressions.aggregate.{First, Last}
+import org.apache.spark.sql.catalyst.parser.ParserUtils.{checkDuplicateClauses, checkDuplicateKeys, entry, escapedIdentifier, operationNotAllowed, source, string, stringWithoutUnescape, validate, withOrigin}
+import org.apache.spark.sql.catalyst.parser.{EnhancedLogicalPlan, ParseException, ParserInterface}


SPARK-44333, EnhancedLogicalPlan moved to a different package

CTTY · 2023-11-03T00:26:41Z

packaging/hudi-utilities-bundle/pom.xml

                  <include>com.github.ben-manes.caffeine:caffeine</include>
+                  <!-- SPARK-43489 Spark 3.5+ has marked protobuf as provided -->
+                  <include>com.google.protobuf:protobuf-java</include>
                  <include>com.twitter:bijection-avro_${scala.binary.version}</include>


This is needed otherwise deltastreamer would fail due to

Exception in thread "main" java.lang.NoClassDefFoundError: com/google/protobuf/Message at java.lang.Class.getDeclaredMethods0(Native Method) at java.lang.Class.privateGetDeclaredMethods(Class.java:2729)

packaging/hudi-utilities-slim-bundle/pom.xml

CTTY · 2023-11-03T00:29:15Z

pom.xml

+        </property>
+      </activation>
+    </profile>
+


I haven't changed the default Spark3 profile to Spark 3.5

.github/workflows/bot.yml

hudi-client/hudi-spark-client/src/main/scala/org/apache/spark/sql/DataFrameUtil.scala

hudi-client/hudi-spark-client/src/main/scala/org/apache/spark/sql/hudi/SparkAdapter.scala

yihua · 2023-11-08T21:30:07Z

hudi-common/src/test/java/org/apache/hudi/avro/TestHoodieAvroUtils.java

        + "{\"name\": \"timestamp\",\"type\": \"double\"},{\"name\": \"_row_key\", \"type\": \"string\"},"
        + "{\"name\": \"non_pii_col\", \"type\": \"string\"},"
-        + "{\"name\": \"pii_col\", \"type\": \"string\"}]},";
+        + "{\"name\": \"pii_col\", \"type\": \"string\"}]}";


Interesting. Does it fail the test before with the comma at the end of the schema String?

No, it didn't fail before. In Avro 1.11.2 they enforce a stricter schema format

hudi-common/src/test/java/org/apache/hudi/common/util/TestClusteringUtils.java

yihua · 2023-11-08T22:25:22Z

hudi-spark-datasource/hudi-spark2/pom.xml

+    <dependency>
+      <groupId>org.apache.spark</groupId>
+      <artifactId>spark-core_${scala.binary.version}</artifactId>
+      <version>${spark2.version}</version>
+      <scope>provided</scope>
+      <optional>true</optional>
+    </dependency>


Any reason of adding this?

I remember seeing some classpath issues but somehow can't find the exact error message. We can try reverting this change

yihua · 2023-11-08T22:26:44Z

hudi-spark-datasource/hudi-spark3.0.x/pom.xml

+    <dependency>
+      <groupId>org.apache.spark</groupId>
+      <artifactId>spark-core_${scala.binary.version}</artifactId>
+      <version>${spark30.version}</version>
+      <scope>provided</scope>
+      <optional>true</optional>
+    </dependency>


similar here and other poms.

...park3.0.x/src/test/java/org/apache/hudi/internal/HoodieBulkInsertInternalWriterTestBase.java

...park3.4.x/src/test/java/org/apache/hudi/internal/HoodieBulkInsertInternalWriterTestBase.java

packaging/hudi-utilities-slim-bundle/pom.xml

yihua

LGTM. I addressed all the minor comments.

hudi-bot · 2023-11-16T02:49:10Z

CI report:

9b8fdd2 UNKNOWN
afe70da Azure: FAILURE

Bot commands

@hudi-bot supports the following commands:

@hudi-bot run azure re-run the last Azure build

yihua · 2023-11-16T05:56:55Z

Azure CI on master also fails on the fourth task. Merging this PR.

commit dfa3bde Merge: bfc0a85 473cf9a Author: Jonathan Vexler <=> Date: Wed Nov 29 15:01:45 2023 -0500 Merge branch 'master' into fg_reader_implement_bootstrap commit bfc0a85 Author: Jonathan Vexler <=> Date: Wed Nov 29 14:55:57 2023 -0500 fix bug with nested required fields due to spark nested schema pruning bug commit 473cf9a Author: Rajesh Mahindra <[email protected]> Date: Wed Nov 29 08:37:40 2023 -0800 [HUDI-7138] Fix error table writer and schema registry provider (apache#10173) --------- Co-authored-by: rmahindra123 <[email protected]> commit 91eabab Author: Lin Liu <[email protected]> Date: Tue Nov 28 23:49:37 2023 -0800 [HUDI-7103] Support time travel queies for COW tables (apache#10109) This is based on HadoopFsRelation. commit b300728 Author: Rajesh Mahindra <[email protected]> Date: Tue Nov 28 22:31:12 2023 -0800 [HUDI-7086] Fix the default for gcp pub sub max sync time to 1min (apache#10171) Co-authored-by: rmahindra123 <[email protected]> commit 8370c62 Author: Shiyan Xu <[email protected]> Date: Tue Nov 28 22:31:34 2023 -0600 [HUDI-7149] Add a dbt example project with CDC capability (apache#10192) commit 817d81a Author: zhuanshenbsj1 <[email protected]> Date: Wed Nov 29 11:46:20 2023 +0800 [MINOR] Add log to print wrong number of instant metadata files (apache#10196) commit cadeade Author: leixin <[email protected]> Date: Wed Nov 29 11:45:24 2023 +0800 [minor] when metric prefix length is 0 ignore the metric prefix (apache#10190) Co-authored-by: leixin1 <[email protected]> commit 91daa7d Author: Lin Liu <[email protected]> Date: Tue Nov 28 19:03:50 2023 -0800 [HUDI-7102] Fix bugs related to time travel queries (apache#10102) commit d1dfa5b Author: Dongsj <[email protected]> Date: Wed Nov 29 10:49:38 2023 +0800 [HUDI-7148] Add an additional fix to the potential thread insecurity problem of heartbeat client (apache#10188) Co-authored-by: dongsj <[email protected]> commit b0b711e Author: Jonathan Vexler <=> Date: Tue Nov 28 21:35:20 2023 -0500 nested schema kinda fix commit 77cfb3a Author: YueZhang <[email protected]> Date: Wed Nov 29 09:46:53 2023 +0800 [HUDI-7147] Fix CDC write flush bug (apache#10186) * Using iterator instead of values to avoid unsupported operation exception * check style commit b144ee0 Author: Jon Vexler <[email protected]> Date: Tue Nov 28 14:23:46 2023 -0500 Update hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/HoodieFileGroupReaderBasedParquetFileFormat.scala Co-authored-by: Sagar Sumit <[email protected]> commit 89fab14 Author: Jonathan Vexler <=> Date: Tue Nov 28 14:23:03 2023 -0500 fix failing tests and address some of sagar pr review commit 675abf1 Author: Tim Brown <[email protected]> Date: Mon Nov 27 23:21:56 2023 -0600 [MINOR] Schema Converter should use default identity transform if not specified (apache#10178) commit 5450aff Author: Jonathan Vexler <=> Date: Mon Nov 27 22:21:06 2023 -0500 disable vector for bootstrap commit fb062df Author: Danny Chan <[email protected]> Date: Tue Nov 28 10:52:33 2023 +0800 [Minor] Fix the flaky tests in TestRemoteHoodieTableFileSystemView (apache#10179) commit 3ae4d30 Author: Jonathan Vexler <=> Date: Mon Nov 27 21:07:17 2023 -0500 fix various issues that caused failing tests commit a045da6 Author: Jonathan Vexler <=> Date: Mon Nov 27 18:00:46 2023 -0500 see if this works commit 91be81a Author: Jonathan Vexler <=> Date: Mon Nov 27 17:07:30 2023 -0500 use java to create unary operator commit c22d1db Merge: 38b2603 4c3a1db Author: Jonathan Vexler <=> Date: Mon Nov 27 15:56:39 2023 -0500 Merge branch 'master' into fg_reader_implement_bootstrap commit 38b2603 Author: Jonathan Vexler <=> Date: Mon Nov 27 15:42:22 2023 -0500 set precombine in test commit 2a9a363 Author: Jonathan Vexler <=> Date: Mon Nov 27 13:27:38 2023 -0500 try to fix scala2.11 unary operator issue commit 60bdf14 Author: Jonathan Vexler <=> Date: Mon Nov 27 13:02:16 2023 -0500 try fix ci commit 4c3a1db Author: majian <[email protected]> Date: Mon Nov 27 16:44:25 2023 +0800 [HUDI-7110][FOLLOW-UP] Improve call procedure for show column stats information (apache#10169) commit 499423c Author: zhuanshenbsj1 <[email protected]> Date: Sun Nov 26 10:13:46 2023 +0800 [HUDI-7041] Optimize the memory usage of timeline server for table service (apache#10002) commit 4f875ed Author: Y Ethan Guo <[email protected]> Date: Sat Nov 25 15:10:37 2023 -0800 [HUDI-7139] Fix operation type for bulk insert with row writer in Hudi Streamer (apache#10175) This commit fixes the bug which causes the `operationType` to be null in the commit metadata of bulk insert operation with row writer enabled in Hudi Streamer (`hoodie.datasource.write.row.writer.enable=true`). `HoodieStreamerDatasetBulkInsertCommitActionExecutor` is updated so that `#preExecute` and `#afterExecute` should run the same logic as regular bulk insert operation without row writer. commit 332e7e8 Author: harshal <[email protected]> Date: Sat Nov 25 14:04:29 2023 +0530 [HUDI-7006] Reduce unnecessary is_empty rdd calls in StreamSync (apache#10158) --------- Co-authored-by: sivabalan <[email protected]> commit 86232d2 Author: Sivabalan Narayanan <[email protected]> Date: Thu Nov 23 19:27:50 2023 -0800 [HUDI-7095] Making perf enhancements to JSON serde (apache#10097) commit a7fd27c Author: Sivabalan Narayanan <[email protected]> Date: Thu Nov 23 19:20:01 2023 -0800 [HUDI-7086] Scaling gcs event source (apache#10073) - Scaling gcs event source --------- Co-authored-by: rmahindra123 <[email protected]> commit bb42c4b Author: Sivabalan Narayanan <[email protected]> Date: Thu Nov 23 18:33:32 2023 -0800 [HUDI-7097] Fix instantiation of Hms Uri with HiveSync tool (apache#10099) commit 0b7f47a Author: Jonathan Vexler <=> Date: Thu Nov 23 16:27:36 2023 -0500 decently working commit bcb974b Author: VitoMakarevich <[email protected]> Date: Thu Nov 23 11:22:14 2023 +0100 [HUDI-7034] Fix refresh table/view (apache#10151) * [HUDI-7034] Refresh index fix - remove cached file slices within partitions --------- Co-authored-by: vmakarevich <[email protected]> Co-authored-by: Sagar Sumit <[email protected]> commit b77eff2 Author: Lokesh Jain <[email protected]> Date: Thu Nov 23 10:47:40 2023 +0530 [HUDI-7120] Performance improvements in deltastreamer executor code path (apache#10135) commit 405be17 Author: Sivabalan Narayanan <[email protected]> Date: Wed Nov 22 21:00:33 2023 -0800 [MINOR] Making misc fixes to deltastreamer sources(S3 and GCS) (apache#10095) * Making misc fixes to deltastreamer sources * Fixing test failures * adding inference to CloudSourceconfig... cloud.data.datafile.format * Fix the tests for s3 events source * Fix the tests for s3 events source --------- Co-authored-by: rmahindra123 <[email protected]> commit 3d21285 Author: Tim Brown <[email protected]> Date: Wed Nov 22 22:51:14 2023 -0600 [HUDI-7112] Reuse existing timeline server and performance improvements (apache#10122) - Reuse timeline server across tables. --------- Co-authored-by: sivabalan <[email protected]> commit 72ff9a7 Author: Rajesh Mahindra <[email protected]> Date: Wed Nov 22 20:49:15 2023 -0800 [HUDI-7052] Fix partition key validation for custom key generators. (apache#10014) --------- Co-authored-by: rmahindra123 <[email protected]> commit 8d6d043 Author: majian <[email protected]> Date: Thu Nov 23 10:08:17 2023 +0800 [HUDI-7110] Add call procedure for show column stats information (apache#10120) commit aabaa99 Author: huangxiaoping <[email protected]> Date: Thu Nov 23 09:06:45 2023 +0800 [MINOR] Remove unused import (apache#10159) commit f88a73f Author: Y Ethan Guo <[email protected]> Date: Wed Nov 22 10:48:48 2023 -0800 [HUDI-7123] Improve CI scripts (apache#10136) Improves the CI scripts in the following aspects: - Removes `hudi-common` tests from `test-spark` job in GH CI as they are already covered by Azure CI - Removes unnecesary bundle validation jobs and adds new bundle validation images (`flink1153hive313spark323`, `flink1162hive313spark331`) - Updates `validate-release-candidate-bundles` jobs - Moves functional tests of `hudi-spark-datasource/hudi-spark` from job 4 (3 hours) to job 2 (1 hour) in Azure CI to rebalance the finish time. commit 38c87b7 Author: harshal <[email protected]> Date: Wed Nov 22 20:53:42 2023 +0530 [HUDI-7004] Add support of snapshotLoadQuerySplitter in s3/gcs sources (apache#10152) commit d0edfb5 Author: Sivabalan Narayanan <[email protected]> Date: Wed Nov 22 10:22:53 2023 -0500 [HUDI-6961] Fixing DefaultHoodieRecordPayload to honor deletion based on meta field as well as custome delete marker (apache#10150) - Fixing DefaultHoodieRecordPayload to honor deletion based on meta field as well as custom delete marker across all delete apis commit cda9dbc Author: Jing Zhang <[email protected]> Date: Wed Nov 22 18:04:39 2023 +0800 [HUDI-7129] Fix bug when upgrade from table version three using UpgradeOrDowngradeProcedure (apache#10147) commit 18f7181 Author: Shiyan Xu <[email protected]> Date: Wed Nov 22 02:00:27 2023 -0600 [HUDI-7133] Improve dbt example for better guidance (apache#10155) commit c5af85d Author: Sivabalan Narayanan <[email protected]> Date: Wed Nov 22 01:33:49 2023 -0500 [HUDI-7096] Improving incremental query to fetch partitions based on commit metadata (apache#10098) commit 2522f6d Author: xuzifu666 <[email protected]> Date: Wed Nov 22 11:53:21 2023 +0800 [HUDI-7128] DeleteMarkerProcedures support delete in batch mode (apache#10148) Co-authored-by: xuyu <[email protected]> commit a1afcdd Author: Tim Brown <[email protected]> Date: Tue Nov 21 14:58:12 2023 -0600 [HUDI-7115] Add in new options for the bigquery sync (apache#10125) - Add in new options for the bigquery sync commit 35cd873 Author: Sivabalan Narayanan <[email protected]> Date: Tue Nov 21 13:11:21 2023 -0500 [HUDI-7084] Fixing schema retrieval for table w/ no commits (apache#10069) * Fixing schema retrieval for table w/ no commits * fixing compilation failure commit 74793d5 Author: Rajesh Mahindra <[email protected]> Date: Tue Nov 21 09:53:12 2023 -0800 [HUDI-7106] Fix sqs deletes, deltasync service close and error table default configs. (apache#10117) Co-authored-by: rmahindra123 <[email protected]> commit b981877 Author: harshal <[email protected]> Date: Tue Nov 21 22:52:28 2023 +0530 [HUDI-7003] Add option to fallback to full table scan if files are deleted due to cleaner (apache#9941) commit 600fd4d Author: Akira Ajisaka <[email protected]> Date: Wed Nov 22 01:24:37 2023 +0900 [HUDI-6734] Add back HUDI-5409: Avoid file index and use fs view cache in COW input format (apache#9567) * [HUDI-6734] Add back HUDI-5409: Avoid file index and use fs view cache in COW input format This reverts commit 2567ada. Conflicts: hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/HoodieCopyOnWriteTableInputFormat.java hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/realtime/HoodieMergeOnReadTableInputFormat.java * Always use file index if files partition is available --------- Co-authored-by: Sagar Sumit <[email protected]> commit 9e2500c Author: Sivabalan Narayanan <[email protected]> Date: Tue Nov 21 09:55:23 2023 -0500 [HUDI-7083] Adding support for multiple tables with Prometheus Reporter (apache#10068) * Adding support for multiple tables with Prometheus Reporter * Fixing closure of http server * Remove entry from port-collector registry map after stopping http server --------- Co-authored-by: Sagar Sumit <[email protected]> commit baffe1d Author: Sivabalan Narayanan <[email protected]> Date: Tue Nov 21 09:32:39 2023 -0500 [MINOR] Misc fixes in deltastreamer (apache#10067) commit 0c4f3a3 Author: Sivabalan Narayanan <[email protected]> Date: Tue Nov 21 02:17:13 2023 -0500 [HUDI-7127] Fixing set up and tear down in tests (apache#10146) commit eaba114 Author: Akira Ajisaka <[email protected]> Date: Tue Nov 21 11:37:47 2023 +0900 [HUDI-7107] Reused MetricsReporter fails to publish metrics in Spark streaming job (apache#10132) commit 578e756 Author: Jing Zhang <[email protected]> Date: Tue Nov 21 10:04:33 2023 +0800 [HUDI-7118] Set conf 'spark.sql.parquet.enableVectorizedReader' to true automatically only if the value is not explicitly set (apache#10134) commit d24220a Author: Jing Zhang <[email protected]> Date: Tue Nov 21 09:56:07 2023 +0800 [HUDI-7111] Fix performance regression of tag when written into simple bucket index table (apache#10130) commit 84990ae Author: Rajesh Mahindra <[email protected]> Date: Mon Nov 20 11:17:45 2023 -0800 Fix schema refresh for KafkaAvroSchemaDeserializer (apache#10118) Co-authored-by: rmahindra123 <[email protected]> commit 979132b Author: majian <[email protected]> Date: Mon Nov 20 10:43:11 2023 +0800 [HUDI-7099] Providing metrics for archive and defining some string constants (apache#10101) commit 3225625 Author: Fabio Buso <[email protected]> Date: Mon Nov 20 03:19:41 2023 +0100 [MINOR] Add Hopsworks File System to StorageSchemes (apache#10141) commit 3913dca Author: Sivabalan Narayanan <[email protected]> Date: Sat Nov 18 23:50:37 2023 -0500 [HUDI-7098] Add max bytes per partition with cloud stores source in DS (apache#10100) commit 4c295b2 Author: hehuiyuan <[email protected]> Date: Sun Nov 19 09:43:52 2023 +0800 [HUDI-7119] Don't write precombine field to hoodie.properties when the ts field does not exist for append mode (apache#10133) commit b2f4493 Author: Jing Zhang <[email protected]> Date: Sun Nov 19 09:35:54 2023 +0800 [HUDI-7072] Remove support for Flink 1.13 (apache#10052) commit dfe1674 Author: Sagar Lakshmipathy <[email protected]> Date: Fri Nov 17 18:43:07 2023 -0800 [Minor] Fixed twitter link to redirect to twitter (apache#10139) commit f58d9cb Author: Jonathan Vexler <=> Date: Fri Nov 17 18:10:00 2023 -0500 current point commit 184858b Author: Jonathan Vexler <=> Date: Fri Nov 17 16:21:56 2023 -0500 non-working. Want to review with team that this makes sense commit 8240b6a Author: Y Ethan Guo <[email protected]> Date: Fri Nov 17 11:20:57 2023 -0800 [HUDI-7113] Update release scripts and docs for Spark 3.5 support (apache#10123) commit 216aeb4 Author: Danny Chan <[email protected]> Date: Fri Nov 17 14:35:17 2023 +0800 [HUDI-7116] Add docker image for flink 1.14 and spark 2.4.8 (apache#10126) commit 3d0c450 Author: YueZhang <[email protected]> Date: Fri Nov 17 09:48:59 2023 +0800 [HUDI-7109] Fix Flink may re-use a committed instant in append mode (apache#10119) commit f06ff5b Author: hehuiyuan <[email protected]> Date: Fri Nov 17 09:43:21 2023 +0800 [HUDI-7090] Set the maxParallelism for singleton operator (apache#10090) commit faa73e9 Author: Y Ethan Guo <[email protected]> Date: Thu Nov 16 12:12:22 2023 -0800 [MINOR] Disable failed test on master (apache#10124) commit 6cc39bf Author: Sivabalan Narayanan <[email protected]> Date: Thu Nov 16 06:00:54 2023 -0500 [MINOR] Removing unnecessary guards to row writer (apache#10004) commit 4ea752f Author: voonhous <[email protected]> Date: Thu Nov 16 16:53:28 2023 +0800 [MINOR] Modified description to include missing trigger strategy (apache#10114) commit 874b5de Author: Shawn Chang <[email protected]> Date: Wed Nov 15 21:57:14 2023 -0800 [HUDI-6806] Support Spark 3.5.0 (apache#9717) --------- Co-authored-by: Shawn Chang <[email protected]> Co-authored-by: Y Ethan Guo <[email protected]> commit 35af64d Author: Shawn Chang <[email protected]> Date: Wed Nov 15 18:36:42 2023 -0800 [Minor] Throw exceptions when cleaner/compactor fail (apache#10108) Co-authored-by: Shawn Chang <[email protected]> commit bada5d9 Author: Shawn Chang <[email protected]> Date: Wed Nov 15 16:50:38 2023 -0800 [HUDI-5936] Fix serialization problem when FileStatus is not serializable (apache#10065) Co-authored-by: Shawn Chang <[email protected]> commit dcd5a81 Author: majian <[email protected]> Date: Wed Nov 15 16:10:15 2023 +0800 [HUDI-7069] Optimize metaclient construction and include table config options (apache#10048) commit f218e54 Author: Jing Zhang <[email protected]> Date: Wed Nov 15 16:07:04 2023 +0800 [MINOR] Add detailed error logs in RunCompactionProcedure (apache#10070) * add detailed error logs in RunCompactionProcedure * only print 100 error file paths into logs commit 2185abb Author: Jing Zhang <[email protected]> Date: Wed Nov 15 16:03:23 2023 +0800 [HUDI-7094] AlterTableAddColumnCommand/AlterTableChangeColumnCommand update table with ro/rt suffix (apache#10094) commit abd3afc Author: Hussein Awala <[email protected]> Date: Wed Nov 15 06:55:47 2023 +0200 [HUDI-6695] Use the AWS provider chain in Glue sync and add a new provider for STS assume role (apache#9260) commit 424e0ce Author: chao chen <[email protected]> Date: Wed Nov 15 12:20:10 2023 +0800 [HUDI-7050] Flink HoodieHiveCatalog supports hadoop parameters (apache#10013) commit 19b3e7f Author: leixin <[email protected]> Date: Wed Nov 15 09:24:29 2023 +0800 [Minor] Throws an exception when using bulk_insert and stream mode (apache#10082) Co-authored-by: leixin1 <[email protected]>

pan3793 · 2024-02-26T14:27:17Z

I didn't find the bundle jar for Spark 3.5 on Maven Central, am I missing something?

--------- Co-authored-by: Shawn Chang <[email protected]> Co-authored-by: Y Ethan Guo <[email protected]>

yihua · 2024-03-09T22:04:29Z

I didn't find the bundle jar for Spark 3.5 on Maven Central, am I missing something?

Spark 3.5 bundle jar will be added in Hudi 0.15.0 release.

ranwani · 2024-03-12T10:11:25Z

@yihua : We need to use Hudi with Spark 3.5. Can you let me know when is Hudi 0.15.0 release planned?

yihua · 2024-03-18T19:24:39Z

@yihua : We need to use Hudi with Spark 3.5. Can you let me know when is Hudi 0.15.0 release planned?

The 0.15.0 release branch is planned to be cut this month once we verify engine integrations.

melin · 2024-04-02T05:41:50Z

The 0.15.0 release branch is planned to be cut this month once we verify engine integrations.

When will it be released?

ranwani · 2024-04-22T10:22:28Z

@yihua Any estimated date for the release?

--------- Co-authored-by: Shawn Chang <[email protected]> Co-authored-by: Y Ethan Guo <[email protected]>

Gatsby-Lee · 2024-06-11T22:13:22Z

none

what is the release date for Hudi 0.15.0?

CTTY · 2024-06-11T22:54:11Z

Hi @Gatsby-Lee , it was released last week: https://github.com/apache/hudi/tree/release-0.15.0

Gatsby-Lee · 2024-06-12T00:08:05Z

Hi @Gatsby-Lee , it was released last week: https://github.com/apache/hudi/tree/release-0.15.0
Oh..
so, the doc has not been updated yet.

CTTY · 2024-10-07T23:27:30Z

We upgraded hive-storage-api to 2.8.1 in this PR and recently we found that this may cause issues for HoodieStreamer + ORC source, please see: https://issues.apache.org/jira/browse/HUDI-8081

About why we added hive-storage-api to hudi-common and upgraded hive-storage-api:

Bunch of Hive dependencies issue
* [ERROR] /Users/yxchang/code/Aws157Hudi/hudi-common/src/main/java/org/apache/hudi/common/util/OrcUtils.java:[36,45] package org.apache.hadoop.hive.ql.exec.vector does not exist
[ERROR] /Users/yxchang/code/Aws157Hudi/hudi-common/src/main/java/org/apache/hudi/common/util/OrcUtils.java:[37,45] package org.apache.hadoop.hive.ql.exec.vector does not exist
[ERROR] /Users/yxchang/code/Aws157Hudi/hudi-common/src/main/java/org/apache/hudi/io/storage/HoodieAvroOrcWriter.java:[37,45] package org.apache.hadoop.hive.ql.exec.vector does not exist
[ERROR] /Users/yxchang/code/Aws157Hudi/hudi-common/src/main/java/org/apache/hudi/io/storage/HoodieAvroOrcWriter.java:[38,45] package org.apache.hadoop.hive.ql.exec.vector does not exist
[ERROR] /Users/yxchang/code/Aws157Hudi/hudi-common/src/main/java/org/apache/hudi/io/storage/HoodieAvroOrcWriter.java:[55,17] cannot find symbol
* Fixed by adding hive-storage-api in hudi-common

testOrcIteratorReadData
* [ERROR] testOrcIteratorReadData Time elapsed: 2.171 s <<< ERROR!
java.lang.NoClassDefFoundError: org/apache/hadoop/hive/ql/exec/vector/DateColumnVector
at org.apache.orc.TypeDescription.createRowBatch(TypeDescription.java:491)
at org.apache.orc.TypeDescription.createRowBatch(TypeDescription.java:525)
at
* It's because orc 1.9.1 depends on hive-storage-api 2.8.1
* Upgrading hive-storage-api in hudi-common to 2.8.1 can solve the problem

CTTY force-pushed the ctty/hudi1x-spark35 branch from 533ea14 to 0a89361 Compare September 19, 2023 04:21

CTTY changed the title ~~[DNM] Support Spark 3.5.0~~ [HUDI-6806] Support Spark 3.5.0 Sep 19, 2023

CTTY marked this pull request as ready for review September 19, 2023 17:03

yihua mentioned this pull request Sep 23, 2023

[DNM] Run UT for Spark 3.5.0 on 0.14.0 #9570

Closed

4 tasks

yihua self-assigned this Oct 17, 2023

CTTY force-pushed the ctty/hudi1x-spark35 branch 2 times, most recently from 6ada390 to ef17855 Compare October 28, 2023 00:01

CTTY mentioned this pull request Oct 28, 2023

[HUDI-6963] Fix class conflict of CreateIndex from Spark3.3 #9895

Merged

4 tasks

CTTY commented Nov 3, 2023

View reviewed changes

.github/workflows/bot.yml Outdated Show resolved Hide resolved

CTTY commented Nov 3, 2023

View reviewed changes

hudi-client/hudi-spark-client/src/main/scala/org/apache/spark/sql/DataFrameUtil.scala Outdated Show resolved Hide resolved

CTTY commented Nov 3, 2023

View reviewed changes

hudi-common/src/test/java/org/apache/hudi/common/util/TestClusteringUtils.java Show resolved Hide resolved

CTTY commented Nov 3, 2023

View reviewed changes

vinothchandar added the release-0.14.1 label Nov 3, 2023

yihua force-pushed the ctty/hudi1x-spark35 branch from b9c7684 to 94c46b0 Compare November 8, 2023 21:25

yihua reviewed Nov 8, 2023

View reviewed changes

yihua added priority:blocker Production down; release blocker release-1.0.0 labels Nov 8, 2023

yihua force-pushed the ctty/hudi1x-spark35 branch 2 times, most recently from 43253ea to 6aeea1a Compare November 9, 2023 21:41

CTTY added 8 commits November 15, 2023 10:30

Support Spark 3.5.0

d05e966

Adjust pom version

f0eaf5d

see all test failures

eef5eaf

Fix insert into statement related fix

c8841b5

add hive-storage-api as provided in hudi-spark

2c0b0ad

Minor compilation fix

1dfc1d5

minor

ae78542

Fix case sensitive test for Spark35

852fd5a

yihua added 7 commits November 15, 2023 11:53

Fix bundle validation for Spark 3.5 and GH CI script

3e1289b

Move util methods from SparkAdapter to corresponding util classes

1efbe4e

Address nits and add docs

af28064

Fix pom

6e762ec

Fix build

a83f451

Fix nits

8ac9c76

Change Spark 3 profile to use Spark 3.5

017a375

yihua approved these changes Nov 16, 2023

View reviewed changes

Shade protobuf in utilities-slim bundle instead

afe70da

yihua merged commit 874b5de into apache:master Nov 16, 2023

CTTY deleted the ctty/hudi1x-spark35 branch January 8, 2024 21:02

yihua added a commit that referenced this pull request Feb 27, 2024

[HUDI-6806] Support Spark 3.5.0 (#9717)

ae80cbd

--------- Co-authored-by: Shawn Chang <[email protected]> Co-authored-by: Y Ethan Guo <[email protected]>

CTTY mentioned this pull request Mar 4, 2024

hudi 0.14.1 and hudi 0.14.0 build issue #10808

Closed

yihua added release-0.15.0 and removed release-0.14.1 labels Mar 5, 2024

CTTY mentioned this pull request Mar 15, 2024

[SUPPORT] hudi0.14.0: Insert data into hudi with spark or create a new table exception #10838

Closed

yihua added a commit that referenced this pull request May 3, 2024

[HUDI-6806] Support Spark 3.5.0 (#9717)

8c121d9

--------- Co-authored-by: Shawn Chang <[email protected]> Co-authored-by: Y Ethan Guo <[email protected]>

[HUDI-6806] Support Spark 3.5.0 #9717

[HUDI-6806] Support Spark 3.5.0 #9717

Uh oh!

Conversation

CTTY commented Sep 14, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Change Logs

Impact

Risk level (write none, low medium or high below)

Documentation Update

Contributor's checklist

Uh oh!

vinothchandar commented Sep 16, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

yihua commented Sep 19, 2023

Uh oh!

CTTY commented Sep 22, 2023

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

yihua left a comment

Choose a reason for hiding this comment

Uh oh!

hudi-bot commented Nov 16, 2023

CI report:

Uh oh!

yihua commented Nov 16, 2023

Uh oh!

pan3793 commented Feb 26, 2024

Uh oh!

yihua commented Mar 9, 2024

Uh oh!

ranwani commented Mar 12, 2024

Uh oh!

yihua commented Mar 18, 2024

Uh oh!

melin commented Apr 2, 2024

Uh oh!

ranwani commented Apr 22, 2024

Uh oh!

Gatsby-Lee commented Jun 11, 2024

Uh oh!

CTTY commented Jun 11, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Gatsby-Lee commented Jun 12, 2024

CTTY commented Sep 14, 2023 •

edited

Loading

vinothchandar commented Sep 16, 2023 •

edited

Loading

CTTY commented Jun 11, 2024 •

edited

Loading